0
# Presidio Analyzer
1
2
Presidio Analyzer is a Python-based service for detecting PII (Personally Identifiable Information) entities in unstructured text. It provides a pluggable and customizable framework using Named Entity Recognition, regular expressions, rule-based logic, and checksums to identify over 50 types of PII entities across multiple languages.
3
4
## Package Information
5
6
- **Package Name**: presidio_analyzer
7
- **Language**: Python
8
- **Installation**: `pip install presidio-analyzer`
9
- **Supported Python**: 3.9, 3.10, 3.11, 3.12
10
11
## Core Imports
12
13
```python
14
from presidio_analyzer import AnalyzerEngine
15
```
16
17
For comprehensive imports:
18
19
```python
20
from presidio_analyzer import (
21
AnalyzerEngine,
22
BatchAnalyzerEngine,
23
RecognizerResult,
24
PatternRecognizer,
25
Pattern,
26
AnalyzerEngineProvider
27
)
28
```
29
30
## Basic Usage
31
32
```python
33
from presidio_analyzer import AnalyzerEngine
34
35
# Initialize analyzer
36
analyzer = AnalyzerEngine()
37
38
# Analyze text for PII
39
text = "My name is John Doe and my phone number is 555-123-4567"
40
results = analyzer.analyze(text=text, language="en")
41
42
# Process results
43
for result in results:
44
print(f"Entity: {result.entity_type}")
45
print(f"Text: {text[result.start:result.end]}")
46
print(f"Score: {result.score}")
47
print(f"Location: {result.start}-{result.end}")
48
```
49
50
## Architecture
51
52
Presidio Analyzer follows a modular architecture:
53
54
- **AnalyzerEngine**: Central orchestrator that coordinates all analysis operations
55
- **RecognizerRegistry**: Manages and holds all entity recognizers
56
- **EntityRecognizer**: Base class for all PII detection logic (pattern-based, ML-based, remote)
57
- **NlpEngine**: Abstraction layer over NLP preprocessing (spaCy, Stanza, Transformers)
58
- **ContextAwareEnhancer**: Improves detection accuracy using surrounding context
59
- **BatchAnalyzerEngine**: Enables efficient processing of large datasets
60
61
This design allows for flexible deployment options from Python scripts to Docker containers and Kubernetes orchestration, while maintaining high extensibility for custom recognizers and detection logic.
62
63
## Capabilities
64
65
### Core Analysis Engine
66
67
Central PII detection functionality including the main AnalyzerEngine class, request handling, and result processing. Provides the primary interface for detecting PII entities in text.
68
69
```python { .api }
70
class AnalyzerEngine:
71
def __init__(
72
self,
73
registry: RecognizerRegistry = None,
74
nlp_engine: NlpEngine = None,
75
app_tracer: AppTracer = None,
76
log_decision_process: bool = False,
77
default_score_threshold: float = 0,
78
supported_languages: List[str] = None,
79
context_aware_enhancer: Optional[ContextAwareEnhancer] = None
80
): ...
81
82
def analyze(
83
self,
84
text: str,
85
language: str,
86
entities: Optional[List[str]] = None,
87
correlation_id: Optional[str] = None,
88
score_threshold: Optional[float] = None,
89
return_decision_process: Optional[bool] = False,
90
ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
91
context: Optional[List[str]] = None,
92
allow_list: Optional[List[str]] = None,
93
allow_list_match: Optional[str] = "exact",
94
regex_flags: Optional[int] = None,
95
nlp_artifacts: Optional[NlpArtifacts] = None
96
) -> List[RecognizerResult]: ...
97
```
98
99
[Core Analysis Engine](./core-analysis.md)
100
101
### Batch Processing
102
103
High-performance analysis of large datasets including iterables, dictionaries, and structured data with multiprocessing support and configurable batch sizes.
104
105
```python { .api }
106
class BatchAnalyzerEngine:
107
def __init__(self, analyzer_engine: Optional[AnalyzerEngine] = None): ...
108
109
def analyze_iterator(
110
self,
111
texts: Iterable[Union[str, bool, float, int]],
112
language: str,
113
batch_size: int = 1,
114
n_process: int = 1,
115
**kwargs
116
) -> List[List[RecognizerResult]]: ...
117
118
def analyze_dict(
119
self,
120
input_dict: Dict[str, Union[Any, Iterable[Any]]],
121
language: str,
122
keys_to_skip: Optional[List[str]] = None,
123
batch_size: int = 1,
124
n_process: int = 1,
125
**kwargs
126
) -> Iterator[DictAnalyzerResult]: ...
127
```
128
129
[Batch Processing](./batch-processing.md)
130
131
### Entity Recognizers
132
133
Framework for creating custom PII recognizers including abstract base classes, pattern-based recognizers, and remote service integration capabilities.
134
135
```python { .api }
136
class EntityRecognizer:
137
def __init__(
138
self,
139
supported_entities: List[str],
140
name: str = None,
141
supported_language: str = "en",
142
version: str = "0.0.1",
143
context: Optional[List[str]] = None
144
): ...
145
146
def analyze(
147
self,
148
text: str,
149
entities: List[str],
150
nlp_artifacts: NlpArtifacts
151
) -> List[RecognizerResult]: ...
152
153
class PatternRecognizer(LocalRecognizer):
154
def __init__(
155
self,
156
supported_entity: str,
157
name: str = None,
158
supported_language: str = "en",
159
patterns: List[Pattern] = None,
160
deny_list: List[str] = None,
161
context: List[str] = None,
162
deny_list_score: float = 1.0,
163
global_regex_flags: Optional[int] = None,
164
version: str = "0.0.1"
165
): ...
166
```
167
168
[Entity Recognizers](./entity-recognizers.md)
169
170
### Predefined Recognizers
171
172
Comprehensive collection of over 50 built-in recognizers for common PII types including generic entities (emails, phone numbers, credit cards) and country-specific identifiers (SSNs, passport numbers, tax IDs).
173
174
```python { .api }
175
# Generic recognizers
176
class CreditCardRecognizer(PatternRecognizer): ...
177
class EmailRecognizer(PatternRecognizer): ...
178
class PhoneRecognizer(PatternRecognizer): ...
179
class IpRecognizer(PatternRecognizer): ...
180
181
# US-specific recognizers
182
class UsSsnRecognizer(PatternRecognizer): ...
183
class UsLicenseRecognizer(PatternRecognizer): ...
184
class UsPassportRecognizer(PatternRecognizer): ...
185
186
# International recognizers
187
class IbanRecognizer(PatternRecognizer): ...
188
class AuMedicareRecognizer(PatternRecognizer): ...
189
class UkNinoRecognizer(PatternRecognizer): ...
190
```
191
192
[Predefined Recognizers](./predefined-recognizers.md)
193
194
### Context Enhancement
195
196
Advanced context-aware enhancement that improves detection accuracy by analyzing surrounding text using lemmatization and contextual similarity scoring.
197
198
```python { .api }
199
class ContextAwareEnhancer:
200
def __init__(
201
self,
202
context_similarity_factor: float,
203
min_score_with_context_similarity: float,
204
context_prefix_count: int,
205
context_suffix_count: int
206
): ...
207
208
def enhance_using_context(
209
self,
210
text: str,
211
raw_results: List[RecognizerResult],
212
nlp_artifacts: NlpArtifacts,
213
recognizers: List[EntityRecognizer],
214
context: Optional[List[str]] = None
215
) -> List[RecognizerResult]: ...
216
217
class LemmaContextAwareEnhancer(ContextAwareEnhancer):
218
def __init__(
219
self,
220
context_similarity_factor: float = 0.35,
221
min_score_with_context_similarity: float = 0.4,
222
context_prefix_count: int = 5,
223
context_suffix_count: int = 0
224
): ...
225
```
226
227
[Context Enhancement](./context-enhancement.md)
228
229
### Configuration and Setup
230
231
Flexible configuration system supporting YAML-based setup, multiple NLP engines (spaCy, Stanza, Transformers), and customizable recognizer registries.
232
233
```python { .api }
234
class AnalyzerEngineProvider:
235
def __init__(
236
self,
237
analyzer_engine_conf_file: Optional[Union[Path, str]] = None,
238
nlp_engine_conf_file: Optional[Union[Path, str]] = None,
239
recognizer_registry_conf_file: Optional[Union[Path, str]] = None
240
): ...
241
242
def create_engine(self) -> AnalyzerEngine: ...
243
244
class RecognizerRegistry:
245
def __init__(
246
self,
247
recognizers: Optional[Iterable[EntityRecognizer]] = None,
248
global_regex_flags: Optional[int] = None,
249
supported_languages: Optional[List[str]] = None
250
): ...
251
252
def load_predefined_recognizers(
253
self,
254
languages: Optional[List[str]] = None,
255
nlp_engine: NlpEngine = None
256
) -> None: ...
257
```
258
259
[Configuration and Setup](./configuration.md)
260
261
## Types
262
263
### Core Result Types
264
265
```python { .api }
266
class RecognizerResult:
267
def __init__(
268
self,
269
entity_type: str,
270
start: int,
271
end: int,
272
score: float,
273
analysis_explanation: AnalysisExplanation = None,
274
recognition_metadata: Dict = None
275
): ...
276
277
# Properties
278
entity_type: str # Type of detected entity (e.g., "PERSON", "PHONE_NUMBER")
279
start: int # Start position in text
280
end: int # End position in text
281
score: float # Confidence score (0.0 to 1.0)
282
analysis_explanation: AnalysisExplanation # Detailed detection explanation
283
recognition_metadata: Dict # Additional recognizer-specific metadata
284
285
class DictAnalyzerResult:
286
key: str # Dictionary key that was analyzed
287
value: Union[str, List[str], dict] # Original value
288
recognizer_results: Union[
289
List[RecognizerResult],
290
List[List[RecognizerResult]],
291
Iterator[DictAnalyzerResult]
292
] # Detection results
293
294
class AnalysisExplanation:
295
def __init__(
296
self,
297
recognizer: str,
298
original_score: float,
299
pattern_name: str = None,
300
pattern: str = None,
301
validation_result: float = None,
302
textual_explanation: str = None,
303
regex_flags: int = None
304
): ...
305
306
# Properties
307
recognizer: str # Name of recognizer that made detection
308
original_score: float # Initial confidence score
309
score: float # Final confidence score (after enhancements)
310
pattern_name: str # Name of matching pattern (if applicable)
311
textual_explanation: str # Human-readable explanation
312
```
313
314
### Pattern and Configuration Types
315
316
```python { .api }
317
class Pattern:
318
def __init__(self, name: str, regex: str, score: float): ...
319
320
# Properties
321
name: str # Descriptive name for the pattern
322
regex: str # Regular expression pattern
323
score: float # Confidence score when pattern matches
324
325
class AnalyzerRequest:
326
def __init__(self, req_data: Dict): ...
327
328
# Properties extracted from req_data
329
text: str # Text to analyze
330
language: str # Language code (e.g., "en")
331
entities: Optional[List[str]] # Entity types to detect
332
correlation_id: Optional[str] # Request tracking ID
333
score_threshold: Optional[float] # Minimum confidence score
334
return_decision_process: Optional[bool] # Include analysis explanations
335
ad_hoc_recognizers: Optional[List[EntityRecognizer]] # Custom recognizers
336
context: Optional[List[str]] # Context keywords for enhancement
337
allow_list: Optional[List[str]] # Values to exclude from detection
338
allow_list_match: Optional[str] # Match strategy ("exact" or "fuzzy")
339
regex_flags: Optional[int] # Regex compilation flags
340
```
341
342
## Error Handling
343
344
Presidio Analyzer uses standard Python exceptions. Common error scenarios:
345
346
- **ValueError**: Invalid parameters (e.g., unsupported language, invalid score threshold)
347
- **TypeError**: Incorrect parameter types
348
- **ImportError**: Missing optional dependencies (e.g., transformers, stanza)
349
- **FileNotFoundError**: Missing configuration files when using AnalyzerEngineProvider
350
351
## Multi-language Support
352
353
Supported languages: English (en), Hebrew (he), Spanish (es), German (de), French (fr), Italian (it), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Arabic (ar).
354
355
Language-specific recognizers are automatically loaded based on the `language` parameter in `analyze()` calls. Some recognizers support multiple languages while others are region-specific.