Tessl Tile for pypi/presidio-analyzer@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-processing.md configuration.md context-enhancement.md core-analysis.md entity-recognizers.md index.md predefined-recognizers.md

index.mddocs/

0
# Presidio Analyzer
1

2
Presidio Analyzer is a Python-based service for detecting PII (Personally Identifiable Information) entities in unstructured text. It provides a pluggable and customizable framework using Named Entity Recognition, regular expressions, rule-based logic, and checksums to identify over 50 types of PII entities across multiple languages.
3

4
## Package Information
5

6
- **Package Name**: presidio_analyzer
7
- **Language**: Python
8
- **Installation**: `pip install presidio-analyzer`
9
- **Supported Python**: 3.9, 3.10, 3.11, 3.12
10

11
## Core Imports
12

13
```python
14
from presidio_analyzer import AnalyzerEngine
15
```
16

17
For comprehensive imports:
18

19
```python
20
from presidio_analyzer import (
21
    AnalyzerEngine,
22
    BatchAnalyzerEngine,
23
    RecognizerResult,
24
    PatternRecognizer,
25
    Pattern,
26
    AnalyzerEngineProvider
27
)
28
```
29

30
## Basic Usage
31

32
```python
33
from presidio_analyzer import AnalyzerEngine
34

35
# Initialize analyzer
36
analyzer = AnalyzerEngine()
37

38
# Analyze text for PII
39
text = "My name is John Doe and my phone number is 555-123-4567"
40
results = analyzer.analyze(text=text, language="en")
41

42
# Process results
43
for result in results:
44
    print(f"Entity: {result.entity_type}")
45
    print(f"Text: {text[result.start:result.end]}")
46
    print(f"Score: {result.score}")
47
    print(f"Location: {result.start}-{result.end}")
48
```
49

50
## Architecture
51

52
Presidio Analyzer follows a modular architecture:
53

54
- **AnalyzerEngine**: Central orchestrator that coordinates all analysis operations
55
- **RecognizerRegistry**: Manages and holds all entity recognizers
56
- **EntityRecognizer**: Base class for all PII detection logic (pattern-based, ML-based, remote)
57
- **NlpEngine**: Abstraction layer over NLP preprocessing (spaCy, Stanza, Transformers)
58
- **ContextAwareEnhancer**: Improves detection accuracy using surrounding context
59
- **BatchAnalyzerEngine**: Enables efficient processing of large datasets
60

61
This design allows for flexible deployment options from Python scripts to Docker containers and Kubernetes orchestration, while maintaining high extensibility for custom recognizers and detection logic.
62

63
## Capabilities
64

65
### Core Analysis Engine
66

67
Central PII detection functionality including the main AnalyzerEngine class, request handling, and result processing. Provides the primary interface for detecting PII entities in text.
68

69
```python { .api }
70
class AnalyzerEngine:
71
    def __init__(
72
        self,
73
        registry: RecognizerRegistry = None,
74
        nlp_engine: NlpEngine = None,
75
        app_tracer: AppTracer = None,
76
        log_decision_process: bool = False,
77
        default_score_threshold: float = 0,
78
        supported_languages: List[str] = None,
79
        context_aware_enhancer: Optional[ContextAwareEnhancer] = None
80
    ): ...
81

82
    def analyze(
83
        self,
84
        text: str,
85
        language: str,
86
        entities: Optional[List[str]] = None,
87
        correlation_id: Optional[str] = None,
88
        score_threshold: Optional[float] = None,
89
        return_decision_process: Optional[bool] = False,
90
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
91
        context: Optional[List[str]] = None,
92
        allow_list: Optional[List[str]] = None,
93
        allow_list_match: Optional[str] = "exact",
94
        regex_flags: Optional[int] = None,
95
        nlp_artifacts: Optional[NlpArtifacts] = None
96
    ) -> List[RecognizerResult]: ...
97
```
98

99
[Core Analysis Engine](./core-analysis.md)
100

101
### Batch Processing
102

103
High-performance analysis of large datasets including iterables, dictionaries, and structured data with multiprocessing support and configurable batch sizes.
104

105
```python { .api }
106
class BatchAnalyzerEngine:
107
    def __init__(self, analyzer_engine: Optional[AnalyzerEngine] = None): ...
108

109
    def analyze_iterator(
110
        self,
111
        texts: Iterable[Union[str, bool, float, int]],
112
        language: str,
113
        batch_size: int = 1,
114
        n_process: int = 1,
115
        **kwargs
116
    ) -> List[List[RecognizerResult]]: ...
117

118
    def analyze_dict(
119
        self,
120
        input_dict: Dict[str, Union[Any, Iterable[Any]]],
121
        language: str,
122
        keys_to_skip: Optional[List[str]] = None,
123
        batch_size: int = 1,
124
        n_process: int = 1,
125
        **kwargs
126
    ) -> Iterator[DictAnalyzerResult]: ...
127
```
128

129
[Batch Processing](./batch-processing.md)
130

131
### Entity Recognizers
132

133
Framework for creating custom PII recognizers including abstract base classes, pattern-based recognizers, and remote service integration capabilities.
134

135
```python { .api }
136
class EntityRecognizer:
137
    def __init__(
138
        self,
139
        supported_entities: List[str],
140
        name: str = None,
141
        supported_language: str = "en",
142
        version: str = "0.0.1",
143
        context: Optional[List[str]] = None
144
    ): ...
145

146
    def analyze(
147
        self,
148
        text: str,
149
        entities: List[str],
150
        nlp_artifacts: NlpArtifacts
151
    ) -> List[RecognizerResult]: ...
152

153
class PatternRecognizer(LocalRecognizer):
154
    def __init__(
155
        self,
156
        supported_entity: str,
157
        name: str = None,
158
        supported_language: str = "en",
159
        patterns: List[Pattern] = None,
160
        deny_list: List[str] = None,
161
        context: List[str] = None,
162
        deny_list_score: float = 1.0,
163
        global_regex_flags: Optional[int] = None,
164
        version: str = "0.0.1"
165
    ): ...
166
```
167

168
[Entity Recognizers](./entity-recognizers.md)
169

170
### Predefined Recognizers
171

172
Comprehensive collection of over 50 built-in recognizers for common PII types including generic entities (emails, phone numbers, credit cards) and country-specific identifiers (SSNs, passport numbers, tax IDs).
173

174
```python { .api }
175
# Generic recognizers
176
class CreditCardRecognizer(PatternRecognizer): ...
177
class EmailRecognizer(PatternRecognizer): ...
178
class PhoneRecognizer(PatternRecognizer): ...
179
class IpRecognizer(PatternRecognizer): ...
180

181
# US-specific recognizers  
182
class UsSsnRecognizer(PatternRecognizer): ...
183
class UsLicenseRecognizer(PatternRecognizer): ...
184
class UsPassportRecognizer(PatternRecognizer): ...
185

186
# International recognizers
187
class IbanRecognizer(PatternRecognizer): ...
188
class AuMedicareRecognizer(PatternRecognizer): ...
189
class UkNinoRecognizer(PatternRecognizer): ...
190
```
191

192
[Predefined Recognizers](./predefined-recognizers.md)
193

194
### Context Enhancement
195

196
Advanced context-aware enhancement that improves detection accuracy by analyzing surrounding text using lemmatization and contextual similarity scoring.
197

198
```python { .api }
199
class ContextAwareEnhancer:
200
    def __init__(
201
        self,
202
        context_similarity_factor: float,
203
        min_score_with_context_similarity: float,
204
        context_prefix_count: int,
205
        context_suffix_count: int
206
    ): ...
207

208
    def enhance_using_context(
209
        self,
210
        text: str,
211
        raw_results: List[RecognizerResult],
212
        nlp_artifacts: NlpArtifacts,
213
        recognizers: List[EntityRecognizer],
214
        context: Optional[List[str]] = None
215
    ) -> List[RecognizerResult]: ...
216

217
class LemmaContextAwareEnhancer(ContextAwareEnhancer):
218
    def __init__(
219
        self,
220
        context_similarity_factor: float = 0.35,
221
        min_score_with_context_similarity: float = 0.4,
222
        context_prefix_count: int = 5,
223
        context_suffix_count: int = 0
224
    ): ...
225
```
226

227
[Context Enhancement](./context-enhancement.md)
228

229
### Configuration and Setup
230

231
Flexible configuration system supporting YAML-based setup, multiple NLP engines (spaCy, Stanza, Transformers), and customizable recognizer registries.
232

233
```python { .api }
234
class AnalyzerEngineProvider:
235
    def __init__(
236
        self,
237
        analyzer_engine_conf_file: Optional[Union[Path, str]] = None,
238
        nlp_engine_conf_file: Optional[Union[Path, str]] = None,
239
        recognizer_registry_conf_file: Optional[Union[Path, str]] = None
240
    ): ...
241

242
    def create_engine(self) -> AnalyzerEngine: ...
243

244
class RecognizerRegistry:
245
    def __init__(
246
        self,
247
        recognizers: Optional[Iterable[EntityRecognizer]] = None,
248
        global_regex_flags: Optional[int] = None,
249
        supported_languages: Optional[List[str]] = None
250
    ): ...
251

252
    def load_predefined_recognizers(
253
        self,
254
        languages: Optional[List[str]] = None,
255
        nlp_engine: NlpEngine = None
256
    ) -> None: ...
257
```
258

259
[Configuration and Setup](./configuration.md)
260

261
## Types
262

263
### Core Result Types
264

265
```python { .api }
266
class RecognizerResult:
267
    def __init__(
268
        self,
269
        entity_type: str,
270
        start: int,
271
        end: int,
272
        score: float,
273
        analysis_explanation: AnalysisExplanation = None,
274
        recognition_metadata: Dict = None
275
    ): ...
276

277
    # Properties
278
    entity_type: str  # Type of detected entity (e.g., "PERSON", "PHONE_NUMBER")
279
    start: int        # Start position in text
280
    end: int          # End position in text  
281
    score: float      # Confidence score (0.0 to 1.0)
282
    analysis_explanation: AnalysisExplanation  # Detailed detection explanation
283
    recognition_metadata: Dict  # Additional recognizer-specific metadata
284

285
class DictAnalyzerResult:
286
    key: str  # Dictionary key that was analyzed
287
    value: Union[str, List[str], dict]  # Original value
288
    recognizer_results: Union[
289
        List[RecognizerResult], 
290
        List[List[RecognizerResult]], 
291
        Iterator[DictAnalyzerResult]
292
    ]  # Detection results
293

294
class AnalysisExplanation:
295
    def __init__(
296
        self,
297
        recognizer: str,
298
        original_score: float,
299
        pattern_name: str = None,
300
        pattern: str = None,
301
        validation_result: float = None,
302
        textual_explanation: str = None,
303
        regex_flags: int = None
304
    ): ...
305

306
    # Properties  
307
    recognizer: str          # Name of recognizer that made detection
308
    original_score: float    # Initial confidence score
309
    score: float            # Final confidence score (after enhancements)
310
    pattern_name: str       # Name of matching pattern (if applicable)
311
    textual_explanation: str  # Human-readable explanation
312
```
313

314
### Pattern and Configuration Types
315

316
```python { .api }
317
class Pattern:
318
    def __init__(self, name: str, regex: str, score: float): ...
319

320
    # Properties
321
    name: str    # Descriptive name for the pattern
322
    regex: str   # Regular expression pattern
323
    score: float # Confidence score when pattern matches
324

325
class AnalyzerRequest:
326
    def __init__(self, req_data: Dict): ...
327

328
    # Properties extracted from req_data
329
    text: str                               # Text to analyze
330
    language: str                          # Language code (e.g., "en")
331
    entities: Optional[List[str]]          # Entity types to detect
332
    correlation_id: Optional[str]          # Request tracking ID
333
    score_threshold: Optional[float]       # Minimum confidence score
334
    return_decision_process: Optional[bool] # Include analysis explanations
335
    ad_hoc_recognizers: Optional[List[EntityRecognizer]]  # Custom recognizers
336
    context: Optional[List[str]]           # Context keywords for enhancement
337
    allow_list: Optional[List[str]]        # Values to exclude from detection
338
    allow_list_match: Optional[str]        # Match strategy ("exact" or "fuzzy")
339
    regex_flags: Optional[int]             # Regex compilation flags
340
```
341

342
## Error Handling
343

344
Presidio Analyzer uses standard Python exceptions. Common error scenarios:
345

346
- **ValueError**: Invalid parameters (e.g., unsupported language, invalid score threshold)
347
- **TypeError**: Incorrect parameter types
348
- **ImportError**: Missing optional dependencies (e.g., transformers, stanza)
349
- **FileNotFoundError**: Missing configuration files when using AnalyzerEngineProvider
350

351
## Multi-language Support
352

353
Supported languages: English (en), Hebrew (he), Spanish (es), German (de), French (fr), Italian (it), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Arabic (ar).
354

355
Language-specific recognizers are automatically loaded based on the `language` parameter in `analyze()` calls. Some recognizers support multiple languages while others are region-specific.

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/