0
# Entity Recognizers
1
2
The entity recognizer framework provides the foundation for creating custom PII detection logic. It includes abstract base classes, pattern-based recognizers, and integration capabilities for remote services.
3
4
## Capabilities
5
6
### EntityRecognizer Base Class
7
8
Abstract base class that defines the interface for all PII entity recognizers in Presidio Analyzer.
9
10
```python { .api }
11
class EntityRecognizer:
12
"""
13
Abstract base class for all PII entity recognizers.
14
15
Args:
16
supported_entities: List of entity types this recognizer can detect
17
name: Unique identifier for the recognizer (auto-generated if None)
18
supported_language: Primary language code supported (default: "en")
19
version: Version string for the recognizer
20
context: Optional context keywords that improve detection accuracy
21
"""
22
def __init__(
23
self,
24
supported_entities: List[str],
25
name: str = None,
26
supported_language: str = "en",
27
version: str = "0.0.1",
28
context: Optional[List[str]] = None
29
): ...
30
31
def analyze(
32
self,
33
text: str,
34
entities: List[str],
35
nlp_artifacts: NlpArtifacts
36
) -> List[RecognizerResult]:
37
"""
38
Abstract method: Analyze text to detect PII entities.
39
40
Args:
41
text: Input text to analyze
42
entities: List of entity types to look for
43
nlp_artifacts: Pre-processed NLP data (tokens, lemmas, etc.)
44
45
Returns:
46
List of RecognizerResult objects for detected entities
47
"""
48
49
def load(self) -> None:
50
"""Abstract method: Initialize recognizer resources (models, patterns, etc.)"""
51
52
def enhance_using_context(
53
self,
54
text: str,
55
raw_results: List[RecognizerResult],
56
nlp_artifacts: NlpArtifacts,
57
recognizers: List[EntityRecognizer],
58
context: Optional[List[str]] = None
59
) -> List[RecognizerResult]:
60
"""
61
Enhance detection results using contextual information.
62
Can be overridden by subclasses for custom enhancement logic.
63
64
Args:
65
text: Original input text
66
raw_results: Initial detection results
67
nlp_artifacts: NLP processing artifacts
68
recognizers: All available recognizers
69
context: Context keywords for enhancement
70
71
Returns:
72
Enhanced list of RecognizerResult objects
73
"""
74
75
def get_supported_entities(self) -> List[str]:
76
"""Get list of entity types this recognizer supports."""
77
78
def get_supported_language(self) -> str:
79
"""Get primary supported language code."""
80
81
def get_version(self) -> str:
82
"""Get recognizer version string."""
83
84
def to_dict(self) -> Dict:
85
"""Serialize recognizer configuration to dictionary."""
86
87
@classmethod
88
def from_dict(cls, entity_recognizer_dict: Dict) -> EntityRecognizer:
89
"""Create recognizer instance from dictionary configuration."""
90
91
@staticmethod
92
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
93
"""Remove duplicate results based on entity type and position."""
94
95
@staticmethod
96
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
97
"""Clean input text using replacement patterns."""
98
99
# Properties
100
supported_entities: List[str] # Entity types this recognizer detects
101
name: str # Unique recognizer identifier
102
supported_language: str # Primary language code
103
version: str # Version string
104
is_loaded: bool # Whether recognizer resources are loaded
105
context: Optional[List[str]] # Context keywords for enhancement
106
id: str # Unique instance identifier
107
108
# Constants
109
MIN_SCORE = 0 # Minimum confidence score
110
MAX_SCORE = 1.0 # Maximum confidence score
111
```
112
113
### LocalRecognizer
114
115
Abstract class for recognizers that run in the same process as the AnalyzerEngine.
116
117
```python { .api }
118
class LocalRecognizer(EntityRecognizer):
119
"""
120
Abstract base class for recognizers that execute locally within the analyzer process.
121
Inherits all methods and properties from EntityRecognizer.
122
"""
123
pass
124
```
125
126
### PatternRecognizer
127
128
Concrete implementation for pattern-based PII detection using regular expressions and deny lists.
129
130
```python { .api }
131
class PatternRecognizer(LocalRecognizer):
132
"""
133
PII entity recognizer using regular expressions and deny lists.
134
135
Args:
136
supported_entity: Single entity type this recognizer detects
137
name: Unique identifier for the recognizer
138
supported_language: Language code (default: "en")
139
patterns: List of Pattern objects containing regex rules
140
deny_list: List of strings that should always be detected
141
context: Context keywords that improve detection accuracy
142
deny_list_score: Confidence score for deny list matches (default: 1.0)
143
global_regex_flags: Default regex compilation flags
144
version: Version string
145
"""
146
def __init__(
147
self,
148
supported_entity: str,
149
name: str = None,
150
supported_language: str = "en",
151
patterns: List[Pattern] = None,
152
deny_list: List[str] = None,
153
context: List[str] = None,
154
deny_list_score: float = 1.0,
155
global_regex_flags: Optional[int] = None, # Default: re.DOTALL | re.MULTILINE | re.IGNORECASE
156
version: str = "0.0.1"
157
): ...
158
159
def analyze(
160
self,
161
text: str,
162
entities: List[str],
163
nlp_artifacts: Optional[NlpArtifacts] = None,
164
regex_flags: Optional[int] = None
165
) -> List[RecognizerResult]:
166
"""
167
Analyze text using configured patterns and deny lists.
168
169
Args:
170
text: Input text to analyze
171
entities: Entity types to detect (must include supported_entity)
172
nlp_artifacts: Pre-processed NLP data (optional for pattern matching)
173
regex_flags: Override default regex compilation flags
174
175
Returns:
176
List of RecognizerResult objects for pattern matches
177
"""
178
179
def validate_result(self, pattern_text: str) -> Optional[bool]:
180
"""
181
Validate pattern match using custom logic (override in subclasses).
182
183
Args:
184
pattern_text: Matched text from pattern
185
186
Returns:
187
True if valid, False if invalid, None if no validation performed
188
"""
189
190
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
191
"""
192
Check if pattern match should be invalidated (override in subclasses).
193
194
Args:
195
pattern_text: Matched text from pattern
196
197
Returns:
198
True if should be invalidated, False if valid, None if no check performed
199
"""
200
201
@staticmethod
202
def build_regex_explanation(
203
recognizer_name: str,
204
pattern_name: str,
205
pattern: str,
206
original_score: float,
207
validation_result: Optional[bool] = None
208
) -> AnalysisExplanation:
209
"""Build detailed explanation for regex-based detection."""
210
211
def to_dict(self) -> Dict:
212
"""Serialize pattern recognizer configuration to dictionary."""
213
214
@classmethod
215
def from_dict(cls, entity_recognizer_dict: Dict) -> PatternRecognizer:
216
"""Create PatternRecognizer from dictionary configuration."""
217
218
# Properties
219
patterns: List[Pattern] # List of regex Pattern objects
220
deny_list: List[str] # List of strings that indicate PII
221
context: Optional[List[str]] # Context keywords for enhancement
222
deny_list_score: float # Confidence score for deny list matches
223
global_regex_flags: Optional[int] # Default regex compilation flags
224
```
225
226
### RemoteRecognizer
227
228
Abstract class for recognizers that call external services or run in separate processes.
229
230
```python { .api }
231
class RemoteRecognizer(EntityRecognizer):
232
"""
233
Abstract base class for recognizers that call external services.
234
235
Args:
236
supported_entities: List of entity types this recognizer can detect
237
name: Unique identifier for the recognizer
238
supported_language: Language code
239
version: Version string
240
context: Optional context keywords
241
"""
242
def __init__(
243
self,
244
supported_entities: List[str],
245
name: Optional[str],
246
supported_language: str,
247
version: str,
248
context: Optional[List[str]] = None
249
): ...
250
251
def analyze(
252
self,
253
text: str,
254
entities: List[str],
255
nlp_artifacts: NlpArtifacts
256
) -> List[RecognizerResult]:
257
"""
258
Abstract method: Call external service for PII detection.
259
Must be implemented by concrete subclasses.
260
"""
261
262
def get_supported_entities(self) -> List[str]:
263
"""Abstract method: Get supported entities from external service."""
264
```
265
266
### Pattern Class
267
268
Represents a regular expression pattern used by PatternRecognizer.
269
270
```python { .api }
271
class Pattern:
272
"""
273
Regular expression pattern for PII detection.
274
275
Args:
276
name: Descriptive name for the pattern
277
regex: Regular expression string
278
score: Confidence score when pattern matches (0.0-1.0)
279
"""
280
def __init__(self, name: str, regex: str, score: float): ...
281
282
def to_dict(self) -> Dict:
283
"""Serialize pattern to dictionary format."""
284
285
@classmethod
286
def from_dict(cls, pattern_dict: Dict) -> Pattern:
287
"""Create Pattern from dictionary data."""
288
289
# Properties
290
name: str # Descriptive pattern name
291
regex: str # Regular expression string
292
score: float # Confidence score for matches
293
compiled_regex: re.Pattern # Compiled regex object
294
compiled_with_flags: re.Pattern # Compiled regex with flags
295
```
296
297
## Usage Examples
298
299
### Creating a Custom PatternRecognizer
300
301
```python
302
from presidio_analyzer import PatternRecognizer, Pattern
303
304
# Define patterns for custom entity type
305
employee_id_patterns = [
306
Pattern(
307
name="employee_id_format_1",
308
regex=r"\bEMP-\d{5}\b",
309
score=0.9
310
),
311
Pattern(
312
name="employee_id_format_2",
313
regex=r"\b[Ee]mployee\s*[Ii][Dd]\s*:?\s*(\d{5})\b",
314
score=0.8
315
)
316
]
317
318
# Create custom recognizer
319
employee_recognizer = PatternRecognizer(
320
supported_entity="EMPLOYEE_ID",
321
name="EmployeeIdRecognizer",
322
patterns=employee_id_patterns,
323
context=["employee", "staff", "worker", "personnel"]
324
)
325
326
# Test the recognizer
327
from presidio_analyzer.nlp_engine import SpacyNlpEngine
328
329
nlp_engine = SpacyNlpEngine()
330
nlp_engine.load()
331
332
text = "Contact employee ID: 12345 or use EMP-98765"
333
nlp_artifacts = nlp_engine.process_text(text, "en")
334
335
results = employee_recognizer.analyze(
336
text=text,
337
entities=["EMPLOYEE_ID"],
338
nlp_artifacts=nlp_artifacts
339
)
340
341
for result in results:
342
detected_text = text[result.start:result.end]
343
print(f"Found {result.entity_type}: '{detected_text}' (score: {result.score})")
344
```
345
346
### Using Deny Lists
347
348
```python
349
from presidio_analyzer import PatternRecognizer
350
351
# Create recognizer with deny list
352
sensitive_terms_recognizer = PatternRecognizer(
353
supported_entity="SENSITIVE_TERM",
354
name="SensitiveTermsRecognizer",
355
deny_list=[
356
"confidential",
357
"classified",
358
"internal use only",
359
"proprietary"
360
],
361
deny_list_score=0.95
362
)
363
364
# Test with text containing deny list terms
365
text = "This document is marked as confidential and internal use only"
366
results = sensitive_terms_recognizer.analyze(
367
text=text,
368
entities=["SENSITIVE_TERM"],
369
nlp_artifacts=None # Deny lists don't need NLP processing
370
)
371
372
print(f"Found {len(results)} sensitive terms")
373
```
374
375
### Creating a Custom Validation Recognizer
376
377
```python
378
from presidio_analyzer import PatternRecognizer, Pattern
379
import re
380
381
class CustomCreditCardRecognizer(PatternRecognizer):
382
"""Custom credit card recognizer with Luhn algorithm validation."""
383
384
def __init__(self):
385
patterns = [
386
Pattern(
387
name="credit_card_generic",
388
regex=r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",
389
score=0.6 # Lower initial score, validation will increase
390
)
391
]
392
393
super().__init__(
394
supported_entity="CREDIT_CARD",
395
name="CustomCreditCardRecognizer",
396
patterns=patterns
397
)
398
399
def validate_result(self, pattern_text: str) -> Optional[bool]:
400
"""Validate credit card number using Luhn algorithm."""
401
# Remove non-digit characters
402
digits = re.sub(r'[-\s]', '', pattern_text)
403
404
if not digits.isdigit() or len(digits) != 16:
405
return False
406
407
# Luhn algorithm validation
408
def luhn_check(card_num):
409
def digits_of(n):
410
return [int(d) for d in str(n)]
411
412
digits = digits_of(card_num)
413
odd_digits = digits[-1::-2]
414
even_digits = digits[-2::-2]
415
checksum = sum(odd_digits)
416
for d in even_digits:
417
checksum += sum(digits_of(d*2))
418
return checksum % 10 == 0
419
420
return luhn_check(digits)
421
422
# Use custom recognizer
423
recognizer = CustomCreditCardRecognizer()
424
425
# Test with valid and invalid credit card numbers
426
text = "Valid: 4532015112830366, Invalid: 1234567890123456"
427
results = recognizer.analyze(
428
text=text,
429
entities=["CREDIT_CARD"],
430
nlp_artifacts=None
431
)
432
433
for result in results:
434
card_num = text[result.start:result.end]
435
print(f"Credit card: {card_num}, Score: {result.score}")
436
```
437
438
### Integrating Custom Recognizer with AnalyzerEngine
439
440
```python
441
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
442
443
# Create custom recognizer
444
custom_recognizer = PatternRecognizer(
445
supported_entity="PRODUCT_CODE",
446
name="ProductCodeRecognizer",
447
patterns=[
448
Pattern(
449
name="product_code_pattern",
450
regex=r"\bPRD-[A-Z]{2}-\d{4}\b",
451
score=0.9
452
)
453
]
454
)
455
456
# Create registry with custom recognizer
457
registry = RecognizerRegistry()
458
registry.recognizers.append(custom_recognizer)
459
460
# Load default recognizers
461
registry.load_predefined_recognizers(languages=["en"])
462
463
# Create analyzer with custom registry
464
analyzer = AnalyzerEngine(registry=registry)
465
466
# Test analysis
467
text = "Order product PRD-AB-1234 and contact john@email.com"
468
results = analyzer.analyze(text=text, language="en")
469
470
for result in results:
471
detected_text = text[result.start:result.end]
472
print(f"Found {result.entity_type}: '{detected_text}'")
473
```
474
475
### Remote Recognizer Implementation Example
476
477
```python
478
from presidio_analyzer import RemoteRecognizer, RecognizerResult
479
import requests
480
481
class APIBasedRecognizer(RemoteRecognizer):
482
"""Example remote recognizer that calls external API."""
483
484
def __init__(self, api_endpoint: str, api_key: str):
485
super().__init__(
486
supported_entities=["CUSTOM_PII"],
487
name="APIBasedRecognizer",
488
supported_language="en",
489
version="1.0.0"
490
)
491
self.api_endpoint = api_endpoint
492
self.api_key = api_key
493
494
def load(self) -> None:
495
"""Initialize connection to remote service."""
496
# Test API connectivity
497
headers = {"Authorization": f"Bearer {self.api_key}"}
498
response = requests.get(f"{self.api_endpoint}/health", headers=headers)
499
if response.status_code != 200:
500
raise ConnectionError("Cannot connect to remote PII service")
501
502
def analyze(self, text: str, entities: List[str], nlp_artifacts) -> List[RecognizerResult]:
503
"""Call remote API for PII detection."""
504
if "CUSTOM_PII" not in entities:
505
return []
506
507
headers = {"Authorization": f"Bearer {self.api_key}"}
508
payload = {"text": text, "entities": entities}
509
510
response = requests.post(
511
f"{self.api_endpoint}/analyze",
512
json=payload,
513
headers=headers
514
)
515
516
results = []
517
if response.status_code == 200:
518
api_results = response.json()
519
for detection in api_results.get("detections", []):
520
result = RecognizerResult(
521
entity_type=detection["entity_type"],
522
start=detection["start"],
523
end=detection["end"],
524
score=detection["score"]
525
)
526
results.append(result)
527
528
return results
529
530
def get_supported_entities(self) -> List[str]:
531
"""Get supported entities from remote service."""
532
headers = {"Authorization": f"Bearer {self.api_key}"}
533
response = requests.get(f"{self.api_endpoint}/entities", headers=headers)
534
535
if response.status_code == 200:
536
return response.json().get("entities", [])
537
return self.supported_entities
538
539
# Usage (assuming you have an API endpoint)
540
# remote_recognizer = APIBasedRecognizer(
541
# api_endpoint="https://api.example.com/pii",
542
# api_key="your-api-key"
543
# )
544
# remote_recognizer.load()
545
```
546
547
### Configuration-Driven Recognizer Creation
548
549
```python
550
from presidio_analyzer import PatternRecognizer, Pattern
551
import yaml
552
553
def create_recognizer_from_config(config_file: str) -> PatternRecognizer:
554
"""Create PatternRecognizer from YAML configuration."""
555
with open(config_file, 'r') as f:
556
config = yaml.safe_load(f)
557
558
# Create patterns from configuration
559
patterns = []
560
for pattern_config in config.get('patterns', []):
561
pattern = Pattern(
562
name=pattern_config['name'],
563
regex=pattern_config['regex'],
564
score=pattern_config['score']
565
)
566
patterns.append(pattern)
567
568
# Create recognizer
569
recognizer = PatternRecognizer(
570
supported_entity=config['entity_type'],
571
name=config['name'],
572
patterns=patterns,
573
deny_list=config.get('deny_list', []),
574
context=config.get('context', []),
575
supported_language=config.get('language', 'en')
576
)
577
578
return recognizer
579
580
# Example YAML configuration file (recognizer_config.yaml):
581
"""
582
name: "CustomBankAccountRecognizer"
583
entity_type: "BANK_ACCOUNT"
584
language: "en"
585
patterns:
586
- name: "routing_account_pattern"
587
regex: "\\b\\d{9}[-\\s]\\d{10,12}\\b"
588
score: 0.8
589
- name: "account_number_pattern"
590
regex: "Account\\s*:?\\s*(\\d{10,12})"
591
score: 0.7
592
deny_list:
593
- "0000000000"
594
- "1111111111"
595
context:
596
- "account"
597
- "banking"
598
- "routing"
599
"""
600
601
# Create recognizer from configuration
602
# recognizer = create_recognizer_from_config("recognizer_config.yaml")
603
```
604
605
## Best Practices
606
607
### Pattern Design
608
609
- Use word boundaries (`\b`) to avoid partial matches
610
- Test patterns with various input formats
611
- Start with lower confidence scores and use validation to increase them
612
- Include context keywords to improve accuracy
613
614
### Performance Optimization
615
616
- Compile patterns once during initialization
617
- Use specific entity type filtering in analyze() method
618
- Implement efficient validation logic
619
- Consider caching for expensive operations
620
621
### Error Handling
622
623
- Validate input parameters in constructor
624
- Handle regex compilation errors gracefully
625
- Implement proper logging for debugging
626
- Return empty results rather than raising exceptions for invalid input