0
# Context Enhancement
1
2
Context-aware enhancement improves PII detection accuracy by analyzing surrounding text and using contextual clues to boost confidence scores for likely PII entities.
3
4
## Capabilities
5
6
### ContextAwareEnhancer Base Class
7
8
Abstract base class for implementing context-aware enhancement logic.
9
10
```python { .api }
11
class ContextAwareEnhancer:
12
"""
13
Abstract base class for context-aware enhancement implementations.
14
15
Args:
16
context_similarity_factor: Weight factor for context similarity (0.0-1.0)
17
min_score_with_context_similarity: Minimum score required for context enhancement
18
context_prefix_count: Number of words to analyze before detected entity
19
context_suffix_count: Number of words to analyze after detected entity
20
"""
21
def __init__(
22
self,
23
context_similarity_factor: float,
24
min_score_with_context_similarity: float,
25
context_prefix_count: int,
26
context_suffix_count: int
27
): ...
28
29
def enhance_using_context(
30
self,
31
text: str,
32
raw_results: List[RecognizerResult],
33
nlp_artifacts: NlpArtifacts,
34
recognizers: List[EntityRecognizer],
35
context: Optional[List[str]] = None
36
) -> List[RecognizerResult]:
37
"""
38
Abstract method: Enhance detection results using contextual information.
39
40
Args:
41
text: Original input text
42
raw_results: Initial detection results from recognizers
43
nlp_artifacts: NLP processing artifacts (tokens, lemmas, etc.)
44
recognizers: List of all available recognizers
45
context: Optional context keywords for enhancement
46
47
Returns:
48
Enhanced list of RecognizerResult objects with improved scores
49
"""
50
51
# Properties
52
context_similarity_factor: float # Weight for context similarity scoring
53
min_score_with_context_similarity: float # Minimum score threshold for enhancement
54
context_prefix_count: int # Words to analyze before entity
55
context_suffix_count: int # Words to analyze after entity
56
57
# Constants
58
MIN_SCORE = 0 # Minimum confidence score
59
MAX_SCORE = 1.0 # Maximum confidence score
60
```
61
62
### LemmaContextAwareEnhancer
63
64
Concrete implementation that uses lemmatization for context-aware enhancement.
65
66
```python { .api }
67
class LemmaContextAwareEnhancer(ContextAwareEnhancer):
68
"""
69
Context-aware enhancer using lemma-based similarity analysis.
70
71
Args:
72
context_similarity_factor: Weight factor for similarity scoring (default: 0.35)
73
min_score_with_context_similarity: Minimum score for enhancement (default: 0.4)
74
context_prefix_count: Words to analyze before entity (default: 5)
75
context_suffix_count: Words to analyze after entity (default: 0)
76
"""
77
def __init__(
78
self,
79
context_similarity_factor: float = 0.35,
80
min_score_with_context_similarity: float = 0.4,
81
context_prefix_count: int = 5,
82
context_suffix_count: int = 0
83
): ...
84
85
def enhance_using_context(
86
self,
87
text: str,
88
raw_results: List[RecognizerResult],
89
nlp_artifacts: NlpArtifacts,
90
recognizers: List[EntityRecognizer],
91
context: Optional[List[str]] = None
92
) -> List[RecognizerResult]:
93
"""
94
Enhance results using lemma-based context analysis.
95
96
Compares lemmatized forms of surrounding words with recognizer context
97
keywords to identify supporting contextual evidence.
98
99
Args:
100
text: Original input text
101
raw_results: Initial detection results
102
nlp_artifacts: NLP processing results with lemmas
103
recognizers: Available recognizers with context keywords
104
context: Additional context keywords for this analysis
105
106
Returns:
107
Enhanced RecognizerResult list with boosted confidence scores
108
"""
109
110
@staticmethod
111
def _find_supportive_word_in_context(
112
context_list: List[str],
113
recognizer_context_list: List[str]
114
) -> str:
115
"""
116
Find context words that support PII detection.
117
118
Args:
119
context_list: Surrounding words from text
120
recognizer_context_list: Context keywords from recognizer
121
122
Returns:
123
First matching supportive word or empty string
124
"""
125
126
def _extract_surrounding_words(
127
self,
128
nlp_artifacts: NlpArtifacts,
129
word: str,
130
start: int
131
) -> List[str]:
132
"""
133
Extract surrounding words from NLP artifacts.
134
135
Args:
136
nlp_artifacts: NLP processing results
137
word: Target word/entity
138
start: Start position of entity in text
139
140
Returns:
141
List of surrounding word lemmas
142
"""
143
```
144
145
## Usage Examples
146
147
### Basic Context Enhancement Setup
148
149
```python
150
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer
151
152
# Create context enhancer with custom settings
153
enhancer = LemmaContextAwareEnhancer(
154
context_similarity_factor=0.45, # Stronger context influence
155
min_score_with_context_similarity=0.3, # Lower threshold for enhancement
156
context_prefix_count=3, # Look at 3 words before
157
context_suffix_count=2 # Look at 2 words after
158
)
159
160
# Initialize analyzer with context enhancement
161
analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)
162
163
# Analyze text with context benefit
164
text = "Please update my phone number to 555-0199 in the system"
165
166
results = analyzer.analyze(text=text, language="en")
167
168
for result in results:
169
detected_text = text[result.start:result.end]
170
print(f"Entity: {result.entity_type}")
171
print(f"Text: '{detected_text}'")
172
print(f"Score: {result.score:.3f}")
173
if result.analysis_explanation:
174
print(f"Context boost: {result.analysis_explanation.textual_explanation}")
175
```
176
177
### Providing Explicit Context Keywords
178
179
```python
180
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer
181
182
# Setup context-aware analyzer
183
enhancer = LemmaContextAwareEnhancer()
184
analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)
185
186
# Text with ambiguous numbers
187
text = "My new contact is 555-0123 and my employee ID is 98765"
188
189
# Provide context to help distinguish phone numbers from other numbers
190
context_keywords = [
191
"contact", "phone", "call", "number", "telephone", "mobile", "cell"
192
]
193
194
results = analyzer.analyze(
195
text=text,
196
language="en",
197
context=context_keywords
198
)
199
200
# Context should help boost phone number confidence
201
for result in results:
202
detected_text = text[result.start:result.end]
203
print(f"Found {result.entity_type}: '{detected_text}' (score: {result.score:.3f})")
204
205
if result.analysis_explanation and result.analysis_explanation.textual_explanation:
206
print(f" Enhancement: {result.analysis_explanation.textual_explanation}")
207
```
208
209
### Comparing with and without Context Enhancement
210
211
```python
212
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer
213
214
# Create analyzer without context enhancement
215
analyzer_basic = AnalyzerEngine()
216
217
# Create analyzer with context enhancement
218
enhancer = LemmaContextAwareEnhancer()
219
analyzer_enhanced = AnalyzerEngine(context_aware_enhancer=enhancer)
220
221
# Test text with contextual clues
222
text = "The patient's medical record shows phone: 555-0199"
223
224
# Analyze without context enhancement
225
basic_results = analyzer_basic.analyze(text=text, language="en")
226
227
# Analyze with context enhancement
228
enhanced_results = analyzer_enhanced.analyze(text=text, language="en")
229
230
print("Without context enhancement:")
231
for result in basic_results:
232
if result.entity_type == "PHONE_NUMBER":
233
print(f" Phone score: {result.score:.3f}")
234
235
print("\nWith context enhancement:")
236
for result in enhanced_results:
237
if result.entity_type == "PHONE_NUMBER":
238
print(f" Phone score: {result.score:.3f}")
239
if result.analysis_explanation:
240
print(f" Explanation: {result.analysis_explanation.textual_explanation}")
241
```
242
243
### Context Enhancement for Multiple Entity Types
244
245
```python
246
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer, PatternRecognizer, Pattern
247
248
# Create custom recognizer with context keywords
249
employee_recognizer = PatternRecognizer(
250
supported_entity="EMPLOYEE_ID",
251
name="EmployeeRecognizer",
252
patterns=[Pattern("emp_id", r"\b\d{5}\b", 0.6)],
253
context=["employee", "staff", "worker", "personnel", "emp"]
254
)
255
256
# Setup context-aware analysis
257
enhancer = LemmaContextAwareEnhancer(
258
context_similarity_factor=0.4,
259
min_score_with_context_similarity=0.3
260
)
261
262
# Create analyzer with custom recognizer and context enhancement
263
from presidio_analyzer import RecognizerRegistry
264
registry = RecognizerRegistry()
265
registry.recognizers.append(employee_recognizer)
266
registry.load_predefined_recognizers(languages=["en"])
267
268
analyzer = AnalyzerEngine(
269
registry=registry,
270
context_aware_enhancer=enhancer
271
)
272
273
# Test text with multiple contextual entities
274
text = """
275
HR Records:
276
- Employee John Smith (ID: 12345)
277
- Contact phone: 555-0199
278
- SSN for tax purposes: 123-45-6789
279
"""
280
281
results = analyzer.analyze(text=text, language="en")
282
283
# Show how context affects different entity types
284
entity_scores = {}
285
for result in results:
286
entity_type = result.entity_type
287
detected_text = text[result.start:result.end]
288
289
if entity_type not in entity_scores:
290
entity_scores[entity_type] = []
291
292
entity_scores[entity_type].append({
293
'text': detected_text,
294
'score': result.score,
295
'enhanced': bool(result.analysis_explanation and
296
result.analysis_explanation.textual_explanation)
297
})
298
299
for entity_type, detections in entity_scores.items():
300
print(f"\n{entity_type}:")
301
for detection in detections:
302
enhancement_marker = " (enhanced)" if detection['enhanced'] else ""
303
print(f" '{detection['text']}': {detection['score']:.3f}{enhancement_marker}")
304
```
305
306
### Fine-tuning Context Parameters
307
308
```python
309
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer
310
311
def test_context_parameters(text, context_params):
312
"""Test different context enhancement parameters."""
313
results_comparison = {}
314
315
for name, params in context_params.items():
316
enhancer = LemmaContextAwareEnhancer(**params)
317
analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)
318
319
results = analyzer.analyze(text=text, language="en")
320
321
results_comparison[name] = []
322
for result in results:
323
results_comparison[name].append({
324
'entity_type': result.entity_type,
325
'score': result.score,
326
'enhanced': bool(result.analysis_explanation and
327
result.analysis_explanation.textual_explanation)
328
})
329
330
return results_comparison
331
332
# Test text
333
text = "Customer service representative phone number is 555-0123"
334
335
# Different parameter configurations
336
context_configs = {
337
'conservative': {
338
'context_similarity_factor': 0.2,
339
'min_score_with_context_similarity': 0.6,
340
'context_prefix_count': 3,
341
'context_suffix_count': 0
342
},
343
'balanced': {
344
'context_similarity_factor': 0.35,
345
'min_score_with_context_similarity': 0.4,
346
'context_prefix_count': 5,
347
'context_suffix_count': 0
348
},
349
'aggressive': {
350
'context_similarity_factor': 0.5,
351
'min_score_with_context_similarity': 0.2,
352
'context_prefix_count': 7,
353
'context_suffix_count': 3
354
}
355
}
356
357
# Compare results
358
comparison = test_context_parameters(text, context_configs)
359
360
for config_name, results in comparison.items():
361
print(f"\n{config_name.upper()} configuration:")
362
for result in results:
363
enhancement = " (enhanced)" if result['enhanced'] else ""
364
print(f" {result['entity_type']}: {result['score']:.3f}{enhancement}")
365
```
366
367
### Context Enhancement with Custom Entity Types
368
369
```python
370
from presidio_analyzer import (
371
AnalyzerEngine, LemmaContextAwareEnhancer, PatternRecognizer,
372
Pattern, RecognizerRegistry
373
)
374
375
# Create domain-specific recognizer with context
376
medical_id_recognizer = PatternRecognizer(
377
supported_entity="MEDICAL_ID",
378
name="MedicalIdRecognizer",
379
patterns=[Pattern("medical_id", r"\bMED-\d{6}\b", 0.7)],
380
context=["medical", "patient", "healthcare", "diagnosis", "treatment", "hospital"]
381
)
382
383
patient_id_recognizer = PatternRecognizer(
384
supported_entity="PATIENT_ID",
385
name="PatientIdRecognizer",
386
patterns=[Pattern("patient_id", r"\bPT-\d{5}\b", 0.6)],
387
context=["patient", "admission", "discharge", "medical", "record"]
388
)
389
390
# Setup context-aware enhancement
391
enhancer = LemmaContextAwareEnhancer(
392
context_similarity_factor=0.4,
393
min_score_with_context_similarity=0.3,
394
context_prefix_count=6, # Look at more words for medical context
395
context_suffix_count=2
396
)
397
398
# Create analyzer with medical recognizers
399
registry = RecognizerRegistry()
400
registry.recognizers.extend([medical_id_recognizer, patient_id_recognizer])
401
registry.load_predefined_recognizers(languages=["en"])
402
403
analyzer = AnalyzerEngine(
404
registry=registry,
405
context_aware_enhancer=enhancer
406
)
407
408
# Medical text with contextual clues
409
medical_text = """
410
Patient medical record shows:
411
- Patient ID: PT-12345 for admission
412
- Medical diagnosis code: MED-987654
413
- Contact phone: 555-0199
414
- Healthcare provider: Dr. Smith
415
"""
416
417
results = analyzer.analyze(text=medical_text, language="en")
418
419
# Show context enhancement effects
420
for result in results:
421
detected_text = medical_text[result.start:result.end]
422
print(f"\nEntity: {result.entity_type}")
423
print(f"Text: '{detected_text}'")
424
print(f"Score: {result.score:.3f}")
425
426
if result.analysis_explanation and result.analysis_explanation.textual_explanation:
427
print(f"Context enhancement: {result.analysis_explanation.textual_explanation}")
428
```
429
430
### Debugging Context Enhancement
431
432
```python
433
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer
434
435
# Enable detailed decision process logging
436
enhancer = LemmaContextAwareEnhancer()
437
analyzer = AnalyzerEngine(
438
context_aware_enhancer=enhancer,
439
log_decision_process=True # Enable detailed logging
440
)
441
442
text = "Update customer phone number 555-0123 in the database"
443
444
results = analyzer.analyze(
445
text=text,
446
language="en",
447
return_decision_process=True # Include decision details in results
448
)
449
450
for result in results:
451
detected_text = text[result.start:result.end]
452
print(f"\nDetected: {result.entity_type} - '{detected_text}'")
453
print(f"Final Score: {result.score:.3f}")
454
455
if result.analysis_explanation:
456
exp = result.analysis_explanation
457
print(f"Original Score: {exp.original_score:.3f}")
458
if exp.score != exp.original_score:
459
score_change = exp.score - exp.original_score
460
print(f"Score Change: +{score_change:.3f}")
461
462
if exp.textual_explanation:
463
print(f"Explanation: {exp.textual_explanation}")
464
```
465
466
### Performance Considerations for Context Enhancement
467
468
```python
469
from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer
470
import time
471
472
def benchmark_context_enhancement(texts, with_context=True):
473
"""Benchmark context enhancement performance."""
474
475
if with_context:
476
enhancer = LemmaContextAwareEnhancer()
477
analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)
478
label = "with context enhancement"
479
else:
480
analyzer = AnalyzerEngine()
481
label = "without context enhancement"
482
483
start_time = time.time()
484
485
total_results = 0
486
for text in texts:
487
results = analyzer.analyze(text=text, language="en")
488
total_results += len(results)
489
490
end_time = time.time()
491
processing_time = end_time - start_time
492
493
print(f"Processing {len(texts)} texts {label}:")
494
print(f" Time: {processing_time:.3f} seconds")
495
print(f" Results: {total_results}")
496
print(f" Rate: {len(texts)/processing_time:.1f} texts/second")
497
498
return processing_time
499
500
# Test texts
501
test_texts = [
502
"Customer phone number is 555-0123",
503
"Employee ID 12345 needs update",
504
"Medical record MED-98765 for patient",
505
"Contact email john@company.com for support",
506
"SSN 123-45-6789 for tax purposes"
507
] * 20 # 100 texts total
508
509
# Benchmark both configurations
510
time_without = benchmark_context_enhancement(test_texts, with_context=False)
511
time_with = benchmark_context_enhancement(test_texts, with_context=True)
512
513
overhead = ((time_with - time_without) / time_without) * 100
514
print(f"\nContext enhancement overhead: {overhead:.1f}%")
515
```
516
517
## Context Enhancement Best Practices
518
519
### When to Use Context Enhancement
520
521
- **Ambiguous patterns**: Numbers that could be phone numbers, IDs, or dates
522
- **Low-confidence detections**: Borderline matches that need confirmation
523
- **Domain-specific text**: Medical, legal, financial documents with specialized terminology
524
- **Multi-language content**: Where context helps disambiguate similar patterns
525
526
### Tuning Parameters
527
528
- **context_similarity_factor**:
529
- Lower (0.1-0.3): Conservative enhancement
530
- Higher (0.4-0.6): Aggressive enhancement
531
- **min_score_with_context_similarity**:
532
- Higher (0.6+): Only enhance high-confidence detections
533
- Lower (0.2-0.4): Enhance more borderline cases
534
- **context_prefix_count**:
535
- 3-5: Standard context window
536
- 7+: Larger context for complex documents
537
538
### Performance Optimization
539
540
- Context enhancement adds processing overhead (~10-30%)
541
- Consider disabling for high-throughput, low-accuracy scenarios
542
- Use smaller context windows for better performance
543
- Pre-compute NLP artifacts when analyzing multiple times