or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-processing.mdconfiguration.mdcontext-enhancement.mdcore-analysis.mdentity-recognizers.mdindex.mdpredefined-recognizers.md

context-enhancement.mddocs/

0

# Context Enhancement

1

2

Context-aware enhancement improves PII detection accuracy by analyzing surrounding text and using contextual clues to boost confidence scores for likely PII entities.

3

4

## Capabilities

5

6

### ContextAwareEnhancer Base Class

7

8

Abstract base class for implementing context-aware enhancement logic.

9

10

```python { .api }

11

class ContextAwareEnhancer:

12

"""

13

Abstract base class for context-aware enhancement implementations.

14

15

Args:

16

context_similarity_factor: Weight factor for context similarity (0.0-1.0)

17

min_score_with_context_similarity: Minimum score required for context enhancement

18

context_prefix_count: Number of words to analyze before detected entity

19

context_suffix_count: Number of words to analyze after detected entity

20

"""

21

def __init__(

22

self,

23

context_similarity_factor: float,

24

min_score_with_context_similarity: float,

25

context_prefix_count: int,

26

context_suffix_count: int

27

): ...

28

29

def enhance_using_context(

30

self,

31

text: str,

32

raw_results: List[RecognizerResult],

33

nlp_artifacts: NlpArtifacts,

34

recognizers: List[EntityRecognizer],

35

context: Optional[List[str]] = None

36

) -> List[RecognizerResult]:

37

"""

38

Abstract method: Enhance detection results using contextual information.

39

40

Args:

41

text: Original input text

42

raw_results: Initial detection results from recognizers

43

nlp_artifacts: NLP processing artifacts (tokens, lemmas, etc.)

44

recognizers: List of all available recognizers

45

context: Optional context keywords for enhancement

46

47

Returns:

48

Enhanced list of RecognizerResult objects with improved scores

49

"""

50

51

# Properties

52

context_similarity_factor: float # Weight for context similarity scoring

53

min_score_with_context_similarity: float # Minimum score threshold for enhancement

54

context_prefix_count: int # Words to analyze before entity

55

context_suffix_count: int # Words to analyze after entity

56

57

# Constants

58

MIN_SCORE = 0 # Minimum confidence score

59

MAX_SCORE = 1.0 # Maximum confidence score

60

```

61

62

### LemmaContextAwareEnhancer

63

64

Concrete implementation that uses lemmatization for context-aware enhancement.

65

66

```python { .api }

67

class LemmaContextAwareEnhancer(ContextAwareEnhancer):

68

"""

69

Context-aware enhancer using lemma-based similarity analysis.

70

71

Args:

72

context_similarity_factor: Weight factor for similarity scoring (default: 0.35)

73

min_score_with_context_similarity: Minimum score for enhancement (default: 0.4)

74

context_prefix_count: Words to analyze before entity (default: 5)

75

context_suffix_count: Words to analyze after entity (default: 0)

76

"""

77

def __init__(

78

self,

79

context_similarity_factor: float = 0.35,

80

min_score_with_context_similarity: float = 0.4,

81

context_prefix_count: int = 5,

82

context_suffix_count: int = 0

83

): ...

84

85

def enhance_using_context(

86

self,

87

text: str,

88

raw_results: List[RecognizerResult],

89

nlp_artifacts: NlpArtifacts,

90

recognizers: List[EntityRecognizer],

91

context: Optional[List[str]] = None

92

) -> List[RecognizerResult]:

93

"""

94

Enhance results using lemma-based context analysis.

95

96

Compares lemmatized forms of surrounding words with recognizer context

97

keywords to identify supporting contextual evidence.

98

99

Args:

100

text: Original input text

101

raw_results: Initial detection results

102

nlp_artifacts: NLP processing results with lemmas

103

recognizers: Available recognizers with context keywords

104

context: Additional context keywords for this analysis

105

106

Returns:

107

Enhanced RecognizerResult list with boosted confidence scores

108

"""

109

110

@staticmethod

111

def _find_supportive_word_in_context(

112

context_list: List[str],

113

recognizer_context_list: List[str]

114

) -> str:

115

"""

116

Find context words that support PII detection.

117

118

Args:

119

context_list: Surrounding words from text

120

recognizer_context_list: Context keywords from recognizer

121

122

Returns:

123

First matching supportive word or empty string

124

"""

125

126

def _extract_surrounding_words(

127

self,

128

nlp_artifacts: NlpArtifacts,

129

word: str,

130

start: int

131

) -> List[str]:

132

"""

133

Extract surrounding words from NLP artifacts.

134

135

Args:

136

nlp_artifacts: NLP processing results

137

word: Target word/entity

138

start: Start position of entity in text

139

140

Returns:

141

List of surrounding word lemmas

142

"""

143

```

144

145

## Usage Examples

146

147

### Basic Context Enhancement Setup

148

149

```python

150

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer

151

152

# Create context enhancer with custom settings

153

enhancer = LemmaContextAwareEnhancer(

154

context_similarity_factor=0.45, # Stronger context influence

155

min_score_with_context_similarity=0.3, # Lower threshold for enhancement

156

context_prefix_count=3, # Look at 3 words before

157

context_suffix_count=2 # Look at 2 words after

158

)

159

160

# Initialize analyzer with context enhancement

161

analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)

162

163

# Analyze text with context benefit

164

text = "Please update my phone number to 555-0199 in the system"

165

166

results = analyzer.analyze(text=text, language="en")

167

168

for result in results:

169

detected_text = text[result.start:result.end]

170

print(f"Entity: {result.entity_type}")

171

print(f"Text: '{detected_text}'")

172

print(f"Score: {result.score:.3f}")

173

if result.analysis_explanation:

174

print(f"Context boost: {result.analysis_explanation.textual_explanation}")

175

```

176

177

### Providing Explicit Context Keywords

178

179

```python

180

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer

181

182

# Setup context-aware analyzer

183

enhancer = LemmaContextAwareEnhancer()

184

analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)

185

186

# Text with ambiguous numbers

187

text = "My new contact is 555-0123 and my employee ID is 98765"

188

189

# Provide context to help distinguish phone numbers from other numbers

190

context_keywords = [

191

"contact", "phone", "call", "number", "telephone", "mobile", "cell"

192

]

193

194

results = analyzer.analyze(

195

text=text,

196

language="en",

197

context=context_keywords

198

)

199

200

# Context should help boost phone number confidence

201

for result in results:

202

detected_text = text[result.start:result.end]

203

print(f"Found {result.entity_type}: '{detected_text}' (score: {result.score:.3f})")

204

205

if result.analysis_explanation and result.analysis_explanation.textual_explanation:

206

print(f" Enhancement: {result.analysis_explanation.textual_explanation}")

207

```

208

209

### Comparing with and without Context Enhancement

210

211

```python

212

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer

213

214

# Create analyzer without context enhancement

215

analyzer_basic = AnalyzerEngine()

216

217

# Create analyzer with context enhancement

218

enhancer = LemmaContextAwareEnhancer()

219

analyzer_enhanced = AnalyzerEngine(context_aware_enhancer=enhancer)

220

221

# Test text with contextual clues

222

text = "The patient's medical record shows phone: 555-0199"

223

224

# Analyze without context enhancement

225

basic_results = analyzer_basic.analyze(text=text, language="en")

226

227

# Analyze with context enhancement

228

enhanced_results = analyzer_enhanced.analyze(text=text, language="en")

229

230

print("Without context enhancement:")

231

for result in basic_results:

232

if result.entity_type == "PHONE_NUMBER":

233

print(f" Phone score: {result.score:.3f}")

234

235

print("\nWith context enhancement:")

236

for result in enhanced_results:

237

if result.entity_type == "PHONE_NUMBER":

238

print(f" Phone score: {result.score:.3f}")

239

if result.analysis_explanation:

240

print(f" Explanation: {result.analysis_explanation.textual_explanation}")

241

```

242

243

### Context Enhancement for Multiple Entity Types

244

245

```python

246

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer, PatternRecognizer, Pattern

247

248

# Create custom recognizer with context keywords

249

employee_recognizer = PatternRecognizer(

250

supported_entity="EMPLOYEE_ID",

251

name="EmployeeRecognizer",

252

patterns=[Pattern("emp_id", r"\b\d{5}\b", 0.6)],

253

context=["employee", "staff", "worker", "personnel", "emp"]

254

)

255

256

# Setup context-aware analysis

257

enhancer = LemmaContextAwareEnhancer(

258

context_similarity_factor=0.4,

259

min_score_with_context_similarity=0.3

260

)

261

262

# Create analyzer with custom recognizer and context enhancement

263

from presidio_analyzer import RecognizerRegistry

264

registry = RecognizerRegistry()

265

registry.recognizers.append(employee_recognizer)

266

registry.load_predefined_recognizers(languages=["en"])

267

268

analyzer = AnalyzerEngine(

269

registry=registry,

270

context_aware_enhancer=enhancer

271

)

272

273

# Test text with multiple contextual entities

274

text = """

275

HR Records:

276

- Employee John Smith (ID: 12345)

277

- Contact phone: 555-0199

278

- SSN for tax purposes: 123-45-6789

279

"""

280

281

results = analyzer.analyze(text=text, language="en")

282

283

# Show how context affects different entity types

284

entity_scores = {}

285

for result in results:

286

entity_type = result.entity_type

287

detected_text = text[result.start:result.end]

288

289

if entity_type not in entity_scores:

290

entity_scores[entity_type] = []

291

292

entity_scores[entity_type].append({

293

'text': detected_text,

294

'score': result.score,

295

'enhanced': bool(result.analysis_explanation and

296

result.analysis_explanation.textual_explanation)

297

})

298

299

for entity_type, detections in entity_scores.items():

300

print(f"\n{entity_type}:")

301

for detection in detections:

302

enhancement_marker = " (enhanced)" if detection['enhanced'] else ""

303

print(f" '{detection['text']}': {detection['score']:.3f}{enhancement_marker}")

304

```

305

306

### Fine-tuning Context Parameters

307

308

```python

309

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer

310

311

def test_context_parameters(text, context_params):

312

"""Test different context enhancement parameters."""

313

results_comparison = {}

314

315

for name, params in context_params.items():

316

enhancer = LemmaContextAwareEnhancer(**params)

317

analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)

318

319

results = analyzer.analyze(text=text, language="en")

320

321

results_comparison[name] = []

322

for result in results:

323

results_comparison[name].append({

324

'entity_type': result.entity_type,

325

'score': result.score,

326

'enhanced': bool(result.analysis_explanation and

327

result.analysis_explanation.textual_explanation)

328

})

329

330

return results_comparison

331

332

# Test text

333

text = "Customer service representative phone number is 555-0123"

334

335

# Different parameter configurations

336

context_configs = {

337

'conservative': {

338

'context_similarity_factor': 0.2,

339

'min_score_with_context_similarity': 0.6,

340

'context_prefix_count': 3,

341

'context_suffix_count': 0

342

},

343

'balanced': {

344

'context_similarity_factor': 0.35,

345

'min_score_with_context_similarity': 0.4,

346

'context_prefix_count': 5,

347

'context_suffix_count': 0

348

},

349

'aggressive': {

350

'context_similarity_factor': 0.5,

351

'min_score_with_context_similarity': 0.2,

352

'context_prefix_count': 7,

353

'context_suffix_count': 3

354

}

355

}

356

357

# Compare results

358

comparison = test_context_parameters(text, context_configs)

359

360

for config_name, results in comparison.items():

361

print(f"\n{config_name.upper()} configuration:")

362

for result in results:

363

enhancement = " (enhanced)" if result['enhanced'] else ""

364

print(f" {result['entity_type']}: {result['score']:.3f}{enhancement}")

365

```

366

367

### Context Enhancement with Custom Entity Types

368

369

```python

370

from presidio_analyzer import (

371

AnalyzerEngine, LemmaContextAwareEnhancer, PatternRecognizer,

372

Pattern, RecognizerRegistry

373

)

374

375

# Create domain-specific recognizer with context

376

medical_id_recognizer = PatternRecognizer(

377

supported_entity="MEDICAL_ID",

378

name="MedicalIdRecognizer",

379

patterns=[Pattern("medical_id", r"\bMED-\d{6}\b", 0.7)],

380

context=["medical", "patient", "healthcare", "diagnosis", "treatment", "hospital"]

381

)

382

383

patient_id_recognizer = PatternRecognizer(

384

supported_entity="PATIENT_ID",

385

name="PatientIdRecognizer",

386

patterns=[Pattern("patient_id", r"\bPT-\d{5}\b", 0.6)],

387

context=["patient", "admission", "discharge", "medical", "record"]

388

)

389

390

# Setup context-aware enhancement

391

enhancer = LemmaContextAwareEnhancer(

392

context_similarity_factor=0.4,

393

min_score_with_context_similarity=0.3,

394

context_prefix_count=6, # Look at more words for medical context

395

context_suffix_count=2

396

)

397

398

# Create analyzer with medical recognizers

399

registry = RecognizerRegistry()

400

registry.recognizers.extend([medical_id_recognizer, patient_id_recognizer])

401

registry.load_predefined_recognizers(languages=["en"])

402

403

analyzer = AnalyzerEngine(

404

registry=registry,

405

context_aware_enhancer=enhancer

406

)

407

408

# Medical text with contextual clues

409

medical_text = """

410

Patient medical record shows:

411

- Patient ID: PT-12345 for admission

412

- Medical diagnosis code: MED-987654

413

- Contact phone: 555-0199

414

- Healthcare provider: Dr. Smith

415

"""

416

417

results = analyzer.analyze(text=medical_text, language="en")

418

419

# Show context enhancement effects

420

for result in results:

421

detected_text = medical_text[result.start:result.end]

422

print(f"\nEntity: {result.entity_type}")

423

print(f"Text: '{detected_text}'")

424

print(f"Score: {result.score:.3f}")

425

426

if result.analysis_explanation and result.analysis_explanation.textual_explanation:

427

print(f"Context enhancement: {result.analysis_explanation.textual_explanation}")

428

```

429

430

### Debugging Context Enhancement

431

432

```python

433

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer

434

435

# Enable detailed decision process logging

436

enhancer = LemmaContextAwareEnhancer()

437

analyzer = AnalyzerEngine(

438

context_aware_enhancer=enhancer,

439

log_decision_process=True # Enable detailed logging

440

)

441

442

text = "Update customer phone number 555-0123 in the database"

443

444

results = analyzer.analyze(

445

text=text,

446

language="en",

447

return_decision_process=True # Include decision details in results

448

)

449

450

for result in results:

451

detected_text = text[result.start:result.end]

452

print(f"\nDetected: {result.entity_type} - '{detected_text}'")

453

print(f"Final Score: {result.score:.3f}")

454

455

if result.analysis_explanation:

456

exp = result.analysis_explanation

457

print(f"Original Score: {exp.original_score:.3f}")

458

if exp.score != exp.original_score:

459

score_change = exp.score - exp.original_score

460

print(f"Score Change: +{score_change:.3f}")

461

462

if exp.textual_explanation:

463

print(f"Explanation: {exp.textual_explanation}")

464

```

465

466

### Performance Considerations for Context Enhancement

467

468

```python

469

from presidio_analyzer import AnalyzerEngine, LemmaContextAwareEnhancer

470

import time

471

472

def benchmark_context_enhancement(texts, with_context=True):

473

"""Benchmark context enhancement performance."""

474

475

if with_context:

476

enhancer = LemmaContextAwareEnhancer()

477

analyzer = AnalyzerEngine(context_aware_enhancer=enhancer)

478

label = "with context enhancement"

479

else:

480

analyzer = AnalyzerEngine()

481

label = "without context enhancement"

482

483

start_time = time.time()

484

485

total_results = 0

486

for text in texts:

487

results = analyzer.analyze(text=text, language="en")

488

total_results += len(results)

489

490

end_time = time.time()

491

processing_time = end_time - start_time

492

493

print(f"Processing {len(texts)} texts {label}:")

494

print(f" Time: {processing_time:.3f} seconds")

495

print(f" Results: {total_results}")

496

print(f" Rate: {len(texts)/processing_time:.1f} texts/second")

497

498

return processing_time

499

500

# Test texts

501

test_texts = [

502

"Customer phone number is 555-0123",

503

"Employee ID 12345 needs update",

504

"Medical record MED-98765 for patient",

505

"Contact email john@company.com for support",

506

"SSN 123-45-6789 for tax purposes"

507

] * 20 # 100 texts total

508

509

# Benchmark both configurations

510

time_without = benchmark_context_enhancement(test_texts, with_context=False)

511

time_with = benchmark_context_enhancement(test_texts, with_context=True)

512

513

overhead = ((time_with - time_without) / time_without) * 100

514

print(f"\nContext enhancement overhead: {overhead:.1f}%")

515

```

516

517

## Context Enhancement Best Practices

518

519

### When to Use Context Enhancement

520

521

- **Ambiguous patterns**: Numbers that could be phone numbers, IDs, or dates

522

- **Low-confidence detections**: Borderline matches that need confirmation

523

- **Domain-specific text**: Medical, legal, financial documents with specialized terminology

524

- **Multi-language content**: Where context helps disambiguate similar patterns

525

526

### Tuning Parameters

527

528

- **context_similarity_factor**:

529

- Lower (0.1-0.3): Conservative enhancement

530

- Higher (0.4-0.6): Aggressive enhancement

531

- **min_score_with_context_similarity**:

532

- Higher (0.6+): Only enhance high-confidence detections

533

- Lower (0.2-0.4): Enhance more borderline cases

534

- **context_prefix_count**:

535

- 3-5: Standard context window

536

- 7+: Larger context for complex documents

537

538

### Performance Optimization

539

540

- Context enhancement adds processing overhead (~10-30%)

541

- Consider disabling for high-throughput, low-accuracy scenarios

542

- Use smaller context windows for better performance

543

- Pre-compute NLP artifacts when analyzing multiple times