0
# Configuration and Setup
1
2
Presidio Analyzer provides flexible configuration through YAML files, supporting multiple NLP engines, customizable recognizer registries, and various deployment scenarios.
3
4
## Capabilities
5
6
### AnalyzerEngineProvider
7
8
Utility class for creating AnalyzerEngine instances from YAML configuration files.
9
10
```python { .api }
11
class AnalyzerEngineProvider:
12
"""
13
Factory class for creating configured AnalyzerEngine instances.
14
15
Args:
16
analyzer_engine_conf_file: Path to analyzer configuration YAML file
17
nlp_engine_conf_file: Path to NLP engine configuration YAML file
18
recognizer_registry_conf_file: Path to recognizer registry configuration YAML file
19
"""
20
def __init__(
21
self,
22
analyzer_engine_conf_file: Optional[Union[Path, str]] = None,
23
nlp_engine_conf_file: Optional[Union[Path, str]] = None,
24
recognizer_registry_conf_file: Optional[Union[Path, str]] = None
25
): ...
26
27
def create_engine(self) -> AnalyzerEngine:
28
"""
29
Create and configure AnalyzerEngine from configuration files.
30
31
Returns:
32
Fully configured AnalyzerEngine instance
33
"""
34
35
def get_configuration(self, conf_file: Optional[Union[Path, str]]) -> Union[Dict[str, Any]]:
36
"""
37
Load configuration from YAML file.
38
39
Args:
40
conf_file: Path to configuration file
41
42
Returns:
43
Dictionary containing configuration data
44
"""
45
46
# Properties
47
configuration: Dict[str, Any] # Loaded configuration data
48
nlp_engine_conf_file: Optional[str] # Path to NLP engine configuration
49
recognizer_registry_conf_file: Optional[str] # Path to recognizer registry configuration
50
```
51
52
### RecognizerRegistry
53
54
Registry that manages and organizes entity recognizers for the analyzer.
55
56
```python { .api }
57
class RecognizerRegistry:
58
"""
59
Registry for managing entity recognizers.
60
61
Args:
62
recognizers: Initial collection of recognizers to register
63
global_regex_flags: Default regex compilation flags for pattern recognizers
64
supported_languages: List of supported language codes
65
"""
66
def __init__(
67
self,
68
recognizers: Optional[Iterable[EntityRecognizer]] = None,
69
global_regex_flags: Optional[int] = None, # Default: re.DOTALL | re.MULTILINE | re.IGNORECASE
70
supported_languages: Optional[List[str]] = None
71
): ...
72
73
def load_predefined_recognizers(
74
self,
75
languages: Optional[List[str]] = None,
76
nlp_engine: NlpEngine = None
77
) -> None:
78
"""
79
Load built-in recognizers into the registry.
80
81
Args:
82
languages: Language codes for recognizers to load (None = all supported)
83
nlp_engine: NLP engine instance for NLP-based recognizers
84
"""
85
86
def add_nlp_recognizer(self, nlp_engine: NlpEngine) -> None:
87
"""
88
Add NLP-based recognizer (spaCy, Stanza, Transformers) to registry.
89
90
Args:
91
nlp_engine: Configured NLP engine instance
92
"""
93
94
# Properties
95
recognizers: List[EntityRecognizer] # List of registered recognizers
96
global_regex_flags: Optional[int] # Default regex flags
97
supported_languages: Optional[List[str]] # Supported language codes
98
```
99
100
### NlpEngineProvider
101
102
Factory for creating configured NLP engine instances.
103
104
```python { .api }
105
class NlpEngineProvider:
106
"""
107
Factory class for creating NLP engine instances from configuration.
108
109
Args:
110
nlp_configuration: Dictionary containing NLP engine configuration
111
"""
112
def __init__(self, nlp_configuration: Optional[Dict] = None): ...
113
114
def create_engine(self) -> NlpEngine:
115
"""
116
Create NLP engine instance based on configuration.
117
118
Returns:
119
Configured NLP engine (spaCy, Stanza, or Transformers)
120
"""
121
122
@staticmethod
123
def create_nlp_engine_with_spacy(
124
model_name: str,
125
nlp_ta_prefix_list: List[str] = None
126
) -> SpacyNlpEngine:
127
"""Create spaCy-based NLP engine with specified model."""
128
129
@staticmethod
130
def create_nlp_engine_with_stanza(
131
model_name: str,
132
nlp_ta_prefix_list: List[str] = None
133
) -> StanzaNlpEngine:
134
"""Create Stanza-based NLP engine with specified model."""
135
136
@staticmethod
137
def create_nlp_engine_with_transformers(
138
model_name: str,
139
nlp_ta_prefix_list: List[str] = None
140
) -> TransformersNlpEngine:
141
"""Create Transformers-based NLP engine with specified model."""
142
```
143
144
## Configuration File Formats
145
146
### Default Analyzer Configuration
147
148
```yaml
149
# default_analyzer.yaml
150
nlp_engine_name: spacy
151
models:
152
- lang_code: en
153
model_name: en_core_web_lg
154
- lang_code: es
155
model_name: es_core_news_md
156
157
# Context enhancement settings
158
context_aware_enhancer:
159
enable: true
160
context_similarity_factor: 0.35
161
min_score_with_context_similarity: 0.4
162
context_prefix_count: 5
163
context_suffix_count: 0
164
165
# Default score threshold
166
default_score_threshold: 0.0
167
168
# Supported languages
169
supported_languages:
170
- en
171
- es
172
- fr
173
- de
174
- it
175
```
176
177
### NLP Engine Configurations
178
179
#### spaCy Configuration
180
181
```yaml
182
# spacy.yaml
183
nlp_engine_name: spacy
184
models:
185
- lang_code: en
186
model_name: en_core_web_lg
187
- lang_code: es
188
model_name: es_core_news_md
189
- lang_code: fr
190
model_name: fr_core_news_md
191
- lang_code: de
192
model_name: de_core_news_md
193
- lang_code: it
194
model_name: it_core_news_md
195
```
196
197
#### Stanza Configuration
198
199
```yaml
200
# stanza.yaml
201
nlp_engine_name: stanza
202
models:
203
- lang_code: en
204
model_name: en
205
- lang_code: es
206
model_name: es
207
- lang_code: fr
208
model_name: fr
209
- lang_code: de
210
model_name: de
211
- lang_code: it
212
model_name: it
213
```
214
215
#### Transformers Configuration
216
217
```yaml
218
# transformers.yaml
219
nlp_engine_name: transformers
220
models:
221
- lang_code: en
222
model_name: dslim/bert-base-NER
223
- lang_code: es
224
model_name: mrm8488/bert-spanish-cased-finetuned-ner
225
```
226
227
### Recognizer Registry Configuration
228
229
```yaml
230
# default_recognizers.yaml
231
recognizers:
232
- name: "CreditCardRecognizer"
233
supported_language: "en"
234
supported_entities: ["CREDIT_CARD"]
235
patterns:
236
- name: "credit_card_visa"
237
regex: "4[0-9]{12}(?:[0-9]{3})?"
238
score: 0.9
239
- name: "credit_card_mastercard"
240
regex: "5[1-5][0-9]{14}"
241
score: 0.9
242
context: ["credit", "card", "payment"]
243
244
- name: "PhoneRecognizer"
245
supported_language: "en"
246
supported_entities: ["PHONE_NUMBER"]
247
patterns:
248
- name: "us_phone"
249
regex: "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b"
250
score: 0.7
251
context: ["phone", "call", "number", "contact"]
252
```
253
254
## Usage Examples
255
256
### Basic Configuration Setup
257
258
```python
259
from presidio_analyzer import AnalyzerEngineProvider
260
261
# Create analyzer from default configuration
262
provider = AnalyzerEngineProvider()
263
analyzer = provider.create_engine()
264
265
# Use the configured analyzer
266
text = "Contact John at john@email.com or call 555-123-4567"
267
results = analyzer.analyze(text=text, language="en")
268
269
print(f"Found {len(results)} PII entities using default configuration")
270
```
271
272
### Custom Configuration Files
273
274
```python
275
from presidio_analyzer import AnalyzerEngineProvider
276
277
# Create analyzer with custom configuration files
278
provider = AnalyzerEngineProvider(
279
analyzer_engine_conf_file="config/custom_analyzer.yaml",
280
nlp_engine_conf_file="config/custom_nlp.yaml",
281
recognizer_registry_conf_file="config/custom_recognizers.yaml"
282
)
283
284
analyzer = provider.create_engine()
285
286
# Test with custom configuration
287
text = "Custom entity detection test"
288
results = analyzer.analyze(text=text, language="en")
289
```
290
291
### Programmatic Configuration
292
293
```python
294
from presidio_analyzer import (
295
AnalyzerEngine, RecognizerRegistry, LemmaContextAwareEnhancer
296
)
297
from presidio_analyzer.nlp_engine import SpacyNlpEngine
298
299
# Configure NLP engine
300
nlp_engine = SpacyNlpEngine(models={"en": "en_core_web_lg"})
301
302
# Configure recognizer registry
303
registry = RecognizerRegistry(supported_languages=["en"])
304
registry.load_predefined_recognizers(languages=["en"], nlp_engine=nlp_engine)
305
306
# Configure context enhancement
307
enhancer = LemmaContextAwareEnhancer(
308
context_similarity_factor=0.4,
309
min_score_with_context_similarity=0.3
310
)
311
312
# Create analyzer with custom configuration
313
analyzer = AnalyzerEngine(
314
registry=registry,
315
nlp_engine=nlp_engine,
316
context_aware_enhancer=enhancer,
317
default_score_threshold=0.5,
318
supported_languages=["en"]
319
)
320
```
321
322
### Multi-language Configuration
323
324
```python
325
from presidio_analyzer import AnalyzerEngineProvider
326
import yaml
327
328
# Create multi-language configuration
329
multilingual_config = {
330
'nlp_engine_name': 'spacy',
331
'models': [
332
{'lang_code': 'en', 'model_name': 'en_core_web_lg'},
333
{'lang_code': 'es', 'model_name': 'es_core_news_md'},
334
{'lang_code': 'fr', 'model_name': 'fr_core_news_md'},
335
{'lang_code': 'de', 'model_name': 'de_core_news_md'}
336
],
337
'supported_languages': ['en', 'es', 'fr', 'de'],
338
'default_score_threshold': 0.6
339
}
340
341
# Save configuration to file
342
with open('multilingual_config.yaml', 'w') as f:
343
yaml.dump(multilingual_config, f)
344
345
# Create analyzer from configuration
346
provider = AnalyzerEngineProvider(
347
analyzer_engine_conf_file='multilingual_config.yaml'
348
)
349
analyzer = provider.create_engine()
350
351
# Test with different languages
352
texts = {
353
'en': "Contact John Smith at john@email.com",
354
'es': "Contacta con Juan en juan@email.com",
355
'fr': "Contactez Jean à jean@email.com",
356
'de': "Kontaktieren Sie Johann unter johann@email.com"
357
}
358
359
for language, text in texts.items():
360
results = analyzer.analyze(text=text, language=language)
361
print(f"{language}: Found {len(results)} entities")
362
```
363
364
### Environment-based Configuration
365
366
```python
367
from presidio_analyzer import AnalyzerEngineProvider
368
import os
369
from pathlib import Path
370
371
def create_analyzer_from_environment():
372
"""Create analyzer using environment-specific configuration."""
373
374
# Get configuration paths from environment variables
375
config_dir = os.getenv('PRESIDIO_CONFIG_DIR', 'config')
376
377
analyzer_config = os.getenv(
378
'PRESIDIO_ANALYZER_CONFIG',
379
f'{config_dir}/analyzer.yaml'
380
)
381
382
nlp_config = os.getenv(
383
'PRESIDIO_NLP_CONFIG',
384
f'{config_dir}/nlp.yaml'
385
)
386
387
recognizer_config = os.getenv(
388
'PRESIDIO_RECOGNIZER_CONFIG',
389
f'{config_dir}/recognizers.yaml'
390
)
391
392
# Verify configuration files exist
393
for config_file in [analyzer_config, nlp_config, recognizer_config]:
394
if not Path(config_file).exists():
395
print(f"Warning: Configuration file not found: {config_file}")
396
397
# Create analyzer with environment-specific configuration
398
provider = AnalyzerEngineProvider(
399
analyzer_engine_conf_file=analyzer_config,
400
nlp_engine_conf_file=nlp_config,
401
recognizer_registry_conf_file=recognizer_config
402
)
403
404
return provider.create_engine()
405
406
# Usage with environment variables
407
# export PRESIDIO_CONFIG_DIR=/etc/presidio
408
# export PRESIDIO_ANALYZER_CONFIG=/etc/presidio/production_analyzer.yaml
409
410
analyzer = create_analyzer_from_environment()
411
```
412
413
### Docker Configuration
414
415
```python
416
from presidio_analyzer import AnalyzerEngineProvider
417
import yaml
418
import os
419
420
def create_docker_optimized_analyzer():
421
"""Create analyzer optimized for Docker deployment."""
422
423
# Docker-optimized configuration
424
docker_config = {
425
'nlp_engine_name': 'spacy',
426
'models': [
427
{
428
'lang_code': 'en',
429
'model_name': 'en_core_web_sm' # Smaller model for containers
430
}
431
],
432
'supported_languages': ['en'],
433
'default_score_threshold': 0.5,
434
'context_aware_enhancer': {
435
'enable': True,
436
'context_similarity_factor': 0.35,
437
'min_score_with_context_similarity': 0.4
438
}
439
}
440
441
# Write configuration to container filesystem
442
config_path = '/tmp/docker_analyzer_config.yaml'
443
with open(config_path, 'w') as f:
444
yaml.dump(docker_config, f)
445
446
# Create analyzer
447
provider = AnalyzerEngineProvider(
448
analyzer_engine_conf_file=config_path
449
)
450
451
return provider.create_engine()
452
453
# Docker deployment usage
454
analyzer = create_docker_optimized_analyzer()
455
```
456
457
### High-Performance Configuration
458
459
```python
460
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
461
from presidio_analyzer.nlp_engine import SpacyNlpEngine
462
463
def create_high_performance_analyzer():
464
"""Create analyzer optimized for high-throughput scenarios."""
465
466
# Use lightweight NLP processing
467
nlp_engine = SpacyNlpEngine(
468
models={"en": "en_core_web_sm"}, # Smaller, faster model
469
nlp_configuration={
470
"nlp_engine_name": "spacy",
471
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}]
472
}
473
)
474
475
# Create registry with only essential recognizers
476
registry = RecognizerRegistry(supported_languages=["en"])
477
478
# Load only high-confidence, fast recognizers
479
essential_recognizers = [
480
"EmailRecognizer",
481
"PhoneRecognizer",
482
"CreditCardRecognizer",
483
"UsSsnRecognizer"
484
]
485
486
registry.load_predefined_recognizers(
487
languages=["en"],
488
nlp_engine=nlp_engine
489
)
490
491
# Filter to essential recognizers only
492
registry.recognizers = [
493
r for r in registry.recognizers
494
if r.name in essential_recognizers
495
]
496
497
# Create analyzer without context enhancement for speed
498
analyzer = AnalyzerEngine(
499
registry=registry,
500
nlp_engine=nlp_engine,
501
context_aware_enhancer=None, # Disable for performance
502
default_score_threshold=0.7 # Higher threshold for precision
503
)
504
505
return analyzer
506
507
# High-performance deployment
508
analyzer = create_high_performance_analyzer()
509
```
510
511
### Custom Recognizer Configuration
512
513
```python
514
from presidio_analyzer import (
515
AnalyzerEngineProvider, PatternRecognizer, Pattern, RecognizerRegistry
516
)
517
import yaml
518
519
def create_custom_recognizer_config():
520
"""Create configuration with custom recognizers."""
521
522
# Define custom recognizer in YAML format
523
custom_config = {
524
'recognizers': [
525
{
526
'name': 'CustomEmployeeIdRecognizer',
527
'supported_language': 'en',
528
'supported_entities': ['EMPLOYEE_ID'],
529
'patterns': [
530
{
531
'name': 'emp_id_pattern_1',
532
'regex': r'\bEMP-\d{5}\b',
533
'score': 0.9
534
},
535
{
536
'name': 'emp_id_pattern_2',
537
'regex': r'\b[Ee]mployee\s*[Ii][Dd]\s*:?\s*(\d{5})\b',
538
'score': 0.8
539
}
540
],
541
'context': ['employee', 'staff', 'worker', 'personnel']
542
},
543
{
544
'name': 'CustomProductCodeRecognizer',
545
'supported_language': 'en',
546
'supported_entities': ['PRODUCT_CODE'],
547
'patterns': [
548
{
549
'name': 'product_code_pattern',
550
'regex': r'\bPRD-[A-Z]{2}-\d{4}\b',
551
'score': 0.9
552
}
553
],
554
'context': ['product', 'item', 'catalog', 'inventory']
555
}
556
]
557
}
558
559
# Save custom recognizer configuration
560
with open('custom_recognizers.yaml', 'w') as f:
561
yaml.dump(custom_config, f)
562
563
# Create analyzer with custom recognizers
564
provider = AnalyzerEngineProvider(
565
recognizer_registry_conf_file='custom_recognizers.yaml'
566
)
567
568
return provider.create_engine()
569
570
# Usage with custom recognizers
571
analyzer = create_custom_recognizer_config()
572
573
test_text = "Employee ID: 12345 ordered product PRD-AB-1234"
574
results = analyzer.analyze(text=test_text, language="en")
575
576
for result in results:
577
detected_text = test_text[result.start:result.end]
578
print(f"Found {result.entity_type}: '{detected_text}'")
579
```
580
581
### Configuration Validation
582
583
```python
584
from presidio_analyzer import AnalyzerEngineProvider
585
import yaml
586
from pathlib import Path
587
588
def validate_configuration(config_file: str) -> bool:
589
"""Validate analyzer configuration file."""
590
591
try:
592
# Check if file exists
593
if not Path(config_file).exists():
594
print(f"Error: Configuration file not found: {config_file}")
595
return False
596
597
# Load and validate YAML syntax
598
with open(config_file, 'r') as f:
599
config = yaml.safe_load(f)
600
601
# Validate required fields
602
required_fields = ['nlp_engine_name', 'models', 'supported_languages']
603
for field in required_fields:
604
if field not in config:
605
print(f"Error: Missing required field: {field}")
606
return False
607
608
# Validate NLP engine name
609
valid_engines = ['spacy', 'stanza', 'transformers']
610
if config['nlp_engine_name'] not in valid_engines:
611
print(f"Error: Invalid NLP engine: {config['nlp_engine_name']}")
612
return False
613
614
# Validate models configuration
615
if not isinstance(config['models'], list) or not config['models']:
616
print("Error: Models must be a non-empty list")
617
return False
618
619
for model in config['models']:
620
if 'lang_code' not in model or 'model_name' not in model:
621
print("Error: Each model must have 'lang_code' and 'model_name'")
622
return False
623
624
# Try to create analyzer to validate configuration
625
provider = AnalyzerEngineProvider(analyzer_engine_conf_file=config_file)
626
analyzer = provider.create_engine()
627
628
print(f"Configuration validation successful: {config_file}")
629
return True
630
631
except yaml.YAMLError as e:
632
print(f"YAML syntax error: {e}")
633
return False
634
except Exception as e:
635
print(f"Configuration error: {e}")
636
return False
637
638
# Validate configuration before deployment
639
config_file = "config/analyzer.yaml"
640
if validate_configuration(config_file):
641
provider = AnalyzerEngineProvider(analyzer_engine_conf_file=config_file)
642
analyzer = provider.create_engine()
643
print("Analyzer created successfully")
644
else:
645
print("Configuration validation failed")
646
```
647
648
## Configuration Best Practices
649
650
### Performance Optimization
651
652
- Use smaller spaCy models (e.g., `en_core_web_sm`) for faster processing
653
- Disable context enhancement for high-throughput scenarios
654
- Load only necessary recognizers for your use case
655
- Set appropriate score thresholds to filter low-confidence results
656
657
### Security Considerations
658
659
- Store configuration files in secure locations with appropriate permissions
660
- Use environment variables for sensitive configuration values
661
- Validate configuration files before deployment
662
- Regularly update NLP models and recognizer patterns
663
664
### Deployment Strategies
665
666
- **Development**: Use comprehensive configurations with all recognizers
667
- **Production**: Use optimized configurations with essential recognizers only
668
- **Docker**: Use lightweight models and configurations for container efficiency
669
- **Multi-language**: Configure only the languages you actually need
670
671
### Monitoring and Maintenance
672
673
- Log configuration loading and validation results
674
- Monitor analyzer performance metrics
675
- Regularly review and update recognizer patterns
676
- Test configuration changes in staging environments before production deployment