Tessl Tile for pypi/presidio-analyzer@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-processing.md configuration.md context-enhancement.md core-analysis.md entity-recognizers.md index.md predefined-recognizers.md

configuration.mddocs/

0
# Configuration and Setup
1

2
Presidio Analyzer provides flexible configuration through YAML files, supporting multiple NLP engines, customizable recognizer registries, and various deployment scenarios.
3

4
## Capabilities
5

6
### AnalyzerEngineProvider
7

8
Utility class for creating AnalyzerEngine instances from YAML configuration files.
9

10
```python { .api }
11
class AnalyzerEngineProvider:
12
    """
13
    Factory class for creating configured AnalyzerEngine instances.
14
    
15
    Args:
16
        analyzer_engine_conf_file: Path to analyzer configuration YAML file
17
        nlp_engine_conf_file: Path to NLP engine configuration YAML file  
18
        recognizer_registry_conf_file: Path to recognizer registry configuration YAML file
19
    """
20
    def __init__(
21
        self,
22
        analyzer_engine_conf_file: Optional[Union[Path, str]] = None,
23
        nlp_engine_conf_file: Optional[Union[Path, str]] = None,
24
        recognizer_registry_conf_file: Optional[Union[Path, str]] = None
25
    ): ...
26

27
    def create_engine(self) -> AnalyzerEngine:
28
        """
29
        Create and configure AnalyzerEngine from configuration files.
30
        
31
        Returns:
32
            Fully configured AnalyzerEngine instance
33
        """
34

35
    def get_configuration(self, conf_file: Optional[Union[Path, str]]) -> Union[Dict[str, Any]]:
36
        """
37
        Load configuration from YAML file.
38
        
39
        Args:
40
            conf_file: Path to configuration file
41
            
42
        Returns:
43
            Dictionary containing configuration data
44
        """
45

46
    # Properties
47
    configuration: Dict[str, Any]           # Loaded configuration data
48
    nlp_engine_conf_file: Optional[str]     # Path to NLP engine configuration
49
    recognizer_registry_conf_file: Optional[str]  # Path to recognizer registry configuration
50
```
51

52
### RecognizerRegistry
53

54
Registry that manages and organizes entity recognizers for the analyzer.
55

56
```python { .api }
57
class RecognizerRegistry:
58
    """
59
    Registry for managing entity recognizers.
60
    
61
    Args:
62
        recognizers: Initial collection of recognizers to register
63
        global_regex_flags: Default regex compilation flags for pattern recognizers
64
        supported_languages: List of supported language codes
65
    """
66
    def __init__(
67
        self,
68
        recognizers: Optional[Iterable[EntityRecognizer]] = None,
69
        global_regex_flags: Optional[int] = None,  # Default: re.DOTALL | re.MULTILINE | re.IGNORECASE
70
        supported_languages: Optional[List[str]] = None
71
    ): ...
72

73
    def load_predefined_recognizers(
74
        self,
75
        languages: Optional[List[str]] = None,
76
        nlp_engine: NlpEngine = None
77
    ) -> None:
78
        """
79
        Load built-in recognizers into the registry.
80
        
81
        Args:
82
            languages: Language codes for recognizers to load (None = all supported)
83
            nlp_engine: NLP engine instance for NLP-based recognizers
84
        """
85

86
    def add_nlp_recognizer(self, nlp_engine: NlpEngine) -> None:
87
        """
88
        Add NLP-based recognizer (spaCy, Stanza, Transformers) to registry.
89
        
90
        Args:
91
            nlp_engine: Configured NLP engine instance
92
        """
93

94
    # Properties
95
    recognizers: List[EntityRecognizer]     # List of registered recognizers
96
    global_regex_flags: Optional[int]       # Default regex flags
97
    supported_languages: Optional[List[str]] # Supported language codes
98
```
99

100
### NlpEngineProvider
101

102
Factory for creating configured NLP engine instances.
103

104
```python { .api }
105
class NlpEngineProvider:
106
    """
107
    Factory class for creating NLP engine instances from configuration.
108
    
109
    Args:
110
        nlp_configuration: Dictionary containing NLP engine configuration
111
    """
112
    def __init__(self, nlp_configuration: Optional[Dict] = None): ...
113

114
    def create_engine(self) -> NlpEngine:
115
        """
116
        Create NLP engine instance based on configuration.
117
        
118
        Returns:
119
            Configured NLP engine (spaCy, Stanza, or Transformers)
120
        """
121

122
    @staticmethod
123
    def create_nlp_engine_with_spacy(
124
        model_name: str,
125
        nlp_ta_prefix_list: List[str] = None
126
    ) -> SpacyNlpEngine:
127
        """Create spaCy-based NLP engine with specified model."""
128

129
    @staticmethod  
130
    def create_nlp_engine_with_stanza(
131
        model_name: str,
132
        nlp_ta_prefix_list: List[str] = None
133
    ) -> StanzaNlpEngine:
134
        """Create Stanza-based NLP engine with specified model."""
135

136
    @staticmethod
137
    def create_nlp_engine_with_transformers(
138
        model_name: str,
139
        nlp_ta_prefix_list: List[str] = None
140
    ) -> TransformersNlpEngine:
141
        """Create Transformers-based NLP engine with specified model."""
142
```
143

144
## Configuration File Formats
145

146
### Default Analyzer Configuration
147

148
```yaml
149
# default_analyzer.yaml
150
nlp_engine_name: spacy
151
models:
152
  - lang_code: en
153
    model_name: en_core_web_lg
154
  - lang_code: es  
155
    model_name: es_core_news_md
156

157
# Context enhancement settings
158
context_aware_enhancer:
159
  enable: true
160
  context_similarity_factor: 0.35
161
  min_score_with_context_similarity: 0.4
162
  context_prefix_count: 5
163
  context_suffix_count: 0
164

165
# Default score threshold
166
default_score_threshold: 0.0
167

168
# Supported languages
169
supported_languages:
170
  - en
171
  - es
172
  - fr
173
  - de
174
  - it
175
```
176

177
### NLP Engine Configurations
178

179
#### spaCy Configuration
180

181
```yaml  
182
# spacy.yaml
183
nlp_engine_name: spacy
184
models:
185
  - lang_code: en
186
    model_name: en_core_web_lg
187
  - lang_code: es
188
    model_name: es_core_news_md
189
  - lang_code: fr
190
    model_name: fr_core_news_md
191
  - lang_code: de
192
    model_name: de_core_news_md
193
  - lang_code: it
194
    model_name: it_core_news_md
195
```
196

197
#### Stanza Configuration
198

199
```yaml
200
# stanza.yaml  
201
nlp_engine_name: stanza
202
models:
203
  - lang_code: en
204
    model_name: en
205
  - lang_code: es
206
    model_name: es
207
  - lang_code: fr  
208
    model_name: fr
209
  - lang_code: de
210
    model_name: de
211
  - lang_code: it
212
    model_name: it
213
```
214

215
#### Transformers Configuration
216

217
```yaml
218
# transformers.yaml
219
nlp_engine_name: transformers
220
models:
221
  - lang_code: en
222
    model_name: dslim/bert-base-NER
223
  - lang_code: es
224
    model_name: mrm8488/bert-spanish-cased-finetuned-ner
225
```
226

227
### Recognizer Registry Configuration
228

229
```yaml
230
# default_recognizers.yaml
231
recognizers:
232
  - name: "CreditCardRecognizer"
233
    supported_language: "en"
234
    supported_entities: ["CREDIT_CARD"]
235
    patterns:
236
      - name: "credit_card_visa"
237
        regex: "4[0-9]{12}(?:[0-9]{3})?"
238
        score: 0.9
239
      - name: "credit_card_mastercard"
240
        regex: "5[1-5][0-9]{14}"
241
        score: 0.9
242
    context: ["credit", "card", "payment"]
243

244
  - name: "PhoneRecognizer"
245
    supported_language: "en"
246
    supported_entities: ["PHONE_NUMBER"]
247
    patterns:
248
      - name: "us_phone"
249
        regex: "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b"
250
        score: 0.7
251
    context: ["phone", "call", "number", "contact"]
252
```
253

254
## Usage Examples
255

256
### Basic Configuration Setup
257

258
```python
259
from presidio_analyzer import AnalyzerEngineProvider
260

261
# Create analyzer from default configuration
262
provider = AnalyzerEngineProvider()
263
analyzer = provider.create_engine()
264

265
# Use the configured analyzer
266
text = "Contact John at john@email.com or call 555-123-4567"
267
results = analyzer.analyze(text=text, language="en")
268

269
print(f"Found {len(results)} PII entities using default configuration")
270
```
271

272
### Custom Configuration Files
273

274
```python
275
from presidio_analyzer import AnalyzerEngineProvider
276

277
# Create analyzer with custom configuration files
278
provider = AnalyzerEngineProvider(
279
    analyzer_engine_conf_file="config/custom_analyzer.yaml",
280
    nlp_engine_conf_file="config/custom_nlp.yaml",
281
    recognizer_registry_conf_file="config/custom_recognizers.yaml"
282
)
283

284
analyzer = provider.create_engine()
285

286
# Test with custom configuration
287
text = "Custom entity detection test"
288
results = analyzer.analyze(text=text, language="en")
289
```
290

291
### Programmatic Configuration
292

293
```python
294
from presidio_analyzer import (
295
    AnalyzerEngine, RecognizerRegistry, LemmaContextAwareEnhancer
296
)
297
from presidio_analyzer.nlp_engine import SpacyNlpEngine
298

299
# Configure NLP engine
300
nlp_engine = SpacyNlpEngine(models={"en": "en_core_web_lg"})
301

302
# Configure recognizer registry
303
registry = RecognizerRegistry(supported_languages=["en"])
304
registry.load_predefined_recognizers(languages=["en"], nlp_engine=nlp_engine)
305

306
# Configure context enhancement
307
enhancer = LemmaContextAwareEnhancer(
308
    context_similarity_factor=0.4,
309
    min_score_with_context_similarity=0.3
310
)
311

312
# Create analyzer with custom configuration
313
analyzer = AnalyzerEngine(
314
    registry=registry,
315
    nlp_engine=nlp_engine,
316
    context_aware_enhancer=enhancer,
317
    default_score_threshold=0.5,
318
    supported_languages=["en"]
319
)
320
```
321

322
### Multi-language Configuration
323

324
```python
325
from presidio_analyzer import AnalyzerEngineProvider
326
import yaml
327

328
# Create multi-language configuration
329
multilingual_config = {
330
    'nlp_engine_name': 'spacy',
331
    'models': [
332
        {'lang_code': 'en', 'model_name': 'en_core_web_lg'},
333
        {'lang_code': 'es', 'model_name': 'es_core_news_md'},
334
        {'lang_code': 'fr', 'model_name': 'fr_core_news_md'},
335
        {'lang_code': 'de', 'model_name': 'de_core_news_md'}
336
    ],
337
    'supported_languages': ['en', 'es', 'fr', 'de'],
338
    'default_score_threshold': 0.6
339
}
340

341
# Save configuration to file
342
with open('multilingual_config.yaml', 'w') as f:
343
    yaml.dump(multilingual_config, f)
344

345
# Create analyzer from configuration
346
provider = AnalyzerEngineProvider(
347
    analyzer_engine_conf_file='multilingual_config.yaml'
348
)
349
analyzer = provider.create_engine()
350

351
# Test with different languages
352
texts = {
353
    'en': "Contact John Smith at john@email.com",
354
    'es': "Contacta con Juan en juan@email.com", 
355
    'fr': "Contactez Jean à jean@email.com",
356
    'de': "Kontaktieren Sie Johann unter johann@email.com"
357
}
358

359
for language, text in texts.items():
360
    results = analyzer.analyze(text=text, language=language)
361
    print(f"{language}: Found {len(results)} entities")
362
```
363

364
### Environment-based Configuration
365

366
```python
367
from presidio_analyzer import AnalyzerEngineProvider
368
import os
369
from pathlib import Path
370

371
def create_analyzer_from_environment():
372
    """Create analyzer using environment-specific configuration."""
373
    
374
    # Get configuration paths from environment variables
375
    config_dir = os.getenv('PRESIDIO_CONFIG_DIR', 'config')
376
    
377
    analyzer_config = os.getenv(
378
        'PRESIDIO_ANALYZER_CONFIG',
379
        f'{config_dir}/analyzer.yaml'
380
    )
381
    
382
    nlp_config = os.getenv(
383
        'PRESIDIO_NLP_CONFIG', 
384
        f'{config_dir}/nlp.yaml'
385
    )
386
    
387
    recognizer_config = os.getenv(
388
        'PRESIDIO_RECOGNIZER_CONFIG',
389
        f'{config_dir}/recognizers.yaml'
390
    )
391
    
392
    # Verify configuration files exist
393
    for config_file in [analyzer_config, nlp_config, recognizer_config]:
394
        if not Path(config_file).exists():
395
            print(f"Warning: Configuration file not found: {config_file}")
396
    
397
    # Create analyzer with environment-specific configuration
398
    provider = AnalyzerEngineProvider(
399
        analyzer_engine_conf_file=analyzer_config,
400
        nlp_engine_conf_file=nlp_config,
401
        recognizer_registry_conf_file=recognizer_config
402
    )
403
    
404
    return provider.create_engine()
405

406
# Usage with environment variables
407
# export PRESIDIO_CONFIG_DIR=/etc/presidio
408
# export PRESIDIO_ANALYZER_CONFIG=/etc/presidio/production_analyzer.yaml
409

410
analyzer = create_analyzer_from_environment()
411
```
412

413
### Docker Configuration
414

415
```python
416
from presidio_analyzer import AnalyzerEngineProvider
417
import yaml
418
import os
419

420
def create_docker_optimized_analyzer():
421
    """Create analyzer optimized for Docker deployment."""
422
    
423
    # Docker-optimized configuration
424
    docker_config = {
425
        'nlp_engine_name': 'spacy',
426
        'models': [
427
            {
428
                'lang_code': 'en',
429
                'model_name': 'en_core_web_sm'  # Smaller model for containers
430
            }
431
        ],
432
        'supported_languages': ['en'],
433
        'default_score_threshold': 0.5,
434
        'context_aware_enhancer': {
435
            'enable': True,
436
            'context_similarity_factor': 0.35,
437
            'min_score_with_context_similarity': 0.4
438
        }
439
    }
440
    
441
    # Write configuration to container filesystem
442
    config_path = '/tmp/docker_analyzer_config.yaml'
443
    with open(config_path, 'w') as f:
444
        yaml.dump(docker_config, f)
445
    
446
    # Create analyzer
447
    provider = AnalyzerEngineProvider(
448
        analyzer_engine_conf_file=config_path
449
    )
450
    
451
    return provider.create_engine()
452

453
# Docker deployment usage
454
analyzer = create_docker_optimized_analyzer()
455
```
456

457
### High-Performance Configuration
458

459
```python
460
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
461
from presidio_analyzer.nlp_engine import SpacyNlpEngine
462

463
def create_high_performance_analyzer():
464
    """Create analyzer optimized for high-throughput scenarios."""
465
    
466
    # Use lightweight NLP processing
467
    nlp_engine = SpacyNlpEngine(
468
        models={"en": "en_core_web_sm"},  # Smaller, faster model
469
        nlp_configuration={
470
            "nlp_engine_name": "spacy",
471
            "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}]
472
        }
473
    )
474
    
475
    # Create registry with only essential recognizers
476
    registry = RecognizerRegistry(supported_languages=["en"])
477
    
478
    # Load only high-confidence, fast recognizers
479
    essential_recognizers = [
480
        "EmailRecognizer",
481
        "PhoneRecognizer", 
482
        "CreditCardRecognizer",
483
        "UsSsnRecognizer"
484
    ]
485
    
486
    registry.load_predefined_recognizers(
487
        languages=["en"],
488
        nlp_engine=nlp_engine
489
    )
490
    
491
    # Filter to essential recognizers only
492
    registry.recognizers = [
493
        r for r in registry.recognizers 
494
        if r.name in essential_recognizers
495
    ]
496
    
497
    # Create analyzer without context enhancement for speed
498
    analyzer = AnalyzerEngine(
499
        registry=registry,
500
        nlp_engine=nlp_engine,
501
        context_aware_enhancer=None,  # Disable for performance
502
        default_score_threshold=0.7   # Higher threshold for precision
503
    )
504
    
505
    return analyzer
506

507
# High-performance deployment
508
analyzer = create_high_performance_analyzer()
509
```
510

511
### Custom Recognizer Configuration
512

513
```python
514
from presidio_analyzer import (
515
    AnalyzerEngineProvider, PatternRecognizer, Pattern, RecognizerRegistry
516
)
517
import yaml
518

519
def create_custom_recognizer_config():
520
    """Create configuration with custom recognizers."""
521
    
522
    # Define custom recognizer in YAML format
523
    custom_config = {
524
        'recognizers': [
525
            {
526
                'name': 'CustomEmployeeIdRecognizer',
527
                'supported_language': 'en',
528
                'supported_entities': ['EMPLOYEE_ID'],
529
                'patterns': [
530
                    {
531
                        'name': 'emp_id_pattern_1',
532
                        'regex': r'\bEMP-\d{5}\b',
533
                        'score': 0.9
534
                    },
535
                    {
536
                        'name': 'emp_id_pattern_2',
537
                        'regex': r'\b[Ee]mployee\s*[Ii][Dd]\s*:?\s*(\d{5})\b',
538
                        'score': 0.8
539
                    }
540
                ],
541
                'context': ['employee', 'staff', 'worker', 'personnel']
542
            },
543
            {
544
                'name': 'CustomProductCodeRecognizer',
545
                'supported_language': 'en',
546
                'supported_entities': ['PRODUCT_CODE'],
547
                'patterns': [
548
                    {
549
                        'name': 'product_code_pattern',
550
                        'regex': r'\bPRD-[A-Z]{2}-\d{4}\b',
551
                        'score': 0.9
552
                    }
553
                ],
554
                'context': ['product', 'item', 'catalog', 'inventory']
555
            }
556
        ]
557
    }
558
    
559
    # Save custom recognizer configuration
560
    with open('custom_recognizers.yaml', 'w') as f:
561
        yaml.dump(custom_config, f)
562
    
563
    # Create analyzer with custom recognizers
564
    provider = AnalyzerEngineProvider(
565
        recognizer_registry_conf_file='custom_recognizers.yaml'
566
    )
567
    
568
    return provider.create_engine()
569

570
# Usage with custom recognizers
571
analyzer = create_custom_recognizer_config()
572

573
test_text = "Employee ID: 12345 ordered product PRD-AB-1234"
574
results = analyzer.analyze(text=test_text, language="en")
575

576
for result in results:
577
    detected_text = test_text[result.start:result.end]
578
    print(f"Found {result.entity_type}: '{detected_text}'")
579
```
580

581
### Configuration Validation
582

583
```python
584
from presidio_analyzer import AnalyzerEngineProvider
585
import yaml
586
from pathlib import Path
587

588
def validate_configuration(config_file: str) -> bool:
589
    """Validate analyzer configuration file."""
590
    
591
    try:
592
        # Check if file exists
593
        if not Path(config_file).exists():
594
            print(f"Error: Configuration file not found: {config_file}")
595
            return False
596
        
597
        # Load and validate YAML syntax
598
        with open(config_file, 'r') as f:
599
            config = yaml.safe_load(f)
600
        
601
        # Validate required fields
602
        required_fields = ['nlp_engine_name', 'models', 'supported_languages']
603
        for field in required_fields:
604
            if field not in config:
605
                print(f"Error: Missing required field: {field}")
606
                return False
607
        
608
        # Validate NLP engine name
609
        valid_engines = ['spacy', 'stanza', 'transformers']
610
        if config['nlp_engine_name'] not in valid_engines:
611
            print(f"Error: Invalid NLP engine: {config['nlp_engine_name']}")
612
            return False
613
        
614
        # Validate models configuration
615
        if not isinstance(config['models'], list) or not config['models']:
616
            print("Error: Models must be a non-empty list")
617
            return False
618
        
619
        for model in config['models']:
620
            if 'lang_code' not in model or 'model_name' not in model:
621
                print("Error: Each model must have 'lang_code' and 'model_name'")
622
                return False
623
        
624
        # Try to create analyzer to validate configuration
625
        provider = AnalyzerEngineProvider(analyzer_engine_conf_file=config_file)
626
        analyzer = provider.create_engine()
627
        
628
        print(f"Configuration validation successful: {config_file}")
629
        return True
630
        
631
    except yaml.YAMLError as e:
632
        print(f"YAML syntax error: {e}")
633
        return False
634
    except Exception as e:
635
        print(f"Configuration error: {e}")
636
        return False
637

638
# Validate configuration before deployment
639
config_file = "config/analyzer.yaml"
640
if validate_configuration(config_file):
641
    provider = AnalyzerEngineProvider(analyzer_engine_conf_file=config_file)
642
    analyzer = provider.create_engine()
643
    print("Analyzer created successfully")
644
else:
645
    print("Configuration validation failed")
646
```
647

648
## Configuration Best Practices
649

650
### Performance Optimization
651

652
- Use smaller spaCy models (e.g., `en_core_web_sm`) for faster processing
653
- Disable context enhancement for high-throughput scenarios
654
- Load only necessary recognizers for your use case
655
- Set appropriate score thresholds to filter low-confidence results
656

657
### Security Considerations
658

659
- Store configuration files in secure locations with appropriate permissions
660
- Use environment variables for sensitive configuration values
661
- Validate configuration files before deployment
662
- Regularly update NLP models and recognizer patterns
663

664
### Deployment Strategies
665

666
- **Development**: Use comprehensive configurations with all recognizers
667
- **Production**: Use optimized configurations with essential recognizers only
668
- **Docker**: Use lightweight models and configurations for container efficiency
669
- **Multi-language**: Configure only the languages you actually need
670

671
### Monitoring and Maintenance
672

673
- Log configuration loading and validation results
674
- Monitor analyzer performance metrics
675
- Regularly review and update recognizer patterns
676
- Test configuration changes in staging environments before production deployment

Version

Tile

Files

configuration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

configuration.mddocs/