or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-processing.mdconfiguration.mdcontext-enhancement.mdcore-analysis.mdentity-recognizers.mdindex.mdpredefined-recognizers.md

configuration.mddocs/

0

# Configuration and Setup

1

2

Presidio Analyzer provides flexible configuration through YAML files, supporting multiple NLP engines, customizable recognizer registries, and various deployment scenarios.

3

4

## Capabilities

5

6

### AnalyzerEngineProvider

7

8

Utility class for creating AnalyzerEngine instances from YAML configuration files.

9

10

```python { .api }

11

class AnalyzerEngineProvider:

12

"""

13

Factory class for creating configured AnalyzerEngine instances.

14

15

Args:

16

analyzer_engine_conf_file: Path to analyzer configuration YAML file

17

nlp_engine_conf_file: Path to NLP engine configuration YAML file

18

recognizer_registry_conf_file: Path to recognizer registry configuration YAML file

19

"""

20

def __init__(

21

self,

22

analyzer_engine_conf_file: Optional[Union[Path, str]] = None,

23

nlp_engine_conf_file: Optional[Union[Path, str]] = None,

24

recognizer_registry_conf_file: Optional[Union[Path, str]] = None

25

): ...

26

27

def create_engine(self) -> AnalyzerEngine:

28

"""

29

Create and configure AnalyzerEngine from configuration files.

30

31

Returns:

32

Fully configured AnalyzerEngine instance

33

"""

34

35

def get_configuration(self, conf_file: Optional[Union[Path, str]]) -> Union[Dict[str, Any]]:

36

"""

37

Load configuration from YAML file.

38

39

Args:

40

conf_file: Path to configuration file

41

42

Returns:

43

Dictionary containing configuration data

44

"""

45

46

# Properties

47

configuration: Dict[str, Any] # Loaded configuration data

48

nlp_engine_conf_file: Optional[str] # Path to NLP engine configuration

49

recognizer_registry_conf_file: Optional[str] # Path to recognizer registry configuration

50

```

51

52

### RecognizerRegistry

53

54

Registry that manages and organizes entity recognizers for the analyzer.

55

56

```python { .api }

57

class RecognizerRegistry:

58

"""

59

Registry for managing entity recognizers.

60

61

Args:

62

recognizers: Initial collection of recognizers to register

63

global_regex_flags: Default regex compilation flags for pattern recognizers

64

supported_languages: List of supported language codes

65

"""

66

def __init__(

67

self,

68

recognizers: Optional[Iterable[EntityRecognizer]] = None,

69

global_regex_flags: Optional[int] = None, # Default: re.DOTALL | re.MULTILINE | re.IGNORECASE

70

supported_languages: Optional[List[str]] = None

71

): ...

72

73

def load_predefined_recognizers(

74

self,

75

languages: Optional[List[str]] = None,

76

nlp_engine: NlpEngine = None

77

) -> None:

78

"""

79

Load built-in recognizers into the registry.

80

81

Args:

82

languages: Language codes for recognizers to load (None = all supported)

83

nlp_engine: NLP engine instance for NLP-based recognizers

84

"""

85

86

def add_nlp_recognizer(self, nlp_engine: NlpEngine) -> None:

87

"""

88

Add NLP-based recognizer (spaCy, Stanza, Transformers) to registry.

89

90

Args:

91

nlp_engine: Configured NLP engine instance

92

"""

93

94

# Properties

95

recognizers: List[EntityRecognizer] # List of registered recognizers

96

global_regex_flags: Optional[int] # Default regex flags

97

supported_languages: Optional[List[str]] # Supported language codes

98

```

99

100

### NlpEngineProvider

101

102

Factory for creating configured NLP engine instances.

103

104

```python { .api }

105

class NlpEngineProvider:

106

"""

107

Factory class for creating NLP engine instances from configuration.

108

109

Args:

110

nlp_configuration: Dictionary containing NLP engine configuration

111

"""

112

def __init__(self, nlp_configuration: Optional[Dict] = None): ...

113

114

def create_engine(self) -> NlpEngine:

115

"""

116

Create NLP engine instance based on configuration.

117

118

Returns:

119

Configured NLP engine (spaCy, Stanza, or Transformers)

120

"""

121

122

@staticmethod

123

def create_nlp_engine_with_spacy(

124

model_name: str,

125

nlp_ta_prefix_list: List[str] = None

126

) -> SpacyNlpEngine:

127

"""Create spaCy-based NLP engine with specified model."""

128

129

@staticmethod

130

def create_nlp_engine_with_stanza(

131

model_name: str,

132

nlp_ta_prefix_list: List[str] = None

133

) -> StanzaNlpEngine:

134

"""Create Stanza-based NLP engine with specified model."""

135

136

@staticmethod

137

def create_nlp_engine_with_transformers(

138

model_name: str,

139

nlp_ta_prefix_list: List[str] = None

140

) -> TransformersNlpEngine:

141

"""Create Transformers-based NLP engine with specified model."""

142

```

143

144

## Configuration File Formats

145

146

### Default Analyzer Configuration

147

148

```yaml

149

# default_analyzer.yaml

150

nlp_engine_name: spacy

151

models:

152

- lang_code: en

153

model_name: en_core_web_lg

154

- lang_code: es

155

model_name: es_core_news_md

156

157

# Context enhancement settings

158

context_aware_enhancer:

159

enable: true

160

context_similarity_factor: 0.35

161

min_score_with_context_similarity: 0.4

162

context_prefix_count: 5

163

context_suffix_count: 0

164

165

# Default score threshold

166

default_score_threshold: 0.0

167

168

# Supported languages

169

supported_languages:

170

- en

171

- es

172

- fr

173

- de

174

- it

175

```

176

177

### NLP Engine Configurations

178

179

#### spaCy Configuration

180

181

```yaml

182

# spacy.yaml

183

nlp_engine_name: spacy

184

models:

185

- lang_code: en

186

model_name: en_core_web_lg

187

- lang_code: es

188

model_name: es_core_news_md

189

- lang_code: fr

190

model_name: fr_core_news_md

191

- lang_code: de

192

model_name: de_core_news_md

193

- lang_code: it

194

model_name: it_core_news_md

195

```

196

197

#### Stanza Configuration

198

199

```yaml

200

# stanza.yaml

201

nlp_engine_name: stanza

202

models:

203

- lang_code: en

204

model_name: en

205

- lang_code: es

206

model_name: es

207

- lang_code: fr

208

model_name: fr

209

- lang_code: de

210

model_name: de

211

- lang_code: it

212

model_name: it

213

```

214

215

#### Transformers Configuration

216

217

```yaml

218

# transformers.yaml

219

nlp_engine_name: transformers

220

models:

221

- lang_code: en

222

model_name: dslim/bert-base-NER

223

- lang_code: es

224

model_name: mrm8488/bert-spanish-cased-finetuned-ner

225

```

226

227

### Recognizer Registry Configuration

228

229

```yaml

230

# default_recognizers.yaml

231

recognizers:

232

- name: "CreditCardRecognizer"

233

supported_language: "en"

234

supported_entities: ["CREDIT_CARD"]

235

patterns:

236

- name: "credit_card_visa"

237

regex: "4[0-9]{12}(?:[0-9]{3})?"

238

score: 0.9

239

- name: "credit_card_mastercard"

240

regex: "5[1-5][0-9]{14}"

241

score: 0.9

242

context: ["credit", "card", "payment"]

243

244

- name: "PhoneRecognizer"

245

supported_language: "en"

246

supported_entities: ["PHONE_NUMBER"]

247

patterns:

248

- name: "us_phone"

249

regex: "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b"

250

score: 0.7

251

context: ["phone", "call", "number", "contact"]

252

```

253

254

## Usage Examples

255

256

### Basic Configuration Setup

257

258

```python

259

from presidio_analyzer import AnalyzerEngineProvider

260

261

# Create analyzer from default configuration

262

provider = AnalyzerEngineProvider()

263

analyzer = provider.create_engine()

264

265

# Use the configured analyzer

266

text = "Contact John at john@email.com or call 555-123-4567"

267

results = analyzer.analyze(text=text, language="en")

268

269

print(f"Found {len(results)} PII entities using default configuration")

270

```

271

272

### Custom Configuration Files

273

274

```python

275

from presidio_analyzer import AnalyzerEngineProvider

276

277

# Create analyzer with custom configuration files

278

provider = AnalyzerEngineProvider(

279

analyzer_engine_conf_file="config/custom_analyzer.yaml",

280

nlp_engine_conf_file="config/custom_nlp.yaml",

281

recognizer_registry_conf_file="config/custom_recognizers.yaml"

282

)

283

284

analyzer = provider.create_engine()

285

286

# Test with custom configuration

287

text = "Custom entity detection test"

288

results = analyzer.analyze(text=text, language="en")

289

```

290

291

### Programmatic Configuration

292

293

```python

294

from presidio_analyzer import (

295

AnalyzerEngine, RecognizerRegistry, LemmaContextAwareEnhancer

296

)

297

from presidio_analyzer.nlp_engine import SpacyNlpEngine

298

299

# Configure NLP engine

300

nlp_engine = SpacyNlpEngine(models={"en": "en_core_web_lg"})

301

302

# Configure recognizer registry

303

registry = RecognizerRegistry(supported_languages=["en"])

304

registry.load_predefined_recognizers(languages=["en"], nlp_engine=nlp_engine)

305

306

# Configure context enhancement

307

enhancer = LemmaContextAwareEnhancer(

308

context_similarity_factor=0.4,

309

min_score_with_context_similarity=0.3

310

)

311

312

# Create analyzer with custom configuration

313

analyzer = AnalyzerEngine(

314

registry=registry,

315

nlp_engine=nlp_engine,

316

context_aware_enhancer=enhancer,

317

default_score_threshold=0.5,

318

supported_languages=["en"]

319

)

320

```

321

322

### Multi-language Configuration

323

324

```python

325

from presidio_analyzer import AnalyzerEngineProvider

326

import yaml

327

328

# Create multi-language configuration

329

multilingual_config = {

330

'nlp_engine_name': 'spacy',

331

'models': [

332

{'lang_code': 'en', 'model_name': 'en_core_web_lg'},

333

{'lang_code': 'es', 'model_name': 'es_core_news_md'},

334

{'lang_code': 'fr', 'model_name': 'fr_core_news_md'},

335

{'lang_code': 'de', 'model_name': 'de_core_news_md'}

336

],

337

'supported_languages': ['en', 'es', 'fr', 'de'],

338

'default_score_threshold': 0.6

339

}

340

341

# Save configuration to file

342

with open('multilingual_config.yaml', 'w') as f:

343

yaml.dump(multilingual_config, f)

344

345

# Create analyzer from configuration

346

provider = AnalyzerEngineProvider(

347

analyzer_engine_conf_file='multilingual_config.yaml'

348

)

349

analyzer = provider.create_engine()

350

351

# Test with different languages

352

texts = {

353

'en': "Contact John Smith at john@email.com",

354

'es': "Contacta con Juan en juan@email.com",

355

'fr': "Contactez Jean à jean@email.com",

356

'de': "Kontaktieren Sie Johann unter johann@email.com"

357

}

358

359

for language, text in texts.items():

360

results = analyzer.analyze(text=text, language=language)

361

print(f"{language}: Found {len(results)} entities")

362

```

363

364

### Environment-based Configuration

365

366

```python

367

from presidio_analyzer import AnalyzerEngineProvider

368

import os

369

from pathlib import Path

370

371

def create_analyzer_from_environment():

372

"""Create analyzer using environment-specific configuration."""

373

374

# Get configuration paths from environment variables

375

config_dir = os.getenv('PRESIDIO_CONFIG_DIR', 'config')

376

377

analyzer_config = os.getenv(

378

'PRESIDIO_ANALYZER_CONFIG',

379

f'{config_dir}/analyzer.yaml'

380

)

381

382

nlp_config = os.getenv(

383

'PRESIDIO_NLP_CONFIG',

384

f'{config_dir}/nlp.yaml'

385

)

386

387

recognizer_config = os.getenv(

388

'PRESIDIO_RECOGNIZER_CONFIG',

389

f'{config_dir}/recognizers.yaml'

390

)

391

392

# Verify configuration files exist

393

for config_file in [analyzer_config, nlp_config, recognizer_config]:

394

if not Path(config_file).exists():

395

print(f"Warning: Configuration file not found: {config_file}")

396

397

# Create analyzer with environment-specific configuration

398

provider = AnalyzerEngineProvider(

399

analyzer_engine_conf_file=analyzer_config,

400

nlp_engine_conf_file=nlp_config,

401

recognizer_registry_conf_file=recognizer_config

402

)

403

404

return provider.create_engine()

405

406

# Usage with environment variables

407

# export PRESIDIO_CONFIG_DIR=/etc/presidio

408

# export PRESIDIO_ANALYZER_CONFIG=/etc/presidio/production_analyzer.yaml

409

410

analyzer = create_analyzer_from_environment()

411

```

412

413

### Docker Configuration

414

415

```python

416

from presidio_analyzer import AnalyzerEngineProvider

417

import yaml

418

import os

419

420

def create_docker_optimized_analyzer():

421

"""Create analyzer optimized for Docker deployment."""

422

423

# Docker-optimized configuration

424

docker_config = {

425

'nlp_engine_name': 'spacy',

426

'models': [

427

{

428

'lang_code': 'en',

429

'model_name': 'en_core_web_sm' # Smaller model for containers

430

}

431

],

432

'supported_languages': ['en'],

433

'default_score_threshold': 0.5,

434

'context_aware_enhancer': {

435

'enable': True,

436

'context_similarity_factor': 0.35,

437

'min_score_with_context_similarity': 0.4

438

}

439

}

440

441

# Write configuration to container filesystem

442

config_path = '/tmp/docker_analyzer_config.yaml'

443

with open(config_path, 'w') as f:

444

yaml.dump(docker_config, f)

445

446

# Create analyzer

447

provider = AnalyzerEngineProvider(

448

analyzer_engine_conf_file=config_path

449

)

450

451

return provider.create_engine()

452

453

# Docker deployment usage

454

analyzer = create_docker_optimized_analyzer()

455

```

456

457

### High-Performance Configuration

458

459

```python

460

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

461

from presidio_analyzer.nlp_engine import SpacyNlpEngine

462

463

def create_high_performance_analyzer():

464

"""Create analyzer optimized for high-throughput scenarios."""

465

466

# Use lightweight NLP processing

467

nlp_engine = SpacyNlpEngine(

468

models={"en": "en_core_web_sm"}, # Smaller, faster model

469

nlp_configuration={

470

"nlp_engine_name": "spacy",

471

"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}]

472

}

473

)

474

475

# Create registry with only essential recognizers

476

registry = RecognizerRegistry(supported_languages=["en"])

477

478

# Load only high-confidence, fast recognizers

479

essential_recognizers = [

480

"EmailRecognizer",

481

"PhoneRecognizer",

482

"CreditCardRecognizer",

483

"UsSsnRecognizer"

484

]

485

486

registry.load_predefined_recognizers(

487

languages=["en"],

488

nlp_engine=nlp_engine

489

)

490

491

# Filter to essential recognizers only

492

registry.recognizers = [

493

r for r in registry.recognizers

494

if r.name in essential_recognizers

495

]

496

497

# Create analyzer without context enhancement for speed

498

analyzer = AnalyzerEngine(

499

registry=registry,

500

nlp_engine=nlp_engine,

501

context_aware_enhancer=None, # Disable for performance

502

default_score_threshold=0.7 # Higher threshold for precision

503

)

504

505

return analyzer

506

507

# High-performance deployment

508

analyzer = create_high_performance_analyzer()

509

```

510

511

### Custom Recognizer Configuration

512

513

```python

514

from presidio_analyzer import (

515

AnalyzerEngineProvider, PatternRecognizer, Pattern, RecognizerRegistry

516

)

517

import yaml

518

519

def create_custom_recognizer_config():

520

"""Create configuration with custom recognizers."""

521

522

# Define custom recognizer in YAML format

523

custom_config = {

524

'recognizers': [

525

{

526

'name': 'CustomEmployeeIdRecognizer',

527

'supported_language': 'en',

528

'supported_entities': ['EMPLOYEE_ID'],

529

'patterns': [

530

{

531

'name': 'emp_id_pattern_1',

532

'regex': r'\bEMP-\d{5}\b',

533

'score': 0.9

534

},

535

{

536

'name': 'emp_id_pattern_2',

537

'regex': r'\b[Ee]mployee\s*[Ii][Dd]\s*:?\s*(\d{5})\b',

538

'score': 0.8

539

}

540

],

541

'context': ['employee', 'staff', 'worker', 'personnel']

542

},

543

{

544

'name': 'CustomProductCodeRecognizer',

545

'supported_language': 'en',

546

'supported_entities': ['PRODUCT_CODE'],

547

'patterns': [

548

{

549

'name': 'product_code_pattern',

550

'regex': r'\bPRD-[A-Z]{2}-\d{4}\b',

551

'score': 0.9

552

}

553

],

554

'context': ['product', 'item', 'catalog', 'inventory']

555

}

556

]

557

}

558

559

# Save custom recognizer configuration

560

with open('custom_recognizers.yaml', 'w') as f:

561

yaml.dump(custom_config, f)

562

563

# Create analyzer with custom recognizers

564

provider = AnalyzerEngineProvider(

565

recognizer_registry_conf_file='custom_recognizers.yaml'

566

)

567

568

return provider.create_engine()

569

570

# Usage with custom recognizers

571

analyzer = create_custom_recognizer_config()

572

573

test_text = "Employee ID: 12345 ordered product PRD-AB-1234"

574

results = analyzer.analyze(text=test_text, language="en")

575

576

for result in results:

577

detected_text = test_text[result.start:result.end]

578

print(f"Found {result.entity_type}: '{detected_text}'")

579

```

580

581

### Configuration Validation

582

583

```python

584

from presidio_analyzer import AnalyzerEngineProvider

585

import yaml

586

from pathlib import Path

587

588

def validate_configuration(config_file: str) -> bool:

589

"""Validate analyzer configuration file."""

590

591

try:

592

# Check if file exists

593

if not Path(config_file).exists():

594

print(f"Error: Configuration file not found: {config_file}")

595

return False

596

597

# Load and validate YAML syntax

598

with open(config_file, 'r') as f:

599

config = yaml.safe_load(f)

600

601

# Validate required fields

602

required_fields = ['nlp_engine_name', 'models', 'supported_languages']

603

for field in required_fields:

604

if field not in config:

605

print(f"Error: Missing required field: {field}")

606

return False

607

608

# Validate NLP engine name

609

valid_engines = ['spacy', 'stanza', 'transformers']

610

if config['nlp_engine_name'] not in valid_engines:

611

print(f"Error: Invalid NLP engine: {config['nlp_engine_name']}")

612

return False

613

614

# Validate models configuration

615

if not isinstance(config['models'], list) or not config['models']:

616

print("Error: Models must be a non-empty list")

617

return False

618

619

for model in config['models']:

620

if 'lang_code' not in model or 'model_name' not in model:

621

print("Error: Each model must have 'lang_code' and 'model_name'")

622

return False

623

624

# Try to create analyzer to validate configuration

625

provider = AnalyzerEngineProvider(analyzer_engine_conf_file=config_file)

626

analyzer = provider.create_engine()

627

628

print(f"Configuration validation successful: {config_file}")

629

return True

630

631

except yaml.YAMLError as e:

632

print(f"YAML syntax error: {e}")

633

return False

634

except Exception as e:

635

print(f"Configuration error: {e}")

636

return False

637

638

# Validate configuration before deployment

639

config_file = "config/analyzer.yaml"

640

if validate_configuration(config_file):

641

provider = AnalyzerEngineProvider(analyzer_engine_conf_file=config_file)

642

analyzer = provider.create_engine()

643

print("Analyzer created successfully")

644

else:

645

print("Configuration validation failed")

646

```

647

648

## Configuration Best Practices

649

650

### Performance Optimization

651

652

- Use smaller spaCy models (e.g., `en_core_web_sm`) for faster processing

653

- Disable context enhancement for high-throughput scenarios

654

- Load only necessary recognizers for your use case

655

- Set appropriate score thresholds to filter low-confidence results

656

657

### Security Considerations

658

659

- Store configuration files in secure locations with appropriate permissions

660

- Use environment variables for sensitive configuration values

661

- Validate configuration files before deployment

662

- Regularly update NLP models and recognizer patterns

663

664

### Deployment Strategies

665

666

- **Development**: Use comprehensive configurations with all recognizers

667

- **Production**: Use optimized configurations with essential recognizers only

668

- **Docker**: Use lightweight models and configurations for container efficiency

669

- **Multi-language**: Configure only the languages you actually need

670

671

### Monitoring and Maintenance

672

673

- Log configuration loading and validation results

674

- Monitor analyzer performance metrics

675

- Regularly review and update recognizer patterns

676

- Test configuration changes in staging environments before production deployment