or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-processing.mdconfiguration.mdcontext-enhancement.mdcore-analysis.mdentity-recognizers.mdindex.mdpredefined-recognizers.md

entity-recognizers.mddocs/

0

# Entity Recognizers

1

2

The entity recognizer framework provides the foundation for creating custom PII detection logic. It includes abstract base classes, pattern-based recognizers, and integration capabilities for remote services.

3

4

## Capabilities

5

6

### EntityRecognizer Base Class

7

8

Abstract base class that defines the interface for all PII entity recognizers in Presidio Analyzer.

9

10

```python { .api }

11

class EntityRecognizer:

12

"""

13

Abstract base class for all PII entity recognizers.

14

15

Args:

16

supported_entities: List of entity types this recognizer can detect

17

name: Unique identifier for the recognizer (auto-generated if None)

18

supported_language: Primary language code supported (default: "en")

19

version: Version string for the recognizer

20

context: Optional context keywords that improve detection accuracy

21

"""

22

def __init__(

23

self,

24

supported_entities: List[str],

25

name: str = None,

26

supported_language: str = "en",

27

version: str = "0.0.1",

28

context: Optional[List[str]] = None

29

): ...

30

31

def analyze(

32

self,

33

text: str,

34

entities: List[str],

35

nlp_artifacts: NlpArtifacts

36

) -> List[RecognizerResult]:

37

"""

38

Abstract method: Analyze text to detect PII entities.

39

40

Args:

41

text: Input text to analyze

42

entities: List of entity types to look for

43

nlp_artifacts: Pre-processed NLP data (tokens, lemmas, etc.)

44

45

Returns:

46

List of RecognizerResult objects for detected entities

47

"""

48

49

def load(self) -> None:

50

"""Abstract method: Initialize recognizer resources (models, patterns, etc.)"""

51

52

def enhance_using_context(

53

self,

54

text: str,

55

raw_results: List[RecognizerResult],

56

nlp_artifacts: NlpArtifacts,

57

recognizers: List[EntityRecognizer],

58

context: Optional[List[str]] = None

59

) -> List[RecognizerResult]:

60

"""

61

Enhance detection results using contextual information.

62

Can be overridden by subclasses for custom enhancement logic.

63

64

Args:

65

text: Original input text

66

raw_results: Initial detection results

67

nlp_artifacts: NLP processing artifacts

68

recognizers: All available recognizers

69

context: Context keywords for enhancement

70

71

Returns:

72

Enhanced list of RecognizerResult objects

73

"""

74

75

def get_supported_entities(self) -> List[str]:

76

"""Get list of entity types this recognizer supports."""

77

78

def get_supported_language(self) -> str:

79

"""Get primary supported language code."""

80

81

def get_version(self) -> str:

82

"""Get recognizer version string."""

83

84

def to_dict(self) -> Dict:

85

"""Serialize recognizer configuration to dictionary."""

86

87

@classmethod

88

def from_dict(cls, entity_recognizer_dict: Dict) -> EntityRecognizer:

89

"""Create recognizer instance from dictionary configuration."""

90

91

@staticmethod

92

def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:

93

"""Remove duplicate results based on entity type and position."""

94

95

@staticmethod

96

def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:

97

"""Clean input text using replacement patterns."""

98

99

# Properties

100

supported_entities: List[str] # Entity types this recognizer detects

101

name: str # Unique recognizer identifier

102

supported_language: str # Primary language code

103

version: str # Version string

104

is_loaded: bool # Whether recognizer resources are loaded

105

context: Optional[List[str]] # Context keywords for enhancement

106

id: str # Unique instance identifier

107

108

# Constants

109

MIN_SCORE = 0 # Minimum confidence score

110

MAX_SCORE = 1.0 # Maximum confidence score

111

```

112

113

### LocalRecognizer

114

115

Abstract class for recognizers that run in the same process as the AnalyzerEngine.

116

117

```python { .api }

118

class LocalRecognizer(EntityRecognizer):

119

"""

120

Abstract base class for recognizers that execute locally within the analyzer process.

121

Inherits all methods and properties from EntityRecognizer.

122

"""

123

pass

124

```

125

126

### PatternRecognizer

127

128

Concrete implementation for pattern-based PII detection using regular expressions and deny lists.

129

130

```python { .api }

131

class PatternRecognizer(LocalRecognizer):

132

"""

133

PII entity recognizer using regular expressions and deny lists.

134

135

Args:

136

supported_entity: Single entity type this recognizer detects

137

name: Unique identifier for the recognizer

138

supported_language: Language code (default: "en")

139

patterns: List of Pattern objects containing regex rules

140

deny_list: List of strings that should always be detected

141

context: Context keywords that improve detection accuracy

142

deny_list_score: Confidence score for deny list matches (default: 1.0)

143

global_regex_flags: Default regex compilation flags

144

version: Version string

145

"""

146

def __init__(

147

self,

148

supported_entity: str,

149

name: str = None,

150

supported_language: str = "en",

151

patterns: List[Pattern] = None,

152

deny_list: List[str] = None,

153

context: List[str] = None,

154

deny_list_score: float = 1.0,

155

global_regex_flags: Optional[int] = None, # Default: re.DOTALL | re.MULTILINE | re.IGNORECASE

156

version: str = "0.0.1"

157

): ...

158

159

def analyze(

160

self,

161

text: str,

162

entities: List[str],

163

nlp_artifacts: Optional[NlpArtifacts] = None,

164

regex_flags: Optional[int] = None

165

) -> List[RecognizerResult]:

166

"""

167

Analyze text using configured patterns and deny lists.

168

169

Args:

170

text: Input text to analyze

171

entities: Entity types to detect (must include supported_entity)

172

nlp_artifacts: Pre-processed NLP data (optional for pattern matching)

173

regex_flags: Override default regex compilation flags

174

175

Returns:

176

List of RecognizerResult objects for pattern matches

177

"""

178

179

def validate_result(self, pattern_text: str) -> Optional[bool]:

180

"""

181

Validate pattern match using custom logic (override in subclasses).

182

183

Args:

184

pattern_text: Matched text from pattern

185

186

Returns:

187

True if valid, False if invalid, None if no validation performed

188

"""

189

190

def invalidate_result(self, pattern_text: str) -> Optional[bool]:

191

"""

192

Check if pattern match should be invalidated (override in subclasses).

193

194

Args:

195

pattern_text: Matched text from pattern

196

197

Returns:

198

True if should be invalidated, False if valid, None if no check performed

199

"""

200

201

@staticmethod

202

def build_regex_explanation(

203

recognizer_name: str,

204

pattern_name: str,

205

pattern: str,

206

original_score: float,

207

validation_result: Optional[bool] = None

208

) -> AnalysisExplanation:

209

"""Build detailed explanation for regex-based detection."""

210

211

def to_dict(self) -> Dict:

212

"""Serialize pattern recognizer configuration to dictionary."""

213

214

@classmethod

215

def from_dict(cls, entity_recognizer_dict: Dict) -> PatternRecognizer:

216

"""Create PatternRecognizer from dictionary configuration."""

217

218

# Properties

219

patterns: List[Pattern] # List of regex Pattern objects

220

deny_list: List[str] # List of strings that indicate PII

221

context: Optional[List[str]] # Context keywords for enhancement

222

deny_list_score: float # Confidence score for deny list matches

223

global_regex_flags: Optional[int] # Default regex compilation flags

224

```

225

226

### RemoteRecognizer

227

228

Abstract class for recognizers that call external services or run in separate processes.

229

230

```python { .api }

231

class RemoteRecognizer(EntityRecognizer):

232

"""

233

Abstract base class for recognizers that call external services.

234

235

Args:

236

supported_entities: List of entity types this recognizer can detect

237

name: Unique identifier for the recognizer

238

supported_language: Language code

239

version: Version string

240

context: Optional context keywords

241

"""

242

def __init__(

243

self,

244

supported_entities: List[str],

245

name: Optional[str],

246

supported_language: str,

247

version: str,

248

context: Optional[List[str]] = None

249

): ...

250

251

def analyze(

252

self,

253

text: str,

254

entities: List[str],

255

nlp_artifacts: NlpArtifacts

256

) -> List[RecognizerResult]:

257

"""

258

Abstract method: Call external service for PII detection.

259

Must be implemented by concrete subclasses.

260

"""

261

262

def get_supported_entities(self) -> List[str]:

263

"""Abstract method: Get supported entities from external service."""

264

```

265

266

### Pattern Class

267

268

Represents a regular expression pattern used by PatternRecognizer.

269

270

```python { .api }

271

class Pattern:

272

"""

273

Regular expression pattern for PII detection.

274

275

Args:

276

name: Descriptive name for the pattern

277

regex: Regular expression string

278

score: Confidence score when pattern matches (0.0-1.0)

279

"""

280

def __init__(self, name: str, regex: str, score: float): ...

281

282

def to_dict(self) -> Dict:

283

"""Serialize pattern to dictionary format."""

284

285

@classmethod

286

def from_dict(cls, pattern_dict: Dict) -> Pattern:

287

"""Create Pattern from dictionary data."""

288

289

# Properties

290

name: str # Descriptive pattern name

291

regex: str # Regular expression string

292

score: float # Confidence score for matches

293

compiled_regex: re.Pattern # Compiled regex object

294

compiled_with_flags: re.Pattern # Compiled regex with flags

295

```

296

297

## Usage Examples

298

299

### Creating a Custom PatternRecognizer

300

301

```python

302

from presidio_analyzer import PatternRecognizer, Pattern

303

304

# Define patterns for custom entity type

305

employee_id_patterns = [

306

Pattern(

307

name="employee_id_format_1",

308

regex=r"\bEMP-\d{5}\b",

309

score=0.9

310

),

311

Pattern(

312

name="employee_id_format_2",

313

regex=r"\b[Ee]mployee\s*[Ii][Dd]\s*:?\s*(\d{5})\b",

314

score=0.8

315

)

316

]

317

318

# Create custom recognizer

319

employee_recognizer = PatternRecognizer(

320

supported_entity="EMPLOYEE_ID",

321

name="EmployeeIdRecognizer",

322

patterns=employee_id_patterns,

323

context=["employee", "staff", "worker", "personnel"]

324

)

325

326

# Test the recognizer

327

from presidio_analyzer.nlp_engine import SpacyNlpEngine

328

329

nlp_engine = SpacyNlpEngine()

330

nlp_engine.load()

331

332

text = "Contact employee ID: 12345 or use EMP-98765"

333

nlp_artifacts = nlp_engine.process_text(text, "en")

334

335

results = employee_recognizer.analyze(

336

text=text,

337

entities=["EMPLOYEE_ID"],

338

nlp_artifacts=nlp_artifacts

339

)

340

341

for result in results:

342

detected_text = text[result.start:result.end]

343

print(f"Found {result.entity_type}: '{detected_text}' (score: {result.score})")

344

```

345

346

### Using Deny Lists

347

348

```python

349

from presidio_analyzer import PatternRecognizer

350

351

# Create recognizer with deny list

352

sensitive_terms_recognizer = PatternRecognizer(

353

supported_entity="SENSITIVE_TERM",

354

name="SensitiveTermsRecognizer",

355

deny_list=[

356

"confidential",

357

"classified",

358

"internal use only",

359

"proprietary"

360

],

361

deny_list_score=0.95

362

)

363

364

# Test with text containing deny list terms

365

text = "This document is marked as confidential and internal use only"

366

results = sensitive_terms_recognizer.analyze(

367

text=text,

368

entities=["SENSITIVE_TERM"],

369

nlp_artifacts=None # Deny lists don't need NLP processing

370

)

371

372

print(f"Found {len(results)} sensitive terms")

373

```

374

375

### Creating a Custom Validation Recognizer

376

377

```python

378

from presidio_analyzer import PatternRecognizer, Pattern

379

import re

380

381

class CustomCreditCardRecognizer(PatternRecognizer):

382

"""Custom credit card recognizer with Luhn algorithm validation."""

383

384

def __init__(self):

385

patterns = [

386

Pattern(

387

name="credit_card_generic",

388

regex=r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",

389

score=0.6 # Lower initial score, validation will increase

390

)

391

]

392

393

super().__init__(

394

supported_entity="CREDIT_CARD",

395

name="CustomCreditCardRecognizer",

396

patterns=patterns

397

)

398

399

def validate_result(self, pattern_text: str) -> Optional[bool]:

400

"""Validate credit card number using Luhn algorithm."""

401

# Remove non-digit characters

402

digits = re.sub(r'[-\s]', '', pattern_text)

403

404

if not digits.isdigit() or len(digits) != 16:

405

return False

406

407

# Luhn algorithm validation

408

def luhn_check(card_num):

409

def digits_of(n):

410

return [int(d) for d in str(n)]

411

412

digits = digits_of(card_num)

413

odd_digits = digits[-1::-2]

414

even_digits = digits[-2::-2]

415

checksum = sum(odd_digits)

416

for d in even_digits:

417

checksum += sum(digits_of(d*2))

418

return checksum % 10 == 0

419

420

return luhn_check(digits)

421

422

# Use custom recognizer

423

recognizer = CustomCreditCardRecognizer()

424

425

# Test with valid and invalid credit card numbers

426

text = "Valid: 4532015112830366, Invalid: 1234567890123456"

427

results = recognizer.analyze(

428

text=text,

429

entities=["CREDIT_CARD"],

430

nlp_artifacts=None

431

)

432

433

for result in results:

434

card_num = text[result.start:result.end]

435

print(f"Credit card: {card_num}, Score: {result.score}")

436

```

437

438

### Integrating Custom Recognizer with AnalyzerEngine

439

440

```python

441

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

442

443

# Create custom recognizer

444

custom_recognizer = PatternRecognizer(

445

supported_entity="PRODUCT_CODE",

446

name="ProductCodeRecognizer",

447

patterns=[

448

Pattern(

449

name="product_code_pattern",

450

regex=r"\bPRD-[A-Z]{2}-\d{4}\b",

451

score=0.9

452

)

453

]

454

)

455

456

# Create registry with custom recognizer

457

registry = RecognizerRegistry()

458

registry.recognizers.append(custom_recognizer)

459

460

# Load default recognizers

461

registry.load_predefined_recognizers(languages=["en"])

462

463

# Create analyzer with custom registry

464

analyzer = AnalyzerEngine(registry=registry)

465

466

# Test analysis

467

text = "Order product PRD-AB-1234 and contact john@email.com"

468

results = analyzer.analyze(text=text, language="en")

469

470

for result in results:

471

detected_text = text[result.start:result.end]

472

print(f"Found {result.entity_type}: '{detected_text}'")

473

```

474

475

### Remote Recognizer Implementation Example

476

477

```python

478

from presidio_analyzer import RemoteRecognizer, RecognizerResult

479

import requests

480

481

class APIBasedRecognizer(RemoteRecognizer):

482

"""Example remote recognizer that calls external API."""

483

484

def __init__(self, api_endpoint: str, api_key: str):

485

super().__init__(

486

supported_entities=["CUSTOM_PII"],

487

name="APIBasedRecognizer",

488

supported_language="en",

489

version="1.0.0"

490

)

491

self.api_endpoint = api_endpoint

492

self.api_key = api_key

493

494

def load(self) -> None:

495

"""Initialize connection to remote service."""

496

# Test API connectivity

497

headers = {"Authorization": f"Bearer {self.api_key}"}

498

response = requests.get(f"{self.api_endpoint}/health", headers=headers)

499

if response.status_code != 200:

500

raise ConnectionError("Cannot connect to remote PII service")

501

502

def analyze(self, text: str, entities: List[str], nlp_artifacts) -> List[RecognizerResult]:

503

"""Call remote API for PII detection."""

504

if "CUSTOM_PII" not in entities:

505

return []

506

507

headers = {"Authorization": f"Bearer {self.api_key}"}

508

payload = {"text": text, "entities": entities}

509

510

response = requests.post(

511

f"{self.api_endpoint}/analyze",

512

json=payload,

513

headers=headers

514

)

515

516

results = []

517

if response.status_code == 200:

518

api_results = response.json()

519

for detection in api_results.get("detections", []):

520

result = RecognizerResult(

521

entity_type=detection["entity_type"],

522

start=detection["start"],

523

end=detection["end"],

524

score=detection["score"]

525

)

526

results.append(result)

527

528

return results

529

530

def get_supported_entities(self) -> List[str]:

531

"""Get supported entities from remote service."""

532

headers = {"Authorization": f"Bearer {self.api_key}"}

533

response = requests.get(f"{self.api_endpoint}/entities", headers=headers)

534

535

if response.status_code == 200:

536

return response.json().get("entities", [])

537

return self.supported_entities

538

539

# Usage (assuming you have an API endpoint)

540

# remote_recognizer = APIBasedRecognizer(

541

# api_endpoint="https://api.example.com/pii",

542

# api_key="your-api-key"

543

# )

544

# remote_recognizer.load()

545

```

546

547

### Configuration-Driven Recognizer Creation

548

549

```python

550

from presidio_analyzer import PatternRecognizer, Pattern

551

import yaml

552

553

def create_recognizer_from_config(config_file: str) -> PatternRecognizer:

554

"""Create PatternRecognizer from YAML configuration."""

555

with open(config_file, 'r') as f:

556

config = yaml.safe_load(f)

557

558

# Create patterns from configuration

559

patterns = []

560

for pattern_config in config.get('patterns', []):

561

pattern = Pattern(

562

name=pattern_config['name'],

563

regex=pattern_config['regex'],

564

score=pattern_config['score']

565

)

566

patterns.append(pattern)

567

568

# Create recognizer

569

recognizer = PatternRecognizer(

570

supported_entity=config['entity_type'],

571

name=config['name'],

572

patterns=patterns,

573

deny_list=config.get('deny_list', []),

574

context=config.get('context', []),

575

supported_language=config.get('language', 'en')

576

)

577

578

return recognizer

579

580

# Example YAML configuration file (recognizer_config.yaml):

581

"""

582

name: "CustomBankAccountRecognizer"

583

entity_type: "BANK_ACCOUNT"

584

language: "en"

585

patterns:

586

- name: "routing_account_pattern"

587

regex: "\\b\\d{9}[-\\s]\\d{10,12}\\b"

588

score: 0.8

589

- name: "account_number_pattern"

590

regex: "Account\\s*:?\\s*(\\d{10,12})"

591

score: 0.7

592

deny_list:

593

- "0000000000"

594

- "1111111111"

595

context:

596

- "account"

597

- "banking"

598

- "routing"

599

"""

600

601

# Create recognizer from configuration

602

# recognizer = create_recognizer_from_config("recognizer_config.yaml")

603

```

604

605

## Best Practices

606

607

### Pattern Design

608

609

- Use word boundaries (`\b`) to avoid partial matches

610

- Test patterns with various input formats

611

- Start with lower confidence scores and use validation to increase them

612

- Include context keywords to improve accuracy

613

614

### Performance Optimization

615

616

- Compile patterns once during initialization

617

- Use specific entity type filtering in analyze() method

618

- Implement efficient validation logic

619

- Consider caching for expensive operations

620

621

### Error Handling

622

623

- Validate input parameters in constructor

624

- Handle regex compilation errors gracefully

625

- Implement proper logging for debugging

626

- Return empty results rather than raising exceptions for invalid input