or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-presidio-analyzer

Python-based service for detecting PII entities in text using Named Entity Recognition, regular expressions, rule-based logic, and checksums

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/presidio-analyzer@2.2.x

To install, run

npx @tessl/cli install tessl/pypi-presidio-analyzer@2.2.0

0

# Presidio Analyzer

1

2

Presidio Analyzer is a Python-based service for detecting PII (Personally Identifiable Information) entities in unstructured text. It provides a pluggable and customizable framework using Named Entity Recognition, regular expressions, rule-based logic, and checksums to identify over 50 types of PII entities across multiple languages.

3

4

## Package Information

5

6

- **Package Name**: presidio_analyzer

7

- **Language**: Python

8

- **Installation**: `pip install presidio-analyzer`

9

- **Supported Python**: 3.9, 3.10, 3.11, 3.12

10

11

## Core Imports

12

13

```python

14

from presidio_analyzer import AnalyzerEngine

15

```

16

17

For comprehensive imports:

18

19

```python

20

from presidio_analyzer import (

21

AnalyzerEngine,

22

BatchAnalyzerEngine,

23

RecognizerResult,

24

PatternRecognizer,

25

Pattern,

26

AnalyzerEngineProvider

27

)

28

```

29

30

## Basic Usage

31

32

```python

33

from presidio_analyzer import AnalyzerEngine

34

35

# Initialize analyzer

36

analyzer = AnalyzerEngine()

37

38

# Analyze text for PII

39

text = "My name is John Doe and my phone number is 555-123-4567"

40

results = analyzer.analyze(text=text, language="en")

41

42

# Process results

43

for result in results:

44

print(f"Entity: {result.entity_type}")

45

print(f"Text: {text[result.start:result.end]}")

46

print(f"Score: {result.score}")

47

print(f"Location: {result.start}-{result.end}")

48

```

49

50

## Architecture

51

52

Presidio Analyzer follows a modular architecture:

53

54

- **AnalyzerEngine**: Central orchestrator that coordinates all analysis operations

55

- **RecognizerRegistry**: Manages and holds all entity recognizers

56

- **EntityRecognizer**: Base class for all PII detection logic (pattern-based, ML-based, remote)

57

- **NlpEngine**: Abstraction layer over NLP preprocessing (spaCy, Stanza, Transformers)

58

- **ContextAwareEnhancer**: Improves detection accuracy using surrounding context

59

- **BatchAnalyzerEngine**: Enables efficient processing of large datasets

60

61

This design allows for flexible deployment options from Python scripts to Docker containers and Kubernetes orchestration, while maintaining high extensibility for custom recognizers and detection logic.

62

63

## Capabilities

64

65

### Core Analysis Engine

66

67

Central PII detection functionality including the main AnalyzerEngine class, request handling, and result processing. Provides the primary interface for detecting PII entities in text.

68

69

```python { .api }

70

class AnalyzerEngine:

71

def __init__(

72

self,

73

registry: RecognizerRegistry = None,

74

nlp_engine: NlpEngine = None,

75

app_tracer: AppTracer = None,

76

log_decision_process: bool = False,

77

default_score_threshold: float = 0,

78

supported_languages: List[str] = None,

79

context_aware_enhancer: Optional[ContextAwareEnhancer] = None

80

): ...

81

82

def analyze(

83

self,

84

text: str,

85

language: str,

86

entities: Optional[List[str]] = None,

87

correlation_id: Optional[str] = None,

88

score_threshold: Optional[float] = None,

89

return_decision_process: Optional[bool] = False,

90

ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,

91

context: Optional[List[str]] = None,

92

allow_list: Optional[List[str]] = None,

93

allow_list_match: Optional[str] = "exact",

94

regex_flags: Optional[int] = None,

95

nlp_artifacts: Optional[NlpArtifacts] = None

96

) -> List[RecognizerResult]: ...

97

```

98

99

[Core Analysis Engine](./core-analysis.md)

100

101

### Batch Processing

102

103

High-performance analysis of large datasets including iterables, dictionaries, and structured data with multiprocessing support and configurable batch sizes.

104

105

```python { .api }

106

class BatchAnalyzerEngine:

107

def __init__(self, analyzer_engine: Optional[AnalyzerEngine] = None): ...

108

109

def analyze_iterator(

110

self,

111

texts: Iterable[Union[str, bool, float, int]],

112

language: str,

113

batch_size: int = 1,

114

n_process: int = 1,

115

**kwargs

116

) -> List[List[RecognizerResult]]: ...

117

118

def analyze_dict(

119

self,

120

input_dict: Dict[str, Union[Any, Iterable[Any]]],

121

language: str,

122

keys_to_skip: Optional[List[str]] = None,

123

batch_size: int = 1,

124

n_process: int = 1,

125

**kwargs

126

) -> Iterator[DictAnalyzerResult]: ...

127

```

128

129

[Batch Processing](./batch-processing.md)

130

131

### Entity Recognizers

132

133

Framework for creating custom PII recognizers including abstract base classes, pattern-based recognizers, and remote service integration capabilities.

134

135

```python { .api }

136

class EntityRecognizer:

137

def __init__(

138

self,

139

supported_entities: List[str],

140

name: str = None,

141

supported_language: str = "en",

142

version: str = "0.0.1",

143

context: Optional[List[str]] = None

144

): ...

145

146

def analyze(

147

self,

148

text: str,

149

entities: List[str],

150

nlp_artifacts: NlpArtifacts

151

) -> List[RecognizerResult]: ...

152

153

class PatternRecognizer(LocalRecognizer):

154

def __init__(

155

self,

156

supported_entity: str,

157

name: str = None,

158

supported_language: str = "en",

159

patterns: List[Pattern] = None,

160

deny_list: List[str] = None,

161

context: List[str] = None,

162

deny_list_score: float = 1.0,

163

global_regex_flags: Optional[int] = None,

164

version: str = "0.0.1"

165

): ...

166

```

167

168

[Entity Recognizers](./entity-recognizers.md)

169

170

### Predefined Recognizers

171

172

Comprehensive collection of over 50 built-in recognizers for common PII types including generic entities (emails, phone numbers, credit cards) and country-specific identifiers (SSNs, passport numbers, tax IDs).

173

174

```python { .api }

175

# Generic recognizers

176

class CreditCardRecognizer(PatternRecognizer): ...

177

class EmailRecognizer(PatternRecognizer): ...

178

class PhoneRecognizer(PatternRecognizer): ...

179

class IpRecognizer(PatternRecognizer): ...

180

181

# US-specific recognizers

182

class UsSsnRecognizer(PatternRecognizer): ...

183

class UsLicenseRecognizer(PatternRecognizer): ...

184

class UsPassportRecognizer(PatternRecognizer): ...

185

186

# International recognizers

187

class IbanRecognizer(PatternRecognizer): ...

188

class AuMedicareRecognizer(PatternRecognizer): ...

189

class UkNinoRecognizer(PatternRecognizer): ...

190

```

191

192

[Predefined Recognizers](./predefined-recognizers.md)

193

194

### Context Enhancement

195

196

Advanced context-aware enhancement that improves detection accuracy by analyzing surrounding text using lemmatization and contextual similarity scoring.

197

198

```python { .api }

199

class ContextAwareEnhancer:

200

def __init__(

201

self,

202

context_similarity_factor: float,

203

min_score_with_context_similarity: float,

204

context_prefix_count: int,

205

context_suffix_count: int

206

): ...

207

208

def enhance_using_context(

209

self,

210

text: str,

211

raw_results: List[RecognizerResult],

212

nlp_artifacts: NlpArtifacts,

213

recognizers: List[EntityRecognizer],

214

context: Optional[List[str]] = None

215

) -> List[RecognizerResult]: ...

216

217

class LemmaContextAwareEnhancer(ContextAwareEnhancer):

218

def __init__(

219

self,

220

context_similarity_factor: float = 0.35,

221

min_score_with_context_similarity: float = 0.4,

222

context_prefix_count: int = 5,

223

context_suffix_count: int = 0

224

): ...

225

```

226

227

[Context Enhancement](./context-enhancement.md)

228

229

### Configuration and Setup

230

231

Flexible configuration system supporting YAML-based setup, multiple NLP engines (spaCy, Stanza, Transformers), and customizable recognizer registries.

232

233

```python { .api }

234

class AnalyzerEngineProvider:

235

def __init__(

236

self,

237

analyzer_engine_conf_file: Optional[Union[Path, str]] = None,

238

nlp_engine_conf_file: Optional[Union[Path, str]] = None,

239

recognizer_registry_conf_file: Optional[Union[Path, str]] = None

240

): ...

241

242

def create_engine(self) -> AnalyzerEngine: ...

243

244

class RecognizerRegistry:

245

def __init__(

246

self,

247

recognizers: Optional[Iterable[EntityRecognizer]] = None,

248

global_regex_flags: Optional[int] = None,

249

supported_languages: Optional[List[str]] = None

250

): ...

251

252

def load_predefined_recognizers(

253

self,

254

languages: Optional[List[str]] = None,

255

nlp_engine: NlpEngine = None

256

) -> None: ...

257

```

258

259

[Configuration and Setup](./configuration.md)

260

261

## Types

262

263

### Core Result Types

264

265

```python { .api }

266

class RecognizerResult:

267

def __init__(

268

self,

269

entity_type: str,

270

start: int,

271

end: int,

272

score: float,

273

analysis_explanation: AnalysisExplanation = None,

274

recognition_metadata: Dict = None

275

): ...

276

277

# Properties

278

entity_type: str # Type of detected entity (e.g., "PERSON", "PHONE_NUMBER")

279

start: int # Start position in text

280

end: int # End position in text

281

score: float # Confidence score (0.0 to 1.0)

282

analysis_explanation: AnalysisExplanation # Detailed detection explanation

283

recognition_metadata: Dict # Additional recognizer-specific metadata

284

285

class DictAnalyzerResult:

286

key: str # Dictionary key that was analyzed

287

value: Union[str, List[str], dict] # Original value

288

recognizer_results: Union[

289

List[RecognizerResult],

290

List[List[RecognizerResult]],

291

Iterator[DictAnalyzerResult]

292

] # Detection results

293

294

class AnalysisExplanation:

295

def __init__(

296

self,

297

recognizer: str,

298

original_score: float,

299

pattern_name: str = None,

300

pattern: str = None,

301

validation_result: float = None,

302

textual_explanation: str = None,

303

regex_flags: int = None

304

): ...

305

306

# Properties

307

recognizer: str # Name of recognizer that made detection

308

original_score: float # Initial confidence score

309

score: float # Final confidence score (after enhancements)

310

pattern_name: str # Name of matching pattern (if applicable)

311

textual_explanation: str # Human-readable explanation

312

```

313

314

### Pattern and Configuration Types

315

316

```python { .api }

317

class Pattern:

318

def __init__(self, name: str, regex: str, score: float): ...

319

320

# Properties

321

name: str # Descriptive name for the pattern

322

regex: str # Regular expression pattern

323

score: float # Confidence score when pattern matches

324

325

class AnalyzerRequest:

326

def __init__(self, req_data: Dict): ...

327

328

# Properties extracted from req_data

329

text: str # Text to analyze

330

language: str # Language code (e.g., "en")

331

entities: Optional[List[str]] # Entity types to detect

332

correlation_id: Optional[str] # Request tracking ID

333

score_threshold: Optional[float] # Minimum confidence score

334

return_decision_process: Optional[bool] # Include analysis explanations

335

ad_hoc_recognizers: Optional[List[EntityRecognizer]] # Custom recognizers

336

context: Optional[List[str]] # Context keywords for enhancement

337

allow_list: Optional[List[str]] # Values to exclude from detection

338

allow_list_match: Optional[str] # Match strategy ("exact" or "fuzzy")

339

regex_flags: Optional[int] # Regex compilation flags

340

```

341

342

## Error Handling

343

344

Presidio Analyzer uses standard Python exceptions. Common error scenarios:

345

346

- **ValueError**: Invalid parameters (e.g., unsupported language, invalid score threshold)

347

- **TypeError**: Incorrect parameter types

348

- **ImportError**: Missing optional dependencies (e.g., transformers, stanza)

349

- **FileNotFoundError**: Missing configuration files when using AnalyzerEngineProvider

350

351

## Multi-language Support

352

353

Supported languages: English (en), Hebrew (he), Spanish (es), German (de), French (fr), Italian (it), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Arabic (ar).

354

355

Language-specific recognizers are automatically loaded based on the `language` parameter in `analyze()` calls. Some recognizers support multiple languages while others are region-specific.