0
# Batch Processing
1
2
The BatchAnonymizerEngine provides efficient anonymization for lists and dictionaries, enabling bulk processing of multiple texts or structured data formats.
3
4
## Capabilities
5
6
### Initialize Batch Engine
7
8
Create a batch processor with an optional custom AnonymizerEngine.
9
10
```python { .api }
11
def __init__(self, anonymizer_engine: Optional[AnonymizerEngine] = None):
12
"""
13
Initialize BatchAnonymizerEngine.
14
15
Parameters:
16
- anonymizer_engine (Optional[AnonymizerEngine]): Custom anonymizer instance,
17
defaults to new AnonymizerEngine()
18
"""
19
```
20
21
**Usage Example:**
22
23
```python
24
from presidio_anonymizer import BatchAnonymizerEngine, AnonymizerEngine
25
26
# Use default engine
27
batch_engine = BatchAnonymizerEngine()
28
29
# Use custom engine with added operators
30
custom_engine = AnonymizerEngine()
31
custom_engine.add_anonymizer(MyCustomOperator)
32
batch_engine = BatchAnonymizerEngine(anonymizer_engine=custom_engine)
33
```
34
35
### List Anonymization
36
37
Anonymize a list of texts with corresponding analyzer results.
38
39
```python { .api }
40
def anonymize_list(
41
self,
42
texts: List[Optional[Union[str, bool, int, float]]],
43
recognizer_results_list: List[List[RecognizerResult]],
44
**kwargs
45
) -> List[Union[str, Any]]:
46
"""
47
Anonymize a list of strings.
48
49
Parameters:
50
- texts (List[Optional[Union[str, bool, int, float]]]): List of texts to anonymize.
51
Non-string types (bool, int, float) are converted to string; other types pass through unchanged
52
- recognizer_results_list (List[List[RecognizerResult]]): List of analyzer results for each text
53
- **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()
54
55
Returns:
56
List[Union[str, Any]]: List of anonymized texts, with non-anonymizable items unchanged
57
"""
58
```
59
60
**Usage Examples:**
61
62
```python
63
from presidio_anonymizer import BatchAnonymizerEngine
64
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
65
66
batch_engine = BatchAnonymizerEngine()
67
68
# Anonymize multiple texts
69
texts = [
70
"John Doe lives in New York",
71
"Contact Sarah at sarah@email.com",
72
"Call Mike at 555-1234",
73
42, # Non-string type
74
None # None value
75
]
76
77
analyzer_results = [
78
[RecognizerResult("PERSON", 0, 8, 0.9), RecognizerResult("LOCATION", 18, 26, 0.8)],
79
[RecognizerResult("PERSON", 8, 13, 0.9), RecognizerResult("EMAIL_ADDRESS", 17, 33, 0.9)],
80
[RecognizerResult("PERSON", 5, 9, 0.9), RecognizerResult("PHONE_NUMBER", 13, 21, 0.8)],
81
[], # No analyzer results for number
82
[] # No analyzer results for None
83
]
84
85
operators = {
86
"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
87
"EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5}),
88
"PHONE_NUMBER": OperatorConfig("redact"),
89
"LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION]"})
90
}
91
92
result = batch_engine.anonymize_list(
93
texts=texts,
94
recognizer_results_list=analyzer_results,
95
operators=operators
96
)
97
98
print(result)
99
# ['[PERSON] lives in [LOCATION]', 'Contact [PERSON] at sa***@email.com', 'Call [PERSON] at ', '42', None]
100
```
101
102
### Dictionary Anonymization
103
104
Anonymize values in nested dictionaries and structured data.
105
106
```python { .api }
107
def anonymize_dict(
108
self,
109
analyzer_results: Iterable[DictRecognizerResult],
110
**kwargs
111
) -> Dict[str, str]:
112
"""
113
Anonymize values in a dictionary.
114
115
Parameters:
116
- analyzer_results (Iterable[DictRecognizerResult]): Iterator of DictRecognizerResult
117
containing analyzer results for dictionary values
118
- **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()
119
120
Returns:
121
Dict[str, str]: Dictionary with anonymized values
122
"""
123
```
124
125
**Usage Example:**
126
127
```python
128
from presidio_anonymizer.entities import DictRecognizerResult
129
130
# Example dictionary data
131
data_dict = {
132
"user_info": {
133
"name": "John Doe",
134
"email": "john@example.com"
135
},
136
"contacts": ["Alice Johnson", "Bob Smith"],
137
"phone": "555-1234",
138
"age": 30
139
}
140
141
# DictRecognizerResult contains analyzer results for structured data
142
# This would typically come from presidio-analyzer's analyze_dict method
143
dict_analyzer_results = [
144
DictRecognizerResult(
145
key="user_info",
146
value={"name": "John Doe", "email": "john@example.com"},
147
recognizer_results=[
148
# Nested analyzer results for the dictionary value
149
]
150
),
151
# Additional results for other keys...
152
]
153
154
operators = {
155
"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
156
"EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})
157
}
158
159
anonymized_dict = batch_engine.anonymize_dict(
160
analyzer_results=dict_analyzer_results,
161
operators=operators
162
)
163
```
164
165
## Data Type Handling
166
167
The batch engine handles different data types appropriately:
168
169
### String Types
170
- Processed through the anonymization engine
171
- Converted to anonymized strings
172
173
### Numeric Types (int, float, bool)
174
- Converted to strings and processed
175
- Returned as anonymized strings
176
177
### Other Types
178
- Pass through unchanged (objects, None, custom classes)
179
- No anonymization applied
180
181
### Nested Structures
182
- Dictionaries: Recursively processed
183
- Lists/Iterables: Each item processed individually
184
- Mixed types: Handled according to their individual type rules
185
186
## Performance Considerations
187
188
- **Batch Processing**: More efficient than individual calls for large datasets
189
- **Memory Usage**: Processes entire lists/dictionaries in memory
190
- **Parallelization**: Not automatically parallelized; consider external solutions for very large datasets
191
- **Result Caching**: Each text is processed independently; no caching between items
192
193
## Common Patterns
194
195
### Processing CSV-like Data
196
197
```python
198
# Process rows of tabular data
199
rows = [
200
["John Doe", "john@email.com", "555-1234"],
201
["Jane Smith", "jane@email.com", "555-5678"]
202
]
203
204
# Flatten for processing
205
texts = [item for row in rows for item in row]
206
# Process with appropriate analyzer results...
207
```
208
209
### Configuration Consistency
210
211
```python
212
# Use same operators across all batch operations
213
standard_operators = {
214
"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
215
"EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})
216
}
217
218
# Apply to lists
219
list_result = batch_engine.anonymize_list(texts, analyzer_results, operators=standard_operators)
220
221
# Apply to dictionaries
222
dict_result = batch_engine.anonymize_dict(dict_analyzer_results, operators=standard_operators)
223
```