Tessl Tile for pypi/presidio-anonymizer@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-processing.md core-anonymization.md deanonymization.md entities.md index.md operators.md

batch-processing.mddocs/

0
# Batch Processing
1

2
The BatchAnonymizerEngine provides efficient anonymization for lists and dictionaries, enabling bulk processing of multiple texts or structured data formats.
3

4
## Capabilities
5

6
### Initialize Batch Engine
7

8
Create a batch processor with an optional custom AnonymizerEngine.
9

10
```python { .api }
11
def __init__(self, anonymizer_engine: Optional[AnonymizerEngine] = None):
12
    """
13
    Initialize BatchAnonymizerEngine.
14

15
    Parameters:
16
    - anonymizer_engine (Optional[AnonymizerEngine]): Custom anonymizer instance, 
17
      defaults to new AnonymizerEngine()
18
    """
19
```
20

21
**Usage Example:**
22

23
```python
24
from presidio_anonymizer import BatchAnonymizerEngine, AnonymizerEngine
25

26
# Use default engine
27
batch_engine = BatchAnonymizerEngine()
28

29
# Use custom engine with added operators
30
custom_engine = AnonymizerEngine()
31
custom_engine.add_anonymizer(MyCustomOperator)
32
batch_engine = BatchAnonymizerEngine(anonymizer_engine=custom_engine)
33
```
34

35
### List Anonymization
36

37
Anonymize a list of texts with corresponding analyzer results.
38

39
```python { .api }
40
def anonymize_list(
41
    self,
42
    texts: List[Optional[Union[str, bool, int, float]]],
43
    recognizer_results_list: List[List[RecognizerResult]],
44
    **kwargs
45
) -> List[Union[str, Any]]:
46
    """
47
    Anonymize a list of strings.
48

49
    Parameters:
50
    - texts (List[Optional[Union[str, bool, int, float]]]): List of texts to anonymize.
51
      Non-string types (bool, int, float) are converted to string; other types pass through unchanged
52
    - recognizer_results_list (List[List[RecognizerResult]]): List of analyzer results for each text
53
    - **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()
54

55
    Returns:
56
    List[Union[str, Any]]: List of anonymized texts, with non-anonymizable items unchanged
57
    """
58
```
59

60
**Usage Examples:**
61

62
```python
63
from presidio_anonymizer import BatchAnonymizerEngine
64
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
65

66
batch_engine = BatchAnonymizerEngine()
67

68
# Anonymize multiple texts
69
texts = [
70
    "John Doe lives in New York",
71
    "Contact Sarah at sarah@email.com", 
72
    "Call Mike at 555-1234",
73
    42,  # Non-string type
74
    None  # None value
75
]
76

77
analyzer_results = [
78
    [RecognizerResult("PERSON", 0, 8, 0.9), RecognizerResult("LOCATION", 18, 26, 0.8)],
79
    [RecognizerResult("PERSON", 8, 13, 0.9), RecognizerResult("EMAIL_ADDRESS", 17, 33, 0.9)],
80
    [RecognizerResult("PERSON", 5, 9, 0.9), RecognizerResult("PHONE_NUMBER", 13, 21, 0.8)],
81
    [],  # No analyzer results for number
82
    []   # No analyzer results for None
83
]
84

85
operators = {
86
    "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
87
    "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5}),
88
    "PHONE_NUMBER": OperatorConfig("redact"),
89
    "LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION]"})
90
}
91

92
result = batch_engine.anonymize_list(
93
    texts=texts,
94
    recognizer_results_list=analyzer_results,
95
    operators=operators
96
)
97

98
print(result)
99
# ['[PERSON] lives in [LOCATION]', 'Contact [PERSON] at sa***@email.com', 'Call [PERSON] at ', '42', None]
100
```
101

102
### Dictionary Anonymization
103

104
Anonymize values in nested dictionaries and structured data.
105

106
```python { .api }
107
def anonymize_dict(
108
    self,
109
    analyzer_results: Iterable[DictRecognizerResult],
110
    **kwargs
111
) -> Dict[str, str]:
112
    """
113
    Anonymize values in a dictionary.
114

115
    Parameters:
116
    - analyzer_results (Iterable[DictRecognizerResult]): Iterator of DictRecognizerResult
117
      containing analyzer results for dictionary values
118
    - **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()
119

120
    Returns:
121
    Dict[str, str]: Dictionary with anonymized values
122
    """
123
```
124

125
**Usage Example:**
126

127
```python
128
from presidio_anonymizer.entities import DictRecognizerResult
129

130
# Example dictionary data
131
data_dict = {
132
    "user_info": {
133
        "name": "John Doe",
134
        "email": "john@example.com"
135
    },
136
    "contacts": ["Alice Johnson", "Bob Smith"],
137
    "phone": "555-1234",
138
    "age": 30
139
}
140

141
# DictRecognizerResult contains analyzer results for structured data
142
# This would typically come from presidio-analyzer's analyze_dict method
143
dict_analyzer_results = [
144
    DictRecognizerResult(
145
        key="user_info",
146
        value={"name": "John Doe", "email": "john@example.com"},
147
        recognizer_results=[
148
            # Nested analyzer results for the dictionary value
149
        ]
150
    ),
151
    # Additional results for other keys...
152
]
153

154
operators = {
155
    "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
156
    "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})
157
}
158

159
anonymized_dict = batch_engine.anonymize_dict(
160
    analyzer_results=dict_analyzer_results,
161
    operators=operators
162
)
163
```
164

165
## Data Type Handling
166

167
The batch engine handles different data types appropriately:
168

169
### String Types
170
- Processed through the anonymization engine
171
- Converted to anonymized strings
172

173
### Numeric Types (int, float, bool)
174
- Converted to strings and processed
175
- Returned as anonymized strings
176

177
### Other Types
178
- Pass through unchanged (objects, None, custom classes)
179
- No anonymization applied
180

181
### Nested Structures
182
- Dictionaries: Recursively processed
183
- Lists/Iterables: Each item processed individually
184
- Mixed types: Handled according to their individual type rules
185

186
## Performance Considerations
187

188
- **Batch Processing**: More efficient than individual calls for large datasets
189
- **Memory Usage**: Processes entire lists/dictionaries in memory
190
- **Parallelization**: Not automatically parallelized; consider external solutions for very large datasets
191
- **Result Caching**: Each text is processed independently; no caching between items
192

193
## Common Patterns
194

195
### Processing CSV-like Data
196

197
```python
198
# Process rows of tabular data
199
rows = [
200
    ["John Doe", "john@email.com", "555-1234"],
201
    ["Jane Smith", "jane@email.com", "555-5678"]
202
]
203

204
# Flatten for processing
205
texts = [item for row in rows for item in row]
206
# Process with appropriate analyzer results...
207
```
208

209
### Configuration Consistency
210

211
```python
212
# Use same operators across all batch operations
213
standard_operators = {
214
    "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
215
    "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})
216
}
217

218
# Apply to lists
219
list_result = batch_engine.anonymize_list(texts, analyzer_results, operators=standard_operators)
220

221
# Apply to dictionaries  
222
dict_result = batch_engine.anonymize_dict(dict_analyzer_results, operators=standard_operators)
223
```

Version

Tile

Files

batch-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

batch-processing.mddocs/