or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-processing.mdcore-anonymization.mddeanonymization.mdentities.mdindex.mdoperators.md

batch-processing.mddocs/

0

# Batch Processing

1

2

The BatchAnonymizerEngine provides efficient anonymization for lists and dictionaries, enabling bulk processing of multiple texts or structured data formats.

3

4

## Capabilities

5

6

### Initialize Batch Engine

7

8

Create a batch processor with an optional custom AnonymizerEngine.

9

10

```python { .api }

11

def __init__(self, anonymizer_engine: Optional[AnonymizerEngine] = None):

12

"""

13

Initialize BatchAnonymizerEngine.

14

15

Parameters:

16

- anonymizer_engine (Optional[AnonymizerEngine]): Custom anonymizer instance,

17

defaults to new AnonymizerEngine()

18

"""

19

```

20

21

**Usage Example:**

22

23

```python

24

from presidio_anonymizer import BatchAnonymizerEngine, AnonymizerEngine

25

26

# Use default engine

27

batch_engine = BatchAnonymizerEngine()

28

29

# Use custom engine with added operators

30

custom_engine = AnonymizerEngine()

31

custom_engine.add_anonymizer(MyCustomOperator)

32

batch_engine = BatchAnonymizerEngine(anonymizer_engine=custom_engine)

33

```

34

35

### List Anonymization

36

37

Anonymize a list of texts with corresponding analyzer results.

38

39

```python { .api }

40

def anonymize_list(

41

self,

42

texts: List[Optional[Union[str, bool, int, float]]],

43

recognizer_results_list: List[List[RecognizerResult]],

44

**kwargs

45

) -> List[Union[str, Any]]:

46

"""

47

Anonymize a list of strings.

48

49

Parameters:

50

- texts (List[Optional[Union[str, bool, int, float]]]): List of texts to anonymize.

51

Non-string types (bool, int, float) are converted to string; other types pass through unchanged

52

- recognizer_results_list (List[List[RecognizerResult]]): List of analyzer results for each text

53

- **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()

54

55

Returns:

56

List[Union[str, Any]]: List of anonymized texts, with non-anonymizable items unchanged

57

"""

58

```

59

60

**Usage Examples:**

61

62

```python

63

from presidio_anonymizer import BatchAnonymizerEngine

64

from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

65

66

batch_engine = BatchAnonymizerEngine()

67

68

# Anonymize multiple texts

69

texts = [

70

"John Doe lives in New York",

71

"Contact Sarah at sarah@email.com",

72

"Call Mike at 555-1234",

73

42, # Non-string type

74

None # None value

75

]

76

77

analyzer_results = [

78

[RecognizerResult("PERSON", 0, 8, 0.9), RecognizerResult("LOCATION", 18, 26, 0.8)],

79

[RecognizerResult("PERSON", 8, 13, 0.9), RecognizerResult("EMAIL_ADDRESS", 17, 33, 0.9)],

80

[RecognizerResult("PERSON", 5, 9, 0.9), RecognizerResult("PHONE_NUMBER", 13, 21, 0.8)],

81

[], # No analyzer results for number

82

[] # No analyzer results for None

83

]

84

85

operators = {

86

"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),

87

"EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5}),

88

"PHONE_NUMBER": OperatorConfig("redact"),

89

"LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION]"})

90

}

91

92

result = batch_engine.anonymize_list(

93

texts=texts,

94

recognizer_results_list=analyzer_results,

95

operators=operators

96

)

97

98

print(result)

99

# ['[PERSON] lives in [LOCATION]', 'Contact [PERSON] at sa***@email.com', 'Call [PERSON] at ', '42', None]

100

```

101

102

### Dictionary Anonymization

103

104

Anonymize values in nested dictionaries and structured data.

105

106

```python { .api }

107

def anonymize_dict(

108

self,

109

analyzer_results: Iterable[DictRecognizerResult],

110

**kwargs

111

) -> Dict[str, str]:

112

"""

113

Anonymize values in a dictionary.

114

115

Parameters:

116

- analyzer_results (Iterable[DictRecognizerResult]): Iterator of DictRecognizerResult

117

containing analyzer results for dictionary values

118

- **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()

119

120

Returns:

121

Dict[str, str]: Dictionary with anonymized values

122

"""

123

```

124

125

**Usage Example:**

126

127

```python

128

from presidio_anonymizer.entities import DictRecognizerResult

129

130

# Example dictionary data

131

data_dict = {

132

"user_info": {

133

"name": "John Doe",

134

"email": "john@example.com"

135

},

136

"contacts": ["Alice Johnson", "Bob Smith"],

137

"phone": "555-1234",

138

"age": 30

139

}

140

141

# DictRecognizerResult contains analyzer results for structured data

142

# This would typically come from presidio-analyzer's analyze_dict method

143

dict_analyzer_results = [

144

DictRecognizerResult(

145

key="user_info",

146

value={"name": "John Doe", "email": "john@example.com"},

147

recognizer_results=[

148

# Nested analyzer results for the dictionary value

149

]

150

),

151

# Additional results for other keys...

152

]

153

154

operators = {

155

"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),

156

"EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})

157

}

158

159

anonymized_dict = batch_engine.anonymize_dict(

160

analyzer_results=dict_analyzer_results,

161

operators=operators

162

)

163

```

164

165

## Data Type Handling

166

167

The batch engine handles different data types appropriately:

168

169

### String Types

170

- Processed through the anonymization engine

171

- Converted to anonymized strings

172

173

### Numeric Types (int, float, bool)

174

- Converted to strings and processed

175

- Returned as anonymized strings

176

177

### Other Types

178

- Pass through unchanged (objects, None, custom classes)

179

- No anonymization applied

180

181

### Nested Structures

182

- Dictionaries: Recursively processed

183

- Lists/Iterables: Each item processed individually

184

- Mixed types: Handled according to their individual type rules

185

186

## Performance Considerations

187

188

- **Batch Processing**: More efficient than individual calls for large datasets

189

- **Memory Usage**: Processes entire lists/dictionaries in memory

190

- **Parallelization**: Not automatically parallelized; consider external solutions for very large datasets

191

- **Result Caching**: Each text is processed independently; no caching between items

192

193

## Common Patterns

194

195

### Processing CSV-like Data

196

197

```python

198

# Process rows of tabular data

199

rows = [

200

["John Doe", "john@email.com", "555-1234"],

201

["Jane Smith", "jane@email.com", "555-5678"]

202

]

203

204

# Flatten for processing

205

texts = [item for row in rows for item in row]

206

# Process with appropriate analyzer results...

207

```

208

209

### Configuration Consistency

210

211

```python

212

# Use same operators across all batch operations

213

standard_operators = {

214

"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),

215

"EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})

216

}

217

218

# Apply to lists

219

list_result = batch_engine.anonymize_list(texts, analyzer_results, operators=standard_operators)

220

221

# Apply to dictionaries

222

dict_result = batch_engine.anonymize_dict(dict_analyzer_results, operators=standard_operators)

223

```