tessl/pypi-presidio-anonymizer

Presidio Anonymizer package - replaces analyzed text with desired values.

—

Pending

Overview

Eval results

Files

Batch Processing

Name: tessl/pypi-presidio-anonymizer
Author: tessl

The BatchAnonymizerEngine provides efficient anonymization for lists and dictionaries, enabling bulk processing of multiple texts or structured data formats.

Capabilities

Initialize Batch Engine

Create a batch processor with an optional custom AnonymizerEngine.

def __init__(self, anonymizer_engine: Optional[AnonymizerEngine] = None):
    """
    Initialize BatchAnonymizerEngine.

    Parameters:
    - anonymizer_engine (Optional[AnonymizerEngine]): Custom anonymizer instance, 
      defaults to new AnonymizerEngine()
    """

Usage Example:

from presidio_anonymizer import BatchAnonymizerEngine, AnonymizerEngine

# Use default engine
batch_engine = BatchAnonymizerEngine()

# Use custom engine with added operators
custom_engine = AnonymizerEngine()
custom_engine.add_anonymizer(MyCustomOperator)
batch_engine = BatchAnonymizerEngine(anonymizer_engine=custom_engine)

List Anonymization

Anonymize a list of texts with corresponding analyzer results.

def anonymize_list(
    self,
    texts: List[Optional[Union[str, bool, int, float]]],
    recognizer_results_list: List[List[RecognizerResult]],
    **kwargs
) -> List[Union[str, Any]]:
    """
    Anonymize a list of strings.

    Parameters:
    - texts (List[Optional[Union[str, bool, int, float]]]): List of texts to anonymize.
      Non-string types (bool, int, float) are converted to string; other types pass through unchanged
    - recognizer_results_list (List[List[RecognizerResult]]): List of analyzer results for each text
    - **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()

    Returns:
    List[Union[str, Any]]: List of anonymized texts, with non-anonymizable items unchanged
    """

Usage Examples:

from presidio_anonymizer import BatchAnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

batch_engine = BatchAnonymizerEngine()

# Anonymize multiple texts
texts = [
    "John Doe lives in New York",
    "Contact Sarah at sarah@email.com", 
    "Call Mike at 555-1234",
    42,  # Non-string type
    None  # None value
]

analyzer_results = [
    [RecognizerResult("PERSON", 0, 8, 0.9), RecognizerResult("LOCATION", 18, 26, 0.8)],
    [RecognizerResult("PERSON", 8, 13, 0.9), RecognizerResult("EMAIL_ADDRESS", 17, 33, 0.9)],
    [RecognizerResult("PERSON", 5, 9, 0.9), RecognizerResult("PHONE_NUMBER", 13, 21, 0.8)],
    [],  # No analyzer results for number
    []   # No analyzer results for None
]

operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5}),
    "PHONE_NUMBER": OperatorConfig("redact"),
    "LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION]"})
}

result = batch_engine.anonymize_list(
    texts=texts,
    recognizer_results_list=analyzer_results,
    operators=operators
)

print(result)
# ['[PERSON] lives in [LOCATION]', 'Contact [PERSON] at sa***@email.com', 'Call [PERSON] at ', '42', None]

Dictionary Anonymization

Anonymize values in nested dictionaries and structured data.

def anonymize_dict(
    self,
    analyzer_results: Iterable[DictRecognizerResult],
    **kwargs
) -> Dict[str, str]:
    """
    Anonymize values in a dictionary.

    Parameters:
    - analyzer_results (Iterable[DictRecognizerResult]): Iterator of DictRecognizerResult
      containing analyzer results for dictionary values
    - **kwargs: Additional arguments passed to AnonymizerEngine.anonymize()

    Returns:
    Dict[str, str]: Dictionary with anonymized values
    """

Usage Example:

from presidio_anonymizer.entities import DictRecognizerResult

# Example dictionary data
data_dict = {
    "user_info": {
        "name": "John Doe",
        "email": "john@example.com"
    },
    "contacts": ["Alice Johnson", "Bob Smith"],
    "phone": "555-1234",
    "age": 30
}

# DictRecognizerResult contains analyzer results for structured data
# This would typically come from presidio-analyzer's analyze_dict method
dict_analyzer_results = [
    DictRecognizerResult(
        key="user_info",
        value={"name": "John Doe", "email": "john@example.com"},
        recognizer_results=[
            # Nested analyzer results for the dictionary value
        ]
    ),
    # Additional results for other keys...
]

operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})
}

anonymized_dict = batch_engine.anonymize_dict(
    analyzer_results=dict_analyzer_results,
    operators=operators
)

Data Type Handling

The batch engine handles different data types appropriately:

String Types

Processed through the anonymization engine
Converted to anonymized strings

Numeric Types (int, float, bool)

Converted to strings and processed
Returned as anonymized strings

Other Types

Pass through unchanged (objects, None, custom classes)
No anonymization applied

Nested Structures

Dictionaries: Recursively processed
Lists/Iterables: Each item processed individually
Mixed types: Handled according to their individual type rules

Performance Considerations

Batch Processing: More efficient than individual calls for large datasets
Memory Usage: Processes entire lists/dictionaries in memory
Parallelization: Not automatically parallelized; consider external solutions for very large datasets
Result Caching: Each text is processed independently; no caching between items

Common Patterns

Processing CSV-like Data

# Process rows of tabular data
rows = [
    ["John Doe", "john@email.com", "555-1234"],
    ["Jane Smith", "jane@email.com", "555-5678"]
]

# Flatten for processing
texts = [item for row in rows for item in row]
# Process with appropriate analyzer results...

Configuration Consistency

# Use same operators across all batch operations
standard_operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5})
}

# Apply to lists
list_result = batch_engine.anonymize_list(texts, analyzer_results, operators=standard_operators)

# Apply to dictionaries  
dict_result = batch_engine.anonymize_dict(dict_analyzer_results, operators=standard_operators)

Install with Tessl CLI

npx tessl i tessl/pypi-presidio-anonymizer

docs

batch-processing.md

core-anonymization.md

tessl/pypi-presidio-anonymizer