CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-pyod

A comprehensive Python library for detecting anomalous/outlying objects in multivariate data with 45+ algorithms.

Pending
Overview
Eval results
Files

data-utilities.mddocs/

Data Utilities

Comprehensive utilities for data generation, preprocessing, evaluation, and visualization to support the complete outlier detection workflow. These utilities are essential for testing detectors, preparing data, and evaluating results.

Capabilities

Data Generation

Generate synthetic datasets with controlled outlier characteristics for testing and benchmarking outlier detection algorithms.

def generate_data(n_train=200, n_test=100, n_features=2, contamination=0.1,
                  train_only=False, offset=10, random_state=None):
    """
    Generate synthetic dataset with outliers for testing detectors.
    
    Parameters:
    - n_train (int): Number of training samples
    - n_test (int): Number of test samples  
    - n_features (int): Number of features
    - contamination (float): Proportion of outliers in dataset
    - train_only (bool): If True, only return training data
    - offset (int): Offset for outlier generation
    - random_state (int): Random number generator seed
    
    Returns:
    - X_train (array): Training data of shape (n_train, n_features)
    - X_test (array): Test data of shape (n_test, n_features) 
    - y_train (array): Training labels (0: inlier, 1: outlier)
    - y_test (array): Test labels (0: inlier, 1: outlier)
    """

Usage example:

from pyod.utils.data import generate_data

# Generate 2D dataset with 10% outliers
X_train, X_test, y_train, y_test = generate_data(
    n_train=500, n_test=200, n_features=2, 
    contamination=0.1, random_state=42
)

# Generate high-dimensional dataset
X_train, X_test, y_train, y_test = generate_data(
    n_train=1000, n_test=300, n_features=20,
    contamination=0.05, random_state=123
)

Evaluation Functions

Comprehensive evaluation metrics specifically designed for outlier detection tasks.

def evaluate_print(clf_name, y, y_scores):
    """
    Print comprehensive evaluation metrics for outlier detection.
    
    Parameters:
    - clf_name (str): Name of the classifier for display
    - y (array): True binary labels (0: inlier, 1: outlier)
    - y_scores (array): Outlier scores from detector
    
    Prints:
    - ROC AUC score
    - Precision at rank n (P@n) where n = number of outliers
    - Average precision score
    """

Data Preprocessing

Standardization and normalization utilities optimized for outlier detection workflows.

def standardizer(X, X_t=None, keep_scalar=False):
    """
    Standardize datasets using minmax scaling.
    
    Parameters:
    - X (array): Training data to fit scaler
    - X_t (array, optional): Test data to transform (if None, transform X)  
    - keep_scalar (bool): Whether to return the fitted scaler
    
    Returns:
    - X_scaled (array): Scaled training data
    - X_t_scaled (array): Scaled test data (if X_t provided)
    - scalar (object): Fitted scaler (if keep_scalar=True)
    """

Score Processing

Utilities for converting and processing outlier scores for different use cases.

def score_to_label(scores, outliers_fraction=0.1):
    """
    Convert outlier scores to binary labels based on contamination rate.
    
    Parameters:
    - scores (array): Outlier scores
    - outliers_fraction (float): Expected fraction of outliers
    
    Returns:
    - labels (array): Binary labels (0: inlier, 1: outlier)
    """

def precision_n_scores(y, y_scores_list, n=None):
    """
    Calculate precision at rank n for multiple detectors.
    
    Parameters:
    - y (array): True binary labels
    - y_scores_list (list): List of outlier score arrays
    - n (int): Rank threshold (default: number of outliers in y)
    
    Returns:
    - precision_list (list): Precision@n scores for each detector
    """

def get_label_n(y, y_scores, n=None):
    """
    Get binary labels by selecting top n highest scores as outliers.
    
    Parameters:
    - y (array): True binary labels (for determining n if not provided)
    - y_scores (array): Outlier scores
    - n (int): Number of top scores to label as outliers
    
    Returns:
    - labels (array): Binary labels (0: inlier, 1: outlier)
    """

def argmaxn(value_list, n, order='desc'):
    """
    Get indices of n largest or smallest values.
    
    Parameters:
    - value_list (array): Input values
    - n (int): Number of indices to return
    - order (str): Sort order ('desc' for largest, 'asc' for smallest)
    
    Returns:
    - indices (array): Indices of n extreme values
    """

def invert_order(scores, method='subtraction'):
    """
    Invert the order of outlier scores (lower becomes higher).
    
    Parameters:
    - scores (array): Input outlier scores
    - method (str): Inversion method ('subtraction', 'division')
    
    Returns:
    - inverted_scores (array): Inverted outlier scores
    """

Visualization

Visualization utilities for 2D datasets and outlier detection results.

def visualize(clf_name, X_train, X_test, y_train, y_test, 
              y_train_pred, y_test_pred, show_figure=True, save_figure=False):
    """
    Visualize outlier detection results for 2D datasets.
    
    Parameters:
    - clf_name (str): Name of the classifier for plot title
    - X_train (array): Training data (must be 2D)
    - X_test (array): Test data (must be 2D)
    - y_train (array): True training labels
    - y_test (array): True test labels
    - y_train_pred (array): Predicted training labels
    - y_test_pred (array): Predicted test labels
    - show_figure (bool): Whether to display the plot
    - save_figure (bool): Whether to save the plot to file
    """

Statistical Utilities

Statistical functions and distance computations for outlier detection algorithms.

def pairwise_distances_no_broadcast(X, Y=None):
    """
    Compute pairwise distances without broadcasting for memory efficiency.
    
    Parameters:
    - X (array): First set of points
    - Y (array, optional): Second set of points (default: X)
    
    Returns:
    - distances (array): Pairwise distance matrix
    """

def wpearsonr(x, y, w):
    """
    Calculate weighted Pearson correlation coefficient.
    
    Parameters:
    - x (array): First variable
    - y (array): Second variable
    - w (array): Weights for each observation
    
    Returns:
    - correlation (float): Weighted Pearson correlation
    """

def pearsonr_mat(mat, w=None):
    """
    Calculate Pearson correlation matrix with optional weights.
    
    Parameters:
    - mat (array): Data matrix
    - w (array, optional): Weights for observations
    
    Returns:
    - corr_matrix (array): Correlation matrix
    """

def get_optimal_n_bins(X, upper_bound=300):
    """
    Get optimal number of bins for histogram-based methods.
    
    Parameters:
    - X (array): Input data
    - upper_bound (int): Maximum number of bins
    
    Returns:
    - n_bins (int): Optimal number of bins
    """

def check_parameter(param, low=float('-inf'), high=float('inf'), 
                   param_name='', include_left=False, include_right=False):
    """
    Validate parameter values within specified bounds.
    
    Parameters:
    - param: Parameter value to check
    - low: Lower bound
    - high: Upper bound  
    - param_name (str): Name of parameter for error messages
    - include_left (bool): Whether to include lower bound
    - include_right (bool): Whether to include upper bound
    
    Raises:
    - ValueError: If parameter is outside valid range
    """

PyTorch Utilities

Specialized utilities for deep learning models using PyTorch framework.

# Neural network components and utilities for deep learning models
# Available in pyod.utils.torch_utility module

class TorchModel:
    """Base class for PyTorch-based outlier detection models"""
    
class InnerAutoencoder:
    """Autoencoder architecture for deep anomaly detection"""
    
class VAE_Encoder:
    """Variational autoencoder encoder network"""
    
class VAE_Decoder: 
    """Variational autoencoder decoder network"""

Usage Patterns

Complete Workflow Example

from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data, evaluate_print
from pyod.utils.utility import standardizer, precision_n_scores
from pyod.utils.example import visualize

# 1. Generate synthetic data
X_train, X_test, y_train, y_test = generate_data(
    n_train=400, n_test=150, n_features=2,
    contamination=0.1, random_state=42
)

# 2. Preprocess data
X_train_scaled, X_test_scaled = standardizer(X_train, X_test)

# 3. Train multiple detectors
lof = LOF(contamination=0.1)
iforest = IForest(contamination=0.1)

lof.fit(X_train_scaled)
iforest.fit(X_train_scaled)

# 4. Get predictions
lof_scores = lof.decision_function(X_test_scaled) 
lof_pred = lof.predict(X_test_scaled)

iforest_scores = iforest.decision_function(X_test_scaled)
iforest_pred = iforest.predict(X_test_scaled)

# 5. Evaluate results
evaluate_print('LOF', y_test, lof_scores)
evaluate_print('IForest', y_test, iforest_scores)

# 6. Compare precision@n
precision_scores = precision_n_scores(y_test, [lof_scores, iforest_scores])
print(f"Precision@n - LOF: {precision_scores[0]:.3f}, IForest: {precision_scores[1]:.3f}")

# 7. Visualize results (for 2D data)
visualize('LOF', X_train, X_test, y_train, y_test, 
          lof.labels_, lof_pred, show_figure=True)

Batch Evaluation

from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.models.ocsvm import OCSVM
from pyod.utils.data import generate_data, evaluate_print

# Generate test datasets with different characteristics
datasets = []
for contamination in [0.05, 0.1, 0.2]:
    for n_features in [2, 5, 10]:
        X_train, X_test, y_train, y_test = generate_data(
            n_train=500, n_test=200, n_features=n_features,
            contamination=contamination, random_state=42
        )
        datasets.append((X_train, X_test, y_train, y_test, 
                        f"cont_{contamination}_feat_{n_features}"))

# Test multiple detectors
detectors = [
    ('LOF', LOF()),
    ('IForest', IForest()), 
    ('OCSVM', OCSVM())
]

# Evaluate all combinations
for X_train, X_test, y_train, y_test, dataset_name in datasets:
    print(f"\nDataset: {dataset_name}")
    for detector_name, detector in detectors:
        detector.fit(X_train)
        scores = detector.decision_function(X_test)
        evaluate_print(f"{detector_name}", y_test, scores)

Best Practices

Data Generation

  • Use consistent random seeds for reproducible experiments
  • Match contamination rate between training and test sets
  • Consider different outlier patterns (clustered, scattered, etc.)

Preprocessing

  • Standardize features for distance-based methods
  • Consider feature scaling impact on tree-based methods
  • Handle categorical variables appropriately

Evaluation

  • Use multiple metrics (ROC-AUC, Precision@n, Average Precision)
  • Consider class imbalance in evaluation metrics
  • Validate on multiple datasets with different characteristics

Visualization

  • Use visualization primarily for 2D data and method demonstration
  • Consider dimensionality reduction for high-dimensional visualization
  • Include both training and test data in visualizations for complete picture

Install with Tessl CLI

npx tessl i tessl/pypi-pyod

docs

classical-models.md

data-utilities.md

deep-learning-models.md

ensemble-models.md

index.md

modern-models.md

tile.json