tessl/pypi-pyod

A comprehensive Python library for detecting anomalous/outlying objects in multivariate data with 45+ algorithms.

—

Pending

Overview

Eval results

Files

Classical Detection Models

Name: tessl/pypi-pyod
Author: tessl

Traditional outlier detection algorithms that have proven effectiveness across various domains. These methods form the foundation of anomaly detection and are often the first choice for many applications due to their interpretability and reliability.

Capabilities

Local Outlier Factor (LOF)

Computes the local density deviation of a given data point with respect to its neighbors. Considers as outliers the samples that have a substantially lower density than their neighbors.

class LOF:
    def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30, 
                 metric='minkowski', p=2, metric_params=None, 
                 contamination=0.1, n_jobs=1, novelty=True):
        """
        Parameters:
        - n_neighbors (int): Number of neighbors to consider
        - algorithm (str): Algorithm for nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')
        - leaf_size (int): Leaf size for tree-based algorithms
        - metric (str): Distance metric to use
        - p (float): Parameter for the Minkowski metric
        - contamination (float): Proportion of outliers in dataset
        - n_jobs (int): Number of parallel jobs
        - novelty (bool): Whether to use novelty detection mode
        """

Usage example:

from pyod.models.lof import LOF
from pyod.utils.data import generate_data

X_train, X_test, y_train, y_test = generate_data(contamination=0.1, random_state=42)

clf = LOF(n_neighbors=20, contamination=0.1)
clf.fit(X_train)
y_pred = clf.predict(X_test)

Isolation Forest

Isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Anomalies are more susceptible to isolation and hence have short average path lengths.

class IForest:
    def __init__(self, n_estimators=100, max_samples='auto', contamination=0.1,
                 max_features=1.0, bootstrap=False, n_jobs=1, random_state=None,
                 verbose=0, behaviour='deprecated'):
        """
        Parameters:
        - n_estimators (int): Number of isolation trees
        - max_samples (int or str): Number of samples to draw for each tree
        - contamination (float): Proportion of outliers in dataset
        - max_features (int or float): Number of features to draw for each tree
        - bootstrap (bool): Whether to use bootstrap sampling
        - n_jobs (int): Number of parallel jobs
        - random_state (int): Random number generator seed
        - verbose (int): Verbosity level
        """

One-Class Support Vector Machine (OCSVM)

Finds a hyperplane that separates the data from the origin with maximum margin. Points far from the hyperplane are considered outliers.

class OCSVM:
    def __init__(self, kernel='rbf', degree=3, gamma='scale', coef0=0.0,
                 tol=1e-3, nu=0.5, shrinking=True, cache_size=200,
                 verbose=False, max_iter=-1, contamination=0.1):
        """
        Parameters:
        - kernel (str): Kernel type ('linear', 'poly', 'rbf', 'sigmoid')
        - degree (int): Degree for polynomial kernel
        - gamma (str or float): Kernel coefficient
        - coef0 (float): Independent term for polynomial/sigmoid kernels
        - tol (float): Tolerance for stopping criterion
        - nu (float): Upper bound on fraction of training errors
        - contamination (float): Proportion of outliers in dataset
        """

k-Nearest Neighbors (KNN)

Uses the distance to the k-th nearest neighbor as the outlier score. Data points with large distances to their k-th nearest neighbor are considered outliers.

class KNN:
    def __init__(self, contamination=0.1, n_neighbors=5, method='largest',
                 radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski',
                 p=2, metric_params=None, n_jobs=1):
        """
        Parameters:
        - contamination (float): Proportion of outliers in dataset
        - n_neighbors (int): Number of neighbors to consider
        - method (str): Method for computing outlier scores ('largest', 'mean', 'median')
        - radius (float): Range of parameter space for radius_neighbors
        - algorithm (str): Algorithm for nearest neighbors
        - metric (str): Distance metric to use
        - n_jobs (int): Number of parallel jobs
        """

Principal Component Analysis (PCA)

Uses the sum of weighted projected distances to the eigenvector hyperplanes as outlier scores. Assumes that normal data can be represented in lower dimensional space.

class PCA:
    def __init__(self, n_components=None, n_selected_components=None, copy=True,
                 whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto',
                 contamination=0.1, random_state=None, weighted=True,
                 standardization=True):
        """
        Parameters:
        - n_components (int): Number of components to keep
        - n_selected_components (int): Number of selected components for outlier detection
        - copy (bool): Whether to copy data
        - whiten (bool): Whether to whiten components
        - svd_solver (str): SVD solver to use
        - contamination (float): Proportion of outliers in dataset
        - weighted (bool): Whether to use weighted PCA
        - standardization (bool): Whether to standardize data
        """

Minimum Covariance Determinant (MCD)

Finds the subset of observations whose empirical covariance has the smallest determinant. Data points far from this "central" subset are considered outliers.

class MCD:
    def __init__(self, contamination=0.1, store_precision=True,
                 assume_centered=False, support_fraction=None,
                 random_state=None):
        """
        Parameters:
        - contamination (float): Proportion of outliers in dataset
        - store_precision (bool): Whether to store precision matrix
        - assume_centered (bool): Whether data is centered
        - support_fraction (float): Fraction of points to include in support
        - random_state (int): Random number generator seed
        """

Histogram-Based Outlier Score (HBOS)

Constructs histograms for each feature and calculates the outlier score as the inverse of the estimated density. Assumes feature independence but is efficient for large datasets.

class HBOS:
    def __init__(self, n_bins=10, alpha=0.1, tol=0.5, contamination=0.1):
        """
        Parameters:
        - n_bins (int or str): Number of bins for histogram
        - alpha (float): Regularization parameter
        - tol (float): Tolerance for minimum density
        - contamination (float): Proportion of outliers in dataset
        """

Additional Classical Models

class ABOD:
    """Angle-Based Outlier Detection"""
    def __init__(self, contamination=0.1, n_neighbors=5): ...

class CBLOF:
    """Clustering-Based Local Outlier Factor"""
    def __init__(self, n_clusters=8, contamination=0.1, clustering_estimator=None, **kwargs): ...

class COF:
    """Connectivity-Based Outlier Factor"""
    def __init__(self, contamination=0.1, n_neighbors=20): ...

class GMM:
    """Gaussian Mixture Model for outlier detection"""
    def __init__(self, n_components=1, contamination=0.1, **kwargs): ...

class KDE:
    """Kernel Density Estimation"""
    def __init__(self, contamination=0.1, bandwidth=1.0, algorithm='auto', **kwargs): ...

class MAD:
    """Median Absolute Deviation"""
    def __init__(self, threshold=3.5, contamination=0.1): ...

Usage Patterns

All classical models follow the same usage pattern:

# 1. Import the model
from pyod.models.lof import LOF

# 2. Initialize with parameters
clf = LOF(n_neighbors=20, contamination=0.1)

# 3. Fit on training data
clf.fit(X_train)

# 4. Access fitted attributes
train_scores = clf.decision_scores_
train_labels = clf.labels_
threshold = clf.threshold_

# 5. Predict on test data
test_labels = clf.predict(X_test)
test_scores = clf.decision_function(X_test)
test_proba = clf.predict_proba(X_test)