Tessl Tile for pypi/pyod@2.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classical-models.md data-utilities.md deep-learning-models.md ensemble-models.md index.md modern-models.md

classical-models.mddocs/

0
# Classical Detection Models
1

2
Traditional outlier detection algorithms that have proven effectiveness across various domains. These methods form the foundation of anomaly detection and are often the first choice for many applications due to their interpretability and reliability.
3

4
## Capabilities
5

6
### Local Outlier Factor (LOF)
7

8
Computes the local density deviation of a given data point with respect to its neighbors. Considers as outliers the samples that have a substantially lower density than their neighbors.
9

10
```python { .api }
11
class LOF:
12
    def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30, 
13
                 metric='minkowski', p=2, metric_params=None, 
14
                 contamination=0.1, n_jobs=1, novelty=True):
15
        """
16
        Parameters:
17
        - n_neighbors (int): Number of neighbors to consider
18
        - algorithm (str): Algorithm for nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')
19
        - leaf_size (int): Leaf size for tree-based algorithms
20
        - metric (str): Distance metric to use
21
        - p (float): Parameter for the Minkowski metric
22
        - contamination (float): Proportion of outliers in dataset
23
        - n_jobs (int): Number of parallel jobs
24
        - novelty (bool): Whether to use novelty detection mode
25
        """
26
```
27

28
Usage example:
29
```python
30
from pyod.models.lof import LOF
31
from pyod.utils.data import generate_data
32

33
X_train, X_test, y_train, y_test = generate_data(contamination=0.1, random_state=42)
34

35
clf = LOF(n_neighbors=20, contamination=0.1)
36
clf.fit(X_train)
37
y_pred = clf.predict(X_test)
38
```
39

40
### Isolation Forest
41

42
Isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Anomalies are more susceptible to isolation and hence have short average path lengths.
43

44
```python { .api }
45
class IForest:
46
    def __init__(self, n_estimators=100, max_samples='auto', contamination=0.1,
47
                 max_features=1.0, bootstrap=False, n_jobs=1, random_state=None,
48
                 verbose=0, behaviour='deprecated'):
49
        """
50
        Parameters:
51
        - n_estimators (int): Number of isolation trees
52
        - max_samples (int or str): Number of samples to draw for each tree
53
        - contamination (float): Proportion of outliers in dataset
54
        - max_features (int or float): Number of features to draw for each tree
55
        - bootstrap (bool): Whether to use bootstrap sampling
56
        - n_jobs (int): Number of parallel jobs
57
        - random_state (int): Random number generator seed
58
        - verbose (int): Verbosity level
59
        """
60
```
61

62
### One-Class Support Vector Machine (OCSVM)
63

64
Finds a hyperplane that separates the data from the origin with maximum margin. Points far from the hyperplane are considered outliers.
65

66
```python { .api }
67
class OCSVM:
68
    def __init__(self, kernel='rbf', degree=3, gamma='scale', coef0=0.0,
69
                 tol=1e-3, nu=0.5, shrinking=True, cache_size=200,
70
                 verbose=False, max_iter=-1, contamination=0.1):
71
        """
72
        Parameters:
73
        - kernel (str): Kernel type ('linear', 'poly', 'rbf', 'sigmoid')
74
        - degree (int): Degree for polynomial kernel
75
        - gamma (str or float): Kernel coefficient
76
        - coef0 (float): Independent term for polynomial/sigmoid kernels
77
        - tol (float): Tolerance for stopping criterion
78
        - nu (float): Upper bound on fraction of training errors
79
        - contamination (float): Proportion of outliers in dataset
80
        """
81
```
82

83
### k-Nearest Neighbors (KNN)
84

85
Uses the distance to the k-th nearest neighbor as the outlier score. Data points with large distances to their k-th nearest neighbor are considered outliers.
86

87
```python { .api }
88
class KNN:
89
    def __init__(self, contamination=0.1, n_neighbors=5, method='largest',
90
                 radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski',
91
                 p=2, metric_params=None, n_jobs=1):
92
        """
93
        Parameters:
94
        - contamination (float): Proportion of outliers in dataset
95
        - n_neighbors (int): Number of neighbors to consider
96
        - method (str): Method for computing outlier scores ('largest', 'mean', 'median')
97
        - radius (float): Range of parameter space for radius_neighbors
98
        - algorithm (str): Algorithm for nearest neighbors
99
        - metric (str): Distance metric to use
100
        - n_jobs (int): Number of parallel jobs
101
        """
102
```
103

104
### Principal Component Analysis (PCA)
105

106
Uses the sum of weighted projected distances to the eigenvector hyperplanes as outlier scores. Assumes that normal data can be represented in lower dimensional space.
107

108
```python { .api }
109
class PCA:
110
    def __init__(self, n_components=None, n_selected_components=None, copy=True,
111
                 whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto',
112
                 contamination=0.1, random_state=None, weighted=True,
113
                 standardization=True):
114
        """
115
        Parameters:
116
        - n_components (int): Number of components to keep
117
        - n_selected_components (int): Number of selected components for outlier detection
118
        - copy (bool): Whether to copy data
119
        - whiten (bool): Whether to whiten components
120
        - svd_solver (str): SVD solver to use
121
        - contamination (float): Proportion of outliers in dataset
122
        - weighted (bool): Whether to use weighted PCA
123
        - standardization (bool): Whether to standardize data
124
        """
125
```
126

127
### Minimum Covariance Determinant (MCD)
128

129
Finds the subset of observations whose empirical covariance has the smallest determinant. Data points far from this "central" subset are considered outliers.
130

131
```python { .api }
132
class MCD:
133
    def __init__(self, contamination=0.1, store_precision=True,
134
                 assume_centered=False, support_fraction=None,
135
                 random_state=None):
136
        """
137
        Parameters:
138
        - contamination (float): Proportion of outliers in dataset
139
        - store_precision (bool): Whether to store precision matrix
140
        - assume_centered (bool): Whether data is centered
141
        - support_fraction (float): Fraction of points to include in support
142
        - random_state (int): Random number generator seed
143
        """
144
```
145

146
### Histogram-Based Outlier Score (HBOS)
147

148
Constructs histograms for each feature and calculates the outlier score as the inverse of the estimated density. Assumes feature independence but is efficient for large datasets.
149

150
```python { .api }
151
class HBOS:
152
    def __init__(self, n_bins=10, alpha=0.1, tol=0.5, contamination=0.1):
153
        """
154
        Parameters:
155
        - n_bins (int or str): Number of bins for histogram
156
        - alpha (float): Regularization parameter
157
        - tol (float): Tolerance for minimum density
158
        - contamination (float): Proportion of outliers in dataset
159
        """
160
```
161

162
### Additional Classical Models
163

164
```python { .api }
165
class ABOD:
166
    """Angle-Based Outlier Detection"""
167
    def __init__(self, contamination=0.1, n_neighbors=5): ...
168

169
class CBLOF:
170
    """Clustering-Based Local Outlier Factor"""
171
    def __init__(self, n_clusters=8, contamination=0.1, clustering_estimator=None, **kwargs): ...
172

173
class COF:
174
    """Connectivity-Based Outlier Factor"""
175
    def __init__(self, contamination=0.1, n_neighbors=20): ...
176

177
class GMM:
178
    """Gaussian Mixture Model for outlier detection"""
179
    def __init__(self, n_components=1, contamination=0.1, **kwargs): ...
180

181
class KDE:
182
    """Kernel Density Estimation"""
183
    def __init__(self, contamination=0.1, bandwidth=1.0, algorithm='auto', **kwargs): ...
184

185
class MAD:
186
    """Median Absolute Deviation"""
187
    def __init__(self, threshold=3.5, contamination=0.1): ...
188
```
189

190
## Usage Patterns
191

192
All classical models follow the same usage pattern:
193

194
```python
195
# 1. Import the model
196
from pyod.models.lof import LOF
197

198
# 2. Initialize with parameters
199
clf = LOF(n_neighbors=20, contamination=0.1)
200

201
# 3. Fit on training data
202
clf.fit(X_train)
203

204
# 4. Access fitted attributes
205
train_scores = clf.decision_scores_
206
train_labels = clf.labels_
207
threshold = clf.threshold_
208

209
# 5. Predict on test data
210
test_labels = clf.predict(X_test)
211
test_scores = clf.decision_function(X_test)
212
test_proba = clf.predict_proba(X_test)
213
```
214

215
## Model Selection Guidelines
216

217
- **LOF**: Good for datasets with varying density regions
218
- **IForest**: Excellent for high-dimensional data and large datasets
219
- **OCSVM**: Effective with small to medium datasets, works well with kernels
220
- **KNN**: Simple and interpretable, good baseline method
221
- **PCA**: Effective when outliers don't lie in principal component subspace
222
- **MCD**: Robust for multivariate normal data with outliers
223
- **HBOS**: Fast for large datasets when features are independent

Version

Tile

Files

classical-models.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

classical-models.mddocs/