0
# Classical Detection Models
1
2
Traditional outlier detection algorithms that have proven effectiveness across various domains. These methods form the foundation of anomaly detection and are often the first choice for many applications due to their interpretability and reliability.
3
4
## Capabilities
5
6
### Local Outlier Factor (LOF)
7
8
Computes the local density deviation of a given data point with respect to its neighbors. Considers as outliers the samples that have a substantially lower density than their neighbors.
9
10
```python { .api }
11
class LOF:
12
def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30,
13
metric='minkowski', p=2, metric_params=None,
14
contamination=0.1, n_jobs=1, novelty=True):
15
"""
16
Parameters:
17
- n_neighbors (int): Number of neighbors to consider
18
- algorithm (str): Algorithm for nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')
19
- leaf_size (int): Leaf size for tree-based algorithms
20
- metric (str): Distance metric to use
21
- p (float): Parameter for the Minkowski metric
22
- contamination (float): Proportion of outliers in dataset
23
- n_jobs (int): Number of parallel jobs
24
- novelty (bool): Whether to use novelty detection mode
25
"""
26
```
27
28
Usage example:
29
```python
30
from pyod.models.lof import LOF
31
from pyod.utils.data import generate_data
32
33
X_train, X_test, y_train, y_test = generate_data(contamination=0.1, random_state=42)
34
35
clf = LOF(n_neighbors=20, contamination=0.1)
36
clf.fit(X_train)
37
y_pred = clf.predict(X_test)
38
```
39
40
### Isolation Forest
41
42
Isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Anomalies are more susceptible to isolation and hence have short average path lengths.
43
44
```python { .api }
45
class IForest:
46
def __init__(self, n_estimators=100, max_samples='auto', contamination=0.1,
47
max_features=1.0, bootstrap=False, n_jobs=1, random_state=None,
48
verbose=0, behaviour='deprecated'):
49
"""
50
Parameters:
51
- n_estimators (int): Number of isolation trees
52
- max_samples (int or str): Number of samples to draw for each tree
53
- contamination (float): Proportion of outliers in dataset
54
- max_features (int or float): Number of features to draw for each tree
55
- bootstrap (bool): Whether to use bootstrap sampling
56
- n_jobs (int): Number of parallel jobs
57
- random_state (int): Random number generator seed
58
- verbose (int): Verbosity level
59
"""
60
```
61
62
### One-Class Support Vector Machine (OCSVM)
63
64
Finds a hyperplane that separates the data from the origin with maximum margin. Points far from the hyperplane are considered outliers.
65
66
```python { .api }
67
class OCSVM:
68
def __init__(self, kernel='rbf', degree=3, gamma='scale', coef0=0.0,
69
tol=1e-3, nu=0.5, shrinking=True, cache_size=200,
70
verbose=False, max_iter=-1, contamination=0.1):
71
"""
72
Parameters:
73
- kernel (str): Kernel type ('linear', 'poly', 'rbf', 'sigmoid')
74
- degree (int): Degree for polynomial kernel
75
- gamma (str or float): Kernel coefficient
76
- coef0 (float): Independent term for polynomial/sigmoid kernels
77
- tol (float): Tolerance for stopping criterion
78
- nu (float): Upper bound on fraction of training errors
79
- contamination (float): Proportion of outliers in dataset
80
"""
81
```
82
83
### k-Nearest Neighbors (KNN)
84
85
Uses the distance to the k-th nearest neighbor as the outlier score. Data points with large distances to their k-th nearest neighbor are considered outliers.
86
87
```python { .api }
88
class KNN:
89
def __init__(self, contamination=0.1, n_neighbors=5, method='largest',
90
radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski',
91
p=2, metric_params=None, n_jobs=1):
92
"""
93
Parameters:
94
- contamination (float): Proportion of outliers in dataset
95
- n_neighbors (int): Number of neighbors to consider
96
- method (str): Method for computing outlier scores ('largest', 'mean', 'median')
97
- radius (float): Range of parameter space for radius_neighbors
98
- algorithm (str): Algorithm for nearest neighbors
99
- metric (str): Distance metric to use
100
- n_jobs (int): Number of parallel jobs
101
"""
102
```
103
104
### Principal Component Analysis (PCA)
105
106
Uses the sum of weighted projected distances to the eigenvector hyperplanes as outlier scores. Assumes that normal data can be represented in lower dimensional space.
107
108
```python { .api }
109
class PCA:
110
def __init__(self, n_components=None, n_selected_components=None, copy=True,
111
whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto',
112
contamination=0.1, random_state=None, weighted=True,
113
standardization=True):
114
"""
115
Parameters:
116
- n_components (int): Number of components to keep
117
- n_selected_components (int): Number of selected components for outlier detection
118
- copy (bool): Whether to copy data
119
- whiten (bool): Whether to whiten components
120
- svd_solver (str): SVD solver to use
121
- contamination (float): Proportion of outliers in dataset
122
- weighted (bool): Whether to use weighted PCA
123
- standardization (bool): Whether to standardize data
124
"""
125
```
126
127
### Minimum Covariance Determinant (MCD)
128
129
Finds the subset of observations whose empirical covariance has the smallest determinant. Data points far from this "central" subset are considered outliers.
130
131
```python { .api }
132
class MCD:
133
def __init__(self, contamination=0.1, store_precision=True,
134
assume_centered=False, support_fraction=None,
135
random_state=None):
136
"""
137
Parameters:
138
- contamination (float): Proportion of outliers in dataset
139
- store_precision (bool): Whether to store precision matrix
140
- assume_centered (bool): Whether data is centered
141
- support_fraction (float): Fraction of points to include in support
142
- random_state (int): Random number generator seed
143
"""
144
```
145
146
### Histogram-Based Outlier Score (HBOS)
147
148
Constructs histograms for each feature and calculates the outlier score as the inverse of the estimated density. Assumes feature independence but is efficient for large datasets.
149
150
```python { .api }
151
class HBOS:
152
def __init__(self, n_bins=10, alpha=0.1, tol=0.5, contamination=0.1):
153
"""
154
Parameters:
155
- n_bins (int or str): Number of bins for histogram
156
- alpha (float): Regularization parameter
157
- tol (float): Tolerance for minimum density
158
- contamination (float): Proportion of outliers in dataset
159
"""
160
```
161
162
### Additional Classical Models
163
164
```python { .api }
165
class ABOD:
166
"""Angle-Based Outlier Detection"""
167
def __init__(self, contamination=0.1, n_neighbors=5): ...
168
169
class CBLOF:
170
"""Clustering-Based Local Outlier Factor"""
171
def __init__(self, n_clusters=8, contamination=0.1, clustering_estimator=None, **kwargs): ...
172
173
class COF:
174
"""Connectivity-Based Outlier Factor"""
175
def __init__(self, contamination=0.1, n_neighbors=20): ...
176
177
class GMM:
178
"""Gaussian Mixture Model for outlier detection"""
179
def __init__(self, n_components=1, contamination=0.1, **kwargs): ...
180
181
class KDE:
182
"""Kernel Density Estimation"""
183
def __init__(self, contamination=0.1, bandwidth=1.0, algorithm='auto', **kwargs): ...
184
185
class MAD:
186
"""Median Absolute Deviation"""
187
def __init__(self, threshold=3.5, contamination=0.1): ...
188
```
189
190
## Usage Patterns
191
192
All classical models follow the same usage pattern:
193
194
```python
195
# 1. Import the model
196
from pyod.models.lof import LOF
197
198
# 2. Initialize with parameters
199
clf = LOF(n_neighbors=20, contamination=0.1)
200
201
# 3. Fit on training data
202
clf.fit(X_train)
203
204
# 4. Access fitted attributes
205
train_scores = clf.decision_scores_
206
train_labels = clf.labels_
207
threshold = clf.threshold_
208
209
# 5. Predict on test data
210
test_labels = clf.predict(X_test)
211
test_scores = clf.decision_function(X_test)
212
test_proba = clf.predict_proba(X_test)
213
```
214
215
## Model Selection Guidelines
216
217
- **LOF**: Good for datasets with varying density regions
218
- **IForest**: Excellent for high-dimensional data and large datasets
219
- **OCSVM**: Effective with small to medium datasets, works well with kernels
220
- **KNN**: Simple and interpretable, good baseline method
221
- **PCA**: Effective when outliers don't lie in principal component subspace
222
- **MCD**: Robust for multivariate normal data with outliers
223
- **HBOS**: Fast for large datasets when features are independent