0
# Modern Detection Models
1
2
State-of-the-art outlier detection algorithms that often provide better performance and scalability compared to classical approaches. These methods incorporate recent advances in machine learning and statistical theory.
3
4
## Capabilities
5
6
### Empirical Cumulative Distribution-based Outlier Detection (ECOD)
7
8
A parameter-free, highly interpretable outlier detection algorithm based on empirical cumulative distribution functions. ECOD is efficient, robust, and provides excellent performance across various datasets.
9
10
```python { .api }
11
class ECOD:
12
def __init__(self, contamination=0.1, n_jobs=1):
13
"""
14
Parameters:
15
- contamination (float): Proportion of outliers in dataset
16
- n_jobs (int): Number of parallel jobs for computation
17
"""
18
```
19
20
Usage example:
21
```python
22
from pyod.models.ecod import ECOD
23
from pyod.utils.data import generate_data
24
25
X_train, X_test, y_train, y_test = generate_data(contamination=0.1, random_state=42)
26
27
clf = ECOD(contamination=0.1, n_jobs=2)
28
clf.fit(X_train)
29
y_pred = clf.predict(X_test)
30
scores = clf.decision_function(X_test)
31
```
32
33
### Copula-Based Outlier Detection (COPOD)
34
35
Uses copula functions to model the dependence structure among features, providing a robust approach to outlier detection that captures complex relationships between variables.
36
37
```python { .api }
38
class COPOD:
39
def __init__(self, contamination=0.1, n_jobs=1):
40
"""
41
Parameters:
42
- contamination (float): Proportion of outliers in dataset
43
- n_jobs (int): Number of parallel jobs for computation
44
"""
45
```
46
47
### Scalable Unsupervised Outlier Detection (SUOD)
48
49
A framework for accelerating outlier detection by using multiple base estimators with approximate methods. Provides significant speedup while maintaining detection quality.
50
51
```python { .api }
52
class SUOD:
53
def __init__(self, base_estimators=None, n_jobs=1, rp_clf_list=None,
54
rp_ng_clf_list=None, rp_flag_global=True, jl_method='basic',
55
jl_proj_nums=None, cost_forecast_loc_fit=None,
56
cost_forecast_loc_pred=None, approx_flag_global=False,
57
approx_clf_list=None, approx_ng_clf_list=None,
58
contamination=0.1, combination='average', verbose=False,
59
random_state=None):
60
"""
61
Parameters:
62
- base_estimators (list): List of base detectors
63
- n_jobs (int): Number of parallel jobs
64
- rp_clf_list (list): List of detectors for random projection
65
- jl_method (str): Johnson-Lindenstrauss method ('basic', 'discrete', 'circulant')
66
- contamination (float): Proportion of outliers in dataset
67
- combination (str): Combination method for scores ('average', 'maximization')
68
- verbose (bool): Whether to print progress information
69
"""
70
```
71
72
### Learning with Uncertainty for Regression (LUNAR)
73
74
A novel approach that combines regression techniques with uncertainty quantification for robust outlier detection, particularly effective for datasets with complex patterns.
75
76
```python { .api }
77
class LUNAR:
78
def __init__(self, model_type='regressor', n_neighbours=5,
79
negative_sampling=1, val_size=0.1, scaler='MinMaxScaler',
80
contamination=0.1):
81
"""
82
Parameters:
83
- model_type (str): Type of base model ('regressor', 'classifier')
84
- n_neighbours (int): Number of neighbors for local modeling
85
- negative_sampling (int): Negative sampling ratio
86
- val_size (float): Validation set size fraction
87
- scaler (str): Scaler type for preprocessing
88
- contamination (float): Proportion of outliers in dataset
89
"""
90
```
91
92
### Deviation-based Outlier Detection (LMDD)
93
94
Detects outliers based on the relative deviation of data points from their local neighborhoods, providing good performance on datasets with complex structures.
95
96
```python { .api }
97
class LMDD:
98
def __init__(self, contamination=0.1, n_iter=50, dis_measure='aad',
99
random_state=None):
100
"""
101
Parameters:
102
- contamination (float): Proportion of outliers in dataset
103
- n_iter (int): Number of iterations for optimization
104
- dis_measure (str): Distance measure ('aad', 'var', 'iqr')
105
- random_state (int): Random number generator seed
106
"""
107
```
108
109
### Lightweight On-line Detector of Anomalies (LODA)
110
111
A fast, online outlier detection algorithm that uses sparse random projections. Effective for high-dimensional data and streaming applications.
112
113
```python { .api }
114
class LODA:
115
def __init__(self, contamination=0.1, n_bins=10, n_random_cuts=100):
116
"""
117
Parameters:
118
- contamination (float): Proportion of outliers in dataset
119
- n_bins (int): Number of bins for histogram
120
- n_random_cuts (int): Number of random projections
121
"""
122
```
123
124
### Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles (INNE)
125
126
Combines the benefits of isolation-based methods with nearest neighbor approaches, providing robust detection across various data distributions.
127
128
```python { .api }
129
class INNE:
130
def __init__(self, n_estimators=200, max_samples=256, contamination=0.1,
131
random_state=None):
132
"""
133
Parameters:
134
- n_estimators (int): Number of estimators in ensemble
135
- max_samples (int): Maximum number of samples per estimator
136
- contamination (float): Proportion of outliers in dataset
137
- random_state (int): Random number generator seed
138
"""
139
```
140
141
### Subspace Outlier Detection (SOD)
142
143
Detects outliers in relevant subspaces rather than the full feature space, making it effective for high-dimensional data where outliers may only be visible in certain dimensions.
144
145
```python { .api }
146
class SOD:
147
def __init__(self, n_neighbors=20, ref_set=10, alpha=0.8, contamination=0.1):
148
"""
149
Parameters:
150
- n_neighbors (int): Number of neighbors to consider
151
- ref_set (int): Size of reference set
152
- alpha (float): Weight parameter for subspace selection
153
- contamination (float): Proportion of outliers in dataset
154
"""
155
```
156
157
### Stochastic Outlier Selection (SOS)
158
159
Uses stochastic methods to compute outlier probabilities, providing uncertainty estimates along with outlier scores.
160
161
```python { .api }
162
class SOS:
163
def __init__(self, perplexity=4.5, metric='euclidean', eps=1e-5,
164
contamination=0.1):
165
"""
166
Parameters:
167
- perplexity (float): Perplexity parameter for probability computation
168
- metric (str): Distance metric to use
169
- eps (float): Numerical stability parameter
170
- contamination (float): Proportion of outliers in dataset
171
"""
172
```
173
174
### Rotation-based Outlier Detection (ROD)
175
176
Generates diverse feature representations through random rotations and combines multiple detectors for improved robustness.
177
178
```python { .api }
179
class ROD:
180
def __init__(self, base_estimator=None, n_estimators=100,
181
max_features=1.0, contamination=0.1, random_state=None):
182
"""
183
Parameters:
184
- base_estimator: Base detector to use
185
- n_estimators (int): Number of estimators
186
- max_features (float): Fraction of features to use
187
- contamination (float): Proportion of outliers in dataset
188
- random_state (int): Random number generator seed
189
"""
190
```
191
192
### Additional Modern Models
193
194
```python { .api }
195
class LOCI:
196
"""Local Correlation Integral"""
197
def __init__(self, contamination=0.1, alpha=0.5, k=3): ...
198
199
class CD:
200
"""Cook's Distance"""
201
def __init__(self, contamination=0.1, whitening=True): ...
202
203
class QMCD:
204
"""Quasi-Monte Carlo Discrepancy"""
205
def __init__(self, contamination=0.1, ref_set=10): ...
206
207
class Sampling:
208
"""Sampling-based outlier detection"""
209
def __init__(self, contamination=0.1, subset_size=20, metric='euclidean'): ...
210
```
211
212
## Usage Patterns
213
214
Modern models follow the same interface as classical models:
215
216
```python
217
# Example with ECOD
218
from pyod.models.ecod import ECOD
219
from pyod.utils.data import generate_data
220
221
# Generate data
222
X_train, X_test, y_train, y_test = generate_data(
223
n_train=500, n_test=200, contamination=0.1, random_state=42
224
)
225
226
# Initialize and fit
227
clf = ECOD(contamination=0.1, n_jobs=2)
228
clf.fit(X_train)
229
230
# Get results
231
train_scores = clf.decision_scores_
232
test_scores = clf.decision_function(X_test)
233
test_labels = clf.predict(X_test)
234
```
235
236
## Performance Characteristics
237
238
### ECOD
239
- **Strengths**: Parameter-free, highly interpretable, excellent empirical performance
240
- **Best for**: General-purpose outlier detection, when interpretability is important
241
- **Time complexity**: O(n*d) where n=samples, d=features
242
243
### COPOD
244
- **Strengths**: Captures feature dependencies, robust to different data distributions
245
- **Best for**: Datasets with complex feature relationships
246
- **Time complexity**: O(n*d²)
247
248
### SUOD
249
- **Strengths**: Significant speedup for ensemble methods, maintains quality
250
- **Best for**: Large datasets requiring fast ensemble-based detection
251
- **Time complexity**: Sublinear speedup over base estimators
252
253
### LUNAR
254
- **Strengths**: Uncertainty quantification, works well with regression patterns
255
- **Best for**: Datasets where uncertainty information is valuable
256
- **Time complexity**: O(n²) for local neighborhood construction
257
258
## Model Selection Guidelines
259
260
- **ECOD**: First choice for most applications due to parameter-free nature and strong performance
261
- **COPOD**: When feature dependencies are important for outlier detection
262
- **SUOD**: When you need ensemble methods but have performance constraints
263
- **LUNAR**: When uncertainty quantification is important
264
- **LODA**: For high-dimensional data or streaming applications
265
- **SOD**: For high-dimensional data where outliers exist in subspaces