0
# Ensemble Models
1
2
Combination methods that leverage multiple base detectors to improve detection performance through diversity and aggregation strategies. Ensemble methods often provide more robust and reliable outlier detection than individual models.
3
4
## Capabilities
5
6
### Feature Bagging
7
8
Combines multiple base detectors trained on different feature subsets. This approach increases diversity and reduces the impact of irrelevant features on outlier detection.
9
10
```python { .api }
11
class FeatureBagging:
12
def __init__(self, base_estimator=None, n_estimators=10, max_features=1.0,
13
bootstrap_features=False, check_detector=True, check_estimator=False,
14
n_jobs=1, random_state=None, combination='average',
15
verbose=0, estimator_params=None, contamination=0.1):
16
"""
17
Parameters:
18
- base_estimator: Base detector (default: LOF)
19
- n_estimators (int): Number of estimators in ensemble
20
- max_features (int or float): Number/fraction of features per estimator
21
- bootstrap_features (bool): Whether to use bootstrap sampling for features
22
- n_jobs (int): Number of parallel jobs
23
- combination (str): Method to combine scores ('average', 'max')
24
- contamination (float): Proportion of outliers in dataset
25
- estimator_params (dict): Parameters for base estimator
26
"""
27
```
28
29
Usage example:
30
```python
31
from pyod.models.feature_bagging import FeatureBagging
32
from pyod.models.lof import LOF
33
from pyod.utils.data import generate_data
34
35
X_train, X_test, y_train, y_test = generate_data(
36
n_train=500, n_test=200, n_features=10, contamination=0.1, random_state=42
37
)
38
39
# Use LOF as base estimator
40
clf = FeatureBagging(
41
base_estimator=LOF(),
42
n_estimators=10,
43
max_features=0.7,
44
contamination=0.1,
45
n_jobs=2
46
)
47
clf.fit(X_train)
48
y_pred = clf.predict(X_test)
49
```
50
51
### Locally Selective Combination in Parallel (LSCP)
52
53
Combines multiple detectors by selecting the most competent detector for each data point based on local performance. This adaptive approach leverages detector diversity more effectively.
54
55
```python { .api }
56
class LSCP:
57
def __init__(self, detector_list, local_region_size=30, local_max_features=1.0,
58
n_bins=10, random_state=None, contamination=0.1):
59
"""
60
Parameters:
61
- detector_list (list): List of fitted detectors to combine
62
- local_region_size (int): Size of local region for competence estimation
63
- local_max_features (float): Maximum features for local region construction
64
- n_bins (int): Number of bins for histogram-based selection
65
- contamination (float): Proportion of outliers in dataset
66
"""
67
```
68
69
Usage example:
70
```python
71
from pyod.models.lscp import LSCP
72
from pyod.models.lof import LOF
73
from pyod.models.iforest import IForest
74
from pyod.models.ocsvm import OCSVM
75
76
# Train base detectors
77
lof = LOF()
78
iforest = IForest()
79
ocsvm = OCSVM()
80
81
lof.fit(X_train)
82
iforest.fit(X_train)
83
ocsvm.fit(X_train)
84
85
# Combine with LSCP
86
clf = LSCP(
87
detector_list=[lof, iforest, ocsvm],
88
local_region_size=30,
89
contamination=0.1
90
)
91
clf.fit(X_train)
92
y_pred = clf.predict(X_test)
93
```
94
95
### Model Combination Functions
96
97
PyOD provides several functions for combining outlier scores from multiple detectors:
98
99
```python { .api }
100
def average(scores):
101
"""
102
Simple average combination of multiple outlier score matrices.
103
104
Parameters:
105
- scores (array): Score matrix of shape (n_samples, n_detectors)
106
107
Returns:
108
- combined_scores (array): Combined outlier scores
109
"""
110
111
def maximization(scores):
112
"""
113
Maximization combination: take maximum score across detectors.
114
115
Parameters:
116
- scores (array): Score matrix of shape (n_samples, n_detectors)
117
118
Returns:
119
- combined_scores (array): Combined outlier scores
120
"""
121
122
def aom(scores, n_buckets=5, method='static'):
123
"""
124
Average of Maximum: divide detectors into buckets and average the maximum scores.
125
126
Parameters:
127
- scores (array): Score matrix of shape (n_samples, n_detectors)
128
- n_buckets (int): Number of buckets to divide detectors
129
- method (str): Bucketing method ('static', 'dynamic')
130
131
Returns:
132
- combined_scores (array): Combined outlier scores
133
"""
134
135
def moa(scores, n_buckets=5, method='static'):
136
"""
137
Maximum of Average: take maximum of averaged scores from each bucket.
138
139
Parameters:
140
- scores (array): Score matrix of shape (n_samples, n_detectors)
141
- n_buckets (int): Number of buckets to divide detectors
142
- method (str): Bucketing method ('static', 'dynamic')
143
144
Returns:
145
- combined_scores (array): Combined outlier scores
146
"""
147
148
def median(scores):
149
"""
150
Median combination of multiple outlier score matrices.
151
152
Parameters:
153
- scores (array): Score matrix of shape (n_samples, n_detectors)
154
155
Returns:
156
- combined_scores (array): Combined outlier scores
157
"""
158
```
159
160
## Usage Patterns
161
162
### Creating Custom Ensembles
163
164
```python
165
from pyod.models.combination import average, aom, moa
166
from pyod.models.lof import LOF
167
from pyod.models.iforest import IForest
168
from pyod.models.ocsvm import OCSVM
169
import numpy as np
170
171
# Train multiple detectors
172
detectors = [LOF(), IForest(), OCSVM()]
173
for detector in detectors:
174
detector.fit(X_train)
175
176
# Get scores from all detectors
177
train_scores = np.zeros((len(X_train), len(detectors)))
178
test_scores = np.zeros((len(X_test), len(detectors)))
179
180
for i, detector in enumerate(detectors):
181
train_scores[:, i] = detector.decision_scores_
182
test_scores[:, i] = detector.decision_function(X_test)
183
184
# Combine scores using different methods
185
combined_avg = average(test_scores)
186
combined_max = maximization(test_scores)
187
combined_aom = aom(test_scores, n_buckets=3)
188
combined_moa = moa(test_scores, n_buckets=3)
189
```
190
191
### Dynamic Detector Selection
192
193
```python
194
from pyod.models.lscp import LSCP
195
from pyod.models.lof import LOF
196
from pyod.models.iforest import IForest
197
from pyod.models.knn import KNN
198
from pyod.models.ecod import ECOD
199
200
# Create diverse set of detectors
201
detectors = [
202
LOF(n_neighbors=20),
203
LOF(n_neighbors=40), # Different parameters
204
IForest(n_estimators=100),
205
KNN(n_neighbors=5, method='mean'),
206
ECOD()
207
]
208
209
# Fit detectors
210
for detector in detectors:
211
detector.fit(X_train)
212
213
# Use LSCP for adaptive combination
214
clf = LSCP(
215
detector_list=detectors,
216
local_region_size=40,
217
contamination=0.1
218
)
219
clf.fit(X_train)
220
y_pred = clf.predict(X_test)
221
```
222
223
### Advanced Ensemble Strategies
224
225
```python
226
from pyod.models.feature_bagging import FeatureBagging
227
from pyod.models.lof import LOF
228
from pyod.models.iforest import IForest
229
230
# Create ensembles of different base detectors
231
lof_ensemble = FeatureBagging(
232
base_estimator=LOF(n_neighbors=20),
233
n_estimators=10,
234
max_features=0.8,
235
contamination=0.1
236
)
237
238
iforest_ensemble = FeatureBagging(
239
base_estimator=IForest(n_estimators=50),
240
n_estimators=5,
241
max_features=0.9,
242
contamination=0.1
243
)
244
245
# Fit ensembles
246
lof_ensemble.fit(X_train)
247
iforest_ensemble.fit(X_train)
248
249
# Combine ensemble scores
250
lof_scores = lof_ensemble.decision_function(X_test)
251
iforest_scores = iforest_ensemble.decision_function(X_test)
252
253
ensemble_scores = np.column_stack([lof_scores, iforest_scores])
254
final_scores = average(ensemble_scores)
255
```
256
257
## Ensemble Design Principles
258
259
### Diversity Strategies
260
261
1. **Algorithm Diversity**: Use different types of detectors (distance-based, density-based, tree-based)
262
2. **Parameter Diversity**: Same algorithm with different hyperparameters
263
3. **Feature Diversity**: Train detectors on different feature subsets
264
4. **Sample Diversity**: Use bootstrap sampling or different training subsets
265
266
### Combination Strategies
267
268
1. **Simple Average**: Good baseline, works well when detectors have similar performance
269
2. **Weighted Average**: Weight detectors by their individual performance
270
3. **Dynamic Selection**: Choose different detectors for different regions (LSCP)
271
4. **Rank-based**: Combine based on rank orders rather than raw scores
272
273
## Model Selection Guidelines
274
275
### FeatureBagging
276
- **Best for**: High-dimensional data, when features have varying importance
277
- **Base detector**: Works well with LOF, KNN, or other distance-based methods
278
- **Performance**: Usually improves over single detector, especially with irrelevant features
279
280
### LSCP
281
- **Best for**: Heterogeneous data, when detector performance varies by region
282
- **Detector mix**: Combine diverse algorithms (LOF, IForest, OCSVM, etc.)
283
- **Performance**: Often achieves best results but requires more computation
284
285
### Manual Combination
286
- **Best for**: When you want full control over combination strategy
287
- **Flexibility**: Can implement custom weighting and selection schemes
288
- **Performance**: Depends on combination method and detector diversity
289
290
## Best Practices
291
292
1. **Detector Diversity**: Use complementary algorithms rather than similar ones
293
2. **Parameter Tuning**: Tune individual detectors before combining
294
3. **Validation**: Use validation set to select best combination method
295
4. **Computational Cost**: Balance ensemble size with available computational resources
296
5. **Score Normalization**: Consider normalizing scores before combination
297
6. **Performance Monitoring**: Track individual detector contributions to ensemble