0
# Data Utilities
1
2
Comprehensive utilities for data generation, preprocessing, evaluation, and visualization to support the complete outlier detection workflow. These utilities are essential for testing detectors, preparing data, and evaluating results.
3
4
## Capabilities
5
6
### Data Generation
7
8
Generate synthetic datasets with controlled outlier characteristics for testing and benchmarking outlier detection algorithms.
9
10
```python { .api }
11
def generate_data(n_train=200, n_test=100, n_features=2, contamination=0.1,
12
train_only=False, offset=10, random_state=None):
13
"""
14
Generate synthetic dataset with outliers for testing detectors.
15
16
Parameters:
17
- n_train (int): Number of training samples
18
- n_test (int): Number of test samples
19
- n_features (int): Number of features
20
- contamination (float): Proportion of outliers in dataset
21
- train_only (bool): If True, only return training data
22
- offset (int): Offset for outlier generation
23
- random_state (int): Random number generator seed
24
25
Returns:
26
- X_train (array): Training data of shape (n_train, n_features)
27
- X_test (array): Test data of shape (n_test, n_features)
28
- y_train (array): Training labels (0: inlier, 1: outlier)
29
- y_test (array): Test labels (0: inlier, 1: outlier)
30
"""
31
```
32
33
Usage example:
34
```python
35
from pyod.utils.data import generate_data
36
37
# Generate 2D dataset with 10% outliers
38
X_train, X_test, y_train, y_test = generate_data(
39
n_train=500, n_test=200, n_features=2,
40
contamination=0.1, random_state=42
41
)
42
43
# Generate high-dimensional dataset
44
X_train, X_test, y_train, y_test = generate_data(
45
n_train=1000, n_test=300, n_features=20,
46
contamination=0.05, random_state=123
47
)
48
```
49
50
### Evaluation Functions
51
52
Comprehensive evaluation metrics specifically designed for outlier detection tasks.
53
54
```python { .api }
55
def evaluate_print(clf_name, y, y_scores):
56
"""
57
Print comprehensive evaluation metrics for outlier detection.
58
59
Parameters:
60
- clf_name (str): Name of the classifier for display
61
- y (array): True binary labels (0: inlier, 1: outlier)
62
- y_scores (array): Outlier scores from detector
63
64
Prints:
65
- ROC AUC score
66
- Precision at rank n (P@n) where n = number of outliers
67
- Average precision score
68
"""
69
```
70
71
### Data Preprocessing
72
73
Standardization and normalization utilities optimized for outlier detection workflows.
74
75
```python { .api }
76
def standardizer(X, X_t=None, keep_scalar=False):
77
"""
78
Standardize datasets using minmax scaling.
79
80
Parameters:
81
- X (array): Training data to fit scaler
82
- X_t (array, optional): Test data to transform (if None, transform X)
83
- keep_scalar (bool): Whether to return the fitted scaler
84
85
Returns:
86
- X_scaled (array): Scaled training data
87
- X_t_scaled (array): Scaled test data (if X_t provided)
88
- scalar (object): Fitted scaler (if keep_scalar=True)
89
"""
90
```
91
92
### Score Processing
93
94
Utilities for converting and processing outlier scores for different use cases.
95
96
```python { .api }
97
def score_to_label(scores, outliers_fraction=0.1):
98
"""
99
Convert outlier scores to binary labels based on contamination rate.
100
101
Parameters:
102
- scores (array): Outlier scores
103
- outliers_fraction (float): Expected fraction of outliers
104
105
Returns:
106
- labels (array): Binary labels (0: inlier, 1: outlier)
107
"""
108
109
def precision_n_scores(y, y_scores_list, n=None):
110
"""
111
Calculate precision at rank n for multiple detectors.
112
113
Parameters:
114
- y (array): True binary labels
115
- y_scores_list (list): List of outlier score arrays
116
- n (int): Rank threshold (default: number of outliers in y)
117
118
Returns:
119
- precision_list (list): Precision@n scores for each detector
120
"""
121
122
def get_label_n(y, y_scores, n=None):
123
"""
124
Get binary labels by selecting top n highest scores as outliers.
125
126
Parameters:
127
- y (array): True binary labels (for determining n if not provided)
128
- y_scores (array): Outlier scores
129
- n (int): Number of top scores to label as outliers
130
131
Returns:
132
- labels (array): Binary labels (0: inlier, 1: outlier)
133
"""
134
135
def argmaxn(value_list, n, order='desc'):
136
"""
137
Get indices of n largest or smallest values.
138
139
Parameters:
140
- value_list (array): Input values
141
- n (int): Number of indices to return
142
- order (str): Sort order ('desc' for largest, 'asc' for smallest)
143
144
Returns:
145
- indices (array): Indices of n extreme values
146
"""
147
148
def invert_order(scores, method='subtraction'):
149
"""
150
Invert the order of outlier scores (lower becomes higher).
151
152
Parameters:
153
- scores (array): Input outlier scores
154
- method (str): Inversion method ('subtraction', 'division')
155
156
Returns:
157
- inverted_scores (array): Inverted outlier scores
158
"""
159
```
160
161
### Visualization
162
163
Visualization utilities for 2D datasets and outlier detection results.
164
165
```python { .api }
166
def visualize(clf_name, X_train, X_test, y_train, y_test,
167
y_train_pred, y_test_pred, show_figure=True, save_figure=False):
168
"""
169
Visualize outlier detection results for 2D datasets.
170
171
Parameters:
172
- clf_name (str): Name of the classifier for plot title
173
- X_train (array): Training data (must be 2D)
174
- X_test (array): Test data (must be 2D)
175
- y_train (array): True training labels
176
- y_test (array): True test labels
177
- y_train_pred (array): Predicted training labels
178
- y_test_pred (array): Predicted test labels
179
- show_figure (bool): Whether to display the plot
180
- save_figure (bool): Whether to save the plot to file
181
"""
182
```
183
184
### Statistical Utilities
185
186
Statistical functions and distance computations for outlier detection algorithms.
187
188
```python { .api }
189
def pairwise_distances_no_broadcast(X, Y=None):
190
"""
191
Compute pairwise distances without broadcasting for memory efficiency.
192
193
Parameters:
194
- X (array): First set of points
195
- Y (array, optional): Second set of points (default: X)
196
197
Returns:
198
- distances (array): Pairwise distance matrix
199
"""
200
201
def wpearsonr(x, y, w):
202
"""
203
Calculate weighted Pearson correlation coefficient.
204
205
Parameters:
206
- x (array): First variable
207
- y (array): Second variable
208
- w (array): Weights for each observation
209
210
Returns:
211
- correlation (float): Weighted Pearson correlation
212
"""
213
214
def pearsonr_mat(mat, w=None):
215
"""
216
Calculate Pearson correlation matrix with optional weights.
217
218
Parameters:
219
- mat (array): Data matrix
220
- w (array, optional): Weights for observations
221
222
Returns:
223
- corr_matrix (array): Correlation matrix
224
"""
225
226
def get_optimal_n_bins(X, upper_bound=300):
227
"""
228
Get optimal number of bins for histogram-based methods.
229
230
Parameters:
231
- X (array): Input data
232
- upper_bound (int): Maximum number of bins
233
234
Returns:
235
- n_bins (int): Optimal number of bins
236
"""
237
238
def check_parameter(param, low=float('-inf'), high=float('inf'),
239
param_name='', include_left=False, include_right=False):
240
"""
241
Validate parameter values within specified bounds.
242
243
Parameters:
244
- param: Parameter value to check
245
- low: Lower bound
246
- high: Upper bound
247
- param_name (str): Name of parameter for error messages
248
- include_left (bool): Whether to include lower bound
249
- include_right (bool): Whether to include upper bound
250
251
Raises:
252
- ValueError: If parameter is outside valid range
253
"""
254
```
255
256
### PyTorch Utilities
257
258
Specialized utilities for deep learning models using PyTorch framework.
259
260
```python { .api }
261
# Neural network components and utilities for deep learning models
262
# Available in pyod.utils.torch_utility module
263
264
class TorchModel:
265
"""Base class for PyTorch-based outlier detection models"""
266
267
class InnerAutoencoder:
268
"""Autoencoder architecture for deep anomaly detection"""
269
270
class VAE_Encoder:
271
"""Variational autoencoder encoder network"""
272
273
class VAE_Decoder:
274
"""Variational autoencoder decoder network"""
275
```
276
277
## Usage Patterns
278
279
### Complete Workflow Example
280
281
```python
282
from pyod.models.lof import LOF
283
from pyod.models.iforest import IForest
284
from pyod.utils.data import generate_data, evaluate_print
285
from pyod.utils.utility import standardizer, precision_n_scores
286
from pyod.utils.example import visualize
287
288
# 1. Generate synthetic data
289
X_train, X_test, y_train, y_test = generate_data(
290
n_train=400, n_test=150, n_features=2,
291
contamination=0.1, random_state=42
292
)
293
294
# 2. Preprocess data
295
X_train_scaled, X_test_scaled = standardizer(X_train, X_test)
296
297
# 3. Train multiple detectors
298
lof = LOF(contamination=0.1)
299
iforest = IForest(contamination=0.1)
300
301
lof.fit(X_train_scaled)
302
iforest.fit(X_train_scaled)
303
304
# 4. Get predictions
305
lof_scores = lof.decision_function(X_test_scaled)
306
lof_pred = lof.predict(X_test_scaled)
307
308
iforest_scores = iforest.decision_function(X_test_scaled)
309
iforest_pred = iforest.predict(X_test_scaled)
310
311
# 5. Evaluate results
312
evaluate_print('LOF', y_test, lof_scores)
313
evaluate_print('IForest', y_test, iforest_scores)
314
315
# 6. Compare precision@n
316
precision_scores = precision_n_scores(y_test, [lof_scores, iforest_scores])
317
print(f"Precision@n - LOF: {precision_scores[0]:.3f}, IForest: {precision_scores[1]:.3f}")
318
319
# 7. Visualize results (for 2D data)
320
visualize('LOF', X_train, X_test, y_train, y_test,
321
lof.labels_, lof_pred, show_figure=True)
322
```
323
324
### Batch Evaluation
325
326
```python
327
from pyod.models.lof import LOF
328
from pyod.models.iforest import IForest
329
from pyod.models.ocsvm import OCSVM
330
from pyod.utils.data import generate_data, evaluate_print
331
332
# Generate test datasets with different characteristics
333
datasets = []
334
for contamination in [0.05, 0.1, 0.2]:
335
for n_features in [2, 5, 10]:
336
X_train, X_test, y_train, y_test = generate_data(
337
n_train=500, n_test=200, n_features=n_features,
338
contamination=contamination, random_state=42
339
)
340
datasets.append((X_train, X_test, y_train, y_test,
341
f"cont_{contamination}_feat_{n_features}"))
342
343
# Test multiple detectors
344
detectors = [
345
('LOF', LOF()),
346
('IForest', IForest()),
347
('OCSVM', OCSVM())
348
]
349
350
# Evaluate all combinations
351
for X_train, X_test, y_train, y_test, dataset_name in datasets:
352
print(f"\nDataset: {dataset_name}")
353
for detector_name, detector in detectors:
354
detector.fit(X_train)
355
scores = detector.decision_function(X_test)
356
evaluate_print(f"{detector_name}", y_test, scores)
357
```
358
359
## Best Practices
360
361
### Data Generation
362
- Use consistent random seeds for reproducible experiments
363
- Match contamination rate between training and test sets
364
- Consider different outlier patterns (clustered, scattered, etc.)
365
366
### Preprocessing
367
- Standardize features for distance-based methods
368
- Consider feature scaling impact on tree-based methods
369
- Handle categorical variables appropriately
370
371
### Evaluation
372
- Use multiple metrics (ROC-AUC, Precision@n, Average Precision)
373
- Consider class imbalance in evaluation metrics
374
- Validate on multiple datasets with different characteristics
375
376
### Visualization
377
- Use visualization primarily for 2D data and method demonstration
378
- Consider dimensionality reduction for high-dimensional visualization
379
- Include both training and test data in visualizations for complete picture