Tessl Tile for pypi/pyod@2.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classical-models.md data-utilities.md deep-learning-models.md ensemble-models.md index.md modern-models.md

data-utilities.mddocs/

0
# Data Utilities
1

2
Comprehensive utilities for data generation, preprocessing, evaluation, and visualization to support the complete outlier detection workflow. These utilities are essential for testing detectors, preparing data, and evaluating results.
3

4
## Capabilities
5

6
### Data Generation
7

8
Generate synthetic datasets with controlled outlier characteristics for testing and benchmarking outlier detection algorithms.
9

10
```python { .api }
11
def generate_data(n_train=200, n_test=100, n_features=2, contamination=0.1,
12
                  train_only=False, offset=10, random_state=None):
13
    """
14
    Generate synthetic dataset with outliers for testing detectors.
15
    
16
    Parameters:
17
    - n_train (int): Number of training samples
18
    - n_test (int): Number of test samples  
19
    - n_features (int): Number of features
20
    - contamination (float): Proportion of outliers in dataset
21
    - train_only (bool): If True, only return training data
22
    - offset (int): Offset for outlier generation
23
    - random_state (int): Random number generator seed
24
    
25
    Returns:
26
    - X_train (array): Training data of shape (n_train, n_features)
27
    - X_test (array): Test data of shape (n_test, n_features) 
28
    - y_train (array): Training labels (0: inlier, 1: outlier)
29
    - y_test (array): Test labels (0: inlier, 1: outlier)
30
    """
31
```
32

33
Usage example:
34
```python
35
from pyod.utils.data import generate_data
36

37
# Generate 2D dataset with 10% outliers
38
X_train, X_test, y_train, y_test = generate_data(
39
    n_train=500, n_test=200, n_features=2, 
40
    contamination=0.1, random_state=42
41
)
42

43
# Generate high-dimensional dataset
44
X_train, X_test, y_train, y_test = generate_data(
45
    n_train=1000, n_test=300, n_features=20,
46
    contamination=0.05, random_state=123
47
)
48
```
49

50
### Evaluation Functions
51

52
Comprehensive evaluation metrics specifically designed for outlier detection tasks.
53

54
```python { .api }
55
def evaluate_print(clf_name, y, y_scores):
56
    """
57
    Print comprehensive evaluation metrics for outlier detection.
58
    
59
    Parameters:
60
    - clf_name (str): Name of the classifier for display
61
    - y (array): True binary labels (0: inlier, 1: outlier)
62
    - y_scores (array): Outlier scores from detector
63
    
64
    Prints:
65
    - ROC AUC score
66
    - Precision at rank n (P@n) where n = number of outliers
67
    - Average precision score
68
    """
69
```
70

71
### Data Preprocessing
72

73
Standardization and normalization utilities optimized for outlier detection workflows.
74

75
```python { .api }
76
def standardizer(X, X_t=None, keep_scalar=False):
77
    """
78
    Standardize datasets using minmax scaling.
79
    
80
    Parameters:
81
    - X (array): Training data to fit scaler
82
    - X_t (array, optional): Test data to transform (if None, transform X)  
83
    - keep_scalar (bool): Whether to return the fitted scaler
84
    
85
    Returns:
86
    - X_scaled (array): Scaled training data
87
    - X_t_scaled (array): Scaled test data (if X_t provided)
88
    - scalar (object): Fitted scaler (if keep_scalar=True)
89
    """
90
```
91

92
### Score Processing
93

94
Utilities for converting and processing outlier scores for different use cases.
95

96
```python { .api }
97
def score_to_label(scores, outliers_fraction=0.1):
98
    """
99
    Convert outlier scores to binary labels based on contamination rate.
100
    
101
    Parameters:
102
    - scores (array): Outlier scores
103
    - outliers_fraction (float): Expected fraction of outliers
104
    
105
    Returns:
106
    - labels (array): Binary labels (0: inlier, 1: outlier)
107
    """
108

109
def precision_n_scores(y, y_scores_list, n=None):
110
    """
111
    Calculate precision at rank n for multiple detectors.
112
    
113
    Parameters:
114
    - y (array): True binary labels
115
    - y_scores_list (list): List of outlier score arrays
116
    - n (int): Rank threshold (default: number of outliers in y)
117
    
118
    Returns:
119
    - precision_list (list): Precision@n scores for each detector
120
    """
121

122
def get_label_n(y, y_scores, n=None):
123
    """
124
    Get binary labels by selecting top n highest scores as outliers.
125
    
126
    Parameters:
127
    - y (array): True binary labels (for determining n if not provided)
128
    - y_scores (array): Outlier scores
129
    - n (int): Number of top scores to label as outliers
130
    
131
    Returns:
132
    - labels (array): Binary labels (0: inlier, 1: outlier)
133
    """
134

135
def argmaxn(value_list, n, order='desc'):
136
    """
137
    Get indices of n largest or smallest values.
138
    
139
    Parameters:
140
    - value_list (array): Input values
141
    - n (int): Number of indices to return
142
    - order (str): Sort order ('desc' for largest, 'asc' for smallest)
143
    
144
    Returns:
145
    - indices (array): Indices of n extreme values
146
    """
147

148
def invert_order(scores, method='subtraction'):
149
    """
150
    Invert the order of outlier scores (lower becomes higher).
151
    
152
    Parameters:
153
    - scores (array): Input outlier scores
154
    - method (str): Inversion method ('subtraction', 'division')
155
    
156
    Returns:
157
    - inverted_scores (array): Inverted outlier scores
158
    """
159
```
160

161
### Visualization
162

163
Visualization utilities for 2D datasets and outlier detection results.
164

165
```python { .api }
166
def visualize(clf_name, X_train, X_test, y_train, y_test, 
167
              y_train_pred, y_test_pred, show_figure=True, save_figure=False):
168
    """
169
    Visualize outlier detection results for 2D datasets.
170
    
171
    Parameters:
172
    - clf_name (str): Name of the classifier for plot title
173
    - X_train (array): Training data (must be 2D)
174
    - X_test (array): Test data (must be 2D)
175
    - y_train (array): True training labels
176
    - y_test (array): True test labels
177
    - y_train_pred (array): Predicted training labels
178
    - y_test_pred (array): Predicted test labels
179
    - show_figure (bool): Whether to display the plot
180
    - save_figure (bool): Whether to save the plot to file
181
    """
182
```
183

184
### Statistical Utilities
185

186
Statistical functions and distance computations for outlier detection algorithms.
187

188
```python { .api }
189
def pairwise_distances_no_broadcast(X, Y=None):
190
    """
191
    Compute pairwise distances without broadcasting for memory efficiency.
192
    
193
    Parameters:
194
    - X (array): First set of points
195
    - Y (array, optional): Second set of points (default: X)
196
    
197
    Returns:
198
    - distances (array): Pairwise distance matrix
199
    """
200

201
def wpearsonr(x, y, w):
202
    """
203
    Calculate weighted Pearson correlation coefficient.
204
    
205
    Parameters:
206
    - x (array): First variable
207
    - y (array): Second variable
208
    - w (array): Weights for each observation
209
    
210
    Returns:
211
    - correlation (float): Weighted Pearson correlation
212
    """
213

214
def pearsonr_mat(mat, w=None):
215
    """
216
    Calculate Pearson correlation matrix with optional weights.
217
    
218
    Parameters:
219
    - mat (array): Data matrix
220
    - w (array, optional): Weights for observations
221
    
222
    Returns:
223
    - corr_matrix (array): Correlation matrix
224
    """
225

226
def get_optimal_n_bins(X, upper_bound=300):
227
    """
228
    Get optimal number of bins for histogram-based methods.
229
    
230
    Parameters:
231
    - X (array): Input data
232
    - upper_bound (int): Maximum number of bins
233
    
234
    Returns:
235
    - n_bins (int): Optimal number of bins
236
    """
237

238
def check_parameter(param, low=float('-inf'), high=float('inf'), 
239
                   param_name='', include_left=False, include_right=False):
240
    """
241
    Validate parameter values within specified bounds.
242
    
243
    Parameters:
244
    - param: Parameter value to check
245
    - low: Lower bound
246
    - high: Upper bound  
247
    - param_name (str): Name of parameter for error messages
248
    - include_left (bool): Whether to include lower bound
249
    - include_right (bool): Whether to include upper bound
250
    
251
    Raises:
252
    - ValueError: If parameter is outside valid range
253
    """
254
```
255

256
### PyTorch Utilities
257

258
Specialized utilities for deep learning models using PyTorch framework.
259

260
```python { .api }
261
# Neural network components and utilities for deep learning models
262
# Available in pyod.utils.torch_utility module
263

264
class TorchModel:
265
    """Base class for PyTorch-based outlier detection models"""
266
    
267
class InnerAutoencoder:
268
    """Autoencoder architecture for deep anomaly detection"""
269
    
270
class VAE_Encoder:
271
    """Variational autoencoder encoder network"""
272
    
273
class VAE_Decoder: 
274
    """Variational autoencoder decoder network"""
275
```
276

277
## Usage Patterns
278

279
### Complete Workflow Example
280

281
```python
282
from pyod.models.lof import LOF
283
from pyod.models.iforest import IForest
284
from pyod.utils.data import generate_data, evaluate_print
285
from pyod.utils.utility import standardizer, precision_n_scores
286
from pyod.utils.example import visualize
287

288
# 1. Generate synthetic data
289
X_train, X_test, y_train, y_test = generate_data(
290
    n_train=400, n_test=150, n_features=2,
291
    contamination=0.1, random_state=42
292
)
293

294
# 2. Preprocess data
295
X_train_scaled, X_test_scaled = standardizer(X_train, X_test)
296

297
# 3. Train multiple detectors
298
lof = LOF(contamination=0.1)
299
iforest = IForest(contamination=0.1)
300

301
lof.fit(X_train_scaled)
302
iforest.fit(X_train_scaled)
303

304
# 4. Get predictions
305
lof_scores = lof.decision_function(X_test_scaled) 
306
lof_pred = lof.predict(X_test_scaled)
307

308
iforest_scores = iforest.decision_function(X_test_scaled)
309
iforest_pred = iforest.predict(X_test_scaled)
310

311
# 5. Evaluate results
312
evaluate_print('LOF', y_test, lof_scores)
313
evaluate_print('IForest', y_test, iforest_scores)
314

315
# 6. Compare precision@n
316
precision_scores = precision_n_scores(y_test, [lof_scores, iforest_scores])
317
print(f"Precision@n - LOF: {precision_scores[0]:.3f}, IForest: {precision_scores[1]:.3f}")
318

319
# 7. Visualize results (for 2D data)
320
visualize('LOF', X_train, X_test, y_train, y_test, 
321
          lof.labels_, lof_pred, show_figure=True)
322
```
323

324
### Batch Evaluation
325

326
```python
327
from pyod.models.lof import LOF
328
from pyod.models.iforest import IForest
329
from pyod.models.ocsvm import OCSVM
330
from pyod.utils.data import generate_data, evaluate_print
331

332
# Generate test datasets with different characteristics
333
datasets = []
334
for contamination in [0.05, 0.1, 0.2]:
335
    for n_features in [2, 5, 10]:
336
        X_train, X_test, y_train, y_test = generate_data(
337
            n_train=500, n_test=200, n_features=n_features,
338
            contamination=contamination, random_state=42
339
        )
340
        datasets.append((X_train, X_test, y_train, y_test, 
341
                        f"cont_{contamination}_feat_{n_features}"))
342

343
# Test multiple detectors
344
detectors = [
345
    ('LOF', LOF()),
346
    ('IForest', IForest()), 
347
    ('OCSVM', OCSVM())
348
]
349

350
# Evaluate all combinations
351
for X_train, X_test, y_train, y_test, dataset_name in datasets:
352
    print(f"\nDataset: {dataset_name}")
353
    for detector_name, detector in detectors:
354
        detector.fit(X_train)
355
        scores = detector.decision_function(X_test)
356
        evaluate_print(f"{detector_name}", y_test, scores)
357
```
358

359
## Best Practices
360

361
### Data Generation
362
- Use consistent random seeds for reproducible experiments
363
- Match contamination rate between training and test sets
364
- Consider different outlier patterns (clustered, scattered, etc.)
365

366
### Preprocessing  
367
- Standardize features for distance-based methods
368
- Consider feature scaling impact on tree-based methods
369
- Handle categorical variables appropriately
370

371
### Evaluation
372
- Use multiple metrics (ROC-AUC, Precision@n, Average Precision)
373
- Consider class imbalance in evaluation metrics
374
- Validate on multiple datasets with different characteristics
375

376
### Visualization
377
- Use visualization primarily for 2D data and method demonstration
378
- Consider dimensionality reduction for high-dimensional visualization
379
- Include both training and test data in visualizations for complete picture

Version

Tile

Files

data-utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-utilities.mddocs/