or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classical-models.mddata-utilities.mddeep-learning-models.mdensemble-models.mdindex.mdmodern-models.md

data-utilities.mddocs/

0

# Data Utilities

1

2

Comprehensive utilities for data generation, preprocessing, evaluation, and visualization to support the complete outlier detection workflow. These utilities are essential for testing detectors, preparing data, and evaluating results.

3

4

## Capabilities

5

6

### Data Generation

7

8

Generate synthetic datasets with controlled outlier characteristics for testing and benchmarking outlier detection algorithms.

9

10

```python { .api }

11

def generate_data(n_train=200, n_test=100, n_features=2, contamination=0.1,

12

train_only=False, offset=10, random_state=None):

13

"""

14

Generate synthetic dataset with outliers for testing detectors.

15

16

Parameters:

17

- n_train (int): Number of training samples

18

- n_test (int): Number of test samples

19

- n_features (int): Number of features

20

- contamination (float): Proportion of outliers in dataset

21

- train_only (bool): If True, only return training data

22

- offset (int): Offset for outlier generation

23

- random_state (int): Random number generator seed

24

25

Returns:

26

- X_train (array): Training data of shape (n_train, n_features)

27

- X_test (array): Test data of shape (n_test, n_features)

28

- y_train (array): Training labels (0: inlier, 1: outlier)

29

- y_test (array): Test labels (0: inlier, 1: outlier)

30

"""

31

```

32

33

Usage example:

34

```python

35

from pyod.utils.data import generate_data

36

37

# Generate 2D dataset with 10% outliers

38

X_train, X_test, y_train, y_test = generate_data(

39

n_train=500, n_test=200, n_features=2,

40

contamination=0.1, random_state=42

41

)

42

43

# Generate high-dimensional dataset

44

X_train, X_test, y_train, y_test = generate_data(

45

n_train=1000, n_test=300, n_features=20,

46

contamination=0.05, random_state=123

47

)

48

```

49

50

### Evaluation Functions

51

52

Comprehensive evaluation metrics specifically designed for outlier detection tasks.

53

54

```python { .api }

55

def evaluate_print(clf_name, y, y_scores):

56

"""

57

Print comprehensive evaluation metrics for outlier detection.

58

59

Parameters:

60

- clf_name (str): Name of the classifier for display

61

- y (array): True binary labels (0: inlier, 1: outlier)

62

- y_scores (array): Outlier scores from detector

63

64

Prints:

65

- ROC AUC score

66

- Precision at rank n (P@n) where n = number of outliers

67

- Average precision score

68

"""

69

```

70

71

### Data Preprocessing

72

73

Standardization and normalization utilities optimized for outlier detection workflows.

74

75

```python { .api }

76

def standardizer(X, X_t=None, keep_scalar=False):

77

"""

78

Standardize datasets using minmax scaling.

79

80

Parameters:

81

- X (array): Training data to fit scaler

82

- X_t (array, optional): Test data to transform (if None, transform X)

83

- keep_scalar (bool): Whether to return the fitted scaler

84

85

Returns:

86

- X_scaled (array): Scaled training data

87

- X_t_scaled (array): Scaled test data (if X_t provided)

88

- scalar (object): Fitted scaler (if keep_scalar=True)

89

"""

90

```

91

92

### Score Processing

93

94

Utilities for converting and processing outlier scores for different use cases.

95

96

```python { .api }

97

def score_to_label(scores, outliers_fraction=0.1):

98

"""

99

Convert outlier scores to binary labels based on contamination rate.

100

101

Parameters:

102

- scores (array): Outlier scores

103

- outliers_fraction (float): Expected fraction of outliers

104

105

Returns:

106

- labels (array): Binary labels (0: inlier, 1: outlier)

107

"""

108

109

def precision_n_scores(y, y_scores_list, n=None):

110

"""

111

Calculate precision at rank n for multiple detectors.

112

113

Parameters:

114

- y (array): True binary labels

115

- y_scores_list (list): List of outlier score arrays

116

- n (int): Rank threshold (default: number of outliers in y)

117

118

Returns:

119

- precision_list (list): Precision@n scores for each detector

120

"""

121

122

def get_label_n(y, y_scores, n=None):

123

"""

124

Get binary labels by selecting top n highest scores as outliers.

125

126

Parameters:

127

- y (array): True binary labels (for determining n if not provided)

128

- y_scores (array): Outlier scores

129

- n (int): Number of top scores to label as outliers

130

131

Returns:

132

- labels (array): Binary labels (0: inlier, 1: outlier)

133

"""

134

135

def argmaxn(value_list, n, order='desc'):

136

"""

137

Get indices of n largest or smallest values.

138

139

Parameters:

140

- value_list (array): Input values

141

- n (int): Number of indices to return

142

- order (str): Sort order ('desc' for largest, 'asc' for smallest)

143

144

Returns:

145

- indices (array): Indices of n extreme values

146

"""

147

148

def invert_order(scores, method='subtraction'):

149

"""

150

Invert the order of outlier scores (lower becomes higher).

151

152

Parameters:

153

- scores (array): Input outlier scores

154

- method (str): Inversion method ('subtraction', 'division')

155

156

Returns:

157

- inverted_scores (array): Inverted outlier scores

158

"""

159

```

160

161

### Visualization

162

163

Visualization utilities for 2D datasets and outlier detection results.

164

165

```python { .api }

166

def visualize(clf_name, X_train, X_test, y_train, y_test,

167

y_train_pred, y_test_pred, show_figure=True, save_figure=False):

168

"""

169

Visualize outlier detection results for 2D datasets.

170

171

Parameters:

172

- clf_name (str): Name of the classifier for plot title

173

- X_train (array): Training data (must be 2D)

174

- X_test (array): Test data (must be 2D)

175

- y_train (array): True training labels

176

- y_test (array): True test labels

177

- y_train_pred (array): Predicted training labels

178

- y_test_pred (array): Predicted test labels

179

- show_figure (bool): Whether to display the plot

180

- save_figure (bool): Whether to save the plot to file

181

"""

182

```

183

184

### Statistical Utilities

185

186

Statistical functions and distance computations for outlier detection algorithms.

187

188

```python { .api }

189

def pairwise_distances_no_broadcast(X, Y=None):

190

"""

191

Compute pairwise distances without broadcasting for memory efficiency.

192

193

Parameters:

194

- X (array): First set of points

195

- Y (array, optional): Second set of points (default: X)

196

197

Returns:

198

- distances (array): Pairwise distance matrix

199

"""

200

201

def wpearsonr(x, y, w):

202

"""

203

Calculate weighted Pearson correlation coefficient.

204

205

Parameters:

206

- x (array): First variable

207

- y (array): Second variable

208

- w (array): Weights for each observation

209

210

Returns:

211

- correlation (float): Weighted Pearson correlation

212

"""

213

214

def pearsonr_mat(mat, w=None):

215

"""

216

Calculate Pearson correlation matrix with optional weights.

217

218

Parameters:

219

- mat (array): Data matrix

220

- w (array, optional): Weights for observations

221

222

Returns:

223

- corr_matrix (array): Correlation matrix

224

"""

225

226

def get_optimal_n_bins(X, upper_bound=300):

227

"""

228

Get optimal number of bins for histogram-based methods.

229

230

Parameters:

231

- X (array): Input data

232

- upper_bound (int): Maximum number of bins

233

234

Returns:

235

- n_bins (int): Optimal number of bins

236

"""

237

238

def check_parameter(param, low=float('-inf'), high=float('inf'),

239

param_name='', include_left=False, include_right=False):

240

"""

241

Validate parameter values within specified bounds.

242

243

Parameters:

244

- param: Parameter value to check

245

- low: Lower bound

246

- high: Upper bound

247

- param_name (str): Name of parameter for error messages

248

- include_left (bool): Whether to include lower bound

249

- include_right (bool): Whether to include upper bound

250

251

Raises:

252

- ValueError: If parameter is outside valid range

253

"""

254

```

255

256

### PyTorch Utilities

257

258

Specialized utilities for deep learning models using PyTorch framework.

259

260

```python { .api }

261

# Neural network components and utilities for deep learning models

262

# Available in pyod.utils.torch_utility module

263

264

class TorchModel:

265

"""Base class for PyTorch-based outlier detection models"""

266

267

class InnerAutoencoder:

268

"""Autoencoder architecture for deep anomaly detection"""

269

270

class VAE_Encoder:

271

"""Variational autoencoder encoder network"""

272

273

class VAE_Decoder:

274

"""Variational autoencoder decoder network"""

275

```

276

277

## Usage Patterns

278

279

### Complete Workflow Example

280

281

```python

282

from pyod.models.lof import LOF

283

from pyod.models.iforest import IForest

284

from pyod.utils.data import generate_data, evaluate_print

285

from pyod.utils.utility import standardizer, precision_n_scores

286

from pyod.utils.example import visualize

287

288

# 1. Generate synthetic data

289

X_train, X_test, y_train, y_test = generate_data(

290

n_train=400, n_test=150, n_features=2,

291

contamination=0.1, random_state=42

292

)

293

294

# 2. Preprocess data

295

X_train_scaled, X_test_scaled = standardizer(X_train, X_test)

296

297

# 3. Train multiple detectors

298

lof = LOF(contamination=0.1)

299

iforest = IForest(contamination=0.1)

300

301

lof.fit(X_train_scaled)

302

iforest.fit(X_train_scaled)

303

304

# 4. Get predictions

305

lof_scores = lof.decision_function(X_test_scaled)

306

lof_pred = lof.predict(X_test_scaled)

307

308

iforest_scores = iforest.decision_function(X_test_scaled)

309

iforest_pred = iforest.predict(X_test_scaled)

310

311

# 5. Evaluate results

312

evaluate_print('LOF', y_test, lof_scores)

313

evaluate_print('IForest', y_test, iforest_scores)

314

315

# 6. Compare precision@n

316

precision_scores = precision_n_scores(y_test, [lof_scores, iforest_scores])

317

print(f"Precision@n - LOF: {precision_scores[0]:.3f}, IForest: {precision_scores[1]:.3f}")

318

319

# 7. Visualize results (for 2D data)

320

visualize('LOF', X_train, X_test, y_train, y_test,

321

lof.labels_, lof_pred, show_figure=True)

322

```

323

324

### Batch Evaluation

325

326

```python

327

from pyod.models.lof import LOF

328

from pyod.models.iforest import IForest

329

from pyod.models.ocsvm import OCSVM

330

from pyod.utils.data import generate_data, evaluate_print

331

332

# Generate test datasets with different characteristics

333

datasets = []

334

for contamination in [0.05, 0.1, 0.2]:

335

for n_features in [2, 5, 10]:

336

X_train, X_test, y_train, y_test = generate_data(

337

n_train=500, n_test=200, n_features=n_features,

338

contamination=contamination, random_state=42

339

)

340

datasets.append((X_train, X_test, y_train, y_test,

341

f"cont_{contamination}_feat_{n_features}"))

342

343

# Test multiple detectors

344

detectors = [

345

('LOF', LOF()),

346

('IForest', IForest()),

347

('OCSVM', OCSVM())

348

]

349

350

# Evaluate all combinations

351

for X_train, X_test, y_train, y_test, dataset_name in datasets:

352

print(f"\nDataset: {dataset_name}")

353

for detector_name, detector in detectors:

354

detector.fit(X_train)

355

scores = detector.decision_function(X_test)

356

evaluate_print(f"{detector_name}", y_test, scores)

357

```

358

359

## Best Practices

360

361

### Data Generation

362

- Use consistent random seeds for reproducible experiments

363

- Match contamination rate between training and test sets

364

- Consider different outlier patterns (clustered, scattered, etc.)

365

366

### Preprocessing

367

- Standardize features for distance-based methods

368

- Consider feature scaling impact on tree-based methods

369

- Handle categorical variables appropriately

370

371

### Evaluation

372

- Use multiple metrics (ROC-AUC, Precision@n, Average Precision)

373

- Consider class imbalance in evaluation metrics

374

- Validate on multiple datasets with different characteristics

375

376

### Visualization

377

- Use visualization primarily for 2D data and method demonstration

378

- Consider dimensionality reduction for high-dimensional visualization

379

- Include both training and test data in visualizations for complete picture