or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classical-models.mddata-utilities.mddeep-learning-models.mdensemble-models.mdindex.mdmodern-models.md

classical-models.mddocs/

0

# Classical Detection Models

1

2

Traditional outlier detection algorithms that have proven effectiveness across various domains. These methods form the foundation of anomaly detection and are often the first choice for many applications due to their interpretability and reliability.

3

4

## Capabilities

5

6

### Local Outlier Factor (LOF)

7

8

Computes the local density deviation of a given data point with respect to its neighbors. Considers as outliers the samples that have a substantially lower density than their neighbors.

9

10

```python { .api }

11

class LOF:

12

def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30,

13

metric='minkowski', p=2, metric_params=None,

14

contamination=0.1, n_jobs=1, novelty=True):

15

"""

16

Parameters:

17

- n_neighbors (int): Number of neighbors to consider

18

- algorithm (str): Algorithm for nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')

19

- leaf_size (int): Leaf size for tree-based algorithms

20

- metric (str): Distance metric to use

21

- p (float): Parameter for the Minkowski metric

22

- contamination (float): Proportion of outliers in dataset

23

- n_jobs (int): Number of parallel jobs

24

- novelty (bool): Whether to use novelty detection mode

25

"""

26

```

27

28

Usage example:

29

```python

30

from pyod.models.lof import LOF

31

from pyod.utils.data import generate_data

32

33

X_train, X_test, y_train, y_test = generate_data(contamination=0.1, random_state=42)

34

35

clf = LOF(n_neighbors=20, contamination=0.1)

36

clf.fit(X_train)

37

y_pred = clf.predict(X_test)

38

```

39

40

### Isolation Forest

41

42

Isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Anomalies are more susceptible to isolation and hence have short average path lengths.

43

44

```python { .api }

45

class IForest:

46

def __init__(self, n_estimators=100, max_samples='auto', contamination=0.1,

47

max_features=1.0, bootstrap=False, n_jobs=1, random_state=None,

48

verbose=0, behaviour='deprecated'):

49

"""

50

Parameters:

51

- n_estimators (int): Number of isolation trees

52

- max_samples (int or str): Number of samples to draw for each tree

53

- contamination (float): Proportion of outliers in dataset

54

- max_features (int or float): Number of features to draw for each tree

55

- bootstrap (bool): Whether to use bootstrap sampling

56

- n_jobs (int): Number of parallel jobs

57

- random_state (int): Random number generator seed

58

- verbose (int): Verbosity level

59

"""

60

```

61

62

### One-Class Support Vector Machine (OCSVM)

63

64

Finds a hyperplane that separates the data from the origin with maximum margin. Points far from the hyperplane are considered outliers.

65

66

```python { .api }

67

class OCSVM:

68

def __init__(self, kernel='rbf', degree=3, gamma='scale', coef0=0.0,

69

tol=1e-3, nu=0.5, shrinking=True, cache_size=200,

70

verbose=False, max_iter=-1, contamination=0.1):

71

"""

72

Parameters:

73

- kernel (str): Kernel type ('linear', 'poly', 'rbf', 'sigmoid')

74

- degree (int): Degree for polynomial kernel

75

- gamma (str or float): Kernel coefficient

76

- coef0 (float): Independent term for polynomial/sigmoid kernels

77

- tol (float): Tolerance for stopping criterion

78

- nu (float): Upper bound on fraction of training errors

79

- contamination (float): Proportion of outliers in dataset

80

"""

81

```

82

83

### k-Nearest Neighbors (KNN)

84

85

Uses the distance to the k-th nearest neighbor as the outlier score. Data points with large distances to their k-th nearest neighbor are considered outliers.

86

87

```python { .api }

88

class KNN:

89

def __init__(self, contamination=0.1, n_neighbors=5, method='largest',

90

radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski',

91

p=2, metric_params=None, n_jobs=1):

92

"""

93

Parameters:

94

- contamination (float): Proportion of outliers in dataset

95

- n_neighbors (int): Number of neighbors to consider

96

- method (str): Method for computing outlier scores ('largest', 'mean', 'median')

97

- radius (float): Range of parameter space for radius_neighbors

98

- algorithm (str): Algorithm for nearest neighbors

99

- metric (str): Distance metric to use

100

- n_jobs (int): Number of parallel jobs

101

"""

102

```

103

104

### Principal Component Analysis (PCA)

105

106

Uses the sum of weighted projected distances to the eigenvector hyperplanes as outlier scores. Assumes that normal data can be represented in lower dimensional space.

107

108

```python { .api }

109

class PCA:

110

def __init__(self, n_components=None, n_selected_components=None, copy=True,

111

whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto',

112

contamination=0.1, random_state=None, weighted=True,

113

standardization=True):

114

"""

115

Parameters:

116

- n_components (int): Number of components to keep

117

- n_selected_components (int): Number of selected components for outlier detection

118

- copy (bool): Whether to copy data

119

- whiten (bool): Whether to whiten components

120

- svd_solver (str): SVD solver to use

121

- contamination (float): Proportion of outliers in dataset

122

- weighted (bool): Whether to use weighted PCA

123

- standardization (bool): Whether to standardize data

124

"""

125

```

126

127

### Minimum Covariance Determinant (MCD)

128

129

Finds the subset of observations whose empirical covariance has the smallest determinant. Data points far from this "central" subset are considered outliers.

130

131

```python { .api }

132

class MCD:

133

def __init__(self, contamination=0.1, store_precision=True,

134

assume_centered=False, support_fraction=None,

135

random_state=None):

136

"""

137

Parameters:

138

- contamination (float): Proportion of outliers in dataset

139

- store_precision (bool): Whether to store precision matrix

140

- assume_centered (bool): Whether data is centered

141

- support_fraction (float): Fraction of points to include in support

142

- random_state (int): Random number generator seed

143

"""

144

```

145

146

### Histogram-Based Outlier Score (HBOS)

147

148

Constructs histograms for each feature and calculates the outlier score as the inverse of the estimated density. Assumes feature independence but is efficient for large datasets.

149

150

```python { .api }

151

class HBOS:

152

def __init__(self, n_bins=10, alpha=0.1, tol=0.5, contamination=0.1):

153

"""

154

Parameters:

155

- n_bins (int or str): Number of bins for histogram

156

- alpha (float): Regularization parameter

157

- tol (float): Tolerance for minimum density

158

- contamination (float): Proportion of outliers in dataset

159

"""

160

```

161

162

### Additional Classical Models

163

164

```python { .api }

165

class ABOD:

166

"""Angle-Based Outlier Detection"""

167

def __init__(self, contamination=0.1, n_neighbors=5): ...

168

169

class CBLOF:

170

"""Clustering-Based Local Outlier Factor"""

171

def __init__(self, n_clusters=8, contamination=0.1, clustering_estimator=None, **kwargs): ...

172

173

class COF:

174

"""Connectivity-Based Outlier Factor"""

175

def __init__(self, contamination=0.1, n_neighbors=20): ...

176

177

class GMM:

178

"""Gaussian Mixture Model for outlier detection"""

179

def __init__(self, n_components=1, contamination=0.1, **kwargs): ...

180

181

class KDE:

182

"""Kernel Density Estimation"""

183

def __init__(self, contamination=0.1, bandwidth=1.0, algorithm='auto', **kwargs): ...

184

185

class MAD:

186

"""Median Absolute Deviation"""

187

def __init__(self, threshold=3.5, contamination=0.1): ...

188

```

189

190

## Usage Patterns

191

192

All classical models follow the same usage pattern:

193

194

```python

195

# 1. Import the model

196

from pyod.models.lof import LOF

197

198

# 2. Initialize with parameters

199

clf = LOF(n_neighbors=20, contamination=0.1)

200

201

# 3. Fit on training data

202

clf.fit(X_train)

203

204

# 4. Access fitted attributes

205

train_scores = clf.decision_scores_

206

train_labels = clf.labels_

207

threshold = clf.threshold_

208

209

# 5. Predict on test data

210

test_labels = clf.predict(X_test)

211

test_scores = clf.decision_function(X_test)

212

test_proba = clf.predict_proba(X_test)

213

```

214

215

## Model Selection Guidelines

216

217

- **LOF**: Good for datasets with varying density regions

218

- **IForest**: Excellent for high-dimensional data and large datasets

219

- **OCSVM**: Effective with small to medium datasets, works well with kernels

220

- **KNN**: Simple and interpretable, good baseline method

221

- **PCA**: Effective when outliers don't lie in principal component subspace

222

- **MCD**: Robust for multivariate normal data with outliers

223

- **HBOS**: Fast for large datasets when features are independent