0
# Metrics and Model Selection
1
2
High-performance implementations of evaluation metrics and model selection utilities with Intel hardware acceleration. These functions provide significant speedups for model evaluation, distance computations, and data splitting operations.
3
4
## Capabilities
5
6
### Ranking Metrics
7
8
#### ROC AUC Score
9
10
Intel-accelerated computation of Area Under the ROC Curve for binary and multiclass classification.
11
12
```python { .api }
13
def roc_auc_score(
14
y_true,
15
y_score,
16
average='macro',
17
sample_weight=None,
18
max_fpr=None,
19
multi_class='raise',
20
labels=None
21
):
22
"""
23
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC).
24
25
Intel-optimized implementation providing significant speedup for large datasets
26
through vectorized operations and efficient curve computation.
27
28
Parameters:
29
y_true (array-like): True binary labels or multiclass labels
30
y_score (array-like): Target scores (probabilities or decision values)
31
average (str): Averaging strategy for multiclass ('macro', 'weighted', 'micro')
32
sample_weight (array-like): Sample weights
33
max_fpr (float): Maximum false positive rate for partial AUC
34
multi_class (str): Multiclass strategy ('raise', 'ovr', 'ovo')
35
labels (array-like): Labels to include for multiclass problems
36
37
Returns:
38
float: Area under ROC curve score
39
40
Example:
41
>>> from sklearnex.metrics import roc_auc_score
42
>>> y_true = [0, 0, 1, 1]
43
>>> y_scores = [0.1, 0.4, 0.35, 0.8]
44
>>> roc_auc_score(y_true, y_scores)
45
0.75
46
"""
47
```
48
49
### Distance Metrics
50
51
#### Pairwise Distances
52
53
Intel-accelerated computation of pairwise distances between samples.
54
55
```python { .api }
56
def pairwise_distances(
57
X,
58
Y=None,
59
metric='euclidean',
60
n_jobs=None,
61
force_all_finite=True,
62
**kwds
63
):
64
"""
65
Compute pairwise distances between samples.
66
67
Intel-optimized implementation with significant speedup through vectorized
68
distance computations and efficient memory access patterns.
69
70
Parameters:
71
X (array-like): Input samples of shape (n_samples_X, n_features)
72
Y (array-like): Second set of samples (n_samples_Y, n_features), optional
73
metric (str or callable): Distance metric to use
74
n_jobs (int): Number of parallel jobs
75
force_all_finite (bool): Whether to check for finite values
76
**kwds: Additional parameters for distance metric
77
78
Returns:
79
ndarray: Distance matrix of shape (n_samples_X, n_samples_Y)
80
81
Supported metrics:
82
- 'euclidean': L2 norm distance
83
- 'manhattan': L1 norm distance
84
- 'cosine': Cosine distance
85
- 'minkowski': Minkowski distance
86
- 'chebyshev': Chebyshev distance
87
- 'hamming': Hamming distance
88
- 'jaccard': Jaccard distance
89
- callable: Custom distance function
90
91
Example:
92
>>> from sklearnex.metrics import pairwise_distances
93
>>> import numpy as np
94
>>> X = np.array([[0, 1], [1, 0], [2, 2]])
95
>>> pairwise_distances(X, metric='euclidean')
96
array([[0. , 1.4142, 2.2361],
97
[1.4142, 0. , 1.4142],
98
[2.2361, 1.4142, 0. ]])
99
"""
100
```
101
102
### Model Selection Utilities
103
104
#### Train Test Split
105
106
Intel-accelerated data splitting for model validation with optimized random sampling.
107
108
```python { .api }
109
def train_test_split(
110
*arrays,
111
test_size=None,
112
train_size=None,
113
random_state=None,
114
shuffle=True,
115
stratify=None
116
):
117
"""
118
Split arrays or matrices into random train and test subsets.
119
120
Intel-optimized implementation with efficient random sampling and
121
memory-optimized array operations for large datasets.
122
123
Parameters:
124
*arrays: Sequence of indexable arrays with same length/shape[0]
125
test_size (float or int): Size of test set (0.0-1.0 for proportion, int for absolute)
126
train_size (float or int): Size of train set
127
random_state (int): Controls random number generation for reproducibility
128
shuffle (bool): Whether to shuffle data before splitting
129
stratify (array-like): If not None, data split in stratified fashion
130
131
Returns:
132
list: List containing train-test split of inputs
133
134
Example:
135
>>> from sklearnex.model_selection import train_test_split
136
>>> import numpy as np
137
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
138
>>> y = np.array([1, 2, 1, 2])
139
>>> X_train, X_test, y_train, y_test = train_test_split(
140
... X, y, test_size=0.5, random_state=42)
141
>>> X_train.shape, X_test.shape
142
((2, 2), (2, 2))
143
"""
144
```
145
146
## Usage Examples
147
148
### ROC AUC Score Computation
149
150
```python
151
import numpy as np
152
from sklearnex.metrics import roc_auc_score
153
from sklearn.datasets import make_classification
154
from sklearn.model_selection import train_test_split
155
from sklearn.ensemble import RandomForestClassifier
156
157
# Binary classification example
158
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
159
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
160
161
# Train a classifier
162
clf = RandomForestClassifier(n_estimators=100, random_state=42)
163
clf.fit(X_train, y_train)
164
165
# Get prediction probabilities
166
y_proba = clf.predict_proba(X_test)[:, 1] # Probabilities for positive class
167
168
# Compute ROC AUC
169
auc_score = roc_auc_score(y_test, y_proba)
170
print(f"Binary ROC AUC: {auc_score:.3f}")
171
172
# Multiclass example
173
X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3,
174
n_informative=10, random_state=42)
175
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
176
X_multi, y_multi, test_size=0.2, random_state=42)
177
178
clf_multi = RandomForestClassifier(n_estimators=100, random_state=42)
179
clf_multi.fit(X_train_multi, y_train_multi)
180
181
# Get prediction probabilities for all classes
182
y_proba_multi = clf_multi.predict_proba(X_test_multi)
183
184
# Compute multiclass ROC AUC with different averaging strategies
185
auc_macro = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average='macro')
186
auc_weighted = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average='weighted')
187
188
print(f"Multiclass ROC AUC (macro): {auc_macro:.3f}")
189
print(f"Multiclass ROC AUC (weighted): {auc_weighted:.3f}")
190
191
# Per-class ROC AUC
192
auc_per_class = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average=None)
193
for i, auc in enumerate(auc_per_class):
194
print(f"Class {i} ROC AUC: {auc:.3f}")
195
196
# One-vs-One strategy
197
auc_ovo = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovo', average='macro')
198
print(f"Multiclass ROC AUC (OvO): {auc_ovo:.3f}")
199
```
200
201
### Pairwise Distance Computations
202
203
```python
204
import numpy as np
205
from sklearnex.metrics import pairwise_distances
206
from sklearn.datasets import make_blobs
207
208
# Generate sample data
209
X, _ = make_blobs(n_samples=500, centers=3, n_features=10, random_state=42)
210
Y = X[:100] # Subset for pairwise comparison
211
212
# Compute various distance metrics
213
metrics = ['euclidean', 'manhattan', 'cosine', 'chebyshev']
214
215
for metric in metrics:
216
distances = pairwise_distances(X[:5], Y[:5], metric=metric)
217
print(f"{metric.capitalize()} distances shape: {distances.shape}")
218
print(f"{metric.capitalize()} distance range: [{distances.min():.3f}, {distances.max():.3f}]")
219
220
# Self-distance matrix (symmetric)
221
euclidean_self = pairwise_distances(X[:10], metric='euclidean')
222
print(f"Self-distance matrix shape: {euclidean_self.shape}")
223
print(f"Diagonal elements (should be ~0): {np.diag(euclidean_self)}")
224
225
# Minkowski distance with different p values
226
for p in [1, 2, 3]:
227
minkowski_dist = pairwise_distances(X[:5], Y[:5], metric='minkowski', p=p)
228
print(f"Minkowski distance (p={p}) range: [{minkowski_dist.min():.3f}, {minkowski_dist.max():.3f}]")
229
230
# Large dataset performance example
231
X_large = np.random.randn(2000, 50)
232
Y_large = np.random.randn(1000, 50)
233
234
import time
235
start_time = time.time()
236
distances_large = pairwise_distances(X_large, Y_large, metric='euclidean')
237
computation_time = time.time() - start_time
238
239
print(f"Large dataset distances shape: {distances_large.shape}")
240
print(f"Computation time: {computation_time:.2f} seconds")
241
242
# Memory-efficient chunked computation for very large datasets
243
def chunked_pairwise_distances(X, Y, chunk_size=1000, metric='euclidean'):
244
"""Compute pairwise distances in chunks to manage memory usage."""
245
n_samples_X = X.shape[0]
246
distances = []
247
248
for i in range(0, n_samples_X, chunk_size):
249
end_idx = min(i + chunk_size, n_samples_X)
250
chunk_distances = pairwise_distances(X[i:end_idx], Y, metric=metric)
251
distances.append(chunk_distances)
252
253
return np.vstack(distances)
254
255
# Example with chunked computation
256
X_very_large = np.random.randn(5000, 20)
257
Y_subset = np.random.randn(500, 20)
258
259
chunked_distances = chunked_pairwise_distances(X_very_large, Y_subset, chunk_size=1000)
260
print(f"Chunked distances shape: {chunked_distances.shape}")
261
```
262
263
### Train-Test Split Operations
264
265
```python
266
import numpy as np
267
from sklearnex.model_selection import train_test_split
268
from sklearn.datasets import make_classification, make_regression
269
270
# Basic train-test split
271
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
272
273
# Split with different test sizes
274
test_sizes = [0.2, 0.3, 0.5]
275
for test_size in test_sizes:
276
X_train, X_test, y_train, y_test = train_test_split(
277
X, y, test_size=test_size, random_state=42
278
)
279
print(f"Test size {test_size}: Train={X_train.shape[0]}, Test={X_test.shape[0]}")
280
281
# Stratified split to preserve class distribution
282
X_imbalanced, y_imbalanced = make_classification(
283
n_samples=1000, n_features=20, n_classes=3,
284
weights=[0.6, 0.3, 0.1], random_state=42
285
)
286
287
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
288
X_imbalanced, y_imbalanced, test_size=0.2, stratify=y_imbalanced, random_state=42
289
)
290
291
# Check class distributions
292
from collections import Counter
293
print("Original distribution:", Counter(y_imbalanced))
294
print("Train distribution:", Counter(y_train_strat))
295
print("Test distribution:", Counter(y_test_strat))
296
297
# Multiple array splitting
298
X_reg, y_reg = make_regression(n_samples=800, n_features=15, random_state=42)
299
sample_weights = np.random.rand(800)
300
groups = np.random.randint(0, 5, 800)
301
302
X_train, X_test, y_train, y_test, weights_train, weights_test, groups_train, groups_test = train_test_split(
303
X_reg, y_reg, sample_weights, groups,
304
test_size=0.25, random_state=42
305
)
306
307
print(f"Multiple arrays split:")
308
print(f"X: {X_train.shape[0]} train, {X_test.shape[0]} test")
309
print(f"y: {y_train.shape[0]} train, {y_test.shape[0]} test")
310
print(f"weights: {weights_train.shape[0]} train, {weights_test.shape[0]} test")
311
print(f"groups: {groups_train.shape[0]} train, {groups_test.shape[0]} test")
312
313
# No shuffle option
314
X_ordered = np.arange(100).reshape(50, 2)
315
y_ordered = np.arange(50)
316
317
X_train_ns, X_test_ns, y_train_ns, y_test_ns = train_test_split(
318
X_ordered, y_ordered, test_size=0.2, shuffle=False
319
)
320
321
print("No shuffle - first few train indices:", y_train_ns[:5])
322
print("No shuffle - first few test indices:", y_test_ns[:5])
323
324
# Fixed train size instead of test size
325
X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(
326
X, y, train_size=600, random_state=42
327
)
328
329
print(f"Fixed train size: Train={X_train_fixed.shape[0]}, Test={X_test_fixed.shape[0]}")
330
331
# Reproducibility check
332
splits = []
333
for seed in [42, 42, 42]: # Same seed
334
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
335
splits.append(y_tr[:5])
336
337
print("Reproducibility check (should be identical):")
338
for i, split in enumerate(splits):
339
print(f"Split {i+1}: {split}")
340
```
341
342
### Combined Metrics and Model Selection Workflow
343
344
```python
345
import numpy as np
346
from sklearnex.model_selection import train_test_split
347
from sklearnex.metrics import roc_auc_score, pairwise_distances
348
from sklearn.datasets import make_classification
349
from sklearn.ensemble import RandomForestClassifier
350
from sklearn.linear_model import LogisticRegression
351
from sklearn.preprocessing import StandardScaler
352
from sklearn.neighbors import KNeighborsClassifier
353
354
# Generate dataset
355
X, y = make_classification(
356
n_samples=2000, n_features=20, n_informative=15,
357
n_classes=2, weights=[0.7, 0.3], random_state=42
358
)
359
360
# Split data
361
X_train, X_test, y_train, y_test = train_test_split(
362
X, y, test_size=0.2, stratify=y, random_state=42
363
)
364
365
# Train multiple models
366
models = {
367
'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
368
'LogisticRegression': LogisticRegression(random_state=42),
369
'KNN': KNeighborsClassifier(n_neighbors=5)
370
}
371
372
results = {}
373
374
for name, model in models.items():
375
# Fit model
376
if name == 'LogisticRegression' or name == 'KNN':
377
# Scale features for these models
378
scaler = StandardScaler()
379
X_train_scaled = scaler.fit_transform(X_train)
380
X_test_scaled = scaler.transform(X_test)
381
382
model.fit(X_train_scaled, y_train)
383
y_proba = model.predict_proba(X_test_scaled)[:, 1]
384
else:
385
model.fit(X_train, y_train)
386
y_proba = model.predict_proba(X_test)[:, 1]
387
388
# Compute ROC AUC
389
auc = roc_auc_score(y_test, y_proba)
390
results[name] = auc
391
392
print(f"{name} ROC AUC: {auc:.3f}")
393
394
# Find best model
395
best_model = max(results, key=results.get)
396
print(f"\nBest model: {best_model} (AUC: {results[best_model]:.3f})")
397
398
# Distance-based analysis
399
# Compute pairwise distances between test samples
400
test_distances = pairwise_distances(X_test[:100], metric='euclidean')
401
402
# Analyze distance distribution
403
print(f"\nDistance analysis on test set:")
404
print(f"Mean distance: {test_distances.mean():.3f}")
405
print(f"Std distance: {test_distances.std():.3f}")
406
print(f"Min non-zero distance: {test_distances[test_distances > 0].min():.3f}")
407
print(f"Max distance: {test_distances.max():.3f}")
408
409
# Cross-validation with custom splits
410
from sklearn.model_selection import cross_val_score
411
412
# Multiple train-test splits for robust evaluation
413
cv_scores = []
414
for i in range(5):
415
X_cv_train, X_cv_test, y_cv_train, y_cv_test = train_test_split(
416
X, y, test_size=0.2, stratify=y, random_state=i
417
)
418
419
# Train best model
420
best_clf = models[best_model]
421
if best_model == 'LogisticRegression' or best_model == 'KNN':
422
scaler = StandardScaler()
423
X_cv_train_scaled = scaler.fit_transform(X_cv_train)
424
X_cv_test_scaled = scaler.transform(X_cv_test)
425
426
best_clf.fit(X_cv_train_scaled, y_cv_train)
427
y_cv_proba = best_clf.predict_proba(X_cv_test_scaled)[:, 1]
428
else:
429
best_clf.fit(X_cv_train, y_cv_train)
430
y_cv_proba = best_clf.predict_proba(X_cv_test)[:, 1]
431
432
cv_auc = roc_auc_score(y_cv_test, y_cv_proba)
433
cv_scores.append(cv_auc)
434
435
print(f"\nCross-validation results ({len(cv_scores)} folds):")
436
print(f"Mean AUC: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")
437
print(f"Individual scores: {[f'{score:.3f}' for score in cv_scores]}")
438
```
439
440
### Performance Comparison
441
442
```python
443
import time
444
import numpy as np
445
from sklearn.datasets import make_classification
446
447
# Generate large dataset for performance testing
448
X_large, y_large = make_classification(
449
n_samples=100000, n_features=50, n_classes=2, random_state=42
450
)
451
452
# Test train_test_split performance
453
print("Train-test split performance:")
454
455
# Intel-optimized version
456
start_time = time.time()
457
from sklearnex.model_selection import train_test_split as intel_split
458
X_train_intel, X_test_intel, y_train_intel, y_test_intel = intel_split(
459
X_large, y_large, test_size=0.2, random_state=42
460
)
461
intel_split_time = time.time() - start_time
462
463
# Standard version
464
start_time = time.time()
465
from sklearn.model_selection import train_test_split as standard_split
466
X_train_std, X_test_std, y_train_std, y_test_std = standard_split(
467
X_large, y_large, test_size=0.2, random_state=42
468
)
469
standard_split_time = time.time() - start_time
470
471
print(f"Intel train_test_split: {intel_split_time:.3f} seconds")
472
print(f"Standard train_test_split: {standard_split_time:.3f} seconds")
473
print(f"Speedup: {standard_split_time / intel_split_time:.1f}x")
474
475
# Test pairwise_distances performance
476
X_dist_test = np.random.randn(2000, 30)
477
Y_dist_test = np.random.randn(1500, 30)
478
479
print("\nPairwise distances performance:")
480
481
# Intel-optimized version
482
start_time = time.time()
483
from sklearnex.metrics import pairwise_distances as intel_distances
484
distances_intel = intel_distances(X_dist_test, Y_dist_test, metric='euclidean')
485
intel_dist_time = time.time() - start_time
486
487
# Standard version
488
start_time = time.time()
489
from sklearn.metrics import pairwise_distances as standard_distances
490
distances_std = standard_distances(X_dist_test, Y_dist_test, metric='euclidean')
491
standard_dist_time = time.time() - start_time
492
493
print(f"Intel pairwise_distances: {intel_dist_time:.3f} seconds")
494
print(f"Standard pairwise_distances: {standard_dist_time:.3f} seconds")
495
print(f"Speedup: {standard_dist_time / intel_dist_time:.1f}x")
496
497
# Verify results are identical
498
print(f"Results identical: {np.allclose(distances_intel, distances_std)}")
499
```
500
501
## Performance Notes
502
503
- ROC AUC computation shows significant speedups on datasets with >10000 samples
504
- Pairwise distance calculations benefit most from Intel optimization with high-dimensional data
505
- Train-test split optimizations are most noticeable with very large datasets (>50000 samples)
506
- Memory usage is comparable to standard scikit-learn versions
507
- All functions maintain identical results to scikit-learn implementations
508
- Vectorized operations provide the greatest performance improvements