Tessl Tile for pypi/scikit-learn-intelex@2024.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

advanced.md clustering.md daal4py-mb.md decomposition.md ensemble.md index.md linear-models.md metrics-model-selection.md neighbors.md patching-config.md stats-manifold.md svm.md

metrics-model-selection.mddocs/

0
# Metrics and Model Selection
1

2
High-performance implementations of evaluation metrics and model selection utilities with Intel hardware acceleration. These functions provide significant speedups for model evaluation, distance computations, and data splitting operations.
3

4
## Capabilities
5

6
### Ranking Metrics
7

8
#### ROC AUC Score
9

10
Intel-accelerated computation of Area Under the ROC Curve for binary and multiclass classification.
11

12
```python { .api }
13
def roc_auc_score(
14
    y_true, 
15
    y_score, 
16
    average='macro', 
17
    sample_weight=None, 
18
    max_fpr=None, 
19
    multi_class='raise', 
20
    labels=None
21
):
22
    """
23
    Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC).
24
    
25
    Intel-optimized implementation providing significant speedup for large datasets
26
    through vectorized operations and efficient curve computation.
27
    
28
    Parameters:
29
        y_true (array-like): True binary labels or multiclass labels
30
        y_score (array-like): Target scores (probabilities or decision values)
31
        average (str): Averaging strategy for multiclass ('macro', 'weighted', 'micro')
32
        sample_weight (array-like): Sample weights
33
        max_fpr (float): Maximum false positive rate for partial AUC
34
        multi_class (str): Multiclass strategy ('raise', 'ovr', 'ovo')
35
        labels (array-like): Labels to include for multiclass problems
36
        
37
    Returns:
38
        float: Area under ROC curve score
39
        
40
    Example:
41
        >>> from sklearnex.metrics import roc_auc_score
42
        >>> y_true = [0, 0, 1, 1]
43
        >>> y_scores = [0.1, 0.4, 0.35, 0.8]
44
        >>> roc_auc_score(y_true, y_scores)
45
        0.75
46
    """
47
```
48

49
### Distance Metrics
50

51
#### Pairwise Distances
52

53
Intel-accelerated computation of pairwise distances between samples.
54

55
```python { .api }
56
def pairwise_distances(
57
    X, 
58
    Y=None, 
59
    metric='euclidean', 
60
    n_jobs=None, 
61
    force_all_finite=True,
62
    **kwds
63
):
64
    """
65
    Compute pairwise distances between samples.
66
    
67
    Intel-optimized implementation with significant speedup through vectorized
68
    distance computations and efficient memory access patterns.
69
    
70
    Parameters:
71
        X (array-like): Input samples of shape (n_samples_X, n_features)
72
        Y (array-like): Second set of samples (n_samples_Y, n_features), optional
73
        metric (str or callable): Distance metric to use
74
        n_jobs (int): Number of parallel jobs
75
        force_all_finite (bool): Whether to check for finite values
76
        **kwds: Additional parameters for distance metric
77
        
78
    Returns:
79
        ndarray: Distance matrix of shape (n_samples_X, n_samples_Y)
80
        
81
    Supported metrics:
82
        - 'euclidean': L2 norm distance
83
        - 'manhattan': L1 norm distance  
84
        - 'cosine': Cosine distance
85
        - 'minkowski': Minkowski distance
86
        - 'chebyshev': Chebyshev distance
87
        - 'hamming': Hamming distance
88
        - 'jaccard': Jaccard distance
89
        - callable: Custom distance function
90
        
91
    Example:
92
        >>> from sklearnex.metrics import pairwise_distances
93
        >>> import numpy as np
94
        >>> X = np.array([[0, 1], [1, 0], [2, 2]])
95
        >>> pairwise_distances(X, metric='euclidean')
96
        array([[0.    , 1.4142, 2.2361],
97
               [1.4142, 0.    , 1.4142],
98
               [2.2361, 1.4142, 0.    ]])
99
    """
100
```
101

102
### Model Selection Utilities
103

104
#### Train Test Split
105

106
Intel-accelerated data splitting for model validation with optimized random sampling.
107

108
```python { .api }
109
def train_test_split(
110
    *arrays,
111
    test_size=None,
112
    train_size=None,
113
    random_state=None,
114
    shuffle=True,
115
    stratify=None
116
):
117
    """
118
    Split arrays or matrices into random train and test subsets.
119
    
120
    Intel-optimized implementation with efficient random sampling and
121
    memory-optimized array operations for large datasets.
122
    
123
    Parameters:
124
        *arrays: Sequence of indexable arrays with same length/shape[0]
125
        test_size (float or int): Size of test set (0.0-1.0 for proportion, int for absolute)
126
        train_size (float or int): Size of train set  
127
        random_state (int): Controls random number generation for reproducibility
128
        shuffle (bool): Whether to shuffle data before splitting
129
        stratify (array-like): If not None, data split in stratified fashion
130
        
131
    Returns:
132
        list: List containing train-test split of inputs
133
        
134
    Example:
135
        >>> from sklearnex.model_selection import train_test_split
136
        >>> import numpy as np
137
        >>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
138
        >>> y = np.array([1, 2, 1, 2])
139
        >>> X_train, X_test, y_train, y_test = train_test_split(
140
        ...     X, y, test_size=0.5, random_state=42)
141
        >>> X_train.shape, X_test.shape
142
        ((2, 2), (2, 2))
143
    """
144
```
145

146
## Usage Examples
147

148
### ROC AUC Score Computation
149

150
```python
151
import numpy as np
152
from sklearnex.metrics import roc_auc_score
153
from sklearn.datasets import make_classification
154
from sklearn.model_selection import train_test_split
155
from sklearn.ensemble import RandomForestClassifier
156

157
# Binary classification example
158
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
159
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
160

161
# Train a classifier
162
clf = RandomForestClassifier(n_estimators=100, random_state=42)
163
clf.fit(X_train, y_train)
164

165
# Get prediction probabilities
166
y_proba = clf.predict_proba(X_test)[:, 1]  # Probabilities for positive class
167

168
# Compute ROC AUC
169
auc_score = roc_auc_score(y_test, y_proba)
170
print(f"Binary ROC AUC: {auc_score:.3f}")
171

172
# Multiclass example
173
X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3, 
174
                                       n_informative=10, random_state=42)
175
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
176
    X_multi, y_multi, test_size=0.2, random_state=42)
177

178
clf_multi = RandomForestClassifier(n_estimators=100, random_state=42)
179
clf_multi.fit(X_train_multi, y_train_multi)
180

181
# Get prediction probabilities for all classes
182
y_proba_multi = clf_multi.predict_proba(X_test_multi)
183

184
# Compute multiclass ROC AUC with different averaging strategies
185
auc_macro = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average='macro')
186
auc_weighted = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average='weighted')
187

188
print(f"Multiclass ROC AUC (macro): {auc_macro:.3f}")
189
print(f"Multiclass ROC AUC (weighted): {auc_weighted:.3f}")
190

191
# Per-class ROC AUC
192
auc_per_class = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average=None)
193
for i, auc in enumerate(auc_per_class):
194
    print(f"Class {i} ROC AUC: {auc:.3f}")
195

196
# One-vs-One strategy
197
auc_ovo = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovo', average='macro')
198
print(f"Multiclass ROC AUC (OvO): {auc_ovo:.3f}")
199
```
200

201
### Pairwise Distance Computations
202

203
```python
204
import numpy as np
205
from sklearnex.metrics import pairwise_distances
206
from sklearn.datasets import make_blobs
207

208
# Generate sample data
209
X, _ = make_blobs(n_samples=500, centers=3, n_features=10, random_state=42)
210
Y = X[:100]  # Subset for pairwise comparison
211

212
# Compute various distance metrics
213
metrics = ['euclidean', 'manhattan', 'cosine', 'chebyshev']
214

215
for metric in metrics:
216
    distances = pairwise_distances(X[:5], Y[:5], metric=metric)
217
    print(f"{metric.capitalize()} distances shape: {distances.shape}")
218
    print(f"{metric.capitalize()} distance range: [{distances.min():.3f}, {distances.max():.3f}]")
219

220
# Self-distance matrix (symmetric)
221
euclidean_self = pairwise_distances(X[:10], metric='euclidean')
222
print(f"Self-distance matrix shape: {euclidean_self.shape}")
223
print(f"Diagonal elements (should be ~0): {np.diag(euclidean_self)}")
224

225
# Minkowski distance with different p values
226
for p in [1, 2, 3]:
227
    minkowski_dist = pairwise_distances(X[:5], Y[:5], metric='minkowski', p=p)
228
    print(f"Minkowski distance (p={p}) range: [{minkowski_dist.min():.3f}, {minkowski_dist.max():.3f}]")
229

230
# Large dataset performance example
231
X_large = np.random.randn(2000, 50)
232
Y_large = np.random.randn(1000, 50)
233

234
import time
235
start_time = time.time()
236
distances_large = pairwise_distances(X_large, Y_large, metric='euclidean')
237
computation_time = time.time() - start_time
238

239
print(f"Large dataset distances shape: {distances_large.shape}")
240
print(f"Computation time: {computation_time:.2f} seconds")
241

242
# Memory-efficient chunked computation for very large datasets
243
def chunked_pairwise_distances(X, Y, chunk_size=1000, metric='euclidean'):
244
    """Compute pairwise distances in chunks to manage memory usage."""
245
    n_samples_X = X.shape[0]
246
    distances = []
247
    
248
    for i in range(0, n_samples_X, chunk_size):
249
        end_idx = min(i + chunk_size, n_samples_X)
250
        chunk_distances = pairwise_distances(X[i:end_idx], Y, metric=metric)
251
        distances.append(chunk_distances)
252
    
253
    return np.vstack(distances)
254

255
# Example with chunked computation
256
X_very_large = np.random.randn(5000, 20)
257
Y_subset = np.random.randn(500, 20)
258

259
chunked_distances = chunked_pairwise_distances(X_very_large, Y_subset, chunk_size=1000)
260
print(f"Chunked distances shape: {chunked_distances.shape}")
261
```
262

263
### Train-Test Split Operations
264

265
```python
266
import numpy as np
267
from sklearnex.model_selection import train_test_split
268
from sklearn.datasets import make_classification, make_regression
269

270
# Basic train-test split
271
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
272

273
# Split with different test sizes
274
test_sizes = [0.2, 0.3, 0.5]
275
for test_size in test_sizes:
276
    X_train, X_test, y_train, y_test = train_test_split(
277
        X, y, test_size=test_size, random_state=42
278
    )
279
    print(f"Test size {test_size}: Train={X_train.shape[0]}, Test={X_test.shape[0]}")
280

281
# Stratified split to preserve class distribution
282
X_imbalanced, y_imbalanced = make_classification(
283
    n_samples=1000, n_features=20, n_classes=3, 
284
    weights=[0.6, 0.3, 0.1], random_state=42
285
)
286

287
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
288
    X_imbalanced, y_imbalanced, test_size=0.2, stratify=y_imbalanced, random_state=42
289
)
290

291
# Check class distributions
292
from collections import Counter
293
print("Original distribution:", Counter(y_imbalanced))
294
print("Train distribution:", Counter(y_train_strat))
295
print("Test distribution:", Counter(y_test_strat))
296

297
# Multiple array splitting
298
X_reg, y_reg = make_regression(n_samples=800, n_features=15, random_state=42)
299
sample_weights = np.random.rand(800)
300
groups = np.random.randint(0, 5, 800)
301

302
X_train, X_test, y_train, y_test, weights_train, weights_test, groups_train, groups_test = train_test_split(
303
    X_reg, y_reg, sample_weights, groups, 
304
    test_size=0.25, random_state=42
305
)
306

307
print(f"Multiple arrays split:")
308
print(f"X: {X_train.shape[0]} train, {X_test.shape[0]} test")
309
print(f"y: {y_train.shape[0]} train, {y_test.shape[0]} test")
310
print(f"weights: {weights_train.shape[0]} train, {weights_test.shape[0]} test")
311
print(f"groups: {groups_train.shape[0]} train, {groups_test.shape[0]} test")
312

313
# No shuffle option
314
X_ordered = np.arange(100).reshape(50, 2)
315
y_ordered = np.arange(50)
316

317
X_train_ns, X_test_ns, y_train_ns, y_test_ns = train_test_split(
318
    X_ordered, y_ordered, test_size=0.2, shuffle=False
319
)
320

321
print("No shuffle - first few train indices:", y_train_ns[:5])
322
print("No shuffle - first few test indices:", y_test_ns[:5])
323

324
# Fixed train size instead of test size
325
X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(
326
    X, y, train_size=600, random_state=42
327
)
328

329
print(f"Fixed train size: Train={X_train_fixed.shape[0]}, Test={X_test_fixed.shape[0]}")
330

331
# Reproducibility check
332
splits = []
333
for seed in [42, 42, 42]:  # Same seed
334
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
335
    splits.append(y_tr[:5])
336

337
print("Reproducibility check (should be identical):")
338
for i, split in enumerate(splits):
339
    print(f"Split {i+1}: {split}")
340
```
341

342
### Combined Metrics and Model Selection Workflow
343

344
```python
345
import numpy as np
346
from sklearnex.model_selection import train_test_split
347
from sklearnex.metrics import roc_auc_score, pairwise_distances
348
from sklearn.datasets import make_classification
349
from sklearn.ensemble import RandomForestClassifier
350
from sklearn.linear_model import LogisticRegression
351
from sklearn.preprocessing import StandardScaler
352
from sklearn.neighbors import KNeighborsClassifier
353

354
# Generate dataset
355
X, y = make_classification(
356
    n_samples=2000, n_features=20, n_informative=15, 
357
    n_classes=2, weights=[0.7, 0.3], random_state=42
358
)
359

360
# Split data
361
X_train, X_test, y_train, y_test = train_test_split(
362
    X, y, test_size=0.2, stratify=y, random_state=42
363
)
364

365
# Train multiple models
366
models = {
367
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
368
    'LogisticRegression': LogisticRegression(random_state=42),
369
    'KNN': KNeighborsClassifier(n_neighbors=5)
370
}
371

372
results = {}
373

374
for name, model in models.items():
375
    # Fit model
376
    if name == 'LogisticRegression' or name == 'KNN':
377
        # Scale features for these models
378
        scaler = StandardScaler()
379
        X_train_scaled = scaler.fit_transform(X_train)
380
        X_test_scaled = scaler.transform(X_test)
381
        
382
        model.fit(X_train_scaled, y_train)
383
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
384
    else:
385
        model.fit(X_train, y_train)
386
        y_proba = model.predict_proba(X_test)[:, 1]
387
    
388
    # Compute ROC AUC
389
    auc = roc_auc_score(y_test, y_proba)
390
    results[name] = auc
391
    
392
    print(f"{name} ROC AUC: {auc:.3f}")
393

394
# Find best model
395
best_model = max(results, key=results.get)
396
print(f"\nBest model: {best_model} (AUC: {results[best_model]:.3f})")
397

398
# Distance-based analysis
399
# Compute pairwise distances between test samples
400
test_distances = pairwise_distances(X_test[:100], metric='euclidean')
401

402
# Analyze distance distribution
403
print(f"\nDistance analysis on test set:")
404
print(f"Mean distance: {test_distances.mean():.3f}")
405
print(f"Std distance: {test_distances.std():.3f}")
406
print(f"Min non-zero distance: {test_distances[test_distances > 0].min():.3f}")
407
print(f"Max distance: {test_distances.max():.3f}")
408

409
# Cross-validation with custom splits
410
from sklearn.model_selection import cross_val_score
411

412
# Multiple train-test splits for robust evaluation
413
cv_scores = []
414
for i in range(5):
415
    X_cv_train, X_cv_test, y_cv_train, y_cv_test = train_test_split(
416
        X, y, test_size=0.2, stratify=y, random_state=i
417
    )
418
    
419
    # Train best model
420
    best_clf = models[best_model]
421
    if best_model == 'LogisticRegression' or best_model == 'KNN':
422
        scaler = StandardScaler()
423
        X_cv_train_scaled = scaler.fit_transform(X_cv_train)
424
        X_cv_test_scaled = scaler.transform(X_cv_test)
425
        
426
        best_clf.fit(X_cv_train_scaled, y_cv_train)
427
        y_cv_proba = best_clf.predict_proba(X_cv_test_scaled)[:, 1]
428
    else:
429
        best_clf.fit(X_cv_train, y_cv_train)
430
        y_cv_proba = best_clf.predict_proba(X_cv_test)[:, 1]
431
    
432
    cv_auc = roc_auc_score(y_cv_test, y_cv_proba)
433
    cv_scores.append(cv_auc)
434

435
print(f"\nCross-validation results ({len(cv_scores)} folds):")
436
print(f"Mean AUC: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")
437
print(f"Individual scores: {[f'{score:.3f}' for score in cv_scores]}")
438
```
439

440
### Performance Comparison
441

442
```python
443
import time
444
import numpy as np
445
from sklearn.datasets import make_classification
446

447
# Generate large dataset for performance testing
448
X_large, y_large = make_classification(
449
    n_samples=100000, n_features=50, n_classes=2, random_state=42
450
)
451

452
# Test train_test_split performance
453
print("Train-test split performance:")
454

455
# Intel-optimized version
456
start_time = time.time()
457
from sklearnex.model_selection import train_test_split as intel_split
458
X_train_intel, X_test_intel, y_train_intel, y_test_intel = intel_split(
459
    X_large, y_large, test_size=0.2, random_state=42
460
)
461
intel_split_time = time.time() - start_time
462

463
# Standard version
464
start_time = time.time()
465
from sklearn.model_selection import train_test_split as standard_split
466
X_train_std, X_test_std, y_train_std, y_test_std = standard_split(
467
    X_large, y_large, test_size=0.2, random_state=42
468
)
469
standard_split_time = time.time() - start_time
470

471
print(f"Intel train_test_split: {intel_split_time:.3f} seconds")
472
print(f"Standard train_test_split: {standard_split_time:.3f} seconds")
473
print(f"Speedup: {standard_split_time / intel_split_time:.1f}x")
474

475
# Test pairwise_distances performance  
476
X_dist_test = np.random.randn(2000, 30)
477
Y_dist_test = np.random.randn(1500, 30)
478

479
print("\nPairwise distances performance:")
480

481
# Intel-optimized version
482
start_time = time.time()
483
from sklearnex.metrics import pairwise_distances as intel_distances
484
distances_intel = intel_distances(X_dist_test, Y_dist_test, metric='euclidean')
485
intel_dist_time = time.time() - start_time
486

487
# Standard version
488
start_time = time.time()
489
from sklearn.metrics import pairwise_distances as standard_distances
490
distances_std = standard_distances(X_dist_test, Y_dist_test, metric='euclidean')
491
standard_dist_time = time.time() - start_time
492

493
print(f"Intel pairwise_distances: {intel_dist_time:.3f} seconds")
494
print(f"Standard pairwise_distances: {standard_dist_time:.3f} seconds")
495
print(f"Speedup: {standard_dist_time / intel_dist_time:.1f}x")
496

497
# Verify results are identical
498
print(f"Results identical: {np.allclose(distances_intel, distances_std)}")
499
```
500

501
## Performance Notes
502

503
- ROC AUC computation shows significant speedups on datasets with >10000 samples
504
- Pairwise distance calculations benefit most from Intel optimization with high-dimensional data
505
- Train-test split optimizations are most noticeable with very large datasets (>50000 samples)
506
- Memory usage is comparable to standard scikit-learn versions
507
- All functions maintain identical results to scikit-learn implementations
508
- Vectorized operations provide the greatest performance improvements

Version

Tile

Files

metrics-model-selection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

metrics-model-selection.mddocs/