or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced.mdclustering.mddaal4py-mb.mddecomposition.mdensemble.mdindex.mdlinear-models.mdmetrics-model-selection.mdneighbors.mdpatching-config.mdstats-manifold.mdsvm.md

metrics-model-selection.mddocs/

0

# Metrics and Model Selection

1

2

High-performance implementations of evaluation metrics and model selection utilities with Intel hardware acceleration. These functions provide significant speedups for model evaluation, distance computations, and data splitting operations.

3

4

## Capabilities

5

6

### Ranking Metrics

7

8

#### ROC AUC Score

9

10

Intel-accelerated computation of Area Under the ROC Curve for binary and multiclass classification.

11

12

```python { .api }

13

def roc_auc_score(

14

y_true,

15

y_score,

16

average='macro',

17

sample_weight=None,

18

max_fpr=None,

19

multi_class='raise',

20

labels=None

21

):

22

"""

23

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC).

24

25

Intel-optimized implementation providing significant speedup for large datasets

26

through vectorized operations and efficient curve computation.

27

28

Parameters:

29

y_true (array-like): True binary labels or multiclass labels

30

y_score (array-like): Target scores (probabilities or decision values)

31

average (str): Averaging strategy for multiclass ('macro', 'weighted', 'micro')

32

sample_weight (array-like): Sample weights

33

max_fpr (float): Maximum false positive rate for partial AUC

34

multi_class (str): Multiclass strategy ('raise', 'ovr', 'ovo')

35

labels (array-like): Labels to include for multiclass problems

36

37

Returns:

38

float: Area under ROC curve score

39

40

Example:

41

>>> from sklearnex.metrics import roc_auc_score

42

>>> y_true = [0, 0, 1, 1]

43

>>> y_scores = [0.1, 0.4, 0.35, 0.8]

44

>>> roc_auc_score(y_true, y_scores)

45

0.75

46

"""

47

```

48

49

### Distance Metrics

50

51

#### Pairwise Distances

52

53

Intel-accelerated computation of pairwise distances between samples.

54

55

```python { .api }

56

def pairwise_distances(

57

X,

58

Y=None,

59

metric='euclidean',

60

n_jobs=None,

61

force_all_finite=True,

62

**kwds

63

):

64

"""

65

Compute pairwise distances between samples.

66

67

Intel-optimized implementation with significant speedup through vectorized

68

distance computations and efficient memory access patterns.

69

70

Parameters:

71

X (array-like): Input samples of shape (n_samples_X, n_features)

72

Y (array-like): Second set of samples (n_samples_Y, n_features), optional

73

metric (str or callable): Distance metric to use

74

n_jobs (int): Number of parallel jobs

75

force_all_finite (bool): Whether to check for finite values

76

**kwds: Additional parameters for distance metric

77

78

Returns:

79

ndarray: Distance matrix of shape (n_samples_X, n_samples_Y)

80

81

Supported metrics:

82

- 'euclidean': L2 norm distance

83

- 'manhattan': L1 norm distance

84

- 'cosine': Cosine distance

85

- 'minkowski': Minkowski distance

86

- 'chebyshev': Chebyshev distance

87

- 'hamming': Hamming distance

88

- 'jaccard': Jaccard distance

89

- callable: Custom distance function

90

91

Example:

92

>>> from sklearnex.metrics import pairwise_distances

93

>>> import numpy as np

94

>>> X = np.array([[0, 1], [1, 0], [2, 2]])

95

>>> pairwise_distances(X, metric='euclidean')

96

array([[0. , 1.4142, 2.2361],

97

[1.4142, 0. , 1.4142],

98

[2.2361, 1.4142, 0. ]])

99

"""

100

```

101

102

### Model Selection Utilities

103

104

#### Train Test Split

105

106

Intel-accelerated data splitting for model validation with optimized random sampling.

107

108

```python { .api }

109

def train_test_split(

110

*arrays,

111

test_size=None,

112

train_size=None,

113

random_state=None,

114

shuffle=True,

115

stratify=None

116

):

117

"""

118

Split arrays or matrices into random train and test subsets.

119

120

Intel-optimized implementation with efficient random sampling and

121

memory-optimized array operations for large datasets.

122

123

Parameters:

124

*arrays: Sequence of indexable arrays with same length/shape[0]

125

test_size (float or int): Size of test set (0.0-1.0 for proportion, int for absolute)

126

train_size (float or int): Size of train set

127

random_state (int): Controls random number generation for reproducibility

128

shuffle (bool): Whether to shuffle data before splitting

129

stratify (array-like): If not None, data split in stratified fashion

130

131

Returns:

132

list: List containing train-test split of inputs

133

134

Example:

135

>>> from sklearnex.model_selection import train_test_split

136

>>> import numpy as np

137

>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

138

>>> y = np.array([1, 2, 1, 2])

139

>>> X_train, X_test, y_train, y_test = train_test_split(

140

... X, y, test_size=0.5, random_state=42)

141

>>> X_train.shape, X_test.shape

142

((2, 2), (2, 2))

143

"""

144

```

145

146

## Usage Examples

147

148

### ROC AUC Score Computation

149

150

```python

151

import numpy as np

152

from sklearnex.metrics import roc_auc_score

153

from sklearn.datasets import make_classification

154

from sklearn.model_selection import train_test_split

155

from sklearn.ensemble import RandomForestClassifier

156

157

# Binary classification example

158

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

159

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

160

161

# Train a classifier

162

clf = RandomForestClassifier(n_estimators=100, random_state=42)

163

clf.fit(X_train, y_train)

164

165

# Get prediction probabilities

166

y_proba = clf.predict_proba(X_test)[:, 1] # Probabilities for positive class

167

168

# Compute ROC AUC

169

auc_score = roc_auc_score(y_test, y_proba)

170

print(f"Binary ROC AUC: {auc_score:.3f}")

171

172

# Multiclass example

173

X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3,

174

n_informative=10, random_state=42)

175

X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(

176

X_multi, y_multi, test_size=0.2, random_state=42)

177

178

clf_multi = RandomForestClassifier(n_estimators=100, random_state=42)

179

clf_multi.fit(X_train_multi, y_train_multi)

180

181

# Get prediction probabilities for all classes

182

y_proba_multi = clf_multi.predict_proba(X_test_multi)

183

184

# Compute multiclass ROC AUC with different averaging strategies

185

auc_macro = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average='macro')

186

auc_weighted = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average='weighted')

187

188

print(f"Multiclass ROC AUC (macro): {auc_macro:.3f}")

189

print(f"Multiclass ROC AUC (weighted): {auc_weighted:.3f}")

190

191

# Per-class ROC AUC

192

auc_per_class = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovr', average=None)

193

for i, auc in enumerate(auc_per_class):

194

print(f"Class {i} ROC AUC: {auc:.3f}")

195

196

# One-vs-One strategy

197

auc_ovo = roc_auc_score(y_test_multi, y_proba_multi, multi_class='ovo', average='macro')

198

print(f"Multiclass ROC AUC (OvO): {auc_ovo:.3f}")

199

```

200

201

### Pairwise Distance Computations

202

203

```python

204

import numpy as np

205

from sklearnex.metrics import pairwise_distances

206

from sklearn.datasets import make_blobs

207

208

# Generate sample data

209

X, _ = make_blobs(n_samples=500, centers=3, n_features=10, random_state=42)

210

Y = X[:100] # Subset for pairwise comparison

211

212

# Compute various distance metrics

213

metrics = ['euclidean', 'manhattan', 'cosine', 'chebyshev']

214

215

for metric in metrics:

216

distances = pairwise_distances(X[:5], Y[:5], metric=metric)

217

print(f"{metric.capitalize()} distances shape: {distances.shape}")

218

print(f"{metric.capitalize()} distance range: [{distances.min():.3f}, {distances.max():.3f}]")

219

220

# Self-distance matrix (symmetric)

221

euclidean_self = pairwise_distances(X[:10], metric='euclidean')

222

print(f"Self-distance matrix shape: {euclidean_self.shape}")

223

print(f"Diagonal elements (should be ~0): {np.diag(euclidean_self)}")

224

225

# Minkowski distance with different p values

226

for p in [1, 2, 3]:

227

minkowski_dist = pairwise_distances(X[:5], Y[:5], metric='minkowski', p=p)

228

print(f"Minkowski distance (p={p}) range: [{minkowski_dist.min():.3f}, {minkowski_dist.max():.3f}]")

229

230

# Large dataset performance example

231

X_large = np.random.randn(2000, 50)

232

Y_large = np.random.randn(1000, 50)

233

234

import time

235

start_time = time.time()

236

distances_large = pairwise_distances(X_large, Y_large, metric='euclidean')

237

computation_time = time.time() - start_time

238

239

print(f"Large dataset distances shape: {distances_large.shape}")

240

print(f"Computation time: {computation_time:.2f} seconds")

241

242

# Memory-efficient chunked computation for very large datasets

243

def chunked_pairwise_distances(X, Y, chunk_size=1000, metric='euclidean'):

244

"""Compute pairwise distances in chunks to manage memory usage."""

245

n_samples_X = X.shape[0]

246

distances = []

247

248

for i in range(0, n_samples_X, chunk_size):

249

end_idx = min(i + chunk_size, n_samples_X)

250

chunk_distances = pairwise_distances(X[i:end_idx], Y, metric=metric)

251

distances.append(chunk_distances)

252

253

return np.vstack(distances)

254

255

# Example with chunked computation

256

X_very_large = np.random.randn(5000, 20)

257

Y_subset = np.random.randn(500, 20)

258

259

chunked_distances = chunked_pairwise_distances(X_very_large, Y_subset, chunk_size=1000)

260

print(f"Chunked distances shape: {chunked_distances.shape}")

261

```

262

263

### Train-Test Split Operations

264

265

```python

266

import numpy as np

267

from sklearnex.model_selection import train_test_split

268

from sklearn.datasets import make_classification, make_regression

269

270

# Basic train-test split

271

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

272

273

# Split with different test sizes

274

test_sizes = [0.2, 0.3, 0.5]

275

for test_size in test_sizes:

276

X_train, X_test, y_train, y_test = train_test_split(

277

X, y, test_size=test_size, random_state=42

278

)

279

print(f"Test size {test_size}: Train={X_train.shape[0]}, Test={X_test.shape[0]}")

280

281

# Stratified split to preserve class distribution

282

X_imbalanced, y_imbalanced = make_classification(

283

n_samples=1000, n_features=20, n_classes=3,

284

weights=[0.6, 0.3, 0.1], random_state=42

285

)

286

287

X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(

288

X_imbalanced, y_imbalanced, test_size=0.2, stratify=y_imbalanced, random_state=42

289

)

290

291

# Check class distributions

292

from collections import Counter

293

print("Original distribution:", Counter(y_imbalanced))

294

print("Train distribution:", Counter(y_train_strat))

295

print("Test distribution:", Counter(y_test_strat))

296

297

# Multiple array splitting

298

X_reg, y_reg = make_regression(n_samples=800, n_features=15, random_state=42)

299

sample_weights = np.random.rand(800)

300

groups = np.random.randint(0, 5, 800)

301

302

X_train, X_test, y_train, y_test, weights_train, weights_test, groups_train, groups_test = train_test_split(

303

X_reg, y_reg, sample_weights, groups,

304

test_size=0.25, random_state=42

305

)

306

307

print(f"Multiple arrays split:")

308

print(f"X: {X_train.shape[0]} train, {X_test.shape[0]} test")

309

print(f"y: {y_train.shape[0]} train, {y_test.shape[0]} test")

310

print(f"weights: {weights_train.shape[0]} train, {weights_test.shape[0]} test")

311

print(f"groups: {groups_train.shape[0]} train, {groups_test.shape[0]} test")

312

313

# No shuffle option

314

X_ordered = np.arange(100).reshape(50, 2)

315

y_ordered = np.arange(50)

316

317

X_train_ns, X_test_ns, y_train_ns, y_test_ns = train_test_split(

318

X_ordered, y_ordered, test_size=0.2, shuffle=False

319

)

320

321

print("No shuffle - first few train indices:", y_train_ns[:5])

322

print("No shuffle - first few test indices:", y_test_ns[:5])

323

324

# Fixed train size instead of test size

325

X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(

326

X, y, train_size=600, random_state=42

327

)

328

329

print(f"Fixed train size: Train={X_train_fixed.shape[0]}, Test={X_test_fixed.shape[0]}")

330

331

# Reproducibility check

332

splits = []

333

for seed in [42, 42, 42]: # Same seed

334

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)

335

splits.append(y_tr[:5])

336

337

print("Reproducibility check (should be identical):")

338

for i, split in enumerate(splits):

339

print(f"Split {i+1}: {split}")

340

```

341

342

### Combined Metrics and Model Selection Workflow

343

344

```python

345

import numpy as np

346

from sklearnex.model_selection import train_test_split

347

from sklearnex.metrics import roc_auc_score, pairwise_distances

348

from sklearn.datasets import make_classification

349

from sklearn.ensemble import RandomForestClassifier

350

from sklearn.linear_model import LogisticRegression

351

from sklearn.preprocessing import StandardScaler

352

from sklearn.neighbors import KNeighborsClassifier

353

354

# Generate dataset

355

X, y = make_classification(

356

n_samples=2000, n_features=20, n_informative=15,

357

n_classes=2, weights=[0.7, 0.3], random_state=42

358

)

359

360

# Split data

361

X_train, X_test, y_train, y_test = train_test_split(

362

X, y, test_size=0.2, stratify=y, random_state=42

363

)

364

365

# Train multiple models

366

models = {

367

'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),

368

'LogisticRegression': LogisticRegression(random_state=42),

369

'KNN': KNeighborsClassifier(n_neighbors=5)

370

}

371

372

results = {}

373

374

for name, model in models.items():

375

# Fit model

376

if name == 'LogisticRegression' or name == 'KNN':

377

# Scale features for these models

378

scaler = StandardScaler()

379

X_train_scaled = scaler.fit_transform(X_train)

380

X_test_scaled = scaler.transform(X_test)

381

382

model.fit(X_train_scaled, y_train)

383

y_proba = model.predict_proba(X_test_scaled)[:, 1]

384

else:

385

model.fit(X_train, y_train)

386

y_proba = model.predict_proba(X_test)[:, 1]

387

388

# Compute ROC AUC

389

auc = roc_auc_score(y_test, y_proba)

390

results[name] = auc

391

392

print(f"{name} ROC AUC: {auc:.3f}")

393

394

# Find best model

395

best_model = max(results, key=results.get)

396

print(f"\nBest model: {best_model} (AUC: {results[best_model]:.3f})")

397

398

# Distance-based analysis

399

# Compute pairwise distances between test samples

400

test_distances = pairwise_distances(X_test[:100], metric='euclidean')

401

402

# Analyze distance distribution

403

print(f"\nDistance analysis on test set:")

404

print(f"Mean distance: {test_distances.mean():.3f}")

405

print(f"Std distance: {test_distances.std():.3f}")

406

print(f"Min non-zero distance: {test_distances[test_distances > 0].min():.3f}")

407

print(f"Max distance: {test_distances.max():.3f}")

408

409

# Cross-validation with custom splits

410

from sklearn.model_selection import cross_val_score

411

412

# Multiple train-test splits for robust evaluation

413

cv_scores = []

414

for i in range(5):

415

X_cv_train, X_cv_test, y_cv_train, y_cv_test = train_test_split(

416

X, y, test_size=0.2, stratify=y, random_state=i

417

)

418

419

# Train best model

420

best_clf = models[best_model]

421

if best_model == 'LogisticRegression' or best_model == 'KNN':

422

scaler = StandardScaler()

423

X_cv_train_scaled = scaler.fit_transform(X_cv_train)

424

X_cv_test_scaled = scaler.transform(X_cv_test)

425

426

best_clf.fit(X_cv_train_scaled, y_cv_train)

427

y_cv_proba = best_clf.predict_proba(X_cv_test_scaled)[:, 1]

428

else:

429

best_clf.fit(X_cv_train, y_cv_train)

430

y_cv_proba = best_clf.predict_proba(X_cv_test)[:, 1]

431

432

cv_auc = roc_auc_score(y_cv_test, y_cv_proba)

433

cv_scores.append(cv_auc)

434

435

print(f"\nCross-validation results ({len(cv_scores)} folds):")

436

print(f"Mean AUC: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")

437

print(f"Individual scores: {[f'{score:.3f}' for score in cv_scores]}")

438

```

439

440

### Performance Comparison

441

442

```python

443

import time

444

import numpy as np

445

from sklearn.datasets import make_classification

446

447

# Generate large dataset for performance testing

448

X_large, y_large = make_classification(

449

n_samples=100000, n_features=50, n_classes=2, random_state=42

450

)

451

452

# Test train_test_split performance

453

print("Train-test split performance:")

454

455

# Intel-optimized version

456

start_time = time.time()

457

from sklearnex.model_selection import train_test_split as intel_split

458

X_train_intel, X_test_intel, y_train_intel, y_test_intel = intel_split(

459

X_large, y_large, test_size=0.2, random_state=42

460

)

461

intel_split_time = time.time() - start_time

462

463

# Standard version

464

start_time = time.time()

465

from sklearn.model_selection import train_test_split as standard_split

466

X_train_std, X_test_std, y_train_std, y_test_std = standard_split(

467

X_large, y_large, test_size=0.2, random_state=42

468

)

469

standard_split_time = time.time() - start_time

470

471

print(f"Intel train_test_split: {intel_split_time:.3f} seconds")

472

print(f"Standard train_test_split: {standard_split_time:.3f} seconds")

473

print(f"Speedup: {standard_split_time / intel_split_time:.1f}x")

474

475

# Test pairwise_distances performance

476

X_dist_test = np.random.randn(2000, 30)

477

Y_dist_test = np.random.randn(1500, 30)

478

479

print("\nPairwise distances performance:")

480

481

# Intel-optimized version

482

start_time = time.time()

483

from sklearnex.metrics import pairwise_distances as intel_distances

484

distances_intel = intel_distances(X_dist_test, Y_dist_test, metric='euclidean')

485

intel_dist_time = time.time() - start_time

486

487

# Standard version

488

start_time = time.time()

489

from sklearn.metrics import pairwise_distances as standard_distances

490

distances_std = standard_distances(X_dist_test, Y_dist_test, metric='euclidean')

491

standard_dist_time = time.time() - start_time

492

493

print(f"Intel pairwise_distances: {intel_dist_time:.3f} seconds")

494

print(f"Standard pairwise_distances: {standard_dist_time:.3f} seconds")

495

print(f"Speedup: {standard_dist_time / intel_dist_time:.1f}x")

496

497

# Verify results are identical

498

print(f"Results identical: {np.allclose(distances_intel, distances_std)}")

499

```

500

501

## Performance Notes

502

503

- ROC AUC computation shows significant speedups on datasets with >10000 samples

504

- Pairwise distance calculations benefit most from Intel optimization with high-dimensional data

505

- Train-test split optimizations are most noticeable with very large datasets (>50000 samples)

506

- Memory usage is comparable to standard scikit-learn versions

507

- All functions maintain identical results to scikit-learn implementations

508

- Vectorized operations provide the greatest performance improvements