or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced.mdclustering.mddaal4py-mb.mddecomposition.mdensemble.mdindex.mdlinear-models.mdmetrics-model-selection.mdneighbors.mdpatching-config.mdstats-manifold.mdsvm.md

advanced.mddocs/

0

# Advanced Features

1

2

Advanced capabilities including preview APIs and distributed computing with SPMD (Single Program Multiple Data) support. These features provide cutting-edge optimizations and enable distributed machine learning on large-scale datasets.

3

4

## Preview API

5

6

Preview features are experimental implementations that provide early access to new algorithms and optimizations. Enable preview features by setting the `SKLEARNEX_PREVIEW` environment variable.

7

8

```bash

9

export SKLEARNEX_PREVIEW=1

10

```

11

12

## Hyperparameter Utilities

13

14

Advanced utilities for accessing Intel oneDAL hyperparameters for specific algorithms and operations.

15

16

```python { .api }

17

def get_hyperparameters(algorithm, op):

18

"""

19

Get hyperparameter object for specific Intel oneDAL algorithm operation.

20

21

Provides access to low-level hyperparameters for Intel oneDAL algorithms,

22

allowing fine-tuning of algorithm behavior and performance characteristics.

23

24

Parameters:

25

algorithm (str): Algorithm name (e.g., 'linear_regression', 'covariance')

26

op (str): Operation name (e.g., 'train', 'compute')

27

28

Returns:

29

HyperParameters: Object with algorithm-specific hyperparameters

30

None: If oneDAL version < 2024.0.0

31

32

Raises:

33

KeyError: If algorithm/operation combination is not supported

34

35

Example:

36

from sklearnex import get_hyperparameters

37

38

# Get hyperparameters for linear regression training

39

hparams = get_hyperparameters('linear_regression', 'train')

40

41

if hparams is not None:

42

# Access hyperparameter values

43

current_params = hparams.to_dict()

44

print(f"Current parameters: {current_params}")

45

46

# Modify hyperparameters (if setters available)

47

# hparams.some_parameter = new_value

48

"""

49

```

50

51

### Supported Algorithm Operations

52

53

Currently supported hyperparameter combinations:

54

55

```python

56

# Linear regression training hyperparameters

57

linear_hparams = get_hyperparameters('linear_regression', 'train')

58

59

# Covariance computation hyperparameters

60

cov_hparams = get_hyperparameters('covariance', 'compute')

61

```

62

63

## Utility Functions

64

65

Core utility functions for array handling and validation with Intel optimization.

66

67

```python { .api }

68

def get_namespace(x, xp=None):

69

"""

70

Get array namespace for input arrays.

71

72

Determines the appropriate array namespace (NumPy, CuPy, etc.)

73

for the given input arrays, enabling cross-library compatibility.

74

75

Parameters:

76

x (array-like): Input array to determine namespace for

77

xp (module, optional): Preferred array namespace module

78

79

Returns:

80

module: Array namespace module (numpy, cupy, etc.)

81

82

Example:

83

from sklearnex.utils import get_namespace

84

import numpy as np

85

86

data = np.array([[1, 2], [3, 4]])

87

xp = get_namespace(data)

88

# xp will be numpy module

89

90

result = xp.mean(data, axis=0)

91

"""

92

93

def _assert_all_finite(X, allow_nan=False, msg_dtype=None):

94

"""

95

Assert that all values in array are finite.

96

97

Validates that input arrays contain only finite values,

98

with Intel-optimized checking for large arrays.

99

100

Parameters:

101

X (array-like): Input array to validate

102

allow_nan (bool): Whether to allow NaN values

103

msg_dtype (str): Data type name for error messages

104

105

Raises:

106

ValueError: If array contains non-finite values

107

108

Example:

109

from sklearnex.utils import _assert_all_finite

110

import numpy as np

111

112

# Valid array - no error

113

valid_data = np.array([[1.0, 2.0], [3.0, 4.0]])

114

_assert_all_finite(valid_data)

115

116

# Invalid array - raises ValueError

117

invalid_data = np.array([[1.0, np.inf], [3.0, 4.0]])

118

# _assert_all_finite(invalid_data) # Would raise ValueError

119

"""

120

```

121

122

### Preview Capabilities

123

124

#### Preview K-Means Clustering

125

126

Enhanced K-means implementation with advanced optimization techniques.

127

128

```python { .api }

129

from sklearnex.preview.cluster import KMeans

130

131

class KMeans:

132

"""

133

Preview K-means clustering with advanced optimizations.

134

135

Features experimental improvements including better initialization,

136

adaptive convergence criteria, and enhanced memory efficiency.

137

"""

138

139

def __init__(

140

self,

141

n_clusters=8,

142

init='k-means++',

143

n_init=10,

144

max_iter=300,

145

tol=1e-4,

146

random_state=None,

147

copy_x=True,

148

algorithm='auto'

149

):

150

"""Enhanced K-means with experimental optimizations."""

151

```

152

153

#### Preview Empirical Covariance

154

155

Advanced covariance estimation with improved numerical stability.

156

157

```python { .api }

158

from sklearnex.preview.covariance import EmpiricalCovariance

159

160

class EmpiricalCovariance:

161

"""

162

Preview empirical covariance with enhanced numerical methods.

163

164

Provides improved stability for high-dimensional and near-singular

165

covariance matrices through advanced regularization techniques.

166

"""

167

168

def __init__(

169

self,

170

store_precision=True,

171

assume_centered=False

172

):

173

"""Enhanced empirical covariance estimation."""

174

```

175

176

#### Preview Incremental PCA

177

178

Advanced incremental Principal Component Analysis implementation.

179

180

```python { .api }

181

from sklearnex.preview.decomposition import IncrementalPCA

182

183

class IncrementalPCA:

184

"""

185

Preview Incremental PCA with memory and computational optimizations.

186

187

Enhanced version supporting larger batch sizes and improved

188

numerical stability for streaming high-dimensional data.

189

"""

190

191

def __init__(

192

self,

193

n_components=None,

194

whiten=False,

195

copy=True,

196

batch_size=None

197

):

198

"""Advanced incremental PCA implementation."""

199

```

200

201

#### Preview Ridge Regression

202

203

Enhanced Ridge regression with advanced solver algorithms.

204

205

```python { .api }

206

from sklearnex.preview.linear_model import Ridge

207

208

class Ridge:

209

"""

210

Preview Ridge regression with experimental solver improvements.

211

212

Features advanced optimization techniques for better convergence

213

and handling of ill-conditioned problems.

214

"""

215

216

def __init__(

217

self,

218

alpha=1.0,

219

fit_intercept=True,

220

normalize='deprecated',

221

copy_X=True,

222

max_iter=None,

223

tol=1e-3,

224

solver='auto',

225

positive=False,

226

random_state=None

227

):

228

"""Enhanced Ridge regression with advanced solvers."""

229

```

230

231

## SPMD (Single Program Multiple Data) API

232

233

SPMD provides distributed computing capabilities for large-scale machine learning across multiple nodes. Requires OneDAL SPMD backend and appropriate distributed computing environment.

234

235

### SPMD Setup and Configuration

236

237

```python

238

# SPMD requires distributed computing setup

239

# Example with mpi4py (Message Passing Interface)

240

241

from mpi4py import MPI

242

import os

243

244

# Initialize MPI environment

245

comm = MPI.COMM_WORLD

246

rank = comm.Get_rank()

247

size = comm.Get_size()

248

249

# Ensure OneDAL SPMD is available

250

os.environ['ONEAPI_DAAL_SPMD'] = '1'

251

252

# Import SPMD modules after MPI setup

253

from sklearnex.spmd import patch_sklearn

254

patch_sklearn()

255

```

256

257

### SPMD Capabilities

258

259

#### Distributed Basic Statistics

260

261

```python { .api }

262

from sklearnex.spmd.basic_statistics import BasicStatistics

263

264

class BasicStatistics:

265

"""

266

Distributed basic statistics computation across multiple nodes.

267

268

Automatically partitions data across available MPI processes and

269

aggregates results for scalable statistical analysis.

270

"""

271

272

def fit(self, X, y=None):

273

"""

274

Compute statistics on distributed data.

275

276

Each MPI rank processes its portion of data, with automatic

277

aggregation of results across all nodes.

278

"""

279

```

280

281

#### Distributed Clustering

282

283

```python { .api }

284

from sklearnex.spmd.cluster import KMeans, DBSCAN

285

286

class KMeans:

287

"""

288

Distributed K-means clustering across multiple nodes.

289

290

Scales to very large datasets by distributing computation

291

and coordinating centroid updates across MPI processes.

292

"""

293

294

class DBSCAN:

295

"""

296

Distributed DBSCAN clustering for large-scale density analysis.

297

298

Enables clustering of massive datasets through distributed

299

density computation and neighbor finding.

300

"""

301

```

302

303

#### Distributed Linear Models

304

305

```python { .api }

306

from sklearnex.spmd.linear_model import LinearRegression, LogisticRegression

307

308

class LinearRegression:

309

"""

310

Distributed linear regression using distributed gradient computation.

311

312

Scales to massive datasets through distributed normal equation

313

or gradient-based solving across multiple nodes.

314

"""

315

316

class LogisticRegression:

317

"""

318

Distributed logistic regression with distributed gradient descent.

319

320

Handles very large classification problems through distributed

321

optimization and coordinated parameter updates.

322

"""

323

```

324

325

#### Distributed Ensemble Methods

326

327

```python { .api }

328

from sklearnex.spmd.ensemble import RandomForestClassifier, RandomForestRegressor

329

330

class RandomForestClassifier:

331

"""

332

Distributed Random Forest classification.

333

334

Distributes tree construction across nodes while maintaining

335

ensemble diversity and prediction accuracy.

336

"""

337

338

class RandomForestRegressor:

339

"""

340

Distributed Random Forest regression.

341

342

Scales tree ensemble training to very large datasets through

343

distributed bootstrap sampling and tree building.

344

"""

345

```

346

347

## Usage Examples

348

349

### Preview API Examples

350

351

```python

352

import os

353

import numpy as np

354

355

# Enable preview features

356

os.environ['SKLEARNEX_PREVIEW'] = '1'

357

358

from sklearnex.preview.cluster import KMeans as PreviewKMeans

359

from sklearnex.preview.covariance import EmpiricalCovariance as PreviewCovariance

360

from sklearnex.preview.decomposition import IncrementalPCA as PreviewIPCA

361

from sklearnex.preview.linear_model import Ridge as PreviewRidge

362

363

from sklearn.datasets import make_blobs, make_regression

364

365

# Preview K-Means Example

366

print("Testing Preview K-Means:")

367

X_kmeans, _ = make_blobs(n_samples=2000, centers=5, n_features=20, random_state=42)

368

369

preview_kmeans = PreviewKMeans(n_clusters=5, random_state=42)

370

preview_kmeans.fit(X_kmeans)

371

372

print(f"Preview K-means inertia: {preview_kmeans.inertia_:.2f}")

373

print(f"Cluster centers shape: {preview_kmeans.cluster_centers_.shape}")

374

375

# Preview Empirical Covariance Example

376

print("\nTesting Preview Empirical Covariance:")

377

X_cov = np.random.randn(1000, 50)

378

379

preview_cov = PreviewCovariance(store_precision=True)

380

preview_cov.fit(X_cov)

381

382

print(f"Covariance matrix shape: {preview_cov.covariance_.shape}")

383

print(f"Precision matrix available: {hasattr(preview_cov, 'precision_')}")

384

print(f"Log-likelihood: {preview_cov.score(X_cov[:100]):.2f}")

385

386

# Preview Incremental PCA Example

387

print("\nTesting Preview Incremental PCA:")

388

X_pca = np.random.randn(2000, 100)

389

390

preview_ipca = PreviewIPCA(n_components=20, batch_size=200)

391

392

# Fit in batches

393

for i in range(0, X_pca.shape[0], 200):

394

batch = X_pca[i:i+200]

395

preview_ipca.partial_fit(batch)

396

397

# Transform data

398

X_transformed = preview_ipca.transform(X_pca[:500])

399

print(f"Transformed data shape: {X_transformed.shape}")

400

print(f"Explained variance ratio sum: {preview_ipca.explained_variance_ratio_.sum():.3f}")

401

402

# Preview Ridge Regression Example

403

print("\nTesting Preview Ridge:")

404

X_ridge, y_ridge = make_regression(n_samples=1500, n_features=50, noise=0.1, random_state=42)

405

406

preview_ridge = PreviewRidge(alpha=1.0, solver='auto')

407

preview_ridge.fit(X_ridge, y_ridge)

408

409

print(f"Ridge R² score: {preview_ridge.score(X_ridge, y_ridge):.3f}")

410

print(f"Coefficients shape: {preview_ridge.coef_.shape}")

411

```

412

413

### SPMD Distributed Computing Examples

414

415

```python

416

# Note: This example requires MPI environment and multiple processes

417

# Run with: mpirun -n 4 python spmd_example.py

418

419

try:

420

from mpi4py import MPI

421

import numpy as np

422

423

# Initialize MPI

424

comm = MPI.COMM_WORLD

425

rank = comm.Get_rank()

426

size = comm.Get_size()

427

428

print(f"Process {rank} of {size} started")

429

430

# Enable SPMD mode

431

import os

432

os.environ['ONEAPI_DAAL_SPMD'] = '1'

433

434

from sklearnex.spmd.basic_statistics import BasicStatistics as SPMDStats

435

from sklearnex.spmd.cluster import KMeans as SPMDKMeans

436

from sklearnex.spmd.linear_model import LinearRegression as SPMDLinear

437

438

# Generate distributed data (each process has its portion)

439

np.random.seed(42 + rank) # Different seed per process

440

local_samples = 2500 # Samples per process

441

n_features = 30

442

443

X_local = np.random.randn(local_samples, n_features)

444

y_local = np.random.randn(local_samples)

445

446

if rank == 0:

447

print(f"Total dataset: {size * local_samples} samples, {n_features} features")

448

print(f"Each process handles: {local_samples} samples")

449

450

# Distributed Basic Statistics

451

if rank == 0:

452

print("\n=== Distributed Basic Statistics ===")

453

454

spmd_stats = SPMDStats(result_options='all')

455

spmd_stats.fit(X_local)

456

457

if rank == 0:

458

print(f"Global mean computed: {spmd_stats.mean_[:5]}...") # Show first 5

459

print(f"Global variance computed: {spmd_stats.variance_[:5]}...")

460

print(f"Total samples processed: {spmd_stats.n_samples_seen_}")

461

462

# Distributed K-Means

463

if rank == 0:

464

print("\n=== Distributed K-Means ===")

465

466

spmd_kmeans = SPMDKMeans(n_clusters=8, random_state=42)

467

spmd_kmeans.fit(X_local)

468

469

if rank == 0:

470

print(f"Global inertia: {spmd_kmeans.inertia_:.2f}")

471

print(f"Cluster centers shape: {spmd_kmeans.cluster_centers_.shape}")

472

473

# Distributed Linear Regression

474

if rank == 0:

475

print("\n=== Distributed Linear Regression ===")

476

477

spmd_linear = SPMDLinear()

478

spmd_linear.fit(X_local, y_local)

479

480

if rank == 0:

481

print(f"Global coefficients computed: {spmd_linear.coef_[:5]}...")

482

print(f"Intercept: {spmd_linear.intercept_:.4f}")

483

484

# Performance comparison (simulate)

485

if rank == 0:

486

print(f"\n=== Performance Summary ===")

487

print(f"Distributed processing across {size} processes")

488

print(f"Each process: {local_samples} samples")

489

print(f"Total effective dataset: {size * local_samples} samples")

490

print(f"Memory per process: ~{X_local.nbytes / 1024**2:.1f} MB")

491

print(f"Total memory distributed: ~{size * X_local.nbytes / 1024**2:.1f} MB")

492

493

except ImportError:

494

print("MPI not available. SPMD examples require mpi4py and MPI environment.")

495

print("Install with: pip install mpi4py")

496

print("Run with: mpirun -n 4 python script.py")

497

498

# Fallback: Show SPMD API without execution

499

print("\nSPMD API available for:")

500

try:

501

from sklearnex import spmd

502

print("- Basic Statistics (distributed)")

503

print("- Clustering (KMeans, DBSCAN)")

504

print("- Linear Models (LinearRegression, LogisticRegression)")

505

print("- Ensemble Methods (RandomForest)")

506

print("- Decomposition (PCA)")

507

print("- Covariance (EmpiricalCovariance)")

508

print("- Neighbors (KNeighbors)")

509

except ImportError as e:

510

print(f"SPMD modules not available: {e}")

511

```

512

513

### Hybrid Preview + SPMD Example

514

515

```python

516

# Advanced example combining Preview and SPMD features

517

import os

518

import numpy as np

519

520

# Enable both preview and SPMD

521

os.environ['SKLEARNEX_PREVIEW'] = '1'

522

os.environ['ONEAPI_DAAL_SPMD'] = '1'

523

524

try:

525

from mpi4py import MPI

526

527

comm = MPI.COMM_WORLD

528

rank = comm.Get_rank()

529

size = comm.Get_size()

530

531

# Generate large-scale synthetic dataset

532

np.random.seed(42 + rank)

533

local_samples = 5000

534

n_features = 100

535

536

X_local = np.random.randn(local_samples, n_features)

537

538

if rank == 0:

539

print("=== Hybrid Preview + SPMD Workflow ===")

540

print(f"Dataset: {size * local_samples} samples, {n_features} features")

541

print(f"Processes: {size}")

542

543

# Step 1: Distributed statistics with SPMD

544

from sklearnex.spmd.basic_statistics import BasicStatistics

545

546

stats = BasicStatistics(result_options=['mean', 'variance'])

547

stats.fit(X_local)

548

549

if rank == 0:

550

print(f"\nStep 1 - Global Statistics:")

551

print(f"Mean range: [{stats.mean_.min():.3f}, {stats.mean_.max():.3f}]")

552

print(f"Variance range: [{stats.variance_.min():.3f}, {stats.variance_.max():.3f}]")

553

554

# Step 2: Local preprocessing with Preview features

555

# Standardize using global statistics

556

X_standardized = (X_local - stats.mean_) / np.sqrt(stats.variance_)

557

558

# Step 3: Distributed clustering with enhanced algorithm

559

from sklearnex.spmd.cluster import KMeans

560

561

kmeans = KMeans(n_clusters=10, n_init=3, random_state=42)

562

kmeans.fit(X_standardized)

563

564

if rank == 0:

565

print(f"\nStep 2 - Distributed Clustering:")

566

print(f"Global inertia: {kmeans.inertia_:.2f}")

567

print(f"Iterations: {kmeans.n_iter_}")

568

569

# Step 4: Local analysis on cluster assignments

570

local_labels = kmeans.predict(X_standardized)

571

local_cluster_counts = np.bincount(local_labels, minlength=10)

572

573

# Aggregate cluster counts across all processes

574

global_cluster_counts = comm.allreduce(local_cluster_counts, op=MPI.SUM)

575

576

if rank == 0:

577

print(f"\nStep 3 - Global Cluster Analysis:")

578

for i, count in enumerate(global_cluster_counts):

579

percentage = 100 * count / (size * local_samples)

580

print(f"Cluster {i}: {count} samples ({percentage:.1f}%)")

581

582

if rank == 0:

583

print(f"\nWorkflow completed successfully!")

584

print(f"Total computation distributed across {size} processes")

585

586

except ImportError as e:

587

print(f"Advanced features require MPI: {e}")

588

print("This example demonstrates the potential of combining:")

589

print("- Preview APIs for enhanced algorithms")

590

print("- SPMD for distributed computation")

591

print("- Hybrid workflows for large-scale ML")

592

```

593

594

### Environment and Configuration

595

596

```python

597

import os

598

import sys

599

600

def setup_advanced_features():

601

"""Setup and verify advanced feature availability."""

602

603

print("=== Advanced Features Configuration ===")

604

605

# Preview API setup

606

os.environ['SKLEARNEX_PREVIEW'] = '1'

607

print("✓ Preview API enabled")

608

609

# Check available preview modules

610

try:

611

from sklearnex import preview

612

print("✓ Preview modules available:")

613

print(" - preview.cluster (enhanced K-means)")

614

print(" - preview.covariance (advanced covariance)")

615

print(" - preview.decomposition (enhanced PCA)")

616

print(" - preview.linear_model (improved Ridge)")

617

except ImportError as e:

618

print(f"✗ Preview modules error: {e}")

619

620

# SPMD setup check

621

try:

622

from mpi4py import MPI

623

comm = MPI.COMM_WORLD

624

rank = comm.Get_rank()

625

size = comm.Get_size()

626

print(f"✓ MPI available: rank {rank} of {size}")

627

628

os.environ['ONEAPI_DAAL_SPMD'] = '1'

629

print("✓ SPMD mode enabled")

630

631

try:

632

from sklearnex import spmd

633

print("✓ SPMD modules available:")

634

print(" - spmd.basic_statistics")

635

print(" - spmd.cluster")

636

print(" - spmd.linear_model")

637

print(" - spmd.ensemble")

638

print(" - spmd.decomposition")

639

except ImportError as e:

640

print(f"✗ SPMD modules error: {e}")

641

642

except ImportError:

643

print("✗ MPI not available (install mpi4py for SPMD)")

644

645

# OneDAL configuration

646

dalroot = os.environ.get('DALROOT')

647

if dalroot:

648

print(f"✓ OneDAL root: {dalroot}")

649

else:

650

print("ℹ OneDAL root not set (may use system installation)")

651

652

# Memory and threading info

653

print(f"\nSystem Information:")

654

print(f"Python version: {sys.version}")

655

print(f"Available CPU cores: {os.cpu_count()}")

656

657

# Threading environment variables

658

threading_vars = ['OMP_NUM_THREADS', 'MKL_NUM_THREADS', 'NUMBA_NUM_THREADS']

659

for var in threading_vars:

660

value = os.environ.get(var, 'not set')

661

print(f"{var}: {value}")

662

663

if __name__ == "__main__":

664

setup_advanced_features()

665

```

666

667

## Performance and Scaling Notes

668

669

### Preview API Performance

670

- Preview features may have different performance characteristics

671

- Some preview algorithms are optimized for specific hardware configurations

672

- Memory usage may vary from standard implementations

673

- API stability is not guaranteed (experimental features)

674

675

### SPMD Scaling Characteristics

676

- Linear scaling achievable with proper data distribution

677

- Communication overhead increases with number of processes

678

- Optimal performance typically with 2-16 processes per node

679

- Memory requirements distributed across all processes

680

- Network bandwidth important for large-scale deployments

681

682

### Best Practices

683

- Test preview features thoroughly before production use

684

- Monitor SPMD communication patterns for performance

685

- Use appropriate batch sizes for distributed processing

686

- Balance computation and communication costs

687

- Validate results against single-node implementations

688

689

### Hardware Recommendations

690

- Intel CPUs for optimal oneDAL acceleration

691

- High-bandwidth interconnects for SPMD (InfiniBand recommended)

692

- Sufficient memory per node for local data portions

693

- NVMe storage for large dataset staging

694

- Consider NUMA topology for multi-socket systems