0
# Advanced Features
1
2
Advanced capabilities including preview APIs and distributed computing with SPMD (Single Program Multiple Data) support. These features provide cutting-edge optimizations and enable distributed machine learning on large-scale datasets.
3
4
## Preview API
5
6
Preview features are experimental implementations that provide early access to new algorithms and optimizations. Enable preview features by setting the `SKLEARNEX_PREVIEW` environment variable.
7
8
```bash
9
export SKLEARNEX_PREVIEW=1
10
```
11
12
## Hyperparameter Utilities
13
14
Advanced utilities for accessing Intel oneDAL hyperparameters for specific algorithms and operations.
15
16
```python { .api }
17
def get_hyperparameters(algorithm, op):
18
"""
19
Get hyperparameter object for specific Intel oneDAL algorithm operation.
20
21
Provides access to low-level hyperparameters for Intel oneDAL algorithms,
22
allowing fine-tuning of algorithm behavior and performance characteristics.
23
24
Parameters:
25
algorithm (str): Algorithm name (e.g., 'linear_regression', 'covariance')
26
op (str): Operation name (e.g., 'train', 'compute')
27
28
Returns:
29
HyperParameters: Object with algorithm-specific hyperparameters
30
None: If oneDAL version < 2024.0.0
31
32
Raises:
33
KeyError: If algorithm/operation combination is not supported
34
35
Example:
36
from sklearnex import get_hyperparameters
37
38
# Get hyperparameters for linear regression training
39
hparams = get_hyperparameters('linear_regression', 'train')
40
41
if hparams is not None:
42
# Access hyperparameter values
43
current_params = hparams.to_dict()
44
print(f"Current parameters: {current_params}")
45
46
# Modify hyperparameters (if setters available)
47
# hparams.some_parameter = new_value
48
"""
49
```
50
51
### Supported Algorithm Operations
52
53
Currently supported hyperparameter combinations:
54
55
```python
56
# Linear regression training hyperparameters
57
linear_hparams = get_hyperparameters('linear_regression', 'train')
58
59
# Covariance computation hyperparameters
60
cov_hparams = get_hyperparameters('covariance', 'compute')
61
```
62
63
## Utility Functions
64
65
Core utility functions for array handling and validation with Intel optimization.
66
67
```python { .api }
68
def get_namespace(x, xp=None):
69
"""
70
Get array namespace for input arrays.
71
72
Determines the appropriate array namespace (NumPy, CuPy, etc.)
73
for the given input arrays, enabling cross-library compatibility.
74
75
Parameters:
76
x (array-like): Input array to determine namespace for
77
xp (module, optional): Preferred array namespace module
78
79
Returns:
80
module: Array namespace module (numpy, cupy, etc.)
81
82
Example:
83
from sklearnex.utils import get_namespace
84
import numpy as np
85
86
data = np.array([[1, 2], [3, 4]])
87
xp = get_namespace(data)
88
# xp will be numpy module
89
90
result = xp.mean(data, axis=0)
91
"""
92
93
def _assert_all_finite(X, allow_nan=False, msg_dtype=None):
94
"""
95
Assert that all values in array are finite.
96
97
Validates that input arrays contain only finite values,
98
with Intel-optimized checking for large arrays.
99
100
Parameters:
101
X (array-like): Input array to validate
102
allow_nan (bool): Whether to allow NaN values
103
msg_dtype (str): Data type name for error messages
104
105
Raises:
106
ValueError: If array contains non-finite values
107
108
Example:
109
from sklearnex.utils import _assert_all_finite
110
import numpy as np
111
112
# Valid array - no error
113
valid_data = np.array([[1.0, 2.0], [3.0, 4.0]])
114
_assert_all_finite(valid_data)
115
116
# Invalid array - raises ValueError
117
invalid_data = np.array([[1.0, np.inf], [3.0, 4.0]])
118
# _assert_all_finite(invalid_data) # Would raise ValueError
119
"""
120
```
121
122
### Preview Capabilities
123
124
#### Preview K-Means Clustering
125
126
Enhanced K-means implementation with advanced optimization techniques.
127
128
```python { .api }
129
from sklearnex.preview.cluster import KMeans
130
131
class KMeans:
132
"""
133
Preview K-means clustering with advanced optimizations.
134
135
Features experimental improvements including better initialization,
136
adaptive convergence criteria, and enhanced memory efficiency.
137
"""
138
139
def __init__(
140
self,
141
n_clusters=8,
142
init='k-means++',
143
n_init=10,
144
max_iter=300,
145
tol=1e-4,
146
random_state=None,
147
copy_x=True,
148
algorithm='auto'
149
):
150
"""Enhanced K-means with experimental optimizations."""
151
```
152
153
#### Preview Empirical Covariance
154
155
Advanced covariance estimation with improved numerical stability.
156
157
```python { .api }
158
from sklearnex.preview.covariance import EmpiricalCovariance
159
160
class EmpiricalCovariance:
161
"""
162
Preview empirical covariance with enhanced numerical methods.
163
164
Provides improved stability for high-dimensional and near-singular
165
covariance matrices through advanced regularization techniques.
166
"""
167
168
def __init__(
169
self,
170
store_precision=True,
171
assume_centered=False
172
):
173
"""Enhanced empirical covariance estimation."""
174
```
175
176
#### Preview Incremental PCA
177
178
Advanced incremental Principal Component Analysis implementation.
179
180
```python { .api }
181
from sklearnex.preview.decomposition import IncrementalPCA
182
183
class IncrementalPCA:
184
"""
185
Preview Incremental PCA with memory and computational optimizations.
186
187
Enhanced version supporting larger batch sizes and improved
188
numerical stability for streaming high-dimensional data.
189
"""
190
191
def __init__(
192
self,
193
n_components=None,
194
whiten=False,
195
copy=True,
196
batch_size=None
197
):
198
"""Advanced incremental PCA implementation."""
199
```
200
201
#### Preview Ridge Regression
202
203
Enhanced Ridge regression with advanced solver algorithms.
204
205
```python { .api }
206
from sklearnex.preview.linear_model import Ridge
207
208
class Ridge:
209
"""
210
Preview Ridge regression with experimental solver improvements.
211
212
Features advanced optimization techniques for better convergence
213
and handling of ill-conditioned problems.
214
"""
215
216
def __init__(
217
self,
218
alpha=1.0,
219
fit_intercept=True,
220
normalize='deprecated',
221
copy_X=True,
222
max_iter=None,
223
tol=1e-3,
224
solver='auto',
225
positive=False,
226
random_state=None
227
):
228
"""Enhanced Ridge regression with advanced solvers."""
229
```
230
231
## SPMD (Single Program Multiple Data) API
232
233
SPMD provides distributed computing capabilities for large-scale machine learning across multiple nodes. Requires OneDAL SPMD backend and appropriate distributed computing environment.
234
235
### SPMD Setup and Configuration
236
237
```python
238
# SPMD requires distributed computing setup
239
# Example with mpi4py (Message Passing Interface)
240
241
from mpi4py import MPI
242
import os
243
244
# Initialize MPI environment
245
comm = MPI.COMM_WORLD
246
rank = comm.Get_rank()
247
size = comm.Get_size()
248
249
# Ensure OneDAL SPMD is available
250
os.environ['ONEAPI_DAAL_SPMD'] = '1'
251
252
# Import SPMD modules after MPI setup
253
from sklearnex.spmd import patch_sklearn
254
patch_sklearn()
255
```
256
257
### SPMD Capabilities
258
259
#### Distributed Basic Statistics
260
261
```python { .api }
262
from sklearnex.spmd.basic_statistics import BasicStatistics
263
264
class BasicStatistics:
265
"""
266
Distributed basic statistics computation across multiple nodes.
267
268
Automatically partitions data across available MPI processes and
269
aggregates results for scalable statistical analysis.
270
"""
271
272
def fit(self, X, y=None):
273
"""
274
Compute statistics on distributed data.
275
276
Each MPI rank processes its portion of data, with automatic
277
aggregation of results across all nodes.
278
"""
279
```
280
281
#### Distributed Clustering
282
283
```python { .api }
284
from sklearnex.spmd.cluster import KMeans, DBSCAN
285
286
class KMeans:
287
"""
288
Distributed K-means clustering across multiple nodes.
289
290
Scales to very large datasets by distributing computation
291
and coordinating centroid updates across MPI processes.
292
"""
293
294
class DBSCAN:
295
"""
296
Distributed DBSCAN clustering for large-scale density analysis.
297
298
Enables clustering of massive datasets through distributed
299
density computation and neighbor finding.
300
"""
301
```
302
303
#### Distributed Linear Models
304
305
```python { .api }
306
from sklearnex.spmd.linear_model import LinearRegression, LogisticRegression
307
308
class LinearRegression:
309
"""
310
Distributed linear regression using distributed gradient computation.
311
312
Scales to massive datasets through distributed normal equation
313
or gradient-based solving across multiple nodes.
314
"""
315
316
class LogisticRegression:
317
"""
318
Distributed logistic regression with distributed gradient descent.
319
320
Handles very large classification problems through distributed
321
optimization and coordinated parameter updates.
322
"""
323
```
324
325
#### Distributed Ensemble Methods
326
327
```python { .api }
328
from sklearnex.spmd.ensemble import RandomForestClassifier, RandomForestRegressor
329
330
class RandomForestClassifier:
331
"""
332
Distributed Random Forest classification.
333
334
Distributes tree construction across nodes while maintaining
335
ensemble diversity and prediction accuracy.
336
"""
337
338
class RandomForestRegressor:
339
"""
340
Distributed Random Forest regression.
341
342
Scales tree ensemble training to very large datasets through
343
distributed bootstrap sampling and tree building.
344
"""
345
```
346
347
## Usage Examples
348
349
### Preview API Examples
350
351
```python
352
import os
353
import numpy as np
354
355
# Enable preview features
356
os.environ['SKLEARNEX_PREVIEW'] = '1'
357
358
from sklearnex.preview.cluster import KMeans as PreviewKMeans
359
from sklearnex.preview.covariance import EmpiricalCovariance as PreviewCovariance
360
from sklearnex.preview.decomposition import IncrementalPCA as PreviewIPCA
361
from sklearnex.preview.linear_model import Ridge as PreviewRidge
362
363
from sklearn.datasets import make_blobs, make_regression
364
365
# Preview K-Means Example
366
print("Testing Preview K-Means:")
367
X_kmeans, _ = make_blobs(n_samples=2000, centers=5, n_features=20, random_state=42)
368
369
preview_kmeans = PreviewKMeans(n_clusters=5, random_state=42)
370
preview_kmeans.fit(X_kmeans)
371
372
print(f"Preview K-means inertia: {preview_kmeans.inertia_:.2f}")
373
print(f"Cluster centers shape: {preview_kmeans.cluster_centers_.shape}")
374
375
# Preview Empirical Covariance Example
376
print("\nTesting Preview Empirical Covariance:")
377
X_cov = np.random.randn(1000, 50)
378
379
preview_cov = PreviewCovariance(store_precision=True)
380
preview_cov.fit(X_cov)
381
382
print(f"Covariance matrix shape: {preview_cov.covariance_.shape}")
383
print(f"Precision matrix available: {hasattr(preview_cov, 'precision_')}")
384
print(f"Log-likelihood: {preview_cov.score(X_cov[:100]):.2f}")
385
386
# Preview Incremental PCA Example
387
print("\nTesting Preview Incremental PCA:")
388
X_pca = np.random.randn(2000, 100)
389
390
preview_ipca = PreviewIPCA(n_components=20, batch_size=200)
391
392
# Fit in batches
393
for i in range(0, X_pca.shape[0], 200):
394
batch = X_pca[i:i+200]
395
preview_ipca.partial_fit(batch)
396
397
# Transform data
398
X_transformed = preview_ipca.transform(X_pca[:500])
399
print(f"Transformed data shape: {X_transformed.shape}")
400
print(f"Explained variance ratio sum: {preview_ipca.explained_variance_ratio_.sum():.3f}")
401
402
# Preview Ridge Regression Example
403
print("\nTesting Preview Ridge:")
404
X_ridge, y_ridge = make_regression(n_samples=1500, n_features=50, noise=0.1, random_state=42)
405
406
preview_ridge = PreviewRidge(alpha=1.0, solver='auto')
407
preview_ridge.fit(X_ridge, y_ridge)
408
409
print(f"Ridge R² score: {preview_ridge.score(X_ridge, y_ridge):.3f}")
410
print(f"Coefficients shape: {preview_ridge.coef_.shape}")
411
```
412
413
### SPMD Distributed Computing Examples
414
415
```python
416
# Note: This example requires MPI environment and multiple processes
417
# Run with: mpirun -n 4 python spmd_example.py
418
419
try:
420
from mpi4py import MPI
421
import numpy as np
422
423
# Initialize MPI
424
comm = MPI.COMM_WORLD
425
rank = comm.Get_rank()
426
size = comm.Get_size()
427
428
print(f"Process {rank} of {size} started")
429
430
# Enable SPMD mode
431
import os
432
os.environ['ONEAPI_DAAL_SPMD'] = '1'
433
434
from sklearnex.spmd.basic_statistics import BasicStatistics as SPMDStats
435
from sklearnex.spmd.cluster import KMeans as SPMDKMeans
436
from sklearnex.spmd.linear_model import LinearRegression as SPMDLinear
437
438
# Generate distributed data (each process has its portion)
439
np.random.seed(42 + rank) # Different seed per process
440
local_samples = 2500 # Samples per process
441
n_features = 30
442
443
X_local = np.random.randn(local_samples, n_features)
444
y_local = np.random.randn(local_samples)
445
446
if rank == 0:
447
print(f"Total dataset: {size * local_samples} samples, {n_features} features")
448
print(f"Each process handles: {local_samples} samples")
449
450
# Distributed Basic Statistics
451
if rank == 0:
452
print("\n=== Distributed Basic Statistics ===")
453
454
spmd_stats = SPMDStats(result_options='all')
455
spmd_stats.fit(X_local)
456
457
if rank == 0:
458
print(f"Global mean computed: {spmd_stats.mean_[:5]}...") # Show first 5
459
print(f"Global variance computed: {spmd_stats.variance_[:5]}...")
460
print(f"Total samples processed: {spmd_stats.n_samples_seen_}")
461
462
# Distributed K-Means
463
if rank == 0:
464
print("\n=== Distributed K-Means ===")
465
466
spmd_kmeans = SPMDKMeans(n_clusters=8, random_state=42)
467
spmd_kmeans.fit(X_local)
468
469
if rank == 0:
470
print(f"Global inertia: {spmd_kmeans.inertia_:.2f}")
471
print(f"Cluster centers shape: {spmd_kmeans.cluster_centers_.shape}")
472
473
# Distributed Linear Regression
474
if rank == 0:
475
print("\n=== Distributed Linear Regression ===")
476
477
spmd_linear = SPMDLinear()
478
spmd_linear.fit(X_local, y_local)
479
480
if rank == 0:
481
print(f"Global coefficients computed: {spmd_linear.coef_[:5]}...")
482
print(f"Intercept: {spmd_linear.intercept_:.4f}")
483
484
# Performance comparison (simulate)
485
if rank == 0:
486
print(f"\n=== Performance Summary ===")
487
print(f"Distributed processing across {size} processes")
488
print(f"Each process: {local_samples} samples")
489
print(f"Total effective dataset: {size * local_samples} samples")
490
print(f"Memory per process: ~{X_local.nbytes / 1024**2:.1f} MB")
491
print(f"Total memory distributed: ~{size * X_local.nbytes / 1024**2:.1f} MB")
492
493
except ImportError:
494
print("MPI not available. SPMD examples require mpi4py and MPI environment.")
495
print("Install with: pip install mpi4py")
496
print("Run with: mpirun -n 4 python script.py")
497
498
# Fallback: Show SPMD API without execution
499
print("\nSPMD API available for:")
500
try:
501
from sklearnex import spmd
502
print("- Basic Statistics (distributed)")
503
print("- Clustering (KMeans, DBSCAN)")
504
print("- Linear Models (LinearRegression, LogisticRegression)")
505
print("- Ensemble Methods (RandomForest)")
506
print("- Decomposition (PCA)")
507
print("- Covariance (EmpiricalCovariance)")
508
print("- Neighbors (KNeighbors)")
509
except ImportError as e:
510
print(f"SPMD modules not available: {e}")
511
```
512
513
### Hybrid Preview + SPMD Example
514
515
```python
516
# Advanced example combining Preview and SPMD features
517
import os
518
import numpy as np
519
520
# Enable both preview and SPMD
521
os.environ['SKLEARNEX_PREVIEW'] = '1'
522
os.environ['ONEAPI_DAAL_SPMD'] = '1'
523
524
try:
525
from mpi4py import MPI
526
527
comm = MPI.COMM_WORLD
528
rank = comm.Get_rank()
529
size = comm.Get_size()
530
531
# Generate large-scale synthetic dataset
532
np.random.seed(42 + rank)
533
local_samples = 5000
534
n_features = 100
535
536
X_local = np.random.randn(local_samples, n_features)
537
538
if rank == 0:
539
print("=== Hybrid Preview + SPMD Workflow ===")
540
print(f"Dataset: {size * local_samples} samples, {n_features} features")
541
print(f"Processes: {size}")
542
543
# Step 1: Distributed statistics with SPMD
544
from sklearnex.spmd.basic_statistics import BasicStatistics
545
546
stats = BasicStatistics(result_options=['mean', 'variance'])
547
stats.fit(X_local)
548
549
if rank == 0:
550
print(f"\nStep 1 - Global Statistics:")
551
print(f"Mean range: [{stats.mean_.min():.3f}, {stats.mean_.max():.3f}]")
552
print(f"Variance range: [{stats.variance_.min():.3f}, {stats.variance_.max():.3f}]")
553
554
# Step 2: Local preprocessing with Preview features
555
# Standardize using global statistics
556
X_standardized = (X_local - stats.mean_) / np.sqrt(stats.variance_)
557
558
# Step 3: Distributed clustering with enhanced algorithm
559
from sklearnex.spmd.cluster import KMeans
560
561
kmeans = KMeans(n_clusters=10, n_init=3, random_state=42)
562
kmeans.fit(X_standardized)
563
564
if rank == 0:
565
print(f"\nStep 2 - Distributed Clustering:")
566
print(f"Global inertia: {kmeans.inertia_:.2f}")
567
print(f"Iterations: {kmeans.n_iter_}")
568
569
# Step 4: Local analysis on cluster assignments
570
local_labels = kmeans.predict(X_standardized)
571
local_cluster_counts = np.bincount(local_labels, minlength=10)
572
573
# Aggregate cluster counts across all processes
574
global_cluster_counts = comm.allreduce(local_cluster_counts, op=MPI.SUM)
575
576
if rank == 0:
577
print(f"\nStep 3 - Global Cluster Analysis:")
578
for i, count in enumerate(global_cluster_counts):
579
percentage = 100 * count / (size * local_samples)
580
print(f"Cluster {i}: {count} samples ({percentage:.1f}%)")
581
582
if rank == 0:
583
print(f"\nWorkflow completed successfully!")
584
print(f"Total computation distributed across {size} processes")
585
586
except ImportError as e:
587
print(f"Advanced features require MPI: {e}")
588
print("This example demonstrates the potential of combining:")
589
print("- Preview APIs for enhanced algorithms")
590
print("- SPMD for distributed computation")
591
print("- Hybrid workflows for large-scale ML")
592
```
593
594
### Environment and Configuration
595
596
```python
597
import os
598
import sys
599
600
def setup_advanced_features():
601
"""Setup and verify advanced feature availability."""
602
603
print("=== Advanced Features Configuration ===")
604
605
# Preview API setup
606
os.environ['SKLEARNEX_PREVIEW'] = '1'
607
print("✓ Preview API enabled")
608
609
# Check available preview modules
610
try:
611
from sklearnex import preview
612
print("✓ Preview modules available:")
613
print(" - preview.cluster (enhanced K-means)")
614
print(" - preview.covariance (advanced covariance)")
615
print(" - preview.decomposition (enhanced PCA)")
616
print(" - preview.linear_model (improved Ridge)")
617
except ImportError as e:
618
print(f"✗ Preview modules error: {e}")
619
620
# SPMD setup check
621
try:
622
from mpi4py import MPI
623
comm = MPI.COMM_WORLD
624
rank = comm.Get_rank()
625
size = comm.Get_size()
626
print(f"✓ MPI available: rank {rank} of {size}")
627
628
os.environ['ONEAPI_DAAL_SPMD'] = '1'
629
print("✓ SPMD mode enabled")
630
631
try:
632
from sklearnex import spmd
633
print("✓ SPMD modules available:")
634
print(" - spmd.basic_statistics")
635
print(" - spmd.cluster")
636
print(" - spmd.linear_model")
637
print(" - spmd.ensemble")
638
print(" - spmd.decomposition")
639
except ImportError as e:
640
print(f"✗ SPMD modules error: {e}")
641
642
except ImportError:
643
print("✗ MPI not available (install mpi4py for SPMD)")
644
645
# OneDAL configuration
646
dalroot = os.environ.get('DALROOT')
647
if dalroot:
648
print(f"✓ OneDAL root: {dalroot}")
649
else:
650
print("ℹ OneDAL root not set (may use system installation)")
651
652
# Memory and threading info
653
print(f"\nSystem Information:")
654
print(f"Python version: {sys.version}")
655
print(f"Available CPU cores: {os.cpu_count()}")
656
657
# Threading environment variables
658
threading_vars = ['OMP_NUM_THREADS', 'MKL_NUM_THREADS', 'NUMBA_NUM_THREADS']
659
for var in threading_vars:
660
value = os.environ.get(var, 'not set')
661
print(f"{var}: {value}")
662
663
if __name__ == "__main__":
664
setup_advanced_features()
665
```
666
667
## Performance and Scaling Notes
668
669
### Preview API Performance
670
- Preview features may have different performance characteristics
671
- Some preview algorithms are optimized for specific hardware configurations
672
- Memory usage may vary from standard implementations
673
- API stability is not guaranteed (experimental features)
674
675
### SPMD Scaling Characteristics
676
- Linear scaling achievable with proper data distribution
677
- Communication overhead increases with number of processes
678
- Optimal performance typically with 2-16 processes per node
679
- Memory requirements distributed across all processes
680
- Network bandwidth important for large-scale deployments
681
682
### Best Practices
683
- Test preview features thoroughly before production use
684
- Monitor SPMD communication patterns for performance
685
- Use appropriate batch sizes for distributed processing
686
- Balance computation and communication costs
687
- Validate results against single-node implementations
688
689
### Hardware Recommendations
690
- Intel CPUs for optimal oneDAL acceleration
691
- High-bandwidth interconnects for SPMD (InfiniBand recommended)
692
- Sufficient memory per node for local data portions
693
- NVMe storage for large dataset staging
694
- Consider NUMA topology for multi-socket systems