Tessl Tile for pypi/scikit-learn-intelex@2024.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

advanced.md clustering.md daal4py-mb.md decomposition.md ensemble.md index.md linear-models.md metrics-model-selection.md neighbors.md patching-config.md stats-manifold.md svm.md

advanced.mddocs/

0
# Advanced Features
1

2
Advanced capabilities including preview APIs and distributed computing with SPMD (Single Program Multiple Data) support. These features provide cutting-edge optimizations and enable distributed machine learning on large-scale datasets.
3

4
## Preview API
5

6
Preview features are experimental implementations that provide early access to new algorithms and optimizations. Enable preview features by setting the `SKLEARNEX_PREVIEW` environment variable.
7

8
```bash
9
export SKLEARNEX_PREVIEW=1
10
```
11

12
## Hyperparameter Utilities
13

14
Advanced utilities for accessing Intel oneDAL hyperparameters for specific algorithms and operations.
15

16
```python { .api }
17
def get_hyperparameters(algorithm, op):
18
    """
19
    Get hyperparameter object for specific Intel oneDAL algorithm operation.
20
    
21
    Provides access to low-level hyperparameters for Intel oneDAL algorithms,
22
    allowing fine-tuning of algorithm behavior and performance characteristics.
23
    
24
    Parameters:
25
        algorithm (str): Algorithm name (e.g., 'linear_regression', 'covariance')
26
        op (str): Operation name (e.g., 'train', 'compute')
27
    
28
    Returns:
29
        HyperParameters: Object with algorithm-specific hyperparameters
30
        None: If oneDAL version < 2024.0.0
31
    
32
    Raises:
33
        KeyError: If algorithm/operation combination is not supported
34
        
35
    Example:
36
        from sklearnex import get_hyperparameters
37
        
38
        # Get hyperparameters for linear regression training
39
        hparams = get_hyperparameters('linear_regression', 'train')
40
        
41
        if hparams is not None:
42
            # Access hyperparameter values
43
            current_params = hparams.to_dict()
44
            print(f"Current parameters: {current_params}")
45
            
46
            # Modify hyperparameters (if setters available)
47
            # hparams.some_parameter = new_value
48
    """
49
```
50

51
### Supported Algorithm Operations
52

53
Currently supported hyperparameter combinations:
54

55
```python
56
# Linear regression training hyperparameters
57
linear_hparams = get_hyperparameters('linear_regression', 'train')
58

59
# Covariance computation hyperparameters  
60
cov_hparams = get_hyperparameters('covariance', 'compute')
61
```
62

63
## Utility Functions
64

65
Core utility functions for array handling and validation with Intel optimization.
66

67
```python { .api }
68
def get_namespace(x, xp=None):
69
    """
70
    Get array namespace for input arrays.
71
    
72
    Determines the appropriate array namespace (NumPy, CuPy, etc.)
73
    for the given input arrays, enabling cross-library compatibility.
74
    
75
    Parameters:
76
        x (array-like): Input array to determine namespace for
77
        xp (module, optional): Preferred array namespace module
78
        
79
    Returns:
80
        module: Array namespace module (numpy, cupy, etc.)
81
        
82
    Example:
83
        from sklearnex.utils import get_namespace
84
        import numpy as np
85
        
86
        data = np.array([[1, 2], [3, 4]])
87
        xp = get_namespace(data)
88
        # xp will be numpy module
89
        
90
        result = xp.mean(data, axis=0)
91
    """
92

93
def _assert_all_finite(X, allow_nan=False, msg_dtype=None):
94
    """
95
    Assert that all values in array are finite.
96
    
97
    Validates that input arrays contain only finite values,
98
    with Intel-optimized checking for large arrays.
99
    
100
    Parameters:
101
        X (array-like): Input array to validate
102
        allow_nan (bool): Whether to allow NaN values
103
        msg_dtype (str): Data type name for error messages
104
        
105
    Raises:
106
        ValueError: If array contains non-finite values
107
        
108
    Example:
109
        from sklearnex.utils import _assert_all_finite
110
        import numpy as np
111
        
112
        # Valid array - no error
113
        valid_data = np.array([[1.0, 2.0], [3.0, 4.0]])
114
        _assert_all_finite(valid_data)
115
        
116
        # Invalid array - raises ValueError
117
        invalid_data = np.array([[1.0, np.inf], [3.0, 4.0]])
118
        # _assert_all_finite(invalid_data)  # Would raise ValueError
119
    """
120
```
121

122
### Preview Capabilities
123

124
#### Preview K-Means Clustering
125

126
Enhanced K-means implementation with advanced optimization techniques.
127

128
```python { .api }
129
from sklearnex.preview.cluster import KMeans
130

131
class KMeans:
132
    """
133
    Preview K-means clustering with advanced optimizations.
134
    
135
    Features experimental improvements including better initialization,
136
    adaptive convergence criteria, and enhanced memory efficiency.
137
    """
138
    
139
    def __init__(
140
        self,
141
        n_clusters=8,
142
        init='k-means++',
143
        n_init=10,
144
        max_iter=300,
145
        tol=1e-4,
146
        random_state=None,
147
        copy_x=True,
148
        algorithm='auto'
149
    ):
150
        """Enhanced K-means with experimental optimizations."""
151
```
152

153
#### Preview Empirical Covariance
154

155
Advanced covariance estimation with improved numerical stability.
156

157
```python { .api }
158
from sklearnex.preview.covariance import EmpiricalCovariance
159

160
class EmpiricalCovariance:
161
    """
162
    Preview empirical covariance with enhanced numerical methods.
163
    
164
    Provides improved stability for high-dimensional and near-singular
165
    covariance matrices through advanced regularization techniques.
166
    """
167
    
168
    def __init__(
169
        self,
170
        store_precision=True,
171
        assume_centered=False
172
    ):
173
        """Enhanced empirical covariance estimation."""
174
```
175

176
#### Preview Incremental PCA
177

178
Advanced incremental Principal Component Analysis implementation.
179

180
```python { .api }
181
from sklearnex.preview.decomposition import IncrementalPCA
182

183
class IncrementalPCA:
184
    """
185
    Preview Incremental PCA with memory and computational optimizations.
186
    
187
    Enhanced version supporting larger batch sizes and improved
188
    numerical stability for streaming high-dimensional data.
189
    """
190
    
191
    def __init__(
192
        self,
193
        n_components=None,
194
        whiten=False,
195
        copy=True,
196
        batch_size=None
197
    ):
198
        """Advanced incremental PCA implementation."""
199
```
200

201
#### Preview Ridge Regression
202

203
Enhanced Ridge regression with advanced solver algorithms.
204

205
```python { .api }
206
from sklearnex.preview.linear_model import Ridge
207

208
class Ridge:
209
    """
210
    Preview Ridge regression with experimental solver improvements.
211
    
212
    Features advanced optimization techniques for better convergence
213
    and handling of ill-conditioned problems.
214
    """
215
    
216
    def __init__(
217
        self,
218
        alpha=1.0,
219
        fit_intercept=True,
220
        normalize='deprecated',
221
        copy_X=True,
222
        max_iter=None,
223
        tol=1e-3,
224
        solver='auto',
225
        positive=False,
226
        random_state=None
227
    ):
228
        """Enhanced Ridge regression with advanced solvers."""
229
```
230

231
## SPMD (Single Program Multiple Data) API
232

233
SPMD provides distributed computing capabilities for large-scale machine learning across multiple nodes. Requires OneDAL SPMD backend and appropriate distributed computing environment.
234

235
### SPMD Setup and Configuration
236

237
```python
238
# SPMD requires distributed computing setup
239
# Example with mpi4py (Message Passing Interface)
240

241
from mpi4py import MPI
242
import os
243

244
# Initialize MPI environment
245
comm = MPI.COMM_WORLD
246
rank = comm.Get_rank()
247
size = comm.Get_size()
248

249
# Ensure OneDAL SPMD is available
250
os.environ['ONEAPI_DAAL_SPMD'] = '1'
251

252
# Import SPMD modules after MPI setup
253
from sklearnex.spmd import patch_sklearn
254
patch_sklearn()
255
```
256

257
### SPMD Capabilities
258

259
#### Distributed Basic Statistics
260

261
```python { .api }
262
from sklearnex.spmd.basic_statistics import BasicStatistics
263

264
class BasicStatistics:
265
    """
266
    Distributed basic statistics computation across multiple nodes.
267
    
268
    Automatically partitions data across available MPI processes and
269
    aggregates results for scalable statistical analysis.
270
    """
271
    
272
    def fit(self, X, y=None):
273
        """
274
        Compute statistics on distributed data.
275
        
276
        Each MPI rank processes its portion of data, with automatic
277
        aggregation of results across all nodes.
278
        """
279
```
280

281
#### Distributed Clustering
282

283
```python { .api }
284
from sklearnex.spmd.cluster import KMeans, DBSCAN
285

286
class KMeans:
287
    """
288
    Distributed K-means clustering across multiple nodes.
289
    
290
    Scales to very large datasets by distributing computation
291
    and coordinating centroid updates across MPI processes.
292
    """
293

294
class DBSCAN:
295
    """
296
    Distributed DBSCAN clustering for large-scale density analysis.
297
    
298
    Enables clustering of massive datasets through distributed
299
    density computation and neighbor finding.
300
    """
301
```
302

303
#### Distributed Linear Models
304

305
```python { .api }
306
from sklearnex.spmd.linear_model import LinearRegression, LogisticRegression
307

308
class LinearRegression:
309
    """
310
    Distributed linear regression using distributed gradient computation.
311
    
312
    Scales to massive datasets through distributed normal equation
313
    or gradient-based solving across multiple nodes.
314
    """
315

316
class LogisticRegression:
317
    """
318
    Distributed logistic regression with distributed gradient descent.
319
    
320
    Handles very large classification problems through distributed
321
    optimization and coordinated parameter updates.
322
    """
323
```
324

325
#### Distributed Ensemble Methods
326

327
```python { .api }
328
from sklearnex.spmd.ensemble import RandomForestClassifier, RandomForestRegressor
329

330
class RandomForestClassifier:
331
    """
332
    Distributed Random Forest classification.
333
    
334
    Distributes tree construction across nodes while maintaining
335
    ensemble diversity and prediction accuracy.
336
    """
337

338
class RandomForestRegressor:
339
    """
340
    Distributed Random Forest regression.
341
    
342
    Scales tree ensemble training to very large datasets through
343
    distributed bootstrap sampling and tree building.
344
    """
345
```
346

347
## Usage Examples
348

349
### Preview API Examples
350

351
```python
352
import os
353
import numpy as np
354

355
# Enable preview features
356
os.environ['SKLEARNEX_PREVIEW'] = '1'
357

358
from sklearnex.preview.cluster import KMeans as PreviewKMeans
359
from sklearnex.preview.covariance import EmpiricalCovariance as PreviewCovariance
360
from sklearnex.preview.decomposition import IncrementalPCA as PreviewIPCA
361
from sklearnex.preview.linear_model import Ridge as PreviewRidge
362

363
from sklearn.datasets import make_blobs, make_regression
364

365
# Preview K-Means Example
366
print("Testing Preview K-Means:")
367
X_kmeans, _ = make_blobs(n_samples=2000, centers=5, n_features=20, random_state=42)
368

369
preview_kmeans = PreviewKMeans(n_clusters=5, random_state=42)
370
preview_kmeans.fit(X_kmeans)
371

372
print(f"Preview K-means inertia: {preview_kmeans.inertia_:.2f}")
373
print(f"Cluster centers shape: {preview_kmeans.cluster_centers_.shape}")
374

375
# Preview Empirical Covariance Example
376
print("\nTesting Preview Empirical Covariance:")
377
X_cov = np.random.randn(1000, 50)
378

379
preview_cov = PreviewCovariance(store_precision=True)
380
preview_cov.fit(X_cov)
381

382
print(f"Covariance matrix shape: {preview_cov.covariance_.shape}")
383
print(f"Precision matrix available: {hasattr(preview_cov, 'precision_')}")
384
print(f"Log-likelihood: {preview_cov.score(X_cov[:100]):.2f}")
385

386
# Preview Incremental PCA Example  
387
print("\nTesting Preview Incremental PCA:")
388
X_pca = np.random.randn(2000, 100)
389

390
preview_ipca = PreviewIPCA(n_components=20, batch_size=200)
391

392
# Fit in batches
393
for i in range(0, X_pca.shape[0], 200):
394
    batch = X_pca[i:i+200]
395
    preview_ipca.partial_fit(batch)
396

397
# Transform data
398
X_transformed = preview_ipca.transform(X_pca[:500])
399
print(f"Transformed data shape: {X_transformed.shape}")
400
print(f"Explained variance ratio sum: {preview_ipca.explained_variance_ratio_.sum():.3f}")
401

402
# Preview Ridge Regression Example
403
print("\nTesting Preview Ridge:")
404
X_ridge, y_ridge = make_regression(n_samples=1500, n_features=50, noise=0.1, random_state=42)
405

406
preview_ridge = PreviewRidge(alpha=1.0, solver='auto')
407
preview_ridge.fit(X_ridge, y_ridge)
408

409
print(f"Ridge R² score: {preview_ridge.score(X_ridge, y_ridge):.3f}")
410
print(f"Coefficients shape: {preview_ridge.coef_.shape}")
411
```
412

413
### SPMD Distributed Computing Examples
414

415
```python
416
# Note: This example requires MPI environment and multiple processes
417
# Run with: mpirun -n 4 python spmd_example.py
418

419
try:
420
    from mpi4py import MPI
421
    import numpy as np
422
    
423
    # Initialize MPI
424
    comm = MPI.COMM_WORLD
425
    rank = comm.Get_rank()
426
    size = comm.Get_size()
427
    
428
    print(f"Process {rank} of {size} started")
429
    
430
    # Enable SPMD mode
431
    import os
432
    os.environ['ONEAPI_DAAL_SPMD'] = '1'
433
    
434
    from sklearnex.spmd.basic_statistics import BasicStatistics as SPMDStats
435
    from sklearnex.spmd.cluster import KMeans as SPMDKMeans
436
    from sklearnex.spmd.linear_model import LinearRegression as SPMDLinear
437
    
438
    # Generate distributed data (each process has its portion)
439
    np.random.seed(42 + rank)  # Different seed per process
440
    local_samples = 2500  # Samples per process
441
    n_features = 30
442
    
443
    X_local = np.random.randn(local_samples, n_features)
444
    y_local = np.random.randn(local_samples)
445
    
446
    if rank == 0:
447
        print(f"Total dataset: {size * local_samples} samples, {n_features} features")
448
        print(f"Each process handles: {local_samples} samples")
449
    
450
    # Distributed Basic Statistics
451
    if rank == 0:
452
        print("\n=== Distributed Basic Statistics ===")
453
    
454
    spmd_stats = SPMDStats(result_options='all')
455
    spmd_stats.fit(X_local)
456
    
457
    if rank == 0:
458
        print(f"Global mean computed: {spmd_stats.mean_[:5]}...")  # Show first 5
459
        print(f"Global variance computed: {spmd_stats.variance_[:5]}...")
460
        print(f"Total samples processed: {spmd_stats.n_samples_seen_}")
461
    
462
    # Distributed K-Means
463
    if rank == 0:
464
        print("\n=== Distributed K-Means ===")
465
    
466
    spmd_kmeans = SPMDKMeans(n_clusters=8, random_state=42)
467
    spmd_kmeans.fit(X_local)
468
    
469
    if rank == 0:
470
        print(f"Global inertia: {spmd_kmeans.inertia_:.2f}")
471
        print(f"Cluster centers shape: {spmd_kmeans.cluster_centers_.shape}")
472
    
473
    # Distributed Linear Regression
474
    if rank == 0:
475
        print("\n=== Distributed Linear Regression ===")
476
    
477
    spmd_linear = SPMDLinear()
478
    spmd_linear.fit(X_local, y_local)
479
    
480
    if rank == 0:
481
        print(f"Global coefficients computed: {spmd_linear.coef_[:5]}...")
482
        print(f"Intercept: {spmd_linear.intercept_:.4f}")
483
    
484
    # Performance comparison (simulate)
485
    if rank == 0:
486
        print(f"\n=== Performance Summary ===")
487
        print(f"Distributed processing across {size} processes")
488
        print(f"Each process: {local_samples} samples")
489
        print(f"Total effective dataset: {size * local_samples} samples")
490
        print(f"Memory per process: ~{X_local.nbytes / 1024**2:.1f} MB")
491
        print(f"Total memory distributed: ~{size * X_local.nbytes / 1024**2:.1f} MB")
492

493
except ImportError:
494
    print("MPI not available. SPMD examples require mpi4py and MPI environment.")
495
    print("Install with: pip install mpi4py")
496
    print("Run with: mpirun -n 4 python script.py")
497
    
498
    # Fallback: Show SPMD API without execution
499
    print("\nSPMD API available for:")
500
    try:
501
        from sklearnex import spmd
502
        print("- Basic Statistics (distributed)")
503
        print("- Clustering (KMeans, DBSCAN)")  
504
        print("- Linear Models (LinearRegression, LogisticRegression)")
505
        print("- Ensemble Methods (RandomForest)")
506
        print("- Decomposition (PCA)")
507
        print("- Covariance (EmpiricalCovariance)")
508
        print("- Neighbors (KNeighbors)")
509
    except ImportError as e:
510
        print(f"SPMD modules not available: {e}")
511
```
512

513
### Hybrid Preview + SPMD Example
514

515
```python
516
# Advanced example combining Preview and SPMD features
517
import os
518
import numpy as np
519

520
# Enable both preview and SPMD
521
os.environ['SKLEARNEX_PREVIEW'] = '1'
522
os.environ['ONEAPI_DAAL_SPMD'] = '1'
523

524
try:
525
    from mpi4py import MPI
526
    
527
    comm = MPI.COMM_WORLD
528
    rank = comm.Get_rank()
529
    size = comm.Get_size()
530
    
531
    # Generate large-scale synthetic dataset
532
    np.random.seed(42 + rank)
533
    local_samples = 5000
534
    n_features = 100
535
    
536
    X_local = np.random.randn(local_samples, n_features)
537
    
538
    if rank == 0:
539
        print("=== Hybrid Preview + SPMD Workflow ===")
540
        print(f"Dataset: {size * local_samples} samples, {n_features} features")
541
        print(f"Processes: {size}")
542
    
543
    # Step 1: Distributed statistics with SPMD
544
    from sklearnex.spmd.basic_statistics import BasicStatistics
545
    
546
    stats = BasicStatistics(result_options=['mean', 'variance'])
547
    stats.fit(X_local)
548
    
549
    if rank == 0:
550
        print(f"\nStep 1 - Global Statistics:")
551
        print(f"Mean range: [{stats.mean_.min():.3f}, {stats.mean_.max():.3f}]")
552
        print(f"Variance range: [{stats.variance_.min():.3f}, {stats.variance_.max():.3f}]")
553
    
554
    # Step 2: Local preprocessing with Preview features
555
    # Standardize using global statistics
556
    X_standardized = (X_local - stats.mean_) / np.sqrt(stats.variance_)
557
    
558
    # Step 3: Distributed clustering with enhanced algorithm
559
    from sklearnex.spmd.cluster import KMeans
560
    
561
    kmeans = KMeans(n_clusters=10, n_init=3, random_state=42)
562
    kmeans.fit(X_standardized)
563
    
564
    if rank == 0:
565
        print(f"\nStep 2 - Distributed Clustering:")
566
        print(f"Global inertia: {kmeans.inertia_:.2f}")
567
        print(f"Iterations: {kmeans.n_iter_}")
568
    
569
    # Step 4: Local analysis on cluster assignments
570
    local_labels = kmeans.predict(X_standardized)
571
    local_cluster_counts = np.bincount(local_labels, minlength=10)
572
    
573
    # Aggregate cluster counts across all processes
574
    global_cluster_counts = comm.allreduce(local_cluster_counts, op=MPI.SUM)
575
    
576
    if rank == 0:
577
        print(f"\nStep 3 - Global Cluster Analysis:")
578
        for i, count in enumerate(global_cluster_counts):
579
            percentage = 100 * count / (size * local_samples)
580
            print(f"Cluster {i}: {count} samples ({percentage:.1f}%)")
581
    
582
    if rank == 0:
583
        print(f"\nWorkflow completed successfully!")
584
        print(f"Total computation distributed across {size} processes")
585

586
except ImportError as e:
587
    print(f"Advanced features require MPI: {e}")
588
    print("This example demonstrates the potential of combining:")
589
    print("- Preview APIs for enhanced algorithms")
590
    print("- SPMD for distributed computation")
591
    print("- Hybrid workflows for large-scale ML")
592
```
593

594
### Environment and Configuration
595

596
```python
597
import os
598
import sys
599

600
def setup_advanced_features():
601
    """Setup and verify advanced feature availability."""
602
    
603
    print("=== Advanced Features Configuration ===")
604
    
605
    # Preview API setup
606
    os.environ['SKLEARNEX_PREVIEW'] = '1'
607
    print("✓ Preview API enabled")
608
    
609
    # Check available preview modules
610
    try:
611
        from sklearnex import preview
612
        print("✓ Preview modules available:")
613
        print("  - preview.cluster (enhanced K-means)")
614
        print("  - preview.covariance (advanced covariance)")
615
        print("  - preview.decomposition (enhanced PCA)")
616
        print("  - preview.linear_model (improved Ridge)")
617
    except ImportError as e:
618
        print(f"✗ Preview modules error: {e}")
619
    
620
    # SPMD setup check
621
    try:
622
        from mpi4py import MPI
623
        comm = MPI.COMM_WORLD
624
        rank = comm.Get_rank()
625
        size = comm.Get_size()
626
        print(f"✓ MPI available: rank {rank} of {size}")
627
        
628
        os.environ['ONEAPI_DAAL_SPMD'] = '1'
629
        print("✓ SPMD mode enabled")
630
        
631
        try:
632
            from sklearnex import spmd
633
            print("✓ SPMD modules available:")
634
            print("  - spmd.basic_statistics")
635
            print("  - spmd.cluster")
636
            print("  - spmd.linear_model")
637
            print("  - spmd.ensemble")
638
            print("  - spmd.decomposition")
639
        except ImportError as e:
640
            print(f"✗ SPMD modules error: {e}")
641
            
642
    except ImportError:
643
        print("✗ MPI not available (install mpi4py for SPMD)")
644
    
645
    # OneDAL configuration
646
    dalroot = os.environ.get('DALROOT')
647
    if dalroot:
648
        print(f"✓ OneDAL root: {dalroot}")
649
    else:
650
        print("ℹ OneDAL root not set (may use system installation)")
651
    
652
    # Memory and threading info
653
    print(f"\nSystem Information:")
654
    print(f"Python version: {sys.version}")
655
    print(f"Available CPU cores: {os.cpu_count()}")
656
    
657
    # Threading environment variables
658
    threading_vars = ['OMP_NUM_THREADS', 'MKL_NUM_THREADS', 'NUMBA_NUM_THREADS']
659
    for var in threading_vars:
660
        value = os.environ.get(var, 'not set')
661
        print(f"{var}: {value}")
662

663
if __name__ == "__main__":
664
    setup_advanced_features()
665
```
666

667
## Performance and Scaling Notes
668

669
### Preview API Performance
670
- Preview features may have different performance characteristics
671
- Some preview algorithms are optimized for specific hardware configurations
672
- Memory usage may vary from standard implementations
673
- API stability is not guaranteed (experimental features)
674

675
### SPMD Scaling Characteristics
676
- Linear scaling achievable with proper data distribution
677
- Communication overhead increases with number of processes
678
- Optimal performance typically with 2-16 processes per node
679
- Memory requirements distributed across all processes
680
- Network bandwidth important for large-scale deployments
681

682
### Best Practices
683
- Test preview features thoroughly before production use
684
- Monitor SPMD communication patterns for performance
685
- Use appropriate batch sizes for distributed processing  
686
- Balance computation and communication costs
687
- Validate results against single-node implementations
688

689
### Hardware Recommendations
690
- Intel CPUs for optimal oneDAL acceleration
691
- High-bandwidth interconnects for SPMD (InfiniBand recommended)
692
- Sufficient memory per node for local data portions
693
- NVMe storage for large dataset staging
694
- Consider NUMA topology for multi-socket systems

Version

Tile

Files

advanced.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

advanced.mddocs/