0
# Statistics and Manifold Learning
1
2
High-performance implementations of statistical analysis and manifold learning algorithms with Intel hardware acceleration. These algorithms provide significant speedups for statistical computations and dimensionality reduction on large datasets.
3
4
## Capabilities
5
6
### Basic Statistics
7
8
#### BasicStatistics
9
10
Intel-accelerated computation of basic statistical metrics with vectorized operations for large datasets.
11
12
```python { .api }
13
class BasicStatistics:
14
"""
15
Basic statistics computation with Intel optimization.
16
17
Provides efficient computation of fundamental statistical metrics
18
including mean, variance, covariance, correlation, and quantiles.
19
"""
20
21
def __init__(
22
self,
23
result_options='all',
24
algorithm='by_default'
25
):
26
"""
27
Initialize BasicStatistics estimator.
28
29
Parameters:
30
result_options (str or list): Statistics to compute
31
('all', 'mean', 'variance', 'variation', 'sum', 'sum_squares',
32
'sum_squares_centered', 'second_order_raw_moment', 'min', 'max')
33
algorithm (str): Algorithm implementation to use
34
"""
35
36
def fit(self, X, y=None):
37
"""
38
Compute basic statistics for the input data.
39
40
Parameters:
41
X (array-like): Input data of shape (n_samples, n_features)
42
y: Ignored, present for API consistency
43
44
Returns:
45
self: Fitted estimator with computed statistics
46
"""
47
48
def partial_fit(self, X, y=None):
49
"""
50
Update statistics with new batch of data.
51
52
Parameters:
53
X (array-like): New batch of data
54
y: Ignored
55
56
Returns:
57
self: Updated estimator
58
"""
59
60
def finalize_fit(self):
61
"""
62
Finalize the computation of statistics.
63
64
Returns:
65
self: Finalized estimator
66
"""
67
68
# Attributes available after fitting
69
min_: ... # Minimum values per feature
70
max_: ... # Maximum values per feature
71
sum_: ... # Sum of values per feature
72
mean_: ... # Mean values per feature
73
variance_: ... # Variance per feature
74
variation_: ... # Coefficient of variation per feature
75
sum_squares_: ... # Sum of squares per feature
76
sum_squares_centered_: ... # Centered sum of squares per feature
77
second_order_raw_moment_: ... # Second order raw moments
78
n_samples_seen_: ... # Number of samples processed
79
```
80
81
#### IncrementalBasicStatistics
82
83
Intel-accelerated incremental computation of basic statistics for streaming data.
84
85
```python { .api }
86
class IncrementalBasicStatistics:
87
"""
88
Incremental basic statistics with Intel optimization.
89
90
Enables efficient online computation of statistical metrics
91
for streaming data or datasets that don't fit in memory.
92
"""
93
94
def __init__(
95
self,
96
result_options='all',
97
algorithm='by_default'
98
):
99
"""
100
Initialize IncrementalBasicStatistics estimator.
101
102
Parameters:
103
result_options (str or list): Statistics to compute
104
('all', 'mean', 'variance', 'variation', 'sum', 'sum_squares',
105
'sum_squares_centered', 'second_order_raw_moment', 'min', 'max')
106
algorithm (str): Algorithm implementation to use
107
"""
108
109
def partial_fit(self, X, y=None):
110
"""
111
Update statistics incrementally with new data batch.
112
113
Parameters:
114
X (array-like): New batch of data
115
y: Ignored, present for API consistency
116
117
Returns:
118
self: Updated estimator
119
"""
120
121
def fit(self, X, y=None):
122
"""
123
Compute statistics for input data (equivalent to single partial_fit).
124
125
Parameters:
126
X (array-like): Input data of shape (n_samples, n_features)
127
y: Ignored
128
129
Returns:
130
self: Fitted estimator
131
"""
132
133
def finalize_fit(self):
134
"""
135
Finalize incremental statistics computation.
136
137
Returns:
138
self: Finalized estimator with complete statistics
139
"""
140
141
# Attributes available after fitting
142
min_: ... # Minimum values per feature
143
max_: ... # Maximum values per feature
144
sum_: ... # Sum of values per feature
145
mean_: ... # Mean values per feature
146
variance_: ... # Variance per feature
147
variation_: ... # Coefficient of variation per feature
148
sum_squares_: ... # Sum of squares per feature
149
sum_squares_centered_: ... # Centered sum of squares per feature
150
second_order_raw_moment_: ... # Second order raw moments
151
n_samples_seen_: ... # Total number of samples processed
152
```
153
154
### Covariance Estimation
155
156
#### IncrementalEmpiricalCovariance
157
158
Intel-accelerated incremental empirical covariance estimation for streaming data and large datasets.
159
160
```python { .api }
161
class IncrementalEmpiricalCovariance:
162
"""
163
Incremental empirical covariance estimation with Intel optimization.
164
165
Efficiently computes sample covariance matrix incrementally, making it
166
suitable for streaming data and datasets too large to fit in memory.
167
"""
168
169
def __init__(
170
self,
171
store_precision=True,
172
assume_centered=False
173
):
174
"""
175
Initialize Incremental Empirical Covariance.
176
177
Parameters:
178
store_precision (bool): Whether to store precision matrix
179
assume_centered (bool): Whether data is already centered
180
"""
181
182
def fit(self, X, y=None):
183
"""
184
Fit covariance model to data.
185
186
Parameters:
187
X (array-like): Training data of shape (n_samples, n_features)
188
y: Ignored, present for API consistency
189
190
Returns:
191
self: Fitted estimator
192
"""
193
194
def partial_fit(self, X, y=None):
195
"""
196
Incrementally fit covariance model.
197
198
Parameters:
199
X (array-like): Data batch of shape (n_samples, n_features)
200
y: Ignored
201
202
Returns:
203
self: Updated estimator
204
"""
205
206
def score(self, X, y=None):
207
"""
208
Compute log-likelihood under the model.
209
210
Parameters:
211
X (array-like): Test data
212
y: Ignored
213
214
Returns:
215
float: Average log-likelihood
216
"""
217
218
# Attributes available after fitting
219
covariance_: ... # Estimated covariance matrix
220
location_: ... # Estimated location (mean)
221
precision_: ... # Estimated precision matrix (if store_precision=True)
222
n_samples_seen_: ... # Number of samples processed
223
```
224
225
### Manifold Learning
226
227
#### t-SNE (t-Distributed Stochastic Neighbor Embedding)
228
229
Intel-accelerated t-SNE for non-linear dimensionality reduction and visualization.
230
231
```python { .api }
232
class TSNE:
233
"""
234
t-distributed Stochastic Neighbor Embedding with Intel optimization.
235
236
Provides efficient non-linear dimensionality reduction for visualization
237
and exploratory data analysis with optimized gradient computations.
238
"""
239
240
def __init__(
241
self,
242
n_components=2,
243
perplexity=30.0,
244
early_exaggeration=12.0,
245
learning_rate='warn',
246
n_iter=1000,
247
n_iter_without_progress=300,
248
min_grad_norm=1e-7,
249
metric='euclidean',
250
init='warn',
251
verbose=0,
252
random_state=None,
253
method='barnes_hut',
254
angle=0.5,
255
n_jobs=None,
256
square_distances='deprecated'
257
):
258
"""
259
Initialize t-SNE estimator.
260
261
Parameters:
262
n_components (int): Dimension of embedded space (usually 2 or 3)
263
perplexity (float): Related to number of nearest neighbors
264
early_exaggeration (float): How tight natural clusters are in original space
265
learning_rate (float or str): Learning rate for optimization
266
n_iter (int): Maximum number of iterations
267
n_iter_without_progress (int): Maximum iterations without progress
268
min_grad_norm (float): Minimum gradient norm for early stopping
269
metric (str): Distance metric to use
270
init (str or array): Initialization method ('random', 'pca', array)
271
verbose (int): Verbosity level
272
random_state (int): Random state for reproducibility
273
method (str): Algorithm to use ('barnes_hut', 'exact')
274
angle (float): Trade-off between speed and accuracy for Barnes-Hut
275
n_jobs (int): Number of parallel jobs
276
square_distances (str): Deprecated parameter
277
"""
278
279
def fit(self, X, y=None):
280
"""
281
Fit X into an embedded space.
282
283
Parameters:
284
X (array-like): Input data of shape (n_samples, n_features)
285
y: Ignored, present for API consistency
286
287
Returns:
288
self: Fitted estimator
289
"""
290
291
def fit_transform(self, X, y=None):
292
"""
293
Fit X into an embedded space and return transformed array.
294
295
Parameters:
296
X (array-like): Input data of shape (n_samples, n_features)
297
y: Ignored
298
299
Returns:
300
array: Embedded coordinates of shape (n_samples, n_components)
301
"""
302
303
# Attributes available after fitting
304
embedding_: ... # Stores embedding vectors
305
kl_divergence_: ... # Kullback-Leibler divergence after optimization
306
n_features_in_: ... # Number of features in input data
307
n_iter_: ... # Number of iterations run
308
learning_rate_: ... # Effective learning rate
309
```
310
311
## Usage Examples
312
313
### Basic Statistics Computation
314
315
```python
316
import numpy as np
317
from sklearnex.basic_statistics import BasicStatistics
318
from sklearn.datasets import make_regression
319
320
# Generate sample data
321
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
322
323
# Compute basic statistics
324
stats = BasicStatistics(result_options='all')
325
stats.fit(X)
326
327
print("Basic Statistics Results:")
328
print(f"Data shape: {X.shape}")
329
print(f"Samples processed: {stats.n_samples_seen_}")
330
331
# Access computed statistics
332
print(f"Mean per feature: {stats.mean_}")
333
print(f"Variance per feature: {stats.variance_}")
334
print(f"Min values: {stats.min_}")
335
print(f"Max values: {stats.max_}")
336
print(f"Sum per feature: {stats.sum_}")
337
338
# Coefficient of variation (std/mean)
339
print(f"Coefficient of variation: {stats.variation_}")
340
341
# Statistical moments
342
print(f"Sum of squares: {stats.sum_squares_}")
343
print(f"Centered sum of squares: {stats.sum_squares_centered_}")
344
print(f"Second order raw moment: {stats.second_order_raw_moment_}")
345
346
# Compute specific statistics only
347
stats_subset = BasicStatistics(result_options=['mean', 'variance', 'min', 'max'])
348
stats_subset.fit(X)
349
350
print("\nSubset of statistics:")
351
print(f"Mean: {stats_subset.mean_}")
352
print(f"Variance: {stats_subset.variance_}")
353
print(f"Min: {stats_subset.min_}")
354
print(f"Max: {stats_subset.max_}")
355
356
# Verify against NumPy computations
357
print(f"\nVerification against NumPy:")
358
print(f"Mean matches NumPy: {np.allclose(stats.mean_, np.mean(X, axis=0))}")
359
print(f"Variance matches NumPy: {np.allclose(stats.variance_, np.var(X, axis=0, ddof=0))}")
360
print(f"Min matches NumPy: {np.allclose(stats.min_, np.min(X, axis=0))}")
361
print(f"Max matches NumPy: {np.allclose(stats.max_, np.max(X, axis=0))}")
362
```
363
364
### Incremental Statistics for Streaming Data
365
366
```python
367
import numpy as np
368
from sklearnex.basic_statistics import IncrementalBasicStatistics
369
370
# Simulate streaming data
371
np.random.seed(42)
372
total_samples = 5000
373
batch_size = 500
374
n_features = 8
375
376
# Create incremental statistics estimator
377
inc_stats = IncrementalBasicStatistics(result_options='all')
378
379
# Process data in batches
380
all_data = []
381
for batch_idx in range(0, total_samples, batch_size):
382
# Generate batch of data
383
batch_data = np.random.randn(batch_size, n_features)
384
all_data.append(batch_data)
385
386
# Update statistics incrementally
387
inc_stats.partial_fit(batch_data)
388
389
print(f"Processed batch {batch_idx//batch_size + 1}: "
390
f"{inc_stats.n_samples_seen_} total samples")
391
392
# Finalize computation
393
inc_stats.finalize_fit()
394
395
# Compare with batch computation
396
full_data = np.vstack(all_data)
397
batch_stats = BasicStatistics(result_options='all')
398
batch_stats.fit(full_data)
399
400
print(f"\nIncremental vs Batch Statistics Comparison:")
401
print(f"Samples processed - Incremental: {inc_stats.n_samples_seen_}, "
402
f"Batch: {batch_stats.n_samples_seen_}")
403
404
# Verify results are identical
405
print(f"Mean identical: {np.allclose(inc_stats.mean_, batch_stats.mean_)}")
406
print(f"Variance identical: {np.allclose(inc_stats.variance_, batch_stats.variance_)}")
407
print(f"Min identical: {np.allclose(inc_stats.min_, batch_stats.min_)}")
408
print(f"Max identical: {np.allclose(inc_stats.max_, batch_stats.max_)}")
409
410
# Demonstrate memory efficiency for large datasets
411
print(f"\nMemory-efficient processing example:")
412
inc_stats_large = IncrementalBasicStatistics(result_options=['mean', 'variance'])
413
414
# Simulate processing very large dataset in small batches
415
n_batches = 100
416
batch_size = 1000
417
418
for i in range(n_batches):
419
# Generate and immediately process batch (no storage)
420
batch = np.random.normal(loc=i*0.1, scale=1.0, size=(batch_size, n_features))
421
inc_stats_large.partial_fit(batch)
422
423
if (i + 1) % 20 == 0:
424
print(f" Processed {inc_stats_large.n_samples_seen_} samples")
425
426
inc_stats_large.finalize_fit()
427
print(f"Final mean: {inc_stats_large.mean_}")
428
print(f"Final variance: {inc_stats_large.variance_}")
429
```
430
431
### t-SNE for Dimensionality Reduction and Visualization
432
433
```python
434
import numpy as np
435
import matplotlib.pyplot as plt
436
from sklearnex.manifold import TSNE
437
from sklearn.datasets import load_digits, make_blobs
438
439
# Example 1: Digits dataset visualization
440
digits = load_digits()
441
X_digits, y_digits = digits.data, digits.target
442
443
print(f"Digits dataset shape: {X_digits.shape}")
444
print(f"Number of classes: {len(np.unique(y_digits))}")
445
446
# Apply t-SNE for 2D visualization
447
tsne = TSNE(n_components=2, perplexity=30, random_state=42, verbose=1)
448
X_tsne = tsne.fit_transform(X_digits)
449
450
print(f"t-SNE embedding shape: {X_tsne.shape}")
451
print(f"KL divergence: {tsne.kl_divergence_:.4f}")
452
print(f"Iterations run: {tsne.n_iter_}")
453
454
# Visualize results
455
plt.figure(figsize=(12, 5))
456
457
plt.subplot(1, 2, 1)
458
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_digits, cmap='tab10', s=20, alpha=0.7)
459
plt.colorbar()
460
plt.title('t-SNE: Digits Dataset (Colored by Digit)')
461
plt.xlabel('t-SNE Component 1')
462
plt.ylabel('t-SNE Component 2')
463
464
# Example 2: High-dimensional synthetic data
465
X_synthetic, y_synthetic = make_blobs(
466
n_samples=1000, centers=5, n_features=50,
467
cluster_std=2.0, random_state=42
468
)
469
470
print(f"\nSynthetic dataset shape: {X_synthetic.shape}")
471
472
# t-SNE with different parameters
473
tsne_synthetic = TSNE(
474
n_components=2,
475
perplexity=50,
476
early_exaggeration=12.0,
477
learning_rate=200.0,
478
n_iter=1000,
479
random_state=42
480
)
481
X_tsne_synthetic = tsne_synthetic.fit_transform(X_synthetic)
482
483
plt.subplot(1, 2, 2)
484
plt.scatter(X_tsne_synthetic[:, 0], X_tsne_synthetic[:, 1],
485
c=y_synthetic, cmap='viridis', s=20, alpha=0.7)
486
plt.colorbar()
487
plt.title('t-SNE: Synthetic High-D Data')
488
plt.xlabel('t-SNE Component 1')
489
plt.ylabel('t-SNE Component 2')
490
491
plt.tight_layout()
492
plt.show()
493
494
# Example 3: 3D embedding
495
tsne_3d = TSNE(n_components=3, perplexity=30, random_state=42)
496
X_tsne_3d = tsne_3d.fit_transform(X_digits[:500]) # Use subset for faster computation
497
498
print(f"\n3D t-SNE embedding shape: {X_tsne_3d.shape}")
499
500
# 3D visualization
501
fig = plt.figure(figsize=(10, 8))
502
ax = fig.add_subplot(111, projection='3d')
503
scatter = ax.scatter(X_tsne_3d[:, 0], X_tsne_3d[:, 1], X_tsne_3d[:, 2],
504
c=y_digits[:500], cmap='tab10', s=30, alpha=0.7)
505
ax.set_xlabel('t-SNE Component 1')
506
ax.set_ylabel('t-SNE Component 2')
507
ax.set_zlabel('t-SNE Component 3')
508
ax.set_title('3D t-SNE: Digits Dataset')
509
plt.colorbar(scatter)
510
plt.show()
511
512
# Parameter sensitivity analysis
513
perplexity_values = [5, 15, 30, 50, 100]
514
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
515
axes = axes.ravel()
516
517
for i, perp in enumerate(perplexity_values):
518
if i >= len(axes):
519
break
520
521
tsne_param = TSNE(n_components=2, perplexity=perp, random_state=42)
522
X_param = tsne_param.fit_transform(X_digits[:1000]) # Use subset for speed
523
524
axes[i].scatter(X_param[:, 0], X_param[:, 1], c=y_digits[:1000],
525
cmap='tab10', s=10, alpha=0.7)
526
axes[i].set_title(f'Perplexity = {perp}')
527
axes[i].set_xlabel('t-SNE Component 1')
528
axes[i].set_ylabel('t-SNE Component 2')
529
530
# Hide the last subplot if not used
531
if len(perplexity_values) < len(axes):
532
axes[-1].axis('off')
533
534
plt.tight_layout()
535
plt.show()
536
```
537
538
### Combined Statistics and Manifold Analysis
539
540
```python
541
import numpy as np
542
from sklearnex.basic_statistics import BasicStatistics
543
from sklearnex.manifold import TSNE
544
from sklearn.datasets import load_breast_cancer
545
from sklearn.preprocessing import StandardScaler
546
547
# Load real-world dataset
548
cancer = load_breast_cancer()
549
X_cancer, y_cancer = cancer.data, cancer.target
550
551
print(f"Breast cancer dataset shape: {X_cancer.shape}")
552
print(f"Feature names: {cancer.feature_names[:5]}...") # Show first 5 features
553
554
# Compute basic statistics on raw data
555
raw_stats = BasicStatistics(result_options='all')
556
raw_stats.fit(X_cancer)
557
558
print("\nRaw data statistics:")
559
print(f"Mean range: [{raw_stats.mean_.min():.2f}, {raw_stats.mean_.max():.2f}]")
560
print(f"Variance range: [{raw_stats.variance_.min():.2e}, {raw_stats.variance_.max():.2e}]")
561
print(f"Min values range: [{raw_stats.min_.min():.2f}, {raw_stats.min_.max():.2f}]")
562
print(f"Max values range: [{raw_stats.max_.min():.2f}, {raw_stats.max_.max():.2f}]")
563
564
# Identify features with high variance
565
high_var_features = np.where(raw_stats.variance_ > np.percentile(raw_stats.variance_, 90))[0]
566
print(f"High variance features: {[cancer.feature_names[i] for i in high_var_features]}")
567
568
# Standardize data for better t-SNE performance
569
scaler = StandardScaler()
570
X_scaled = scaler.fit_transform(X_cancer)
571
572
# Compute statistics on scaled data
573
scaled_stats = BasicStatistics(result_options=['mean', 'variance'])
574
scaled_stats.fit(X_scaled)
575
576
print(f"\nScaled data statistics:")
577
print(f"Mean after scaling: {scaled_stats.mean_}")
578
print(f"Variance after scaling: {scaled_stats.variance_}")
579
580
# Apply t-SNE to scaled data
581
tsne_cancer = TSNE(
582
n_components=2,
583
perplexity=30,
584
learning_rate=200,
585
n_iter=1000,
586
random_state=42,
587
verbose=1
588
)
589
X_tsne_cancer = tsne_cancer.fit_transform(X_scaled)
590
591
# Analyze t-SNE embedding statistics
592
tsne_stats = BasicStatistics(result_options='all')
593
tsne_stats.fit(X_tsne_cancer)
594
595
print(f"\nt-SNE embedding statistics:")
596
print(f"Embedding mean: {tsne_stats.mean_}")
597
print(f"Embedding variance: {tsne_stats.variance_}")
598
print(f"Embedding range: [{tsne_stats.min_}, {tsne_stats.max_}]")
599
600
# Visualize results with statistics
601
plt.figure(figsize=(15, 5))
602
603
# Original data: first two features
604
plt.subplot(1, 3, 1)
605
plt.scatter(X_cancer[:, 0], X_cancer[:, 1], c=y_cancer, cmap='coolwarm', alpha=0.7)
606
plt.xlabel(f"{cancer.feature_names[0]}")
607
plt.ylabel(f"{cancer.feature_names[1]}")
608
plt.title("Original Data (First 2 Features)")
609
plt.colorbar(label='Malignant (1) / Benign (0)')
610
611
# Scaled data: first two features
612
plt.subplot(1, 3, 2)
613
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_cancer, cmap='coolwarm', alpha=0.7)
614
plt.xlabel(f"Scaled {cancer.feature_names[0]}")
615
plt.ylabel(f"Scaled {cancer.feature_names[1]}")
616
plt.title("Scaled Data (First 2 Features)")
617
plt.colorbar(label='Malignant (1) / Benign (0)')
618
619
# t-SNE embedding
620
plt.subplot(1, 3, 3)
621
plt.scatter(X_tsne_cancer[:, 0], X_tsne_cancer[:, 1], c=y_cancer, cmap='coolwarm', alpha=0.7)
622
plt.xlabel("t-SNE Component 1")
623
plt.ylabel("t-SNE Component 2")
624
plt.title(f"t-SNE Embedding (KL={tsne_cancer.kl_divergence_:.2f})")
625
plt.colorbar(label='Malignant (1) / Benign (0)')
626
627
plt.tight_layout()
628
plt.show()
629
630
# Feature correlation analysis using statistics
631
feature_correlations = []
632
for i in range(X_cancer.shape[1]):
633
for j in range(i+1, X_cancer.shape[1]):
634
corr = np.corrcoef(X_cancer[:, i], X_cancer[:, j])[0, 1]
635
feature_correlations.append({
636
'feature1': cancer.feature_names[i],
637
'feature2': cancer.feature_names[j],
638
'correlation': abs(corr)
639
})
640
641
# Find most correlated features
642
feature_correlations.sort(key=lambda x: x['correlation'], reverse=True)
643
print(f"\nTop 5 most correlated feature pairs:")
644
for i in range(5):
645
fc = feature_correlations[i]
646
print(f" {fc['feature1']} <-> {fc['feature2']}: {fc['correlation']:.3f}")
647
```
648
649
### Performance Comparison
650
651
```python
652
import time
653
import numpy as np
654
from sklearn.datasets import make_regression
655
656
# Generate large dataset for performance testing
657
X_large, _ = make_regression(n_samples=100000, n_features=50, random_state=42)
658
659
print("Performance comparison on large dataset:")
660
print(f"Dataset shape: {X_large.shape}")
661
662
# Test BasicStatistics performance
663
print("\nBasic Statistics Performance:")
664
665
# Intel-optimized version
666
start_time = time.time()
667
from sklearnex.basic_statistics import BasicStatistics as IntelStats
668
intel_stats = IntelStats(result_options='all')
669
intel_stats.fit(X_large)
670
intel_time = time.time() - start_time
671
672
print(f"Intel BasicStatistics: {intel_time:.3f} seconds")
673
674
# NumPy comparison
675
start_time = time.time()
676
numpy_mean = np.mean(X_large, axis=0)
677
numpy_var = np.var(X_large, axis=0)
678
numpy_min = np.min(X_large, axis=0)
679
numpy_max = np.max(X_large, axis=0)
680
numpy_sum = np.sum(X_large, axis=0)
681
numpy_time = time.time() - start_time
682
683
print(f"NumPy equivalent computations: {numpy_time:.3f} seconds")
684
print(f"Speedup: {numpy_time / intel_time:.1f}x")
685
686
# Verify results match
687
print(f"Results identical:")
688
print(f" Mean: {np.allclose(intel_stats.mean_, numpy_mean)}")
689
print(f" Variance: {np.allclose(intel_stats.variance_, numpy_var)}")
690
print(f" Min: {np.allclose(intel_stats.min_, numpy_min)}")
691
print(f" Max: {np.allclose(intel_stats.max_, numpy_max)}")
692
693
# Test t-SNE performance (smaller dataset for practical timing)
694
X_tsne_test = X_large[:5000, :20] # Reduce size for t-SNE timing
695
696
print(f"\nt-SNE Performance (shape: {X_tsne_test.shape}):")
697
698
# Intel-optimized version
699
start_time = time.time()
700
from sklearnex.manifold import TSNE as IntelTSNE
701
intel_tsne = IntelTSNE(n_components=2, perplexity=30, random_state=42, verbose=0)
702
intel_embedding = intel_tsne.fit_transform(X_tsne_test)
703
intel_tsne_time = time.time() - start_time
704
705
print(f"Intel t-SNE: {intel_tsne_time:.2f} seconds")
706
print(f"KL divergence: {intel_tsne.kl_divergence_:.4f}")
707
708
# Standard scikit-learn version
709
start_time = time.time()
710
from sklearn.manifold import TSNE as StandardTSNE
711
standard_tsne = StandardTSNE(n_components=2, perplexity=30, random_state=42, verbose=0)
712
standard_embedding = standard_tsne.fit_transform(X_tsne_test)
713
standard_tsne_time = time.time() - start_time
714
715
print(f"Standard t-SNE: {standard_tsne_time:.2f} seconds")
716
print(f"KL divergence: {standard_tsne.kl_divergence_:.4f}")
717
print(f"Speedup: {standard_tsne_time / intel_tsne_time:.1f}x")
718
719
# Compare embedding quality
720
embedding_diff = np.mean(np.abs(intel_embedding - standard_embedding))
721
print(f"Mean absolute difference in embeddings: {embedding_diff:.4f}")
722
```
723
724
## Performance Notes
725
726
- BasicStatistics shows significant speedups on datasets with >10000 samples
727
- IncrementalBasicStatistics enables processing of datasets larger than memory
728
- t-SNE optimization provides substantial improvements on high-dimensional data (>20 features)
729
- Statistical computations benefit most from vectorized operations on wide datasets
730
- Memory usage for statistics is minimal and constant
731
- t-SNE memory usage scales with sample count, similar to scikit-learn
732
- All algorithms maintain high numerical accuracy compared to standard implementations