0
# Datasets and Data Generation
1
2
This document covers all dataset loading, fetching, generation, and utility functions in scikit-learn.
3
4
## Built-in Toy Datasets
5
6
### Classification Datasets
7
8
#### load_iris { .api }
9
```python
10
from sklearn.datasets import load_iris
11
12
load_iris(
13
return_X_y: bool = False,
14
as_frame: bool = False
15
) -> Bunch | tuple[ArrayLike, ArrayLike]
16
```
17
Load and return the iris dataset (classification).
18
19
#### load_digits { .api }
20
```python
21
from sklearn.datasets import load_digits
22
23
load_digits(
24
n_class: int = 10,
25
return_X_y: bool = False,
26
as_frame: bool = False
27
) -> Bunch | tuple[ArrayLike, ArrayLike]
28
```
29
Load and return the digits dataset (classification).
30
31
#### load_wine { .api }
32
```python
33
from sklearn.datasets import load_wine
34
35
load_wine(
36
return_X_y: bool = False,
37
as_frame: bool = False
38
) -> Bunch | tuple[ArrayLike, ArrayLike]
39
```
40
Load and return the wine dataset (classification).
41
42
#### load_breast_cancer { .api }
43
```python
44
from sklearn.datasets import load_breast_cancer
45
46
load_breast_cancer(
47
return_X_y: bool = False,
48
as_frame: bool = False
49
) -> Bunch | tuple[ArrayLike, ArrayLike]
50
```
51
Load and return the breast cancer wisconsin dataset (classification).
52
53
### Regression Datasets
54
55
#### load_diabetes { .api }
56
```python
57
from sklearn.datasets import load_diabetes
58
59
load_diabetes(
60
return_X_y: bool = False,
61
as_frame: bool = False,
62
scaled: bool = True
63
) -> Bunch | tuple[ArrayLike, ArrayLike]
64
```
65
Load and return the diabetes dataset (regression).
66
67
#### load_linnerud { .api }
68
```python
69
from sklearn.datasets import load_linnerud
70
71
load_linnerud(
72
return_X_y: bool = False,
73
as_frame: bool = False
74
) -> Bunch | tuple[ArrayLike, ArrayLike]
75
```
76
Load and return the linnerud dataset (multivariate regression).
77
78
### General Data Loading
79
80
#### load_files { .api }
81
```python
82
from sklearn.datasets import load_files
83
84
load_files(
85
container_path: str,
86
description: str | None = None,
87
categories: list[str] | None = None,
88
load_content: bool = True,
89
shuffle: bool = True,
90
encoding: str | None = None,
91
decode_error: str = "strict",
92
random_state: int | RandomState | None = 0
93
) -> Bunch
94
```
95
Load text files with categories as subfolder names.
96
97
## Sample Images
98
99
#### load_sample_images { .api }
100
```python
101
from sklearn.datasets import load_sample_images
102
103
load_sample_images() -> Bunch
104
```
105
Load sample images for image manipulation.
106
107
#### load_sample_image { .api }
108
```python
109
from sklearn.datasets import load_sample_image
110
111
load_sample_image(
112
image_name: str
113
) -> ArrayLike
114
```
115
Load the numpy array of a single sample image.
116
117
## Real-World Datasets (Fetch Functions)
118
119
### Text Datasets
120
121
#### fetch_20newsgroups { .api }
122
```python
123
from sklearn.datasets import fetch_20newsgroups
124
125
fetch_20newsgroups(
126
data_home: str | None = None,
127
subset: str = "train",
128
categories: list[str] | None = None,
129
shuffle: bool = True,
130
random_state: int | RandomState | None = 42,
131
remove: tuple | None = (),
132
download_if_missing: bool = True,
133
return_X_y: bool = False
134
) -> Bunch | tuple[list[str], ArrayLike]
135
```
136
Load the filenames and data from the 20 newsgroups dataset.
137
138
#### fetch_20newsgroups_vectorized { .api }
139
```python
140
from sklearn.datasets import fetch_20newsgroups_vectorized
141
142
fetch_20newsgroups_vectorized(
143
subset: str = "train",
144
remove: tuple = (),
145
data_home: str | None = None,
146
download_if_missing: bool = True,
147
return_X_y: bool = False,
148
normalize: bool = True,
149
as_frame: bool = False
150
) -> Bunch | tuple[ArrayLike, ArrayLike]
151
```
152
Load the 20 newsgroups dataset and vectorize it.
153
154
#### fetch_rcv1 { .api }
155
```python
156
from sklearn.datasets import fetch_rcv1
157
158
fetch_rcv1(
159
data_home: str | None = None,
160
subset: str = "all",
161
download_if_missing: bool = True,
162
random_state: int | RandomState | None = None,
163
shuffle: bool = False,
164
return_X_y: bool = False
165
) -> Bunch | tuple[ArrayLike, ArrayLike]
166
```
167
Load the RCV1 multilabel dataset.
168
169
### Computer Vision Datasets
170
171
#### fetch_lfw_people { .api }
172
```python
173
from sklearn.datasets import fetch_lfw_people
174
175
fetch_lfw_people(
176
data_home: str | None = None,
177
funneled: bool = True,
178
resize: float = 0.5,
179
min_faces_per_person: int = 0,
180
color: bool = False,
181
slice_: tuple | None = (slice(70, 195), slice(78, 172)),
182
download_if_missing: bool = True,
183
return_X_y: bool = False
184
) -> Bunch | tuple[ArrayLike, ArrayLike]
185
```
186
Load the Labeled Faces in the Wild (LFW) people dataset.
187
188
#### fetch_lfw_pairs { .api }
189
```python
190
from sklearn.datasets import fetch_lfw_pairs
191
192
fetch_lfw_pairs(
193
subset: str = "train",
194
data_home: str | None = None,
195
funneled: bool = True,
196
resize: float = 0.5,
197
color: bool = False,
198
slice_: tuple | None = (slice(70, 195), slice(78, 172)),
199
download_if_missing: bool = True
200
) -> Bunch
201
```
202
Load the Labeled Faces in the Wild (LFW) pairs dataset.
203
204
#### fetch_olivetti_faces { .api }
205
```python
206
from sklearn.datasets import fetch_olivetti_faces
207
208
fetch_olivetti_faces(
209
data_home: str | None = None,
210
shuffle: bool = False,
211
random_state: int | RandomState | None = 0,
212
download_if_missing: bool = True,
213
return_X_y: bool = False
214
) -> Bunch | tuple[ArrayLike, ArrayLike]
215
```
216
Load the Olivetti faces dataset.
217
218
### Real Estate and Regression Datasets
219
220
#### fetch_california_housing { .api }
221
```python
222
from sklearn.datasets import fetch_california_housing
223
224
fetch_california_housing(
225
data_home: str | None = None,
226
download_if_missing: bool = True,
227
return_X_y: bool = False,
228
as_frame: bool = False
229
) -> Bunch | tuple[ArrayLike, ArrayLike]
230
```
231
Load the California housing dataset.
232
233
### Network Security Datasets
234
235
#### fetch_kddcup99 { .api }
236
```python
237
from sklearn.datasets import fetch_kddcup99
238
239
fetch_kddcup99(
240
subset: str | None = None,
241
data_home: str | None = None,
242
shuffle: bool = False,
243
random_state: int | RandomState | None = None,
244
percent10: bool = True,
245
download_if_missing: bool = True,
246
return_X_y: bool = False,
247
as_frame: bool = False
248
) -> Bunch | tuple[ArrayLike, ArrayLike]
249
```
250
Load the kddcup99 dataset.
251
252
### Environmental Datasets
253
254
#### fetch_covtype { .api }
255
```python
256
from sklearn.datasets import fetch_covtype
257
258
fetch_covtype(
259
data_home: str | None = None,
260
download_if_missing: bool = True,
261
random_state: int | RandomState | None = None,
262
shuffle: bool = False,
263
return_X_y: bool = False,
264
as_frame: bool = False
265
) -> Bunch | tuple[ArrayLike, ArrayLike]
266
```
267
Load the covertype dataset.
268
269
#### fetch_species_distributions { .api }
270
```python
271
from sklearn.datasets import fetch_species_distributions
272
273
fetch_species_distributions(
274
data_home: str | None = None,
275
download_if_missing: bool = True
276
) -> Bunch
277
```
278
Loader for species distribution dataset.
279
280
### OpenML Integration
281
282
#### fetch_openml { .api }
283
```python
284
from sklearn.datasets import fetch_openml
285
286
fetch_openml(
287
name: str | int | None = None,
288
version: int | str = "active",
289
data_id: int | None = None,
290
data_home: str | None = None,
291
target_column: str | list | None = "default-target",
292
cache: bool = True,
293
return_X_y: bool = False,
294
as_frame: bool | str = "auto",
295
n_retries: int = 3,
296
delay: float = 1.0,
297
parser: str = "auto",
298
read_csv_kwargs: dict | None = None
299
) -> Bunch | tuple[ArrayLike, ArrayLike]
300
```
301
Fetch dataset from openml by name or dataset id.
302
303
### General File Fetching
304
305
#### fetch_file { .api }
306
```python
307
from sklearn.datasets import fetch_file
308
309
fetch_file(
310
url: str,
311
data_home: str | None = None,
312
cache_subdir: str = "",
313
hash_: str | None = None,
314
hash_algorithm: str = "auto",
315
extract: bool = False,
316
force_extract: bool = False,
317
quiet: bool = False,
318
local_folder: str | None = None
319
) -> str
320
```
321
Load a file from the Web.
322
323
## Synthetic Data Generation
324
325
### Classification Data Generation
326
327
#### make_classification { .api }
328
```python
329
from sklearn.datasets import make_classification
330
331
make_classification(
332
n_samples: int = 100,
333
n_features: int = 20,
334
n_informative: int = 2,
335
n_redundant: int = 2,
336
n_repeated: int = 0,
337
n_classes: int = 2,
338
n_clusters_per_class: int = 2,
339
weights: ArrayLike | None = None,
340
flip_y: float = 0.01,
341
class_sep: float = 1.0,
342
hypercube: bool = True,
343
shift: float | ArrayLike | None = 0.0,
344
scale: float | ArrayLike | None = 1.0,
345
shuffle: bool = True,
346
random_state: int | RandomState | None = None
347
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
348
```
349
Generate a random n-class classification problem.
350
351
#### make_multilabel_classification { .api }
352
```python
353
from sklearn.datasets import make_multilabel_classification
354
355
make_multilabel_classification(
356
n_samples: int = 100,
357
n_features: int = 20,
358
n_classes: int = 5,
359
n_labels: int = 2,
360
length: int = 50,
361
allow_unlabeled: bool = True,
362
sparse: bool = False,
363
return_indicator: str = "dense",
364
return_distributions: bool = False,
365
random_state: int | RandomState | None = None
366
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike, ArrayLike]
367
```
368
Generate a random multilabel classification problem.
369
370
#### make_hastie_10_2 { .api }
371
```python
372
from sklearn.datasets import make_hastie_10_2
373
374
make_hastie_10_2(
375
n_samples: int = 12000,
376
random_state: int | RandomState | None = None
377
) -> tuple[ArrayLike, ArrayLike]
378
```
379
Generate data for binary classification used in Hastie et al. 2009.
380
381
#### make_gaussian_quantiles { .api }
382
```python
383
from sklearn.datasets import make_gaussian_quantiles
384
385
make_gaussian_quantiles(
386
mean: ArrayLike | None = None,
387
cov: float = 1.0,
388
n_samples: int = 100,
389
n_features: int = 2,
390
n_classes: int = 3,
391
shuffle: bool = True,
392
random_state: int | RandomState | None = None
393
) -> tuple[ArrayLike, ArrayLike]
394
```
395
Generate isotropic Gaussian and label samples by quantile.
396
397
### Regression Data Generation
398
399
#### make_regression { .api }
400
```python
401
from sklearn.datasets import make_regression
402
403
make_regression(
404
n_samples: int = 100,
405
n_features: int = 100,
406
n_informative: int = 10,
407
n_targets: int = 1,
408
bias: float = 0.0,
409
effective_rank: int | None = None,
410
tail_strength: float = 0.5,
411
noise: float = 0.0,
412
shuffle: bool = True,
413
coef: bool = False,
414
random_state: int | RandomState | None = None
415
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
416
```
417
Generate a random regression problem.
418
419
#### make_friedman1 { .api }
420
```python
421
from sklearn.datasets import make_friedman1
422
423
make_friedman1(
424
n_samples: int = 100,
425
n_features: int = 10,
426
noise: float = 0.0,
427
random_state: int | RandomState | None = None
428
) -> tuple[ArrayLike, ArrayLike]
429
```
430
Generate the "Friedman #1" regression problem.
431
432
#### make_friedman2 { .api }
433
```python
434
from sklearn.datasets import make_friedman2
435
436
make_friedman2(
437
n_samples: int = 100,
438
noise: float = 0.0,
439
random_state: int | RandomState | None = None
440
) -> tuple[ArrayLike, ArrayLike]
441
```
442
Generate the "Friedman #2" regression problem.
443
444
#### make_friedman3 { .api }
445
```python
446
from sklearn.datasets import make_friedman3
447
448
make_friedman3(
449
n_samples: int = 100,
450
noise: float = 0.0,
451
random_state: int | RandomState | None = None
452
) -> tuple[ArrayLike, ArrayLike]
453
```
454
Generate the "Friedman #3" regression problem.
455
456
#### make_sparse_uncorrelated { .api }
457
```python
458
from sklearn.datasets import make_sparse_uncorrelated
459
460
make_sparse_uncorrelated(
461
n_samples: int = 100,
462
n_features: int = 10,
463
random_state: int | RandomState | None = None
464
) -> tuple[ArrayLike, ArrayLike]
465
```
466
Generate a random regression problem with sparse uncorrelated design.
467
468
### Clustering Data Generation
469
470
#### make_blobs { .api }
471
```python
472
from sklearn.datasets import make_blobs
473
474
make_blobs(
475
n_samples: int | ArrayLike = 100,
476
n_features: int = 2,
477
centers: int | ArrayLike | None = None,
478
cluster_std: float | ArrayLike = 1.0,
479
center_box: tuple[float, float] = (-10.0, 10.0),
480
shuffle: bool = True,
481
random_state: int | RandomState | None = None,
482
return_centers: bool = False
483
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
484
```
485
Generate isotropic Gaussian blobs for clustering.
486
487
#### make_circles { .api }
488
```python
489
from sklearn.datasets import make_circles
490
491
make_circles(
492
n_samples: int | tuple[int, int] = 100,
493
shuffle: bool = True,
494
noise: float | None = None,
495
random_state: int | RandomState | None = None,
496
factor: float = 0.8
497
) -> tuple[ArrayLike, ArrayLike]
498
```
499
Make a large circle containing a smaller circle in 2d.
500
501
#### make_moons { .api }
502
```python
503
from sklearn.datasets import make_moons
504
505
make_moons(
506
n_samples: int | tuple[int, int] = 100,
507
shuffle: bool = True,
508
noise: float | None = None,
509
random_state: int | RandomState | None = None
510
) -> tuple[ArrayLike, ArrayLike]
511
```
512
Make two interleaving half circles.
513
514
### Manifold Data Generation
515
516
#### make_swiss_roll { .api }
517
```python
518
from sklearn.datasets import make_swiss_roll
519
520
make_swiss_roll(
521
n_samples: int = 100,
522
noise: float = 0.0,
523
random_state: int | RandomState | None = None,
524
hole: bool = False
525
) -> tuple[ArrayLike, ArrayLike]
526
```
527
Generate a swiss roll dataset.
528
529
#### make_s_curve { .api }
530
```python
531
from sklearn.datasets import make_s_curve
532
533
make_s_curve(
534
n_samples: int = 100,
535
noise: float = 0.0,
536
random_state: int | RandomState | None = None
537
) -> tuple[ArrayLike, ArrayLike]
538
```
539
Generate an S curve dataset.
540
541
### Biclustering Data Generation
542
543
#### make_biclusters { .api }
544
```python
545
from sklearn.datasets import make_biclusters
546
547
make_biclusters(
548
shape: tuple[int, int],
549
n_clusters: int,
550
noise: float = 0.0,
551
minval: int = 10,
552
maxval: int = 100,
553
shuffle: bool = True,
554
random_state: int | RandomState | None = None
555
) -> tuple[ArrayLike, ArrayLike, ArrayLike]
556
```
557
Generate an array with constant block diagonal structure.
558
559
#### make_checkerboard { .api }
560
```python
561
from sklearn.datasets import make_checkerboard
562
563
make_checkerboard(
564
shape: tuple[int, int],
565
n_clusters: int | tuple[int, int],
566
noise: float = 0.0,
567
minval: int = 10,
568
maxval: int = 100,
569
shuffle: bool = True,
570
random_state: int | RandomState | None = None
571
) -> tuple[ArrayLike, ArrayLike, ArrayLike]
572
```
573
Generate an array with block checkerboard structure.
574
575
### Matrix Generation
576
577
#### make_low_rank_matrix { .api }
578
```python
579
from sklearn.datasets import make_low_rank_matrix
580
581
make_low_rank_matrix(
582
n_samples: int = 100,
583
n_features: int = 100,
584
effective_rank: int = 10,
585
tail_strength: float = 0.5,
586
random_state: int | RandomState | None = None
587
) -> ArrayLike
588
```
589
Generate a mostly low rank matrix with bell-shaped singular values.
590
591
#### make_sparse_coded_signal { .api }
592
```python
593
from sklearn.datasets import make_sparse_coded_signal
594
595
make_sparse_coded_signal(
596
n_samples: int,
597
n_components: int,
598
n_features: int,
599
n_nonzero_coefs: int,
600
random_state: int | RandomState | None = None
601
) -> tuple[ArrayLike, ArrayLike, ArrayLike]
602
```
603
Generate a signal as a sparse combination of dictionary elements.
604
605
#### make_spd_matrix { .api }
606
```python
607
from sklearn.datasets import make_spd_matrix
608
609
make_spd_matrix(
610
n_dim: int,
611
random_state: int | RandomState | None = None
612
) -> ArrayLike
613
```
614
Generate a random symmetric, positive-definite matrix.
615
616
#### make_sparse_spd_matrix { .api }
617
```python
618
from sklearn.datasets import make_sparse_spd_matrix
619
620
make_sparse_spd_matrix(
621
dim: int = 1,
622
alpha: float = 0.95,
623
norm_diag: bool = False,
624
smallest_coef: float = 0.1,
625
largest_coef: float = 0.9,
626
random_state: int | RandomState | None = None
627
) -> ArrayLike
628
```
629
Generate a sparse symmetric definite positive matrix.
630
631
## File I/O Utilities
632
633
### SVMLight Format
634
635
#### load_svmlight_file { .api }
636
```python
637
from sklearn.datasets import load_svmlight_file
638
639
load_svmlight_file(
640
f: str | IO,
641
n_features: int | None = None,
642
dtype: type = ...,
643
multilabel: bool = False,
644
zero_based: bool | str = "auto",
645
query_id: bool = False,
646
offset: int = 0,
647
length: int = -1
648
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
649
```
650
Load datasets in the svmlight / libsvm format into sparse CSR matrix.
651
652
#### load_svmlight_files { .api }
653
```python
654
from sklearn.datasets import load_svmlight_files
655
656
load_svmlight_files(
657
files: list[str | IO],
658
n_features: int | None = None,
659
dtype: type = ...,
660
multilabel: bool = False,
661
zero_based: bool | str = "auto",
662
query_id: bool = False,
663
offset: int = 0,
664
length: int = -1
665
) -> list[tuple[ArrayLike, ArrayLike]] | list[tuple[ArrayLike, ArrayLike, ArrayLike]]
666
```
667
Load dataset from multiple files in SVMlight format.
668
669
#### dump_svmlight_file { .api }
670
```python
671
from sklearn.datasets import dump_svmlight_file
672
673
dump_svmlight_file(
674
X: ArrayLike,
675
y: ArrayLike,
676
f: str | IO,
677
zero_based: bool = True,
678
comment: str | bytes | None = None,
679
query_id: ArrayLike | None = None,
680
multilabel: bool = False
681
) -> None
682
```
683
Dump the dataset in svmlight / libsvm file format.
684
685
## Data Directory Management
686
687
#### get_data_home { .api }
688
```python
689
from sklearn.datasets import get_data_home
690
691
get_data_home(
692
data_home: str | None = None
693
) -> str
694
```
695
Return the path to scikit-learn data dir.
696
697
#### clear_data_home { .api }
698
```python
699
from sklearn.datasets import clear_data_home
700
701
clear_data_home(
702
data_home: str | None = None
703
) -> None
704
```
705
Delete all the content in the data home cache.
706
707
## Examples
708
709
### Loading Built-in Datasets
710
711
```python
712
from sklearn.datasets import load_iris, load_digits, load_wine
713
714
# Load iris dataset
715
iris = load_iris()
716
X_iris, y_iris = iris.data, iris.target
717
print(f"Iris dataset: {X_iris.shape}, classes: {len(iris.target_names)}")
718
719
# Load digits dataset
720
digits = load_digits(n_class=10)
721
X_digits, y_digits = digits.data, digits.target
722
print(f"Digits dataset: {X_digits.shape}")
723
724
# Load wine dataset as tuple
725
X_wine, y_wine = load_wine(return_X_y=True)
726
print(f"Wine dataset: {X_wine.shape}")
727
728
# Load as pandas DataFrame
729
wine_frame = load_wine(as_frame=True)
730
df = wine_frame.frame
731
print(df.head())
732
```
733
734
### Fetching Real-World Datasets
735
736
```python
737
from sklearn.datasets import fetch_california_housing, fetch_20newsgroups
738
739
# Fetch California housing dataset
740
housing = fetch_california_housing()
741
X_housing, y_housing = housing.data, housing.target
742
print(f"Housing dataset: {X_housing.shape}")
743
print(f"Features: {housing.feature_names}")
744
745
# Fetch text data (20 newsgroups)
746
newsgroups = fetch_20newsgroups(
747
subset='train',
748
categories=['alt.atheism', 'sci.space']
749
)
750
print(f"Newsgroups: {len(newsgroups.data)} documents")
751
print(f"Categories: {newsgroups.target_names}")
752
```
753
754
### Generating Synthetic Data
755
756
```python
757
from sklearn.datasets import (
758
make_classification, make_regression, make_blobs,
759
make_circles, make_moons
760
)
761
762
# Classification data
763
X_clf, y_clf = make_classification(
764
n_samples=1000, n_features=20, n_informative=10,
765
n_redundant=5, n_classes=3, random_state=42
766
)
767
print(f"Classification data: {X_clf.shape}")
768
769
# Regression data
770
X_reg, y_reg = make_regression(
771
n_samples=1000, n_features=20, n_informative=10,
772
noise=0.1, random_state=42
773
)
774
print(f"Regression data: {X_reg.shape}")
775
776
# Clustering data - blobs
777
X_blobs, y_blobs = make_blobs(
778
n_samples=300, centers=4, n_features=2,
779
random_state=42, cluster_std=0.8
780
)
781
782
# Non-linear clustering data
783
X_circles, y_circles = make_circles(
784
n_samples=300, noise=0.05, factor=0.6, random_state=42
785
)
786
787
X_moons, y_moons = make_moons(
788
n_samples=300, noise=0.1, random_state=42
789
)
790
791
print(f"Blobs: {X_blobs.shape}, Circles: {X_circles.shape}, Moons: {X_moons.shape}")
792
```
793
794
### Manifold Learning Data
795
796
```python
797
from sklearn.datasets import make_swiss_roll, make_s_curve
798
799
# Generate swiss roll manifold
800
X_swiss, t_swiss = make_swiss_roll(n_samples=1000, noise=0.1, random_state=42)
801
print(f"Swiss roll: {X_swiss.shape}")
802
803
# Generate S-curve manifold
804
X_s_curve, t_s_curve = make_s_curve(n_samples=1000, noise=0.1, random_state=42)
805
print(f"S-curve: {X_s_curve.shape}")
806
```
807
808
### Working with SVMLight Format
809
810
```python
811
from sklearn.datasets import dump_svmlight_file, load_svmlight_file
812
import tempfile
813
import os
814
815
# Create sample data
816
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
817
818
# Save to SVMLight format
819
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.svmlight') as f:
820
dump_svmlight_file(X, y, f.name)
821
filename = f.name
822
823
# Load from SVMLight format
824
X_loaded, y_loaded = load_svmlight_file(filename)
825
print(f"Original: {X.shape}, Loaded: {X_loaded.shape}")
826
827
# Clean up
828
os.unlink(filename)
829
```
830
831
### Custom Dataset Creation
832
833
```python
834
import numpy as np
835
from sklearn.utils import Bunch
836
837
def create_custom_dataset(n_samples=100):
838
"""Create a custom dataset with specific characteristics."""
839
np.random.seed(42)
840
841
# Generate features
842
X = np.random.randn(n_samples, 5)
843
844
# Create target with specific pattern
845
y = (X[:, 0] + X[:, 1] > 0).astype(int)
846
847
# Create a Bunch object similar to sklearn datasets
848
return Bunch(
849
data=X,
850
target=y,
851
feature_names=[f'feature_{i}' for i in range(5)],
852
target_names=['class_0', 'class_1'],
853
DESCR='Custom synthetic dataset'
854
)
855
856
# Use custom dataset
857
custom_data = create_custom_dataset(500)
858
print(f"Custom dataset: {custom_data.data.shape}")
859
print(f"Features: {custom_data.feature_names}")
860
```