Tessl Tile for pypi/scikit-learn@1.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md feature-extraction.md index.md metrics.md model-selection.md neighbors.md pipelines.md preprocessing.md supervised-learning.md unsupervised-learning.md utilities.md

datasets.mddocs/

0
# Datasets and Data Generation
1

2
This document covers all dataset loading, fetching, generation, and utility functions in scikit-learn.
3

4
## Built-in Toy Datasets
5

6
### Classification Datasets
7

8
#### load_iris { .api }
9
```python
10
from sklearn.datasets import load_iris
11

12
load_iris(
13
    return_X_y: bool = False,
14
    as_frame: bool = False
15
) -> Bunch | tuple[ArrayLike, ArrayLike]
16
```
17
Load and return the iris dataset (classification).
18

19
#### load_digits { .api }
20
```python
21
from sklearn.datasets import load_digits
22

23
load_digits(
24
    n_class: int = 10,
25
    return_X_y: bool = False,
26
    as_frame: bool = False
27
) -> Bunch | tuple[ArrayLike, ArrayLike]
28
```
29
Load and return the digits dataset (classification).
30

31
#### load_wine { .api }
32
```python
33
from sklearn.datasets import load_wine
34

35
load_wine(
36
    return_X_y: bool = False,
37
    as_frame: bool = False
38
) -> Bunch | tuple[ArrayLike, ArrayLike]
39
```
40
Load and return the wine dataset (classification).
41

42
#### load_breast_cancer { .api }
43
```python
44
from sklearn.datasets import load_breast_cancer
45

46
load_breast_cancer(
47
    return_X_y: bool = False,
48
    as_frame: bool = False
49
) -> Bunch | tuple[ArrayLike, ArrayLike]
50
```
51
Load and return the breast cancer wisconsin dataset (classification).
52

53
### Regression Datasets
54

55
#### load_diabetes { .api }
56
```python
57
from sklearn.datasets import load_diabetes
58

59
load_diabetes(
60
    return_X_y: bool = False,
61
    as_frame: bool = False,
62
    scaled: bool = True
63
) -> Bunch | tuple[ArrayLike, ArrayLike]
64
```
65
Load and return the diabetes dataset (regression).
66

67
#### load_linnerud { .api }
68
```python
69
from sklearn.datasets import load_linnerud
70

71
load_linnerud(
72
    return_X_y: bool = False,
73
    as_frame: bool = False
74
) -> Bunch | tuple[ArrayLike, ArrayLike]
75
```
76
Load and return the linnerud dataset (multivariate regression).
77

78
### General Data Loading
79

80
#### load_files { .api }
81
```python
82
from sklearn.datasets import load_files
83

84
load_files(
85
    container_path: str,
86
    description: str | None = None,
87
    categories: list[str] | None = None,
88
    load_content: bool = True,
89
    shuffle: bool = True,
90
    encoding: str | None = None,
91
    decode_error: str = "strict",
92
    random_state: int | RandomState | None = 0
93
) -> Bunch
94
```
95
Load text files with categories as subfolder names.
96

97
## Sample Images
98

99
#### load_sample_images { .api }
100
```python
101
from sklearn.datasets import load_sample_images
102

103
load_sample_images() -> Bunch
104
```
105
Load sample images for image manipulation.
106

107
#### load_sample_image { .api }
108
```python
109
from sklearn.datasets import load_sample_image
110

111
load_sample_image(
112
    image_name: str
113
) -> ArrayLike
114
```
115
Load the numpy array of a single sample image.
116

117
## Real-World Datasets (Fetch Functions)
118

119
### Text Datasets
120

121
#### fetch_20newsgroups { .api }
122
```python
123
from sklearn.datasets import fetch_20newsgroups
124

125
fetch_20newsgroups(
126
    data_home: str | None = None,
127
    subset: str = "train",
128
    categories: list[str] | None = None,
129
    shuffle: bool = True,
130
    random_state: int | RandomState | None = 42,
131
    remove: tuple | None = (),
132
    download_if_missing: bool = True,
133
    return_X_y: bool = False
134
) -> Bunch | tuple[list[str], ArrayLike]
135
```
136
Load the filenames and data from the 20 newsgroups dataset.
137

138
#### fetch_20newsgroups_vectorized { .api }
139
```python
140
from sklearn.datasets import fetch_20newsgroups_vectorized
141

142
fetch_20newsgroups_vectorized(
143
    subset: str = "train",
144
    remove: tuple = (),
145
    data_home: str | None = None,
146
    download_if_missing: bool = True,
147
    return_X_y: bool = False,
148
    normalize: bool = True,
149
    as_frame: bool = False
150
) -> Bunch | tuple[ArrayLike, ArrayLike]
151
```
152
Load the 20 newsgroups dataset and vectorize it.
153

154
#### fetch_rcv1 { .api }
155
```python
156
from sklearn.datasets import fetch_rcv1
157

158
fetch_rcv1(
159
    data_home: str | None = None,
160
    subset: str = "all",
161
    download_if_missing: bool = True,
162
    random_state: int | RandomState | None = None,
163
    shuffle: bool = False,
164
    return_X_y: bool = False
165
) -> Bunch | tuple[ArrayLike, ArrayLike]
166
```
167
Load the RCV1 multilabel dataset.
168

169
### Computer Vision Datasets
170

171
#### fetch_lfw_people { .api }
172
```python
173
from sklearn.datasets import fetch_lfw_people
174

175
fetch_lfw_people(
176
    data_home: str | None = None,
177
    funneled: bool = True,
178
    resize: float = 0.5,
179
    min_faces_per_person: int = 0,
180
    color: bool = False,
181
    slice_: tuple | None = (slice(70, 195), slice(78, 172)),
182
    download_if_missing: bool = True,
183
    return_X_y: bool = False
184
) -> Bunch | tuple[ArrayLike, ArrayLike]
185
```
186
Load the Labeled Faces in the Wild (LFW) people dataset.
187

188
#### fetch_lfw_pairs { .api }
189
```python
190
from sklearn.datasets import fetch_lfw_pairs
191

192
fetch_lfw_pairs(
193
    subset: str = "train",
194
    data_home: str | None = None,
195
    funneled: bool = True,
196
    resize: float = 0.5,
197
    color: bool = False,
198
    slice_: tuple | None = (slice(70, 195), slice(78, 172)),
199
    download_if_missing: bool = True
200
) -> Bunch
201
```
202
Load the Labeled Faces in the Wild (LFW) pairs dataset.
203

204
#### fetch_olivetti_faces { .api }
205
```python
206
from sklearn.datasets import fetch_olivetti_faces
207

208
fetch_olivetti_faces(
209
    data_home: str | None = None,
210
    shuffle: bool = False,
211
    random_state: int | RandomState | None = 0,
212
    download_if_missing: bool = True,
213
    return_X_y: bool = False
214
) -> Bunch | tuple[ArrayLike, ArrayLike]
215
```
216
Load the Olivetti faces dataset.
217

218
### Real Estate and Regression Datasets
219

220
#### fetch_california_housing { .api }
221
```python
222
from sklearn.datasets import fetch_california_housing
223

224
fetch_california_housing(
225
    data_home: str | None = None,
226
    download_if_missing: bool = True,
227
    return_X_y: bool = False,
228
    as_frame: bool = False
229
) -> Bunch | tuple[ArrayLike, ArrayLike]
230
```
231
Load the California housing dataset.
232

233
### Network Security Datasets
234

235
#### fetch_kddcup99 { .api }
236
```python
237
from sklearn.datasets import fetch_kddcup99
238

239
fetch_kddcup99(
240
    subset: str | None = None,
241
    data_home: str | None = None,
242
    shuffle: bool = False,
243
    random_state: int | RandomState | None = None,
244
    percent10: bool = True,
245
    download_if_missing: bool = True,
246
    return_X_y: bool = False,
247
    as_frame: bool = False
248
) -> Bunch | tuple[ArrayLike, ArrayLike]
249
```
250
Load the kddcup99 dataset.
251

252
### Environmental Datasets
253

254
#### fetch_covtype { .api }
255
```python
256
from sklearn.datasets import fetch_covtype
257

258
fetch_covtype(
259
    data_home: str | None = None,
260
    download_if_missing: bool = True,
261
    random_state: int | RandomState | None = None,
262
    shuffle: bool = False,
263
    return_X_y: bool = False,
264
    as_frame: bool = False
265
) -> Bunch | tuple[ArrayLike, ArrayLike]
266
```
267
Load the covertype dataset.
268

269
#### fetch_species_distributions { .api }
270
```python
271
from sklearn.datasets import fetch_species_distributions
272

273
fetch_species_distributions(
274
    data_home: str | None = None,
275
    download_if_missing: bool = True
276
) -> Bunch
277
```
278
Loader for species distribution dataset.
279

280
### OpenML Integration
281

282
#### fetch_openml { .api }
283
```python
284
from sklearn.datasets import fetch_openml
285

286
fetch_openml(
287
    name: str | int | None = None,
288
    version: int | str = "active",
289
    data_id: int | None = None,
290
    data_home: str | None = None,
291
    target_column: str | list | None = "default-target",
292
    cache: bool = True,
293
    return_X_y: bool = False,
294
    as_frame: bool | str = "auto",
295
    n_retries: int = 3,
296
    delay: float = 1.0,
297
    parser: str = "auto",
298
    read_csv_kwargs: dict | None = None
299
) -> Bunch | tuple[ArrayLike, ArrayLike]
300
```
301
Fetch dataset from openml by name or dataset id.
302

303
### General File Fetching
304

305
#### fetch_file { .api }
306
```python
307
from sklearn.datasets import fetch_file
308

309
fetch_file(
310
    url: str,
311
    data_home: str | None = None,
312
    cache_subdir: str = "",
313
    hash_: str | None = None,
314
    hash_algorithm: str = "auto",
315
    extract: bool = False,
316
    force_extract: bool = False,
317
    quiet: bool = False,
318
    local_folder: str | None = None
319
) -> str
320
```
321
Load a file from the Web.
322

323
## Synthetic Data Generation
324

325
### Classification Data Generation
326

327
#### make_classification { .api }
328
```python
329
from sklearn.datasets import make_classification
330

331
make_classification(
332
    n_samples: int = 100,
333
    n_features: int = 20,
334
    n_informative: int = 2,
335
    n_redundant: int = 2,
336
    n_repeated: int = 0,
337
    n_classes: int = 2,
338
    n_clusters_per_class: int = 2,
339
    weights: ArrayLike | None = None,
340
    flip_y: float = 0.01,
341
    class_sep: float = 1.0,
342
    hypercube: bool = True,
343
    shift: float | ArrayLike | None = 0.0,
344
    scale: float | ArrayLike | None = 1.0,
345
    shuffle: bool = True,
346
    random_state: int | RandomState | None = None
347
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
348
```
349
Generate a random n-class classification problem.
350

351
#### make_multilabel_classification { .api }
352
```python
353
from sklearn.datasets import make_multilabel_classification
354

355
make_multilabel_classification(
356
    n_samples: int = 100,
357
    n_features: int = 20,
358
    n_classes: int = 5,
359
    n_labels: int = 2,
360
    length: int = 50,
361
    allow_unlabeled: bool = True,
362
    sparse: bool = False,
363
    return_indicator: str = "dense",
364
    return_distributions: bool = False,
365
    random_state: int | RandomState | None = None
366
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike, ArrayLike]
367
```
368
Generate a random multilabel classification problem.
369

370
#### make_hastie_10_2 { .api }
371
```python
372
from sklearn.datasets import make_hastie_10_2
373

374
make_hastie_10_2(
375
    n_samples: int = 12000,
376
    random_state: int | RandomState | None = None
377
) -> tuple[ArrayLike, ArrayLike]
378
```
379
Generate data for binary classification used in Hastie et al. 2009.
380

381
#### make_gaussian_quantiles { .api }
382
```python
383
from sklearn.datasets import make_gaussian_quantiles
384

385
make_gaussian_quantiles(
386
    mean: ArrayLike | None = None,
387
    cov: float = 1.0,
388
    n_samples: int = 100,
389
    n_features: int = 2,
390
    n_classes: int = 3,
391
    shuffle: bool = True,
392
    random_state: int | RandomState | None = None
393
) -> tuple[ArrayLike, ArrayLike]
394
```
395
Generate isotropic Gaussian and label samples by quantile.
396

397
### Regression Data Generation
398

399
#### make_regression { .api }
400
```python
401
from sklearn.datasets import make_regression
402

403
make_regression(
404
    n_samples: int = 100,
405
    n_features: int = 100,
406
    n_informative: int = 10,
407
    n_targets: int = 1,
408
    bias: float = 0.0,
409
    effective_rank: int | None = None,
410
    tail_strength: float = 0.5,
411
    noise: float = 0.0,
412
    shuffle: bool = True,
413
    coef: bool = False,
414
    random_state: int | RandomState | None = None
415
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
416
```
417
Generate a random regression problem.
418

419
#### make_friedman1 { .api }
420
```python
421
from sklearn.datasets import make_friedman1
422

423
make_friedman1(
424
    n_samples: int = 100,
425
    n_features: int = 10,
426
    noise: float = 0.0,
427
    random_state: int | RandomState | None = None
428
) -> tuple[ArrayLike, ArrayLike]
429
```
430
Generate the "Friedman #1" regression problem.
431

432
#### make_friedman2 { .api }
433
```python
434
from sklearn.datasets import make_friedman2
435

436
make_friedman2(
437
    n_samples: int = 100,
438
    noise: float = 0.0,
439
    random_state: int | RandomState | None = None
440
) -> tuple[ArrayLike, ArrayLike]
441
```
442
Generate the "Friedman #2" regression problem.
443

444
#### make_friedman3 { .api }
445
```python
446
from sklearn.datasets import make_friedman3
447

448
make_friedman3(
449
    n_samples: int = 100,
450
    noise: float = 0.0,
451
    random_state: int | RandomState | None = None
452
) -> tuple[ArrayLike, ArrayLike]
453
```
454
Generate the "Friedman #3" regression problem.
455

456
#### make_sparse_uncorrelated { .api }
457
```python
458
from sklearn.datasets import make_sparse_uncorrelated
459

460
make_sparse_uncorrelated(
461
    n_samples: int = 100,
462
    n_features: int = 10,
463
    random_state: int | RandomState | None = None
464
) -> tuple[ArrayLike, ArrayLike]
465
```
466
Generate a random regression problem with sparse uncorrelated design.
467

468
### Clustering Data Generation
469

470
#### make_blobs { .api }
471
```python
472
from sklearn.datasets import make_blobs
473

474
make_blobs(
475
    n_samples: int | ArrayLike = 100,
476
    n_features: int = 2,
477
    centers: int | ArrayLike | None = None,
478
    cluster_std: float | ArrayLike = 1.0,
479
    center_box: tuple[float, float] = (-10.0, 10.0),
480
    shuffle: bool = True,
481
    random_state: int | RandomState | None = None,
482
    return_centers: bool = False
483
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
484
```
485
Generate isotropic Gaussian blobs for clustering.
486

487
#### make_circles { .api }
488
```python
489
from sklearn.datasets import make_circles
490

491
make_circles(
492
    n_samples: int | tuple[int, int] = 100,
493
    shuffle: bool = True,
494
    noise: float | None = None,
495
    random_state: int | RandomState | None = None,
496
    factor: float = 0.8
497
) -> tuple[ArrayLike, ArrayLike]
498
```
499
Make a large circle containing a smaller circle in 2d.
500

501
#### make_moons { .api }
502
```python
503
from sklearn.datasets import make_moons
504

505
make_moons(
506
    n_samples: int | tuple[int, int] = 100,
507
    shuffle: bool = True,
508
    noise: float | None = None,
509
    random_state: int | RandomState | None = None
510
) -> tuple[ArrayLike, ArrayLike]
511
```
512
Make two interleaving half circles.
513

514
### Manifold Data Generation
515

516
#### make_swiss_roll { .api }
517
```python
518
from sklearn.datasets import make_swiss_roll
519

520
make_swiss_roll(
521
    n_samples: int = 100,
522
    noise: float = 0.0,
523
    random_state: int | RandomState | None = None,
524
    hole: bool = False
525
) -> tuple[ArrayLike, ArrayLike]
526
```
527
Generate a swiss roll dataset.
528

529
#### make_s_curve { .api }
530
```python
531
from sklearn.datasets import make_s_curve
532

533
make_s_curve(
534
    n_samples: int = 100,
535
    noise: float = 0.0,
536
    random_state: int | RandomState | None = None
537
) -> tuple[ArrayLike, ArrayLike]
538
```
539
Generate an S curve dataset.
540

541
### Biclustering Data Generation
542

543
#### make_biclusters { .api }
544
```python
545
from sklearn.datasets import make_biclusters
546

547
make_biclusters(
548
    shape: tuple[int, int],
549
    n_clusters: int,
550
    noise: float = 0.0,
551
    minval: int = 10,
552
    maxval: int = 100,
553
    shuffle: bool = True,
554
    random_state: int | RandomState | None = None
555
) -> tuple[ArrayLike, ArrayLike, ArrayLike]
556
```
557
Generate an array with constant block diagonal structure.
558

559
#### make_checkerboard { .api }
560
```python
561
from sklearn.datasets import make_checkerboard
562

563
make_checkerboard(
564
    shape: tuple[int, int],
565
    n_clusters: int | tuple[int, int],
566
    noise: float = 0.0,
567
    minval: int = 10,
568
    maxval: int = 100,
569
    shuffle: bool = True,
570
    random_state: int | RandomState | None = None
571
) -> tuple[ArrayLike, ArrayLike, ArrayLike]
572
```
573
Generate an array with block checkerboard structure.
574

575
### Matrix Generation
576

577
#### make_low_rank_matrix { .api }
578
```python
579
from sklearn.datasets import make_low_rank_matrix
580

581
make_low_rank_matrix(
582
    n_samples: int = 100,
583
    n_features: int = 100,
584
    effective_rank: int = 10,
585
    tail_strength: float = 0.5,
586
    random_state: int | RandomState | None = None
587
) -> ArrayLike
588
```
589
Generate a mostly low rank matrix with bell-shaped singular values.
590

591
#### make_sparse_coded_signal { .api }
592
```python
593
from sklearn.datasets import make_sparse_coded_signal
594

595
make_sparse_coded_signal(
596
    n_samples: int,
597
    n_components: int,
598
    n_features: int,
599
    n_nonzero_coefs: int,
600
    random_state: int | RandomState | None = None
601
) -> tuple[ArrayLike, ArrayLike, ArrayLike]
602
```
603
Generate a signal as a sparse combination of dictionary elements.
604

605
#### make_spd_matrix { .api }
606
```python
607
from sklearn.datasets import make_spd_matrix
608

609
make_spd_matrix(
610
    n_dim: int,
611
    random_state: int | RandomState | None = None
612
) -> ArrayLike
613
```
614
Generate a random symmetric, positive-definite matrix.
615

616
#### make_sparse_spd_matrix { .api }
617
```python
618
from sklearn.datasets import make_sparse_spd_matrix
619

620
make_sparse_spd_matrix(
621
    dim: int = 1,
622
    alpha: float = 0.95,
623
    norm_diag: bool = False,
624
    smallest_coef: float = 0.1,
625
    largest_coef: float = 0.9,
626
    random_state: int | RandomState | None = None
627
) -> ArrayLike
628
```
629
Generate a sparse symmetric definite positive matrix.
630

631
## File I/O Utilities
632

633
### SVMLight Format
634

635
#### load_svmlight_file { .api }
636
```python
637
from sklearn.datasets import load_svmlight_file
638

639
load_svmlight_file(
640
    f: str | IO,
641
    n_features: int | None = None,
642
    dtype: type = ...,
643
    multilabel: bool = False,
644
    zero_based: bool | str = "auto",
645
    query_id: bool = False,
646
    offset: int = 0,
647
    length: int = -1
648
) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]
649
```
650
Load datasets in the svmlight / libsvm format into sparse CSR matrix.
651

652
#### load_svmlight_files { .api }
653
```python
654
from sklearn.datasets import load_svmlight_files
655

656
load_svmlight_files(
657
    files: list[str | IO],
658
    n_features: int | None = None,
659
    dtype: type = ...,
660
    multilabel: bool = False,
661
    zero_based: bool | str = "auto",
662
    query_id: bool = False,
663
    offset: int = 0,
664
    length: int = -1
665
) -> list[tuple[ArrayLike, ArrayLike]] | list[tuple[ArrayLike, ArrayLike, ArrayLike]]
666
```
667
Load dataset from multiple files in SVMlight format.
668

669
#### dump_svmlight_file { .api }
670
```python
671
from sklearn.datasets import dump_svmlight_file
672

673
dump_svmlight_file(
674
    X: ArrayLike,
675
    y: ArrayLike,
676
    f: str | IO,
677
    zero_based: bool = True,
678
    comment: str | bytes | None = None,
679
    query_id: ArrayLike | None = None,
680
    multilabel: bool = False
681
) -> None
682
```
683
Dump the dataset in svmlight / libsvm file format.
684

685
## Data Directory Management
686

687
#### get_data_home { .api }
688
```python
689
from sklearn.datasets import get_data_home
690

691
get_data_home(
692
    data_home: str | None = None
693
) -> str
694
```
695
Return the path to scikit-learn data dir.
696

697
#### clear_data_home { .api }
698
```python
699
from sklearn.datasets import clear_data_home
700

701
clear_data_home(
702
    data_home: str | None = None
703
) -> None
704
```
705
Delete all the content in the data home cache.
706

707
## Examples
708

709
### Loading Built-in Datasets
710

711
```python
712
from sklearn.datasets import load_iris, load_digits, load_wine
713

714
# Load iris dataset
715
iris = load_iris()
716
X_iris, y_iris = iris.data, iris.target
717
print(f"Iris dataset: {X_iris.shape}, classes: {len(iris.target_names)}")
718

719
# Load digits dataset  
720
digits = load_digits(n_class=10)
721
X_digits, y_digits = digits.data, digits.target
722
print(f"Digits dataset: {X_digits.shape}")
723

724
# Load wine dataset as tuple
725
X_wine, y_wine = load_wine(return_X_y=True)
726
print(f"Wine dataset: {X_wine.shape}")
727

728
# Load as pandas DataFrame
729
wine_frame = load_wine(as_frame=True)
730
df = wine_frame.frame
731
print(df.head())
732
```
733

734
### Fetching Real-World Datasets
735

736
```python
737
from sklearn.datasets import fetch_california_housing, fetch_20newsgroups
738

739
# Fetch California housing dataset
740
housing = fetch_california_housing()
741
X_housing, y_housing = housing.data, housing.target
742
print(f"Housing dataset: {X_housing.shape}")
743
print(f"Features: {housing.feature_names}")
744

745
# Fetch text data (20 newsgroups)
746
newsgroups = fetch_20newsgroups(
747
    subset='train', 
748
    categories=['alt.atheism', 'sci.space']
749
)
750
print(f"Newsgroups: {len(newsgroups.data)} documents")
751
print(f"Categories: {newsgroups.target_names}")
752
```
753

754
### Generating Synthetic Data
755

756
```python
757
from sklearn.datasets import (
758
    make_classification, make_regression, make_blobs, 
759
    make_circles, make_moons
760
)
761

762
# Classification data
763
X_clf, y_clf = make_classification(
764
    n_samples=1000, n_features=20, n_informative=10,
765
    n_redundant=5, n_classes=3, random_state=42
766
)
767
print(f"Classification data: {X_clf.shape}")
768

769
# Regression data
770
X_reg, y_reg = make_regression(
771
    n_samples=1000, n_features=20, n_informative=10,
772
    noise=0.1, random_state=42
773
)
774
print(f"Regression data: {X_reg.shape}")
775

776
# Clustering data - blobs
777
X_blobs, y_blobs = make_blobs(
778
    n_samples=300, centers=4, n_features=2,
779
    random_state=42, cluster_std=0.8
780
)
781

782
# Non-linear clustering data
783
X_circles, y_circles = make_circles(
784
    n_samples=300, noise=0.05, factor=0.6, random_state=42
785
)
786

787
X_moons, y_moons = make_moons(
788
    n_samples=300, noise=0.1, random_state=42
789
)
790

791
print(f"Blobs: {X_blobs.shape}, Circles: {X_circles.shape}, Moons: {X_moons.shape}")
792
```
793

794
### Manifold Learning Data
795

796
```python
797
from sklearn.datasets import make_swiss_roll, make_s_curve
798

799
# Generate swiss roll manifold
800
X_swiss, t_swiss = make_swiss_roll(n_samples=1000, noise=0.1, random_state=42)
801
print(f"Swiss roll: {X_swiss.shape}")
802

803
# Generate S-curve manifold
804
X_s_curve, t_s_curve = make_s_curve(n_samples=1000, noise=0.1, random_state=42)
805
print(f"S-curve: {X_s_curve.shape}")
806
```
807

808
### Working with SVMLight Format
809

810
```python
811
from sklearn.datasets import dump_svmlight_file, load_svmlight_file
812
import tempfile
813
import os
814

815
# Create sample data
816
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
817

818
# Save to SVMLight format
819
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.svmlight') as f:
820
    dump_svmlight_file(X, y, f.name)
821
    filename = f.name
822

823
# Load from SVMLight format
824
X_loaded, y_loaded = load_svmlight_file(filename)
825
print(f"Original: {X.shape}, Loaded: {X_loaded.shape}")
826

827
# Clean up
828
os.unlink(filename)
829
```
830

831
### Custom Dataset Creation
832

833
```python
834
import numpy as np
835
from sklearn.utils import Bunch
836

837
def create_custom_dataset(n_samples=100):
838
    """Create a custom dataset with specific characteristics."""
839
    np.random.seed(42)
840
    
841
    # Generate features
842
    X = np.random.randn(n_samples, 5)
843
    
844
    # Create target with specific pattern
845
    y = (X[:, 0] + X[:, 1] > 0).astype(int)
846
    
847
    # Create a Bunch object similar to sklearn datasets
848
    return Bunch(
849
        data=X,
850
        target=y,
851
        feature_names=[f'feature_{i}' for i in range(5)],
852
        target_names=['class_0', 'class_1'],
853
        DESCR='Custom synthetic dataset'
854
    )
855

856
# Use custom dataset
857
custom_data = create_custom_dataset(500)
858
print(f"Custom dataset: {custom_data.data.shape}")
859
print(f"Features: {custom_data.feature_names}")
860
```

Version

Tile

Files

datasets.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

datasets.mddocs/