or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdfeature-extraction.mdindex.mdmetrics.mdmodel-selection.mdneighbors.mdpipelines.mdpreprocessing.mdsupervised-learning.mdunsupervised-learning.mdutilities.md

datasets.mddocs/

0

# Datasets and Data Generation

1

2

This document covers all dataset loading, fetching, generation, and utility functions in scikit-learn.

3

4

## Built-in Toy Datasets

5

6

### Classification Datasets

7

8

#### load_iris { .api }

9

```python

10

from sklearn.datasets import load_iris

11

12

load_iris(

13

return_X_y: bool = False,

14

as_frame: bool = False

15

) -> Bunch | tuple[ArrayLike, ArrayLike]

16

```

17

Load and return the iris dataset (classification).

18

19

#### load_digits { .api }

20

```python

21

from sklearn.datasets import load_digits

22

23

load_digits(

24

n_class: int = 10,

25

return_X_y: bool = False,

26

as_frame: bool = False

27

) -> Bunch | tuple[ArrayLike, ArrayLike]

28

```

29

Load and return the digits dataset (classification).

30

31

#### load_wine { .api }

32

```python

33

from sklearn.datasets import load_wine

34

35

load_wine(

36

return_X_y: bool = False,

37

as_frame: bool = False

38

) -> Bunch | tuple[ArrayLike, ArrayLike]

39

```

40

Load and return the wine dataset (classification).

41

42

#### load_breast_cancer { .api }

43

```python

44

from sklearn.datasets import load_breast_cancer

45

46

load_breast_cancer(

47

return_X_y: bool = False,

48

as_frame: bool = False

49

) -> Bunch | tuple[ArrayLike, ArrayLike]

50

```

51

Load and return the breast cancer wisconsin dataset (classification).

52

53

### Regression Datasets

54

55

#### load_diabetes { .api }

56

```python

57

from sklearn.datasets import load_diabetes

58

59

load_diabetes(

60

return_X_y: bool = False,

61

as_frame: bool = False,

62

scaled: bool = True

63

) -> Bunch | tuple[ArrayLike, ArrayLike]

64

```

65

Load and return the diabetes dataset (regression).

66

67

#### load_linnerud { .api }

68

```python

69

from sklearn.datasets import load_linnerud

70

71

load_linnerud(

72

return_X_y: bool = False,

73

as_frame: bool = False

74

) -> Bunch | tuple[ArrayLike, ArrayLike]

75

```

76

Load and return the linnerud dataset (multivariate regression).

77

78

### General Data Loading

79

80

#### load_files { .api }

81

```python

82

from sklearn.datasets import load_files

83

84

load_files(

85

container_path: str,

86

description: str | None = None,

87

categories: list[str] | None = None,

88

load_content: bool = True,

89

shuffle: bool = True,

90

encoding: str | None = None,

91

decode_error: str = "strict",

92

random_state: int | RandomState | None = 0

93

) -> Bunch

94

```

95

Load text files with categories as subfolder names.

96

97

## Sample Images

98

99

#### load_sample_images { .api }

100

```python

101

from sklearn.datasets import load_sample_images

102

103

load_sample_images() -> Bunch

104

```

105

Load sample images for image manipulation.

106

107

#### load_sample_image { .api }

108

```python

109

from sklearn.datasets import load_sample_image

110

111

load_sample_image(

112

image_name: str

113

) -> ArrayLike

114

```

115

Load the numpy array of a single sample image.

116

117

## Real-World Datasets (Fetch Functions)

118

119

### Text Datasets

120

121

#### fetch_20newsgroups { .api }

122

```python

123

from sklearn.datasets import fetch_20newsgroups

124

125

fetch_20newsgroups(

126

data_home: str | None = None,

127

subset: str = "train",

128

categories: list[str] | None = None,

129

shuffle: bool = True,

130

random_state: int | RandomState | None = 42,

131

remove: tuple | None = (),

132

download_if_missing: bool = True,

133

return_X_y: bool = False

134

) -> Bunch | tuple[list[str], ArrayLike]

135

```

136

Load the filenames and data from the 20 newsgroups dataset.

137

138

#### fetch_20newsgroups_vectorized { .api }

139

```python

140

from sklearn.datasets import fetch_20newsgroups_vectorized

141

142

fetch_20newsgroups_vectorized(

143

subset: str = "train",

144

remove: tuple = (),

145

data_home: str | None = None,

146

download_if_missing: bool = True,

147

return_X_y: bool = False,

148

normalize: bool = True,

149

as_frame: bool = False

150

) -> Bunch | tuple[ArrayLike, ArrayLike]

151

```

152

Load the 20 newsgroups dataset and vectorize it.

153

154

#### fetch_rcv1 { .api }

155

```python

156

from sklearn.datasets import fetch_rcv1

157

158

fetch_rcv1(

159

data_home: str | None = None,

160

subset: str = "all",

161

download_if_missing: bool = True,

162

random_state: int | RandomState | None = None,

163

shuffle: bool = False,

164

return_X_y: bool = False

165

) -> Bunch | tuple[ArrayLike, ArrayLike]

166

```

167

Load the RCV1 multilabel dataset.

168

169

### Computer Vision Datasets

170

171

#### fetch_lfw_people { .api }

172

```python

173

from sklearn.datasets import fetch_lfw_people

174

175

fetch_lfw_people(

176

data_home: str | None = None,

177

funneled: bool = True,

178

resize: float = 0.5,

179

min_faces_per_person: int = 0,

180

color: bool = False,

181

slice_: tuple | None = (slice(70, 195), slice(78, 172)),

182

download_if_missing: bool = True,

183

return_X_y: bool = False

184

) -> Bunch | tuple[ArrayLike, ArrayLike]

185

```

186

Load the Labeled Faces in the Wild (LFW) people dataset.

187

188

#### fetch_lfw_pairs { .api }

189

```python

190

from sklearn.datasets import fetch_lfw_pairs

191

192

fetch_lfw_pairs(

193

subset: str = "train",

194

data_home: str | None = None,

195

funneled: bool = True,

196

resize: float = 0.5,

197

color: bool = False,

198

slice_: tuple | None = (slice(70, 195), slice(78, 172)),

199

download_if_missing: bool = True

200

) -> Bunch

201

```

202

Load the Labeled Faces in the Wild (LFW) pairs dataset.

203

204

#### fetch_olivetti_faces { .api }

205

```python

206

from sklearn.datasets import fetch_olivetti_faces

207

208

fetch_olivetti_faces(

209

data_home: str | None = None,

210

shuffle: bool = False,

211

random_state: int | RandomState | None = 0,

212

download_if_missing: bool = True,

213

return_X_y: bool = False

214

) -> Bunch | tuple[ArrayLike, ArrayLike]

215

```

216

Load the Olivetti faces dataset.

217

218

### Real Estate and Regression Datasets

219

220

#### fetch_california_housing { .api }

221

```python

222

from sklearn.datasets import fetch_california_housing

223

224

fetch_california_housing(

225

data_home: str | None = None,

226

download_if_missing: bool = True,

227

return_X_y: bool = False,

228

as_frame: bool = False

229

) -> Bunch | tuple[ArrayLike, ArrayLike]

230

```

231

Load the California housing dataset.

232

233

### Network Security Datasets

234

235

#### fetch_kddcup99 { .api }

236

```python

237

from sklearn.datasets import fetch_kddcup99

238

239

fetch_kddcup99(

240

subset: str | None = None,

241

data_home: str | None = None,

242

shuffle: bool = False,

243

random_state: int | RandomState | None = None,

244

percent10: bool = True,

245

download_if_missing: bool = True,

246

return_X_y: bool = False,

247

as_frame: bool = False

248

) -> Bunch | tuple[ArrayLike, ArrayLike]

249

```

250

Load the kddcup99 dataset.

251

252

### Environmental Datasets

253

254

#### fetch_covtype { .api }

255

```python

256

from sklearn.datasets import fetch_covtype

257

258

fetch_covtype(

259

data_home: str | None = None,

260

download_if_missing: bool = True,

261

random_state: int | RandomState | None = None,

262

shuffle: bool = False,

263

return_X_y: bool = False,

264

as_frame: bool = False

265

) -> Bunch | tuple[ArrayLike, ArrayLike]

266

```

267

Load the covertype dataset.

268

269

#### fetch_species_distributions { .api }

270

```python

271

from sklearn.datasets import fetch_species_distributions

272

273

fetch_species_distributions(

274

data_home: str | None = None,

275

download_if_missing: bool = True

276

) -> Bunch

277

```

278

Loader for species distribution dataset.

279

280

### OpenML Integration

281

282

#### fetch_openml { .api }

283

```python

284

from sklearn.datasets import fetch_openml

285

286

fetch_openml(

287

name: str | int | None = None,

288

version: int | str = "active",

289

data_id: int | None = None,

290

data_home: str | None = None,

291

target_column: str | list | None = "default-target",

292

cache: bool = True,

293

return_X_y: bool = False,

294

as_frame: bool | str = "auto",

295

n_retries: int = 3,

296

delay: float = 1.0,

297

parser: str = "auto",

298

read_csv_kwargs: dict | None = None

299

) -> Bunch | tuple[ArrayLike, ArrayLike]

300

```

301

Fetch dataset from openml by name or dataset id.

302

303

### General File Fetching

304

305

#### fetch_file { .api }

306

```python

307

from sklearn.datasets import fetch_file

308

309

fetch_file(

310

url: str,

311

data_home: str | None = None,

312

cache_subdir: str = "",

313

hash_: str | None = None,

314

hash_algorithm: str = "auto",

315

extract: bool = False,

316

force_extract: bool = False,

317

quiet: bool = False,

318

local_folder: str | None = None

319

) -> str

320

```

321

Load a file from the Web.

322

323

## Synthetic Data Generation

324

325

### Classification Data Generation

326

327

#### make_classification { .api }

328

```python

329

from sklearn.datasets import make_classification

330

331

make_classification(

332

n_samples: int = 100,

333

n_features: int = 20,

334

n_informative: int = 2,

335

n_redundant: int = 2,

336

n_repeated: int = 0,

337

n_classes: int = 2,

338

n_clusters_per_class: int = 2,

339

weights: ArrayLike | None = None,

340

flip_y: float = 0.01,

341

class_sep: float = 1.0,

342

hypercube: bool = True,

343

shift: float | ArrayLike | None = 0.0,

344

scale: float | ArrayLike | None = 1.0,

345

shuffle: bool = True,

346

random_state: int | RandomState | None = None

347

) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]

348

```

349

Generate a random n-class classification problem.

350

351

#### make_multilabel_classification { .api }

352

```python

353

from sklearn.datasets import make_multilabel_classification

354

355

make_multilabel_classification(

356

n_samples: int = 100,

357

n_features: int = 20,

358

n_classes: int = 5,

359

n_labels: int = 2,

360

length: int = 50,

361

allow_unlabeled: bool = True,

362

sparse: bool = False,

363

return_indicator: str = "dense",

364

return_distributions: bool = False,

365

random_state: int | RandomState | None = None

366

) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike, ArrayLike]

367

```

368

Generate a random multilabel classification problem.

369

370

#### make_hastie_10_2 { .api }

371

```python

372

from sklearn.datasets import make_hastie_10_2

373

374

make_hastie_10_2(

375

n_samples: int = 12000,

376

random_state: int | RandomState | None = None

377

) -> tuple[ArrayLike, ArrayLike]

378

```

379

Generate data for binary classification used in Hastie et al. 2009.

380

381

#### make_gaussian_quantiles { .api }

382

```python

383

from sklearn.datasets import make_gaussian_quantiles

384

385

make_gaussian_quantiles(

386

mean: ArrayLike | None = None,

387

cov: float = 1.0,

388

n_samples: int = 100,

389

n_features: int = 2,

390

n_classes: int = 3,

391

shuffle: bool = True,

392

random_state: int | RandomState | None = None

393

) -> tuple[ArrayLike, ArrayLike]

394

```

395

Generate isotropic Gaussian and label samples by quantile.

396

397

### Regression Data Generation

398

399

#### make_regression { .api }

400

```python

401

from sklearn.datasets import make_regression

402

403

make_regression(

404

n_samples: int = 100,

405

n_features: int = 100,

406

n_informative: int = 10,

407

n_targets: int = 1,

408

bias: float = 0.0,

409

effective_rank: int | None = None,

410

tail_strength: float = 0.5,

411

noise: float = 0.0,

412

shuffle: bool = True,

413

coef: bool = False,

414

random_state: int | RandomState | None = None

415

) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]

416

```

417

Generate a random regression problem.

418

419

#### make_friedman1 { .api }

420

```python

421

from sklearn.datasets import make_friedman1

422

423

make_friedman1(

424

n_samples: int = 100,

425

n_features: int = 10,

426

noise: float = 0.0,

427

random_state: int | RandomState | None = None

428

) -> tuple[ArrayLike, ArrayLike]

429

```

430

Generate the "Friedman #1" regression problem.

431

432

#### make_friedman2 { .api }

433

```python

434

from sklearn.datasets import make_friedman2

435

436

make_friedman2(

437

n_samples: int = 100,

438

noise: float = 0.0,

439

random_state: int | RandomState | None = None

440

) -> tuple[ArrayLike, ArrayLike]

441

```

442

Generate the "Friedman #2" regression problem.

443

444

#### make_friedman3 { .api }

445

```python

446

from sklearn.datasets import make_friedman3

447

448

make_friedman3(

449

n_samples: int = 100,

450

noise: float = 0.0,

451

random_state: int | RandomState | None = None

452

) -> tuple[ArrayLike, ArrayLike]

453

```

454

Generate the "Friedman #3" regression problem.

455

456

#### make_sparse_uncorrelated { .api }

457

```python

458

from sklearn.datasets import make_sparse_uncorrelated

459

460

make_sparse_uncorrelated(

461

n_samples: int = 100,

462

n_features: int = 10,

463

random_state: int | RandomState | None = None

464

) -> tuple[ArrayLike, ArrayLike]

465

```

466

Generate a random regression problem with sparse uncorrelated design.

467

468

### Clustering Data Generation

469

470

#### make_blobs { .api }

471

```python

472

from sklearn.datasets import make_blobs

473

474

make_blobs(

475

n_samples: int | ArrayLike = 100,

476

n_features: int = 2,

477

centers: int | ArrayLike | None = None,

478

cluster_std: float | ArrayLike = 1.0,

479

center_box: tuple[float, float] = (-10.0, 10.0),

480

shuffle: bool = True,

481

random_state: int | RandomState | None = None,

482

return_centers: bool = False

483

) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]

484

```

485

Generate isotropic Gaussian blobs for clustering.

486

487

#### make_circles { .api }

488

```python

489

from sklearn.datasets import make_circles

490

491

make_circles(

492

n_samples: int | tuple[int, int] = 100,

493

shuffle: bool = True,

494

noise: float | None = None,

495

random_state: int | RandomState | None = None,

496

factor: float = 0.8

497

) -> tuple[ArrayLike, ArrayLike]

498

```

499

Make a large circle containing a smaller circle in 2d.

500

501

#### make_moons { .api }

502

```python

503

from sklearn.datasets import make_moons

504

505

make_moons(

506

n_samples: int | tuple[int, int] = 100,

507

shuffle: bool = True,

508

noise: float | None = None,

509

random_state: int | RandomState | None = None

510

) -> tuple[ArrayLike, ArrayLike]

511

```

512

Make two interleaving half circles.

513

514

### Manifold Data Generation

515

516

#### make_swiss_roll { .api }

517

```python

518

from sklearn.datasets import make_swiss_roll

519

520

make_swiss_roll(

521

n_samples: int = 100,

522

noise: float = 0.0,

523

random_state: int | RandomState | None = None,

524

hole: bool = False

525

) -> tuple[ArrayLike, ArrayLike]

526

```

527

Generate a swiss roll dataset.

528

529

#### make_s_curve { .api }

530

```python

531

from sklearn.datasets import make_s_curve

532

533

make_s_curve(

534

n_samples: int = 100,

535

noise: float = 0.0,

536

random_state: int | RandomState | None = None

537

) -> tuple[ArrayLike, ArrayLike]

538

```

539

Generate an S curve dataset.

540

541

### Biclustering Data Generation

542

543

#### make_biclusters { .api }

544

```python

545

from sklearn.datasets import make_biclusters

546

547

make_biclusters(

548

shape: tuple[int, int],

549

n_clusters: int,

550

noise: float = 0.0,

551

minval: int = 10,

552

maxval: int = 100,

553

shuffle: bool = True,

554

random_state: int | RandomState | None = None

555

) -> tuple[ArrayLike, ArrayLike, ArrayLike]

556

```

557

Generate an array with constant block diagonal structure.

558

559

#### make_checkerboard { .api }

560

```python

561

from sklearn.datasets import make_checkerboard

562

563

make_checkerboard(

564

shape: tuple[int, int],

565

n_clusters: int | tuple[int, int],

566

noise: float = 0.0,

567

minval: int = 10,

568

maxval: int = 100,

569

shuffle: bool = True,

570

random_state: int | RandomState | None = None

571

) -> tuple[ArrayLike, ArrayLike, ArrayLike]

572

```

573

Generate an array with block checkerboard structure.

574

575

### Matrix Generation

576

577

#### make_low_rank_matrix { .api }

578

```python

579

from sklearn.datasets import make_low_rank_matrix

580

581

make_low_rank_matrix(

582

n_samples: int = 100,

583

n_features: int = 100,

584

effective_rank: int = 10,

585

tail_strength: float = 0.5,

586

random_state: int | RandomState | None = None

587

) -> ArrayLike

588

```

589

Generate a mostly low rank matrix with bell-shaped singular values.

590

591

#### make_sparse_coded_signal { .api }

592

```python

593

from sklearn.datasets import make_sparse_coded_signal

594

595

make_sparse_coded_signal(

596

n_samples: int,

597

n_components: int,

598

n_features: int,

599

n_nonzero_coefs: int,

600

random_state: int | RandomState | None = None

601

) -> tuple[ArrayLike, ArrayLike, ArrayLike]

602

```

603

Generate a signal as a sparse combination of dictionary elements.

604

605

#### make_spd_matrix { .api }

606

```python

607

from sklearn.datasets import make_spd_matrix

608

609

make_spd_matrix(

610

n_dim: int,

611

random_state: int | RandomState | None = None

612

) -> ArrayLike

613

```

614

Generate a random symmetric, positive-definite matrix.

615

616

#### make_sparse_spd_matrix { .api }

617

```python

618

from sklearn.datasets import make_sparse_spd_matrix

619

620

make_sparse_spd_matrix(

621

dim: int = 1,

622

alpha: float = 0.95,

623

norm_diag: bool = False,

624

smallest_coef: float = 0.1,

625

largest_coef: float = 0.9,

626

random_state: int | RandomState | None = None

627

) -> ArrayLike

628

```

629

Generate a sparse symmetric definite positive matrix.

630

631

## File I/O Utilities

632

633

### SVMLight Format

634

635

#### load_svmlight_file { .api }

636

```python

637

from sklearn.datasets import load_svmlight_file

638

639

load_svmlight_file(

640

f: str | IO,

641

n_features: int | None = None,

642

dtype: type = ...,

643

multilabel: bool = False,

644

zero_based: bool | str = "auto",

645

query_id: bool = False,

646

offset: int = 0,

647

length: int = -1

648

) -> tuple[ArrayLike, ArrayLike] | tuple[ArrayLike, ArrayLike, ArrayLike]

649

```

650

Load datasets in the svmlight / libsvm format into sparse CSR matrix.

651

652

#### load_svmlight_files { .api }

653

```python

654

from sklearn.datasets import load_svmlight_files

655

656

load_svmlight_files(

657

files: list[str | IO],

658

n_features: int | None = None,

659

dtype: type = ...,

660

multilabel: bool = False,

661

zero_based: bool | str = "auto",

662

query_id: bool = False,

663

offset: int = 0,

664

length: int = -1

665

) -> list[tuple[ArrayLike, ArrayLike]] | list[tuple[ArrayLike, ArrayLike, ArrayLike]]

666

```

667

Load dataset from multiple files in SVMlight format.

668

669

#### dump_svmlight_file { .api }

670

```python

671

from sklearn.datasets import dump_svmlight_file

672

673

dump_svmlight_file(

674

X: ArrayLike,

675

y: ArrayLike,

676

f: str | IO,

677

zero_based: bool = True,

678

comment: str | bytes | None = None,

679

query_id: ArrayLike | None = None,

680

multilabel: bool = False

681

) -> None

682

```

683

Dump the dataset in svmlight / libsvm file format.

684

685

## Data Directory Management

686

687

#### get_data_home { .api }

688

```python

689

from sklearn.datasets import get_data_home

690

691

get_data_home(

692

data_home: str | None = None

693

) -> str

694

```

695

Return the path to scikit-learn data dir.

696

697

#### clear_data_home { .api }

698

```python

699

from sklearn.datasets import clear_data_home

700

701

clear_data_home(

702

data_home: str | None = None

703

) -> None

704

```

705

Delete all the content in the data home cache.

706

707

## Examples

708

709

### Loading Built-in Datasets

710

711

```python

712

from sklearn.datasets import load_iris, load_digits, load_wine

713

714

# Load iris dataset

715

iris = load_iris()

716

X_iris, y_iris = iris.data, iris.target

717

print(f"Iris dataset: {X_iris.shape}, classes: {len(iris.target_names)}")

718

719

# Load digits dataset

720

digits = load_digits(n_class=10)

721

X_digits, y_digits = digits.data, digits.target

722

print(f"Digits dataset: {X_digits.shape}")

723

724

# Load wine dataset as tuple

725

X_wine, y_wine = load_wine(return_X_y=True)

726

print(f"Wine dataset: {X_wine.shape}")

727

728

# Load as pandas DataFrame

729

wine_frame = load_wine(as_frame=True)

730

df = wine_frame.frame

731

print(df.head())

732

```

733

734

### Fetching Real-World Datasets

735

736

```python

737

from sklearn.datasets import fetch_california_housing, fetch_20newsgroups

738

739

# Fetch California housing dataset

740

housing = fetch_california_housing()

741

X_housing, y_housing = housing.data, housing.target

742

print(f"Housing dataset: {X_housing.shape}")

743

print(f"Features: {housing.feature_names}")

744

745

# Fetch text data (20 newsgroups)

746

newsgroups = fetch_20newsgroups(

747

subset='train',

748

categories=['alt.atheism', 'sci.space']

749

)

750

print(f"Newsgroups: {len(newsgroups.data)} documents")

751

print(f"Categories: {newsgroups.target_names}")

752

```

753

754

### Generating Synthetic Data

755

756

```python

757

from sklearn.datasets import (

758

make_classification, make_regression, make_blobs,

759

make_circles, make_moons

760

)

761

762

# Classification data

763

X_clf, y_clf = make_classification(

764

n_samples=1000, n_features=20, n_informative=10,

765

n_redundant=5, n_classes=3, random_state=42

766

)

767

print(f"Classification data: {X_clf.shape}")

768

769

# Regression data

770

X_reg, y_reg = make_regression(

771

n_samples=1000, n_features=20, n_informative=10,

772

noise=0.1, random_state=42

773

)

774

print(f"Regression data: {X_reg.shape}")

775

776

# Clustering data - blobs

777

X_blobs, y_blobs = make_blobs(

778

n_samples=300, centers=4, n_features=2,

779

random_state=42, cluster_std=0.8

780

)

781

782

# Non-linear clustering data

783

X_circles, y_circles = make_circles(

784

n_samples=300, noise=0.05, factor=0.6, random_state=42

785

)

786

787

X_moons, y_moons = make_moons(

788

n_samples=300, noise=0.1, random_state=42

789

)

790

791

print(f"Blobs: {X_blobs.shape}, Circles: {X_circles.shape}, Moons: {X_moons.shape}")

792

```

793

794

### Manifold Learning Data

795

796

```python

797

from sklearn.datasets import make_swiss_roll, make_s_curve

798

799

# Generate swiss roll manifold

800

X_swiss, t_swiss = make_swiss_roll(n_samples=1000, noise=0.1, random_state=42)

801

print(f"Swiss roll: {X_swiss.shape}")

802

803

# Generate S-curve manifold

804

X_s_curve, t_s_curve = make_s_curve(n_samples=1000, noise=0.1, random_state=42)

805

print(f"S-curve: {X_s_curve.shape}")

806

```

807

808

### Working with SVMLight Format

809

810

```python

811

from sklearn.datasets import dump_svmlight_file, load_svmlight_file

812

import tempfile

813

import os

814

815

# Create sample data

816

X, y = make_classification(n_samples=100, n_features=10, random_state=42)

817

818

# Save to SVMLight format

819

with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.svmlight') as f:

820

dump_svmlight_file(X, y, f.name)

821

filename = f.name

822

823

# Load from SVMLight format

824

X_loaded, y_loaded = load_svmlight_file(filename)

825

print(f"Original: {X.shape}, Loaded: {X_loaded.shape}")

826

827

# Clean up

828

os.unlink(filename)

829

```

830

831

### Custom Dataset Creation

832

833

```python

834

import numpy as np

835

from sklearn.utils import Bunch

836

837

def create_custom_dataset(n_samples=100):

838

"""Create a custom dataset with specific characteristics."""

839

np.random.seed(42)

840

841

# Generate features

842

X = np.random.randn(n_samples, 5)

843

844

# Create target with specific pattern

845

y = (X[:, 0] + X[:, 1] > 0).astype(int)

846

847

# Create a Bunch object similar to sklearn datasets

848

return Bunch(

849

data=X,

850

target=y,

851

feature_names=[f'feature_{i}' for i in range(5)],

852

target_names=['class_0', 'class_1'],

853

DESCR='Custom synthetic dataset'

854

)

855

856

# Use custom dataset

857

custom_data = create_custom_dataset(500)

858

print(f"Custom dataset: {custom_data.data.shape}")

859

print(f"Features: {custom_data.feature_names}")

860

```