or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced-features.mdcore-models.mddata-handling.mddatasets.mdevaluation.mdfeature-analysis.mdindex.mdmetrics.mdtraining-evaluation.mdutilities.mdvisualization.md

datasets.mddocs/

0

# Dataset Utilities

1

2

CatBoost includes built-in datasets for testing, learning, and benchmarking machine learning algorithms. These datasets cover various domains including classification, regression, and ranking tasks, with proper preprocessing and metadata.

3

4

## Capabilities

5

6

### Built-in Dataset Loading Functions

7

8

Pre-processed datasets ready for immediate use with CatBoost models.

9

10

```python { .api }

11

def titanic():

12

"""

13

Load the famous Titanic survival dataset for binary classification.

14

15

Returns:

16

tuple: (train_df, test_df)

17

- train_df: Training DataFrame with features and 'Survived' target

18

- test_df: Test DataFrame with features (no target)

19

20

Features:

21

- Passenger class, sex, age, siblings/spouses, parents/children

22

- Fare, embarked port, cabin, ticket information

23

- Mixed categorical and numerical features

24

- Target: Binary survival (0/1)

25

"""

26

27

def amazon():

28

"""

29

Load Amazon employee access dataset for binary classification.

30

31

Returns:

32

tuple: (train_df, test_df)

33

- train_df: Training DataFrame with features and 'ACTION' target

34

- test_df: Test DataFrame with features (no target)

35

36

Features:

37

- Employee resource access request attributes

38

- All categorical features (role, department, etc.)

39

- Target: Binary access approval (0/1)

40

"""

41

42

def adult():

43

"""

44

Load Adult (Census Income) dataset for binary classification.

45

46

Returns:

47

tuple: (train_df, test_df)

48

- train_df: Training DataFrame with features and income target

49

- test_df: Test DataFrame with features (no target)

50

51

Features:

52

- Demographics (age, workclass, education, marital status)

53

- Work information (occupation, relationship, race, sex)

54

- Financial information (capital gain/loss, hours per week)

55

- Mixed categorical and numerical features

56

- Target: Binary income level (<=50K, >50K)

57

"""

58

59

def epsilon():

60

"""

61

Load Epsilon dataset for binary classification (large-scale dataset).

62

63

Returns:

64

tuple: (train_df, test_df)

65

- train_df: Training DataFrame (400,000 samples)

66

- test_df: Test DataFrame (100,000 samples)

67

68

Features:

69

- 2000 numerical features

70

- Sparse feature representation

71

- Target: Binary classification (0/1)

72

- Commonly used for large-scale ML benchmarking

73

"""

74

75

def higgs():

76

"""

77

Load HIGGS dataset for binary classification (physics domain).

78

79

Returns:

80

tuple: (train_df, test_df)

81

- train_df: Training DataFrame (10.5M samples)

82

- test_df: Test DataFrame (500K samples)

83

84

Features:

85

- 28 numerical features from particle physics simulations

86

- High-energy physics particle collision data

87

- Target: Binary classification (signal/background)

88

- Benchmark for large-scale classification

89

"""

90

```

91

92

### Text and Sentiment Datasets

93

94

Datasets specifically designed for text classification and sentiment analysis tasks.

95

96

```python { .api }

97

def imdb():

98

"""

99

Load IMDB movie reviews dataset for sentiment classification.

100

101

Returns:

102

tuple: (train_df, test_df)

103

- train_df: Training DataFrame with 'text' and 'label' columns

104

- test_df: Test DataFrame with 'text' and 'label' columns

105

106

Features:

107

- Movie review text (strings)

108

- Preprocessed and cleaned text data

109

- Target: Binary sentiment (positive/negative)

110

- Suitable for text feature processing in CatBoost

111

"""

112

113

def rotten_tomatoes():

114

"""

115

Load Rotten Tomatoes movie reviews for sentiment classification.

116

117

Returns:

118

tuple: (train_df, test_df)

119

- train_df: Training DataFrame with review text and sentiment

120

- test_df: Test DataFrame with review text and sentiment

121

122

Features:

123

- Short movie review snippets

124

- Text preprocessing for CatBoost text features

125

- Target: Binary sentiment classification

126

- Smaller dataset compared to IMDB

127

"""

128

```

129

130

### Ranking Datasets

131

132

Specialized datasets for learning-to-rank and information retrieval tasks.

133

134

```python { .api }

135

def msrank():

136

"""

137

Load Microsoft Learning-to-Rank dataset (full version).

138

139

Returns:

140

tuple: (train_df, test_df)

141

- train_df: Training DataFrame with features, relevance, and query_id

142

- test_df: Test DataFrame with features, relevance, and query_id

143

144

Features:

145

- 136 numerical features from web search

146

- Query-document relevance scores (0-4 scale)

147

- Query group identifiers for ranking evaluation

148

- Standard benchmark for learning-to-rank algorithms

149

"""

150

151

def msrank_10k():

152

"""

153

Load Microsoft Learning-to-Rank dataset (10K subset).

154

155

Returns:

156

tuple: (train_df, test_df)

157

- train_df: Training DataFrame (subset of msrank)

158

- test_df: Test DataFrame (subset of msrank)

159

160

Features:

161

- Same features as msrank() but smaller size

162

- Suitable for quick testing and prototyping

163

- Maintains query group structure for ranking

164

"""

165

```

166

167

### Synthetic and Mathematical Datasets

168

169

Datasets with known mathematical properties for algorithm testing.

170

171

```python { .api }

172

def monotonic1():

173

"""

174

Load first monotonic regression dataset.

175

176

Returns:

177

tuple: (train_df, test_df)

178

- train_df: Training DataFrame with monotonic relationships

179

- test_df: Test DataFrame for evaluation

180

181

Features:

182

- Features with known monotonic relationships to target

183

- Useful for testing monotonic constraints in CatBoost

184

- Synthetic data with controlled properties

185

"""

186

187

def monotonic2():

188

"""

189

Load second monotonic regression dataset.

190

191

Returns:

192

tuple: (train_df, test_df)

193

- train_df: Training DataFrame with different monotonic patterns

194

- test_df: Test DataFrame for evaluation

195

196

Features:

197

- Alternative monotonic feature patterns

198

- Complementary to monotonic1() for comprehensive testing

199

- Different complexity and noise levels

200

"""

201

```

202

203

### Dataset Cache Management

204

205

Functions for managing dataset storage and caching.

206

207

```python { .api }

208

def set_cache_path(path):

209

"""

210

Set the cache directory for downloaded datasets.

211

212

Parameters:

213

- path: Directory path for caching datasets (string)

214

- Must be writable directory

215

- Datasets will be downloaded and stored here

216

- Subsequent calls will use cached versions

217

218

Example:

219

set_cache_path('/path/to/dataset/cache')

220

"""

221

222

```

223

224

## Dataset Usage Examples

225

226

### Basic Dataset Loading

227

228

```python

229

from catboost.datasets import titanic, adult, amazon

230

from catboost import CatBoostClassifier, Pool

231

232

# Load Titanic dataset

233

train_df, test_df = titanic()

234

print(f"Titanic - Train shape: {train_df.shape}, Test shape: {test_df.shape}")

235

236

# Prepare features and target

237

X_train = train_df.drop('Survived', axis=1)

238

y_train = train_df['Survived']

239

240

# Identify categorical features

241

cat_features = ['Sex', 'Embarked', 'Pclass']

242

243

# Train model

244

model = CatBoostClassifier(

245

iterations=100,

246

verbose=False,

247

cat_features=cat_features

248

)

249

250

model.fit(X_train, y_train)

251

print("Model trained on Titanic dataset")

252

```

253

254

### Text Dataset Processing

255

256

```python

257

from catboost.datasets import imdb

258

from catboost import CatBoostClassifier, Pool

259

260

# Load IMDB dataset

261

train_df, test_df = imdb()

262

print(f"IMDB - Train shape: {train_df.shape}")

263

264

# Create pools with text features

265

train_pool = Pool(

266

data=train_df,

267

label=train_df['label'],

268

text_features=['text'] # Specify text column

269

)

270

271

test_pool = Pool(

272

data=test_df,

273

label=test_df['label'],

274

text_features=['text']

275

)

276

277

# Train model with text processing

278

model = CatBoostClassifier(

279

iterations=200,

280

verbose=50,

281

text_processing={

282

'tokenizers': [{'tokenizer_id': 'Space', 'separator_type': 'ByDelimiter', 'delimiter': ' '}],

283

'dictionaries': [{'dictionary_id': 'Word', 'max_dictionary_size': '50000'}],

284

'feature_processing': {

285

'default': [{'dictionaries_names': ['Word'], 'feature_calcers': ['BoW']}]

286

}

287

}

288

)

289

290

model.fit(train_pool, eval_set=test_pool)

291

print("Model trained on IMDB text data")

292

```

293

294

### Ranking Dataset Usage

295

296

```python

297

from catboost.datasets import msrank_10k

298

from catboost import CatBoostRanker, Pool

299

300

# Load ranking dataset

301

train_df, test_df = msrank_10k()

302

print(f"MSRank 10K - Train shape: {train_df.shape}")

303

304

# Extract features, labels, and group IDs

305

feature_cols = [col for col in train_df.columns if col not in ['label', 'query_id']]

306

X_train = train_df[feature_cols]

307

y_train = train_df['label']

308

group_id_train = train_df['query_id']

309

310

X_test = test_df[feature_cols]

311

y_test = test_df['label']

312

group_id_test = test_df['query_id']

313

314

# Create pools for ranking

315

train_pool = Pool(

316

data=X_train,

317

label=y_train,

318

group_id=group_id_train

319

)

320

321

test_pool = Pool(

322

data=X_test,

323

label=y_test,

324

group_id=group_id_test

325

)

326

327

# Train ranking model

328

ranker = CatBoostRanker(

329

iterations=200,

330

learning_rate=0.1,

331

depth=6,

332

loss_function='YetiRank',

333

eval_metric='NDCG',

334

verbose=50

335

)

336

337

ranker.fit(train_pool, eval_set=test_pool)

338

print("Ranking model trained on MSRank dataset")

339

```

340

341

### Large Dataset Handling

342

343

```python

344

from catboost.datasets import epsilon, higgs, set_cache_path

345

from catboost import CatBoostClassifier

346

import os

347

348

# Set cache directory for large datasets

349

cache_dir = '/tmp/catboost_datasets'

350

os.makedirs(cache_dir, exist_ok=True)

351

set_cache_path(cache_dir)

352

353

# Load large dataset (this may take time on first run)

354

print("Loading epsilon dataset...")

355

train_df, test_df = epsilon()

356

print(f"Epsilon - Train: {train_df.shape}, Test: {test_df.shape}")

357

358

# For very large datasets, consider using file-based training

359

# Save to files and use Pool with file paths

360

train_df.to_csv('epsilon_train.tsv', sep='\t', index=False)

361

test_df.to_csv('epsilon_test.tsv', sep='\t', index=False)

362

363

# Create pools from files for memory efficiency

364

from catboost import Pool

365

train_pool = Pool('epsilon_train.tsv', delimiter='\t', has_header=True)

366

test_pool = Pool('epsilon_test.tsv', delimiter='\t', has_header=True)

367

368

# Train with limited memory usage

369

model = CatBoostClassifier(

370

iterations=100,

371

learning_rate=0.1,

372

depth=6,

373

verbose=25,

374

used_ram_limit='4gb' # Limit RAM usage

375

)

376

377

model.fit(train_pool, eval_set=test_pool)

378

print("Large dataset model training completed")

379

```

380

381

### Dataset Comparison and Analysis

382

383

```python

384

from catboost.datasets import titanic, adult, amazon

385

import pandas as pd

386

387

def analyze_dataset(load_func, name):

388

"""Analyze a CatBoost dataset."""

389

train_df, test_df = load_func()

390

391

print(f"\n{name} Dataset Analysis:")

392

print(f" Train shape: {train_df.shape}")

393

print(f" Test shape: {test_df.shape}")

394

print(f" Features: {train_df.shape[1] - 1}") # Excluding target

395

396

# Identify column types

397

numeric_cols = train_df.select_dtypes(include=['number']).columns

398

categorical_cols = train_df.select_dtypes(include=['object', 'category']).columns

399

400

print(f" Numeric features: {len(numeric_cols)}")

401

print(f" Categorical features: {len(categorical_cols)}")

402

403

# Target analysis

404

target_col = train_df.columns[-1] # Assume last column is target

405

if target_col in train_df.columns:

406

target_unique = train_df[target_col].nunique()

407

print(f" Target classes: {target_unique}")

408

print(f" Target distribution: {dict(train_df[target_col].value_counts())}")

409

410

# Analyze multiple datasets

411

datasets = [

412

(titanic, "Titanic"),

413

(adult, "Adult"),

414

(amazon, "Amazon")

415

]

416

417

for load_func, name in datasets:

418

analyze_dataset(load_func, name)

419

```

420

421

### Custom Dataset Cache Management

422

423

```python

424

from catboost.datasets import set_cache_path

425

import os

426

427

# Set custom cache location

428

custom_cache = "/home/user/ml_datasets"

429

os.makedirs(custom_cache, exist_ok=True)

430

set_cache_path(custom_cache)

431

432

print(f"Cache path set to: {custom_cache}")

433

434

# Load dataset (will cache in new location)

435

from catboost.datasets import titanic

436

train_df, test_df = titanic()

437

438

# List cached files

439

cache_files = os.listdir(custom_cache)

440

print(f"Cached files: {cache_files}")

441

```