CatBoost is a fast, scalable, high performance gradient boosting on decision trees library used for ranking, classification, regression and other ML tasks.
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
CatBoost includes built-in datasets for testing, learning, and benchmarking machine learning algorithms. These datasets cover various domains including classification, regression, and ranking tasks, with proper preprocessing and metadata.
Pre-processed datasets ready for immediate use with CatBoost models.
def titanic():
"""
Load the famous Titanic survival dataset for binary classification.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with features and 'Survived' target
- test_df: Test DataFrame with features (no target)
Features:
- Passenger class, sex, age, siblings/spouses, parents/children
- Fare, embarked port, cabin, ticket information
- Mixed categorical and numerical features
- Target: Binary survival (0/1)
"""
def amazon():
"""
Load Amazon employee access dataset for binary classification.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with features and 'ACTION' target
- test_df: Test DataFrame with features (no target)
Features:
- Employee resource access request attributes
- All categorical features (role, department, etc.)
- Target: Binary access approval (0/1)
"""
def adult():
"""
Load Adult (Census Income) dataset for binary classification.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with features and income target
- test_df: Test DataFrame with features (no target)
Features:
- Demographics (age, workclass, education, marital status)
- Work information (occupation, relationship, race, sex)
- Financial information (capital gain/loss, hours per week)
- Mixed categorical and numerical features
- Target: Binary income level (<=50K, >50K)
"""
def epsilon():
"""
Load Epsilon dataset for binary classification (large-scale dataset).
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame (400,000 samples)
- test_df: Test DataFrame (100,000 samples)
Features:
- 2000 numerical features
- Sparse feature representation
- Target: Binary classification (0/1)
- Commonly used for large-scale ML benchmarking
"""
def higgs():
"""
Load HIGGS dataset for binary classification (physics domain).
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame (10.5M samples)
- test_df: Test DataFrame (500K samples)
Features:
- 28 numerical features from particle physics simulations
- High-energy physics particle collision data
- Target: Binary classification (signal/background)
- Benchmark for large-scale classification
"""Datasets specifically designed for text classification and sentiment analysis tasks.
def imdb():
"""
Load IMDB movie reviews dataset for sentiment classification.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with 'text' and 'label' columns
- test_df: Test DataFrame with 'text' and 'label' columns
Features:
- Movie review text (strings)
- Preprocessed and cleaned text data
- Target: Binary sentiment (positive/negative)
- Suitable for text feature processing in CatBoost
"""
def rotten_tomatoes():
"""
Load Rotten Tomatoes movie reviews for sentiment classification.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with review text and sentiment
- test_df: Test DataFrame with review text and sentiment
Features:
- Short movie review snippets
- Text preprocessing for CatBoost text features
- Target: Binary sentiment classification
- Smaller dataset compared to IMDB
"""Specialized datasets for learning-to-rank and information retrieval tasks.
def msrank():
"""
Load Microsoft Learning-to-Rank dataset (full version).
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with features, relevance, and query_id
- test_df: Test DataFrame with features, relevance, and query_id
Features:
- 136 numerical features from web search
- Query-document relevance scores (0-4 scale)
- Query group identifiers for ranking evaluation
- Standard benchmark for learning-to-rank algorithms
"""
def msrank_10k():
"""
Load Microsoft Learning-to-Rank dataset (10K subset).
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame (subset of msrank)
- test_df: Test DataFrame (subset of msrank)
Features:
- Same features as msrank() but smaller size
- Suitable for quick testing and prototyping
- Maintains query group structure for ranking
"""Datasets with known mathematical properties for algorithm testing.
def monotonic1():
"""
Load first monotonic regression dataset.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with monotonic relationships
- test_df: Test DataFrame for evaluation
Features:
- Features with known monotonic relationships to target
- Useful for testing monotonic constraints in CatBoost
- Synthetic data with controlled properties
"""
def monotonic2():
"""
Load second monotonic regression dataset.
Returns:
tuple: (train_df, test_df)
- train_df: Training DataFrame with different monotonic patterns
- test_df: Test DataFrame for evaluation
Features:
- Alternative monotonic feature patterns
- Complementary to monotonic1() for comprehensive testing
- Different complexity and noise levels
"""Functions for managing dataset storage and caching.
def set_cache_path(path):
"""
Set the cache directory for downloaded datasets.
Parameters:
- path: Directory path for caching datasets (string)
- Must be writable directory
- Datasets will be downloaded and stored here
- Subsequent calls will use cached versions
Example:
set_cache_path('/path/to/dataset/cache')
"""from catboost.datasets import titanic, adult, amazon
from catboost import CatBoostClassifier, Pool
# Load Titanic dataset
train_df, test_df = titanic()
print(f"Titanic - Train shape: {train_df.shape}, Test shape: {test_df.shape}")
# Prepare features and target
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
# Identify categorical features
cat_features = ['Sex', 'Embarked', 'Pclass']
# Train model
model = CatBoostClassifier(
iterations=100,
verbose=False,
cat_features=cat_features
)
model.fit(X_train, y_train)
print("Model trained on Titanic dataset")from catboost.datasets import imdb
from catboost import CatBoostClassifier, Pool
# Load IMDB dataset
train_df, test_df = imdb()
print(f"IMDB - Train shape: {train_df.shape}")
# Create pools with text features
train_pool = Pool(
data=train_df,
label=train_df['label'],
text_features=['text'] # Specify text column
)
test_pool = Pool(
data=test_df,
label=test_df['label'],
text_features=['text']
)
# Train model with text processing
model = CatBoostClassifier(
iterations=200,
verbose=50,
text_processing={
'tokenizers': [{'tokenizer_id': 'Space', 'separator_type': 'ByDelimiter', 'delimiter': ' '}],
'dictionaries': [{'dictionary_id': 'Word', 'max_dictionary_size': '50000'}],
'feature_processing': {
'default': [{'dictionaries_names': ['Word'], 'feature_calcers': ['BoW']}]
}
}
)
model.fit(train_pool, eval_set=test_pool)
print("Model trained on IMDB text data")from catboost.datasets import msrank_10k
from catboost import CatBoostRanker, Pool
# Load ranking dataset
train_df, test_df = msrank_10k()
print(f"MSRank 10K - Train shape: {train_df.shape}")
# Extract features, labels, and group IDs
feature_cols = [col for col in train_df.columns if col not in ['label', 'query_id']]
X_train = train_df[feature_cols]
y_train = train_df['label']
group_id_train = train_df['query_id']
X_test = test_df[feature_cols]
y_test = test_df['label']
group_id_test = test_df['query_id']
# Create pools for ranking
train_pool = Pool(
data=X_train,
label=y_train,
group_id=group_id_train
)
test_pool = Pool(
data=X_test,
label=y_test,
group_id=group_id_test
)
# Train ranking model
ranker = CatBoostRanker(
iterations=200,
learning_rate=0.1,
depth=6,
loss_function='YetiRank',
eval_metric='NDCG',
verbose=50
)
ranker.fit(train_pool, eval_set=test_pool)
print("Ranking model trained on MSRank dataset")from catboost.datasets import epsilon, higgs, set_cache_path
from catboost import CatBoostClassifier
import os
# Set cache directory for large datasets
cache_dir = '/tmp/catboost_datasets'
os.makedirs(cache_dir, exist_ok=True)
set_cache_path(cache_dir)
# Load large dataset (this may take time on first run)
print("Loading epsilon dataset...")
train_df, test_df = epsilon()
print(f"Epsilon - Train: {train_df.shape}, Test: {test_df.shape}")
# For very large datasets, consider using file-based training
# Save to files and use Pool with file paths
train_df.to_csv('epsilon_train.tsv', sep='\t', index=False)
test_df.to_csv('epsilon_test.tsv', sep='\t', index=False)
# Create pools from files for memory efficiency
from catboost import Pool
train_pool = Pool('epsilon_train.tsv', delimiter='\t', has_header=True)
test_pool = Pool('epsilon_test.tsv', delimiter='\t', has_header=True)
# Train with limited memory usage
model = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
depth=6,
verbose=25,
used_ram_limit='4gb' # Limit RAM usage
)
model.fit(train_pool, eval_set=test_pool)
print("Large dataset model training completed")from catboost.datasets import titanic, adult, amazon
import pandas as pd
def analyze_dataset(load_func, name):
"""Analyze a CatBoost dataset."""
train_df, test_df = load_func()
print(f"\n{name} Dataset Analysis:")
print(f" Train shape: {train_df.shape}")
print(f" Test shape: {test_df.shape}")
print(f" Features: {train_df.shape[1] - 1}") # Excluding target
# Identify column types
numeric_cols = train_df.select_dtypes(include=['number']).columns
categorical_cols = train_df.select_dtypes(include=['object', 'category']).columns
print(f" Numeric features: {len(numeric_cols)}")
print(f" Categorical features: {len(categorical_cols)}")
# Target analysis
target_col = train_df.columns[-1] # Assume last column is target
if target_col in train_df.columns:
target_unique = train_df[target_col].nunique()
print(f" Target classes: {target_unique}")
print(f" Target distribution: {dict(train_df[target_col].value_counts())}")
# Analyze multiple datasets
datasets = [
(titanic, "Titanic"),
(adult, "Adult"),
(amazon, "Amazon")
]
for load_func, name in datasets:
analyze_dataset(load_func, name)from catboost.datasets import set_cache_path
import os
# Set custom cache location
custom_cache = "/home/user/ml_datasets"
os.makedirs(custom_cache, exist_ok=True)
set_cache_path(custom_cache)
print(f"Cache path set to: {custom_cache}")
# Load dataset (will cache in new location)
from catboost.datasets import titanic
train_df, test_df = titanic()
# List cached files
cache_files = os.listdir(custom_cache)
print(f"Cached files: {cache_files}")Install with Tessl CLI
npx tessl i tessl/pypi-catboost