CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-catboost

CatBoost is a fast, scalable, high performance gradient boosting on decision trees library used for ranking, classification, regression and other ML tasks.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

datasets.mddocs/

Dataset Utilities

CatBoost includes built-in datasets for testing, learning, and benchmarking machine learning algorithms. These datasets cover various domains including classification, regression, and ranking tasks, with proper preprocessing and metadata.

Capabilities

Built-in Dataset Loading Functions

Pre-processed datasets ready for immediate use with CatBoost models.

def titanic():
    """
    Load the famous Titanic survival dataset for binary classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features and 'Survived' target
        - test_df: Test DataFrame with features (no target)
        
    Features:
    - Passenger class, sex, age, siblings/spouses, parents/children
    - Fare, embarked port, cabin, ticket information
    - Mixed categorical and numerical features
    - Target: Binary survival (0/1)
    """

def amazon():
    """
    Load Amazon employee access dataset for binary classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features and 'ACTION' target
        - test_df: Test DataFrame with features (no target)
        
    Features:
    - Employee resource access request attributes
    - All categorical features (role, department, etc.)
    - Target: Binary access approval (0/1)
    """

def adult():
    """
    Load Adult (Census Income) dataset for binary classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features and income target
        - test_df: Test DataFrame with features (no target)
        
    Features:
    - Demographics (age, workclass, education, marital status)
    - Work information (occupation, relationship, race, sex)
    - Financial information (capital gain/loss, hours per week)
    - Mixed categorical and numerical features
    - Target: Binary income level (<=50K, >50K)
    """

def epsilon():
    """
    Load Epsilon dataset for binary classification (large-scale dataset).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame (400,000 samples)
        - test_df: Test DataFrame (100,000 samples)
        
    Features:
    - 2000 numerical features
    - Sparse feature representation
    - Target: Binary classification (0/1)
    - Commonly used for large-scale ML benchmarking
    """

def higgs():
    """
    Load HIGGS dataset for binary classification (physics domain).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame (10.5M samples)
        - test_df: Test DataFrame (500K samples)
        
    Features:
    - 28 numerical features from particle physics simulations
    - High-energy physics particle collision data
    - Target: Binary classification (signal/background)
    - Benchmark for large-scale classification
    """

Text and Sentiment Datasets

Datasets specifically designed for text classification and sentiment analysis tasks.

def imdb():
    """
    Load IMDB movie reviews dataset for sentiment classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with 'text' and 'label' columns
        - test_df: Test DataFrame with 'text' and 'label' columns
        
    Features:
    - Movie review text (strings)
    - Preprocessed and cleaned text data
    - Target: Binary sentiment (positive/negative)
    - Suitable for text feature processing in CatBoost
    """

def rotten_tomatoes():
    """
    Load Rotten Tomatoes movie reviews for sentiment classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with review text and sentiment
        - test_df: Test DataFrame with review text and sentiment
        
    Features:
    - Short movie review snippets
    - Text preprocessing for CatBoost text features
    - Target: Binary sentiment classification
    - Smaller dataset compared to IMDB
    """

Ranking Datasets

Specialized datasets for learning-to-rank and information retrieval tasks.

def msrank():
    """
    Load Microsoft Learning-to-Rank dataset (full version).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features, relevance, and query_id
        - test_df: Test DataFrame with features, relevance, and query_id
        
    Features:
    - 136 numerical features from web search
    - Query-document relevance scores (0-4 scale)
    - Query group identifiers for ranking evaluation
    - Standard benchmark for learning-to-rank algorithms
    """

def msrank_10k():
    """
    Load Microsoft Learning-to-Rank dataset (10K subset).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame (subset of msrank)
        - test_df: Test DataFrame (subset of msrank)
        
    Features:
    - Same features as msrank() but smaller size
    - Suitable for quick testing and prototyping
    - Maintains query group structure for ranking
    """

Synthetic and Mathematical Datasets

Datasets with known mathematical properties for algorithm testing.

def monotonic1():
    """
    Load first monotonic regression dataset.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with monotonic relationships
        - test_df: Test DataFrame for evaluation
        
    Features:
    - Features with known monotonic relationships to target
    - Useful for testing monotonic constraints in CatBoost
    - Synthetic data with controlled properties
    """

def monotonic2():
    """
    Load second monotonic regression dataset.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with different monotonic patterns
        - test_df: Test DataFrame for evaluation
        
    Features:
    - Alternative monotonic feature patterns
    - Complementary to monotonic1() for comprehensive testing
    - Different complexity and noise levels
    """

Dataset Cache Management

Functions for managing dataset storage and caching.

def set_cache_path(path):
    """
    Set the cache directory for downloaded datasets.
    
    Parameters:
    - path: Directory path for caching datasets (string)
        - Must be writable directory
        - Datasets will be downloaded and stored here
        - Subsequent calls will use cached versions
    
    Example:
    set_cache_path('/path/to/dataset/cache')
    """

Dataset Usage Examples

Basic Dataset Loading

from catboost.datasets import titanic, adult, amazon
from catboost import CatBoostClassifier, Pool

# Load Titanic dataset
train_df, test_df = titanic()
print(f"Titanic - Train shape: {train_df.shape}, Test shape: {test_df.shape}")

# Prepare features and target
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']

# Identify categorical features
cat_features = ['Sex', 'Embarked', 'Pclass']

# Train model
model = CatBoostClassifier(
    iterations=100,
    verbose=False,
    cat_features=cat_features
)

model.fit(X_train, y_train)
print("Model trained on Titanic dataset")

Text Dataset Processing

from catboost.datasets import imdb
from catboost import CatBoostClassifier, Pool

# Load IMDB dataset
train_df, test_df = imdb()
print(f"IMDB - Train shape: {train_df.shape}")

# Create pools with text features
train_pool = Pool(
    data=train_df,
    label=train_df['label'],
    text_features=['text']  # Specify text column
)

test_pool = Pool(
    data=test_df,
    label=test_df['label'],
    text_features=['text']
)

# Train model with text processing
model = CatBoostClassifier(
    iterations=200,
    verbose=50,
    text_processing={
        'tokenizers': [{'tokenizer_id': 'Space', 'separator_type': 'ByDelimiter', 'delimiter': ' '}],
        'dictionaries': [{'dictionary_id': 'Word', 'max_dictionary_size': '50000'}],
        'feature_processing': {
            'default': [{'dictionaries_names': ['Word'], 'feature_calcers': ['BoW']}]
        }
    }
)

model.fit(train_pool, eval_set=test_pool)
print("Model trained on IMDB text data")

Ranking Dataset Usage

from catboost.datasets import msrank_10k
from catboost import CatBoostRanker, Pool

# Load ranking dataset
train_df, test_df = msrank_10k()
print(f"MSRank 10K - Train shape: {train_df.shape}")

# Extract features, labels, and group IDs
feature_cols = [col for col in train_df.columns if col not in ['label', 'query_id']]
X_train = train_df[feature_cols]
y_train = train_df['label']
group_id_train = train_df['query_id']

X_test = test_df[feature_cols]
y_test = test_df['label']
group_id_test = test_df['query_id']

# Create pools for ranking
train_pool = Pool(
    data=X_train,
    label=y_train,
    group_id=group_id_train
)

test_pool = Pool(
    data=X_test,
    label=y_test,
    group_id=group_id_test
)

# Train ranking model
ranker = CatBoostRanker(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRank',
    eval_metric='NDCG',
    verbose=50
)

ranker.fit(train_pool, eval_set=test_pool)
print("Ranking model trained on MSRank dataset")

Large Dataset Handling

from catboost.datasets import epsilon, higgs, set_cache_path
from catboost import CatBoostClassifier
import os

# Set cache directory for large datasets
cache_dir = '/tmp/catboost_datasets'
os.makedirs(cache_dir, exist_ok=True)
set_cache_path(cache_dir)

# Load large dataset (this may take time on first run)
print("Loading epsilon dataset...")
train_df, test_df = epsilon()
print(f"Epsilon - Train: {train_df.shape}, Test: {test_df.shape}")

# For very large datasets, consider using file-based training
# Save to files and use Pool with file paths
train_df.to_csv('epsilon_train.tsv', sep='\t', index=False)
test_df.to_csv('epsilon_test.tsv', sep='\t', index=False)

# Create pools from files for memory efficiency
from catboost import Pool
train_pool = Pool('epsilon_train.tsv', delimiter='\t', has_header=True)
test_pool = Pool('epsilon_test.tsv', delimiter='\t', has_header=True)

# Train with limited memory usage
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    verbose=25,
    used_ram_limit='4gb'  # Limit RAM usage
)

model.fit(train_pool, eval_set=test_pool)
print("Large dataset model training completed")

Dataset Comparison and Analysis

from catboost.datasets import titanic, adult, amazon
import pandas as pd

def analyze_dataset(load_func, name):
    """Analyze a CatBoost dataset."""
    train_df, test_df = load_func()
    
    print(f"\n{name} Dataset Analysis:")
    print(f"  Train shape: {train_df.shape}")
    print(f"  Test shape: {test_df.shape}")
    print(f"  Features: {train_df.shape[1] - 1}")  # Excluding target
    
    # Identify column types
    numeric_cols = train_df.select_dtypes(include=['number']).columns
    categorical_cols = train_df.select_dtypes(include=['object', 'category']).columns
    
    print(f"  Numeric features: {len(numeric_cols)}")
    print(f"  Categorical features: {len(categorical_cols)}")
    
    # Target analysis
    target_col = train_df.columns[-1]  # Assume last column is target
    if target_col in train_df.columns:
        target_unique = train_df[target_col].nunique()
        print(f"  Target classes: {target_unique}")
        print(f"  Target distribution: {dict(train_df[target_col].value_counts())}")

# Analyze multiple datasets
datasets = [
    (titanic, "Titanic"),
    (adult, "Adult"),
    (amazon, "Amazon")
]

for load_func, name in datasets:
    analyze_dataset(load_func, name)

Custom Dataset Cache Management

from catboost.datasets import set_cache_path
import os

# Set custom cache location
custom_cache = "/home/user/ml_datasets"
os.makedirs(custom_cache, exist_ok=True)
set_cache_path(custom_cache)

print(f"Cache path set to: {custom_cache}")

# Load dataset (will cache in new location)
from catboost.datasets import titanic
train_df, test_df = titanic()

# List cached files
cache_files = os.listdir(custom_cache)
print(f"Cached files: {cache_files}")

Install with Tessl CLI

npx tessl i tessl/pypi-catboost

docs

advanced-features.md

core-models.md

data-handling.md

datasets.md

evaluation.md

feature-analysis.md

index.md

metrics.md

training-evaluation.md

utilities.md

visualization.md

tile.json