tessl/pypi-catboost

CatBoost is a fast, scalable, high performance gradient boosting on decision trees library used for ranking, classification, regression and other ML tasks.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Dataset Utilities

Name: tessl/pypi-catboost
Author: tessl

CatBoost includes built-in datasets for testing, learning, and benchmarking machine learning algorithms. These datasets cover various domains including classification, regression, and ranking tasks, with proper preprocessing and metadata.

Capabilities

Built-in Dataset Loading Functions

Pre-processed datasets ready for immediate use with CatBoost models.

def titanic():
    """
    Load the famous Titanic survival dataset for binary classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features and 'Survived' target
        - test_df: Test DataFrame with features (no target)
        
    Features:
    - Passenger class, sex, age, siblings/spouses, parents/children
    - Fare, embarked port, cabin, ticket information
    - Mixed categorical and numerical features
    - Target: Binary survival (0/1)
    """

def amazon():
    """
    Load Amazon employee access dataset for binary classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features and 'ACTION' target
        - test_df: Test DataFrame with features (no target)
        
    Features:
    - Employee resource access request attributes
    - All categorical features (role, department, etc.)
    - Target: Binary access approval (0/1)
    """

def adult():
    """
    Load Adult (Census Income) dataset for binary classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features and income target
        - test_df: Test DataFrame with features (no target)
        
    Features:
    - Demographics (age, workclass, education, marital status)
    - Work information (occupation, relationship, race, sex)
    - Financial information (capital gain/loss, hours per week)
    - Mixed categorical and numerical features
    - Target: Binary income level (<=50K, >50K)
    """

def epsilon():
    """
    Load Epsilon dataset for binary classification (large-scale dataset).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame (400,000 samples)
        - test_df: Test DataFrame (100,000 samples)
        
    Features:
    - 2000 numerical features
    - Sparse feature representation
    - Target: Binary classification (0/1)
    - Commonly used for large-scale ML benchmarking
    """

def higgs():
    """
    Load HIGGS dataset for binary classification (physics domain).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame (10.5M samples)
        - test_df: Test DataFrame (500K samples)
        
    Features:
    - 28 numerical features from particle physics simulations
    - High-energy physics particle collision data
    - Target: Binary classification (signal/background)
    - Benchmark for large-scale classification
    """

Text and Sentiment Datasets

Datasets specifically designed for text classification and sentiment analysis tasks.

def imdb():
    """
    Load IMDB movie reviews dataset for sentiment classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with 'text' and 'label' columns
        - test_df: Test DataFrame with 'text' and 'label' columns
        
    Features:
    - Movie review text (strings)
    - Preprocessed and cleaned text data
    - Target: Binary sentiment (positive/negative)
    - Suitable for text feature processing in CatBoost
    """

def rotten_tomatoes():
    """
    Load Rotten Tomatoes movie reviews for sentiment classification.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with review text and sentiment
        - test_df: Test DataFrame with review text and sentiment
        
    Features:
    - Short movie review snippets
    - Text preprocessing for CatBoost text features
    - Target: Binary sentiment classification
    - Smaller dataset compared to IMDB
    """

Ranking Datasets

Specialized datasets for learning-to-rank and information retrieval tasks.

def msrank():
    """
    Load Microsoft Learning-to-Rank dataset (full version).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with features, relevance, and query_id
        - test_df: Test DataFrame with features, relevance, and query_id
        
    Features:
    - 136 numerical features from web search
    - Query-document relevance scores (0-4 scale)
    - Query group identifiers for ranking evaluation
    - Standard benchmark for learning-to-rank algorithms
    """

def msrank_10k():
    """
    Load Microsoft Learning-to-Rank dataset (10K subset).
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame (subset of msrank)
        - test_df: Test DataFrame (subset of msrank)
        
    Features:
    - Same features as msrank() but smaller size
    - Suitable for quick testing and prototyping
    - Maintains query group structure for ranking
    """

Synthetic and Mathematical Datasets

Datasets with known mathematical properties for algorithm testing.

def monotonic1():
    """
    Load first monotonic regression dataset.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with monotonic relationships
        - test_df: Test DataFrame for evaluation
        
    Features:
    - Features with known monotonic relationships to target
    - Useful for testing monotonic constraints in CatBoost
    - Synthetic data with controlled properties
    """

def monotonic2():
    """
    Load second monotonic regression dataset.
    
    Returns:
    tuple: (train_df, test_df)
        - train_df: Training DataFrame with different monotonic patterns
        - test_df: Test DataFrame for evaluation
        
    Features:
    - Alternative monotonic feature patterns
    - Complementary to monotonic1() for comprehensive testing
    - Different complexity and noise levels
    """

Dataset Cache Management

Functions for managing dataset storage and caching.

def set_cache_path(path):
    """
    Set the cache directory for downloaded datasets.
    
    Parameters:
    - path: Directory path for caching datasets (string)
        - Must be writable directory
        - Datasets will be downloaded and stored here
        - Subsequent calls will use cached versions
    
    Example:
    set_cache_path('/path/to/dataset/cache')
    """

Dataset Usage Examples

Basic Dataset Loading

from catboost.datasets import titanic, adult, amazon
from catboost import CatBoostClassifier, Pool

# Load Titanic dataset
train_df, test_df = titanic()
print(f"Titanic - Train shape: {train_df.shape}, Test shape: {test_df.shape}")

# Prepare features and target
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']

# Identify categorical features
cat_features = ['Sex', 'Embarked', 'Pclass']

# Train model
model = CatBoostClassifier(
    iterations=100,
    verbose=False,
    cat_features=cat_features
)

model.fit(X_train, y_train)
print("Model trained on Titanic dataset")

Text Dataset Processing

from catboost.datasets import imdb
from catboost import CatBoostClassifier, Pool

# Load IMDB dataset
train_df, test_df = imdb()
print(f"IMDB - Train shape: {train_df.shape}")

# Create pools with text features
train_pool = Pool(
    data=train_df,
    label=train_df['label'],
    text_features=['text']  # Specify text column
)

test_pool = Pool(
    data=test_df,
    label=test_df['label'],
    text_features=['text']
)

# Train model with text processing
model = CatBoostClassifier(
    iterations=200,
    verbose=50,
    text_processing={
        'tokenizers': [{'tokenizer_id': 'Space', 'separator_type': 'ByDelimiter', 'delimiter': ' '}],
        'dictionaries': [{'dictionary_id': 'Word', 'max_dictionary_size': '50000'}],
        'feature_processing': {
            'default': [{'dictionaries_names': ['Word'], 'feature_calcers': ['BoW']}]
        }
    }
)

model.fit(train_pool, eval_set=test_pool)
print("Model trained on IMDB text data")

Ranking Dataset Usage

from catboost.datasets import msrank_10k
from catboost import CatBoostRanker, Pool

# Load ranking dataset
train_df, test_df = msrank_10k()
print(f"MSRank 10K - Train shape: {train_df.shape}")

# Extract features, labels, and group IDs
feature_cols = [col for col in train_df.columns if col not in ['label', 'query_id']]
X_train = train_df[feature_cols]
y_train = train_df['label']
group_id_train = train_df['query_id']

X_test = test_df[feature_cols]
y_test = test_df['label']
group_id_test = test_df['query_id']

# Create pools for ranking
train_pool = Pool(
    data=X_train,
    label=y_train,
    group_id=group_id_train
)

test_pool = Pool(
    data=X_test,
    label=y_test,
    group_id=group_id_test
)

# Train ranking model
ranker = CatBoostRanker(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRank',
    eval_metric='NDCG',
    verbose=50
)

ranker.fit(train_pool, eval_set=test_pool)
print("Ranking model trained on MSRank dataset")

Large Dataset Handling

from catboost.datasets import epsilon, higgs, set_cache_path
from catboost import CatBoostClassifier
import os

# Set cache directory for large datasets
cache_dir = '/tmp/catboost_datasets'
os.makedirs(cache_dir, exist_ok=True)
set_cache_path(cache_dir)

# Load large dataset (this may take time on first run)
print("Loading epsilon dataset...")
train_df, test_df = epsilon()
print(f"Epsilon - Train: {train_df.shape}, Test: {test_df.shape}")

# For very large datasets, consider using file-based training
# Save to files and use Pool with file paths
train_df.to_csv('epsilon_train.tsv', sep='\t', index=False)
test_df.to_csv('epsilon_test.tsv', sep='\t', index=False)

# Create pools from files for memory efficiency
from catboost import Pool
train_pool = Pool('epsilon_train.tsv', delimiter='\t', has_header=True)
test_pool = Pool('epsilon_test.tsv', delimiter='\t', has_header=True)

# Train with limited memory usage
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    verbose=25,
    used_ram_limit='4gb'  # Limit RAM usage
)

model.fit(train_pool, eval_set=test_pool)
print("Large dataset model training completed")

Dataset Comparison and Analysis

from catboost.datasets import titanic, adult, amazon
import pandas as pd

def analyze_dataset(load_func, name):
    """Analyze a CatBoost dataset."""
    train_df, test_df = load_func()
    
    print(f"\n{name} Dataset Analysis:")
    print(f"  Train shape: {train_df.shape}")
    print(f"  Test shape: {test_df.shape}")
    print(f"  Features: {train_df.shape[1] - 1}")  # Excluding target
    
    # Identify column types
    numeric_cols = train_df.select_dtypes(include=['number']).columns
    categorical_cols = train_df.select_dtypes(include=['object', 'category']).columns
    
    print(f"  Numeric features: {len(numeric_cols)}")
    print(f"  Categorical features: {len(categorical_cols)}")
    
    # Target analysis
    target_col = train_df.columns[-1]  # Assume last column is target
    if target_col in train_df.columns:
        target_unique = train_df[target_col].nunique()
        print(f"  Target classes: {target_unique}")
        print(f"  Target distribution: {dict(train_df[target_col].value_counts())}")

# Analyze multiple datasets
datasets = [
    (titanic, "Titanic"),
    (adult, "Adult"),
    (amazon, "Amazon")
]

for load_func, name in datasets:
    analyze_dataset(load_func, name)

Custom Dataset Cache Management

from catboost.datasets import set_cache_path
import os

# Set custom cache location
custom_cache = "/home/user/ml_datasets"
os.makedirs(custom_cache, exist_ok=True)
set_cache_path(custom_cache)

print(f"Cache path set to: {custom_cache}")

# Load dataset (will cache in new location)
from catboost.datasets import titanic
train_df, test_df = titanic()

# List cached files
cache_files = os.listdir(custom_cache)
print(f"Cached files: {cache_files}")

Install with Tessl CLI