tessl/pypi-catboost

CatBoost is a fast, scalable, high performance gradient boosting on decision trees library used for ranking, classification, regression and other ML tasks.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Pending

The risk profile of this skill

Overview

Eval results

Files

CatBoost

Name: tessl/pypi-catboost
Author: tessl

CatBoost is a fast, scalable, high performance gradient boosting on decision trees library. Used for ranking, classification, regression and other ML tasks. CatBoost provides superior quality compared to other GBDT libraries, best-in-class prediction speed, native GPU and multi-GPU support, built-in visualization tools, and distributed training capabilities.

Package Information

Package Name: catboost
Package Type: pypi
Language: Python
Installation: pip install catboost

Core Imports

import catboost

Common for working with models:

from catboost import CatBoostClassifier, CatBoostRegressor, CatBoostRanker
from catboost import Pool, cv, train

Submodule imports:

# Dataset utilities
from catboost import datasets
# or specific functions
from catboost.datasets import titanic, adult, amazon

# Utility functions
from catboost import utils
# or specific functions  
from catboost.utils import eval_metric, get_roc_curve, create_cd

# Evaluation framework
from catboost import eval
# or specific classes
from catboost.eval import CatboostEvaluation, EvaluationResults

# Metrics framework
from catboost import metrics
# or specific metrics
from catboost.metrics import Logloss, AUC, RMSE

# Text processing
from catboost.text_processing import Tokenizer, Dictionary

# Model interpretation
from catboost.monoforest import to_polynom, explain_features

Basic Usage

from catboost import CatBoostClassifier, Pool
import pandas as pd
import numpy as np

# Prepare data
train_data = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'category': np.random.choice(['A', 'B', 'C'], 1000)
})
train_labels = np.random.randint(0, 2, 1000)

# Create CatBoost pool with categorical features
train_pool = Pool(
    data=train_data,
    label=train_labels,
    cat_features=['category']
)

# Initialize and train classifier
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    verbose=True
)

model.fit(train_pool)

# Make predictions
predictions = model.predict(train_data)
probabilities = model.predict_proba(train_data)

# Get feature importance
feature_importance = model.get_feature_importance()

Architecture

CatBoost is built around several key components:

Model Classes: CatBoost, CatBoostClassifier, CatBoostRegressor, and CatBoostRanker provide different interfaces for gradient boosting tasks
Data Handling: Pool class efficiently manages training data with categorical features, text features, and metadata
Training Pipeline: Support for cross-validation, hyperparameter tuning, and early stopping
Feature Analysis: Comprehensive feature importance, SHAP values, and automatic feature selection
GPU Acceleration: Native GPU support for training and prediction across multiple devices

Capabilities

Core Model Classes

Scikit-learn compatible classifier, regressor, and ranker implementations with the base CatBoost class providing the core gradient boosting functionality.

class CatBoostClassifier:
    def __init__(self, iterations=500, learning_rate=None, depth=6, l2_leaf_reg=3.0, 
                 loss_function='Logloss', **kwargs): ...
    def fit(self, X, y, cat_features=None, sample_weight=None, baseline=None, 
            use_best_model=None, eval_set=None, **kwargs): ...
    def predict(self, data, prediction_type='Class', **kwargs): ...
    def predict_proba(self, X, **kwargs): ...

class CatBoostRegressor:
    def __init__(self, iterations=500, learning_rate=None, depth=6, l2_leaf_reg=3.0,
                 loss_function='RMSE', **kwargs): ...
    def fit(self, X, y, **kwargs): ...
    def predict(self, data, **kwargs): ...

class CatBoostRanker:
    def __init__(self, iterations=500, learning_rate=None, depth=6, l2_leaf_reg=3.0,
                 loss_function='YetiRank', **kwargs): ...
    def fit(self, X, y, **kwargs): ...
    def predict(self, data, **kwargs): ...

Core Model Classes

Data Handling

Pool class and FeaturesData for efficient data management with categorical features, text features, embeddings, and metadata like groups and weights.

class Pool:
    def __init__(self, data, label=None, cat_features=None, text_features=None,
                 embedding_features=None, column_description=None, pairs=None, 
                 delimiter='\t', has_header=False, weight=None, group_id=None, 
                 **kwargs): ...
    def slice(self, rindex): ...
    def save(self, fname): ...
    def quantize(self, **kwargs): ...

class FeaturesData:
    # Container for feature data with metadata
    ...

Data Handling

Training and Evaluation

Cross-validation, training functions, and model evaluation utilities for comprehensive model development and assessment.

def train(pool, params=None, dtrain=None, logging_level=None, verbose=None, 
          iterations=None, **kwargs): ...

def cv(pool, params=None, dtrain=None, iterations=None, num_boost_round=None,
       fold_count=3, inverted=False, shuffle=True, partition_random_seed=0,
       stratified=None, **kwargs): ...

def sample_gaussian_process(X, y, **kwargs): ...

Training and Evaluation

Feature Analysis

Feature importance calculation, SHAP values, feature selection algorithms, and interpretability tools for understanding model behavior.

# Enums for feature analysis
class EFstrType:
    PredictionValuesChange = 0
    LossFunctionChange = 1
    FeatureImportance = 2
    Interaction = 3
    ShapValues = 4
    PredictionDiff = 5
    ShapInteractionValues = 6
    SageValues = 7

class EShapCalcType:
    Regular = "Regular"
    Approximate = "Approximate"
    Exact = "Exact"

class EFeaturesSelectionAlgorithm:
    RecursiveByPredictionValuesChange = "RecursiveByPredictionValuesChange"
    RecursiveByLossFunctionChange = "RecursiveByLossFunctionChange"
    RecursiveByShapValues = "RecursiveByShapValues"

class EFeaturesSelectionGrouping:
    Individual = "Individual"
    ByTags = "ByTags"

Feature Analysis

Utility Functions

Model conversion, GPU utilities, metric evaluation, confusion matrices, ROC curves, and threshold selection tools.

def sum_models(models, weights=None, ctr_merge_policy='IntersectingCountersAverage'): ...
def to_regressor(model): ...
def to_classifier(model): ...
def to_ranker(model): ...

# From catboost.utils
def eval_metric(label, approx, metric, weight=None, group_id=None, **kwargs): ...
def get_gpu_device_count(): ...
def get_confusion_matrix(model, data, thread_count=-1): ...
def get_roc_curve(model, data, thread_count=-1, plot=False): ...
def select_threshold(model, data, curve=None, FPR=None, FNR=None, thread_count=-1): ...

Utilities

Dataset Utilities

Built-in datasets for testing and learning, including Titanic, Amazon, IMDB, Adult, Higgs, and ranking datasets.

# From catboost.datasets
def titanic(): ...
def amazon(): ...
def adult(): ...
def imdb(): ...
def higgs(): ...
def msrank(): ...
def msrank_10k(): ...
def epsilon(): ...
def rotten_tomatoes(): ...
def monotonic1(): ...
def monotonic2(): ...
def set_cache_path(path): ...

Dataset Utilities

Visualization

Interactive widgets for Jupyter notebooks, metrics plotting, and compatibility with XGBoost and LightGBM plotting callbacks.

# From catboost.widget (conditionally imported)
class MetricVisualizer:
    # Interactive metric visualization widget for Jupyter
    ...

class MetricsPlotter:
    # Plotting utility for training metrics
    ...

def XGBPlottingCallback(): ...
def lgbm_plotting_callback(): ...

Visualization

Advanced Features

Text processing, monoforest model interpretation, custom metrics and objectives for specialized use cases.

# Custom metrics and objectives
class MultiRegressionCustomMetric: ...
class MultiRegressionCustomObjective: ...
class MultiTargetCustomMetric: ...  # Alias
class MultiTargetCustomObjective: ...  # Alias

# From catboost.text_processing
class Tokenizer: ...
class Dictionary: ...

# From catboost.monoforest
def to_polynom(model): ...
def to_polynom_string(model): ...
def explain_features(model): ...
class FeatureExplanation: ...

Advanced Features

Model Evaluation Framework

Comprehensive evaluation framework for statistical testing, performance comparisons, and model validation with confidence intervals.

# From catboost.eval
class EvalType: ...
class CatboostEvaluation: ...
class ScoreType: ...
class ScoreConfig: ...
class CaseEvaluationResult: ...
class MetricEvaluationResult: ...
class EvaluationResults: ...
class ExecutionCase: ...

def calc_wilcoxon_test(): ...
def calc_bootstrap_ci_for_mean(): ...
def make_dirs_if_not_exists(): ...
def series_to_line(): ...
def save_plot(): ...

Model Evaluation Framework

Metrics Framework

Dynamic metric classes for evaluating model performance across classification, regression, and ranking tasks.

# From catboost.metrics
class BuiltinMetric:
    def eval(self, label, approx, weight=None, group_id=None, **kwargs): ...
    def is_max_optimal(self): ...
    def is_min_optimal(self): ...
    def set_hints(self, **hints): ...
    @staticmethod
    def params_with_defaults(): ...

# Dynamically generated metric classes (examples)
class Logloss(BuiltinMetric): ...
class CrossEntropy(BuiltinMetric): ...
class Accuracy(BuiltinMetric): ...
class AUC(BuiltinMetric): ...
class RMSE(BuiltinMetric): ...
class MAE(BuiltinMetric): ...
class NDCG(BuiltinMetric): ...
class MAP(BuiltinMetric): ...

Metrics Framework

Constants and Exceptions

class CatBoostError(Exception):
    """Main exception class for CatBoost errors."""
    ...

# Compatibility alias
CatboostError = CatBoostError

__version__: str  # Currently '1.2.8'