0
# CatBoost
1
2
CatBoost is a fast, scalable, high performance gradient boosting on decision trees library. Used for ranking, classification, regression and other ML tasks. CatBoost provides superior quality compared to other GBDT libraries, best-in-class prediction speed, native GPU and multi-GPU support, built-in visualization tools, and distributed training capabilities.
3
4
## Package Information
5
6
- **Package Name**: catboost
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install catboost`
10
11
## Core Imports
12
13
```python
14
import catboost
15
```
16
17
Common for working with models:
18
19
```python
20
from catboost import CatBoostClassifier, CatBoostRegressor, CatBoostRanker
21
from catboost import Pool, cv, train
22
```
23
24
Submodule imports:
25
26
```python
27
# Dataset utilities
28
from catboost import datasets
29
# or specific functions
30
from catboost.datasets import titanic, adult, amazon
31
32
# Utility functions
33
from catboost import utils
34
# or specific functions
35
from catboost.utils import eval_metric, get_roc_curve, create_cd
36
37
# Evaluation framework
38
from catboost import eval
39
# or specific classes
40
from catboost.eval import CatboostEvaluation, EvaluationResults
41
42
# Metrics framework
43
from catboost import metrics
44
# or specific metrics
45
from catboost.metrics import Logloss, AUC, RMSE
46
47
# Text processing
48
from catboost.text_processing import Tokenizer, Dictionary
49
50
# Model interpretation
51
from catboost.monoforest import to_polynom, explain_features
52
```
53
54
## Basic Usage
55
56
```python
57
from catboost import CatBoostClassifier, Pool
58
import pandas as pd
59
import numpy as np
60
61
# Prepare data
62
train_data = pd.DataFrame({
63
'feature1': np.random.randn(1000),
64
'feature2': np.random.randn(1000),
65
'category': np.random.choice(['A', 'B', 'C'], 1000)
66
})
67
train_labels = np.random.randint(0, 2, 1000)
68
69
# Create CatBoost pool with categorical features
70
train_pool = Pool(
71
data=train_data,
72
label=train_labels,
73
cat_features=['category']
74
)
75
76
# Initialize and train classifier
77
model = CatBoostClassifier(
78
iterations=100,
79
learning_rate=0.1,
80
depth=6,
81
verbose=True
82
)
83
84
model.fit(train_pool)
85
86
# Make predictions
87
predictions = model.predict(train_data)
88
probabilities = model.predict_proba(train_data)
89
90
# Get feature importance
91
feature_importance = model.get_feature_importance()
92
```
93
94
## Architecture
95
96
CatBoost is built around several key components:
97
98
- **Model Classes**: CatBoost, CatBoostClassifier, CatBoostRegressor, and CatBoostRanker provide different interfaces for gradient boosting tasks
99
- **Data Handling**: Pool class efficiently manages training data with categorical features, text features, and metadata
100
- **Training Pipeline**: Support for cross-validation, hyperparameter tuning, and early stopping
101
- **Feature Analysis**: Comprehensive feature importance, SHAP values, and automatic feature selection
102
- **GPU Acceleration**: Native GPU support for training and prediction across multiple devices
103
104
## Capabilities
105
106
### Core Model Classes
107
108
Scikit-learn compatible classifier, regressor, and ranker implementations with the base CatBoost class providing the core gradient boosting functionality.
109
110
```python { .api }
111
class CatBoostClassifier:
112
def __init__(self, iterations=500, learning_rate=None, depth=6, l2_leaf_reg=3.0,
113
loss_function='Logloss', **kwargs): ...
114
def fit(self, X, y, cat_features=None, sample_weight=None, baseline=None,
115
use_best_model=None, eval_set=None, **kwargs): ...
116
def predict(self, data, prediction_type='Class', **kwargs): ...
117
def predict_proba(self, X, **kwargs): ...
118
119
class CatBoostRegressor:
120
def __init__(self, iterations=500, learning_rate=None, depth=6, l2_leaf_reg=3.0,
121
loss_function='RMSE', **kwargs): ...
122
def fit(self, X, y, **kwargs): ...
123
def predict(self, data, **kwargs): ...
124
125
class CatBoostRanker:
126
def __init__(self, iterations=500, learning_rate=None, depth=6, l2_leaf_reg=3.0,
127
loss_function='YetiRank', **kwargs): ...
128
def fit(self, X, y, **kwargs): ...
129
def predict(self, data, **kwargs): ...
130
```
131
132
[Core Model Classes](./core-models.md)
133
134
### Data Handling
135
136
Pool class and FeaturesData for efficient data management with categorical features, text features, embeddings, and metadata like groups and weights.
137
138
```python { .api }
139
class Pool:
140
def __init__(self, data, label=None, cat_features=None, text_features=None,
141
embedding_features=None, column_description=None, pairs=None,
142
delimiter='\t', has_header=False, weight=None, group_id=None,
143
**kwargs): ...
144
def slice(self, rindex): ...
145
def save(self, fname): ...
146
def quantize(self, **kwargs): ...
147
148
class FeaturesData:
149
# Container for feature data with metadata
150
...
151
```
152
153
[Data Handling](./data-handling.md)
154
155
### Training and Evaluation
156
157
Cross-validation, training functions, and model evaluation utilities for comprehensive model development and assessment.
158
159
```python { .api }
160
def train(pool, params=None, dtrain=None, logging_level=None, verbose=None,
161
iterations=None, **kwargs): ...
162
163
def cv(pool, params=None, dtrain=None, iterations=None, num_boost_round=None,
164
fold_count=3, inverted=False, shuffle=True, partition_random_seed=0,
165
stratified=None, **kwargs): ...
166
167
def sample_gaussian_process(X, y, **kwargs): ...
168
```
169
170
[Training and Evaluation](./training-evaluation.md)
171
172
### Feature Analysis
173
174
Feature importance calculation, SHAP values, feature selection algorithms, and interpretability tools for understanding model behavior.
175
176
```python { .api }
177
# Enums for feature analysis
178
class EFstrType:
179
PredictionValuesChange = 0
180
LossFunctionChange = 1
181
FeatureImportance = 2
182
Interaction = 3
183
ShapValues = 4
184
PredictionDiff = 5
185
ShapInteractionValues = 6
186
SageValues = 7
187
188
class EShapCalcType:
189
Regular = "Regular"
190
Approximate = "Approximate"
191
Exact = "Exact"
192
193
class EFeaturesSelectionAlgorithm:
194
RecursiveByPredictionValuesChange = "RecursiveByPredictionValuesChange"
195
RecursiveByLossFunctionChange = "RecursiveByLossFunctionChange"
196
RecursiveByShapValues = "RecursiveByShapValues"
197
198
class EFeaturesSelectionGrouping:
199
Individual = "Individual"
200
ByTags = "ByTags"
201
```
202
203
[Feature Analysis](./feature-analysis.md)
204
205
### Utility Functions
206
207
Model conversion, GPU utilities, metric evaluation, confusion matrices, ROC curves, and threshold selection tools.
208
209
```python { .api }
210
def sum_models(models, weights=None, ctr_merge_policy='IntersectingCountersAverage'): ...
211
def to_regressor(model): ...
212
def to_classifier(model): ...
213
def to_ranker(model): ...
214
215
# From catboost.utils
216
def eval_metric(label, approx, metric, weight=None, group_id=None, **kwargs): ...
217
def get_gpu_device_count(): ...
218
def get_confusion_matrix(model, data, thread_count=-1): ...
219
def get_roc_curve(model, data, thread_count=-1, plot=False): ...
220
def select_threshold(model, data, curve=None, FPR=None, FNR=None, thread_count=-1): ...
221
```
222
223
[Utilities](./utilities.md)
224
225
### Dataset Utilities
226
227
Built-in datasets for testing and learning, including Titanic, Amazon, IMDB, Adult, Higgs, and ranking datasets.
228
229
```python { .api }
230
# From catboost.datasets
231
def titanic(): ...
232
def amazon(): ...
233
def adult(): ...
234
def imdb(): ...
235
def higgs(): ...
236
def msrank(): ...
237
def msrank_10k(): ...
238
def epsilon(): ...
239
def rotten_tomatoes(): ...
240
def monotonic1(): ...
241
def monotonic2(): ...
242
def set_cache_path(path): ...
243
```
244
245
[Dataset Utilities](./datasets.md)
246
247
### Visualization
248
249
Interactive widgets for Jupyter notebooks, metrics plotting, and compatibility with XGBoost and LightGBM plotting callbacks.
250
251
```python { .api }
252
# From catboost.widget (conditionally imported)
253
class MetricVisualizer:
254
# Interactive metric visualization widget for Jupyter
255
...
256
257
class MetricsPlotter:
258
# Plotting utility for training metrics
259
...
260
261
def XGBPlottingCallback(): ...
262
def lgbm_plotting_callback(): ...
263
```
264
265
[Visualization](./visualization.md)
266
267
### Advanced Features
268
269
Text processing, monoforest model interpretation, custom metrics and objectives for specialized use cases.
270
271
```python { .api }
272
# Custom metrics and objectives
273
class MultiRegressionCustomMetric: ...
274
class MultiRegressionCustomObjective: ...
275
class MultiTargetCustomMetric: ... # Alias
276
class MultiTargetCustomObjective: ... # Alias
277
278
# From catboost.text_processing
279
class Tokenizer: ...
280
class Dictionary: ...
281
282
# From catboost.monoforest
283
def to_polynom(model): ...
284
def to_polynom_string(model): ...
285
def explain_features(model): ...
286
class FeatureExplanation: ...
287
```
288
289
[Advanced Features](./advanced-features.md)
290
291
### Model Evaluation Framework
292
293
Comprehensive evaluation framework for statistical testing, performance comparisons, and model validation with confidence intervals.
294
295
```python { .api }
296
# From catboost.eval
297
class EvalType: ...
298
class CatboostEvaluation: ...
299
class ScoreType: ...
300
class ScoreConfig: ...
301
class CaseEvaluationResult: ...
302
class MetricEvaluationResult: ...
303
class EvaluationResults: ...
304
class ExecutionCase: ...
305
306
def calc_wilcoxon_test(): ...
307
def calc_bootstrap_ci_for_mean(): ...
308
def make_dirs_if_not_exists(): ...
309
def series_to_line(): ...
310
def save_plot(): ...
311
```
312
313
[Model Evaluation Framework](./evaluation.md)
314
315
### Metrics Framework
316
317
Dynamic metric classes for evaluating model performance across classification, regression, and ranking tasks.
318
319
```python { .api }
320
# From catboost.metrics
321
class BuiltinMetric:
322
def eval(self, label, approx, weight=None, group_id=None, **kwargs): ...
323
def is_max_optimal(self): ...
324
def is_min_optimal(self): ...
325
def set_hints(self, **hints): ...
326
@staticmethod
327
def params_with_defaults(): ...
328
329
# Dynamically generated metric classes (examples)
330
class Logloss(BuiltinMetric): ...
331
class CrossEntropy(BuiltinMetric): ...
332
class Accuracy(BuiltinMetric): ...
333
class AUC(BuiltinMetric): ...
334
class RMSE(BuiltinMetric): ...
335
class MAE(BuiltinMetric): ...
336
class NDCG(BuiltinMetric): ...
337
class MAP(BuiltinMetric): ...
338
```
339
340
[Metrics Framework](./metrics.md)
341
342
## Constants and Exceptions
343
344
```python { .api }
345
class CatBoostError(Exception):
346
"""Main exception class for CatBoost errors."""
347
...
348
349
# Compatibility alias
350
CatboostError = CatBoostError
351
352
__version__: str # Currently '1.2.8'
353
```