0
# Advanced Features
1
2
CatBoost provides specialized features for advanced use cases including text processing, monoforest model interpretation, custom metrics and objectives, and evaluation frameworks. These capabilities extend CatBoost's functionality for specialized domains and research applications.
3
4
## Capabilities
5
6
### Custom Metrics and Objectives
7
8
Base classes for implementing custom loss functions and evaluation metrics for specialized machine learning tasks.
9
10
```python { .api }
11
class MultiRegressionCustomMetric:
12
"""
13
Base class for implementing custom metrics for multi-regression tasks.
14
15
Allows creation of domain-specific evaluation metrics that can be used
16
during training and validation of multi-output regression models.
17
"""
18
19
def __init__(self):
20
"""Initialize custom metric."""
21
pass
22
23
def is_max_optimal(self):
24
"""
25
Specify optimization direction for the metric.
26
27
Returns:
28
bool: True if higher values are better, False if lower values are better
29
"""
30
raise NotImplementedError()
31
32
def evaluate(self, approxes, target, weight):
33
"""
34
Calculate metric value for given predictions and targets.
35
36
Parameters:
37
- approxes: Model predictions (list of numpy.ndarray for each target)
38
- target: True target values (numpy.ndarray)
39
- weight: Sample weights (numpy.ndarray, optional)
40
41
Returns:
42
tuple: (metric_value, weight_sum)
43
- metric_value: Calculated metric value (float)
44
- weight_sum: Sum of weights used (float)
45
"""
46
raise NotImplementedError()
47
48
def get_final_error(self, error, weight):
49
"""
50
Calculate final error from accumulated values.
51
52
Parameters:
53
- error: Accumulated error value (float)
54
- weight: Accumulated weight (float)
55
56
Returns:
57
float: Final metric value
58
"""
59
return error / weight if weight != 0 else 0
60
61
class MultiRegressionCustomObjective:
62
"""
63
Base class for implementing custom loss functions (objectives) for multi-regression.
64
65
Enables implementation of specialized loss functions tailored to specific
66
problem domains or research requirements.
67
"""
68
69
def __init__(self):
70
"""Initialize custom objective."""
71
pass
72
73
def calc_ders_range(self, approxes, targets, weights):
74
"""
75
Calculate first and second derivatives of the loss function.
76
77
Parameters:
78
- approxes: Current model predictions (list of numpy.ndarray)
79
- targets: True target values (numpy.ndarray)
80
- weights: Sample weights (numpy.ndarray)
81
82
Returns:
83
tuple: (first_derivatives, second_derivatives)
84
- first_derivatives: First derivatives (list of numpy.ndarray)
85
- second_derivatives: Second derivatives (list of numpy.ndarray)
86
"""
87
raise NotImplementedError()
88
89
# Type aliases for multi-target scenarios
90
MultiTargetCustomMetric = MultiRegressionCustomMetric
91
MultiTargetCustomObjective = MultiRegressionCustomObjective
92
```
93
94
### Text Processing
95
96
Specialized classes for handling text features within CatBoost's gradient boosting framework.
97
98
```python { .api }
99
class Tokenizer:
100
"""
101
Text tokenization utility for preprocessing text features in CatBoost.
102
103
Provides various tokenization strategies optimized for gradient boosting
104
on text data, with support for different languages and text types.
105
"""
106
107
def __init__(self, tokenizer_id='Space', separator_type='ByDelimiter',
108
delimiter=' ', **kwargs):
109
"""
110
Initialize text tokenizer.
111
112
Parameters:
113
- tokenizer_id: Tokenizer type ('Space', 'SentenсePiece', 'Regexp')
114
- separator_type: How to separate tokens ('ByDelimiter', 'BySeparator')
115
- delimiter: Token delimiter character (string)
116
- kwargs: Additional tokenizer-specific parameters
117
"""
118
self.tokenizer_id = tokenizer_id
119
self.separator_type = separator_type
120
self.delimiter = delimiter
121
122
def tokenize(self, text):
123
"""
124
Tokenize input text.
125
126
Parameters:
127
- text: Input text string
128
129
Returns:
130
list: List of tokens
131
"""
132
# Implementation depends on tokenizer type
133
pass
134
135
class Dictionary:
136
"""
137
Dictionary builder for text feature processing in CatBoost.
138
139
Creates and manages vocabularies for text features, with support for
140
frequency-based filtering and domain-specific dictionaries.
141
"""
142
143
def __init__(self, dictionary_id='Word', max_dictionary_size=50000,
144
occurrence_lower_bound=1, **kwargs):
145
"""
146
Initialize dictionary builder.
147
148
Parameters:
149
- dictionary_id: Dictionary identifier (string)
150
- max_dictionary_size: Maximum vocabulary size (int)
151
- occurrence_lower_bound: Minimum token frequency (int)
152
- kwargs: Additional dictionary parameters
153
"""
154
self.dictionary_id = dictionary_id
155
self.max_dictionary_size = max_dictionary_size
156
self.occurrence_lower_bound = occurrence_lower_bound
157
158
def build(self, texts):
159
"""
160
Build dictionary from text corpus.
161
162
Parameters:
163
- texts: List of text documents
164
165
Returns:
166
Dictionary object ready for use in text processing
167
"""
168
pass
169
170
def get_dictionary_info(self):
171
"""
172
Get information about the built dictionary.
173
174
Returns:
175
dict: Dictionary statistics including size and coverage
176
"""
177
pass
178
```
179
180
### Monoforest Interpretation
181
182
Tools for interpreting monotonic forest models and converting them to polynomial representations.
183
184
```python { .api }
185
def to_polynom(model):
186
"""
187
Convert monoforest model to polynomial representation.
188
189
Parameters:
190
- model: Trained CatBoost model with monotonic constraints
191
192
Returns:
193
Polynomial representation of the model that can be used for analysis
194
and interpretation of feature relationships.
195
"""
196
pass
197
198
def to_polynom_string(model):
199
"""
200
Convert monoforest model to human-readable polynomial string.
201
202
Parameters:
203
- model: Trained CatBoost model with monotonic constraints
204
205
Returns:
206
string: Mathematical polynomial expression representing the model
207
"""
208
pass
209
210
def explain_features(model):
211
"""
212
Generate feature explanations for monoforest models.
213
214
Parameters:
215
- model: Trained CatBoost model with monotonic constraints
216
217
Returns:
218
FeatureExplanation: Detailed explanation of feature contributions
219
"""
220
pass
221
222
class FeatureExplanation:
223
"""
224
Container for detailed feature explanations from monoforest models.
225
226
Provides structured information about how features contribute to
227
predictions in monotonic gradient boosting models.
228
"""
229
230
def __init__(self):
231
"""Initialize feature explanation container."""
232
self.feature_effects = {}
233
self.monotonic_constraints = {}
234
self.feature_interactions = {}
235
236
def get_feature_effect(self, feature_idx):
237
"""
238
Get the effect description for a specific feature.
239
240
Parameters:
241
- feature_idx: Feature index (int)
242
243
Returns:
244
dict: Feature effect information including direction and magnitude
245
"""
246
return self.feature_effects.get(feature_idx, {})
247
248
def get_monotonic_constraints(self):
249
"""
250
Get monotonic constraints applied to features.
251
252
Returns:
253
dict: Feature indices mapped to constraint types (1, -1, or 0)
254
"""
255
return self.monotonic_constraints
256
257
def summary(self):
258
"""
259
Generate summary of feature explanations.
260
261
Returns:
262
string: Human-readable summary of model feature behavior
263
"""
264
pass
265
```
266
267
### Advanced Evaluation Framework
268
269
Comprehensive evaluation system for complex model assessment scenarios.
270
271
```python { .api }
272
class CatboostEvaluation:
273
"""
274
Advanced evaluation framework for comprehensive model assessment.
275
276
Provides tools for statistical testing, confidence intervals, and
277
rigorous comparison of CatBoost models across different scenarios.
278
"""
279
280
def __init__(self, eval_type='Classification', score_type='Logloss'):
281
"""
282
Initialize evaluation framework.
283
284
Parameters:
285
- eval_type: Type of evaluation ('Classification', 'Regression', 'Ranking')
286
- score_type: Primary scoring metric (string)
287
"""
288
self.eval_type = eval_type
289
self.score_type = score_type
290
291
def add_case(self, case_name, model, test_data, test_labels):
292
"""
293
Add evaluation case to the framework.
294
295
Parameters:
296
- case_name: Identifier for this evaluation case (string)
297
- model: Trained CatBoost model
298
- test_data: Test dataset (Pool or array-like)
299
- test_labels: True labels (array-like)
300
"""
301
pass
302
303
def evaluate(self):
304
"""
305
Perform comprehensive evaluation across all cases.
306
307
Returns:
308
EvaluationResults: Detailed evaluation results with statistical tests
309
"""
310
pass
311
312
class EvaluationResults:
313
"""Container for comprehensive evaluation results."""
314
315
def __init__(self):
316
self.case_results = {}
317
self.statistical_tests = {}
318
self.confidence_intervals = {}
319
320
def get_case_result(self, case_name):
321
"""Get results for specific evaluation case."""
322
return self.case_results.get(case_name)
323
324
def get_statistical_comparison(self, case1, case2):
325
"""Get statistical comparison between two cases."""
326
pass
327
328
def summary_table(self):
329
"""Generate summary table of all evaluation results."""
330
pass
331
332
def calc_wilcoxon_test(scores1, scores2, alternative='two-sided'):
333
"""
334
Calculate Wilcoxon signed-rank test for comparing model performance.
335
336
Parameters:
337
- scores1: Performance scores from first model (array-like)
338
- scores2: Performance scores from second model (array-like)
339
- alternative: Test alternative ('two-sided', 'less', 'greater')
340
341
Returns:
342
tuple: (statistic, p_value)
343
- statistic: Test statistic (float)
344
- p_value: P-value of the test (float)
345
"""
346
pass
347
348
def calc_bootstrap_ci_for_mean(scores, confidence_level=0.95, num_bootstrap=10000):
349
"""
350
Calculate bootstrap confidence interval for mean performance.
351
352
Parameters:
353
- scores: Performance scores (array-like)
354
- confidence_level: Confidence level (float, 0-1)
355
- num_bootstrap: Number of bootstrap samples (int)
356
357
Returns:
358
tuple: (lower_bound, upper_bound, mean_estimate)
359
- lower_bound: Lower confidence bound (float)
360
- upper_bound: Upper confidence bound (float)
361
- mean_estimate: Bootstrap mean estimate (float)
362
"""
363
pass
364
```
365
366
## Advanced Features Examples
367
368
### Custom Metric Implementation
369
370
```python
371
from catboost import CatBoostRegressor, Pool
372
from catboost import MultiRegressionCustomMetric
373
import numpy as np
374
375
class MeanAbsolutePercentageError(MultiRegressionCustomMetric):
376
"""Custom MAPE metric implementation."""
377
378
def is_max_optimal(self):
379
return False # Lower MAPE is better
380
381
def evaluate(self, approxes, target, weight):
382
"""Calculate MAPE."""
383
# approxes[0] contains predictions for single output regression
384
predictions = approxes[0]
385
386
# Avoid division by zero
387
mask = target != 0
388
if not np.any(mask):
389
return 0.0, len(target)
390
391
# Calculate MAPE only for non-zero targets
392
ape = np.abs((target[mask] - predictions[mask]) / target[mask]) * 100
393
394
if weight is not None:
395
weight_sum = np.sum(weight[mask])
396
mape = np.sum(ape * weight[mask]) / weight_sum if weight_sum > 0 else 0
397
else:
398
mape = np.mean(ape)
399
weight_sum = len(ape)
400
401
return mape, weight_sum
402
403
# Use custom metric
404
custom_mape = MeanAbsolutePercentageError()
405
406
model = CatBoostRegressor(
407
iterations=200,
408
eval_metric=custom_mape, # Use custom metric for evaluation
409
verbose=50
410
)
411
412
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
413
print("Model trained with custom MAPE metric")
414
```
415
416
### Text Processing Configuration
417
418
```python
419
from catboost import CatBoostClassifier, Pool
420
from catboost.text_processing import Tokenizer, Dictionary
421
422
# Prepare text data
423
text_data = pd.DataFrame({
424
'text_feature': [
425
"This is a positive review",
426
"Negative sentiment example",
427
"Another positive text sample",
428
"Bad negative review text"
429
],
430
'category': ['A', 'B', 'A', 'B'],
431
'target': [1, 0, 1, 0]
432
})
433
434
# Create pool with text features
435
text_pool = Pool(
436
data=text_data.drop('target', axis=1),
437
label=text_data['target'],
438
text_features=['text_feature'],
439
cat_features=['category']
440
)
441
442
# Configure text processing
443
text_processing_config = {
444
'tokenizers': [
445
{
446
'tokenizer_id': 'Space',
447
'separator_type': 'ByDelimiter',
448
'delimiter': ' '
449
},
450
{
451
'tokenizer_id': 'SentencePiece',
452
'number_of_tokens': 1000
453
}
454
],
455
'dictionaries': [
456
{
457
'dictionary_id': 'Word',
458
'max_dictionary_size': 50000,
459
'occurrence_lower_bound': 1
460
},
461
{
462
'dictionary_id': 'Bigram',
463
'max_dictionary_size': 50000,
464
'occurrence_lower_bound': 2,
465
'gram_order': 2
466
}
467
],
468
'feature_processing': {
469
'default': [
470
{
471
'dictionaries_names': ['Word', 'Bigram'],
472
'feature_calcers': ['BoW', 'NaiveBayes'],
473
'tokenizers_names': ['Space']
474
}
475
]
476
}
477
}
478
479
# Train model with text processing
480
text_model = CatBoostClassifier(
481
iterations=100,
482
text_processing=text_processing_config,
483
verbose=50
484
)
485
486
text_model.fit(text_pool)
487
print("Text classification model trained")
488
489
# Get text feature importance
490
text_importance = text_model.get_feature_importance(prettified=True)
491
print("Text feature importance calculated")
492
```
493
494
### Monoforest Model Interpretation
495
496
```python
497
from catboost import CatBoostRegressor
498
from catboost.monoforest import to_polynom_string, explain_features
499
import numpy as np
500
501
# Create synthetic data with known monotonic relationships
502
np.random.seed(42)
503
n_samples = 1000
504
505
X_mono = pd.DataFrame({
506
'increasing_feature': np.random.uniform(0, 10, n_samples),
507
'decreasing_feature': np.random.uniform(0, 5, n_samples),
508
'neutral_feature': np.random.uniform(-2, 2, n_samples)
509
})
510
511
# Create target with monotonic relationships
512
y_mono = (
513
2 * X_mono['increasing_feature'] + # Positive monotonic
514
-1.5 * X_mono['decreasing_feature'] + # Negative monotonic
515
0.5 * X_mono['neutral_feature']**2 + # Non-monotonic
516
np.random.normal(0, 0.5, n_samples) # Noise
517
)
518
519
# Train model with monotonic constraints
520
mono_model = CatBoostRegressor(
521
iterations=200,
522
depth=4,
523
monotone_constraints=[1, -1, 0], # +1: increasing, -1: decreasing, 0: no constraint
524
verbose=50
525
)
526
527
mono_model.fit(X_mono, y_mono)
528
529
# Convert to polynomial representation
530
try:
531
poly_string = to_polynom_string(mono_model)
532
print("Polynomial representation:")
533
print(poly_string)
534
except:
535
print("Polynomial conversion not available for this model type")
536
537
# Get feature explanations
538
try:
539
explanations = explain_features(mono_model)
540
print("\nFeature explanations:")
541
print(explanations.summary())
542
except:
543
print("Feature explanations not available")
544
545
# Verify monotonic behavior
546
test_values = np.linspace(0, 10, 100)
547
predictions_increasing = []
548
predictions_decreasing = []
549
550
for val in test_values:
551
# Test increasing feature
552
test_data_inc = pd.DataFrame({
553
'increasing_feature': [val],
554
'decreasing_feature': [2.5], # Fixed value
555
'neutral_feature': [0] # Fixed value
556
})
557
pred_inc = mono_model.predict(test_data_inc)[0]
558
predictions_increasing.append(pred_inc)
559
560
# Test decreasing feature
561
test_data_dec = pd.DataFrame({
562
'increasing_feature': [5], # Fixed value
563
'decreasing_feature': [val],
564
'neutral_feature': [0] # Fixed value
565
})
566
pred_dec = mono_model.predict(test_data_dec)[0]
567
predictions_decreasing.append(pred_dec)
568
569
# Check monotonicity
570
increasing_diff = np.diff(predictions_increasing)
571
decreasing_diff = np.diff(predictions_decreasing)
572
573
print(f"\nMonotonic constraint verification:")
574
print(f"Increasing feature violations: {np.sum(increasing_diff < 0)} / {len(increasing_diff)}")
575
print(f"Decreasing feature violations: {np.sum(decreasing_diff > 0)} / {len(decreasing_diff)}")
576
```
577
578
### Advanced Model Evaluation
579
580
```python
581
from catboost import CatBoostClassifier
582
from catboost.eval import calc_wilcoxon_test, calc_bootstrap_ci_for_mean
583
from sklearn.model_selection import cross_val_score
584
from sklearn.metrics import accuracy_score, roc_auc_score
585
import numpy as np
586
587
# Train multiple models for comparison
588
models = {
589
'shallow': CatBoostClassifier(iterations=100, depth=4, verbose=False),
590
'medium': CatBoostClassifier(iterations=200, depth=6, verbose=False),
591
'deep': CatBoostClassifier(iterations=300, depth=8, verbose=False)
592
}
593
594
# Perform cross-validation for each model
595
cv_scores = {}
596
for name, model in models.items():
597
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
598
cv_scores[name] = scores
599
print(f"{name} model - CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")
600
601
# Statistical comparison between models
602
comparisons = [('shallow', 'medium'), ('medium', 'deep'), ('shallow', 'deep')]
603
604
for model1, model2 in comparisons:
605
statistic, p_value = calc_wilcoxon_test(
606
cv_scores[model1],
607
cv_scores[model2],
608
alternative='two-sided'
609
)
610
611
print(f"\nWilcoxon test: {model1} vs {model2}")
612
print(f"Statistic: {statistic:.4f}, P-value: {p_value:.4f}")
613
614
if p_value < 0.05:
615
better_model = model1 if cv_scores[model1].mean() > cv_scores[model2].mean() else model2
616
print(f"Significant difference (p < 0.05): {better_model} performs better")
617
else:
618
print("No significant difference (p >= 0.05)")
619
620
# Bootstrap confidence intervals
621
for name, scores in cv_scores.items():
622
lower, upper, mean_est = calc_bootstrap_ci_for_mean(
623
scores,
624
confidence_level=0.95,
625
num_bootstrap=10000
626
)
627
628
print(f"\n{name} model - Bootstrap 95% CI:")
629
print(f"Mean: {mean_est:.4f}, CI: [{lower:.4f}, {upper:.4f}]")
630
631
# Final model selection and evaluation
632
best_model_name = max(cv_scores.keys(), key=lambda k: cv_scores[k].mean())
633
best_model = models[best_model_name]
634
635
print(f"\nSelected best model: {best_model_name}")
636
637
# Train best model on full training set and evaluate on test set
638
best_model.fit(X_train, y_train)
639
test_predictions = best_model.predict_proba(X_test)[:, 1]
640
test_auc = roc_auc_score(y_test, test_predictions)
641
642
print(f"Final test AUC: {test_auc:.4f}")
643
644
# Calculate bootstrap CI for test performance
645
test_bootstrap_scores = []
646
for _ in range(1000):
647
indices = np.random.choice(len(y_test), len(y_test), replace=True)
648
boot_auc = roc_auc_score(y_test[indices], test_predictions[indices])
649
test_bootstrap_scores.append(boot_auc)
650
651
test_lower, test_upper, test_mean = calc_bootstrap_ci_for_mean(
652
test_bootstrap_scores,
653
confidence_level=0.95
654
)
655
656
print(f"Test AUC 95% Bootstrap CI: [{test_lower:.4f}, {test_upper:.4f}]")
657
```