Tessl Tile for pypi/catboost@1.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

advanced-features.md core-models.md data-handling.md datasets.md evaluation.md feature-analysis.md index.md metrics.md training-evaluation.md utilities.md visualization.md

advanced-features.mddocs/

0
# Advanced Features
1

2
CatBoost provides specialized features for advanced use cases including text processing, monoforest model interpretation, custom metrics and objectives, and evaluation frameworks. These capabilities extend CatBoost's functionality for specialized domains and research applications.
3

4
## Capabilities
5

6
### Custom Metrics and Objectives
7

8
Base classes for implementing custom loss functions and evaluation metrics for specialized machine learning tasks.
9

10
```python { .api }
11
class MultiRegressionCustomMetric:
12
    """
13
    Base class for implementing custom metrics for multi-regression tasks.
14
    
15
    Allows creation of domain-specific evaluation metrics that can be used
16
    during training and validation of multi-output regression models.
17
    """
18
    
19
    def __init__(self):
20
        """Initialize custom metric."""
21
        pass
22
    
23
    def is_max_optimal(self):
24
        """
25
        Specify optimization direction for the metric.
26
        
27
        Returns:
28
        bool: True if higher values are better, False if lower values are better
29
        """
30
        raise NotImplementedError()
31
    
32
    def evaluate(self, approxes, target, weight):
33
        """
34
        Calculate metric value for given predictions and targets.
35
        
36
        Parameters:
37
        - approxes: Model predictions (list of numpy.ndarray for each target)
38
        - target: True target values (numpy.ndarray)
39
        - weight: Sample weights (numpy.ndarray, optional)
40
        
41
        Returns:
42
        tuple: (metric_value, weight_sum)
43
            - metric_value: Calculated metric value (float)
44
            - weight_sum: Sum of weights used (float)
45
        """
46
        raise NotImplementedError()
47
    
48
    def get_final_error(self, error, weight):
49
        """
50
        Calculate final error from accumulated values.
51
        
52
        Parameters:
53
        - error: Accumulated error value (float)
54
        - weight: Accumulated weight (float)
55
        
56
        Returns:
57
        float: Final metric value
58
        """
59
        return error / weight if weight != 0 else 0
60

61
class MultiRegressionCustomObjective:
62
    """
63
    Base class for implementing custom loss functions (objectives) for multi-regression.
64
    
65
    Enables implementation of specialized loss functions tailored to specific
66
    problem domains or research requirements.
67
    """
68
    
69
    def __init__(self):
70
        """Initialize custom objective."""
71
        pass
72
    
73
    def calc_ders_range(self, approxes, targets, weights):
74
        """
75
        Calculate first and second derivatives of the loss function.
76
        
77
        Parameters:
78
        - approxes: Current model predictions (list of numpy.ndarray)
79
        - targets: True target values (numpy.ndarray)
80
        - weights: Sample weights (numpy.ndarray)
81
        
82
        Returns:
83
        tuple: (first_derivatives, second_derivatives)
84
            - first_derivatives: First derivatives (list of numpy.ndarray)
85
            - second_derivatives: Second derivatives (list of numpy.ndarray)
86
        """
87
        raise NotImplementedError()
88

89
# Type aliases for multi-target scenarios
90
MultiTargetCustomMetric = MultiRegressionCustomMetric
91
MultiTargetCustomObjective = MultiRegressionCustomObjective
92
```
93

94
### Text Processing
95

96
Specialized classes for handling text features within CatBoost's gradient boosting framework.
97

98
```python { .api }
99
class Tokenizer:
100
    """
101
    Text tokenization utility for preprocessing text features in CatBoost.
102
    
103
    Provides various tokenization strategies optimized for gradient boosting
104
    on text data, with support for different languages and text types.
105
    """
106
    
107
    def __init__(self, tokenizer_id='Space', separator_type='ByDelimiter', 
108
                 delimiter=' ', **kwargs):
109
        """
110
        Initialize text tokenizer.
111
        
112
        Parameters:
113
        - tokenizer_id: Tokenizer type ('Space', 'SentenсePiece', 'Regexp')
114
        - separator_type: How to separate tokens ('ByDelimiter', 'BySeparator')
115
        - delimiter: Token delimiter character (string)
116
        - kwargs: Additional tokenizer-specific parameters
117
        """
118
        self.tokenizer_id = tokenizer_id
119
        self.separator_type = separator_type
120
        self.delimiter = delimiter
121
    
122
    def tokenize(self, text):
123
        """
124
        Tokenize input text.
125
        
126
        Parameters:
127
        - text: Input text string
128
        
129
        Returns:
130
        list: List of tokens
131
        """
132
        # Implementation depends on tokenizer type
133
        pass
134

135
class Dictionary:
136
    """
137
    Dictionary builder for text feature processing in CatBoost.
138
    
139
    Creates and manages vocabularies for text features, with support for
140
    frequency-based filtering and domain-specific dictionaries.
141
    """
142
    
143
    def __init__(self, dictionary_id='Word', max_dictionary_size=50000,
144
                 occurrence_lower_bound=1, **kwargs):
145
        """
146
        Initialize dictionary builder.
147
        
148
        Parameters:
149
        - dictionary_id: Dictionary identifier (string)
150
        - max_dictionary_size: Maximum vocabulary size (int)
151
        - occurrence_lower_bound: Minimum token frequency (int)
152
        - kwargs: Additional dictionary parameters
153
        """
154
        self.dictionary_id = dictionary_id
155
        self.max_dictionary_size = max_dictionary_size
156
        self.occurrence_lower_bound = occurrence_lower_bound
157
    
158
    def build(self, texts):
159
        """
160
        Build dictionary from text corpus.
161
        
162
        Parameters:
163
        - texts: List of text documents
164
        
165
        Returns:
166
        Dictionary object ready for use in text processing
167
        """
168
        pass
169
    
170
    def get_dictionary_info(self):
171
        """
172
        Get information about the built dictionary.
173
        
174
        Returns:
175
        dict: Dictionary statistics including size and coverage
176
        """
177
        pass
178
```
179

180
### Monoforest Interpretation
181

182
Tools for interpreting monotonic forest models and converting them to polynomial representations.
183

184
```python { .api }
185
def to_polynom(model):
186
    """
187
    Convert monoforest model to polynomial representation.
188
    
189
    Parameters:
190
    - model: Trained CatBoost model with monotonic constraints
191
    
192
    Returns:
193
    Polynomial representation of the model that can be used for analysis
194
    and interpretation of feature relationships.
195
    """
196
    pass
197

198
def to_polynom_string(model):
199
    """
200
    Convert monoforest model to human-readable polynomial string.
201
    
202
    Parameters:
203
    - model: Trained CatBoost model with monotonic constraints
204
    
205
    Returns:
206
    string: Mathematical polynomial expression representing the model
207
    """
208
    pass
209

210
def explain_features(model):
211
    """
212
    Generate feature explanations for monoforest models.
213
    
214
    Parameters:
215
    - model: Trained CatBoost model with monotonic constraints
216
    
217
    Returns:
218
    FeatureExplanation: Detailed explanation of feature contributions
219
    """
220
    pass
221

222
class FeatureExplanation:
223
    """
224
    Container for detailed feature explanations from monoforest models.
225
    
226
    Provides structured information about how features contribute to
227
    predictions in monotonic gradient boosting models.
228
    """
229
    
230
    def __init__(self):
231
        """Initialize feature explanation container."""
232
        self.feature_effects = {}
233
        self.monotonic_constraints = {}
234
        self.feature_interactions = {}
235
    
236
    def get_feature_effect(self, feature_idx):
237
        """
238
        Get the effect description for a specific feature.
239
        
240
        Parameters:
241
        - feature_idx: Feature index (int)
242
        
243
        Returns:
244
        dict: Feature effect information including direction and magnitude
245
        """
246
        return self.feature_effects.get(feature_idx, {})
247
    
248
    def get_monotonic_constraints(self):
249
        """
250
        Get monotonic constraints applied to features.
251
        
252
        Returns:
253
        dict: Feature indices mapped to constraint types (1, -1, or 0)
254
        """
255
        return self.monotonic_constraints
256
    
257
    def summary(self):
258
        """
259
        Generate summary of feature explanations.
260
        
261
        Returns:
262
        string: Human-readable summary of model feature behavior
263
        """
264
        pass
265
```
266

267
### Advanced Evaluation Framework
268

269
Comprehensive evaluation system for complex model assessment scenarios.
270

271
```python { .api }
272
class CatboostEvaluation:
273
    """
274
    Advanced evaluation framework for comprehensive model assessment.
275
    
276
    Provides tools for statistical testing, confidence intervals, and
277
    rigorous comparison of CatBoost models across different scenarios.
278
    """
279
    
280
    def __init__(self, eval_type='Classification', score_type='Logloss'):
281
        """
282
        Initialize evaluation framework.
283
        
284
        Parameters:
285
        - eval_type: Type of evaluation ('Classification', 'Regression', 'Ranking')
286
        - score_type: Primary scoring metric (string)
287
        """
288
        self.eval_type = eval_type
289
        self.score_type = score_type
290
    
291
    def add_case(self, case_name, model, test_data, test_labels):
292
        """
293
        Add evaluation case to the framework.
294
        
295
        Parameters:
296
        - case_name: Identifier for this evaluation case (string)
297
        - model: Trained CatBoost model
298
        - test_data: Test dataset (Pool or array-like)
299
        - test_labels: True labels (array-like)
300
        """
301
        pass
302
    
303
    def evaluate(self):
304
        """
305
        Perform comprehensive evaluation across all cases.
306
        
307
        Returns:
308
        EvaluationResults: Detailed evaluation results with statistical tests
309
        """
310
        pass
311

312
class EvaluationResults:
313
    """Container for comprehensive evaluation results."""
314
    
315
    def __init__(self):
316
        self.case_results = {}
317
        self.statistical_tests = {}
318
        self.confidence_intervals = {}
319
    
320
    def get_case_result(self, case_name):
321
        """Get results for specific evaluation case."""
322
        return self.case_results.get(case_name)
323
    
324
    def get_statistical_comparison(self, case1, case2):
325
        """Get statistical comparison between two cases."""
326
        pass
327
    
328
    def summary_table(self):
329
        """Generate summary table of all evaluation results."""
330
        pass
331

332
def calc_wilcoxon_test(scores1, scores2, alternative='two-sided'):
333
    """
334
    Calculate Wilcoxon signed-rank test for comparing model performance.
335
    
336
    Parameters:
337
    - scores1: Performance scores from first model (array-like)
338
    - scores2: Performance scores from second model (array-like)
339
    - alternative: Test alternative ('two-sided', 'less', 'greater')
340
    
341
    Returns:
342
    tuple: (statistic, p_value)
343
        - statistic: Test statistic (float)
344
        - p_value: P-value of the test (float)
345
    """
346
    pass
347

348
def calc_bootstrap_ci_for_mean(scores, confidence_level=0.95, num_bootstrap=10000):
349
    """
350
    Calculate bootstrap confidence interval for mean performance.
351
    
352
    Parameters:
353
    - scores: Performance scores (array-like)
354
    - confidence_level: Confidence level (float, 0-1)
355
    - num_bootstrap: Number of bootstrap samples (int)
356
    
357
    Returns:
358
    tuple: (lower_bound, upper_bound, mean_estimate)
359
        - lower_bound: Lower confidence bound (float)
360
        - upper_bound: Upper confidence bound (float)
361
        - mean_estimate: Bootstrap mean estimate (float)
362
    """
363
    pass
364
```
365

366
## Advanced Features Examples
367

368
### Custom Metric Implementation
369

370
```python
371
from catboost import CatBoostRegressor, Pool
372
from catboost import MultiRegressionCustomMetric
373
import numpy as np
374

375
class MeanAbsolutePercentageError(MultiRegressionCustomMetric):
376
    """Custom MAPE metric implementation."""
377
    
378
    def is_max_optimal(self):
379
        return False  # Lower MAPE is better
380
    
381
    def evaluate(self, approxes, target, weight):
382
        """Calculate MAPE."""
383
        # approxes[0] contains predictions for single output regression
384
        predictions = approxes[0]
385
        
386
        # Avoid division by zero
387
        mask = target != 0
388
        if not np.any(mask):
389
            return 0.0, len(target)
390
        
391
        # Calculate MAPE only for non-zero targets
392
        ape = np.abs((target[mask] - predictions[mask]) / target[mask]) * 100
393
        
394
        if weight is not None:
395
            weight_sum = np.sum(weight[mask])
396
            mape = np.sum(ape * weight[mask]) / weight_sum if weight_sum > 0 else 0
397
        else:
398
            mape = np.mean(ape)
399
            weight_sum = len(ape)
400
        
401
        return mape, weight_sum
402

403
# Use custom metric
404
custom_mape = MeanAbsolutePercentageError()
405

406
model = CatBoostRegressor(
407
    iterations=200,
408
    eval_metric=custom_mape,  # Use custom metric for evaluation
409
    verbose=50
410
)
411

412
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
413
print("Model trained with custom MAPE metric")
414
```
415

416
### Text Processing Configuration
417

418
```python
419
from catboost import CatBoostClassifier, Pool
420
from catboost.text_processing import Tokenizer, Dictionary
421

422
# Prepare text data
423
text_data = pd.DataFrame({
424
    'text_feature': [
425
        "This is a positive review",
426
        "Negative sentiment example", 
427
        "Another positive text sample",
428
        "Bad negative review text"
429
    ],
430
    'category': ['A', 'B', 'A', 'B'],
431
    'target': [1, 0, 1, 0]
432
})
433

434
# Create pool with text features
435
text_pool = Pool(
436
    data=text_data.drop('target', axis=1),
437
    label=text_data['target'],
438
    text_features=['text_feature'],
439
    cat_features=['category']
440
)
441

442
# Configure text processing
443
text_processing_config = {
444
    'tokenizers': [
445
        {
446
            'tokenizer_id': 'Space',
447
            'separator_type': 'ByDelimiter',
448
            'delimiter': ' '
449
        },
450
        {
451
            'tokenizer_id': 'SentencePiece',
452
            'number_of_tokens': 1000
453
        }
454
    ],
455
    'dictionaries': [
456
        {
457
            'dictionary_id': 'Word',
458
            'max_dictionary_size': 50000,
459
            'occurrence_lower_bound': 1
460
        },
461
        {
462
            'dictionary_id': 'Bigram', 
463
            'max_dictionary_size': 50000,
464
            'occurrence_lower_bound': 2,
465
            'gram_order': 2
466
        }
467
    ],
468
    'feature_processing': {
469
        'default': [
470
            {
471
                'dictionaries_names': ['Word', 'Bigram'],
472
                'feature_calcers': ['BoW', 'NaiveBayes'],
473
                'tokenizers_names': ['Space']
474
            }
475
        ]
476
    }
477
}
478

479
# Train model with text processing
480
text_model = CatBoostClassifier(
481
    iterations=100,
482
    text_processing=text_processing_config,
483
    verbose=50
484
)
485

486
text_model.fit(text_pool)
487
print("Text classification model trained")
488

489
# Get text feature importance
490
text_importance = text_model.get_feature_importance(prettified=True)
491
print("Text feature importance calculated")
492
```
493

494
### Monoforest Model Interpretation
495

496
```python
497
from catboost import CatBoostRegressor
498
from catboost.monoforest import to_polynom_string, explain_features
499
import numpy as np
500

501
# Create synthetic data with known monotonic relationships
502
np.random.seed(42)
503
n_samples = 1000
504

505
X_mono = pd.DataFrame({
506
    'increasing_feature': np.random.uniform(0, 10, n_samples),
507
    'decreasing_feature': np.random.uniform(0, 5, n_samples),
508
    'neutral_feature': np.random.uniform(-2, 2, n_samples)
509
})
510

511
# Create target with monotonic relationships
512
y_mono = (
513
    2 * X_mono['increasing_feature'] +          # Positive monotonic
514
    -1.5 * X_mono['decreasing_feature'] +       # Negative monotonic  
515
    0.5 * X_mono['neutral_feature']**2 +        # Non-monotonic
516
    np.random.normal(0, 0.5, n_samples)         # Noise
517
)
518

519
# Train model with monotonic constraints
520
mono_model = CatBoostRegressor(
521
    iterations=200,
522
    depth=4,
523
    monotone_constraints=[1, -1, 0],  # +1: increasing, -1: decreasing, 0: no constraint
524
    verbose=50
525
)
526

527
mono_model.fit(X_mono, y_mono)
528

529
# Convert to polynomial representation
530
try:
531
    poly_string = to_polynom_string(mono_model)
532
    print("Polynomial representation:")
533
    print(poly_string)
534
except:
535
    print("Polynomial conversion not available for this model type")
536

537
# Get feature explanations
538
try:
539
    explanations = explain_features(mono_model)
540
    print("\nFeature explanations:")
541
    print(explanations.summary())
542
except:
543
    print("Feature explanations not available")
544

545
# Verify monotonic behavior
546
test_values = np.linspace(0, 10, 100)
547
predictions_increasing = []
548
predictions_decreasing = []
549

550
for val in test_values:
551
    # Test increasing feature
552
    test_data_inc = pd.DataFrame({
553
        'increasing_feature': [val],
554
        'decreasing_feature': [2.5],  # Fixed value
555
        'neutral_feature': [0]        # Fixed value  
556
    })
557
    pred_inc = mono_model.predict(test_data_inc)[0]
558
    predictions_increasing.append(pred_inc)
559
    
560
    # Test decreasing feature
561
    test_data_dec = pd.DataFrame({
562
        'increasing_feature': [5],     # Fixed value
563
        'decreasing_feature': [val],
564
        'neutral_feature': [0]        # Fixed value
565
    })
566
    pred_dec = mono_model.predict(test_data_dec)[0]
567
    predictions_decreasing.append(pred_dec)
568

569
# Check monotonicity
570
increasing_diff = np.diff(predictions_increasing) 
571
decreasing_diff = np.diff(predictions_decreasing)
572

573
print(f"\nMonotonic constraint verification:")
574
print(f"Increasing feature violations: {np.sum(increasing_diff < 0)} / {len(increasing_diff)}")
575
print(f"Decreasing feature violations: {np.sum(decreasing_diff > 0)} / {len(decreasing_diff)}")
576
```
577

578
### Advanced Model Evaluation
579

580
```python
581
from catboost import CatBoostClassifier
582
from catboost.eval import calc_wilcoxon_test, calc_bootstrap_ci_for_mean
583
from sklearn.model_selection import cross_val_score
584
from sklearn.metrics import accuracy_score, roc_auc_score
585
import numpy as np
586

587
# Train multiple models for comparison
588
models = {
589
    'shallow': CatBoostClassifier(iterations=100, depth=4, verbose=False),
590
    'medium': CatBoostClassifier(iterations=200, depth=6, verbose=False),
591
    'deep': CatBoostClassifier(iterations=300, depth=8, verbose=False)
592
}
593

594
# Perform cross-validation for each model
595
cv_scores = {}
596
for name, model in models.items():
597
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
598
    cv_scores[name] = scores
599
    print(f"{name} model - CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")
600

601
# Statistical comparison between models
602
comparisons = [('shallow', 'medium'), ('medium', 'deep'), ('shallow', 'deep')]
603

604
for model1, model2 in comparisons:
605
    statistic, p_value = calc_wilcoxon_test(
606
        cv_scores[model1], 
607
        cv_scores[model2],
608
        alternative='two-sided'
609
    )
610
    
611
    print(f"\nWilcoxon test: {model1} vs {model2}")
612
    print(f"Statistic: {statistic:.4f}, P-value: {p_value:.4f}")
613
    
614
    if p_value < 0.05:
615
        better_model = model1 if cv_scores[model1].mean() > cv_scores[model2].mean() else model2
616
        print(f"Significant difference (p < 0.05): {better_model} performs better")
617
    else:
618
        print("No significant difference (p >= 0.05)")
619

620
# Bootstrap confidence intervals
621
for name, scores in cv_scores.items():
622
    lower, upper, mean_est = calc_bootstrap_ci_for_mean(
623
        scores, 
624
        confidence_level=0.95,
625
        num_bootstrap=10000
626
    )
627
    
628
    print(f"\n{name} model - Bootstrap 95% CI:")
629
    print(f"Mean: {mean_est:.4f}, CI: [{lower:.4f}, {upper:.4f}]")
630

631
# Final model selection and evaluation
632
best_model_name = max(cv_scores.keys(), key=lambda k: cv_scores[k].mean())
633
best_model = models[best_model_name]
634

635
print(f"\nSelected best model: {best_model_name}")
636

637
# Train best model on full training set and evaluate on test set
638
best_model.fit(X_train, y_train)
639
test_predictions = best_model.predict_proba(X_test)[:, 1]
640
test_auc = roc_auc_score(y_test, test_predictions)
641

642
print(f"Final test AUC: {test_auc:.4f}")
643

644
# Calculate bootstrap CI for test performance
645
test_bootstrap_scores = []
646
for _ in range(1000):
647
    indices = np.random.choice(len(y_test), len(y_test), replace=True)
648
    boot_auc = roc_auc_score(y_test[indices], test_predictions[indices])
649
    test_bootstrap_scores.append(boot_auc)
650

651
test_lower, test_upper, test_mean = calc_bootstrap_ci_for_mean(
652
    test_bootstrap_scores,
653
    confidence_level=0.95
654
)
655

656
print(f"Test AUC 95% Bootstrap CI: [{test_lower:.4f}, {test_upper:.4f}]")
657
```

Version

Tile

Files

advanced-features.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

advanced-features.mddocs/