0
# Tabular Machine Learning
1
2
Automated machine learning for structured/tabular data supporting binary classification, multiclass classification, and regression tasks. TabularPredictor automatically handles feature engineering, model selection, hyperparameter tuning, and intelligent ensembling to achieve strong predictive performance with minimal configuration.
3
4
## Capabilities
5
6
### TabularPredictor Class
7
8
Main predictor class for tabular/structured data that automates the entire ML pipeline from data preprocessing to model deployment.
9
10
```python { .api }
11
class TabularPredictor:
12
def __init__(
13
self,
14
label: str,
15
problem_type: str = None,
16
eval_metric: str = None,
17
path: str = None,
18
verbosity: int = 2,
19
sample_weight: str = None,
20
weight_evaluation: bool = False,
21
groups: str = None,
22
**kwargs
23
):
24
"""
25
Initialize TabularPredictor for automated machine learning on tabular data.
26
27
Parameters:
28
- label: Name of the target column to predict
29
- problem_type: Type of problem ('binary', 'multiclass', 'regression', 'quantile')
30
- eval_metric: Evaluation metric ('accuracy', 'roc_auc', 'rmse', etc.)
31
- path: Directory to save models and artifacts
32
- verbosity: Logging verbosity level (0-4)
33
- sample_weight: Column name for sample weights
34
- weight_evaluation: Whether to weight evaluation metrics
35
- groups: Column name for group information (for grouped CV)
36
"""
37
```
38
39
### Model Training
40
41
Train and automatically tune machine learning models on tabular data with intelligent preprocessing and model selection.
42
43
```python { .api }
44
def fit(
45
self,
46
train_data,
47
tuning_data=None,
48
time_limit: float = None,
49
presets: str = None,
50
hyperparameters=None,
51
feature_metadata=None,
52
infer_limit: float = None,
53
infer_limit_batch_size: int = None,
54
fit_weighted_ensemble: bool = True,
55
dynamic_stacking: bool = False,
56
calibrate_decision_threshold: str = "auto",
57
num_cpus: str = "auto",
58
num_gpus: str = "auto",
59
fit_strategy: str = "sequential",
60
memory_limit: str = "auto",
61
excluded_model_types: list = None,
62
included_model_types: list = None,
63
holdout_frac: float = None,
64
callbacks: list = None,
65
**kwargs
66
):
67
"""
68
Fit TabularPredictor on training data.
69
70
Parameters:
71
- train_data: Training data (DataFrame, file path, or TabularDataset)
72
- tuning_data: Validation data for hyperparameter tuning
73
- time_limit: Maximum training time in seconds
74
- presets: Quality/speed presets ('best_quality', 'high_quality', 'medium_quality', 'optimize_for_deployment')
75
- hyperparameters: Custom hyperparameter configurations
76
- feature_metadata: Manual feature type specifications or 'infer'
77
- infer_limit: Time limit for feature inference
78
- infer_limit_batch_size: Batch size for feature inference
79
- fit_weighted_ensemble: Whether to fit weighted ensemble models
80
- dynamic_stacking: Enable dynamic stacking for ensemble models
81
- calibrate_decision_threshold: Auto-calibrate decision threshold ('auto', True, False)
82
- num_cpus: Number of CPU cores ('auto' or int)
83
- num_gpus: Number of GPUs ('auto' or int)
84
- fit_strategy: Model fitting strategy ('sequential', 'parallel')
85
- memory_limit: Memory limit for training ('auto' or float)
86
- excluded_model_types: List of model types to exclude
87
- included_model_types: List of model types to include only
88
- holdout_frac: Fraction of data to hold out for validation
89
- callbacks: List of callback functions for training
90
91
Returns:
92
TabularPredictor: Fitted predictor instance
93
"""
94
```
95
96
### Prediction
97
98
Generate predictions and prediction probabilities for new data using the trained model ensemble.
99
100
```python { .api }
101
def predict(
102
self,
103
data,
104
model: str = None,
105
as_pandas: bool = True,
106
transform_features: bool = True
107
):
108
"""
109
Generate predictions for new data.
110
111
Parameters:
112
- data: Input data (DataFrame, file path, or TabularDataset)
113
- model: Specific model name to use for prediction
114
- as_pandas: Return results as pandas Series
115
- transform_features: Apply feature transformations
116
117
Returns:
118
Predictions as pandas Series or numpy array
119
"""
120
121
def predict_proba(
122
self,
123
data,
124
model: str = None,
125
as_pandas: bool = True,
126
as_multiclass: bool = True,
127
transform_features: bool = True
128
):
129
"""
130
Generate prediction probabilities for classification tasks.
131
132
Parameters:
133
- data: Input data (DataFrame, file path, or TabularDataset)
134
- model: Specific model name to use for prediction
135
- as_pandas: Return results as pandas DataFrame
136
- as_multiclass: Return all class probabilities vs just positive class
137
- transform_features: Apply feature transformations
138
139
Returns:
140
Prediction probabilities as pandas DataFrame or numpy array
141
"""
142
```
143
144
### Model Evaluation
145
146
Evaluate model performance and analyze results with comprehensive metrics and model comparison capabilities.
147
148
```python { .api }
149
def evaluate(
150
self,
151
data,
152
model: str = None,
153
auxiliary_metrics: bool = True,
154
detailed_report: bool = False,
155
silent: bool = False
156
):
157
"""
158
Evaluate predictor performance on test data.
159
160
Parameters:
161
- data: Test data (DataFrame, file path, or TabularDataset)
162
- model: Specific model to evaluate
163
- auxiliary_metrics: Include additional evaluation metrics
164
- detailed_report: Generate detailed evaluation report
165
- silent: Suppress output
166
167
Returns:
168
dict: Dictionary of evaluation metrics
169
"""
170
171
def leaderboard(
172
self,
173
data=None,
174
extra_info: bool = False,
175
only_pareto_frontier: bool = False,
176
skip_score: bool = False,
177
silent: bool = False
178
):
179
"""
180
Display model leaderboard with performance rankings.
181
182
Parameters:
183
- data: Test data for evaluation (optional)
184
- extra_info: Include additional model information
185
- only_pareto_frontier: Show only Pareto optimal models
186
- skip_score: Skip performance scoring
187
- silent: Suppress output
188
189
Returns:
190
DataFrame: Model leaderboard with performance metrics
191
"""
192
```
193
194
### Feature Analysis
195
196
Analyze feature importance and understand model behavior through interpretability tools.
197
198
```python { .api }
199
def feature_importance(
200
self,
201
data=None,
202
model: str = None,
203
features: list = None,
204
feature_stage: str = 'original',
205
subsample_size: int = 5000,
206
silent: bool = False
207
):
208
"""
209
Calculate feature importance scores.
210
211
Parameters:
212
- data: Data for importance calculation
213
- model: Specific model to analyze
214
- features: Specific features to analyze
215
- feature_stage: Feature processing stage ('original' or 'transformed')
216
- subsample_size: Sample size for efficient computation
217
- silent: Suppress output
218
219
Returns:
220
DataFrame: Feature importance scores
221
"""
222
223
def fit_summary(self, verbosity: int = 1, show_plot: bool = False):
224
"""
225
Display summary of training process and results.
226
227
Parameters:
228
- verbosity: Detail level (0-4)
229
- show_plot: Show training plots
230
231
Returns:
232
dict: Training summary information
233
"""
234
```
235
236
### Model Persistence
237
238
Save and load trained predictors for deployment and reuse.
239
240
```python { .api }
241
def save(self, path: str = None):
242
"""
243
Save trained predictor to disk.
244
245
Parameters:
246
- path: Directory to save predictor
247
"""
248
249
@classmethod
250
def load(cls, path: str, verbosity: int = 2):
251
"""
252
Load saved predictor from disk.
253
254
Parameters:
255
- path: Directory containing saved predictor
256
- verbosity: Logging verbosity level
257
258
Returns:
259
TabularPredictor: Loaded predictor instance
260
"""
261
```
262
263
### Advanced Features
264
265
Advanced model configuration and specialized functionality for power users.
266
267
```python { .api }
268
def refit_full(self, model: str = 'best'):
269
"""
270
Refit model on full dataset (train + validation).
271
272
Parameters:
273
- model: Model to refit ('best', 'all', or specific model name)
274
275
Returns:
276
dict: Refit results
277
"""
278
279
def distill(
280
self,
281
train_data=None,
282
tuning_data=None,
283
time_limit: int = None,
284
hyperparameters=None,
285
**kwargs
286
):
287
"""
288
Create distilled (compressed) version of ensemble model.
289
290
Parameters:
291
- train_data: Training data for distillation
292
- tuning_data: Validation data for distillation
293
- time_limit: Maximum distillation time
294
- hyperparameters: Distillation hyperparameters
295
296
Returns:
297
dict: Distillation results
298
"""
299
300
def persist_models(self, models: list = None, with_ancestors: bool = True):
301
"""
302
Persist models in memory to disk for memory optimization.
303
304
Parameters:
305
- models: List of model names to persist
306
- with_ancestors: Include ancestor models in persistence
307
"""
308
309
def unpersist_models(self, models: list = None):
310
"""
311
Load persisted models back into memory.
312
313
Parameters:
314
- models: List of model names to unpersist
315
"""
316
317
def calibrate_decision_threshold(
318
self,
319
data=None,
320
metric: str = None,
321
return_optimization_curve: bool = False,
322
verbose: bool = True
323
):
324
"""
325
Calibrate decision threshold for binary classification to optimize specified metric.
326
327
Parameters:
328
- data: Data to use for threshold calibration
329
- metric: Metric to optimize ('f1', 'balanced_accuracy', 'mcc', etc.)
330
- return_optimization_curve: Return threshold vs metric curve
331
- verbose: Print optimization results
332
333
Returns:
334
dict or tuple: Calibration results, optionally with optimization curve
335
"""
336
337
def clone(self, path: str, *, return_clone: bool = False, dirs_exist_ok: bool = False):
338
"""
339
Create a copy of the predictor at a new location.
340
341
Parameters:
342
- path: Directory path for the cloned predictor
343
- return_clone: Return the cloned predictor instance
344
- dirs_exist_ok: Allow overwriting existing directory
345
346
Returns:
347
str or TabularPredictor: Path to clone or cloned predictor instance
348
"""
349
350
def clone_for_deployment(
351
self,
352
path: str,
353
*,
354
model: str = "best",
355
return_clone: bool = False,
356
dirs_exist_ok: bool = False
357
):
358
"""
359
Create optimized copy of predictor for deployment with minimal storage footprint.
360
361
Parameters:
362
- path: Directory path for deployment clone
363
- model: Model to include in deployment clone
364
- return_clone: Return the cloned predictor instance
365
- dirs_exist_ok: Allow overwriting existing directory
366
367
Returns:
368
str or TabularPredictor: Path to clone or cloned predictor instance
369
"""
370
```
371
372
### InterpretableTabularPredictor Class
373
374
**[EXPERIMENTAL]** Specialized TabularPredictor subclass focused on interpretable models with simple, human-readable rules. Trades accuracy for interpretability by limiting to simple models and disabling complex ensemble techniques.
375
376
```python { .api }
377
class InterpretableTabularPredictor(TabularPredictor):
378
def __init__(self, *args, **kwargs):
379
"""
380
Initialize InterpretableTabularPredictor with same parameters as TabularPredictor.
381
Automatically restricts to interpretable models and preprocessing.
382
"""
383
384
def fit(
385
self,
386
train_data,
387
tuning_data=None,
388
time_limit: float = None,
389
*,
390
presets: str = "interpretable",
391
**kwargs
392
):
393
"""
394
Fit interpretable models with automatic preset selection for interpretability.
395
396
Parameters:
397
- train_data: Training data (same as TabularPredictor)
398
- tuning_data: Validation data (optional)
399
- time_limit: Maximum training time
400
- presets: Defaults to "interpretable" preset
401
402
Note: Bagging, stacking, and complex ensembles are disabled for interpretability
403
"""
404
405
def leaderboard_interpretable(self, verbose: bool = False, **kwargs):
406
"""
407
Leaderboard with model complexity scores for interpretable model selection.
408
409
Parameters:
410
- verbose: Print detailed leaderboard
411
412
Returns:
413
DataFrame: Leaderboard with additional 'complexity' column showing rule count
414
"""
415
416
def print_interpretable_rules(
417
self,
418
complexity_threshold: int = 10,
419
model_name: str = None
420
):
421
"""
422
Print human-readable rules from the best interpretable model.
423
424
Parameters:
425
- complexity_threshold: Maximum rule complexity to display
426
- model_name: Specific model to show rules for
427
"""
428
```
429
430
## Usage Examples
431
432
### Basic Classification
433
434
```python
435
from autogluon.tabular import TabularPredictor
436
437
# Binary classification
438
predictor = TabularPredictor(label='target')
439
predictor.fit('train.csv', presets='best_quality', time_limit=3600)
440
441
# Make predictions
442
predictions = predictor.predict('test.csv')
443
probabilities = predictor.predict_proba('test.csv')
444
445
# Evaluate performance
446
scores = predictor.evaluate('test.csv')
447
print(f"Accuracy: {scores['accuracy']:.3f}")
448
449
# View model leaderboard
450
leaderboard = predictor.leaderboard('test.csv')
451
print(leaderboard)
452
```
453
454
### Custom Configuration
455
456
```python
457
# Custom hyperparameters and model selection
458
hyperparameters = {
459
'GBM': {'num_boost_round': 1000, 'learning_rate': 0.01},
460
'RF': {'n_estimators': 500, 'max_depth': 20},
461
'XGB': {'n_estimators': 1000, 'learning_rate': 0.01}
462
}
463
464
predictor = TabularPredictor(
465
label='price',
466
problem_type='regression',
467
eval_metric='rmse',
468
path='./models'
469
)
470
471
predictor.fit(
472
train_data,
473
hyperparameters=hyperparameters,
474
excluded_model_types=['KNN', 'LR'], # Exclude certain model types
475
time_limit=7200,
476
presets='high_quality'
477
)
478
479
# Feature importance analysis
480
importance = predictor.feature_importance(train_data)
481
print(importance.head(10))
482
```