AutoGluon TabularPredictor for automated machine learning on tabular datasets
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
AutoGluon Tabular provides extensive configuration options through presets, hyperparameter configurations, and feature processing settings. These configurations enable users to optimize for different objectives like accuracy, speed, interpretability, or deployment constraints.
Pre-configured settings optimized for different use cases, balancing accuracy, training time, and computational resources.
# Available preset configurations
PRESET_CONFIGURATIONS = Literal[
"best_quality", # Maximum accuracy, longer training time
"high_quality", # High accuracy with fast inference
"good_quality", # Good accuracy with very fast inference
"medium_quality", # Medium accuracy, very fast training (default)
"optimize_for_deployment", # Optimizes for deployment by cleaning up models
"interpretable" # Interpretable models only
]
def get_preset_config(preset: str) -> dict:
"""
Get configuration dictionary for a specific preset.
Parameters:
- preset: Name of the preset configuration
Returns:
Dictionary with preset configuration parameters
"""Systematic hyperparameter configuration system for customizing model training and optimization strategies.
def get_hyperparameter_config(
preset: str = None,
model_types: list[str] = None,
search_strategy: str = "auto"
) -> dict:
"""
Generate hyperparameter configuration for specified models and preset.
Parameters:
- preset: Base preset configuration
- model_types: List of model types to configure
- search_strategy: Hyperparameter search strategy ('grid', 'random', 'bayesian', 'auto')
Returns:
Dictionary mapping model names to hyperparameter configurations
"""
# Hyperparameter configuration structure
HYPERPARAMETER_CONFIG = dict[str, dict[str, Any]]
# Example: {'LGB': {'num_leaves': [31, 127], 'learning_rate': [0.01, 0.1]}}
def get_hyperparameter_config_options() -> list[str]:
"""
Get list of available hyperparameter configuration presets.
Returns:
List of available configuration names
"""
def get_hyperparameter_config(config_name: str) -> dict:
"""
Get specific hyperparameter configuration by name.
Parameters:
- config_name: Name of the hyperparameter configuration preset
Returns:
Hyperparameter configuration dictionary
"""Automated feature engineering and preprocessing configuration system for handling diverse data types and feature transformations.
def get_default_feature_generator(
feature_generator: str = "auto",
feature_metadata: 'FeatureMetadata' = None,
init_kwargs: dict = None
) -> 'AutoMLPipelineFeatureGenerator':
"""
Get default feature generator with specified configuration.
Parameters:
- feature_generator: Feature generation preset ('auto', 'interpretable')
- feature_metadata: Metadata for feature processing
- init_kwargs: Additional initialization arguments
Returns:
Configured feature generator instance
"""
class FeatureGenerator:
"""Base class for feature generation and preprocessing."""
def fit_transform(
self,
X: pd.DataFrame,
feature_metadata: 'FeatureMetadata' = None,
**kwargs
) -> pd.DataFrame:
"""
Fit feature generator and transform input data.
Parameters:
- X: Input dataframe
- feature_metadata: Feature type metadata
Returns:
Transformed feature dataframe
"""
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Transform input data using fitted generator."""Configuration options for advanced training strategies including bagging, stacking, and resource management.
class AGArgsFit:
"""Arguments for controlling model fitting behavior."""
num_cpus: int = "auto" # CPU cores for training
num_gpus: int = 0 # GPU devices to use
memory_limit: int = None # Memory limit in MB
disk_limit: int = None # Disk space limit in MB
time_limit: float = None # Time limit per model in seconds
name_suffix: str = "" # Suffix for model names
priority: int = 0 # Training priority
class AGArgsEnsemble:
"""Arguments for controlling ensemble behavior."""
fold_fitting_strategy: str = "sequential_local" # Fold fitting strategy
auto_stack: bool = True # Enable automatic stacking
bagging_mode: str = "oob" # Bagging validation mode
stack_mode: str = "infer" # Stacking mode
ensemble_size_max: int = 25 # Maximum ensemble size
# Training configuration structure
TRAINING_CONFIG = {
'num_bag_folds': int, # Number of bagging folds (default: auto)
'num_bag_sets': int, # Number of bagging sets (default: auto)
'num_stack_levels': int, # Number of stacking levels (default: auto)
'ag_args_fit': dict, # Advanced fitting arguments
'ag_args_ensemble': dict, # Advanced ensemble arguments
}Configuration for evaluation metrics, validation strategies, and performance measurement.
# Classification metrics
CLASSIFICATION_METRICS = [
"accuracy", "balanced_accuracy", "log_loss",
"f1", "f1_macro", "f1_micro", "f1_weighted",
"roc_auc", "roc_auc_ovo", "roc_auc_ovo_macro", "roc_auc_ovo_weighted",
"roc_auc_ovr", "roc_auc_ovr_macro", "roc_auc_ovr_micro", "roc_auc_ovr_weighted",
"average_precision", "precision", "precision_macro", "precision_micro", "precision_weighted",
"recall", "recall_macro", "recall_micro", "recall_weighted",
"mcc", "pac_score"
]
# Regression metrics
REGRESSION_METRICS = [
"root_mean_squared_error", "mean_squared_error", "mean_absolute_error",
"median_absolute_error", "mean_absolute_percentage_error",
"r2", "symmetric_mean_absolute_percentage_error"
]
# Quantile regression metrics
QUANTILE_METRICS = ["pinball_loss"]
def get_metric_config(
problem_type: str,
eval_metric: str = None,
greater_is_better: bool = None
) -> dict:
"""
Get metric configuration for evaluation.
Parameters:
- problem_type: Type of ML problem
- eval_metric: Primary evaluation metric
- greater_is_better: Whether higher metric values are better
Returns:
Metric configuration dictionary
"""Settings for optimizing computational resource usage, memory management, and training performance.
class ResourceConfig:
"""Configuration for computational resources and performance optimization."""
# CPU and Memory
num_cpus: int = "auto" # Number of CPU cores
memory_limit_mb: int = None # Memory limit in megabytes
# GPU Configuration
num_gpus: int = 0 # Number of GPU devices
gpu_memory_limit: int = None # GPU memory limit
# Disk and Storage
disk_limit_mb: int = None # Disk space limit
cache_data: bool = True # Cache preprocessed data
# Performance Optimization
enable_multiprocessing: bool = True # Enable multiprocessing
max_concurrent_models: int = 1 # Maximum concurrent model training
early_stopping_rounds: int = None # Early stopping configuration
# Inference Optimization
optimize_for_deployment: bool = False # Optimize for deployment
model_compression: bool = False # Enable model compressionfrom autogluon.tabular import TabularPredictor
import pandas as pd
# Load data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
# Different preset configurations
presets = ['good_quality', 'best_quality', 'optimize_for_deployment', 'interpretable']
results = {}
for preset in presets:
print(f"\nTraining with preset: {preset}")
predictor = TabularPredictor(
label='target',
path=f'./models_{preset}/'
)
predictor.fit(
train_data,
presets=preset,
time_limit=600 # 10 minutes per preset
)
# Evaluate performance
performance = predictor.evaluate(test_data)
leaderboard = predictor.leaderboard(test_data)
results[preset] = {
'score': performance,
'best_model': leaderboard.iloc[0]['model'],
'num_models': len(leaderboard)
}
print(f"Best score: {performance}")
print(f"Best model: {results[preset]['best_model']}")
print(f"Total models trained: {results[preset]['num_models']}")
# Compare results
print("\nPreset Comparison:")
for preset, result in results.items():
print(f"{preset}: {result['score']:.4f} ({result['num_models']} models)")from autogluon.tabular import TabularPredictor
# Advanced hyperparameter configuration
hyperparameters = {
# Gradient Boosting Models
'LGB': [
# Fast configuration
{
'num_leaves': 31,
'learning_rate': 0.1,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'min_data_in_leaf': 20,
'objective': 'binary',
'max_depth': -1,
'save_binary': True,
'ag_args': {'name_suffix': '_Fast', 'priority': 1}
},
# Accurate configuration
{
'num_leaves': 127,
'learning_rate': 0.05,
'feature_fraction': 0.8,
'bagging_fraction': 0.9,
'bagging_freq': 5,
'min_data_in_leaf': 10,
'reg_alpha': 0.1,
'reg_lambda': 0.1,
'ag_args': {'name_suffix': '_Accurate', 'priority': 2}
}
],
'XGB': {
'n_estimators': [100, 300, 500],
'max_depth': [3, 6, 10],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0],
'reg_alpha': [0, 0.1, 1],
'reg_lambda': [0, 0.1, 1]
},
# Neural Networks
'NN_TORCH': [
# Small network
{
'num_epochs': 50,
'learning_rate': 0.001,
'weight_decay': 1e-4,
'dropout_prob': 0.1,
'embedding_size_factor': 1.0,
'ag_args': {'name_suffix': '_Small'}
},
# Large network
{
'num_epochs': 100,
'learning_rate': 0.0005,
'weight_decay': 1e-5,
'dropout_prob': 0.2,
'embedding_size_factor': 2.0,
'ag_args': {'name_suffix': '_Large'}
}
]
}
# Train with custom hyperparameters
predictor = TabularPredictor(label='target')
predictor.fit(
train_data,
hyperparameters=hyperparameters,
time_limit=1800, # 30 minutes
num_bag_folds=5,
num_stack_levels=2
)from autogluon.tabular import TabularPredictor
# Advanced training arguments
ag_args_fit = {
'num_cpus': 8, # Use 8 CPU cores
'num_gpus': 1, # Use 1 GPU
'memory_limit': 16000, # 16GB memory limit
'time_limit': 300, # 5 minutes per model
}
ag_args_ensemble = {
'fold_fitting_strategy': 'sequential_local',
'auto_stack': True,
'bagging_mode': 'oob', # Out-of-bag validation
'stack_mode': 'infer',
'ensemble_size_max': 50 # Maximum ensemble size
}
# Feature generation configuration
feature_generator_kwargs = {
'enable_raw_text_features': True,
'enable_nlp_features': True,
'text_ngram_size': 300,
'text_special_features': ['word_count', 'char_count']
}
predictor = TabularPredictor(
label='target',
eval_metric='roc_auc',
sample_weight='sample_weights'
)
predictor.fit(
train_data,
tuning_data=validation_data,
time_limit=3600, # 1 hour total
presets='best_quality',
# Advanced configurations
ag_args_fit=ag_args_fit,
ag_args_ensemble=ag_args_ensemble,
feature_generator_kwargs=feature_generator_kwargs,
# Bagging and stacking
num_bag_folds=10,
num_bag_sets=3,
num_stack_levels=3,
# Model selection
excluded_model_types=['KNN'], # Exclude slow models
# Hyperparameter tuning
hyperparameter_tune_kwargs={
'scheduler': 'local',
'searcher': 'bayesopt',
'num_trials': 100
}
)from autogluon.tabular import TabularPredictor
# Configuration optimized for deployment
deployment_hyperparameters = {
'LGB': {
'num_leaves': 31, # Smaller trees
'max_depth': 6,
'min_data_in_leaf': 50, # Regularization
'bagging_freq': 0, # Disable bagging for speed
'feature_fraction': 1.0, # Use all features
},
'CAT': {
'iterations': 100, # Fewer iterations
'depth': 6,
'l2_leaf_reg': 3,
'bootstrap_type': 'No' # Disable bootstrap
}
}
predictor = TabularPredictor(
label='target',
path='./deployment_model/'
)
predictor.fit(
train_data,
presets='optimize_for_deployment',
hyperparameters=deployment_hyperparameters,
time_limit=300, # Fast training
num_bag_folds=0, # Disable bagging
num_stack_levels=0, # Disable stacking
# Focus on fast, simple models
included_model_types=['LGB', 'CAT', 'LR']
)
# Create deployment-optimized clone
deployment_predictor = predictor.clone_for_deployment(
path='./deployment_ready/',
model='best' # Single best model only
)
# Test inference speed
import time
start_time = time.time()
predictions = deployment_predictor.predict(test_data)
inference_time = time.time() - start_time
print(f"Inference time: {inference_time:.3f} seconds")
print(f"Predictions per second: {len(test_data) / inference_time:.0f}")from autogluon.tabular import TabularPredictor
# Configuration for interpretable models
interpretable_hyperparameters = {
'LR': { # Logistic Regression
'C': [0.01, 0.1, 1.0, 10], # Regularization
'penalty': ['l1', 'l2'],
'solver': ['liblinear', 'saga']
},
'RF': { # Random Forest
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10], # Limit depth for interpretability
'min_samples_split': [10, 20, 50],
'max_features': ['sqrt', 'log2']
},
'XGB': { # XGBoost (regularized)
'n_estimators': [50, 100],
'max_depth': [3, 4, 5], # Shallow trees
'learning_rate': [0.1, 0.2],
'reg_alpha': [0.1, 1.0], # L1 regularization
'reg_lambda': [0.1, 1.0] # L2 regularization
}
}
predictor = TabularPredictor(
label='target',
eval_metric='accuracy'
)
predictor.fit(
train_data,
presets='interpretable',
hyperparameters=interpretable_hyperparameters,
# Enable only interpretable models
included_model_types=['LR', 'RF', 'XGB'],
# Simpler ensemble strategies
num_bag_folds=3,
num_stack_levels=1,
# Feature processing for interpretability
feature_generator='auto' # Minimal feature engineering
)
# Analyze model interpretability
leaderboard = predictor.leaderboard(extra_info=True)
print("Interpretable models ranking:")
print(leaderboard[['model', 'score_val', 'fit_time']].head())| Preset | Training Time | Model Diversity | Ensembling | Best For |
|---|---|---|---|---|
medium_quality | Low | Medium | None | Quick prototyping, default preset |
good_quality | Medium | High | Moderate | General use, balanced performance |
high_quality | High | High | Extensive | High accuracy with fast inference |
best_quality | Very High | Very High | Extensive | Maximum accuracy, competitions |
optimize_for_deployment | - | - | - | Post-training optimization |
interpretable | Low | Limited | Simple | Regulated industries, explainability |
| Code | Full Name | Category |
|---|---|---|
LGB | LightGBM | Gradient Boosting |
XGB | XGBoost | Gradient Boosting |
CAT | CatBoost | Gradient Boosting |
RF | Random Forest | Tree Ensemble |
XT | Extra Trees | Tree Ensemble |
LR | Linear/Logistic Regression | Linear |
KNN | K-Nearest Neighbors | Instance-based |
NN_TORCH | PyTorch Neural Network | Deep Learning |
FASTAI | FastAI Neural Network | Deep Learning |
TABPFN | TabPFN | Foundation Model |
| Use Case | CPU Cores | Memory (GB) | Time Limit | Bag Folds |
|---|---|---|---|---|
| Quick Prototype | 2-4 | 4-8 | 5-15 min | 2-3 |
| Production Model | 8-16 | 16-32 | 30-60 min | 5-10 |
| Competition | 16-32 | 32-64 | 2-8 hours | 10-20 |
| Large Dataset | 16+ | 64+ | 4+ hours | 5-10 |
Install with Tessl CLI
npx tessl i tessl/pypi-autogluon--tabular