0
# Experimental Scikit-learn Compatible Interfaces
1
2
AutoGluon provides experimental scikit-learn compatible interfaces for seamless integration with existing scikit-learn workflows, pipelines, and ecosystem tools. These classes provide familiar fit/predict APIs while leveraging AutoGluon's automated machine learning capabilities.
3
4
## Capabilities
5
6
### Tabular Classification
7
8
Scikit-learn compatible classifier interface that wraps AutoGluon's TabularPredictor for classification tasks with standard sklearn API conventions.
9
10
```python { .api }
11
class TabularClassifier:
12
"""
13
Scikit-learn compatible classifier using AutoGluon's automated ML.
14
15
Provides standard sklearn interface (fit, predict, predict_proba, score)
16
while leveraging AutoGluon's model selection and ensemble capabilities.
17
"""
18
19
def __init__(
20
self,
21
eval_metric: str = None,
22
time_limit: float = None,
23
presets: list[str] | str = None,
24
hyperparameters: dict | str = None,
25
path: str = None,
26
verbosity: int = 2,
27
init_args: dict = None,
28
fit_args: dict = None
29
):
30
"""
31
Initialize TabularClassifier.
32
33
Parameters:
34
- eval_metric: Evaluation metric for model selection
35
- time_limit: Maximum training time in seconds
36
- presets: Preset configurations for training
37
- hyperparameters: Custom hyperparameter configurations
38
- path: Directory to save models
39
- verbosity: Logging level (0-4)
40
- init_args: Additional initialization arguments
41
- fit_args: Additional fitting arguments
42
"""
43
44
def fit(
45
self,
46
X: pd.DataFrame | np.ndarray,
47
y: pd.Series | np.ndarray,
48
**kwargs
49
) -> 'TabularClassifier':
50
"""
51
Train the classifier on the provided data.
52
53
Parameters:
54
- X: Training features
55
- y: Training labels
56
- kwargs: Additional arguments passed to TabularPredictor.fit()
57
58
Returns:
59
Self (fitted TabularClassifier)
60
"""
61
62
def predict(
63
self,
64
X: pd.DataFrame | np.ndarray
65
) -> np.ndarray:
66
"""
67
Generate class predictions for input data.
68
69
Parameters:
70
- X: Input features
71
72
Returns:
73
Predicted class labels as numpy array
74
"""
75
76
def predict_proba(
77
self,
78
X: pd.DataFrame | np.ndarray
79
) -> np.ndarray:
80
"""
81
Generate class probabilities for input data.
82
83
Parameters:
84
- X: Input features
85
86
Returns:
87
Class probabilities as numpy array
88
"""
89
90
def score(
91
self,
92
X: pd.DataFrame | np.ndarray,
93
y: pd.Series | np.ndarray,
94
sample_weight: np.ndarray = None
95
) -> float:
96
"""
97
Calculate accuracy score on the given test data and labels.
98
99
Parameters:
100
- X: Test features
101
- y: True labels
102
- sample_weight: Sample weights for scoring
103
104
Returns:
105
Mean accuracy score
106
"""
107
```
108
109
### Tabular Regression
110
111
Scikit-learn compatible regressor interface that wraps AutoGluon's TabularPredictor for regression tasks with standard sklearn API conventions.
112
113
```python { .api }
114
class TabularRegressor:
115
"""
116
Scikit-learn compatible regressor using AutoGluon's automated ML.
117
118
Provides standard sklearn interface (fit, predict, score)
119
while leveraging AutoGluon's model selection and ensemble capabilities.
120
"""
121
122
def __init__(
123
self,
124
eval_metric: str = None,
125
time_limit: float = None,
126
presets: list[str] | str = None,
127
hyperparameters: dict | str = None,
128
path: str = None,
129
verbosity: int = 2,
130
init_args: dict = None,
131
fit_args: dict = None
132
):
133
"""
134
Initialize TabularRegressor.
135
136
Parameters:
137
- eval_metric: Evaluation metric for model selection
138
- time_limit: Maximum training time in seconds
139
- presets: Preset configurations for training
140
- hyperparameters: Custom hyperparameter configurations
141
- path: Directory to save models
142
- verbosity: Logging level (0-4)
143
- init_args: Additional initialization arguments
144
- fit_args: Additional fitting arguments
145
"""
146
147
def fit(
148
self,
149
X: pd.DataFrame | np.ndarray,
150
y: pd.Series | np.ndarray,
151
**kwargs
152
) -> 'TabularRegressor':
153
"""
154
Train the regressor on the provided data.
155
156
Parameters:
157
- X: Training features
158
- y: Training target values
159
- kwargs: Additional arguments passed to TabularPredictor.fit()
160
161
Returns:
162
Self (fitted TabularRegressor)
163
"""
164
165
def predict(
166
self,
167
X: pd.DataFrame | np.ndarray
168
) -> np.ndarray:
169
"""
170
Generate predictions for input data.
171
172
Parameters:
173
- X: Input features
174
175
Returns:
176
Predicted values as numpy array
177
"""
178
179
def score(
180
self,
181
X: pd.DataFrame | np.ndarray,
182
y: pd.Series | np.ndarray,
183
sample_weight: np.ndarray = None
184
) -> float:
185
"""
186
Calculate R² coefficient of determination on test data.
187
188
Parameters:
189
- X: Test features
190
- y: True target values
191
- sample_weight: Sample weights for scoring
192
193
Returns:
194
R² score
195
"""
196
```
197
198
## Usage Examples
199
200
### Classification with Scikit-learn Pipeline
201
202
```python
203
from autogluon.tabular.experimental import TabularClassifier
204
from sklearn.pipeline import Pipeline
205
from sklearn.preprocessing import StandardScaler
206
from sklearn.model_selection import cross_val_score
207
import pandas as pd
208
209
# Load data
210
X_train = pd.read_csv('X_train.csv')
211
y_train = pd.read_csv('y_train.csv').squeeze()
212
X_test = pd.read_csv('X_test.csv')
213
214
# Create sklearn-compatible classifier
215
classifier = TabularClassifier(
216
eval_metric='roc_auc',
217
verbosity=1
218
)
219
220
# Use in sklearn pipeline
221
pipeline = Pipeline([
222
('scaler', StandardScaler()),
223
('classifier', classifier)
224
])
225
226
# Cross-validation with sklearn
227
cv_scores = cross_val_score(
228
pipeline,
229
X_train,
230
y_train,
231
cv=5,
232
scoring='roc_auc'
233
)
234
235
print(f"Cross-validation AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
236
237
# Fit and predict
238
pipeline.fit(X_train, y_train)
239
predictions = pipeline.predict(X_test)
240
probabilities = pipeline.predict_proba(X_test)
241
```
242
243
### Regression with GridSearchCV
244
245
```python
246
from autogluon.tabular.experimental import TabularRegressor
247
from sklearn.model_selection import GridSearchCV
248
from sklearn.metrics import mean_squared_error
249
import pandas as pd
250
import numpy as np
251
252
# Load regression data
253
X_train = pd.read_csv('X_train.csv')
254
y_train = pd.read_csv('y_train.csv').squeeze()
255
X_test = pd.read_csv('X_test.csv')
256
y_test = pd.read_csv('y_test.csv').squeeze()
257
258
# Create regressor
259
regressor = TabularRegressor(verbosity=1)
260
261
# Grid search over AutoGluon parameters
262
param_grid = {
263
'eval_metric': ['mean_squared_error', 'mean_absolute_error'],
264
'time_limit': [300, 600],
265
'presets': ['good_quality', 'best_quality']
266
}
267
268
# Perform grid search
269
grid_search = GridSearchCV(
270
regressor,
271
param_grid,
272
cv=3,
273
scoring='neg_mean_squared_error',
274
n_jobs=1 # AutoGluon handles parallelization internally
275
)
276
277
# Fit with grid search
278
grid_search.fit(X_train, y_train)
279
280
# Best model predictions
281
best_model = grid_search.best_estimator_
282
predictions = best_model.predict(X_test)
283
284
# Evaluate
285
mse = mean_squared_error(y_test, predictions)
286
rmse = np.sqrt(mse)
287
288
print(f"Best parameters: {grid_search.best_params_}")
289
print(f"Test RMSE: {rmse:.4f}")
290
print(f"Test R²: {best_model.score(X_test, y_test):.4f}")
291
```
292
293
### Integration with Model Selection
294
295
```python
296
from autogluon.tabular.experimental import TabularClassifier, TabularRegressor
297
from sklearn.model_selection import train_test_split
298
from sklearn.ensemble import RandomForestClassifier
299
from sklearn.linear_model import LogisticRegression
300
from sklearn.metrics import classification_report
301
import pandas as pd
302
303
# Prepare data
304
X = pd.read_csv('features.csv')
305
y = pd.read_csv('target.csv').squeeze()
306
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
307
308
# Compare AutoGluon with sklearn models
309
models = {
310
'AutoGluon': TabularClassifier(time_limit=300, verbosity=0),
311
'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
312
'LogisticRegression': LogisticRegression(random_state=42)
313
}
314
315
results = {}
316
for name, model in models.items():
317
# Fit model
318
model.fit(X_train, y_train)
319
320
# Predictions
321
predictions = model.predict(X_val)
322
323
# Store results
324
results[name] = {
325
'accuracy': model.score(X_val, y_val),
326
'predictions': predictions
327
}
328
329
print(f"\n{name} Results:")
330
print(f"Accuracy: {results[name]['accuracy']:.4f}")
331
print(classification_report(y_val, predictions))
332
```
333
334
### Advanced Usage with Custom Configurations
335
336
```python
337
from autogluon.tabular.experimental import TabularClassifier
338
339
# Custom hyperparameters for AutoGluon models
340
hyperparameters = {
341
'LGB': {'num_leaves': [26, 66, 176]},
342
'XGB': {'n_estimators': [50, 100, 200]},
343
'CAT': {'iterations': [100, 200, 500]}
344
}
345
346
# Advanced classifier with custom settings
347
classifier = TabularClassifier(
348
problem_type='multiclass',
349
eval_metric='f1_macro',
350
path='./sklearn_compatible_models/',
351
verbosity=2
352
)
353
354
# Fit with custom hyperparameters and advanced options
355
classifier.fit(
356
X_train,
357
y_train,
358
time_limit=900,
359
hyperparameters=hyperparameters,
360
num_bag_folds=5,
361
presets='best_quality'
362
)
363
364
# Access underlying AutoGluon predictor for advanced functionality
365
autogluon_predictor = classifier.predictor
366
leaderboard = autogluon_predictor.leaderboard(extra_info=True)
367
print(leaderboard)
368
369
# Standard sklearn predictions
370
predictions = classifier.predict(X_test)
371
probabilities = classifier.predict_proba(X_test)
372
```
373
374
## Notes
375
376
- **Experimental Status**: These interfaces are experimental and may change in future versions
377
- **Feature Compatibility**: Most AutoGluon features are accessible through the underlying predictor
378
- **Performance**: Same performance as using TabularPredictor directly
379
- **Integration**: Full compatibility with sklearn pipelines, grid search, and cross-validation
380
- **Memory**: Models are stored in the specified path directory for persistence
381
- **Parallelization**: AutoGluon handles internal parallelization; avoid nested parallelization in sklearn tools