Pandas integration with sklearn providing DataFrameMapper for bridging DataFrame columns to sklearn transformations
npx @tessl/cli install tessl/pypi-sklearn-pandas@2.2.00
# sklearn-pandas
1
2
Pandas integration with scikit-learn providing a bridge between pandas DataFrames and sklearn's machine learning transformations. The core component is DataFrameMapper, which allows mapping DataFrame columns to different sklearn transformations that are later recombined into features.
3
4
## Package Information
5
6
- **Package Name**: sklearn-pandas
7
- **Language**: Python
8
- **Installation**: `pip install sklearn-pandas` or `conda install -c conda-forge sklearn-pandas`
9
10
## Core Imports
11
12
```python
13
from sklearn_pandas import DataFrameMapper
14
```
15
16
Additional utilities:
17
18
```python
19
from sklearn_pandas import gen_features, NumericalTransformer
20
```
21
22
## Basic Usage
23
24
```python
25
import pandas as pd
26
import numpy as np
27
from sklearn_pandas import DataFrameMapper
28
from sklearn.preprocessing import StandardScaler, LabelBinarizer
29
30
# Create sample data
31
data = pd.DataFrame({
32
'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog'],
33
'children': [4., 6, 3, 3, 2, 3],
34
'salary': [90., 24, 44, 27, 32, 59]
35
})
36
37
# Define feature mappings
38
mapper = DataFrameMapper([
39
('pet', LabelBinarizer()),
40
(['children'], StandardScaler()),
41
(['salary'], StandardScaler())
42
])
43
44
# Fit and transform the data
45
X_transformed = mapper.fit_transform(data)
46
47
# Or use in sklearn pipeline
48
from sklearn.pipeline import Pipeline
49
from sklearn.linear_model import LogisticRegression
50
51
pipeline = Pipeline([
52
('mapper', mapper),
53
('classifier', LogisticRegression())
54
])
55
```
56
57
## Architecture
58
59
sklearn-pandas provides several key components:
60
61
- **DataFrameMapper**: Main transformer class for mapping DataFrame columns to sklearn transformations
62
- **Feature Definition System**: Flexible column selection using strings, lists, or callable functions
63
- **Pipeline Integration**: Full compatibility with sklearn pipelines and cross-validation
64
- **Output Flexibility**: Support for numpy arrays, sparse matrices, or pandas DataFrames
65
- **Feature Naming**: Automatic generation of meaningful feature names
66
67
## Capabilities
68
69
### Data Frame Mapping
70
71
The core functionality for mapping pandas DataFrame columns to sklearn transformations, with support for flexible column selection, multiple transformers per column, and comprehensive output options.
72
73
```python { .api }
74
class DataFrameMapper:
75
def __init__(
76
self,
77
features,
78
default=False,
79
sparse=False,
80
df_out=False,
81
input_df=False,
82
drop_cols=None
83
):
84
"""
85
Map Pandas DataFrame columns to sklearn transformations.
86
87
Parameters:
88
- features: List of tuples with feature definitions [(columns, transformer, options), ...]
89
- default: Default transformer for unselected columns (False=discard, None=passthrough, transformer=apply)
90
- sparse: Return sparse matrix if True and any features are sparse
91
- df_out: Return pandas DataFrame with named columns
92
- input_df: Pass DataFrame/Series to transformers instead of numpy arrays
93
- drop_cols: List of columns to drop entirely
94
"""
95
96
def fit(self, X, y=None):
97
"""
98
Fit transformations to the data.
99
100
Parameters:
101
- X: pandas DataFrame to fit
102
- y: Target vector (optional)
103
104
Returns:
105
DataFrameMapper instance
106
"""
107
108
def transform(self, X):
109
"""
110
Transform data using fitted transformations.
111
112
Parameters:
113
- X: pandas DataFrame to transform
114
115
Returns:
116
numpy array, sparse matrix, or pandas DataFrame based on configuration
117
"""
118
119
def fit_transform(self, X, y=None):
120
"""
121
Fit transformations and transform data in one step.
122
123
Parameters:
124
- X: pandas DataFrame to fit and transform
125
- y: Target vector (optional)
126
127
Returns:
128
numpy array, sparse matrix, or pandas DataFrame based on configuration
129
"""
130
131
def get_names(self, columns, transformer, x, alias=None, prefix='', suffix=''):
132
"""
133
Generate verbose names for transformed columns.
134
135
Parameters:
136
- columns: Original column name(s)
137
- transformer: Applied transformer
138
- x: Transformed data
139
- alias: Custom base name for columns
140
- prefix: Prefix for column names
141
- suffix: Suffix for column names
142
143
Returns:
144
List of column names
145
"""
146
147
def get_dtypes(self, extracted):
148
"""
149
Get data types for all extracted features.
150
151
Parameters:
152
- extracted: List of extracted feature arrays/DataFrames
153
154
Returns:
155
List of data types for all features
156
"""
157
158
def get_dtype(self, ex):
159
"""
160
Get data type(s) for a single extracted feature.
161
162
Parameters:
163
- ex: Single extracted feature (numpy array, sparse matrix, or DataFrame)
164
165
Returns:
166
List of data types (one per column)
167
"""
168
169
# Attributes (set after transform)
170
transformed_names_: list
171
"""
172
List of column names for transformed features.
173
Set automatically after calling transform() or fit_transform().
174
"""
175
176
built_features: list
177
"""
178
List of built feature definitions after calling fit().
179
Contains tuples of (columns, transformer, options).
180
"""
181
182
built_default: object
183
"""
184
Built default transformer for unselected columns, if any.
185
Set after calling fit().
186
"""
187
188
### Feature Generation Utilities
189
190
Helper functions for programmatically generating feature definitions and applying transformations.
191
192
```python { .api }
193
def gen_features(columns, classes=None, prefix='', suffix=''):
194
"""
195
Generate feature definition list for DataFrameMapper.
196
197
Parameters:
198
- columns: List of column names to generate features for
199
- classes: List of transformer classes or dicts with class and params
200
- prefix: Prefix for transformed column names
201
- suffix: Suffix for transformed column names
202
203
Returns:
204
List of feature definition tuples
205
"""
206
```
207
208
### Pipeline Components
209
210
Custom pipeline components for transformer chaining and cross-validation compatibility. These must be imported from submodules.
211
212
```python
213
from sklearn_pandas.pipeline import TransformerPipeline, make_transformer_pipeline, _call_fit
214
```
215
216
```python { .api }
217
class TransformerPipeline(Pipeline):
218
def __init__(self, steps):
219
"""
220
Pipeline expecting all steps to be transformers.
221
Inherits from sklearn.pipeline.Pipeline.
222
223
Parameters:
224
- steps: List of (name, transformer) tuples
225
"""
226
227
def fit(self, X, y=None, **fit_params):
228
"""Fit the pipeline."""
229
230
def transform(self, X):
231
"""Transform data using the pipeline."""
232
233
def fit_transform(self, X, y=None, **fit_params):
234
"""Fit and transform using the pipeline."""
235
236
def make_transformer_pipeline(*steps):
237
"""
238
Construct TransformerPipeline from estimators.
239
240
Parameters:
241
- steps: Transformer instances
242
243
Returns:
244
TransformerPipeline instance
245
"""
246
247
def _call_fit(fit_method, X, y=None, **kwargs):
248
"""
249
Helper function for calling fit or fit_transform methods with correct parameters.
250
Handles transformers that may or may not accept y parameter.
251
252
Parameters:
253
- fit_method: fit or fit_transform method of the transformer
254
- X: Data to fit
255
- y: Target vector relative to X (optional)
256
- **kwargs: Keyword arguments to the fit method
257
258
Returns:
259
Result of the fit or fit_transform method
260
"""
261
```
262
263
### Legacy Transformers
264
265
Deprecated numerical transformers maintained for backward compatibility.
266
267
```python { .api }
268
class NumericalTransformer:
269
"""
270
DEPRECATED: Will be removed in version 3.0.
271
Use sklearn.base.TransformerMixin for custom transformers.
272
"""
273
274
SUPPORTED_FUNCTIONS = ['log', 'log1p']
275
276
def __init__(self, func):
277
"""
278
Parameters:
279
- func: Function name ('log' or 'log1p')
280
"""
281
282
def fit(self, X, y=None):
283
"""Fit transformer (no-op)."""
284
285
def transform(self, X, y=None):
286
"""Apply numerical transformation."""
287
```
288
289
### Cross-validation Support
290
291
Compatibility wrapper for older sklearn versions. Must be imported from submodule.
292
293
```python
294
from sklearn_pandas.cross_validation import DataWrapper
295
```
296
297
```python { .api }
298
class DataWrapper:
299
def __init__(self, df):
300
"""
301
Wrapper for DataFrame with indexing support.
302
303
Parameters:
304
- df: pandas DataFrame to wrap
305
"""
306
307
def __len__(self):
308
"""Get length of wrapped DataFrame."""
309
310
def __getitem__(self, key):
311
"""Get item using iloc indexing."""
312
```
313
314
## Feature Definition Format
315
316
Feature definitions are tuples with 1-3 elements:
317
318
1. **Column selector** (required): String, list of strings, or callable function
319
2. **Transformer** (required): sklearn transformer instance or list of transformers
320
3. **Options** (optional): Dictionary with transformation options
321
322
### Column Selection Patterns
323
324
```python
325
# Single column as string - passes 1D array to transformer
326
('column_name', StandardScaler())
327
328
# Single column as list - passes 2D array to transformer
329
(['column_name'], StandardScaler())
330
331
# Multiple columns
332
(['col1', 'col2', 'col3'], StandardScaler())
333
334
# Callable column selector
335
(lambda df: df.select_dtypes(include=[np.number]).columns, StandardScaler())
336
```
337
338
### Transformer Options
339
340
```python
341
# Custom column naming
342
('salary', StandardScaler(), {'alias': 'normalized_salary'})
343
344
# Column prefixes and suffixes
345
('category', LabelBinarizer(), {'prefix': 'cat_', 'suffix': '_flag'})
346
347
# Input format control
348
('text_col', CountVectorizer(), {'input_df': True})
349
```
350
351
### Multiple Transformers per Column
352
353
```python
354
# Chain transformers with TransformerPipeline
355
('numeric_col', [StandardScaler(), PCA(n_components=2)])
356
357
# Equivalent using make_transformer_pipeline
358
from sklearn_pandas.pipeline import make_transformer_pipeline
359
('numeric_col', make_transformer_pipeline(StandardScaler(), PCA(n_components=2)))
360
```
361
362
## Common Usage Patterns
363
364
### Working with Mixed Data Types
365
366
```python
367
# Handle categorical and numerical columns differently
368
mapper = DataFrameMapper([
369
# Categorical columns - use LabelBinarizer for 1D input
370
('category', LabelBinarizer()),
371
('status', LabelBinarizer()),
372
373
# Numerical columns - use list notation for 2D input
374
(['price'], StandardScaler()),
375
(['quantity'], StandardScaler()),
376
377
# Text columns with custom options
378
('description', CountVectorizer(), {'input_df': True})
379
])
380
```
381
382
### Pipeline Integration
383
384
```python
385
from sklearn.pipeline import Pipeline
386
from sklearn.model_selection import cross_val_score
387
from sklearn.ensemble import RandomForestClassifier
388
389
# Create complete ML pipeline
390
pipeline = Pipeline([
391
('features', DataFrameMapper([
392
('category', LabelBinarizer()),
393
(['numerical_col'], StandardScaler())
394
])),
395
('classifier', RandomForestClassifier())
396
])
397
398
# Use in cross-validation
399
scores = cross_val_score(pipeline, df, target, cv=5)
400
```
401
402
### Preserving DataFrame Structure
403
404
```python
405
# Return transformed data as DataFrame with named columns
406
mapper = DataFrameMapper([
407
('cat_col', LabelBinarizer()),
408
(['num_col'], StandardScaler())
409
], df_out=True)
410
411
transformed_df = mapper.fit_transform(data)
412
# Result is a pandas DataFrame with meaningful column names
413
```
414
415
### Handling Default Columns
416
417
```python
418
# Apply default transformation to unselected columns
419
mapper = DataFrameMapper([
420
('specific_col', StandardScaler())
421
], default=StandardScaler()) # Apply StandardScaler to all other columns
422
423
# Or pass through unselected columns unchanged
424
mapper = DataFrameMapper([
425
('specific_col', StandardScaler())
426
], default=None) # Keep other columns as-is
427
```
428
429
## Error Handling
430
431
DataFrameMapper provides enhanced error messages that include column names for easier debugging:
432
433
```python
434
# If transformation fails, error message includes problematic column names
435
try:
436
mapper.fit_transform(data)
437
except Exception as e:
438
# Error message will include column names like: "['column_name']: Original error message"
439
print(e)
440
```
441
442
## Types
443
444
```python { .api }
445
# Feature definition tuple format (Python 2.7+ compatible)
446
FeatureDefinition = tuple # Format: (column_selector, transformer(s), options)
447
# column_selector: str or list of str or callable
448
# transformer(s): sklearn transformer instance, list of transformers, or None
449
# options: dict or None (optional third element)
450
451
# Common option keys
452
TransformationOptions = dict # {
453
# 'alias': str, # Custom name for transformed features
454
# 'prefix': str, # Prefix for column names
455
# 'suffix': str, # Suffix for column names
456
# 'input_df': bool # Pass DataFrame instead of numpy array
457
# }
458
```