Tessl Tile for pypi/sklearn-pandas@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-sklearn-pandas

Pandas integration with sklearn providing DataFrameMapper for bridging DataFrame columns to sklearn transformations

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/sklearn-pandas@2.2.x

To install, run

npx @tessl/cli install tessl/pypi-sklearn-pandas@2.2.0

0
# sklearn-pandas
1

2
Pandas integration with scikit-learn providing a bridge between pandas DataFrames and sklearn's machine learning transformations. The core component is DataFrameMapper, which allows mapping DataFrame columns to different sklearn transformations that are later recombined into features.
3

4
## Package Information
5

6
- **Package Name**: sklearn-pandas
7
- **Language**: Python
8
- **Installation**: `pip install sklearn-pandas` or `conda install -c conda-forge sklearn-pandas`
9

10
## Core Imports
11

12
```python
13
from sklearn_pandas import DataFrameMapper
14
```
15

16
Additional utilities:
17

18
```python
19
from sklearn_pandas import gen_features, NumericalTransformer
20
```
21

22
## Basic Usage
23

24
```python
25
import pandas as pd
26
import numpy as np
27
from sklearn_pandas import DataFrameMapper
28
from sklearn.preprocessing import StandardScaler, LabelBinarizer
29

30
# Create sample data
31
data = pd.DataFrame({
32
    'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog'],
33
    'children': [4., 6, 3, 3, 2, 3],
34
    'salary': [90., 24, 44, 27, 32, 59]
35
})
36

37
# Define feature mappings
38
mapper = DataFrameMapper([
39
    ('pet', LabelBinarizer()),
40
    (['children'], StandardScaler()),
41
    (['salary'], StandardScaler())
42
])
43

44
# Fit and transform the data
45
X_transformed = mapper.fit_transform(data)
46

47
# Or use in sklearn pipeline
48
from sklearn.pipeline import Pipeline
49
from sklearn.linear_model import LogisticRegression
50

51
pipeline = Pipeline([
52
    ('mapper', mapper),
53
    ('classifier', LogisticRegression())
54
])
55
```
56

57
## Architecture
58

59
sklearn-pandas provides several key components:
60

61
- **DataFrameMapper**: Main transformer class for mapping DataFrame columns to sklearn transformations
62
- **Feature Definition System**: Flexible column selection using strings, lists, or callable functions
63
- **Pipeline Integration**: Full compatibility with sklearn pipelines and cross-validation
64
- **Output Flexibility**: Support for numpy arrays, sparse matrices, or pandas DataFrames
65
- **Feature Naming**: Automatic generation of meaningful feature names
66

67
## Capabilities
68

69
### Data Frame Mapping
70

71
The core functionality for mapping pandas DataFrame columns to sklearn transformations, with support for flexible column selection, multiple transformers per column, and comprehensive output options.
72

73
```python { .api }
74
class DataFrameMapper:
75
    def __init__(
76
        self, 
77
        features, 
78
        default=False, 
79
        sparse=False, 
80
        df_out=False,
81
        input_df=False, 
82
        drop_cols=None
83
    ):
84
        """
85
        Map Pandas DataFrame columns to sklearn transformations.
86
        
87
        Parameters:
88
        - features: List of tuples with feature definitions [(columns, transformer, options), ...]
89
        - default: Default transformer for unselected columns (False=discard, None=passthrough, transformer=apply)
90
        - sparse: Return sparse matrix if True and any features are sparse
91
        - df_out: Return pandas DataFrame with named columns
92
        - input_df: Pass DataFrame/Series to transformers instead of numpy arrays
93
        - drop_cols: List of columns to drop entirely
94
        """
95
    
96
    def fit(self, X, y=None):
97
        """
98
        Fit transformations to the data.
99
        
100
        Parameters:
101
        - X: pandas DataFrame to fit
102
        - y: Target vector (optional)
103
        
104
        Returns:
105
        DataFrameMapper instance
106
        """
107
    
108
    def transform(self, X):
109
        """
110
        Transform data using fitted transformations.
111
        
112
        Parameters:
113
        - X: pandas DataFrame to transform
114
        
115
        Returns:
116
        numpy array, sparse matrix, or pandas DataFrame based on configuration
117
        """
118
    
119
    def fit_transform(self, X, y=None):
120
        """
121
        Fit transformations and transform data in one step.
122
        
123
        Parameters:
124
        - X: pandas DataFrame to fit and transform
125
        - y: Target vector (optional)
126
        
127
        Returns:
128
        numpy array, sparse matrix, or pandas DataFrame based on configuration
129
        """
130
    
131
    def get_names(self, columns, transformer, x, alias=None, prefix='', suffix=''):
132
        """
133
        Generate verbose names for transformed columns.
134
        
135
        Parameters:
136
        - columns: Original column name(s)
137
        - transformer: Applied transformer
138
        - x: Transformed data
139
        - alias: Custom base name for columns
140
        - prefix: Prefix for column names
141
        - suffix: Suffix for column names
142
        
143
        Returns:
144
        List of column names
145
        """
146
    
147
    def get_dtypes(self, extracted):
148
        """
149
        Get data types for all extracted features.
150
        
151
        Parameters:
152
        - extracted: List of extracted feature arrays/DataFrames
153
        
154
        Returns:
155
        List of data types for all features
156
        """
157
    
158
    def get_dtype(self, ex):
159
        """
160
        Get data type(s) for a single extracted feature.
161
        
162
        Parameters:
163
        - ex: Single extracted feature (numpy array, sparse matrix, or DataFrame)
164
        
165
        Returns:
166
        List of data types (one per column)
167
        """
168
    
169
    # Attributes (set after transform)
170
    transformed_names_: list
171
    """
172
    List of column names for transformed features. 
173
    Set automatically after calling transform() or fit_transform().
174
    """
175
    
176
    built_features: list
177
    """
178
    List of built feature definitions after calling fit().
179
    Contains tuples of (columns, transformer, options).
180
    """
181
    
182
    built_default: object
183
    """
184
    Built default transformer for unselected columns, if any.
185
    Set after calling fit().
186
    """
187

188
### Feature Generation Utilities
189

190
Helper functions for programmatically generating feature definitions and applying transformations.
191

192
```python { .api }
193
def gen_features(columns, classes=None, prefix='', suffix=''):
194
    """
195
    Generate feature definition list for DataFrameMapper.
196
    
197
    Parameters:
198
    - columns: List of column names to generate features for
199
    - classes: List of transformer classes or dicts with class and params
200
    - prefix: Prefix for transformed column names
201
    - suffix: Suffix for transformed column names
202
    
203
    Returns:
204
    List of feature definition tuples
205
    """
206
```
207

208
### Pipeline Components
209

210
Custom pipeline components for transformer chaining and cross-validation compatibility. These must be imported from submodules.
211

212
```python
213
from sklearn_pandas.pipeline import TransformerPipeline, make_transformer_pipeline, _call_fit
214
```
215

216
```python { .api }
217
class TransformerPipeline(Pipeline):
218
    def __init__(self, steps):
219
        """
220
        Pipeline expecting all steps to be transformers.
221
        Inherits from sklearn.pipeline.Pipeline.
222
        
223
        Parameters:
224
        - steps: List of (name, transformer) tuples
225
        """
226
    
227
    def fit(self, X, y=None, **fit_params):
228
        """Fit the pipeline."""
229
    
230
    def transform(self, X):
231
        """Transform data using the pipeline."""
232
    
233
    def fit_transform(self, X, y=None, **fit_params):
234
        """Fit and transform using the pipeline."""
235

236
def make_transformer_pipeline(*steps):
237
    """
238
    Construct TransformerPipeline from estimators.
239
    
240
    Parameters:
241
    - steps: Transformer instances
242
    
243
    Returns:
244
    TransformerPipeline instance
245
    """
246

247
def _call_fit(fit_method, X, y=None, **kwargs):
248
    """
249
    Helper function for calling fit or fit_transform methods with correct parameters.
250
    Handles transformers that may or may not accept y parameter.
251
    
252
    Parameters:
253
    - fit_method: fit or fit_transform method of the transformer
254
    - X: Data to fit
255
    - y: Target vector relative to X (optional)  
256
    - **kwargs: Keyword arguments to the fit method
257
    
258
    Returns:
259
    Result of the fit or fit_transform method
260
    """
261
```
262

263
### Legacy Transformers
264

265
Deprecated numerical transformers maintained for backward compatibility.
266

267
```python { .api }
268
class NumericalTransformer:
269
    """
270
    DEPRECATED: Will be removed in version 3.0.
271
    Use sklearn.base.TransformerMixin for custom transformers.
272
    """
273
    
274
    SUPPORTED_FUNCTIONS = ['log', 'log1p']
275
    
276
    def __init__(self, func):
277
        """
278
        Parameters:
279
        - func: Function name ('log' or 'log1p')
280
        """
281
    
282
    def fit(self, X, y=None):
283
        """Fit transformer (no-op)."""
284
    
285
    def transform(self, X, y=None):
286
        """Apply numerical transformation."""
287
```
288

289
### Cross-validation Support
290

291
Compatibility wrapper for older sklearn versions. Must be imported from submodule.
292

293
```python
294
from sklearn_pandas.cross_validation import DataWrapper
295
```
296

297
```python { .api }
298
class DataWrapper:
299
    def __init__(self, df):
300
        """
301
        Wrapper for DataFrame with indexing support.
302
        
303
        Parameters:
304
        - df: pandas DataFrame to wrap
305
        """
306
    
307
    def __len__(self):
308
        """Get length of wrapped DataFrame."""
309
    
310
    def __getitem__(self, key):
311
        """Get item using iloc indexing."""
312
```
313

314
## Feature Definition Format
315

316
Feature definitions are tuples with 1-3 elements:
317

318
1. **Column selector** (required): String, list of strings, or callable function
319
2. **Transformer** (required): sklearn transformer instance or list of transformers
320
3. **Options** (optional): Dictionary with transformation options
321

322
### Column Selection Patterns
323

324
```python
325
# Single column as string - passes 1D array to transformer
326
('column_name', StandardScaler())
327

328
# Single column as list - passes 2D array to transformer  
329
(['column_name'], StandardScaler())
330

331
# Multiple columns
332
(['col1', 'col2', 'col3'], StandardScaler())
333

334
# Callable column selector
335
(lambda df: df.select_dtypes(include=[np.number]).columns, StandardScaler())
336
```
337

338
### Transformer Options
339

340
```python
341
# Custom column naming
342
('salary', StandardScaler(), {'alias': 'normalized_salary'})
343

344
# Column prefixes and suffixes
345
('category', LabelBinarizer(), {'prefix': 'cat_', 'suffix': '_flag'})
346

347
# Input format control
348
('text_col', CountVectorizer(), {'input_df': True})
349
```
350

351
### Multiple Transformers per Column
352

353
```python
354
# Chain transformers with TransformerPipeline
355
('numeric_col', [StandardScaler(), PCA(n_components=2)])
356

357
# Equivalent using make_transformer_pipeline
358
from sklearn_pandas.pipeline import make_transformer_pipeline
359
('numeric_col', make_transformer_pipeline(StandardScaler(), PCA(n_components=2)))
360
```
361

362
## Common Usage Patterns
363

364
### Working with Mixed Data Types
365

366
```python
367
# Handle categorical and numerical columns differently
368
mapper = DataFrameMapper([
369
    # Categorical columns - use LabelBinarizer for 1D input
370
    ('category', LabelBinarizer()),
371
    ('status', LabelBinarizer()),
372
    
373
    # Numerical columns - use list notation for 2D input
374
    (['price'], StandardScaler()),
375
    (['quantity'], StandardScaler()),
376
    
377
    # Text columns with custom options
378
    ('description', CountVectorizer(), {'input_df': True})
379
])
380
```
381

382
### Pipeline Integration
383

384
```python
385
from sklearn.pipeline import Pipeline
386
from sklearn.model_selection import cross_val_score
387
from sklearn.ensemble import RandomForestClassifier
388

389
# Create complete ML pipeline
390
pipeline = Pipeline([
391
    ('features', DataFrameMapper([
392
        ('category', LabelBinarizer()),
393
        (['numerical_col'], StandardScaler())
394
    ])),
395
    ('classifier', RandomForestClassifier())
396
])
397

398
# Use in cross-validation
399
scores = cross_val_score(pipeline, df, target, cv=5)
400
```
401

402
### Preserving DataFrame Structure
403

404
```python
405
# Return transformed data as DataFrame with named columns
406
mapper = DataFrameMapper([
407
    ('cat_col', LabelBinarizer()),
408
    (['num_col'], StandardScaler())
409
], df_out=True)
410

411
transformed_df = mapper.fit_transform(data)
412
# Result is a pandas DataFrame with meaningful column names
413
```
414

415
### Handling Default Columns
416

417
```python
418
# Apply default transformation to unselected columns
419
mapper = DataFrameMapper([
420
    ('specific_col', StandardScaler())
421
], default=StandardScaler())  # Apply StandardScaler to all other columns
422

423
# Or pass through unselected columns unchanged
424
mapper = DataFrameMapper([
425
    ('specific_col', StandardScaler())
426
], default=None)  # Keep other columns as-is
427
```
428

429
## Error Handling
430

431
DataFrameMapper provides enhanced error messages that include column names for easier debugging:
432

433
```python
434
# If transformation fails, error message includes problematic column names
435
try:
436
    mapper.fit_transform(data)
437
except Exception as e:
438
    # Error message will include column names like: "['column_name']: Original error message"
439
    print(e)
440
```
441

442
## Types
443

444
```python { .api }
445
# Feature definition tuple format (Python 2.7+ compatible)
446
FeatureDefinition = tuple  # Format: (column_selector, transformer(s), options)
447
# column_selector: str or list of str or callable
448
# transformer(s): sklearn transformer instance, list of transformers, or None  
449
# options: dict or None (optional third element)
450

451
# Common option keys
452
TransformationOptions = dict  # {
453
#     'alias': str,        # Custom name for transformed features
454
#     'prefix': str,       # Prefix for column names  
455
#     'suffix': str,       # Suffix for column names
456
#     'input_df': bool     # Pass DataFrame instead of numpy array
457
# }
458
```