or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# sklearn-pandas

1

2

Pandas integration with scikit-learn providing a bridge between pandas DataFrames and sklearn's machine learning transformations. The core component is DataFrameMapper, which allows mapping DataFrame columns to different sklearn transformations that are later recombined into features.

3

4

## Package Information

5

6

- **Package Name**: sklearn-pandas

7

- **Language**: Python

8

- **Installation**: `pip install sklearn-pandas` or `conda install -c conda-forge sklearn-pandas`

9

10

## Core Imports

11

12

```python

13

from sklearn_pandas import DataFrameMapper

14

```

15

16

Additional utilities:

17

18

```python

19

from sklearn_pandas import gen_features, NumericalTransformer

20

```

21

22

## Basic Usage

23

24

```python

25

import pandas as pd

26

import numpy as np

27

from sklearn_pandas import DataFrameMapper

28

from sklearn.preprocessing import StandardScaler, LabelBinarizer

29

30

# Create sample data

31

data = pd.DataFrame({

32

'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog'],

33

'children': [4., 6, 3, 3, 2, 3],

34

'salary': [90., 24, 44, 27, 32, 59]

35

})

36

37

# Define feature mappings

38

mapper = DataFrameMapper([

39

('pet', LabelBinarizer()),

40

(['children'], StandardScaler()),

41

(['salary'], StandardScaler())

42

])

43

44

# Fit and transform the data

45

X_transformed = mapper.fit_transform(data)

46

47

# Or use in sklearn pipeline

48

from sklearn.pipeline import Pipeline

49

from sklearn.linear_model import LogisticRegression

50

51

pipeline = Pipeline([

52

('mapper', mapper),

53

('classifier', LogisticRegression())

54

])

55

```

56

57

## Architecture

58

59

sklearn-pandas provides several key components:

60

61

- **DataFrameMapper**: Main transformer class for mapping DataFrame columns to sklearn transformations

62

- **Feature Definition System**: Flexible column selection using strings, lists, or callable functions

63

- **Pipeline Integration**: Full compatibility with sklearn pipelines and cross-validation

64

- **Output Flexibility**: Support for numpy arrays, sparse matrices, or pandas DataFrames

65

- **Feature Naming**: Automatic generation of meaningful feature names

66

67

## Capabilities

68

69

### Data Frame Mapping

70

71

The core functionality for mapping pandas DataFrame columns to sklearn transformations, with support for flexible column selection, multiple transformers per column, and comprehensive output options.

72

73

```python { .api }

74

class DataFrameMapper:

75

def __init__(

76

self,

77

features,

78

default=False,

79

sparse=False,

80

df_out=False,

81

input_df=False,

82

drop_cols=None

83

):

84

"""

85

Map Pandas DataFrame columns to sklearn transformations.

86

87

Parameters:

88

- features: List of tuples with feature definitions [(columns, transformer, options), ...]

89

- default: Default transformer for unselected columns (False=discard, None=passthrough, transformer=apply)

90

- sparse: Return sparse matrix if True and any features are sparse

91

- df_out: Return pandas DataFrame with named columns

92

- input_df: Pass DataFrame/Series to transformers instead of numpy arrays

93

- drop_cols: List of columns to drop entirely

94

"""

95

96

def fit(self, X, y=None):

97

"""

98

Fit transformations to the data.

99

100

Parameters:

101

- X: pandas DataFrame to fit

102

- y: Target vector (optional)

103

104

Returns:

105

DataFrameMapper instance

106

"""

107

108

def transform(self, X):

109

"""

110

Transform data using fitted transformations.

111

112

Parameters:

113

- X: pandas DataFrame to transform

114

115

Returns:

116

numpy array, sparse matrix, or pandas DataFrame based on configuration

117

"""

118

119

def fit_transform(self, X, y=None):

120

"""

121

Fit transformations and transform data in one step.

122

123

Parameters:

124

- X: pandas DataFrame to fit and transform

125

- y: Target vector (optional)

126

127

Returns:

128

numpy array, sparse matrix, or pandas DataFrame based on configuration

129

"""

130

131

def get_names(self, columns, transformer, x, alias=None, prefix='', suffix=''):

132

"""

133

Generate verbose names for transformed columns.

134

135

Parameters:

136

- columns: Original column name(s)

137

- transformer: Applied transformer

138

- x: Transformed data

139

- alias: Custom base name for columns

140

- prefix: Prefix for column names

141

- suffix: Suffix for column names

142

143

Returns:

144

List of column names

145

"""

146

147

def get_dtypes(self, extracted):

148

"""

149

Get data types for all extracted features.

150

151

Parameters:

152

- extracted: List of extracted feature arrays/DataFrames

153

154

Returns:

155

List of data types for all features

156

"""

157

158

def get_dtype(self, ex):

159

"""

160

Get data type(s) for a single extracted feature.

161

162

Parameters:

163

- ex: Single extracted feature (numpy array, sparse matrix, or DataFrame)

164

165

Returns:

166

List of data types (one per column)

167

"""

168

169

# Attributes (set after transform)

170

transformed_names_: list

171

"""

172

List of column names for transformed features.

173

Set automatically after calling transform() or fit_transform().

174

"""

175

176

built_features: list

177

"""

178

List of built feature definitions after calling fit().

179

Contains tuples of (columns, transformer, options).

180

"""

181

182

built_default: object

183

"""

184

Built default transformer for unselected columns, if any.

185

Set after calling fit().

186

"""

187

188

### Feature Generation Utilities

189

190

Helper functions for programmatically generating feature definitions and applying transformations.

191

192

```python { .api }

193

def gen_features(columns, classes=None, prefix='', suffix=''):

194

"""

195

Generate feature definition list for DataFrameMapper.

196

197

Parameters:

198

- columns: List of column names to generate features for

199

- classes: List of transformer classes or dicts with class and params

200

- prefix: Prefix for transformed column names

201

- suffix: Suffix for transformed column names

202

203

Returns:

204

List of feature definition tuples

205

"""

206

```

207

208

### Pipeline Components

209

210

Custom pipeline components for transformer chaining and cross-validation compatibility. These must be imported from submodules.

211

212

```python

213

from sklearn_pandas.pipeline import TransformerPipeline, make_transformer_pipeline, _call_fit

214

```

215

216

```python { .api }

217

class TransformerPipeline(Pipeline):

218

def __init__(self, steps):

219

"""

220

Pipeline expecting all steps to be transformers.

221

Inherits from sklearn.pipeline.Pipeline.

222

223

Parameters:

224

- steps: List of (name, transformer) tuples

225

"""

226

227

def fit(self, X, y=None, **fit_params):

228

"""Fit the pipeline."""

229

230

def transform(self, X):

231

"""Transform data using the pipeline."""

232

233

def fit_transform(self, X, y=None, **fit_params):

234

"""Fit and transform using the pipeline."""

235

236

def make_transformer_pipeline(*steps):

237

"""

238

Construct TransformerPipeline from estimators.

239

240

Parameters:

241

- steps: Transformer instances

242

243

Returns:

244

TransformerPipeline instance

245

"""

246

247

def _call_fit(fit_method, X, y=None, **kwargs):

248

"""

249

Helper function for calling fit or fit_transform methods with correct parameters.

250

Handles transformers that may or may not accept y parameter.

251

252

Parameters:

253

- fit_method: fit or fit_transform method of the transformer

254

- X: Data to fit

255

- y: Target vector relative to X (optional)

256

- **kwargs: Keyword arguments to the fit method

257

258

Returns:

259

Result of the fit or fit_transform method

260

"""

261

```

262

263

### Legacy Transformers

264

265

Deprecated numerical transformers maintained for backward compatibility.

266

267

```python { .api }

268

class NumericalTransformer:

269

"""

270

DEPRECATED: Will be removed in version 3.0.

271

Use sklearn.base.TransformerMixin for custom transformers.

272

"""

273

274

SUPPORTED_FUNCTIONS = ['log', 'log1p']

275

276

def __init__(self, func):

277

"""

278

Parameters:

279

- func: Function name ('log' or 'log1p')

280

"""

281

282

def fit(self, X, y=None):

283

"""Fit transformer (no-op)."""

284

285

def transform(self, X, y=None):

286

"""Apply numerical transformation."""

287

```

288

289

### Cross-validation Support

290

291

Compatibility wrapper for older sklearn versions. Must be imported from submodule.

292

293

```python

294

from sklearn_pandas.cross_validation import DataWrapper

295

```

296

297

```python { .api }

298

class DataWrapper:

299

def __init__(self, df):

300

"""

301

Wrapper for DataFrame with indexing support.

302

303

Parameters:

304

- df: pandas DataFrame to wrap

305

"""

306

307

def __len__(self):

308

"""Get length of wrapped DataFrame."""

309

310

def __getitem__(self, key):

311

"""Get item using iloc indexing."""

312

```

313

314

## Feature Definition Format

315

316

Feature definitions are tuples with 1-3 elements:

317

318

1. **Column selector** (required): String, list of strings, or callable function

319

2. **Transformer** (required): sklearn transformer instance or list of transformers

320

3. **Options** (optional): Dictionary with transformation options

321

322

### Column Selection Patterns

323

324

```python

325

# Single column as string - passes 1D array to transformer

326

('column_name', StandardScaler())

327

328

# Single column as list - passes 2D array to transformer

329

(['column_name'], StandardScaler())

330

331

# Multiple columns

332

(['col1', 'col2', 'col3'], StandardScaler())

333

334

# Callable column selector

335

(lambda df: df.select_dtypes(include=[np.number]).columns, StandardScaler())

336

```

337

338

### Transformer Options

339

340

```python

341

# Custom column naming

342

('salary', StandardScaler(), {'alias': 'normalized_salary'})

343

344

# Column prefixes and suffixes

345

('category', LabelBinarizer(), {'prefix': 'cat_', 'suffix': '_flag'})

346

347

# Input format control

348

('text_col', CountVectorizer(), {'input_df': True})

349

```

350

351

### Multiple Transformers per Column

352

353

```python

354

# Chain transformers with TransformerPipeline

355

('numeric_col', [StandardScaler(), PCA(n_components=2)])

356

357

# Equivalent using make_transformer_pipeline

358

from sklearn_pandas.pipeline import make_transformer_pipeline

359

('numeric_col', make_transformer_pipeline(StandardScaler(), PCA(n_components=2)))

360

```

361

362

## Common Usage Patterns

363

364

### Working with Mixed Data Types

365

366

```python

367

# Handle categorical and numerical columns differently

368

mapper = DataFrameMapper([

369

# Categorical columns - use LabelBinarizer for 1D input

370

('category', LabelBinarizer()),

371

('status', LabelBinarizer()),

372

373

# Numerical columns - use list notation for 2D input

374

(['price'], StandardScaler()),

375

(['quantity'], StandardScaler()),

376

377

# Text columns with custom options

378

('description', CountVectorizer(), {'input_df': True})

379

])

380

```

381

382

### Pipeline Integration

383

384

```python

385

from sklearn.pipeline import Pipeline

386

from sklearn.model_selection import cross_val_score

387

from sklearn.ensemble import RandomForestClassifier

388

389

# Create complete ML pipeline

390

pipeline = Pipeline([

391

('features', DataFrameMapper([

392

('category', LabelBinarizer()),

393

(['numerical_col'], StandardScaler())

394

])),

395

('classifier', RandomForestClassifier())

396

])

397

398

# Use in cross-validation

399

scores = cross_val_score(pipeline, df, target, cv=5)

400

```

401

402

### Preserving DataFrame Structure

403

404

```python

405

# Return transformed data as DataFrame with named columns

406

mapper = DataFrameMapper([

407

('cat_col', LabelBinarizer()),

408

(['num_col'], StandardScaler())

409

], df_out=True)

410

411

transformed_df = mapper.fit_transform(data)

412

# Result is a pandas DataFrame with meaningful column names

413

```

414

415

### Handling Default Columns

416

417

```python

418

# Apply default transformation to unselected columns

419

mapper = DataFrameMapper([

420

('specific_col', StandardScaler())

421

], default=StandardScaler()) # Apply StandardScaler to all other columns

422

423

# Or pass through unselected columns unchanged

424

mapper = DataFrameMapper([

425

('specific_col', StandardScaler())

426

], default=None) # Keep other columns as-is

427

```

428

429

## Error Handling

430

431

DataFrameMapper provides enhanced error messages that include column names for easier debugging:

432

433

```python

434

# If transformation fails, error message includes problematic column names

435

try:

436

mapper.fit_transform(data)

437

except Exception as e:

438

# Error message will include column names like: "['column_name']: Original error message"

439

print(e)

440

```

441

442

## Types

443

444

```python { .api }

445

# Feature definition tuple format (Python 2.7+ compatible)

446

FeatureDefinition = tuple # Format: (column_selector, transformer(s), options)

447

# column_selector: str or list of str or callable

448

# transformer(s): sklearn transformer instance, list of transformers, or None

449

# options: dict or None (optional third element)

450

451

# Common option keys

452

TransformationOptions = dict # {

453

# 'alias': str, # Custom name for transformed features

454

# 'prefix': str, # Prefix for column names

455

# 'suffix': str, # Suffix for column names

456

# 'input_df': bool # Pass DataFrame instead of numpy array

457

# }

458

```