or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

creation.mddatetime.mddiscretisation.mdencoding.mdimputation.mdindex.mdoutliers.mdpreprocessing.mdselection.mdtransformation.mdwrappers.md

encoding.mddocs/

0

# Categorical Variable Encoding

1

2

Transformers for converting categorical variables into numerical representations using various encoding methods including one-hot, ordinal, target-based, frequency-based, and weight of evidence encoders.

3

4

## Capabilities

5

6

### One-Hot Encoding

7

8

Replaces categorical variables by binary variables representing each category.

9

10

```python { .api }

11

class OneHotEncoder:

12

def __init__(self, top_categories=None, drop_last=False, drop_last_binary=False,

13

variables=None, ignore_format=False):

14

"""

15

Initialize OneHotEncoder.

16

17

Parameters:

18

- top_categories (int): Number of most frequent categories to encode. If None, encodes all categories

19

- drop_last (bool): Whether to create k-1 dummy variables (drop last category to avoid multicollinearity)

20

- drop_last_binary (bool): Whether to return 1 dummy for binary variables instead of 2

21

- variables (list): List of categorical variables to encode. If None, selects all object variables

22

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

23

"""

24

25

def fit(self, X, y=None):

26

"""

27

Learn unique categories per variable.

28

29

Parameters:

30

- X (pandas.DataFrame): Training dataset

31

- y (pandas.Series, optional): Target variable (not used)

32

33

Returns:

34

- self

35

"""

36

37

def transform(self, X):

38

"""

39

Replace categorical variables with binary dummy variables.

40

41

Parameters:

42

- X (pandas.DataFrame): Dataset to transform

43

44

Returns:

45

- pandas.DataFrame: Dataset with categorical variables replaced by dummy variables

46

"""

47

48

def fit_transform(self, X, y=None):

49

"""Fit to data, then transform it."""

50

```

51

52

**Usage Example**:

53

```python

54

from feature_engine.encoding import OneHotEncoder

55

import pandas as pd

56

57

# Sample categorical data

58

data = {'color': ['red', 'blue', 'green', 'red', 'blue'],

59

'size': ['S', 'M', 'L', 'M', 'S']}

60

df = pd.DataFrame(data)

61

62

# Basic one-hot encoding

63

encoder = OneHotEncoder()

64

df_encoded = encoder.fit_transform(df)

65

# Creates columns: color_blue, color_green, color_red, size_L, size_M, size_S

66

67

# Drop last category to avoid multicollinearity

68

encoder = OneHotEncoder(drop_last=True)

69

df_encoded = encoder.fit_transform(df)

70

# Creates columns: color_blue, color_green, size_L, size_M

71

72

# Encode only top N categories

73

encoder = OneHotEncoder(top_categories=2)

74

df_encoded = encoder.fit_transform(df)

75

76

# Access learned categories

77

print(encoder.encoder_dict_) # Shows categories for each variable

78

```

79

80

### Ordinal Encoding

81

82

Replaces categories by ordinal numbers (0, 1, 2, 3, etc).

83

84

```python { .api }

85

class OrdinalEncoder:

86

def __init__(self, encoding_method='ordered', variables=None, ignore_format=False, errors='ignore'):

87

"""

88

Initialize OrdinalEncoder.

89

90

Parameters:

91

- encoding_method (str): 'ordered' (requires target y) or 'arbitrary' (lexicographic order)

92

- variables (list): List of categorical variables to encode. If None, selects all object variables

93

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

94

- errors (str): How to handle unseen categories - 'ignore' or 'raise'

95

"""

96

97

def fit(self, X, y=None):

98

"""

99

Learn integer mappings for categories.

100

101

Parameters:

102

- X (pandas.DataFrame): Training dataset

103

- y (pandas.Series): Target variable (required if encoding_method='ordered')

104

105

Returns:

106

- self

107

"""

108

109

def transform(self, X):

110

"""

111

Encode categories to ordinal numbers.

112

113

Parameters:

114

- X (pandas.DataFrame): Dataset to transform

115

116

Returns:

117

- pandas.DataFrame: Dataset with categories replaced by ordinal numbers

118

"""

119

120

def fit_transform(self, X, y=None):

121

"""Fit to data, then transform it."""

122

123

def inverse_transform(self, X):

124

"""

125

Encode numbers back to original categories.

126

127

Parameters:

128

- X (pandas.DataFrame): Dataset with encoded values

129

130

Returns:

131

- pandas.DataFrame: Dataset with original category labels

132

"""

133

```

134

135

**Usage Example**:

136

```python

137

from feature_engine.encoding import OrdinalEncoder

138

139

# Arbitrary encoding (alphabetical order)

140

encoder = OrdinalEncoder(encoding_method='arbitrary')

141

df_encoded = encoder.fit_transform(df)

142

# Categories encoded in lexicographic order: blue=0, green=1, red=2

143

144

# Ordered encoding based on target mean

145

encoder = OrdinalEncoder(encoding_method='ordered')

146

df_encoded = encoder.fit_transform(df, y)

147

# Categories ordered by target mean value

148

149

# Reverse the encoding

150

df_original = encoder.inverse_transform(df_encoded)

151

```

152

153

### Target Mean Encoding

154

155

Replaces categories by the mean value of the target for each category.

156

157

```python { .api }

158

class MeanEncoder:

159

def __init__(self, variables=None, ignore_format=False, errors='ignore'):

160

"""

161

Initialize MeanEncoder.

162

163

Parameters:

164

- variables (list): List of categorical variables to encode. If None, selects all object variables

165

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

166

- errors (str): How to handle unseen categories - 'ignore' or 'raise'

167

"""

168

169

def fit(self, X, y):

170

"""

171

Learn target mean value per category per variable.

172

173

Parameters:

174

- X (pandas.DataFrame): Training dataset

175

- y (pandas.Series): Target variable (required)

176

177

Returns:

178

- self

179

"""

180

181

def transform(self, X):

182

"""

183

Encode categories to target mean values.

184

185

Parameters:

186

- X (pandas.DataFrame): Dataset to transform

187

188

Returns:

189

- pandas.DataFrame: Dataset with categories replaced by target means

190

"""

191

192

def fit_transform(self, X, y):

193

"""Fit to data, then transform it."""

194

195

def inverse_transform(self, X):

196

"""

197

Encode numbers back to original categories (approximate).

198

199

Parameters:

200

- X (pandas.DataFrame): Dataset with encoded values

201

202

Returns:

203

- pandas.DataFrame: Dataset with closest matching category labels

204

"""

205

```

206

207

**Usage Example**:

208

```python

209

from feature_engine.encoding import MeanEncoder

210

211

# Target encoding

212

encoder = MeanEncoder()

213

df_encoded = encoder.fit_transform(df, y)

214

# Each category replaced by mean target value for that category

215

216

# Access learned mappings

217

print(encoder.encoder_dict_) # Shows target mean per category per variable

218

```

219

220

### Count and Frequency Encoding

221

222

Replaces categories by their count or frequency in the dataset.

223

224

```python { .api }

225

class CountFrequencyEncoder:

226

def __init__(self, encoding_method='count', variables=None, ignore_format=False):

227

"""

228

Initialize CountFrequencyEncoder.

229

230

Parameters:

231

- encoding_method (str): 'count' (absolute count) or 'frequency' (relative frequency)

232

- variables (list): List of categorical variables to encode. If None, selects all object variables

233

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

234

"""

235

236

def fit(self, X, y=None):

237

"""

238

Learn count or frequency for each category per variable.

239

240

Parameters:

241

- X (pandas.DataFrame): Training dataset

242

- y (pandas.Series, optional): Target variable (not used)

243

244

Returns:

245

- self

246

"""

247

248

def transform(self, X):

249

"""

250

Encode categories to counts or frequencies.

251

252

Parameters:

253

- X (pandas.DataFrame): Dataset to transform

254

255

Returns:

256

- pandas.DataFrame: Dataset with categories replaced by counts or frequencies

257

"""

258

259

def fit_transform(self, X, y=None):

260

"""Fit to data, then transform it."""

261

```

262

263

**Usage Example**:

264

```python

265

from feature_engine.encoding import CountFrequencyEncoder

266

267

# Count encoding

268

encoder = CountFrequencyEncoder(encoding_method='count')

269

df_encoded = encoder.fit_transform(df)

270

# Each category replaced by its count in training data

271

272

# Frequency encoding

273

encoder = CountFrequencyEncoder(encoding_method='frequency')

274

df_encoded = encoder.fit_transform(df)

275

# Each category replaced by its relative frequency (0-1)

276

```

277

278

### Decision Tree Encoder

279

280

Replaces categories with predictions of a decision tree trained to predict the target.

281

282

```python { .api }

283

class DecisionTreeEncoder:

284

def __init__(self, variables=None, ignore_format=False, cv=3, scoring='accuracy',

285

param_grid=None, regression=False, random_state=None):

286

"""

287

Initialize DecisionTreeEncoder.

288

289

Parameters:

290

- variables (list): List of categorical variables to encode. If None, selects all object variables

291

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

292

- cv (int): Cross-validation folds for hyperparameter tuning

293

- scoring (str): Scoring metric for model selection

294

- param_grid (dict): Parameter grid for decision tree hyperparameter tuning

295

- regression (bool): Whether target is continuous (True) or categorical (False)

296

- random_state (int): Random state for reproducibility

297

"""

298

299

def fit(self, X, y):

300

"""

301

Train decision trees per variable to predict target from categories.

302

303

Parameters:

304

- X (pandas.DataFrame): Training dataset

305

- y (pandas.Series): Target variable (required)

306

307

Returns:

308

- self

309

"""

310

311

def transform(self, X):

312

"""

313

Encode categories using decision tree predictions.

314

315

Parameters:

316

- X (pandas.DataFrame): Dataset to transform

317

318

Returns:

319

- pandas.DataFrame: Dataset with categories replaced by decision tree predictions

320

"""

321

322

def fit_transform(self, X, y):

323

"""Fit to data, then transform it."""

324

```

325

326

**Usage Example**:

327

```python

328

from feature_engine.encoding import DecisionTreeEncoder

329

from sklearn.ensemble import RandomForestClassifier

330

331

# Decision tree encoding for classification

332

encoder = DecisionTreeEncoder(cv=5, scoring='accuracy')

333

df_encoded = encoder.fit_transform(df, y)

334

335

# For regression tasks

336

encoder = DecisionTreeEncoder(

337

regression=True,

338

scoring='neg_mean_squared_error',

339

random_state=42

340

)

341

df_encoded = encoder.fit_transform(df, y_continuous)

342

343

# Access trained models

344

print(encoder.encoder_) # Shows trained decision trees per variable

345

```

346

347

### Rare Label Encoder

348

349

Groups infrequent categories into a single category.

350

351

```python { .api }

352

class RareLabelEncoder:

353

def __init__(self, tol=0.05, n_categories=10, max_n_categories=None,

354

variables=None, ignore_format=False):

355

"""

356

Initialize RareLabelEncoder.

357

358

Parameters:

359

- tol (float): Minimum frequency threshold (0-1) for category to be kept separate

360

- n_categories (int): Maximum number of categories to keep (most frequent)

361

- max_n_categories (int): Alternative to n_categories, maximum categories per variable

362

- variables (list): List of categorical variables to encode. If None, selects all object variables

363

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

364

"""

365

366

def fit(self, X, y=None):

367

"""

368

Identify frequent categories per variable.

369

370

Parameters:

371

- X (pandas.DataFrame): Training dataset

372

- y (pandas.Series, optional): Target variable (not used)

373

374

Returns:

375

- self

376

"""

377

378

def transform(self, X):

379

"""

380

Replace rare categories with 'Rare' label.

381

382

Parameters:

383

- X (pandas.DataFrame): Dataset to transform

384

385

Returns:

386

- pandas.DataFrame: Dataset with rare categories grouped as 'Rare'

387

"""

388

389

def fit_transform(self, X, y=None):

390

"""Fit to data, then transform it."""

391

```

392

393

**Usage Example**:

394

```python

395

from feature_engine.encoding import RareLabelEncoder

396

397

# Group categories appearing in less than 5% of observations

398

encoder = RareLabelEncoder(tol=0.05)

399

df_encoded = encoder.fit_transform(df)

400

401

# Keep only top 3 most frequent categories

402

encoder = RareLabelEncoder(n_categories=3)

403

df_encoded = encoder.fit_transform(df)

404

405

# Access frequent categories

406

print(encoder.encoder_dict_) # Shows kept categories per variable

407

```

408

409

### Weight of Evidence Encoder

410

411

Replaces categories with Weight of Evidence (WoE) values for binary classification.

412

413

```python { .api }

414

class WoEEncoder:

415

def __init__(self, variables=None, ignore_format=False, errors='ignore'):

416

"""

417

Initialize WoEEncoder.

418

419

Parameters:

420

- variables (list): List of categorical variables to encode. If None, selects all object variables

421

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

422

- errors (str): How to handle unseen categories - 'ignore' or 'raise'

423

"""

424

425

def fit(self, X, y):

426

"""

427

Calculate Weight of Evidence for each category.

428

429

Parameters:

430

- X (pandas.DataFrame): Training dataset

431

- y (pandas.Series): Binary target variable (required)

432

433

Returns:

434

- self

435

"""

436

437

def transform(self, X):

438

"""

439

Encode categories to Weight of Evidence values.

440

441

Parameters:

442

- X (pandas.DataFrame): Dataset to transform

443

444

Returns:

445

- pandas.DataFrame: Dataset with categories replaced by WoE values

446

"""

447

448

def fit_transform(self, X, y):

449

"""Fit to data, then transform it."""

450

```

451

452

**Usage Example**:

453

```python

454

from feature_engine.encoding import WoEEncoder

455

456

# Weight of Evidence encoding for binary classification

457

encoder = WoEEncoder()

458

df_encoded = encoder.fit_transform(df, y_binary)

459

460

# Access learned WoE values

461

print(encoder.encoder_dict_) # Shows WoE values per category per variable

462

```

463

464

### Probability Ratio Encoder

465

466

Replaces categories with probability ratios for binary classification.

467

468

```python { .api }

469

class PRatioEncoder:

470

def __init__(self, variables=None, ignore_format=False, errors='ignore'):

471

"""

472

Initialize PRatioEncoder.

473

474

Parameters:

475

- variables (list): List of categorical variables to encode. If None, selects all object variables

476

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

477

- errors (str): How to handle unseen categories - 'ignore' or 'raise'

478

"""

479

480

def fit(self, X, y):

481

"""

482

Calculate probability ratios for each category.

483

484

Parameters:

485

- X (pandas.DataFrame): Training dataset

486

- y (pandas.Series): Binary target variable (required)

487

488

Returns:

489

- self

490

"""

491

492

def transform(self, X):

493

"""

494

Encode categories to probability ratio values.

495

496

Parameters:

497

- X (pandas.DataFrame): Dataset to transform

498

499

Returns:

500

- pandas.DataFrame: Dataset with categories replaced by probability ratios

501

"""

502

503

def fit_transform(self, X, y):

504

"""Fit to data, then transform it."""

505

```

506

507

**Usage Example**:

508

```python

509

from feature_engine.encoding import PRatioEncoder

510

511

# Probability ratio encoding for binary classification

512

encoder = PRatioEncoder()

513

df_encoded = encoder.fit_transform(df, y_binary)

514

515

# Access learned probability ratios

516

print(encoder.encoder_dict_) # Shows probability ratios per category per variable

517

```

518

519

## Common Attributes

520

521

All encoding transformers share these fitted attributes:

522

523

- `variables_` (list): Variables that will be transformed

524

- `n_features_in_` (int): Number of features in training set

525

- `encoder_dict_` (dict): Dictionary with category mappings per variable

526

527

Additional attributes for specific encoders:

528

- `variables_binary_` (list): Binary variables identified in data (OneHotEncoder)

529

- `encoder_` (dict): Trained models per variable (DecisionTreeEncoder)