0
# Categorical Variable Encoding
1
2
Transformers for converting categorical variables into numerical representations using various encoding methods including one-hot, ordinal, target-based, frequency-based, and weight of evidence encoders.
3
4
## Capabilities
5
6
### One-Hot Encoding
7
8
Replaces categorical variables by binary variables representing each category.
9
10
```python { .api }
11
class OneHotEncoder:
12
def __init__(self, top_categories=None, drop_last=False, drop_last_binary=False,
13
variables=None, ignore_format=False):
14
"""
15
Initialize OneHotEncoder.
16
17
Parameters:
18
- top_categories (int): Number of most frequent categories to encode. If None, encodes all categories
19
- drop_last (bool): Whether to create k-1 dummy variables (drop last category to avoid multicollinearity)
20
- drop_last_binary (bool): Whether to return 1 dummy for binary variables instead of 2
21
- variables (list): List of categorical variables to encode. If None, selects all object variables
22
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
23
"""
24
25
def fit(self, X, y=None):
26
"""
27
Learn unique categories per variable.
28
29
Parameters:
30
- X (pandas.DataFrame): Training dataset
31
- y (pandas.Series, optional): Target variable (not used)
32
33
Returns:
34
- self
35
"""
36
37
def transform(self, X):
38
"""
39
Replace categorical variables with binary dummy variables.
40
41
Parameters:
42
- X (pandas.DataFrame): Dataset to transform
43
44
Returns:
45
- pandas.DataFrame: Dataset with categorical variables replaced by dummy variables
46
"""
47
48
def fit_transform(self, X, y=None):
49
"""Fit to data, then transform it."""
50
```
51
52
**Usage Example**:
53
```python
54
from feature_engine.encoding import OneHotEncoder
55
import pandas as pd
56
57
# Sample categorical data
58
data = {'color': ['red', 'blue', 'green', 'red', 'blue'],
59
'size': ['S', 'M', 'L', 'M', 'S']}
60
df = pd.DataFrame(data)
61
62
# Basic one-hot encoding
63
encoder = OneHotEncoder()
64
df_encoded = encoder.fit_transform(df)
65
# Creates columns: color_blue, color_green, color_red, size_L, size_M, size_S
66
67
# Drop last category to avoid multicollinearity
68
encoder = OneHotEncoder(drop_last=True)
69
df_encoded = encoder.fit_transform(df)
70
# Creates columns: color_blue, color_green, size_L, size_M
71
72
# Encode only top N categories
73
encoder = OneHotEncoder(top_categories=2)
74
df_encoded = encoder.fit_transform(df)
75
76
# Access learned categories
77
print(encoder.encoder_dict_) # Shows categories for each variable
78
```
79
80
### Ordinal Encoding
81
82
Replaces categories by ordinal numbers (0, 1, 2, 3, etc).
83
84
```python { .api }
85
class OrdinalEncoder:
86
def __init__(self, encoding_method='ordered', variables=None, ignore_format=False, errors='ignore'):
87
"""
88
Initialize OrdinalEncoder.
89
90
Parameters:
91
- encoding_method (str): 'ordered' (requires target y) or 'arbitrary' (lexicographic order)
92
- variables (list): List of categorical variables to encode. If None, selects all object variables
93
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
94
- errors (str): How to handle unseen categories - 'ignore' or 'raise'
95
"""
96
97
def fit(self, X, y=None):
98
"""
99
Learn integer mappings for categories.
100
101
Parameters:
102
- X (pandas.DataFrame): Training dataset
103
- y (pandas.Series): Target variable (required if encoding_method='ordered')
104
105
Returns:
106
- self
107
"""
108
109
def transform(self, X):
110
"""
111
Encode categories to ordinal numbers.
112
113
Parameters:
114
- X (pandas.DataFrame): Dataset to transform
115
116
Returns:
117
- pandas.DataFrame: Dataset with categories replaced by ordinal numbers
118
"""
119
120
def fit_transform(self, X, y=None):
121
"""Fit to data, then transform it."""
122
123
def inverse_transform(self, X):
124
"""
125
Encode numbers back to original categories.
126
127
Parameters:
128
- X (pandas.DataFrame): Dataset with encoded values
129
130
Returns:
131
- pandas.DataFrame: Dataset with original category labels
132
"""
133
```
134
135
**Usage Example**:
136
```python
137
from feature_engine.encoding import OrdinalEncoder
138
139
# Arbitrary encoding (alphabetical order)
140
encoder = OrdinalEncoder(encoding_method='arbitrary')
141
df_encoded = encoder.fit_transform(df)
142
# Categories encoded in lexicographic order: blue=0, green=1, red=2
143
144
# Ordered encoding based on target mean
145
encoder = OrdinalEncoder(encoding_method='ordered')
146
df_encoded = encoder.fit_transform(df, y)
147
# Categories ordered by target mean value
148
149
# Reverse the encoding
150
df_original = encoder.inverse_transform(df_encoded)
151
```
152
153
### Target Mean Encoding
154
155
Replaces categories by the mean value of the target for each category.
156
157
```python { .api }
158
class MeanEncoder:
159
def __init__(self, variables=None, ignore_format=False, errors='ignore'):
160
"""
161
Initialize MeanEncoder.
162
163
Parameters:
164
- variables (list): List of categorical variables to encode. If None, selects all object variables
165
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
166
- errors (str): How to handle unseen categories - 'ignore' or 'raise'
167
"""
168
169
def fit(self, X, y):
170
"""
171
Learn target mean value per category per variable.
172
173
Parameters:
174
- X (pandas.DataFrame): Training dataset
175
- y (pandas.Series): Target variable (required)
176
177
Returns:
178
- self
179
"""
180
181
def transform(self, X):
182
"""
183
Encode categories to target mean values.
184
185
Parameters:
186
- X (pandas.DataFrame): Dataset to transform
187
188
Returns:
189
- pandas.DataFrame: Dataset with categories replaced by target means
190
"""
191
192
def fit_transform(self, X, y):
193
"""Fit to data, then transform it."""
194
195
def inverse_transform(self, X):
196
"""
197
Encode numbers back to original categories (approximate).
198
199
Parameters:
200
- X (pandas.DataFrame): Dataset with encoded values
201
202
Returns:
203
- pandas.DataFrame: Dataset with closest matching category labels
204
"""
205
```
206
207
**Usage Example**:
208
```python
209
from feature_engine.encoding import MeanEncoder
210
211
# Target encoding
212
encoder = MeanEncoder()
213
df_encoded = encoder.fit_transform(df, y)
214
# Each category replaced by mean target value for that category
215
216
# Access learned mappings
217
print(encoder.encoder_dict_) # Shows target mean per category per variable
218
```
219
220
### Count and Frequency Encoding
221
222
Replaces categories by their count or frequency in the dataset.
223
224
```python { .api }
225
class CountFrequencyEncoder:
226
def __init__(self, encoding_method='count', variables=None, ignore_format=False):
227
"""
228
Initialize CountFrequencyEncoder.
229
230
Parameters:
231
- encoding_method (str): 'count' (absolute count) or 'frequency' (relative frequency)
232
- variables (list): List of categorical variables to encode. If None, selects all object variables
233
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
234
"""
235
236
def fit(self, X, y=None):
237
"""
238
Learn count or frequency for each category per variable.
239
240
Parameters:
241
- X (pandas.DataFrame): Training dataset
242
- y (pandas.Series, optional): Target variable (not used)
243
244
Returns:
245
- self
246
"""
247
248
def transform(self, X):
249
"""
250
Encode categories to counts or frequencies.
251
252
Parameters:
253
- X (pandas.DataFrame): Dataset to transform
254
255
Returns:
256
- pandas.DataFrame: Dataset with categories replaced by counts or frequencies
257
"""
258
259
def fit_transform(self, X, y=None):
260
"""Fit to data, then transform it."""
261
```
262
263
**Usage Example**:
264
```python
265
from feature_engine.encoding import CountFrequencyEncoder
266
267
# Count encoding
268
encoder = CountFrequencyEncoder(encoding_method='count')
269
df_encoded = encoder.fit_transform(df)
270
# Each category replaced by its count in training data
271
272
# Frequency encoding
273
encoder = CountFrequencyEncoder(encoding_method='frequency')
274
df_encoded = encoder.fit_transform(df)
275
# Each category replaced by its relative frequency (0-1)
276
```
277
278
### Decision Tree Encoder
279
280
Replaces categories with predictions of a decision tree trained to predict the target.
281
282
```python { .api }
283
class DecisionTreeEncoder:
284
def __init__(self, variables=None, ignore_format=False, cv=3, scoring='accuracy',
285
param_grid=None, regression=False, random_state=None):
286
"""
287
Initialize DecisionTreeEncoder.
288
289
Parameters:
290
- variables (list): List of categorical variables to encode. If None, selects all object variables
291
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
292
- cv (int): Cross-validation folds for hyperparameter tuning
293
- scoring (str): Scoring metric for model selection
294
- param_grid (dict): Parameter grid for decision tree hyperparameter tuning
295
- regression (bool): Whether target is continuous (True) or categorical (False)
296
- random_state (int): Random state for reproducibility
297
"""
298
299
def fit(self, X, y):
300
"""
301
Train decision trees per variable to predict target from categories.
302
303
Parameters:
304
- X (pandas.DataFrame): Training dataset
305
- y (pandas.Series): Target variable (required)
306
307
Returns:
308
- self
309
"""
310
311
def transform(self, X):
312
"""
313
Encode categories using decision tree predictions.
314
315
Parameters:
316
- X (pandas.DataFrame): Dataset to transform
317
318
Returns:
319
- pandas.DataFrame: Dataset with categories replaced by decision tree predictions
320
"""
321
322
def fit_transform(self, X, y):
323
"""Fit to data, then transform it."""
324
```
325
326
**Usage Example**:
327
```python
328
from feature_engine.encoding import DecisionTreeEncoder
329
from sklearn.ensemble import RandomForestClassifier
330
331
# Decision tree encoding for classification
332
encoder = DecisionTreeEncoder(cv=5, scoring='accuracy')
333
df_encoded = encoder.fit_transform(df, y)
334
335
# For regression tasks
336
encoder = DecisionTreeEncoder(
337
regression=True,
338
scoring='neg_mean_squared_error',
339
random_state=42
340
)
341
df_encoded = encoder.fit_transform(df, y_continuous)
342
343
# Access trained models
344
print(encoder.encoder_) # Shows trained decision trees per variable
345
```
346
347
### Rare Label Encoder
348
349
Groups infrequent categories into a single category.
350
351
```python { .api }
352
class RareLabelEncoder:
353
def __init__(self, tol=0.05, n_categories=10, max_n_categories=None,
354
variables=None, ignore_format=False):
355
"""
356
Initialize RareLabelEncoder.
357
358
Parameters:
359
- tol (float): Minimum frequency threshold (0-1) for category to be kept separate
360
- n_categories (int): Maximum number of categories to keep (most frequent)
361
- max_n_categories (int): Alternative to n_categories, maximum categories per variable
362
- variables (list): List of categorical variables to encode. If None, selects all object variables
363
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
364
"""
365
366
def fit(self, X, y=None):
367
"""
368
Identify frequent categories per variable.
369
370
Parameters:
371
- X (pandas.DataFrame): Training dataset
372
- y (pandas.Series, optional): Target variable (not used)
373
374
Returns:
375
- self
376
"""
377
378
def transform(self, X):
379
"""
380
Replace rare categories with 'Rare' label.
381
382
Parameters:
383
- X (pandas.DataFrame): Dataset to transform
384
385
Returns:
386
- pandas.DataFrame: Dataset with rare categories grouped as 'Rare'
387
"""
388
389
def fit_transform(self, X, y=None):
390
"""Fit to data, then transform it."""
391
```
392
393
**Usage Example**:
394
```python
395
from feature_engine.encoding import RareLabelEncoder
396
397
# Group categories appearing in less than 5% of observations
398
encoder = RareLabelEncoder(tol=0.05)
399
df_encoded = encoder.fit_transform(df)
400
401
# Keep only top 3 most frequent categories
402
encoder = RareLabelEncoder(n_categories=3)
403
df_encoded = encoder.fit_transform(df)
404
405
# Access frequent categories
406
print(encoder.encoder_dict_) # Shows kept categories per variable
407
```
408
409
### Weight of Evidence Encoder
410
411
Replaces categories with Weight of Evidence (WoE) values for binary classification.
412
413
```python { .api }
414
class WoEEncoder:
415
def __init__(self, variables=None, ignore_format=False, errors='ignore'):
416
"""
417
Initialize WoEEncoder.
418
419
Parameters:
420
- variables (list): List of categorical variables to encode. If None, selects all object variables
421
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
422
- errors (str): How to handle unseen categories - 'ignore' or 'raise'
423
"""
424
425
def fit(self, X, y):
426
"""
427
Calculate Weight of Evidence for each category.
428
429
Parameters:
430
- X (pandas.DataFrame): Training dataset
431
- y (pandas.Series): Binary target variable (required)
432
433
Returns:
434
- self
435
"""
436
437
def transform(self, X):
438
"""
439
Encode categories to Weight of Evidence values.
440
441
Parameters:
442
- X (pandas.DataFrame): Dataset to transform
443
444
Returns:
445
- pandas.DataFrame: Dataset with categories replaced by WoE values
446
"""
447
448
def fit_transform(self, X, y):
449
"""Fit to data, then transform it."""
450
```
451
452
**Usage Example**:
453
```python
454
from feature_engine.encoding import WoEEncoder
455
456
# Weight of Evidence encoding for binary classification
457
encoder = WoEEncoder()
458
df_encoded = encoder.fit_transform(df, y_binary)
459
460
# Access learned WoE values
461
print(encoder.encoder_dict_) # Shows WoE values per category per variable
462
```
463
464
### Probability Ratio Encoder
465
466
Replaces categories with probability ratios for binary classification.
467
468
```python { .api }
469
class PRatioEncoder:
470
def __init__(self, variables=None, ignore_format=False, errors='ignore'):
471
"""
472
Initialize PRatioEncoder.
473
474
Parameters:
475
- variables (list): List of categorical variables to encode. If None, selects all object variables
476
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
477
- errors (str): How to handle unseen categories - 'ignore' or 'raise'
478
"""
479
480
def fit(self, X, y):
481
"""
482
Calculate probability ratios for each category.
483
484
Parameters:
485
- X (pandas.DataFrame): Training dataset
486
- y (pandas.Series): Binary target variable (required)
487
488
Returns:
489
- self
490
"""
491
492
def transform(self, X):
493
"""
494
Encode categories to probability ratio values.
495
496
Parameters:
497
- X (pandas.DataFrame): Dataset to transform
498
499
Returns:
500
- pandas.DataFrame: Dataset with categories replaced by probability ratios
501
"""
502
503
def fit_transform(self, X, y):
504
"""Fit to data, then transform it."""
505
```
506
507
**Usage Example**:
508
```python
509
from feature_engine.encoding import PRatioEncoder
510
511
# Probability ratio encoding for binary classification
512
encoder = PRatioEncoder()
513
df_encoded = encoder.fit_transform(df, y_binary)
514
515
# Access learned probability ratios
516
print(encoder.encoder_dict_) # Shows probability ratios per category per variable
517
```
518
519
## Common Attributes
520
521
All encoding transformers share these fitted attributes:
522
523
- `variables_` (list): Variables that will be transformed
524
- `n_features_in_` (int): Number of features in training set
525
- `encoder_dict_` (dict): Dictionary with category mappings per variable
526
527
Additional attributes for specific encoders:
528
- `variables_binary_` (list): Binary variables identified in data (OneHotEncoder)
529
- `encoder_` (dict): Trained models per variable (DecisionTreeEncoder)