0
# Missing Data Imputation
1
2
Transformers for handling missing values in numerical and categorical variables using statistical methods, arbitrary values, random sampling, and missing data indicators.
3
4
## Capabilities
5
6
### Mean and Median Imputation
7
8
Replaces missing data by the mean or median value of numerical variables.
9
10
```python { .api }
11
class MeanMedianImputer:
12
def __init__(self, imputation_method='median', variables=None):
13
"""
14
Initialize MeanMedianImputer.
15
16
Parameters:
17
- imputation_method (str): 'mean' or 'median'
18
- variables (list): List of numerical variables to impute. If None, selects all numerical variables
19
"""
20
21
def fit(self, X, y=None):
22
"""
23
Learn mean/median values for each variable.
24
25
Parameters:
26
- X (pandas.DataFrame): Training dataset
27
- y (pandas.Series, optional): Target variable (not used)
28
29
Returns:
30
- self
31
"""
32
33
def transform(self, X):
34
"""
35
Impute missing data using learned parameters.
36
37
Parameters:
38
- X (pandas.DataFrame): Dataset to transform
39
40
Returns:
41
- pandas.DataFrame: Transformed dataset with imputed values
42
"""
43
44
def fit_transform(self, X, y=None):
45
"""Fit to data, then transform it."""
46
```
47
48
**Usage Example**:
49
```python
50
from feature_engine.imputation import MeanMedianImputer
51
import pandas as pd
52
53
# Sample data with missing values
54
data = {'var1': [1.0, 2.0, None, 4.0], 'var2': [10, None, 30, 40]}
55
df = pd.DataFrame(data)
56
57
# Mean imputation
58
imputer = MeanMedianImputer(imputation_method='mean')
59
df_imputed = imputer.fit_transform(df)
60
61
# Median imputation (default)
62
imputer = MeanMedianImputer()
63
df_imputed = imputer.fit_transform(df)
64
65
# Access learned parameters
66
print(imputer.imputer_dict_) # {'var1': 2.33, 'var2': 26.67}
67
```
68
69
### Arbitrary Number Imputation
70
71
Replaces missing data by an arbitrary value determined by the user for numerical variables.
72
73
```python { .api }
74
class ArbitraryNumberImputer:
75
def __init__(self, arbitrary_number=999, variables=None, imputer_dict=None):
76
"""
77
Initialize ArbitraryNumberImputer.
78
79
Parameters:
80
- arbitrary_number (int/float): Number to replace missing data (ignored if imputer_dict provided)
81
- variables (list): List of variables to impute. If None, selects all numerical variables
82
- imputer_dict (dict): Dictionary mapping variables to imputation values
83
"""
84
85
def fit(self, X, y=None):
86
"""
87
Validate input data (no parameters learned).
88
89
Parameters:
90
- X (pandas.DataFrame): Training dataset
91
- y (pandas.Series, optional): Target variable (not used)
92
93
Returns:
94
- self
95
"""
96
97
def transform(self, X):
98
"""
99
Impute missing data with arbitrary values.
100
101
Parameters:
102
- X (pandas.DataFrame): Dataset to transform
103
104
Returns:
105
- pandas.DataFrame: Transformed dataset with imputed values
106
"""
107
108
def fit_transform(self, X, y=None):
109
"""Fit to data, then transform it."""
110
```
111
112
**Usage Example**:
113
```python
114
from feature_engine.imputation import ArbitraryNumberImputer
115
116
# Single value for all variables
117
imputer = ArbitraryNumberImputer(arbitrary_number=-999)
118
df_imputed = imputer.fit_transform(df)
119
120
# Different values per variable
121
imputer = ArbitraryNumberImputer(
122
imputer_dict={'var1': 0, 'var2': -1, 'var3': 99}
123
)
124
df_imputed = imputer.fit_transform(df)
125
```
126
127
### Categorical Variable Imputation
128
129
Replaces missing data in categorical variables by an arbitrary value or the most frequent category.
130
131
```python { .api }
132
class CategoricalImputer:
133
def __init__(self, imputation_method='missing', fill_value='Missing',
134
variables=None, return_object=False, ignore_format=False):
135
"""
136
Initialize CategoricalImputer.
137
138
Parameters:
139
- imputation_method (str): 'missing' (use fill_value) or 'frequent' (use mode)
140
- fill_value (str/int/float): Value to replace missing data when method='missing'
141
- variables (list): List of categorical variables to impute. If None, selects all object variables
142
- return_object (bool): Whether to return variables as object dtype
143
- ignore_format (bool): Whether to ignore variable format and accept numerical variables
144
"""
145
146
def fit(self, X, y=None):
147
"""
148
Learn most frequent category or assign arbitrary value per variable.
149
150
Parameters:
151
- X (pandas.DataFrame): Training dataset
152
- y (pandas.Series, optional): Target variable (not used)
153
154
Returns:
155
- self
156
"""
157
158
def transform(self, X):
159
"""
160
Impute missing data in categorical variables.
161
162
Parameters:
163
- X (pandas.DataFrame): Dataset to transform
164
165
Returns:
166
- pandas.DataFrame: Transformed dataset with imputed values
167
"""
168
169
def fit_transform(self, X, y=None):
170
"""Fit to data, then transform it."""
171
```
172
173
**Usage Example**:
174
```python
175
from feature_engine.imputation import CategoricalImputer
176
177
# Impute with most frequent category
178
imputer = CategoricalImputer(imputation_method='frequent')
179
df_imputed = imputer.fit_transform(df)
180
181
# Impute with custom value
182
imputer = CategoricalImputer(
183
imputation_method='missing',
184
fill_value='Unknown'
185
)
186
df_imputed = imputer.fit_transform(df)
187
```
188
189
### End Tail Imputation
190
191
Replaces missing data by values at either tail of the distribution for numerical variables.
192
193
```python { .api }
194
class EndTailImputer:
195
def __init__(self, imputation_method='gaussian', tail='right', fold=3, variables=None):
196
"""
197
Initialize EndTailImputer.
198
199
Parameters:
200
- imputation_method (str): 'gaussian' (mean ± fold*std), 'iqr' (Q1/Q3 ± fold*IQR), 'max' (fold*max/min)
201
- tail (str): 'right' (upper tail) or 'left' (lower tail)
202
- fold (int/float): Factor to multiply std, IQR or max values
203
- variables (list): List of numerical variables to impute
204
"""
205
206
def fit(self, X, y=None):
207
"""
208
Learn values at end of distribution for each variable.
209
210
Parameters:
211
- X (pandas.DataFrame): Training dataset
212
- y (pandas.Series, optional): Target variable (not used)
213
214
Returns:
215
- self
216
"""
217
218
def transform(self, X):
219
"""
220
Impute missing data with end tail values.
221
222
Parameters:
223
- X (pandas.DataFrame): Dataset to transform
224
225
Returns:
226
- pandas.DataFrame: Transformed dataset with imputed values
227
"""
228
229
def fit_transform(self, X, y=None):
230
"""Fit to data, then transform it."""
231
```
232
233
**Usage Example**:
234
```python
235
from feature_engine.imputation import EndTailImputer
236
237
# Right tail using IQR method
238
imputer = EndTailImputer(
239
imputation_method='iqr',
240
tail='right',
241
fold=3
242
)
243
df_imputed = imputer.fit_transform(df)
244
245
# Left tail using gaussian method
246
imputer = EndTailImputer(
247
imputation_method='gaussian',
248
tail='left',
249
fold=2
250
)
251
df_imputed = imputer.fit_transform(df)
252
```
253
254
### Missing Data Indicators
255
256
Adds binary variables that indicate if data was missing for each variable.
257
258
```python { .api }
259
class AddMissingIndicator:
260
def __init__(self, missing_only=True, variables=None):
261
"""
262
Initialize AddMissingIndicator.
263
264
Parameters:
265
- missing_only (bool): Whether to add indicators only for variables with missing data in train set
266
- variables (list): List of variables to create indicators for. If None, evaluates all variables
267
"""
268
269
def fit(self, X, y=None):
270
"""
271
Find variables for which missing indicators will be created.
272
273
Parameters:
274
- X (pandas.DataFrame): Training dataset
275
- y (pandas.Series, optional): Target variable (not used)
276
277
Returns:
278
- self
279
"""
280
281
def transform(self, X):
282
"""
283
Add binary missing indicators to dataset.
284
285
Parameters:
286
- X (pandas.DataFrame): Dataset to transform
287
288
Returns:
289
- pandas.DataFrame: Dataset with additional binary indicator columns
290
"""
291
292
def fit_transform(self, X, y=None):
293
"""Fit to data, then transform it."""
294
```
295
296
**Usage Example**:
297
```python
298
from feature_engine.imputation import AddMissingIndicator
299
300
# Add indicators only for variables with missing data
301
indicator = AddMissingIndicator(missing_only=True)
302
df_with_indicators = indicator.fit_transform(df)
303
304
# Creates new columns like 'var1_na', 'var2_na' where missing data existed
305
print(df_with_indicators.columns) # Original columns + indicator columns
306
```
307
308
### Random Sample Imputation
309
310
Replaces missing data with random sample extracted from the variables in the training set.
311
312
```python { .api }
313
class RandomSampleImputer:
314
def __init__(self, variables=None, random_state=None, seed='general', seeding_method='add'):
315
"""
316
Initialize RandomSampleImputer.
317
318
Parameters:
319
- variables (list): List of variables to be imputed. If None, selects all variables with missing data
320
- random_state (int/str/list): Random state for sampling reproducibility
321
- seed (str): 'general' (single seed) or 'observation' (seed per observation)
322
- seeding_method (str): 'add' or 'multiply' when combining seeds
323
"""
324
325
def fit(self, X, y=None):
326
"""
327
Store copy of training dataset for sampling.
328
329
Parameters:
330
- X (pandas.DataFrame): Training dataset
331
- y (pandas.Series, optional): Target variable (not used)
332
333
Returns:
334
- self
335
"""
336
337
def transform(self, X):
338
"""
339
Impute missing data with random samples from training set.
340
341
Parameters:
342
- X (pandas.DataFrame): Dataset to transform
343
344
Returns:
345
- pandas.DataFrame: Transformed dataset with imputed values
346
"""
347
348
def fit_transform(self, X, y=None):
349
"""Fit to data, then transform it."""
350
```
351
352
**Usage Example**:
353
```python
354
from feature_engine.imputation import RandomSampleImputer
355
356
# Random sampling with fixed seed
357
imputer = RandomSampleImputer(random_state=42, seed='general')
358
df_imputed = imputer.fit_transform(df)
359
360
# Different seed per observation
361
imputer = RandomSampleImputer(
362
random_state=42,
363
seed='observation',
364
seeding_method='add'
365
)
366
df_imputed = imputer.fit_transform(df)
367
```
368
369
### Drop Missing Data
370
371
Deletes rows containing missing values, similar to pandas.dropna().
372
373
```python { .api }
374
class DropMissingData:
375
def __init__(self, missing_only=True, threshold=None, variables=None):
376
"""
377
Initialize DropMissingData.
378
379
Parameters:
380
- missing_only (bool): If True, consider only variables with missing data in train set
381
- threshold (int/float): Percentage (0-1) or count of non-NA values required to keep row
382
- variables (list): List of variables to evaluate for missing data. If None, uses all variables
383
"""
384
385
def fit(self, X, y=None):
386
"""
387
Find variables for missing data evaluation.
388
389
Parameters:
390
- X (pandas.DataFrame): Training dataset
391
- y (pandas.Series, optional): Target variable (not used)
392
393
Returns:
394
- self
395
"""
396
397
def transform(self, X):
398
"""
399
Remove rows with missing data based on specified criteria.
400
401
Parameters:
402
- X (pandas.DataFrame): Dataset to transform
403
404
Returns:
405
- pandas.DataFrame: Dataset with rows containing missing data removed
406
"""
407
408
def fit_transform(self, X, y=None):
409
"""Fit to data, then transform it."""
410
411
def return_na_data(self, X):
412
"""
413
Return subset of dataframe with rows that would be removed.
414
415
Parameters:
416
- X (pandas.DataFrame): Dataset to evaluate
417
418
Returns:
419
- pandas.DataFrame: Rows that contain missing data
420
"""
421
```
422
423
**Usage Example**:
424
```python
425
from feature_engine.imputation import DropMissingData
426
427
# Drop rows with any missing data
428
dropper = DropMissingData()
429
df_clean = dropper.fit_transform(df)
430
431
# Keep rows with at least 80% non-missing data
432
dropper = DropMissingData(threshold=0.8)
433
df_clean = dropper.fit_transform(df)
434
435
# See which rows would be dropped
436
rows_to_drop = dropper.return_na_data(df)
437
```
438
439
## Common Attributes
440
441
All imputation transformers share these fitted attributes:
442
443
- `variables_` (list): Variables that will be transformed
444
- `n_features_in_` (int): Number of features in training set
445
- `imputer_dict_` (dict): Dictionary with imputation values per variable (where applicable)