or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

creation.mddatetime.mddiscretisation.mdencoding.mdimputation.mdindex.mdoutliers.mdpreprocessing.mdselection.mdtransformation.mdwrappers.md

imputation.mddocs/

0

# Missing Data Imputation

1

2

Transformers for handling missing values in numerical and categorical variables using statistical methods, arbitrary values, random sampling, and missing data indicators.

3

4

## Capabilities

5

6

### Mean and Median Imputation

7

8

Replaces missing data by the mean or median value of numerical variables.

9

10

```python { .api }

11

class MeanMedianImputer:

12

def __init__(self, imputation_method='median', variables=None):

13

"""

14

Initialize MeanMedianImputer.

15

16

Parameters:

17

- imputation_method (str): 'mean' or 'median'

18

- variables (list): List of numerical variables to impute. If None, selects all numerical variables

19

"""

20

21

def fit(self, X, y=None):

22

"""

23

Learn mean/median values for each variable.

24

25

Parameters:

26

- X (pandas.DataFrame): Training dataset

27

- y (pandas.Series, optional): Target variable (not used)

28

29

Returns:

30

- self

31

"""

32

33

def transform(self, X):

34

"""

35

Impute missing data using learned parameters.

36

37

Parameters:

38

- X (pandas.DataFrame): Dataset to transform

39

40

Returns:

41

- pandas.DataFrame: Transformed dataset with imputed values

42

"""

43

44

def fit_transform(self, X, y=None):

45

"""Fit to data, then transform it."""

46

```

47

48

**Usage Example**:

49

```python

50

from feature_engine.imputation import MeanMedianImputer

51

import pandas as pd

52

53

# Sample data with missing values

54

data = {'var1': [1.0, 2.0, None, 4.0], 'var2': [10, None, 30, 40]}

55

df = pd.DataFrame(data)

56

57

# Mean imputation

58

imputer = MeanMedianImputer(imputation_method='mean')

59

df_imputed = imputer.fit_transform(df)

60

61

# Median imputation (default)

62

imputer = MeanMedianImputer()

63

df_imputed = imputer.fit_transform(df)

64

65

# Access learned parameters

66

print(imputer.imputer_dict_) # {'var1': 2.33, 'var2': 26.67}

67

```

68

69

### Arbitrary Number Imputation

70

71

Replaces missing data by an arbitrary value determined by the user for numerical variables.

72

73

```python { .api }

74

class ArbitraryNumberImputer:

75

def __init__(self, arbitrary_number=999, variables=None, imputer_dict=None):

76

"""

77

Initialize ArbitraryNumberImputer.

78

79

Parameters:

80

- arbitrary_number (int/float): Number to replace missing data (ignored if imputer_dict provided)

81

- variables (list): List of variables to impute. If None, selects all numerical variables

82

- imputer_dict (dict): Dictionary mapping variables to imputation values

83

"""

84

85

def fit(self, X, y=None):

86

"""

87

Validate input data (no parameters learned).

88

89

Parameters:

90

- X (pandas.DataFrame): Training dataset

91

- y (pandas.Series, optional): Target variable (not used)

92

93

Returns:

94

- self

95

"""

96

97

def transform(self, X):

98

"""

99

Impute missing data with arbitrary values.

100

101

Parameters:

102

- X (pandas.DataFrame): Dataset to transform

103

104

Returns:

105

- pandas.DataFrame: Transformed dataset with imputed values

106

"""

107

108

def fit_transform(self, X, y=None):

109

"""Fit to data, then transform it."""

110

```

111

112

**Usage Example**:

113

```python

114

from feature_engine.imputation import ArbitraryNumberImputer

115

116

# Single value for all variables

117

imputer = ArbitraryNumberImputer(arbitrary_number=-999)

118

df_imputed = imputer.fit_transform(df)

119

120

# Different values per variable

121

imputer = ArbitraryNumberImputer(

122

imputer_dict={'var1': 0, 'var2': -1, 'var3': 99}

123

)

124

df_imputed = imputer.fit_transform(df)

125

```

126

127

### Categorical Variable Imputation

128

129

Replaces missing data in categorical variables by an arbitrary value or the most frequent category.

130

131

```python { .api }

132

class CategoricalImputer:

133

def __init__(self, imputation_method='missing', fill_value='Missing',

134

variables=None, return_object=False, ignore_format=False):

135

"""

136

Initialize CategoricalImputer.

137

138

Parameters:

139

- imputation_method (str): 'missing' (use fill_value) or 'frequent' (use mode)

140

- fill_value (str/int/float): Value to replace missing data when method='missing'

141

- variables (list): List of categorical variables to impute. If None, selects all object variables

142

- return_object (bool): Whether to return variables as object dtype

143

- ignore_format (bool): Whether to ignore variable format and accept numerical variables

144

"""

145

146

def fit(self, X, y=None):

147

"""

148

Learn most frequent category or assign arbitrary value per variable.

149

150

Parameters:

151

- X (pandas.DataFrame): Training dataset

152

- y (pandas.Series, optional): Target variable (not used)

153

154

Returns:

155

- self

156

"""

157

158

def transform(self, X):

159

"""

160

Impute missing data in categorical variables.

161

162

Parameters:

163

- X (pandas.DataFrame): Dataset to transform

164

165

Returns:

166

- pandas.DataFrame: Transformed dataset with imputed values

167

"""

168

169

def fit_transform(self, X, y=None):

170

"""Fit to data, then transform it."""

171

```

172

173

**Usage Example**:

174

```python

175

from feature_engine.imputation import CategoricalImputer

176

177

# Impute with most frequent category

178

imputer = CategoricalImputer(imputation_method='frequent')

179

df_imputed = imputer.fit_transform(df)

180

181

# Impute with custom value

182

imputer = CategoricalImputer(

183

imputation_method='missing',

184

fill_value='Unknown'

185

)

186

df_imputed = imputer.fit_transform(df)

187

```

188

189

### End Tail Imputation

190

191

Replaces missing data by values at either tail of the distribution for numerical variables.

192

193

```python { .api }

194

class EndTailImputer:

195

def __init__(self, imputation_method='gaussian', tail='right', fold=3, variables=None):

196

"""

197

Initialize EndTailImputer.

198

199

Parameters:

200

- imputation_method (str): 'gaussian' (mean ± fold*std), 'iqr' (Q1/Q3 ± fold*IQR), 'max' (fold*max/min)

201

- tail (str): 'right' (upper tail) or 'left' (lower tail)

202

- fold (int/float): Factor to multiply std, IQR or max values

203

- variables (list): List of numerical variables to impute

204

"""

205

206

def fit(self, X, y=None):

207

"""

208

Learn values at end of distribution for each variable.

209

210

Parameters:

211

- X (pandas.DataFrame): Training dataset

212

- y (pandas.Series, optional): Target variable (not used)

213

214

Returns:

215

- self

216

"""

217

218

def transform(self, X):

219

"""

220

Impute missing data with end tail values.

221

222

Parameters:

223

- X (pandas.DataFrame): Dataset to transform

224

225

Returns:

226

- pandas.DataFrame: Transformed dataset with imputed values

227

"""

228

229

def fit_transform(self, X, y=None):

230

"""Fit to data, then transform it."""

231

```

232

233

**Usage Example**:

234

```python

235

from feature_engine.imputation import EndTailImputer

236

237

# Right tail using IQR method

238

imputer = EndTailImputer(

239

imputation_method='iqr',

240

tail='right',

241

fold=3

242

)

243

df_imputed = imputer.fit_transform(df)

244

245

# Left tail using gaussian method

246

imputer = EndTailImputer(

247

imputation_method='gaussian',

248

tail='left',

249

fold=2

250

)

251

df_imputed = imputer.fit_transform(df)

252

```

253

254

### Missing Data Indicators

255

256

Adds binary variables that indicate if data was missing for each variable.

257

258

```python { .api }

259

class AddMissingIndicator:

260

def __init__(self, missing_only=True, variables=None):

261

"""

262

Initialize AddMissingIndicator.

263

264

Parameters:

265

- missing_only (bool): Whether to add indicators only for variables with missing data in train set

266

- variables (list): List of variables to create indicators for. If None, evaluates all variables

267

"""

268

269

def fit(self, X, y=None):

270

"""

271

Find variables for which missing indicators will be created.

272

273

Parameters:

274

- X (pandas.DataFrame): Training dataset

275

- y (pandas.Series, optional): Target variable (not used)

276

277

Returns:

278

- self

279

"""

280

281

def transform(self, X):

282

"""

283

Add binary missing indicators to dataset.

284

285

Parameters:

286

- X (pandas.DataFrame): Dataset to transform

287

288

Returns:

289

- pandas.DataFrame: Dataset with additional binary indicator columns

290

"""

291

292

def fit_transform(self, X, y=None):

293

"""Fit to data, then transform it."""

294

```

295

296

**Usage Example**:

297

```python

298

from feature_engine.imputation import AddMissingIndicator

299

300

# Add indicators only for variables with missing data

301

indicator = AddMissingIndicator(missing_only=True)

302

df_with_indicators = indicator.fit_transform(df)

303

304

# Creates new columns like 'var1_na', 'var2_na' where missing data existed

305

print(df_with_indicators.columns) # Original columns + indicator columns

306

```

307

308

### Random Sample Imputation

309

310

Replaces missing data with random sample extracted from the variables in the training set.

311

312

```python { .api }

313

class RandomSampleImputer:

314

def __init__(self, variables=None, random_state=None, seed='general', seeding_method='add'):

315

"""

316

Initialize RandomSampleImputer.

317

318

Parameters:

319

- variables (list): List of variables to be imputed. If None, selects all variables with missing data

320

- random_state (int/str/list): Random state for sampling reproducibility

321

- seed (str): 'general' (single seed) or 'observation' (seed per observation)

322

- seeding_method (str): 'add' or 'multiply' when combining seeds

323

"""

324

325

def fit(self, X, y=None):

326

"""

327

Store copy of training dataset for sampling.

328

329

Parameters:

330

- X (pandas.DataFrame): Training dataset

331

- y (pandas.Series, optional): Target variable (not used)

332

333

Returns:

334

- self

335

"""

336

337

def transform(self, X):

338

"""

339

Impute missing data with random samples from training set.

340

341

Parameters:

342

- X (pandas.DataFrame): Dataset to transform

343

344

Returns:

345

- pandas.DataFrame: Transformed dataset with imputed values

346

"""

347

348

def fit_transform(self, X, y=None):

349

"""Fit to data, then transform it."""

350

```

351

352

**Usage Example**:

353

```python

354

from feature_engine.imputation import RandomSampleImputer

355

356

# Random sampling with fixed seed

357

imputer = RandomSampleImputer(random_state=42, seed='general')

358

df_imputed = imputer.fit_transform(df)

359

360

# Different seed per observation

361

imputer = RandomSampleImputer(

362

random_state=42,

363

seed='observation',

364

seeding_method='add'

365

)

366

df_imputed = imputer.fit_transform(df)

367

```

368

369

### Drop Missing Data

370

371

Deletes rows containing missing values, similar to pandas.dropna().

372

373

```python { .api }

374

class DropMissingData:

375

def __init__(self, missing_only=True, threshold=None, variables=None):

376

"""

377

Initialize DropMissingData.

378

379

Parameters:

380

- missing_only (bool): If True, consider only variables with missing data in train set

381

- threshold (int/float): Percentage (0-1) or count of non-NA values required to keep row

382

- variables (list): List of variables to evaluate for missing data. If None, uses all variables

383

"""

384

385

def fit(self, X, y=None):

386

"""

387

Find variables for missing data evaluation.

388

389

Parameters:

390

- X (pandas.DataFrame): Training dataset

391

- y (pandas.Series, optional): Target variable (not used)

392

393

Returns:

394

- self

395

"""

396

397

def transform(self, X):

398

"""

399

Remove rows with missing data based on specified criteria.

400

401

Parameters:

402

- X (pandas.DataFrame): Dataset to transform

403

404

Returns:

405

- pandas.DataFrame: Dataset with rows containing missing data removed

406

"""

407

408

def fit_transform(self, X, y=None):

409

"""Fit to data, then transform it."""

410

411

def return_na_data(self, X):

412

"""

413

Return subset of dataframe with rows that would be removed.

414

415

Parameters:

416

- X (pandas.DataFrame): Dataset to evaluate

417

418

Returns:

419

- pandas.DataFrame: Rows that contain missing data

420

"""

421

```

422

423

**Usage Example**:

424

```python

425

from feature_engine.imputation import DropMissingData

426

427

# Drop rows with any missing data

428

dropper = DropMissingData()

429

df_clean = dropper.fit_transform(df)

430

431

# Keep rows with at least 80% non-missing data

432

dropper = DropMissingData(threshold=0.8)

433

df_clean = dropper.fit_transform(df)

434

435

# See which rows would be dropped

436

rows_to_drop = dropper.return_na_data(df)

437

```

438

439

## Common Attributes

440

441

All imputation transformers share these fitted attributes:

442

443

- `variables_` (list): Variables that will be transformed

444

- `n_features_in_` (int): Number of features in training set

445

- `imputer_dict_` (dict): Dictionary with imputation values per variable (where applicable)