or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

creation.mddatetime.mddiscretisation.mdencoding.mdimputation.mdindex.mdoutliers.mdpreprocessing.mdselection.mdtransformation.mdwrappers.md

preprocessing.mddocs/

0

# Preprocessing Utilities

1

2

General preprocessing functions and transformers for data preparation and variable matching between datasets to ensure consistency and compatibility in machine learning workflows.

3

4

## Capabilities

5

6

### Variable Matching

7

8

Ensures that variables in a dataset match those in a reference dataset, handling missing columns and maintaining consistent structure across training and prediction datasets.

9

10

```python { .api }

11

class MatchVariables:

12

def __init__(self, missing_values='raise'):

13

"""

14

Initialize MatchVariables.

15

16

Parameters:

17

- missing_values (str): How to handle missing variables - 'raise' or 'ignore'

18

"""

19

20

def fit(self, X, y=None):

21

"""

22

Learn the reference set of variables from training data.

23

24

Parameters:

25

- X (pandas.DataFrame): Reference dataset (typically training data)

26

- y (pandas.Series, optional): Target variable (not used)

27

28

Returns:

29

- self

30

"""

31

32

def transform(self, X):

33

"""

34

Transform dataset to match reference variables.

35

36

Parameters:

37

- X (pandas.DataFrame): Dataset to transform

38

39

Returns:

40

- pandas.DataFrame: Dataset with variables matching reference set

41

"""

42

43

def fit_transform(self, X, y=None):

44

"""Fit to data, then transform it."""

45

```

46

47

**Usage Example**:

48

```python

49

from feature_engine.preprocessing import MatchVariables

50

import pandas as pd

51

import numpy as np

52

53

# Training dataset

54

train_data = {

55

'feature1': np.random.randn(100),

56

'feature2': np.random.randn(100),

57

'feature3': np.random.randn(100),

58

'target': np.random.randint(0, 2, 100)

59

}

60

df_train = pd.DataFrame(train_data)

61

62

# Test dataset with missing feature and extra feature

63

test_data = {

64

'feature1': np.random.randn(50),

65

'feature2': np.random.randn(50),

66

# feature3 is missing

67

'feature4': np.random.randn(50) # Extra feature

68

}

69

df_test = pd.DataFrame(test_data)

70

71

# Match test data to training data structure

72

matcher = MatchVariables(missing_values='ignore')

73

matcher.fit(df_train.drop('target', axis=1)) # Fit on features only

74

df_test_matched = matcher.transform(df_test)

75

76

print("Training features:", df_train.drop('target', axis=1).columns.tolist())

77

print("Original test features:", df_test.columns.tolist())

78

print("Matched test features:", df_test_matched.columns.tolist())

79

# Result: df_test_matched will have feature1, feature2, feature3 (with NaN)

80

# feature4 is dropped

81

```

82

83

## Usage Patterns

84

85

### Model Deployment Pipeline

86

87

```python

88

from sklearn.pipeline import Pipeline

89

from feature_engine.imputation import MeanMedianImputer

90

from feature_engine.preprocessing import MatchVariables

91

from feature_engine.encoding import OneHotEncoder

92

from sklearn.ensemble import RandomForestClassifier

93

94

# Training pipeline

95

training_pipeline = Pipeline([

96

('imputer', MeanMedianImputer()),

97

('encoder', OneHotEncoder()),

98

('classifier', RandomForestClassifier())

99

])

100

101

# Fit on training data

102

training_pipeline.fit(X_train, y_train)

103

104

# Deployment pipeline with variable matching

105

deployment_pipeline = Pipeline([

106

('matcher', MatchVariables()), # Ensure consistent variables

107

('imputer', MeanMedianImputer()),

108

('encoder', OneHotEncoder()),

109

('classifier', RandomForestClassifier())

110

])

111

112

# Fit matcher on training features

113

deployment_pipeline.named_steps['matcher'].fit(X_train)

114

115

# Copy trained parameters from training pipeline

116

deployment_pipeline.named_steps['imputer'] = training_pipeline.named_steps['imputer']

117

deployment_pipeline.named_steps['encoder'] = training_pipeline.named_steps['encoder']

118

deployment_pipeline.named_steps['classifier'] = training_pipeline.named_steps['classifier']

119

120

# Now can handle new data with different column structure

121

predictions = deployment_pipeline.predict(X_new)

122

```

123

124

### Cross-Dataset Validation

125

126

```python

127

# Different datasets with potentially different features

128

dataset1 = pd.DataFrame({

129

'age': [25, 30, 35],

130

'income': [50000, 60000, 70000],

131

'education': ['BS', 'MS', 'PhD']

132

})

133

134

dataset2 = pd.DataFrame({

135

'age': [28, 32],

136

'income': [55000, 65000],

137

'experience': [3, 5] # Different feature

138

})

139

140

dataset3 = pd.DataFrame({

141

'income': [45000, 75000],

142

'education': ['BS', 'MS'],

143

'location': ['NYC', 'LA'] # Different feature

144

})

145

146

# Use first dataset as reference

147

matcher = MatchVariables(missing_values='ignore')

148

matcher.fit(dataset1)

149

150

# Transform other datasets to match

151

dataset2_matched = matcher.transform(dataset2)

152

dataset3_matched = matcher.transform(dataset3)

153

154

print("Reference columns:", dataset1.columns.tolist())

155

print("Dataset2 matched:", dataset2_matched.columns.tolist())

156

print("Dataset3 matched:", dataset3_matched.columns.tolist())

157

# All will have: age, income, education (with NaN where missing)

158

```

159

160

### Feature Engineering Consistency

161

162

```python

163

from feature_engine.creation import MathematicalCombination

164

from feature_engine.datetime import DatetimeFeatures

165

166

# Complex feature engineering pipeline

167

feature_pipeline = Pipeline([

168

('datetime_features', DatetimeFeatures(

169

features_to_extract=['month', 'day_of_week']

170

)),

171

('math_combinations', MathematicalCombination(

172

variables_to_combine=['feature1', 'feature2'],

173

math_operations=['sum', 'prod']

174

)),

175

('matcher', MatchVariables()) # Ensure final consistency

176

])

177

178

# Fit on training data

179

feature_pipeline.fit(X_train)

180

181

# Apply to validation/test data with potential missing features

182

X_val_processed = feature_pipeline.transform(X_val)

183

X_test_processed = feature_pipeline.transform(X_test)

184

185

# All datasets will have consistent feature structure

186

```

187

188

### Handling Schema Changes

189

190

```python

191

# Original model trained on v1 data schema

192

v1_schema = ['customer_id', 'purchase_amount', 'product_category', 'region']

193

v1_data = pd.DataFrame({col: np.random.randn(100) for col in v1_schema})

194

195

# New data has updated schema

196

v2_schema = ['customer_id', 'purchase_amount', 'product_category', 'region', 'channel', 'discount']

197

v2_data = pd.DataFrame({col: np.random.randn(50) for col in v2_schema})

198

199

# Legacy data missing new columns

200

legacy_schema = ['customer_id', 'purchase_amount', 'product_category'] # Missing region

201

legacy_data = pd.DataFrame({col: np.random.randn(25) for col in legacy_schema})

202

203

# Train matcher on original schema

204

schema_matcher = MatchVariables(missing_values='ignore')

205

schema_matcher.fit(v1_data)

206

207

# All datasets can be processed consistently

208

v2_matched = schema_matcher.transform(v2_data) # Extra columns removed

209

legacy_matched = schema_matcher.transform(legacy_data) # Missing column added with NaN

210

211

print("V1 schema:", v1_data.columns.tolist())

212

print("V2 matched:", v2_matched.columns.tolist())

213

print("Legacy matched:", legacy_matched.columns.tolist())

214

# All have same columns: customer_id, purchase_amount, product_category, region

215

```

216

217

### API Integration

218

219

```python

220

import json

221

222

def preprocess_api_data(api_response, trained_matcher):

223

"""

224

Preprocess data from API response to match model expectations.

225

"""

226

# Parse API response

227

data = json.loads(api_response)

228

df = pd.DataFrame([data]) # Single row from API

229

230

# Match to expected schema

231

df_matched = trained_matcher.transform(df)

232

233

return df_matched

234

235

# Example API responses with different structures

236

api_response_1 = '{"feature1": 1.0, "feature2": 2.0, "feature3": 3.0}'

237

api_response_2 = '{"feature1": 1.5, "feature2": 2.5}' # Missing feature3

238

api_response_3 = '{"feature1": 2.0, "feature2": 3.0, "feature3": 4.0, "extra_field": 5.0}'

239

240

# Trained matcher expects feature1, feature2, feature3

241

matcher = MatchVariables()

242

matcher.fit(pd.DataFrame(columns=['feature1', 'feature2', 'feature3']))

243

244

# All API responses can be handled consistently

245

for i, response in enumerate([api_response_1, api_response_2, api_response_3], 1):

246

processed = preprocess_api_data(response, matcher)

247

print(f"API response {i} processed shape:", processed.shape)

248

print(f"Columns: {processed.columns.tolist()}")

249

```

250

251

### Error Handling Modes

252

253

```python

254

# Strict mode - raise error on missing variables

255

strict_matcher = MatchVariables(missing_values='raise')

256

strict_matcher.fit(df_train)

257

258

try:

259

result = strict_matcher.transform(df_missing_features)

260

except ValueError as e:

261

print(f"Strict mode error: {e}")

262

263

# Lenient mode - ignore missing variables

264

lenient_matcher = MatchVariables(missing_values='ignore')

265

lenient_matcher.fit(df_train)

266

result = lenient_matcher.transform(df_missing_features) # Succeeds with NaN

267

```

268

269

## Best Practices

270

271

### 1. Use in Production Pipelines

272

Always include MatchVariables in production pipelines to handle schema changes gracefully.

273

274

### 2. Fit on Training Data Only

275

Fit the matcher on training data to establish the canonical variable set.

276

277

### 3. Handle Missing Data Downstream

278

Use missing_values='ignore' and handle NaN values with appropriate imputation strategies.

279

280

### 4. Version Control Schemas

281

Keep track of expected schemas when deploying models to different environments.

282

283

### 5. Monitor Schema Drift

284

Log when MatchVariables adds or removes columns to detect data drift.

285

286

## Common Attributes

287

288

MatchVariables has these fitted attributes:

289

290

- `variables_to_match_` (list): Reference set of variables established during fit

291

- `n_features_in_` (int): Number of features in training set

292

293

The transformer ensures that output datasets always have exactly the variables specified in `variables_to_match_`, adding missing variables as NaN columns and dropping extra variables.