Tessl Tile for pypi/feature-engine@1.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

creation.md datetime.md discretisation.md encoding.md imputation.md index.md outliers.md preprocessing.md selection.md transformation.md wrappers.md

preprocessing.mddocs/

0
# Preprocessing Utilities
1

2
General preprocessing functions and transformers for data preparation and variable matching between datasets to ensure consistency and compatibility in machine learning workflows.
3

4
## Capabilities
5

6
### Variable Matching
7

8
Ensures that variables in a dataset match those in a reference dataset, handling missing columns and maintaining consistent structure across training and prediction datasets.
9

10
```python { .api }
11
class MatchVariables:
12
    def __init__(self, missing_values='raise'):
13
        """
14
        Initialize MatchVariables.
15
        
16
        Parameters:
17
        - missing_values (str): How to handle missing variables - 'raise' or 'ignore'
18
        """
19
    
20
    def fit(self, X, y=None):
21
        """
22
        Learn the reference set of variables from training data.
23
        
24
        Parameters:
25
        - X (pandas.DataFrame): Reference dataset (typically training data)
26
        - y (pandas.Series, optional): Target variable (not used)
27
        
28
        Returns:
29
        - self
30
        """
31
    
32
    def transform(self, X):
33
        """
34
        Transform dataset to match reference variables.
35
        
36
        Parameters:
37
        - X (pandas.DataFrame): Dataset to transform
38
        
39
        Returns:
40
        - pandas.DataFrame: Dataset with variables matching reference set
41
        """
42
    
43
    def fit_transform(self, X, y=None):
44
        """Fit to data, then transform it."""
45
```
46

47
**Usage Example**:
48
```python
49
from feature_engine.preprocessing import MatchVariables
50
import pandas as pd
51
import numpy as np
52

53
# Training dataset
54
train_data = {
55
    'feature1': np.random.randn(100),
56
    'feature2': np.random.randn(100),
57
    'feature3': np.random.randn(100),
58
    'target': np.random.randint(0, 2, 100)
59
}
60
df_train = pd.DataFrame(train_data)
61

62
# Test dataset with missing feature and extra feature
63
test_data = {
64
    'feature1': np.random.randn(50),
65
    'feature2': np.random.randn(50),
66
    # feature3 is missing
67
    'feature4': np.random.randn(50)  # Extra feature
68
}
69
df_test = pd.DataFrame(test_data)
70

71
# Match test data to training data structure
72
matcher = MatchVariables(missing_values='ignore')
73
matcher.fit(df_train.drop('target', axis=1))  # Fit on features only
74
df_test_matched = matcher.transform(df_test)
75

76
print("Training features:", df_train.drop('target', axis=1).columns.tolist())
77
print("Original test features:", df_test.columns.tolist()) 
78
print("Matched test features:", df_test_matched.columns.tolist())
79
# Result: df_test_matched will have feature1, feature2, feature3 (with NaN)
80
# feature4 is dropped
81
```
82

83
## Usage Patterns
84

85
### Model Deployment Pipeline
86

87
```python
88
from sklearn.pipeline import Pipeline
89
from feature_engine.imputation import MeanMedianImputer
90
from feature_engine.preprocessing import MatchVariables
91
from feature_engine.encoding import OneHotEncoder
92
from sklearn.ensemble import RandomForestClassifier
93

94
# Training pipeline
95
training_pipeline = Pipeline([
96
    ('imputer', MeanMedianImputer()),
97
    ('encoder', OneHotEncoder()),
98
    ('classifier', RandomForestClassifier())
99
])
100

101
# Fit on training data
102
training_pipeline.fit(X_train, y_train)
103

104
# Deployment pipeline with variable matching
105
deployment_pipeline = Pipeline([
106
    ('matcher', MatchVariables()),  # Ensure consistent variables
107
    ('imputer', MeanMedianImputer()),
108
    ('encoder', OneHotEncoder()),
109
    ('classifier', RandomForestClassifier())
110
])
111

112
# Fit matcher on training features
113
deployment_pipeline.named_steps['matcher'].fit(X_train)
114

115
# Copy trained parameters from training pipeline
116
deployment_pipeline.named_steps['imputer'] = training_pipeline.named_steps['imputer']
117
deployment_pipeline.named_steps['encoder'] = training_pipeline.named_steps['encoder'] 
118
deployment_pipeline.named_steps['classifier'] = training_pipeline.named_steps['classifier']
119

120
# Now can handle new data with different column structure
121
predictions = deployment_pipeline.predict(X_new)
122
```
123

124
### Cross-Dataset Validation
125

126
```python
127
# Different datasets with potentially different features
128
dataset1 = pd.DataFrame({
129
    'age': [25, 30, 35],
130
    'income': [50000, 60000, 70000],
131
    'education': ['BS', 'MS', 'PhD']
132
})
133

134
dataset2 = pd.DataFrame({
135
    'age': [28, 32],
136
    'income': [55000, 65000],
137
    'experience': [3, 5]  # Different feature
138
})
139

140
dataset3 = pd.DataFrame({
141
    'income': [45000, 75000],
142
    'education': ['BS', 'MS'],
143
    'location': ['NYC', 'LA']  # Different feature
144
})
145

146
# Use first dataset as reference
147
matcher = MatchVariables(missing_values='ignore')
148
matcher.fit(dataset1)
149

150
# Transform other datasets to match
151
dataset2_matched = matcher.transform(dataset2)
152
dataset3_matched = matcher.transform(dataset3)
153

154
print("Reference columns:", dataset1.columns.tolist())
155
print("Dataset2 matched:", dataset2_matched.columns.tolist())
156
print("Dataset3 matched:", dataset3_matched.columns.tolist())
157
# All will have: age, income, education (with NaN where missing)
158
```
159

160
### Feature Engineering Consistency
161

162
```python
163
from feature_engine.creation import MathematicalCombination
164
from feature_engine.datetime import DatetimeFeatures
165

166
# Complex feature engineering pipeline
167
feature_pipeline = Pipeline([
168
    ('datetime_features', DatetimeFeatures(
169
        features_to_extract=['month', 'day_of_week']
170
    )),
171
    ('math_combinations', MathematicalCombination(
172
        variables_to_combine=['feature1', 'feature2'],
173
        math_operations=['sum', 'prod']
174
    )),
175
    ('matcher', MatchVariables())  # Ensure final consistency
176
])
177

178
# Fit on training data
179
feature_pipeline.fit(X_train)
180

181
# Apply to validation/test data with potential missing features
182
X_val_processed = feature_pipeline.transform(X_val)
183
X_test_processed = feature_pipeline.transform(X_test)
184

185
# All datasets will have consistent feature structure
186
```
187

188
### Handling Schema Changes
189

190
```python
191
# Original model trained on v1 data schema
192
v1_schema = ['customer_id', 'purchase_amount', 'product_category', 'region']
193
v1_data = pd.DataFrame({col: np.random.randn(100) for col in v1_schema})
194

195
# New data has updated schema
196
v2_schema = ['customer_id', 'purchase_amount', 'product_category', 'region', 'channel', 'discount']
197
v2_data = pd.DataFrame({col: np.random.randn(50) for col in v2_schema})
198

199
# Legacy data missing new columns
200
legacy_schema = ['customer_id', 'purchase_amount', 'product_category']  # Missing region
201
legacy_data = pd.DataFrame({col: np.random.randn(25) for col in legacy_schema})
202

203
# Train matcher on original schema
204
schema_matcher = MatchVariables(missing_values='ignore')
205
schema_matcher.fit(v1_data)
206

207
# All datasets can be processed consistently
208
v2_matched = schema_matcher.transform(v2_data)  # Extra columns removed
209
legacy_matched = schema_matcher.transform(legacy_data)  # Missing column added with NaN
210

211
print("V1 schema:", v1_data.columns.tolist())
212
print("V2 matched:", v2_matched.columns.tolist())
213
print("Legacy matched:", legacy_matched.columns.tolist())
214
# All have same columns: customer_id, purchase_amount, product_category, region
215
```
216

217
### API Integration
218

219
```python
220
import json
221

222
def preprocess_api_data(api_response, trained_matcher):
223
    """
224
    Preprocess data from API response to match model expectations.
225
    """
226
    # Parse API response
227
    data = json.loads(api_response)
228
    df = pd.DataFrame([data])  # Single row from API
229
    
230
    # Match to expected schema
231
    df_matched = trained_matcher.transform(df)
232
    
233
    return df_matched
234

235
# Example API responses with different structures
236
api_response_1 = '{"feature1": 1.0, "feature2": 2.0, "feature3": 3.0}'
237
api_response_2 = '{"feature1": 1.5, "feature2": 2.5}'  # Missing feature3
238
api_response_3 = '{"feature1": 2.0, "feature2": 3.0, "feature3": 4.0, "extra_field": 5.0}'
239

240
# Trained matcher expects feature1, feature2, feature3
241
matcher = MatchVariables()
242
matcher.fit(pd.DataFrame(columns=['feature1', 'feature2', 'feature3']))
243

244
# All API responses can be handled consistently
245
for i, response in enumerate([api_response_1, api_response_2, api_response_3], 1):
246
    processed = preprocess_api_data(response, matcher)
247
    print(f"API response {i} processed shape:", processed.shape)
248
    print(f"Columns: {processed.columns.tolist()}")
249
```
250

251
### Error Handling Modes
252

253
```python
254
# Strict mode - raise error on missing variables
255
strict_matcher = MatchVariables(missing_values='raise')
256
strict_matcher.fit(df_train)
257

258
try:
259
    result = strict_matcher.transform(df_missing_features)
260
except ValueError as e:
261
    print(f"Strict mode error: {e}")
262

263
# Lenient mode - ignore missing variables
264
lenient_matcher = MatchVariables(missing_values='ignore')
265
lenient_matcher.fit(df_train)
266
result = lenient_matcher.transform(df_missing_features)  # Succeeds with NaN
267
```
268

269
## Best Practices
270

271
### 1. Use in Production Pipelines
272
Always include MatchVariables in production pipelines to handle schema changes gracefully.
273

274
### 2. Fit on Training Data Only
275
Fit the matcher on training data to establish the canonical variable set.
276

277
### 3. Handle Missing Data Downstream
278
Use missing_values='ignore' and handle NaN values with appropriate imputation strategies.
279

280
### 4. Version Control Schemas
281
Keep track of expected schemas when deploying models to different environments.
282

283
### 5. Monitor Schema Drift
284
Log when MatchVariables adds or removes columns to detect data drift.
285

286
## Common Attributes
287

288
MatchVariables has these fitted attributes:
289

290
- `variables_to_match_` (list): Reference set of variables established during fit
291
- `n_features_in_` (int): Number of features in training set
292

293
The transformer ensures that output datasets always have exactly the variables specified in `variables_to_match_`, adding missing variables as NaN columns and dropping extra variables.

Version

Tile

Files

preprocessing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

preprocessing.mddocs/