0
# Mathematical Transformations
1
2
Transformers for applying mathematical functions to numerical variables including logarithmic, power, reciprocal, Box-Cox, and Yeo-Johnson transformations to improve data distribution and model performance.
3
4
## Capabilities
5
6
### Logarithmic Transformation
7
8
Applies natural logarithm or base 10 logarithm to numerical variables.
9
10
```python { .api }
11
class LogTransformer:
12
def __init__(self, variables=None, base='e'):
13
"""
14
Initialize LogTransformer.
15
16
Parameters:
17
- variables (list): List of numerical variables to transform. If None, selects all numerical variables
18
- base (str): 'e' for natural logarithm or '10' for base 10 logarithm
19
"""
20
21
def fit(self, X, y=None):
22
"""
23
Validate that variables are positive (no parameters learned).
24
25
Parameters:
26
- X (pandas.DataFrame): Training dataset
27
- y (pandas.Series, optional): Target variable (not used)
28
29
Returns:
30
- self
31
"""
32
33
def transform(self, X):
34
"""
35
Apply logarithm transformation to variables.
36
37
Parameters:
38
- X (pandas.DataFrame): Dataset to transform
39
40
Returns:
41
- pandas.DataFrame: Dataset with log-transformed variables
42
"""
43
44
def fit_transform(self, X, y=None):
45
"""Fit to data, then transform it."""
46
47
def inverse_transform(self, X):
48
"""
49
Convert back to original representation using exponential.
50
51
Parameters:
52
- X (pandas.DataFrame): Dataset with log-transformed values
53
54
Returns:
55
- pandas.DataFrame: Dataset with original scale restored
56
"""
57
```
58
59
**Usage Example**:
60
```python
61
from feature_engine.transformation import LogTransformer
62
import pandas as pd
63
import numpy as np
64
65
# Sample data with positive values
66
data = {'price': [100, 200, 500, 1000, 2000],
67
'volume': [10, 25, 50, 100, 200]}
68
df = pd.DataFrame(data)
69
70
# Natural log transformation
71
transformer = LogTransformer(base='e')
72
df_transformed = transformer.fit_transform(df)
73
74
# Base 10 log transformation
75
transformer = LogTransformer(base='10')
76
df_transformed = transformer.fit_transform(df)
77
78
# Inverse transformation
79
df_original = transformer.inverse_transform(df_transformed)
80
```
81
82
### Log Plus Constant Transformation
83
84
Applies log(x + C) transformation where C is a positive constant, useful for data with zeros or negative values.
85
86
```python { .api }
87
class LogCpTransformer:
88
def __init__(self, variables=None, base='e', C='auto'):
89
"""
90
Initialize LogCpTransformer.
91
92
Parameters:
93
- variables (list): List of numerical variables to transform. If None, selects all numerical variables
94
- base (str): 'e' for natural logarithm or '10' for base 10 logarithm
95
- C (int/float/str/dict): Constant to add before log. 'auto' calculates optimal C
96
"""
97
98
def fit(self, X, y=None):
99
"""
100
Learn constant C if C='auto', otherwise validate input.
101
102
Parameters:
103
- X (pandas.DataFrame): Training dataset
104
- y (pandas.Series, optional): Target variable (not used)
105
106
Returns:
107
- self
108
"""
109
110
def transform(self, X):
111
"""
112
Apply log(x + C) transformation to variables.
113
114
Parameters:
115
- X (pandas.DataFrame): Dataset to transform
116
117
Returns:
118
- pandas.DataFrame: Dataset with log(x + C) transformed variables
119
"""
120
121
def fit_transform(self, X, y=None):
122
"""Fit to data, then transform it."""
123
124
def inverse_transform(self, X):
125
"""
126
Convert back to original representation using exp(x) - C.
127
128
Parameters:
129
- X (pandas.DataFrame): Dataset with log-transformed values
130
131
Returns:
132
- pandas.DataFrame: Dataset with original scale restored
133
"""
134
```
135
136
**Usage Example**:
137
```python
138
from feature_engine.transformation import LogCpTransformer
139
140
# Auto-calculate C (makes minimum value positive)
141
transformer = LogCpTransformer(C='auto')
142
df_transformed = transformer.fit_transform(df)
143
144
# Specify constant C
145
transformer = LogCpTransformer(C=1)
146
df_transformed = transformer.fit_transform(df)
147
148
# Different C per variable
149
transformer = LogCpTransformer(C={'var1': 1, 'var2': 5})
150
df_transformed = transformer.fit_transform(df)
151
152
# Access learned C values
153
print(transformer.C_) # Shows C value per variable
154
```
155
156
### Box-Cox Transformation
157
158
Applies Box-Cox transformation to numerical variables to achieve normality.
159
160
```python { .api }
161
class BoxCoxTransformer:
162
def __init__(self, variables=None):
163
"""
164
Initialize BoxCoxTransformer.
165
166
Parameters:
167
- variables (list): List of numerical variables to transform. If None, selects all numerical variables
168
"""
169
170
def fit(self, X, y=None):
171
"""
172
Learn optimal lambda parameter for Box-Cox transformation per variable.
173
174
Parameters:
175
- X (pandas.DataFrame): Training dataset (must contain positive values)
176
- y (pandas.Series, optional): Target variable (not used)
177
178
Returns:
179
- self
180
"""
181
182
def transform(self, X):
183
"""
184
Apply Box-Cox transformation using learned lambda values.
185
186
Parameters:
187
- X (pandas.DataFrame): Dataset to transform
188
189
Returns:
190
- pandas.DataFrame: Dataset with Box-Cox transformed variables
191
"""
192
193
def fit_transform(self, X, y=None):
194
"""Fit to data, then transform it."""
195
196
def inverse_transform(self, X):
197
"""
198
Convert back to original representation using inverse Box-Cox.
199
200
Parameters:
201
- X (pandas.DataFrame): Dataset with Box-Cox transformed values
202
203
Returns:
204
- pandas.DataFrame: Dataset with original scale restored
205
"""
206
```
207
208
**Usage Example**:
209
```python
210
from feature_engine.transformation import BoxCoxTransformer
211
212
# Box-Cox transformation (requires positive values)
213
transformer = BoxCoxTransformer()
214
df_transformed = transformer.fit_transform(df)
215
216
# Access learned lambda parameters
217
print(transformer.lambda_dict_) # Shows optimal lambda per variable
218
219
# Inverse transformation
220
df_original = transformer.inverse_transform(df_transformed)
221
```
222
223
### Yeo-Johnson Transformation
224
225
Applies Yeo-Johnson transformation to numerical variables, which works with positive and negative values.
226
227
```python { .api }
228
class YeoJohnsonTransformer:
229
def __init__(self, variables=None):
230
"""
231
Initialize YeoJohnsonTransformer.
232
233
Parameters:
234
- variables (list): List of numerical variables to transform. If None, selects all numerical variables
235
"""
236
237
def fit(self, X, y=None):
238
"""
239
Learn optimal lambda parameter for Yeo-Johnson transformation per variable.
240
241
Parameters:
242
- X (pandas.DataFrame): Training dataset
243
- y (pandas.Series, optional): Target variable (not used)
244
245
Returns:
246
- self
247
"""
248
249
def transform(self, X):
250
"""
251
Apply Yeo-Johnson transformation using learned lambda values.
252
253
Parameters:
254
- X (pandas.DataFrame): Dataset to transform
255
256
Returns:
257
- pandas.DataFrame: Dataset with Yeo-Johnson transformed variables
258
"""
259
260
def fit_transform(self, X, y=None):
261
"""Fit to data, then transform it."""
262
263
def inverse_transform(self, X):
264
"""
265
Convert back to original representation using inverse Yeo-Johnson.
266
267
Parameters:
268
- X (pandas.DataFrame): Dataset with Yeo-Johnson transformed values
269
270
Returns:
271
- pandas.DataFrame: Dataset with original scale restored
272
"""
273
```
274
275
**Usage Example**:
276
```python
277
from feature_engine.transformation import YeoJohnsonTransformer
278
279
# Yeo-Johnson transformation (works with positive and negative values)
280
transformer = YeoJohnsonTransformer()
281
df_transformed = transformer.fit_transform(df)
282
283
# Access learned lambda parameters
284
print(transformer.lambda_dict_) # Shows optimal lambda per variable
285
286
# Inverse transformation
287
df_original = transformer.inverse_transform(df_transformed)
288
```
289
290
### Power Transformation
291
292
Applies power transformation (x^lambda) to numerical variables.
293
294
```python { .api }
295
class PowerTransformer:
296
def __init__(self, variables=None, exp=2):
297
"""
298
Initialize PowerTransformer.
299
300
Parameters:
301
- variables (list): List of numerical variables to transform. If None, selects all numerical variables
302
- exp (int/float/list/dict): Exponent for power transformation
303
"""
304
305
def fit(self, X, y=None):
306
"""
307
Validate input data (no parameters learned).
308
309
Parameters:
310
- X (pandas.DataFrame): Training dataset
311
- y (pandas.Series, optional): Target variable (not used)
312
313
Returns:
314
- self
315
"""
316
317
def transform(self, X):
318
"""
319
Apply power transformation to variables.
320
321
Parameters:
322
- X (pandas.DataFrame): Dataset to transform
323
324
Returns:
325
- pandas.DataFrame: Dataset with power-transformed variables
326
"""
327
328
def fit_transform(self, X, y=None):
329
"""Fit to data, then transform it."""
330
331
def inverse_transform(self, X):
332
"""
333
Convert back to original representation using root transformation.
334
335
Parameters:
336
- X (pandas.DataFrame): Dataset with power-transformed values
337
338
Returns:
339
- pandas.DataFrame: Dataset with original scale restored
340
"""
341
```
342
343
**Usage Example**:
344
```python
345
from feature_engine.transformation import PowerTransformer
346
347
# Square transformation (default)
348
transformer = PowerTransformer(exp=2)
349
df_transformed = transformer.fit_transform(df)
350
351
# Square root transformation
352
transformer = PowerTransformer(exp=0.5)
353
df_transformed = transformer.fit_transform(df)
354
355
# Different exponents per variable
356
transformer = PowerTransformer(exp={'var1': 2, 'var2': 3, 'var3': 0.5})
357
df_transformed = transformer.fit_transform(df)
358
359
# Inverse transformation
360
df_original = transformer.inverse_transform(df_transformed)
361
```
362
363
### Reciprocal Transformation
364
365
Applies reciprocal transformation (1/x) to numerical variables.
366
367
```python { .api }
368
class ReciprocalTransformer:
369
def __init__(self, variables=None):
370
"""
371
Initialize ReciprocalTransformer.
372
373
Parameters:
374
- variables (list): List of numerical variables to transform. If None, selects all numerical variables
375
"""
376
377
def fit(self, X, y=None):
378
"""
379
Validate that variables don't contain zeros (no parameters learned).
380
381
Parameters:
382
- X (pandas.DataFrame): Training dataset
383
- y (pandas.Series, optional): Target variable (not used)
384
385
Returns:
386
- self
387
"""
388
389
def transform(self, X):
390
"""
391
Apply reciprocal transformation (1/x) to variables.
392
393
Parameters:
394
- X (pandas.DataFrame): Dataset to transform
395
396
Returns:
397
- pandas.DataFrame: Dataset with reciprocal-transformed variables
398
"""
399
400
def fit_transform(self, X, y=None):
401
"""Fit to data, then transform it."""
402
403
def inverse_transform(self, X):
404
"""
405
Convert back to original representation using reciprocal (1/x).
406
407
Parameters:
408
- X (pandas.DataFrame): Dataset with reciprocal-transformed values
409
410
Returns:
411
- pandas.DataFrame: Dataset with original scale restored
412
"""
413
```
414
415
**Usage Example**:
416
```python
417
from feature_engine.transformation import ReciprocalTransformer
418
419
# Reciprocal transformation (1/x)
420
transformer = ReciprocalTransformer()
421
df_transformed = transformer.fit_transform(df)
422
423
# Inverse transformation (also 1/x)
424
df_original = transformer.inverse_transform(df_transformed)
425
```
426
427
## Usage Patterns
428
429
### Selecting Appropriate Transformations
430
431
```python
432
import matplotlib.pyplot as plt
433
from scipy import stats
434
435
# Assess data distribution before transformation
436
def assess_normality(data, variable):
437
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
438
439
# Histogram
440
ax1.hist(data[variable], bins=30)
441
ax1.set_title(f'{variable} Distribution')
442
443
# Q-Q plot
444
stats.probplot(data[variable], dist="norm", plot=ax2)
445
ax2.set_title(f'{variable} Q-Q Plot')
446
447
plt.tight_layout()
448
plt.show()
449
450
# Shapiro-Wilk test
451
stat, p_value = stats.shapiro(data[variable].dropna())
452
print(f"Shapiro-Wilk test p-value: {p_value}")
453
454
# Test different transformations
455
from feature_engine.transformation import LogTransformer, BoxCoxTransformer
456
457
transformers = {
458
'log': LogTransformer(),
459
'boxcox': BoxCoxTransformer()
460
}
461
462
for name, transformer in transformers.items():
463
try:
464
df_transformed = transformer.fit_transform(df)
465
print(f"{name} transformation successful")
466
except Exception as e:
467
print(f"{name} transformation failed: {e}")
468
```
469
470
### Pipeline Integration
471
472
```python
473
from sklearn.pipeline import Pipeline
474
from feature_engine.imputation import MeanMedianImputer
475
from feature_engine.transformation import LogCpTransformer
476
from sklearn.preprocessing import StandardScaler
477
478
# Preprocessing pipeline with transformation
479
pipeline = Pipeline([
480
('imputer', MeanMedianImputer()),
481
('transformer', LogCpTransformer(C='auto')),
482
('scaler', StandardScaler())
483
])
484
485
df_processed = pipeline.fit_transform(df)
486
```
487
488
## Common Attributes
489
490
All transformation transformers share these fitted attributes:
491
492
- `variables_` (list): Variables that will be transformed
493
- `n_features_in_` (int): Number of features in training set
494
495
Transformer-specific attributes:
496
- `C_` (dict): Constant C values per variable (LogCpTransformer)
497
- `lambda_dict_` (dict): Lambda parameters per variable (BoxCoxTransformer, YeoJohnsonTransformer)
498
- `exp_` (dict): Exponent values per variable (PowerTransformer)