0
# Stateful Transforms
1
2
Transform functions that maintain state across data processing operations. These transforms remember characteristics of the training data and apply consistent transformations to new data, essential for preprocessing in statistical modeling.
3
4
## Capabilities
5
6
### Stateful Transform Decorator
7
8
Creates stateful transform callable objects from classes implementing the stateful transform protocol.
9
10
```python { .api }
11
def stateful_transform(class_):
12
"""
13
Create a stateful transform callable from a class implementing the stateful transform protocol.
14
15
Parameters:
16
- class_: A class implementing the stateful transform protocol with methods:
17
- __init__(): Initialize the transform
18
- memorize_chunk(input_data): Process data during learning phase
19
- memorize_finish(): Finalize learning phase
20
- transform(input_data): Apply transformation to data
21
22
Returns:
23
Callable transform object that can be used in formulas
24
"""
25
```
26
27
#### Usage Examples
28
29
```python
30
import patsy
31
import numpy as np
32
33
# Define a custom stateful transform class
34
class CustomScale:
35
def __init__(self):
36
self.scale_factor = None
37
38
def memorize_chunk(self, input_data):
39
# Accumulate data statistics during training
40
pass
41
42
def memorize_finish(self):
43
# Finalize computation after seeing all training data
44
pass
45
46
def transform(self, input_data):
47
# Apply transformation consistently to new data
48
return input_data * self.scale_factor
49
50
# Create the stateful transform
51
custom_scale = patsy.stateful_transform(CustomScale)
52
53
# Use in formulas (conceptually)
54
# design = patsy.dmatrix("custom_scale(x)", data)
55
```
56
57
### Centering Transform
58
59
Subtracts the mean from data, centering it around zero while preserving the scale.
60
61
```python { .api }
62
def center(x):
63
"""
64
Stateful transform that centers input data by subtracting the mean.
65
66
Parameters:
67
- x: Array-like data to center
68
69
Returns:
70
Array with same shape as input, centered around zero
71
72
Notes:
73
- For multi-column input, centers each column separately
74
- Equivalent to standardize(x, rescale=False)
75
- State: Remembers the mean of training data
76
"""
77
```
78
79
#### Usage Examples
80
81
```python
82
import patsy
83
import numpy as np
84
import pandas as pd
85
86
# Sample data
87
data = pd.DataFrame({
88
'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
89
'y': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
90
})
91
92
# Center a variable in formula
93
design = patsy.dmatrix("center(x)", data)
94
print(f"Original mean: {np.mean(data['x'])}")
95
print(f"Centered mean: {np.mean(design)}") # Should be close to 0
96
97
# Center multiple variables
98
design = patsy.dmatrix("center(x) + center(y)", data)
99
100
# Complete model with centering
101
y_matrix, X_matrix = patsy.dmatrices("y ~ center(x)", data)
102
103
# Centering preserves relationships but changes intercept interpretation
104
print("Design matrix mean by column:", np.mean(X_matrix, axis=0))
105
```
106
107
### Standardization Transform
108
109
Centers data and scales to unit variance (z-score standardization).
110
111
```python { .api }
112
def standardize(x, center=True, rescale=True, ddof=0):
113
"""
114
Stateful transform that standardizes input data (z-score normalization).
115
116
Parameters:
117
- x: Array-like data to standardize
118
- center (bool): Whether to subtract the mean (default: True)
119
- rescale (bool): Whether to divide by standard deviation (default: True)
120
- ddof (int): Delta degrees of freedom for standard deviation computation (default: 0)
121
122
Returns:
123
Array with same shape as input, standardized
124
125
Notes:
126
- ddof=0 gives maximum likelihood estimate (divides by n)
127
- ddof=1 gives unbiased estimate (divides by n-1)
128
- For multi-column input, standardizes each column separately
129
- State: Remembers mean and standard deviation of training data
130
"""
131
```
132
133
#### Usage Examples
134
135
```python
136
import patsy
137
import numpy as np
138
import pandas as pd
139
140
# Sample data with different scales
141
data = pd.DataFrame({
142
'small': [0.1, 0.2, 0.3, 0.4, 0.5],
143
'large': [100, 200, 300, 400, 500],
144
'y': [1, 2, 3, 4, 5]
145
})
146
147
# Standardize variables to have mean 0, std 1
148
design = patsy.dmatrix("standardize(small) + standardize(large)", data)
149
print("Standardized means:", np.mean(design, axis=0)) # Should be ~0
150
print("Standardized stds:", np.std(design, axis=0)) # Should be ~1
151
152
# Only center without rescaling
153
design = patsy.dmatrix("standardize(small, rescale=False)", data)
154
155
# Only rescale without centering
156
design = patsy.dmatrix("standardize(small, center=False)", data)
157
158
# Use unbiased standard deviation (ddof=1)
159
design = patsy.dmatrix("standardize(small, ddof=1)", data)
160
161
# Complete model with standardization
162
y_matrix, X_matrix = patsy.dmatrices("y ~ standardize(small) + standardize(large)", data)
163
```
164
165
### Scale Transform
166
167
Alias for the standardize function, providing the same functionality.
168
169
```python { .api }
170
def scale(x, ddof=0):
171
"""
172
Alias for standardize() function.
173
174
Equivalent to standardize(x, center=True, rescale=True, ddof=ddof)
175
176
Parameters:
177
- x: Array-like data to scale
178
- ddof (int): Delta degrees of freedom for standard deviation computation
179
180
Returns:
181
Standardized array (mean 0, standard deviation 1)
182
"""
183
```
184
185
#### Usage Examples
186
187
```python
188
import patsy
189
import pandas as pd
190
191
data = pd.DataFrame({
192
'x': [10, 20, 30, 40, 50],
193
'y': [1, 4, 9, 16, 25]
194
})
195
196
# scale() is equivalent to standardize()
197
design1 = patsy.dmatrix("scale(x)", data)
198
design2 = patsy.dmatrix("standardize(x)", data)
199
print("Designs are equal:", np.allclose(design1, design2))
200
201
# Complete model using scale
202
y_matrix, X_matrix = patsy.dmatrices("y ~ scale(x)", data)
203
```
204
205
## Transform Behavior and State
206
207
### Stateful Nature
208
209
Stateful transforms work in two phases:
210
211
1. **Learning Phase** (during initial matrix construction):
212
- `memorize_chunk()`: Process training data chunks
213
- `memorize_finish()`: Finalize parameter computation
214
215
2. **Transform Phase** (during application to new data):
216
- `transform()`: Apply learned parameters to new data
217
218
### Consistent Application
219
220
```python
221
import patsy
222
import numpy as np
223
224
# Training data
225
train_data = {'x': [1, 2, 3, 4, 5]}
226
builder = patsy.dmatrix("standardize(x)", train_data)
227
228
# The standardize transform has learned the mean and std from training data
229
# Now it can be applied consistently to new data
230
test_data = {'x': [1.5, 2.5, 3.5]}
231
test_design = builder.transform(test_data) # Uses same mean/std from training
232
```
233
234
### Integration with Incremental Processing
235
236
Stateful transforms work with Patsy's incremental processing for large datasets:
237
238
```python
239
import patsy
240
241
def data_chunks():
242
# Generator yielding data chunks
243
for i in range(0, 10000, 1000):
244
yield {'x': list(range(i, i+1000))}
245
246
# Build incremental design matrix with transforms
247
builder = patsy.incr_dbuilder("standardize(x)", data_chunks)
248
249
# Apply to new data using learned parameters
250
new_data = {'x': [5000, 5001, 5002]}
251
design = builder.build(new_data)
252
```
253
254
## Advanced Transform Usage
255
256
### Multiple Transforms
257
258
```python
259
# Chain transforms
260
design = patsy.dmatrix("center(standardize(x))", data) # Note: This is redundant
261
262
# Apply different transforms to different variables
263
design = patsy.dmatrix("center(x1) + standardize(x2) + scale(x3)", data)
264
```
265
266
### Custom Transform Development
267
268
```python
269
class RobustScale:
270
"""Custom stateful transform using median and MAD instead of mean and std"""
271
272
def __init__(self):
273
self.median = None
274
self.mad = None
275
276
def memorize_chunk(self, input_data):
277
# In practice, you'd accumulate statistics across chunks
278
data = np.asarray(input_data)
279
if self.median is None:
280
self.median = np.median(data)
281
self.mad = np.median(np.abs(data - self.median))
282
283
def memorize_finish(self):
284
# Finalize computation if needed
285
pass
286
287
def transform(self, input_data):
288
data = np.asarray(input_data)
289
return (data - self.median) / (1.4826 * self.mad) # 1.4826 for normal consistency
290
291
# Create the transform
292
robust_scale = patsy.stateful_transform(RobustScale)
293
```
294
295
### Transform with Model Fitting
296
297
```python
298
import patsy
299
from sklearn.linear_model import LinearRegression
300
301
# Create standardized design matrices
302
data = {'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]}
303
y, X = patsy.dmatrices("y ~ standardize(x)", data)
304
305
# Fit model
306
model = LinearRegression(fit_intercept=False)
307
model.fit(X, y.ravel())
308
309
# The transform state is preserved for new predictions
310
new_data = {'x': [1.5, 2.5, 3.5]}
311
X_new = patsy.dmatrix("standardize(x)", new_data,
312
return_type="matrix") # Uses same standardization parameters
313
predictions = model.predict(X_new)
314
```