or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

builtins.mdcategorical.mdcontrasts.mdhigh-level.mdindex.mdmatrix-building.mdsplines.mdtransforms.mdutilities.md

transforms.mddocs/

0

# Stateful Transforms

1

2

Transform functions that maintain state across data processing operations. These transforms remember characteristics of the training data and apply consistent transformations to new data, essential for preprocessing in statistical modeling.

3

4

## Capabilities

5

6

### Stateful Transform Decorator

7

8

Creates stateful transform callable objects from classes implementing the stateful transform protocol.

9

10

```python { .api }

11

def stateful_transform(class_):

12

"""

13

Create a stateful transform callable from a class implementing the stateful transform protocol.

14

15

Parameters:

16

- class_: A class implementing the stateful transform protocol with methods:

17

- __init__(): Initialize the transform

18

- memorize_chunk(input_data): Process data during learning phase

19

- memorize_finish(): Finalize learning phase

20

- transform(input_data): Apply transformation to data

21

22

Returns:

23

Callable transform object that can be used in formulas

24

"""

25

```

26

27

#### Usage Examples

28

29

```python

30

import patsy

31

import numpy as np

32

33

# Define a custom stateful transform class

34

class CustomScale:

35

def __init__(self):

36

self.scale_factor = None

37

38

def memorize_chunk(self, input_data):

39

# Accumulate data statistics during training

40

pass

41

42

def memorize_finish(self):

43

# Finalize computation after seeing all training data

44

pass

45

46

def transform(self, input_data):

47

# Apply transformation consistently to new data

48

return input_data * self.scale_factor

49

50

# Create the stateful transform

51

custom_scale = patsy.stateful_transform(CustomScale)

52

53

# Use in formulas (conceptually)

54

# design = patsy.dmatrix("custom_scale(x)", data)

55

```

56

57

### Centering Transform

58

59

Subtracts the mean from data, centering it around zero while preserving the scale.

60

61

```python { .api }

62

def center(x):

63

"""

64

Stateful transform that centers input data by subtracting the mean.

65

66

Parameters:

67

- x: Array-like data to center

68

69

Returns:

70

Array with same shape as input, centered around zero

71

72

Notes:

73

- For multi-column input, centers each column separately

74

- Equivalent to standardize(x, rescale=False)

75

- State: Remembers the mean of training data

76

"""

77

```

78

79

#### Usage Examples

80

81

```python

82

import patsy

83

import numpy as np

84

import pandas as pd

85

86

# Sample data

87

data = pd.DataFrame({

88

'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

89

'y': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

90

})

91

92

# Center a variable in formula

93

design = patsy.dmatrix("center(x)", data)

94

print(f"Original mean: {np.mean(data['x'])}")

95

print(f"Centered mean: {np.mean(design)}") # Should be close to 0

96

97

# Center multiple variables

98

design = patsy.dmatrix("center(x) + center(y)", data)

99

100

# Complete model with centering

101

y_matrix, X_matrix = patsy.dmatrices("y ~ center(x)", data)

102

103

# Centering preserves relationships but changes intercept interpretation

104

print("Design matrix mean by column:", np.mean(X_matrix, axis=0))

105

```

106

107

### Standardization Transform

108

109

Centers data and scales to unit variance (z-score standardization).

110

111

```python { .api }

112

def standardize(x, center=True, rescale=True, ddof=0):

113

"""

114

Stateful transform that standardizes input data (z-score normalization).

115

116

Parameters:

117

- x: Array-like data to standardize

118

- center (bool): Whether to subtract the mean (default: True)

119

- rescale (bool): Whether to divide by standard deviation (default: True)

120

- ddof (int): Delta degrees of freedom for standard deviation computation (default: 0)

121

122

Returns:

123

Array with same shape as input, standardized

124

125

Notes:

126

- ddof=0 gives maximum likelihood estimate (divides by n)

127

- ddof=1 gives unbiased estimate (divides by n-1)

128

- For multi-column input, standardizes each column separately

129

- State: Remembers mean and standard deviation of training data

130

"""

131

```

132

133

#### Usage Examples

134

135

```python

136

import patsy

137

import numpy as np

138

import pandas as pd

139

140

# Sample data with different scales

141

data = pd.DataFrame({

142

'small': [0.1, 0.2, 0.3, 0.4, 0.5],

143

'large': [100, 200, 300, 400, 500],

144

'y': [1, 2, 3, 4, 5]

145

})

146

147

# Standardize variables to have mean 0, std 1

148

design = patsy.dmatrix("standardize(small) + standardize(large)", data)

149

print("Standardized means:", np.mean(design, axis=0)) # Should be ~0

150

print("Standardized stds:", np.std(design, axis=0)) # Should be ~1

151

152

# Only center without rescaling

153

design = patsy.dmatrix("standardize(small, rescale=False)", data)

154

155

# Only rescale without centering

156

design = patsy.dmatrix("standardize(small, center=False)", data)

157

158

# Use unbiased standard deviation (ddof=1)

159

design = patsy.dmatrix("standardize(small, ddof=1)", data)

160

161

# Complete model with standardization

162

y_matrix, X_matrix = patsy.dmatrices("y ~ standardize(small) + standardize(large)", data)

163

```

164

165

### Scale Transform

166

167

Alias for the standardize function, providing the same functionality.

168

169

```python { .api }

170

def scale(x, ddof=0):

171

"""

172

Alias for standardize() function.

173

174

Equivalent to standardize(x, center=True, rescale=True, ddof=ddof)

175

176

Parameters:

177

- x: Array-like data to scale

178

- ddof (int): Delta degrees of freedom for standard deviation computation

179

180

Returns:

181

Standardized array (mean 0, standard deviation 1)

182

"""

183

```

184

185

#### Usage Examples

186

187

```python

188

import patsy

189

import pandas as pd

190

191

data = pd.DataFrame({

192

'x': [10, 20, 30, 40, 50],

193

'y': [1, 4, 9, 16, 25]

194

})

195

196

# scale() is equivalent to standardize()

197

design1 = patsy.dmatrix("scale(x)", data)

198

design2 = patsy.dmatrix("standardize(x)", data)

199

print("Designs are equal:", np.allclose(design1, design2))

200

201

# Complete model using scale

202

y_matrix, X_matrix = patsy.dmatrices("y ~ scale(x)", data)

203

```

204

205

## Transform Behavior and State

206

207

### Stateful Nature

208

209

Stateful transforms work in two phases:

210

211

1. **Learning Phase** (during initial matrix construction):

212

- `memorize_chunk()`: Process training data chunks

213

- `memorize_finish()`: Finalize parameter computation

214

215

2. **Transform Phase** (during application to new data):

216

- `transform()`: Apply learned parameters to new data

217

218

### Consistent Application

219

220

```python

221

import patsy

222

import numpy as np

223

224

# Training data

225

train_data = {'x': [1, 2, 3, 4, 5]}

226

builder = patsy.dmatrix("standardize(x)", train_data)

227

228

# The standardize transform has learned the mean and std from training data

229

# Now it can be applied consistently to new data

230

test_data = {'x': [1.5, 2.5, 3.5]}

231

test_design = builder.transform(test_data) # Uses same mean/std from training

232

```

233

234

### Integration with Incremental Processing

235

236

Stateful transforms work with Patsy's incremental processing for large datasets:

237

238

```python

239

import patsy

240

241

def data_chunks():

242

# Generator yielding data chunks

243

for i in range(0, 10000, 1000):

244

yield {'x': list(range(i, i+1000))}

245

246

# Build incremental design matrix with transforms

247

builder = patsy.incr_dbuilder("standardize(x)", data_chunks)

248

249

# Apply to new data using learned parameters

250

new_data = {'x': [5000, 5001, 5002]}

251

design = builder.build(new_data)

252

```

253

254

## Advanced Transform Usage

255

256

### Multiple Transforms

257

258

```python

259

# Chain transforms

260

design = patsy.dmatrix("center(standardize(x))", data) # Note: This is redundant

261

262

# Apply different transforms to different variables

263

design = patsy.dmatrix("center(x1) + standardize(x2) + scale(x3)", data)

264

```

265

266

### Custom Transform Development

267

268

```python

269

class RobustScale:

270

"""Custom stateful transform using median and MAD instead of mean and std"""

271

272

def __init__(self):

273

self.median = None

274

self.mad = None

275

276

def memorize_chunk(self, input_data):

277

# In practice, you'd accumulate statistics across chunks

278

data = np.asarray(input_data)

279

if self.median is None:

280

self.median = np.median(data)

281

self.mad = np.median(np.abs(data - self.median))

282

283

def memorize_finish(self):

284

# Finalize computation if needed

285

pass

286

287

def transform(self, input_data):

288

data = np.asarray(input_data)

289

return (data - self.median) / (1.4826 * self.mad) # 1.4826 for normal consistency

290

291

# Create the transform

292

robust_scale = patsy.stateful_transform(RobustScale)

293

```

294

295

### Transform with Model Fitting

296

297

```python

298

import patsy

299

from sklearn.linear_model import LinearRegression

300

301

# Create standardized design matrices

302

data = {'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]}

303

y, X = patsy.dmatrices("y ~ standardize(x)", data)

304

305

# Fit model

306

model = LinearRegression(fit_intercept=False)

307

model.fit(X, y.ravel())

308

309

# The transform state is preserved for new predictions

310

new_data = {'x': [1.5, 2.5, 3.5]}

311

X_new = patsy.dmatrix("standardize(x)", new_data,

312

return_type="matrix") # Uses same standardization parameters

313

predictions = model.predict(X_new)

314

```