0
# High-Level Interface
1
2
The main entry points for creating design matrices from formula strings. These functions handle the complete workflow from formula parsing to matrix construction, providing the most convenient interface for typical statistical modeling tasks.
3
4
## Capabilities
5
6
### Single Design Matrix Construction
7
8
Constructs a single design matrix from a formula specification, commonly used for creating predictor matrices in regression models.
9
10
```python { .api }
11
def dmatrix(formula_like, data={}, eval_env=0, NA_action="drop", return_type="matrix"):
12
"""
13
Construct a single design matrix given a formula_like and data.
14
15
Parameters:
16
- formula_like: Formula string, ModelDesc, DesignInfo, explicit matrix, or object with __patsy_get_model_desc__ method
17
- data (dict-like): Dict-like object to look up variables referenced in formula
18
- eval_env (int or EvalEnvironment): Environment for variable lookup (0=caller frame, 1=caller's caller, etc.)
19
- NA_action (str or NAAction): Strategy for handling missing data ("drop", "raise", or NAAction object)
20
- return_type (str): "matrix" for numpy arrays or "dataframe" for pandas DataFrames
21
22
Returns:
23
DesignMatrix (numpy.ndarray subclass with metadata) or pandas DataFrame
24
"""
25
```
26
27
#### Usage Examples
28
29
```python
30
import patsy
31
import pandas as pd
32
33
# Simple linear terms
34
data = {'x': [1, 2, 3, 4], 'y': [2, 4, 6, 8]}
35
design = patsy.dmatrix("x", data)
36
37
# Polynomial terms with I() function
38
design = patsy.dmatrix("x + I(x**2)", data)
39
40
# Categorical variables
41
data = {'treatment': ['A', 'B', 'A', 'B'], 'response': [1, 2, 3, 4]}
42
design = patsy.dmatrix("C(treatment)", data)
43
44
# Interactions
45
design = patsy.dmatrix("x * C(treatment)", data)
46
```
47
48
### Dual Design Matrix Construction
49
50
Constructs both outcome and predictor design matrices from a formula specification, the standard approach for regression modeling.
51
52
```python { .api }
53
def dmatrices(formula_like, data={}, eval_env=0, NA_action="drop", return_type="matrix"):
54
"""
55
Construct two design matrices given a formula_like and data.
56
57
This function requires a two-sided formula (outcome ~ predictors) and returns
58
two matrices: the outcome (y) and predictor (X) matrices.
59
60
Parameters:
61
- formula_like: Two-sided formula string or equivalent (must specify both outcome and predictors)
62
- data (dict-like): Dict-like object to look up variables referenced in formula
63
- eval_env (int or EvalEnvironment): Environment for variable lookup
64
- NA_action (str or NAAction): Strategy for handling missing data
65
- return_type (str): "matrix" for numpy arrays or "dataframe" for pandas DataFrames
66
67
Returns:
68
Tuple of (outcome_matrix, predictor_matrix) - both DesignMatrix objects or DataFrames
69
"""
70
```
71
72
#### Usage Examples
73
74
```python
75
import patsy
76
import pandas as pd
77
78
# Basic regression model
79
data = pd.DataFrame({
80
'y': [1, 2, 3, 4, 5],
81
'x1': [1, 2, 3, 4, 5],
82
'x2': [2, 4, 6, 8, 10]
83
})
84
85
# Two-sided formula
86
y, X = patsy.dmatrices("y ~ x1 + x2", data)
87
print("Outcome shape:", y.shape)
88
print("Predictors shape:", X.shape)
89
90
# More complex model with interactions and transformations
91
y, X = patsy.dmatrices("y ~ x1 * x2 + I(x1**2)", data)
92
93
# Categorical predictors
94
data = pd.DataFrame({
95
'y': [1, 2, 3, 4, 5, 6],
96
'x': [1, 2, 3, 4, 5, 6],
97
'group': ['A', 'A', 'B', 'B', 'C', 'C']
98
})
99
y, X = patsy.dmatrices("y ~ x + C(group)", data)
100
```
101
102
### Incremental Design Matrix Builders
103
104
For large datasets that don't fit in memory, these functions create builders that can process data incrementally.
105
106
```python { .api }
107
def incr_dbuilder(formula_like, data_iter_maker, eval_env=0, NA_action="drop"):
108
"""
109
Construct a design matrix builder incrementally from a large data set.
110
111
Parameters:
112
- formula_like: Formula string, ModelDesc, DesignInfo, or object with __patsy_get_model_desc__ method (explicit matrices not allowed)
113
- data_iter_maker: Zero-argument callable returning iterator over dict-like data objects
114
- eval_env (int or EvalEnvironment): Environment for variable lookup
115
- NA_action (str or NAAction): Strategy for handling missing data
116
117
Returns:
118
DesignMatrixBuilder object that can process data incrementally
119
"""
120
121
def incr_dbuilders(formula_like, data_iter_maker, eval_env=0, NA_action="drop"):
122
"""
123
Construct two design matrix builders incrementally from a large data set.
124
125
This is the incremental version of dmatrices(), for processing large datasets
126
that require multiple passes or don't fit in memory.
127
128
Parameters:
129
- formula_like: Two-sided formula string or equivalent
130
- data_iter_maker: Zero-argument callable returning iterator over dict-like data objects
131
- eval_env (int or EvalEnvironment): Environment for variable lookup
132
- NA_action (str or NAAction): Strategy for handling missing data
133
134
Returns:
135
Tuple of (outcome_builder, predictor_builder) - both DesignMatrixBuilder objects
136
"""
137
```
138
139
#### Usage Examples
140
141
```python
142
import patsy
143
144
# Function that returns an iterator over data chunks
145
def data_chunks():
146
# This could read from a database, files, etc.
147
for i in range(0, 10000, 1000):
148
yield {'x': list(range(i, i+1000)),
149
'y': [j*2 for j in range(i, i+1000)]}
150
151
# Build incremental design matrix builder
152
builder = patsy.incr_dbuilder("x + I(x**2)", data_chunks)
153
154
# Use the builder to process new data
155
new_data = {'x': [1, 2, 3], 'y': [2, 4, 6]}
156
design_matrix = builder.build(new_data)
157
158
# For two-sided formulas
159
y_builder, X_builder = patsy.incr_dbuilders("y ~ x + I(x**2)", data_chunks)
160
```
161
162
## Formula Types
163
164
The `formula_like` parameter accepts several types:
165
166
- **String formulas**: R-style formula strings like `"y ~ x1 + x2"`
167
- **ModelDesc objects**: Parsed formula representations
168
- **DesignInfo objects**: Metadata about matrix structure
169
- **Explicit matrices**: numpy arrays or pandas DataFrames (dmatrix only)
170
- **Objects with __patsy_get_model_desc__ method**: Custom formula-like objects
171
172
## Return Types
173
174
Functions support two return types via the `return_type` parameter:
175
176
- **"matrix"** (default): Returns DesignMatrix objects (numpy.ndarray subclasses with metadata)
177
- **"dataframe"**: Returns pandas DataFrames (requires pandas installation)
178
179
## Missing Data Handling
180
181
The `NA_action` parameter controls missing data handling:
182
183
- **"drop"** (default): Remove rows with any missing values
184
- **"raise"**: Raise an exception if missing values are encountered
185
- **NAAction object**: Custom missing data handling strategy