A Python package for describing statistical models and for building design matrices.
—
The main entry points for creating design matrices from formula strings. These functions handle the complete workflow from formula parsing to matrix construction, providing the most convenient interface for typical statistical modeling tasks.
Constructs a single design matrix from a formula specification, commonly used for creating predictor matrices in regression models.
def dmatrix(formula_like, data={}, eval_env=0, NA_action="drop", return_type="matrix"):
"""
Construct a single design matrix given a formula_like and data.
Parameters:
- formula_like: Formula string, ModelDesc, DesignInfo, explicit matrix, or object with __patsy_get_model_desc__ method
- data (dict-like): Dict-like object to look up variables referenced in formula
- eval_env (int or EvalEnvironment): Environment for variable lookup (0=caller frame, 1=caller's caller, etc.)
- NA_action (str or NAAction): Strategy for handling missing data ("drop", "raise", or NAAction object)
- return_type (str): "matrix" for numpy arrays or "dataframe" for pandas DataFrames
Returns:
DesignMatrix (numpy.ndarray subclass with metadata) or pandas DataFrame
"""import patsy
import pandas as pd
# Simple linear terms
data = {'x': [1, 2, 3, 4], 'y': [2, 4, 6, 8]}
design = patsy.dmatrix("x", data)
# Polynomial terms with I() function
design = patsy.dmatrix("x + I(x**2)", data)
# Categorical variables
data = {'treatment': ['A', 'B', 'A', 'B'], 'response': [1, 2, 3, 4]}
design = patsy.dmatrix("C(treatment)", data)
# Interactions
design = patsy.dmatrix("x * C(treatment)", data)Constructs both outcome and predictor design matrices from a formula specification, the standard approach for regression modeling.
def dmatrices(formula_like, data={}, eval_env=0, NA_action="drop", return_type="matrix"):
"""
Construct two design matrices given a formula_like and data.
This function requires a two-sided formula (outcome ~ predictors) and returns
two matrices: the outcome (y) and predictor (X) matrices.
Parameters:
- formula_like: Two-sided formula string or equivalent (must specify both outcome and predictors)
- data (dict-like): Dict-like object to look up variables referenced in formula
- eval_env (int or EvalEnvironment): Environment for variable lookup
- NA_action (str or NAAction): Strategy for handling missing data
- return_type (str): "matrix" for numpy arrays or "dataframe" for pandas DataFrames
Returns:
Tuple of (outcome_matrix, predictor_matrix) - both DesignMatrix objects or DataFrames
"""import patsy
import pandas as pd
# Basic regression model
data = pd.DataFrame({
'y': [1, 2, 3, 4, 5],
'x1': [1, 2, 3, 4, 5],
'x2': [2, 4, 6, 8, 10]
})
# Two-sided formula
y, X = patsy.dmatrices("y ~ x1 + x2", data)
print("Outcome shape:", y.shape)
print("Predictors shape:", X.shape)
# More complex model with interactions and transformations
y, X = patsy.dmatrices("y ~ x1 * x2 + I(x1**2)", data)
# Categorical predictors
data = pd.DataFrame({
'y': [1, 2, 3, 4, 5, 6],
'x': [1, 2, 3, 4, 5, 6],
'group': ['A', 'A', 'B', 'B', 'C', 'C']
})
y, X = patsy.dmatrices("y ~ x + C(group)", data)For large datasets that don't fit in memory, these functions create builders that can process data incrementally.
def incr_dbuilder(formula_like, data_iter_maker, eval_env=0, NA_action="drop"):
"""
Construct a design matrix builder incrementally from a large data set.
Parameters:
- formula_like: Formula string, ModelDesc, DesignInfo, or object with __patsy_get_model_desc__ method (explicit matrices not allowed)
- data_iter_maker: Zero-argument callable returning iterator over dict-like data objects
- eval_env (int or EvalEnvironment): Environment for variable lookup
- NA_action (str or NAAction): Strategy for handling missing data
Returns:
DesignMatrixBuilder object that can process data incrementally
"""
def incr_dbuilders(formula_like, data_iter_maker, eval_env=0, NA_action="drop"):
"""
Construct two design matrix builders incrementally from a large data set.
This is the incremental version of dmatrices(), for processing large datasets
that require multiple passes or don't fit in memory.
Parameters:
- formula_like: Two-sided formula string or equivalent
- data_iter_maker: Zero-argument callable returning iterator over dict-like data objects
- eval_env (int or EvalEnvironment): Environment for variable lookup
- NA_action (str or NAAction): Strategy for handling missing data
Returns:
Tuple of (outcome_builder, predictor_builder) - both DesignMatrixBuilder objects
"""import patsy
# Function that returns an iterator over data chunks
def data_chunks():
# This could read from a database, files, etc.
for i in range(0, 10000, 1000):
yield {'x': list(range(i, i+1000)),
'y': [j*2 for j in range(i, i+1000)]}
# Build incremental design matrix builder
builder = patsy.incr_dbuilder("x + I(x**2)", data_chunks)
# Use the builder to process new data
new_data = {'x': [1, 2, 3], 'y': [2, 4, 6]}
design_matrix = builder.build(new_data)
# For two-sided formulas
y_builder, X_builder = patsy.incr_dbuilders("y ~ x + I(x**2)", data_chunks)The formula_like parameter accepts several types:
"y ~ x1 + x2"Functions support two return types via the return_type parameter:
The NA_action parameter controls missing data handling:
Install with Tessl CLI
npx tessl i tessl/pypi-patsy