CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-patsy

A Python package for describing statistical models and for building design matrices.

Pending
Overview
Eval results
Files

high-level.mddocs/

High-Level Interface

The main entry points for creating design matrices from formula strings. These functions handle the complete workflow from formula parsing to matrix construction, providing the most convenient interface for typical statistical modeling tasks.

Capabilities

Single Design Matrix Construction

Constructs a single design matrix from a formula specification, commonly used for creating predictor matrices in regression models.

def dmatrix(formula_like, data={}, eval_env=0, NA_action="drop", return_type="matrix"):
    """
    Construct a single design matrix given a formula_like and data.

    Parameters:
    - formula_like: Formula string, ModelDesc, DesignInfo, explicit matrix, or object with __patsy_get_model_desc__ method
    - data (dict-like): Dict-like object to look up variables referenced in formula
    - eval_env (int or EvalEnvironment): Environment for variable lookup (0=caller frame, 1=caller's caller, etc.)
    - NA_action (str or NAAction): Strategy for handling missing data ("drop", "raise", or NAAction object)
    - return_type (str): "matrix" for numpy arrays or "dataframe" for pandas DataFrames

    Returns:
    DesignMatrix (numpy.ndarray subclass with metadata) or pandas DataFrame
    """

Usage Examples

import patsy
import pandas as pd

# Simple linear terms
data = {'x': [1, 2, 3, 4], 'y': [2, 4, 6, 8]}
design = patsy.dmatrix("x", data)

# Polynomial terms with I() function
design = patsy.dmatrix("x + I(x**2)", data)

# Categorical variables
data = {'treatment': ['A', 'B', 'A', 'B'], 'response': [1, 2, 3, 4]}
design = patsy.dmatrix("C(treatment)", data)

# Interactions
design = patsy.dmatrix("x * C(treatment)", data)

Dual Design Matrix Construction

Constructs both outcome and predictor design matrices from a formula specification, the standard approach for regression modeling.

def dmatrices(formula_like, data={}, eval_env=0, NA_action="drop", return_type="matrix"):
    """
    Construct two design matrices given a formula_like and data.
    
    This function requires a two-sided formula (outcome ~ predictors) and returns
    two matrices: the outcome (y) and predictor (X) matrices.

    Parameters:
    - formula_like: Two-sided formula string or equivalent (must specify both outcome and predictors)
    - data (dict-like): Dict-like object to look up variables referenced in formula
    - eval_env (int or EvalEnvironment): Environment for variable lookup
    - NA_action (str or NAAction): Strategy for handling missing data
    - return_type (str): "matrix" for numpy arrays or "dataframe" for pandas DataFrames

    Returns:
    Tuple of (outcome_matrix, predictor_matrix) - both DesignMatrix objects or DataFrames
    """

Usage Examples

import patsy
import pandas as pd

# Basic regression model
data = pd.DataFrame({
    'y': [1, 2, 3, 4, 5],
    'x1': [1, 2, 3, 4, 5],
    'x2': [2, 4, 6, 8, 10]
})

# Two-sided formula
y, X = patsy.dmatrices("y ~ x1 + x2", data)
print("Outcome shape:", y.shape)
print("Predictors shape:", X.shape)

# More complex model with interactions and transformations
y, X = patsy.dmatrices("y ~ x1 * x2 + I(x1**2)", data)

# Categorical predictors
data = pd.DataFrame({
    'y': [1, 2, 3, 4, 5, 6],
    'x': [1, 2, 3, 4, 5, 6],
    'group': ['A', 'A', 'B', 'B', 'C', 'C']
})
y, X = patsy.dmatrices("y ~ x + C(group)", data)

Incremental Design Matrix Builders

For large datasets that don't fit in memory, these functions create builders that can process data incrementally.

def incr_dbuilder(formula_like, data_iter_maker, eval_env=0, NA_action="drop"):
    """
    Construct a design matrix builder incrementally from a large data set.

    Parameters:
    - formula_like: Formula string, ModelDesc, DesignInfo, or object with __patsy_get_model_desc__ method (explicit matrices not allowed)
    - data_iter_maker: Zero-argument callable returning iterator over dict-like data objects
    - eval_env (int or EvalEnvironment): Environment for variable lookup
    - NA_action (str or NAAction): Strategy for handling missing data

    Returns:
    DesignMatrixBuilder object that can process data incrementally
    """

def incr_dbuilders(formula_like, data_iter_maker, eval_env=0, NA_action="drop"):
    """
    Construct two design matrix builders incrementally from a large data set.
    
    This is the incremental version of dmatrices(), for processing large datasets
    that require multiple passes or don't fit in memory.

    Parameters:
    - formula_like: Two-sided formula string or equivalent
    - data_iter_maker: Zero-argument callable returning iterator over dict-like data objects
    - eval_env (int or EvalEnvironment): Environment for variable lookup
    - NA_action (str or NAAction): Strategy for handling missing data

    Returns:
    Tuple of (outcome_builder, predictor_builder) - both DesignMatrixBuilder objects
    """

Usage Examples

import patsy

# Function that returns an iterator over data chunks
def data_chunks():
    # This could read from a database, files, etc.
    for i in range(0, 10000, 1000):
        yield {'x': list(range(i, i+1000)), 
               'y': [j*2 for j in range(i, i+1000)]}

# Build incremental design matrix builder
builder = patsy.incr_dbuilder("x + I(x**2)", data_chunks)

# Use the builder to process new data
new_data = {'x': [1, 2, 3], 'y': [2, 4, 6]}
design_matrix = builder.build(new_data)

# For two-sided formulas
y_builder, X_builder = patsy.incr_dbuilders("y ~ x + I(x**2)", data_chunks)

Formula Types

The formula_like parameter accepts several types:

  • String formulas: R-style formula strings like "y ~ x1 + x2"
  • ModelDesc objects: Parsed formula representations
  • DesignInfo objects: Metadata about matrix structure
  • Explicit matrices: numpy arrays or pandas DataFrames (dmatrix only)
  • Objects with patsy_get_model_desc method: Custom formula-like objects

Return Types

Functions support two return types via the return_type parameter:

  • "matrix" (default): Returns DesignMatrix objects (numpy.ndarray subclasses with metadata)
  • "dataframe": Returns pandas DataFrames (requires pandas installation)

Missing Data Handling

The NA_action parameter controls missing data handling:

  • "drop" (default): Remove rows with any missing values
  • "raise": Raise an exception if missing values are encountered
  • NAAction object: Custom missing data handling strategy

Install with Tessl CLI

npx tessl i tessl/pypi-patsy

docs

builtins.md

categorical.md

contrasts.md

high-level.md

index.md

matrix-building.md

splines.md

transforms.md

utilities.md

tile.json