A Python package for describing statistical models and for building design matrices.
—
Classes implementing different contrast coding schemes for categorical variables. These coding schemes determine how categorical factors are represented in design matrices, affecting the interpretation of model coefficients.
The foundation class for all contrast coding schemes, containing the actual coding matrix and column naming information.
class ContrastMatrix:
"""
Container for a matrix used for coding categorical factors.
Attributes:
- matrix: 2d ndarray where each column corresponds to one design matrix column
and each row contains entries for a single categorical level
- column_suffixes: List of strings appended to factor names for column names
"""
def __init__(self, matrix, column_suffixes):
"""
Create a contrast matrix.
Parameters:
- matrix: 2d array-like coding matrix
- column_suffixes: List of suffix strings for column naming
"""The default contrast coding scheme, comparing each level to a reference level.
class Treatment:
"""
Treatment coding (dummy coding) - the default contrast scheme.
For reduced-rank coding, one level is the reference (represented by intercept),
and each column represents the difference between a level and the reference.
For full-rank coding, classic dummy coding with each level having its own column.
"""
def __init__(self, reference=None):
"""
Parameters:
- reference: Level to use as reference (default: first level)
"""import patsy
from patsy import Treatment
import pandas as pd
data = pd.DataFrame({
'group': ['A', 'B', 'C', 'A', 'B', 'C'],
'y': [1, 2, 3, 1.5, 2.5, 3.5]
})
# Default treatment contrasts (first level as reference)
y, X = patsy.dmatrices("y ~ C(group)", data)
print(X.design_info.column_names) # ['Intercept', 'C(group)[T.B]', 'C(group)[T.C]']
# Specify reference level
y, X = patsy.dmatrices("y ~ C(group, Treatment(reference='B'))", data)
print(X.design_info.column_names) # ['Intercept', 'C(group)[T.A]', 'C(group)[T.C]']Compares each level to the grand mean, with coefficients that sum to zero.
class Sum:
"""
Deviation coding (sum-to-zero coding).
Compares the mean of each level to the mean-of-means (overall mean in balanced designs).
Coefficients sum to zero, making interpretation relative to the grand mean.
"""
def __init__(self, omit=None):
"""
Parameters:
- omit: Level to omit to avoid redundancy (default: last level)
"""import patsy
from patsy import Sum
# Sum-to-zero contrasts
y, X = patsy.dmatrices("y ~ C(group, Sum)", data)
print(X.design_info.column_names) # ['Intercept', 'C(group)[S.A]', 'C(group)[S.B]']
# Specify which level to omit
y, X = patsy.dmatrices("y ~ C(group, Sum(omit='A'))", data)Compares each level with the average of all preceding levels.
class Helmert:
"""
Helmert contrasts.
Compares the second level with the first, the third with the average of
the first two, and so on. Useful for ordered factors.
Warning: Multiple definitions of 'Helmert coding' exist. Verify this matches
your expected interpretation.
"""import patsy
from patsy import Helmert
# Helmert contrasts for ordered factors
data = pd.DataFrame({
'dose': ['low', 'medium', 'high', 'low', 'medium', 'high'],
'response': [1, 2, 4, 1.2, 2.1, 3.8]
})
y, X = patsy.dmatrices("response ~ C(dose, Helmert, levels=['low', 'medium', 'high'])", data)
print(X.design_info.column_names)Treats categorical levels as ordered samples for polynomial trend analysis.
class Poly:
"""
Orthogonal polynomial contrast coding.
Treats levels as ordered samples from an underlying continuous scale,
decomposing effects into linear, quadratic, cubic, etc. components.
Useful for ordered factors with potentially nonlinear relationships.
"""import patsy
from patsy import Poly
# Polynomial contrasts for dose-response analysis
data = pd.DataFrame({
'dose': [1, 2, 3, 4, 1, 2, 3, 4], # Numeric levels
'response': [1, 1.8, 3.2, 4.5, 1.1, 1.9, 3.1, 4.6]
})
y, X = patsy.dmatrices("response ~ C(dose, Poly)", data)
print(X.design_info.column_names) # Linear, quadratic, cubic termsCompares each level with the immediately preceding level, useful for ordered factors.
class Diff:
"""
Backward difference coding.
Compares each level with the preceding level: second minus first,
third minus second, etc. Useful for ordered factors to examine
step-wise changes between adjacent levels.
"""import patsy
from patsy import Diff
# Difference contrasts for time periods
data = pd.DataFrame({
'period': ['pre', 'during', 'post', 'pre', 'during', 'post'],
'measurement': [10, 15, 12, 9, 16, 13]
})
y, X = patsy.dmatrices("measurement ~ C(period, Diff, levels=['pre', 'during', 'post'])", data)
print(X.design_info.column_names) # Shows differences: during-pre, post-during| Contrast Type | Best For | Interpretation |
|---|---|---|
| Treatment | General categorical factors | Difference from reference level |
| Sum | Balanced designs, ANOVA-style analysis | Deviation from grand mean |
| Helmert | Ordered factors, progressive comparisons | Cumulative effects |
| Polynomial | Ordered factors, trend analysis | Linear, quadratic, cubic trends |
| Diff | Ordered factors, adjacent comparisons | Step-wise changes |
import numpy as np
from patsy import ContrastMatrix
# Create custom contrast matrix
custom_matrix = np.array([[1, 0], [0, 1], [-1, -1]])
custom_contrasts = ContrastMatrix(custom_matrix, ["[custom.1]", "[custom.2]"])
# Use in formula (requires integration with Patsy's system)Contrast coding works seamlessly with categorical variable specification:
import patsy
from patsy import C, Treatment, Sum
data = {'factor': ['A', 'B', 'C'] * 10, 'y': range(30)}
# Combine C() with contrast specification
designs = [
patsy.dmatrix("C(factor, Treatment)", data),
patsy.dmatrix("C(factor, Sum)", data),
patsy.dmatrix("C(factor, levels=['C', 'B', 'A'])", data) # Custom ordering
]Install with Tessl CLI
npx tessl i tessl/pypi-patsy