0
# Categorical Variables
1
2
Functions and classes for handling categorical data in statistical models. Patsy provides automatic detection of categorical variables and flexible manual specification with custom contrast coding schemes.
3
4
## Capabilities
5
6
### Categorical Variable Specification
7
8
Explicitly marks data as categorical and specifies how it should be interpreted in formulas.
9
10
```python { .api }
11
def C(data, contrast=None, levels=None):
12
"""
13
Marks data as categorical and specifies interpretation options.
14
15
Parameters:
16
- data: Array-like data to be treated as categorical
17
- contrast (contrast object or None): Contrast coding scheme to use (Treatment, Sum, Helmert, etc.)
18
- levels (sequence or None): Explicit ordering of category levels
19
20
Returns:
21
Categorical factor object for use in formulas
22
"""
23
```
24
25
#### Usage Examples
26
27
```python
28
import patsy
29
import pandas as pd
30
31
data = pd.DataFrame({
32
'treatment': ['control', 'drug_a', 'drug_b', 'control', 'drug_a'],
33
'outcome': [1.2, 2.3, 3.1, 1.8, 2.9]
34
})
35
36
# Basic categorical specification
37
design = patsy.dmatrix("C(treatment)", data)
38
39
# With custom level ordering
40
design = patsy.dmatrix("C(treatment, levels=['control', 'drug_a', 'drug_b'])", data)
41
42
# With custom contrast coding
43
from patsy import Sum
44
design = patsy.dmatrix("C(treatment, Sum)", data)
45
46
# Combining with other terms
47
y, X = patsy.dmatrices("outcome ~ C(treatment) + I(treatment=='control')", data)
48
```
49
50
### Automatic Categorical Detection
51
52
Determines whether data should be automatically treated as categorical based on its type and content.
53
54
```python { .api }
55
def guess_categorical(data):
56
"""
57
Determine if data should be treated as categorical.
58
59
Parameters:
60
- data: Array-like data to examine
61
62
Returns:
63
bool: True if data appears categorical, False otherwise
64
"""
65
```
66
67
#### Usage Examples
68
69
```python
70
import patsy
71
import numpy as np
72
73
# String data is usually categorical
74
text_data = ['A', 'B', 'A', 'C', 'B']
75
print(patsy.guess_categorical(text_data)) # True
76
77
# Numeric data with few unique values might be categorical
78
numeric_groups = [1, 2, 1, 3, 2, 1, 3]
79
print(patsy.guess_categorical(numeric_groups)) # Depends on implementation
80
81
# Continuous numeric data is not categorical
82
continuous = np.random.normal(0, 1, 100)
83
print(patsy.guess_categorical(continuous)) # False
84
```
85
86
### Categorical Data Conversion
87
88
Converts categorical data to integer codes for internal processing.
89
90
```python { .api }
91
def categorical_to_int(data, levels=None, pandas_index=False):
92
"""
93
Convert categorical data to integer representation.
94
95
Parameters:
96
- data: Categorical data to convert
97
- levels (sequence or None): Explicit level ordering
98
- pandas_index (bool): Whether to return pandas index information
99
100
Returns:
101
Integer array with category codes, with missing values as -1
102
"""
103
```
104
105
#### Usage Examples
106
107
```python
108
import patsy
109
110
# Convert string categories to integers
111
categories = ['A', 'B', 'A', 'C', 'B']
112
int_codes = patsy.categorical_to_int(categories)
113
print(int_codes) # [0, 1, 0, 2, 1] or similar
114
115
# With explicit level ordering
116
int_codes = patsy.categorical_to_int(categories, levels=['C', 'B', 'A'])
117
print(int_codes) # Different ordering
118
```
119
120
### Automatic Categorical Detection Class
121
122
A class that can detect and handle categorical variables automatically during formula evaluation.
123
124
```python { .api }
125
class CategoricalSniffer:
126
"""
127
Automatically detects and handles categorical variables during formula processing.
128
"""
129
def __init__(self, NA_action, origin=None):
130
"""
131
Initialize categorical detection.
132
133
Parameters:
134
- NA_action: Strategy for handling missing data
135
- origin: Origin information for error reporting
136
"""
137
```
138
139
#### Usage Examples
140
141
```python
142
import patsy
143
from patsy.missing import NAAction
144
145
# Create a categorical sniffer
146
na_action = NAAction()
147
sniffer = patsy.CategoricalSniffer(na_action)
148
149
# The sniffer is typically used internally by patsy,
150
# but can be used manually for custom processing
151
```
152
153
## Categorical Data Types
154
155
Patsy recognizes several types of categorical data:
156
157
### Pandas Categorical
158
159
```python
160
import pandas as pd
161
import patsy
162
163
# Pandas categorical data
164
cat_data = pd.Categorical(['A', 'B', 'A', 'C'], categories=['A', 'B', 'C'])
165
design = patsy.dmatrix("cat_data", {'cat_data': cat_data})
166
```
167
168
### String/Text Data
169
170
```python
171
# String data is automatically treated as categorical
172
text_groups = ['control', 'treatment', 'control', 'treatment']
173
design = patsy.dmatrix("C(text_groups)", {'text_groups': text_groups})
174
```
175
176
### Numeric Categories
177
178
```python
179
# Numeric data can be explicitly marked categorical
180
numeric_groups = [1, 2, 1, 3, 2]
181
design = patsy.dmatrix("C(numeric_groups)", {'numeric_groups': numeric_groups})
182
```
183
184
## Integration with Contrast Coding
185
186
Categorical variables work seamlessly with Patsy's contrast coding system:
187
188
```python
189
import patsy
190
from patsy import Treatment, Sum, Helmert
191
192
data = {'group': ['A', 'B', 'C', 'A', 'B', 'C']}
193
194
# Default treatment contrasts
195
design1 = patsy.dmatrix("C(group)", data)
196
197
# Sum-to-zero contrasts
198
design2 = patsy.dmatrix("C(group, Sum)", data)
199
200
# Helmert contrasts
201
design3 = patsy.dmatrix("C(group, Helmert)", data)
202
```
203
204
## Missing Data Handling
205
206
Categorical functions respect Patsy's missing data handling:
207
208
- Missing values in categorical data are typically coded as -1 internally
209
- The NA_action parameter controls how missing values affect matrix construction
210
- Categories with all missing values may be handled specially