or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

builtins.mdcategorical.mdcontrasts.mdhigh-level.mdindex.mdmatrix-building.mdsplines.mdtransforms.mdutilities.md

categorical.mddocs/

0

# Categorical Variables

1

2

Functions and classes for handling categorical data in statistical models. Patsy provides automatic detection of categorical variables and flexible manual specification with custom contrast coding schemes.

3

4

## Capabilities

5

6

### Categorical Variable Specification

7

8

Explicitly marks data as categorical and specifies how it should be interpreted in formulas.

9

10

```python { .api }

11

def C(data, contrast=None, levels=None):

12

"""

13

Marks data as categorical and specifies interpretation options.

14

15

Parameters:

16

- data: Array-like data to be treated as categorical

17

- contrast (contrast object or None): Contrast coding scheme to use (Treatment, Sum, Helmert, etc.)

18

- levels (sequence or None): Explicit ordering of category levels

19

20

Returns:

21

Categorical factor object for use in formulas

22

"""

23

```

24

25

#### Usage Examples

26

27

```python

28

import patsy

29

import pandas as pd

30

31

data = pd.DataFrame({

32

'treatment': ['control', 'drug_a', 'drug_b', 'control', 'drug_a'],

33

'outcome': [1.2, 2.3, 3.1, 1.8, 2.9]

34

})

35

36

# Basic categorical specification

37

design = patsy.dmatrix("C(treatment)", data)

38

39

# With custom level ordering

40

design = patsy.dmatrix("C(treatment, levels=['control', 'drug_a', 'drug_b'])", data)

41

42

# With custom contrast coding

43

from patsy import Sum

44

design = patsy.dmatrix("C(treatment, Sum)", data)

45

46

# Combining with other terms

47

y, X = patsy.dmatrices("outcome ~ C(treatment) + I(treatment=='control')", data)

48

```

49

50

### Automatic Categorical Detection

51

52

Determines whether data should be automatically treated as categorical based on its type and content.

53

54

```python { .api }

55

def guess_categorical(data):

56

"""

57

Determine if data should be treated as categorical.

58

59

Parameters:

60

- data: Array-like data to examine

61

62

Returns:

63

bool: True if data appears categorical, False otherwise

64

"""

65

```

66

67

#### Usage Examples

68

69

```python

70

import patsy

71

import numpy as np

72

73

# String data is usually categorical

74

text_data = ['A', 'B', 'A', 'C', 'B']

75

print(patsy.guess_categorical(text_data)) # True

76

77

# Numeric data with few unique values might be categorical

78

numeric_groups = [1, 2, 1, 3, 2, 1, 3]

79

print(patsy.guess_categorical(numeric_groups)) # Depends on implementation

80

81

# Continuous numeric data is not categorical

82

continuous = np.random.normal(0, 1, 100)

83

print(patsy.guess_categorical(continuous)) # False

84

```

85

86

### Categorical Data Conversion

87

88

Converts categorical data to integer codes for internal processing.

89

90

```python { .api }

91

def categorical_to_int(data, levels=None, pandas_index=False):

92

"""

93

Convert categorical data to integer representation.

94

95

Parameters:

96

- data: Categorical data to convert

97

- levels (sequence or None): Explicit level ordering

98

- pandas_index (bool): Whether to return pandas index information

99

100

Returns:

101

Integer array with category codes, with missing values as -1

102

"""

103

```

104

105

#### Usage Examples

106

107

```python

108

import patsy

109

110

# Convert string categories to integers

111

categories = ['A', 'B', 'A', 'C', 'B']

112

int_codes = patsy.categorical_to_int(categories)

113

print(int_codes) # [0, 1, 0, 2, 1] or similar

114

115

# With explicit level ordering

116

int_codes = patsy.categorical_to_int(categories, levels=['C', 'B', 'A'])

117

print(int_codes) # Different ordering

118

```

119

120

### Automatic Categorical Detection Class

121

122

A class that can detect and handle categorical variables automatically during formula evaluation.

123

124

```python { .api }

125

class CategoricalSniffer:

126

"""

127

Automatically detects and handles categorical variables during formula processing.

128

"""

129

def __init__(self, NA_action, origin=None):

130

"""

131

Initialize categorical detection.

132

133

Parameters:

134

- NA_action: Strategy for handling missing data

135

- origin: Origin information for error reporting

136

"""

137

```

138

139

#### Usage Examples

140

141

```python

142

import patsy

143

from patsy.missing import NAAction

144

145

# Create a categorical sniffer

146

na_action = NAAction()

147

sniffer = patsy.CategoricalSniffer(na_action)

148

149

# The sniffer is typically used internally by patsy,

150

# but can be used manually for custom processing

151

```

152

153

## Categorical Data Types

154

155

Patsy recognizes several types of categorical data:

156

157

### Pandas Categorical

158

159

```python

160

import pandas as pd

161

import patsy

162

163

# Pandas categorical data

164

cat_data = pd.Categorical(['A', 'B', 'A', 'C'], categories=['A', 'B', 'C'])

165

design = patsy.dmatrix("cat_data", {'cat_data': cat_data})

166

```

167

168

### String/Text Data

169

170

```python

171

# String data is automatically treated as categorical

172

text_groups = ['control', 'treatment', 'control', 'treatment']

173

design = patsy.dmatrix("C(text_groups)", {'text_groups': text_groups})

174

```

175

176

### Numeric Categories

177

178

```python

179

# Numeric data can be explicitly marked categorical

180

numeric_groups = [1, 2, 1, 3, 2]

181

design = patsy.dmatrix("C(numeric_groups)", {'numeric_groups': numeric_groups})

182

```

183

184

## Integration with Contrast Coding

185

186

Categorical variables work seamlessly with Patsy's contrast coding system:

187

188

```python

189

import patsy

190

from patsy import Treatment, Sum, Helmert

191

192

data = {'group': ['A', 'B', 'C', 'A', 'B', 'C']}

193

194

# Default treatment contrasts

195

design1 = patsy.dmatrix("C(group)", data)

196

197

# Sum-to-zero contrasts

198

design2 = patsy.dmatrix("C(group, Sum)", data)

199

200

# Helmert contrasts

201

design3 = patsy.dmatrix("C(group, Helmert)", data)

202

```

203

204

## Missing Data Handling

205

206

Categorical functions respect Patsy's missing data handling:

207

208

- Missing values in categorical data are typically coded as -1 internally

209

- The NA_action parameter controls how missing values affect matrix construction

210

- Categories with all missing values may be handled specially