0
# CRF Estimator
1
2
The main CRF class provides a scikit-learn compatible interface for Conditional Random Field sequence labeling. It wraps the efficient CRFsuite C++ implementation while maintaining full compatibility with sklearn's ecosystem for model selection, cross-validation, and pipeline integration.
3
4
## Capabilities
5
6
### Constructor and Configuration
7
8
Initialize a CRF estimator with algorithm selection and comprehensive hyperparameter configuration.
9
10
```python { .api }
11
class CRF:
12
def __init__(
13
self,
14
algorithm='lbfgs',
15
min_freq=0,
16
all_possible_states=False,
17
all_possible_transitions=False,
18
c1=0,
19
c2=1.0,
20
max_iterations=None,
21
num_memories=6,
22
epsilon=1e-5,
23
period=10,
24
delta=1e-5,
25
linesearch='MoreThuente',
26
max_linesearch=20,
27
calibration_eta=0.1,
28
calibration_rate=2.0,
29
calibration_samples=1000,
30
calibration_candidates=10,
31
calibration_max_trials=20,
32
pa_type=1,
33
c=1,
34
error_sensitive=True,
35
averaging=True,
36
variance=1,
37
gamma=1,
38
verbose=False,
39
model_filename=None,
40
keep_tempfiles=False,
41
trainer_cls=None
42
):
43
"""
44
Initialize CRF estimator.
45
46
Parameters:
47
- algorithm: str, training algorithm ('lbfgs', 'l2sgd', 'ap', 'pa', 'arow')
48
- min_freq: float, feature occurrence frequency cutoff threshold
49
- all_possible_states: bool, generate state features for all attribute-label combinations
50
- all_possible_transitions: bool, generate transition features for all label pairs
51
- c1: float, L1 regularization coefficient (lbfgs only)
52
- c2: float, L2 regularization coefficient
53
- max_iterations: int, maximum optimization iterations
54
- num_memories: int, limited memories for inverse hessian approximation (lbfgs)
55
- epsilon: float, convergence condition parameter
56
- period: int, iteration period for stopping criterion testing
57
- delta: float, stopping criterion threshold
58
- linesearch: str, line search algorithm ('MoreThuente', 'Backtracking', 'StrongBacktracking')
59
- max_linesearch: int, maximum line search trials
60
- calibration_eta: float, initial learning rate for calibration (l2sgd)
61
- calibration_rate: float, learning rate change rate (l2sgd)
62
- calibration_samples: int, calibration sample count (l2sgd)
63
- calibration_candidates: int, learning rate candidates (l2sgd)
64
- calibration_max_trials: int, maximum calibration trials (l2sgd)
65
- pa_type: int, passive aggressive strategy (0=no slack, 1=PA-I, 2=PA-II)
66
- c: float, aggressiveness parameter for PA
67
- error_sensitive: bool, include prediction error count in objective
68
- averaging: bool, compute averaged feature weights
69
- variance: float, initial feature weight variance (arow)
70
- gamma: float, loss vs weight change tradeoff (arow)
71
- verbose: bool, enable training progress output
72
- model_filename: str, path to existing model file
73
- keep_tempfiles: bool, preserve temporary model files
74
- trainer_cls: class, custom trainer class
75
"""
76
```
77
78
**Usage Example:**
79
80
```python
81
# Basic L-BFGS with regularization
82
crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100)
83
84
# Stochastic gradient descent setup
85
crf_sgd = CRF(
86
algorithm='l2sgd',
87
c2=1.0,
88
calibration_eta=0.01,
89
calibration_samples=500,
90
verbose=True
91
)
92
93
# Passive Aggressive configuration
94
crf_pa = CRF(algorithm='pa', pa_type=1, c=0.5, error_sensitive=True)
95
```
96
97
### Training
98
99
Train the CRF model on sequential data with optional development set for validation.
100
101
```python { .api }
102
def fit(self, X, y, X_dev=None, y_dev=None):
103
"""
104
Train the CRF model.
105
106
Parameters:
107
- X: List[List[Dict]], feature sequences for training documents
108
- y: List[List[str]], label sequences for training documents
109
- X_dev: List[List[Dict]], optional development/validation feature sequences
110
- y_dev: List[List[str]], optional development/validation label sequences
111
112
Returns:
113
- self: fitted CRF instance
114
"""
115
```
116
117
**Usage Example:**
118
119
```python
120
# Basic training
121
crf.fit(X_train, y_train)
122
123
# Training with validation set
124
crf.fit(X_train, y_train, X_dev=X_val, y_dev=y_val)
125
```
126
127
### Prediction
128
129
Make predictions on new sequences with various output formats.
130
131
```python { .api }
132
def predict(self, X):
133
"""
134
Predict labels for input sequences.
135
136
Parameters:
137
- X: List[List[Dict]], feature sequences to predict
138
139
Returns:
140
- List[List[str]]: predicted label sequences
141
"""
142
143
def predict_single(self, xseq):
144
"""
145
Predict labels for a single sequence.
146
147
Parameters:
148
- xseq: List[Dict], single feature sequence
149
150
Returns:
151
- List[str]: predicted labels for the sequence
152
"""
153
154
def predict_marginals(self, X):
155
"""
156
Get marginal probabilities for all labels at each position.
157
158
Parameters:
159
- X: List[List[Dict]], feature sequences
160
161
Returns:
162
- List[List[Dict[str, float]]]: marginal probabilities for each position
163
"""
164
165
def predict_marginals_single(self, xseq):
166
"""
167
Get marginal probabilities for a single sequence.
168
169
Parameters:
170
- xseq: List[Dict], single feature sequence
171
172
Returns:
173
- List[Dict[str, float]]: marginal probabilities for each position
174
"""
175
```
176
177
**Usage Example:**
178
179
```python
180
# Basic prediction
181
predictions = crf.predict(X_test)
182
183
# Single sequence prediction
184
single_pred = crf.predict_single(X_test[0])
185
186
# Get prediction confidence
187
marginals = crf.predict_marginals(X_test)
188
for seq_marginals in marginals:
189
for pos_probs in seq_marginals:
190
best_label = max(pos_probs, key=pos_probs.get)
191
confidence = pos_probs[best_label]
192
print(f"Label: {best_label}, Confidence: {confidence:.3f}")
193
```
194
195
### Evaluation
196
197
Evaluate model performance using built-in scoring methods.
198
199
```python { .api }
200
def score(self, X, y):
201
"""
202
Return token-level accuracy score.
203
204
Parameters:
205
- X: List[List[Dict]], feature sequences
206
- y: List[List[str]], true label sequences
207
208
Returns:
209
- float: flat accuracy score (token-level accuracy)
210
"""
211
```
212
213
### Model Introspection
214
215
Access learned model parameters and feature information.
216
217
```python { .api }
218
@property
219
def classes_(self):
220
"""List of class labels learned during training."""
221
222
@property
223
def tagger_(self):
224
"""Underlying pycrfsuite.Tagger instance."""
225
226
@property
227
def size_(self):
228
"""Model size in bytes."""
229
230
@property
231
def num_attributes_(self):
232
"""Number of non-zero CRF attributes."""
233
234
@property
235
def attributes_(self):
236
"""List of learned feature attributes."""
237
238
@property
239
def state_features_(self):
240
"""
241
Dict mapping (attribute_name, label) tuples to feature coefficients.
242
Shows learned weights for state features.
243
"""
244
245
@property
246
def transition_features_(self):
247
"""
248
Dict mapping (label_from, label_to) tuples to transition coefficients.
249
Shows learned weights for label transitions.
250
"""
251
252
@property
253
def training_log_(self):
254
"""Training log parser with iteration details."""
255
```
256
257
**Usage Example:**
258
259
```python
260
# Inspect learned model
261
print(f"Model size: {crf.size_} bytes")
262
print(f"Number of features: {crf.num_attributes_}")
263
print(f"Learned labels: {crf.classes_}")
264
265
# Examine feature weights
266
for (attr, label), weight in crf.state_features_.items():
267
if abs(weight) > 0.1: # Show only significant features
268
print(f"Feature '{attr}' -> '{label}': {weight:.3f}")
269
270
# Check transition patterns
271
for (from_label, to_label), weight in crf.transition_features_.items():
272
if abs(weight) > 0.1:
273
print(f"Transition '{from_label}' -> '{to_label}': {weight:.3f}")
274
```