CRFsuite (python-crfsuite) wrapper which provides interface similar to scikit-learn
npx @tessl/cli install tessl/pypi-sklearn-crfsuite@0.3.00
# sklearn-crfsuite
1
2
A scikit-learn compatible wrapper for CRFsuite that enables Conditional Random Fields (CRF) for sequence labeling tasks. It provides a familiar fit/predict interface while leveraging the efficient C++ CRFsuite implementation through python-crfsuite, making it ideal for named entity recognition, part-of-speech tagging, and other structured prediction tasks.
3
4
## Package Information
5
6
- **Package Name**: sklearn-crfsuite
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install sklearn-crfsuite`
10
11
## Core Imports
12
13
```python
14
from sklearn_crfsuite import CRF
15
```
16
17
Common pattern for metrics and evaluation:
18
19
```python
20
from sklearn_crfsuite import metrics
21
```
22
23
For scikit-learn integration:
24
25
```python
26
from sklearn_crfsuite import scorers
27
```
28
29
For utility functions:
30
31
```python
32
from sklearn_crfsuite import utils
33
```
34
35
For advanced trainer customization:
36
37
```python
38
from sklearn_crfsuite import trainer
39
```
40
41
## Basic Usage
42
43
```python
44
from sklearn_crfsuite import CRF
45
from sklearn_crfsuite import metrics
46
47
# Prepare training data (list of lists of feature dicts)
48
X_train = [
49
[{'word': 'I', 'pos': 'PRP'}, {'word': 'love', 'pos': 'VBP'}, {'word': 'Python', 'pos': 'NNP'}],
50
[{'word': 'CRF', 'pos': 'NNP'}, {'word': 'models', 'pos': 'NNS'}, {'word': 'work', 'pos': 'VBP'}]
51
]
52
53
# Labels for each sequence
54
y_train = [
55
['O', 'O', 'B-LANG'],
56
['B-TECH', 'I-TECH', 'O']
57
]
58
59
# Create and train the CRF model
60
crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100)
61
crf.fit(X_train, y_train)
62
63
# Make predictions
64
X_test = [
65
[{'word': 'Java', 'pos': 'NNP'}, {'word': 'is', 'pos': 'VBZ'}, {'word': 'popular', 'pos': 'JJ'}]
66
]
67
y_pred = crf.predict(X_test)
68
69
# Evaluate with sequence-level metrics
70
y_test = [['B-LANG', 'O', 'O']]
71
accuracy = metrics.flat_accuracy_score(y_test, y_pred)
72
seq_accuracy = metrics.sequence_accuracy_score(y_test, y_pred)
73
74
print(f"Token accuracy: {accuracy}")
75
print(f"Sequence accuracy: {seq_accuracy}")
76
```
77
78
## Architecture
79
80
sklearn-crfsuite bridges two key technologies:
81
82
- **CRFsuite**: High-performance C++ implementation of Conditional Random Fields
83
- **scikit-learn**: Python machine learning ecosystem providing standardized interfaces
84
85
The library maintains compatibility with sklearn's model selection utilities (cross-validation, grid search, pipeline integration) while providing access to CRF-specific features like marginal probabilities and feature introspection.
86
87
## Capabilities
88
89
### CRF Estimator
90
91
The main CRF class providing scikit-learn compatible interface for Conditional Random Field sequence labeling with comprehensive algorithm options and hyperparameter configuration.
92
93
```python { .api }
94
class CRF:
95
def __init__(self, algorithm='lbfgs', c1=0, c2=1.0, max_iterations=None, **kwargs): ...
96
def fit(self, X, y, X_dev=None, y_dev=None): ...
97
def predict(self, X): ...
98
def predict_marginals(self, X): ...
99
def score(self, X, y): ...
100
```
101
102
[CRF Estimator](./crf-estimator.md)
103
104
### Evaluation Metrics
105
106
Specialized metrics for sequence labeling evaluation, including both token-level (flat) and sequence-level accuracy measures designed for structured prediction tasks.
107
108
```python { .api }
109
def flat_accuracy_score(y_true, y_pred): ...
110
def flat_precision_score(y_true, y_pred, **kwargs): ...
111
def flat_recall_score(y_true, y_pred, **kwargs): ...
112
def flat_f1_score(y_true, y_pred, **kwargs): ...
113
def sequence_accuracy_score(y_true, y_pred): ...
114
```
115
116
[Evaluation Metrics](./metrics.md)
117
118
### Scikit-learn Integration
119
120
Ready-to-use scorer functions compatible with scikit-learn's cross-validation, grid search, and model selection utilities for seamless integration into ML pipelines.
121
122
```python { .api }
123
flat_accuracy: sklearn.metrics.scorer
124
sequence_accuracy: sklearn.metrics.scorer
125
```
126
127
[Scikit-learn Integration](./sklearn-integration.md)
128
129
### Utility Functions
130
131
Helper functions for working with sequence data and CRF-specific data transformations.
132
133
```python { .api }
134
def flatten(sequences): ...
135
```
136
137
[Utility Functions](./utils.md)
138
139
### Advanced Features
140
141
Advanced customization options including custom trainer classes for specialized training workflows and logging.
142
143
```python { .api }
144
class LinePerIterationTrainer: ...
145
```
146
147
[Advanced Features](./advanced.md)
148
149
## Types
150
151
```python { .api }
152
# Feature representation for CRF input
153
FeatureDict = Dict[str, Union[str, int, float, bool]]
154
Sequence = List[FeatureDict]
155
Dataset = List[Sequence]
156
157
# Label representation
158
LabelSequence = List[str]
159
LabelDataset = List[LabelSequence]
160
161
# Marginal probabilities output
162
MarginalProbs = Dict[str, float]
163
SequenceMarginals = List[MarginalProbs]
164
DatasetMarginals = List[SequenceMarginals]
165
```