CRFsuite (python-crfsuite) wrapper which provides interface similar to scikit-learn
npx @tessl/cli install tessl/pypi-sklearn-crfsuite@0.3.0A scikit-learn compatible wrapper for CRFsuite that enables Conditional Random Fields (CRF) for sequence labeling tasks. It provides a familiar fit/predict interface while leveraging the efficient C++ CRFsuite implementation through python-crfsuite, making it ideal for named entity recognition, part-of-speech tagging, and other structured prediction tasks.
pip install sklearn-crfsuitefrom sklearn_crfsuite import CRFCommon pattern for metrics and evaluation:
from sklearn_crfsuite import metricsFor scikit-learn integration:
from sklearn_crfsuite import scorersFor utility functions:
from sklearn_crfsuite import utilsFor advanced trainer customization:
from sklearn_crfsuite import trainerfrom sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
# Prepare training data (list of lists of feature dicts)
X_train = [
[{'word': 'I', 'pos': 'PRP'}, {'word': 'love', 'pos': 'VBP'}, {'word': 'Python', 'pos': 'NNP'}],
[{'word': 'CRF', 'pos': 'NNP'}, {'word': 'models', 'pos': 'NNS'}, {'word': 'work', 'pos': 'VBP'}]
]
# Labels for each sequence
y_train = [
['O', 'O', 'B-LANG'],
['B-TECH', 'I-TECH', 'O']
]
# Create and train the CRF model
crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100)
crf.fit(X_train, y_train)
# Make predictions
X_test = [
[{'word': 'Java', 'pos': 'NNP'}, {'word': 'is', 'pos': 'VBZ'}, {'word': 'popular', 'pos': 'JJ'}]
]
y_pred = crf.predict(X_test)
# Evaluate with sequence-level metrics
y_test = [['B-LANG', 'O', 'O']]
accuracy = metrics.flat_accuracy_score(y_test, y_pred)
seq_accuracy = metrics.sequence_accuracy_score(y_test, y_pred)
print(f"Token accuracy: {accuracy}")
print(f"Sequence accuracy: {seq_accuracy}")sklearn-crfsuite bridges two key technologies:
The library maintains compatibility with sklearn's model selection utilities (cross-validation, grid search, pipeline integration) while providing access to CRF-specific features like marginal probabilities and feature introspection.
The main CRF class providing scikit-learn compatible interface for Conditional Random Field sequence labeling with comprehensive algorithm options and hyperparameter configuration.
class CRF:
def __init__(self, algorithm='lbfgs', c1=0, c2=1.0, max_iterations=None, **kwargs): ...
def fit(self, X, y, X_dev=None, y_dev=None): ...
def predict(self, X): ...
def predict_marginals(self, X): ...
def score(self, X, y): ...Specialized metrics for sequence labeling evaluation, including both token-level (flat) and sequence-level accuracy measures designed for structured prediction tasks.
def flat_accuracy_score(y_true, y_pred): ...
def flat_precision_score(y_true, y_pred, **kwargs): ...
def flat_recall_score(y_true, y_pred, **kwargs): ...
def flat_f1_score(y_true, y_pred, **kwargs): ...
def sequence_accuracy_score(y_true, y_pred): ...Ready-to-use scorer functions compatible with scikit-learn's cross-validation, grid search, and model selection utilities for seamless integration into ML pipelines.
flat_accuracy: sklearn.metrics.scorer
sequence_accuracy: sklearn.metrics.scorerHelper functions for working with sequence data and CRF-specific data transformations.
def flatten(sequences): ...Advanced customization options including custom trainer classes for specialized training workflows and logging.
class LinePerIterationTrainer: ...# Feature representation for CRF input
FeatureDict = Dict[str, Union[str, int, float, bool]]
Sequence = List[FeatureDict]
Dataset = List[Sequence]
# Label representation
LabelSequence = List[str]
LabelDataset = List[LabelSequence]
# Marginal probabilities output
MarginalProbs = Dict[str, float]
SequenceMarginals = List[MarginalProbs]
DatasetMarginals = List[SequenceMarginals]