tessl/pypi-sklearn-crfsuite

CRFsuite (python-crfsuite) wrapper which provides interface similar to scikit-learn

—

Pending

Overview

Eval results

Files

Utility Functions

Name: tessl/pypi-sklearn-crfsuite
Author: tessl

Helper functions for working with sequence data and CRF-specific data transformations. These utilities are primarily used internally by the metrics module but are available for advanced use cases requiring sequence data manipulation.

Capabilities

Sequence Flattening

Converts nested sequence structures into flat lists, essential for adapting CRF sequence data to work with standard scikit-learn metrics that expect flat label arrays.

def flatten(sequences):
    """
    Flatten a list of sequences into a single list.

    Parameters:
    - sequences: List[List[Any]], list of sequences to flatten

    Returns:
    - List[Any]: flattened list combining all sequence elements
    """

Usage Example:

from sklearn_crfsuite.utils import flatten

# Flatten sequence labels for use with sklearn metrics
y_sequences = [['B-PER', 'I-PER', 'O'], ['O', 'B-LOC']]
y_flat = flatten(y_sequences)
print(y_flat)  # ['B-PER', 'I-PER', 'O', 'O', 'B-LOC']

# Flatten feature sequences (less common use case)
feature_sequences = [
    [{'word': 'John'}, {'word': 'Smith'}],
    [{'word': 'New'}, {'word': 'York'}]
]
# Note: flatten works on any nested list structure
flat_features = flatten([[f['word'] for f in seq] for seq in feature_sequences])
print(flat_features)  # ['John', 'Smith', 'New', 'York']

Integration with Metrics

The flatten function is automatically used by all "flat" metrics in sklearn_crfsuite.metrics to convert sequence data before passing to sklearn metrics functions.

Usage Pattern:

from sklearn_crfsuite import metrics
from sklearn_crfsuite.utils import flatten
from sklearn.metrics import classification_report

# Automatic flattening (recommended)
report = metrics.flat_classification_report(y_true, y_pred)

# Manual flattening (for custom metrics)
y_true_flat = flatten(y_true)
y_pred_flat = flatten(y_pred)
custom_report = classification_report(y_true_flat, y_pred_flat)

Data Preprocessing Applications

The utility can be useful for various sequence data preprocessing tasks:

Usage Example:

from sklearn_crfsuite.utils import flatten
from collections import Counter

def analyze_label_distribution(y_sequences):
    """Analyze label distribution across all sequences."""
    all_labels = flatten(y_sequences)
    return Counter(all_labels)

def create_vocabulary(feature_sequences, feature_key='word'):
    """Create vocabulary from feature sequences."""
    all_words = flatten([[token.get(feature_key, '') for token in seq] 
                        for seq in feature_sequences])
    return set(all_words)

# Example usage
y_train = [['B-PER', 'I-PER', 'O'], ['O', 'B-LOC', 'I-LOC']]
label_dist = analyze_label_distribution(y_train)
print(f"Label distribution: {label_dist}")

X_train = [
    [{'word': 'John', 'pos': 'NNP'}, {'word': 'lives', 'pos': 'VBZ'}],
    [{'word': 'in', 'pos': 'IN'}, {'word': 'Boston', 'pos': 'NNP'}]
]
vocab = create_vocabulary(X_train)
print(f"Vocabulary: {sorted(vocab)}")

Install with Tessl CLI