0
# Utility Functions
1
2
Helper functions for working with sequence data and CRF-specific data transformations. These utilities are primarily used internally by the metrics module but are available for advanced use cases requiring sequence data manipulation.
3
4
## Capabilities
5
6
### Sequence Flattening
7
8
Converts nested sequence structures into flat lists, essential for adapting CRF sequence data to work with standard scikit-learn metrics that expect flat label arrays.
9
10
```python { .api }
11
def flatten(sequences):
12
"""
13
Flatten a list of sequences into a single list.
14
15
Parameters:
16
- sequences: List[List[Any]], list of sequences to flatten
17
18
Returns:
19
- List[Any]: flattened list combining all sequence elements
20
"""
21
```
22
23
**Usage Example:**
24
25
```python
26
from sklearn_crfsuite.utils import flatten
27
28
# Flatten sequence labels for use with sklearn metrics
29
y_sequences = [['B-PER', 'I-PER', 'O'], ['O', 'B-LOC']]
30
y_flat = flatten(y_sequences)
31
print(y_flat) # ['B-PER', 'I-PER', 'O', 'O', 'B-LOC']
32
33
# Flatten feature sequences (less common use case)
34
feature_sequences = [
35
[{'word': 'John'}, {'word': 'Smith'}],
36
[{'word': 'New'}, {'word': 'York'}]
37
]
38
# Note: flatten works on any nested list structure
39
flat_features = flatten([[f['word'] for f in seq] for seq in feature_sequences])
40
print(flat_features) # ['John', 'Smith', 'New', 'York']
41
```
42
43
### Integration with Metrics
44
45
The flatten function is automatically used by all "flat" metrics in sklearn_crfsuite.metrics to convert sequence data before passing to sklearn metrics functions.
46
47
**Usage Pattern:**
48
49
```python
50
from sklearn_crfsuite import metrics
51
from sklearn_crfsuite.utils import flatten
52
from sklearn.metrics import classification_report
53
54
# Automatic flattening (recommended)
55
report = metrics.flat_classification_report(y_true, y_pred)
56
57
# Manual flattening (for custom metrics)
58
y_true_flat = flatten(y_true)
59
y_pred_flat = flatten(y_pred)
60
custom_report = classification_report(y_true_flat, y_pred_flat)
61
```
62
63
### Data Preprocessing Applications
64
65
The utility can be useful for various sequence data preprocessing tasks:
66
67
**Usage Example:**
68
69
```python
70
from sklearn_crfsuite.utils import flatten
71
from collections import Counter
72
73
def analyze_label_distribution(y_sequences):
74
"""Analyze label distribution across all sequences."""
75
all_labels = flatten(y_sequences)
76
return Counter(all_labels)
77
78
def create_vocabulary(feature_sequences, feature_key='word'):
79
"""Create vocabulary from feature sequences."""
80
all_words = flatten([[token.get(feature_key, '') for token in seq]
81
for seq in feature_sequences])
82
return set(all_words)
83
84
# Example usage
85
y_train = [['B-PER', 'I-PER', 'O'], ['O', 'B-LOC', 'I-LOC']]
86
label_dist = analyze_label_distribution(y_train)
87
print(f"Label distribution: {label_dist}")
88
89
X_train = [
90
[{'word': 'John', 'pos': 'NNP'}, {'word': 'lives', 'pos': 'VBZ'}],
91
[{'word': 'in', 'pos': 'IN'}, {'word': 'Boston', 'pos': 'NNP'}]
92
]
93
vocab = create_vocabulary(X_train)
94
print(f"Vocabulary: {sorted(vocab)}")
95
```