or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced.mdcrf-estimator.mdindex.mdmetrics.mdsklearn-integration.mdutils.md

utils.mddocs/

0

# Utility Functions

1

2

Helper functions for working with sequence data and CRF-specific data transformations. These utilities are primarily used internally by the metrics module but are available for advanced use cases requiring sequence data manipulation.

3

4

## Capabilities

5

6

### Sequence Flattening

7

8

Converts nested sequence structures into flat lists, essential for adapting CRF sequence data to work with standard scikit-learn metrics that expect flat label arrays.

9

10

```python { .api }

11

def flatten(sequences):

12

"""

13

Flatten a list of sequences into a single list.

14

15

Parameters:

16

- sequences: List[List[Any]], list of sequences to flatten

17

18

Returns:

19

- List[Any]: flattened list combining all sequence elements

20

"""

21

```

22

23

**Usage Example:**

24

25

```python

26

from sklearn_crfsuite.utils import flatten

27

28

# Flatten sequence labels for use with sklearn metrics

29

y_sequences = [['B-PER', 'I-PER', 'O'], ['O', 'B-LOC']]

30

y_flat = flatten(y_sequences)

31

print(y_flat) # ['B-PER', 'I-PER', 'O', 'O', 'B-LOC']

32

33

# Flatten feature sequences (less common use case)

34

feature_sequences = [

35

[{'word': 'John'}, {'word': 'Smith'}],

36

[{'word': 'New'}, {'word': 'York'}]

37

]

38

# Note: flatten works on any nested list structure

39

flat_features = flatten([[f['word'] for f in seq] for seq in feature_sequences])

40

print(flat_features) # ['John', 'Smith', 'New', 'York']

41

```

42

43

### Integration with Metrics

44

45

The flatten function is automatically used by all "flat" metrics in sklearn_crfsuite.metrics to convert sequence data before passing to sklearn metrics functions.

46

47

**Usage Pattern:**

48

49

```python

50

from sklearn_crfsuite import metrics

51

from sklearn_crfsuite.utils import flatten

52

from sklearn.metrics import classification_report

53

54

# Automatic flattening (recommended)

55

report = metrics.flat_classification_report(y_true, y_pred)

56

57

# Manual flattening (for custom metrics)

58

y_true_flat = flatten(y_true)

59

y_pred_flat = flatten(y_pred)

60

custom_report = classification_report(y_true_flat, y_pred_flat)

61

```

62

63

### Data Preprocessing Applications

64

65

The utility can be useful for various sequence data preprocessing tasks:

66

67

**Usage Example:**

68

69

```python

70

from sklearn_crfsuite.utils import flatten

71

from collections import Counter

72

73

def analyze_label_distribution(y_sequences):

74

"""Analyze label distribution across all sequences."""

75

all_labels = flatten(y_sequences)

76

return Counter(all_labels)

77

78

def create_vocabulary(feature_sequences, feature_key='word'):

79

"""Create vocabulary from feature sequences."""

80

all_words = flatten([[token.get(feature_key, '') for token in seq]

81

for seq in feature_sequences])

82

return set(all_words)

83

84

# Example usage

85

y_train = [['B-PER', 'I-PER', 'O'], ['O', 'B-LOC', 'I-LOC']]

86

label_dist = analyze_label_distribution(y_train)

87

print(f"Label distribution: {label_dist}")

88

89

X_train = [

90

[{'word': 'John', 'pos': 'NNP'}, {'word': 'lives', 'pos': 'VBZ'}],

91

[{'word': 'in', 'pos': 'IN'}, {'word': 'Boston', 'pos': 'NNP'}]

92

]

93

vocab = create_vocabulary(X_train)

94

print(f"Vocabulary: {sorted(vocab)}")

95

```