or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

edit-distance.mdfactory-utilities.mdindex.mdngram-shingle.mdphonetic-record-linkage.mdsequence-based.md

sequence-based.mddocs/

0

# Sequence-Based Algorithms

1

2

Algorithms based on longest common subsequences and sequence alignment techniques. These algorithms are commonly used in diff utilities, version control systems, bioinformatics applications, and other domains where sequence comparison is important.

3

4

## Capabilities

5

6

### Longest Common Subsequence (LCS)

7

8

Computes the distance based on the longest common subsequence between two strings. Unlike substrings, subsequences don't need to be contiguous, making this algorithm useful for comparing strings with insertions and rearrangements.

9

10

```python { .api }

11

class LongestCommonSubsequence(StringDistance):

12

def distance(self, s0: str, s1: str) -> float:

13

"""

14

Calculate LCS distance: len(s0) + len(s1) - 2 * LCS_length.

15

16

Args:

17

s0: First string

18

s1: Second string

19

20

Returns:

21

float: LCS distance (minimum 0, maximum len(s0) + len(s1))

22

23

Raises:

24

TypeError: If either string is None

25

"""

26

27

@staticmethod

28

def length(s0: str, s1: str) -> float:

29

"""

30

Calculate the length of the longest common subsequence.

31

32

Args:

33

s0: First string

34

s1: Second string

35

36

Returns:

37

float: Length of the longest common subsequence

38

39

Raises:

40

TypeError: If either string is None

41

"""

42

```

43

44

**Usage Examples:**

45

46

```python

47

from similarity.longest_common_subsequence import LongestCommonSubsequence

48

49

lcs = LongestCommonSubsequence()

50

51

# Basic LCS distance

52

distance = lcs.distance('AGCAT', 'GAC') # Returns: 4.0

53

distance = lcs.distance('AGCAT', 'AGCT') # Returns: 1.0

54

55

# LCS length calculation

56

length = lcs.length('AGCAT', 'GAC') # Returns: 2.0 (common: 'AC')

57

length = lcs.length('ABCDEF', 'ACBDEF') # Returns: 5.0 (common: 'ACDEF')

58

59

# Practical example: comparing file versions

60

version1 = "def hello_world():\n print('Hello')\n return"

61

version2 = "def hello_world():\n print('Hello, World!')\n return"

62

lcs_len = lcs.length(version1, version2)

63

lcs_dist = lcs.distance(version1, version2)

64

print(f"Common characters: {lcs_len}")

65

print(f"LCS distance: {lcs_dist}")

66

```

67

68

### Metric LCS

69

70

A normalized version of LCS distance that returns values in the range [0.0, 1.0] and satisfies the triangle inequality, making it a proper metric distance. Based on the paper "An LCS-based string metric" by Daniel Bakkelund.

71

72

```python { .api }

73

class MetricLCS(MetricStringDistance, NormalizedStringDistance):

74

def __init__(self):

75

"""Initialize MetricLCS with internal LCS instance."""

76

77

def distance(self, s0: str, s1: str) -> float:

78

"""

79

Calculate normalized metric LCS distance: 1 - |LCS(s0,s1)| / max(|s0|, |s1|).

80

81

Args:

82

s0: First string

83

s1: Second string

84

85

Returns:

86

float: Normalized distance in range [0.0, 1.0] where 0.0 = identical

87

88

Raises:

89

TypeError: If either string is None

90

"""

91

```

92

93

**Usage Examples:**

94

95

```python

96

from similarity.metric_lcs import MetricLCS

97

98

metric_lcs = MetricLCS()

99

100

# Basic normalized distance

101

distance = metric_lcs.distance('ABCDEFG', 'ABCDEFHJKL') # Returns: 0.4

102

distance = metric_lcs.distance('ABDEF', 'ABDIF') # Returns: ~0.2

103

104

# Practical comparison with different string lengths

105

strings = ['programming', 'programing', 'programs', 'grogram']

106

base = 'program'

107

108

for s in strings:

109

dist = metric_lcs.distance(base, s)

110

print(f"'{base}' vs '{s}': {dist:.3f}")

111

112

# Triangle inequality verification (metric property)

113

s1, s2, s3 = 'abc', 'def', 'ghi'

114

d12 = metric_lcs.distance(s1, s2)

115

d23 = metric_lcs.distance(s2, s3)

116

d13 = metric_lcs.distance(s1, s3)

117

print(f"Triangle inequality: {d13} <= {d12} + {d23} = {d12 + d23}")

118

assert d13 <= d12 + d23 # Should always be True for metric distances

119

```

120

121

### Algorithm Comparison

122

123

**LCS Distance** is ideal for:

124

- Diff utilities and version control

125

- Comparing sequences where order matters but gaps are allowed

126

- Applications needing raw distance values

127

- When you don't need normalized results

128

129

**Metric LCS** is ideal for:

130

- Applications requiring normalized distances (0.0-1.0)

131

- When you need triangle inequality guarantees

132

- Clustering or nearest-neighbor algorithms

133

- Comparing strings of very different lengths

134

135

**Comparative Example:**

136

137

```python

138

from similarity.longest_common_subsequence import LongestCommonSubsequence

139

from similarity.metric_lcs import MetricLCS

140

141

lcs = LongestCommonSubsequence()

142

metric_lcs = MetricLCS()

143

144

test_pairs = [

145

('ABCDEF', 'ACEF'), # Deletions

146

('ABC', 'XAYBZC'), # Insertions

147

('hello', 'world'), # Different strings

148

('programming', 'program'), # Substring relationship

149

]

150

151

for s1, s2 in test_pairs:

152

lcs_dist = lcs.distance(s1, s2)

153

lcs_len = lcs.length(s1, s2)

154

metric_dist = metric_lcs.distance(s1, s2)

155

156

print(f"'{s1}' vs '{s2}':")

157

print(f" LCS length: {lcs_len}")

158

print(f" LCS distance: {lcs_dist}")

159

print(f" Metric LCS distance: {metric_dist:.3f}")

160

print()

161

```

162

163

### Implementation Notes

164

165

Both algorithms use dynamic programming with O(m*n) time complexity and space requirements, where m and n are the string lengths. The LCS algorithms are equivalent to Levenshtein distance when only insertions and deletions are allowed (no substitutions), or when the substitution cost is double the insertion/deletion cost.