0
# Sequence-Based Algorithms
1
2
Algorithms based on longest common subsequences and sequence alignment techniques. These algorithms are commonly used in diff utilities, version control systems, bioinformatics applications, and other domains where sequence comparison is important.
3
4
## Capabilities
5
6
### Longest Common Subsequence (LCS)
7
8
Computes the distance based on the longest common subsequence between two strings. Unlike substrings, subsequences don't need to be contiguous, making this algorithm useful for comparing strings with insertions and rearrangements.
9
10
```python { .api }
11
class LongestCommonSubsequence(StringDistance):
12
def distance(self, s0: str, s1: str) -> float:
13
"""
14
Calculate LCS distance: len(s0) + len(s1) - 2 * LCS_length.
15
16
Args:
17
s0: First string
18
s1: Second string
19
20
Returns:
21
float: LCS distance (minimum 0, maximum len(s0) + len(s1))
22
23
Raises:
24
TypeError: If either string is None
25
"""
26
27
@staticmethod
28
def length(s0: str, s1: str) -> float:
29
"""
30
Calculate the length of the longest common subsequence.
31
32
Args:
33
s0: First string
34
s1: Second string
35
36
Returns:
37
float: Length of the longest common subsequence
38
39
Raises:
40
TypeError: If either string is None
41
"""
42
```
43
44
**Usage Examples:**
45
46
```python
47
from similarity.longest_common_subsequence import LongestCommonSubsequence
48
49
lcs = LongestCommonSubsequence()
50
51
# Basic LCS distance
52
distance = lcs.distance('AGCAT', 'GAC') # Returns: 4.0
53
distance = lcs.distance('AGCAT', 'AGCT') # Returns: 1.0
54
55
# LCS length calculation
56
length = lcs.length('AGCAT', 'GAC') # Returns: 2.0 (common: 'AC')
57
length = lcs.length('ABCDEF', 'ACBDEF') # Returns: 5.0 (common: 'ACDEF')
58
59
# Practical example: comparing file versions
60
version1 = "def hello_world():\n print('Hello')\n return"
61
version2 = "def hello_world():\n print('Hello, World!')\n return"
62
lcs_len = lcs.length(version1, version2)
63
lcs_dist = lcs.distance(version1, version2)
64
print(f"Common characters: {lcs_len}")
65
print(f"LCS distance: {lcs_dist}")
66
```
67
68
### Metric LCS
69
70
A normalized version of LCS distance that returns values in the range [0.0, 1.0] and satisfies the triangle inequality, making it a proper metric distance. Based on the paper "An LCS-based string metric" by Daniel Bakkelund.
71
72
```python { .api }
73
class MetricLCS(MetricStringDistance, NormalizedStringDistance):
74
def __init__(self):
75
"""Initialize MetricLCS with internal LCS instance."""
76
77
def distance(self, s0: str, s1: str) -> float:
78
"""
79
Calculate normalized metric LCS distance: 1 - |LCS(s0,s1)| / max(|s0|, |s1|).
80
81
Args:
82
s0: First string
83
s1: Second string
84
85
Returns:
86
float: Normalized distance in range [0.0, 1.0] where 0.0 = identical
87
88
Raises:
89
TypeError: If either string is None
90
"""
91
```
92
93
**Usage Examples:**
94
95
```python
96
from similarity.metric_lcs import MetricLCS
97
98
metric_lcs = MetricLCS()
99
100
# Basic normalized distance
101
distance = metric_lcs.distance('ABCDEFG', 'ABCDEFHJKL') # Returns: 0.4
102
distance = metric_lcs.distance('ABDEF', 'ABDIF') # Returns: ~0.2
103
104
# Practical comparison with different string lengths
105
strings = ['programming', 'programing', 'programs', 'grogram']
106
base = 'program'
107
108
for s in strings:
109
dist = metric_lcs.distance(base, s)
110
print(f"'{base}' vs '{s}': {dist:.3f}")
111
112
# Triangle inequality verification (metric property)
113
s1, s2, s3 = 'abc', 'def', 'ghi'
114
d12 = metric_lcs.distance(s1, s2)
115
d23 = metric_lcs.distance(s2, s3)
116
d13 = metric_lcs.distance(s1, s3)
117
print(f"Triangle inequality: {d13} <= {d12} + {d23} = {d12 + d23}")
118
assert d13 <= d12 + d23 # Should always be True for metric distances
119
```
120
121
### Algorithm Comparison
122
123
**LCS Distance** is ideal for:
124
- Diff utilities and version control
125
- Comparing sequences where order matters but gaps are allowed
126
- Applications needing raw distance values
127
- When you don't need normalized results
128
129
**Metric LCS** is ideal for:
130
- Applications requiring normalized distances (0.0-1.0)
131
- When you need triangle inequality guarantees
132
- Clustering or nearest-neighbor algorithms
133
- Comparing strings of very different lengths
134
135
**Comparative Example:**
136
137
```python
138
from similarity.longest_common_subsequence import LongestCommonSubsequence
139
from similarity.metric_lcs import MetricLCS
140
141
lcs = LongestCommonSubsequence()
142
metric_lcs = MetricLCS()
143
144
test_pairs = [
145
('ABCDEF', 'ACEF'), # Deletions
146
('ABC', 'XAYBZC'), # Insertions
147
('hello', 'world'), # Different strings
148
('programming', 'program'), # Substring relationship
149
]
150
151
for s1, s2 in test_pairs:
152
lcs_dist = lcs.distance(s1, s2)
153
lcs_len = lcs.length(s1, s2)
154
metric_dist = metric_lcs.distance(s1, s2)
155
156
print(f"'{s1}' vs '{s2}':")
157
print(f" LCS length: {lcs_len}")
158
print(f" LCS distance: {lcs_dist}")
159
print(f" Metric LCS distance: {metric_dist:.3f}")
160
print()
161
```
162
163
### Implementation Notes
164
165
Both algorithms use dynamic programming with O(m*n) time complexity and space requirements, where m and n are the string lengths. The LCS algorithms are equivalent to Levenshtein distance when only insertions and deletions are allowed (no substitutions), or when the substitution cost is double the insertion/deletion cost.