0
# N-Gram and Shingle-Based Algorithms
1
2
Algorithms that convert strings into sets or profiles of n-character sequences (shingles or n-grams) and compute similarity based on these representations. These algorithms support both direct string comparison and pre-computed profile optimization for large datasets.
3
4
## Capabilities
5
6
### Q-Gram Distance
7
8
Q-gram distance as defined by Ukkonen, computed as the L1 norm of the difference between character n-gram profiles. This algorithm provides a lower bound on Levenshtein distance but can be computed in O(m + n) time compared to Levenshtein's O(m*n).
9
10
```python { .api }
11
class QGram(ShingleBased, StringDistance):
12
def __init__(self, k: int = 3):
13
"""
14
Initialize Q-Gram with shingle size.
15
16
Args:
17
k: Size of character n-grams (shingles) to use (default: 3)
18
"""
19
20
def distance(self, s0: str, s1: str) -> float:
21
"""
22
Calculate Q-gram distance between two strings.
23
24
Args:
25
s0: First string
26
s1: Second string
27
28
Returns:
29
float: Q-gram distance (sum of absolute differences in n-gram counts)
30
31
Raises:
32
TypeError: If either string is None
33
"""
34
35
@staticmethod
36
def distance_profile(profile0: dict, profile1: dict) -> float:
37
"""
38
Calculate Q-gram distance between pre-computed profiles.
39
40
Args:
41
profile0: N-gram profile dictionary for first string
42
profile1: N-gram profile dictionary for second string
43
44
Returns:
45
float: Q-gram distance between profiles
46
"""
47
```
48
49
**Usage Examples:**
50
51
```python
52
from similarity.qgram import QGram
53
54
# Basic Q-gram distance with different k values
55
qgram2 = QGram(2) # Bigrams
56
qgram3 = QGram(3) # Trigrams
57
58
distance = qgram2.distance('ABCD', 'ABCE') # Returns: 2 (different endings)
59
distance = qgram3.distance('hello', 'hallo') # Returns: 2 (different middle)
60
61
# Profile-based computation for large datasets
62
qgram = QGram(2)
63
strings = ['hello', 'hallo', 'hullo', 'hillo']
64
65
# Pre-compute profiles once
66
profiles = [qgram.get_profile(s) for s in strings]
67
68
# Compare all pairs efficiently
69
for i in range(len(strings)):
70
for j in range(i + 1, len(strings)):
71
dist = QGram.distance_profile(profiles[i], profiles[j])
72
print(f"'{strings[i]}' vs '{strings[j]}': {dist}")
73
```
74
75
### Cosine Similarity
76
77
Cosine similarity treats n-gram profiles as vectors and computes the cosine of the angle between them. This normalized measure is effective for comparing documents and text with different lengths.
78
79
```python { .api }
80
class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
81
def __init__(self, k: int):
82
"""
83
Initialize Cosine similarity with shingle size.
84
85
Args:
86
k: Size of character n-grams (shingles) to use
87
"""
88
89
def distance(self, s0: str, s1: str) -> float:
90
"""
91
Calculate cosine distance (1 - cosine similarity).
92
93
Args:
94
s0: First string
95
s1: Second string
96
97
Returns:
98
float: Cosine distance in range [0.0, 1.0] where 0.0 = identical
99
100
Raises:
101
TypeError: If either string is None
102
"""
103
104
def similarity(self, s0: str, s1: str) -> float:
105
"""
106
Calculate cosine similarity between two strings.
107
108
Args:
109
s0: First string
110
s1: Second string
111
112
Returns:
113
float: Cosine similarity in range [0.0, 1.0] where 1.0 = identical
114
115
Raises:
116
TypeError: If either string is None
117
"""
118
119
def similarity_profiles(self, profile0: dict, profile1: dict) -> float:
120
"""
121
Calculate cosine similarity between pre-computed profiles.
122
123
Args:
124
profile0: N-gram profile dictionary for first string
125
profile1: N-gram profile dictionary for second string
126
127
Returns:
128
float: Cosine similarity between profiles
129
"""
130
```
131
132
**Usage Examples:**
133
134
```python
135
from similarity.cosine import Cosine
136
137
# Document similarity with different k values
138
cosine3 = Cosine(3)
139
cosine4 = Cosine(4)
140
141
# Text comparison
142
text1 = "The quick brown fox jumps over the lazy dog"
143
text2 = "A quick brown fox jumped over a lazy dog"
144
145
similarity = cosine3.similarity(text1, text2)
146
distance = cosine3.distance(text1, text2)
147
print(f"Cosine similarity: {similarity:.3f}")
148
print(f"Cosine distance: {distance:.3f}")
149
150
# Profile-based computation
151
cosine = Cosine(2)
152
documents = [
153
"machine learning algorithms",
154
"deep learning neural networks",
155
"artificial intelligence systems",
156
"natural language processing"
157
]
158
159
# Pre-compute profiles
160
profiles = [cosine.get_profile(doc) for doc in documents]
161
162
# Find most similar documents
163
max_sim = 0
164
best_pair = None
165
for i in range(len(documents)):
166
for j in range(i + 1, len(documents)):
167
sim = cosine.similarity_profiles(profiles[i], profiles[j])
168
if sim > max_sim:
169
max_sim = sim
170
best_pair = (i, j)
171
172
if best_pair:
173
i, j = best_pair
174
print(f"Most similar: '{documents[i]}' and '{documents[j]}' ({max_sim:.3f})")
175
```
176
177
### Jaccard Index
178
179
Jaccard index treats n-gram profiles as sets (ignoring counts) and computes the ratio of intersection to union. This metric is particularly effective for comparing texts with different vocabulary sizes.
180
181
```python { .api }
182
class Jaccard(ShingleBased, MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity):
183
def __init__(self, k: int):
184
"""
185
Initialize Jaccard index with shingle size.
186
187
Args:
188
k: Size of character n-grams (shingles) to use
189
"""
190
191
def distance(self, s0: str, s1: str) -> float:
192
"""
193
Calculate Jaccard distance (1 - Jaccard similarity).
194
195
Args:
196
s0: First string
197
s1: Second string
198
199
Returns:
200
float: Jaccard distance in range [0.0, 1.0] where 0.0 = identical
201
202
Raises:
203
TypeError: If either string is None
204
"""
205
206
def similarity(self, s0: str, s1: str) -> float:
207
"""
208
Calculate Jaccard similarity between two strings.
209
210
Args:
211
s0: First string
212
s1: Second string
213
214
Returns:
215
float: Jaccard similarity in range [0.0, 1.0] where 1.0 = identical
216
217
Raises:
218
TypeError: If either string is None
219
"""
220
```
221
222
**Usage Examples:**
223
224
```python
225
from similarity.jaccard import Jaccard
226
227
# Set-based similarity
228
jaccard2 = Jaccard(2)
229
jaccard3 = Jaccard(3)
230
231
# Compare strings as character sets
232
similarity = jaccard2.similarity('hello', 'hallo') # Set overlap
233
distance = jaccard2.distance('hello', 'hallo')
234
235
# Compare different length strings
236
similarity = jaccard3.similarity('programming', 'program')
237
print(f"'programming' vs 'program': {similarity:.3f}")
238
239
# Document deduplication example
240
documents = [
241
"artificial intelligence machine learning",
242
"machine learning artificial intelligence",
243
"deep learning neural networks",
244
"neural networks for deep learning"
245
]
246
247
jaccard = Jaccard(3)
248
threshold = 0.5
249
250
print("Similar document pairs (Jaccard > 0.5):")
251
for i in range(len(documents)):
252
for j in range(i + 1, len(documents)):
253
sim = jaccard.similarity(documents[i], documents[j])
254
if sim > threshold:
255
print(f" '{documents[i]}' vs '{documents[j]}': {sim:.3f}")
256
```
257
258
### Sorensen-Dice Coefficient
259
260
Similar to Jaccard but uses a different formula: 2 * |intersection| / (|set1| + |set2|). This gives higher weight to matches and is often more sensitive to similarities than Jaccard.
261
262
```python { .api }
263
class SorensenDice(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
264
def __init__(self, k: int = 3):
265
"""
266
Initialize Sorensen-Dice with shingle size.
267
268
Args:
269
k: Size of character n-grams (shingles) to use (default: 3)
270
"""
271
272
def distance(self, s0: str, s1: str) -> float:
273
"""
274
Calculate Sorensen-Dice distance (1 - Sorensen-Dice similarity).
275
276
Args:
277
s0: First string
278
s1: Second string
279
280
Returns:
281
float: Sorensen-Dice distance in range [0.0, 1.0] where 0.0 = identical
282
283
Raises:
284
TypeError: If either string is None
285
"""
286
287
def similarity(self, s0: str, s1: str) -> float:
288
"""
289
Calculate Sorensen-Dice similarity between two strings.
290
291
Args:
292
s0: First string
293
s1: Second string
294
295
Returns:
296
float: Sorensen-Dice similarity in range [0.0, 1.0] where 1.0 = identical
297
298
Raises:
299
TypeError: If either string is None
300
"""
301
```
302
303
**Usage Examples:**
304
305
```python
306
from similarity.sorensen_dice import SorensenDice
307
308
# Basic Sorensen-Dice similarity
309
dice = SorensenDice(2)
310
similarity = dice.similarity('hello', 'hallo')
311
distance = dice.distance('hello', 'hallo')
312
313
# Compare with Jaccard
314
from similarity.jaccard import Jaccard
315
jaccard = Jaccard(2)
316
317
test_pairs = [
318
('programming', 'program'),
319
('hello', 'world'),
320
('similar', 'similarity')
321
]
322
323
for s1, s2 in test_pairs:
324
dice_sim = dice.similarity(s1, s2)
325
jaccard_sim = jaccard.similarity(s1, s2)
326
print(f"'{s1}' vs '{s2}':")
327
print(f" Dice: {dice_sim:.3f}, Jaccard: {jaccard_sim:.3f}")
328
```
329
330
### N-Gram Distance (Kondrak)
331
332
Normalized n-gram distance as defined by Kondrak, using special character affixing to weight initial characters more heavily. This algorithm is designed specifically for string similarity rather than set operations.
333
334
```python { .api }
335
class NGram(NormalizedStringDistance):
336
def __init__(self, n: int = 2):
337
"""
338
Initialize N-Gram distance with gram size.
339
340
Args:
341
n: Size of character n-grams to use (default: 2)
342
"""
343
344
def distance(self, s0: str, s1: str) -> float:
345
"""
346
Calculate normalized N-gram distance with special character affixing.
347
348
Args:
349
s0: First string
350
s1: Second string
351
352
Returns:
353
float: Normalized distance in range [0.0, 1.0] where 0.0 = identical
354
355
Raises:
356
TypeError: If either string is None
357
"""
358
```
359
360
**Usage Examples:**
361
362
```python
363
from similarity.ngram import NGram
364
365
# Different n-gram sizes
366
twogram = NGram(2)
367
fourgram = NGram(4)
368
369
# String comparison with weighted prefixes
370
distance = twogram.distance('ABCD', 'ABTUIO')
371
print(f"2-gram distance: {distance:.3f}")
372
373
# Longer strings with different n-gram sizes
374
s1 = 'Adobe CreativeSuite 5 Master Collection from cheap 4zp'
375
s2 = 'Adobe CreativeSuite 5 Master Collection from cheap d1x'
376
377
distance2 = twogram.distance(s1, s2)
378
distance4 = fourgram.distance(s1, s2)
379
print(f"Similar strings - 2-gram: {distance2:.3f}, 4-gram: {distance4:.3f}")
380
```
381
382
### ShingleBased Base Class
383
384
All shingle-based algorithms inherit from this base class which provides profile computation:
385
386
```python { .api }
387
class ShingleBased:
388
def __init__(self, k: int = 3):
389
"""
390
Initialize with shingle size.
391
392
Args:
393
k: Size of character n-grams (shingles) to use
394
"""
395
396
def get_k(self) -> int:
397
"""
398
Get the current shingle size.
399
400
Returns:
401
int: Current shingle size
402
"""
403
404
def get_profile(self, string: str) -> dict:
405
"""
406
Convert string to n-gram profile (dictionary of n-gram counts).
407
408
Args:
409
string: Input string to profile
410
411
Returns:
412
dict: Dictionary mapping n-grams to their occurrence counts
413
"""
414
```
415
416
**Profile Usage Example:**
417
418
```python
419
from similarity.cosine import Cosine
420
421
cosine = Cosine(3)
422
423
# Get n-gram profiles
424
text = "hello world"
425
profile = cosine.get_profile(text)
426
print(f"3-gram profile for '{text}': {profile}")
427
428
# Profiles can be reused for multiple comparisons
429
profile1 = cosine.get_profile("machine learning")
430
profile2 = cosine.get_profile("deep learning")
431
profile3 = cosine.get_profile("learning algorithms")
432
433
similarity_12 = cosine.similarity_profiles(profile1, profile2)
434
similarity_13 = cosine.similarity_profiles(profile1, profile3)
435
print(f"ML vs DL: {similarity_12:.3f}")
436
print(f"ML vs Algorithms: {similarity_13:.3f}")
437
```