Tessl Tile for pypi/strsim@0.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

edit-distance.md factory-utilities.md index.md ngram-shingle.md phonetic-record-linkage.md sequence-based.md

ngram-shingle.mddocs/

0
# N-Gram and Shingle-Based Algorithms
1

2
Algorithms that convert strings into sets or profiles of n-character sequences (shingles or n-grams) and compute similarity based on these representations. These algorithms support both direct string comparison and pre-computed profile optimization for large datasets.
3

4
## Capabilities
5

6
### Q-Gram Distance
7

8
Q-gram distance as defined by Ukkonen, computed as the L1 norm of the difference between character n-gram profiles. This algorithm provides a lower bound on Levenshtein distance but can be computed in O(m + n) time compared to Levenshtein's O(m*n).
9

10
```python { .api }
11
class QGram(ShingleBased, StringDistance):
12
    def __init__(self, k: int = 3):
13
        """
14
        Initialize Q-Gram with shingle size.
15
        
16
        Args:
17
            k: Size of character n-grams (shingles) to use (default: 3)
18
        """
19
    
20
    def distance(self, s0: str, s1: str) -> float:
21
        """
22
        Calculate Q-gram distance between two strings.
23
        
24
        Args:
25
            s0: First string
26
            s1: Second string
27
            
28
        Returns:
29
            float: Q-gram distance (sum of absolute differences in n-gram counts)
30
            
31
        Raises:
32
            TypeError: If either string is None
33
        """
34
    
35
    @staticmethod
36
    def distance_profile(profile0: dict, profile1: dict) -> float:
37
        """
38
        Calculate Q-gram distance between pre-computed profiles.
39
        
40
        Args:
41
            profile0: N-gram profile dictionary for first string
42
            profile1: N-gram profile dictionary for second string
43
            
44
        Returns:
45
            float: Q-gram distance between profiles
46
        """
47
```
48

49
**Usage Examples:**
50

51
```python
52
from similarity.qgram import QGram
53

54
# Basic Q-gram distance with different k values
55
qgram2 = QGram(2)  # Bigrams
56
qgram3 = QGram(3)  # Trigrams
57

58
distance = qgram2.distance('ABCD', 'ABCE')  # Returns: 2 (different endings)
59
distance = qgram3.distance('hello', 'hallo')  # Returns: 2 (different middle)
60

61
# Profile-based computation for large datasets
62
qgram = QGram(2)
63
strings = ['hello', 'hallo', 'hullo', 'hillo']
64

65
# Pre-compute profiles once
66
profiles = [qgram.get_profile(s) for s in strings]
67

68
# Compare all pairs efficiently
69
for i in range(len(strings)):
70
    for j in range(i + 1, len(strings)):
71
        dist = QGram.distance_profile(profiles[i], profiles[j])
72
        print(f"'{strings[i]}' vs '{strings[j]}': {dist}")
73
```
74

75
### Cosine Similarity
76

77
Cosine similarity treats n-gram profiles as vectors and computes the cosine of the angle between them. This normalized measure is effective for comparing documents and text with different lengths.
78

79
```python { .api }
80
class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
81
    def __init__(self, k: int):
82
        """
83
        Initialize Cosine similarity with shingle size.
84
        
85
        Args:
86
            k: Size of character n-grams (shingles) to use
87
        """
88
    
89
    def distance(self, s0: str, s1: str) -> float:
90
        """
91
        Calculate cosine distance (1 - cosine similarity).
92
        
93
        Args:
94
            s0: First string
95
            s1: Second string
96
            
97
        Returns:
98
            float: Cosine distance in range [0.0, 1.0] where 0.0 = identical
99
            
100
        Raises:
101
            TypeError: If either string is None
102
        """
103
    
104
    def similarity(self, s0: str, s1: str) -> float:
105
        """
106
        Calculate cosine similarity between two strings.
107
        
108
        Args:
109
            s0: First string
110
            s1: Second string
111
            
112
        Returns:
113
            float: Cosine similarity in range [0.0, 1.0] where 1.0 = identical
114
            
115
        Raises:
116
            TypeError: If either string is None
117
        """
118
    
119
    def similarity_profiles(self, profile0: dict, profile1: dict) -> float:
120
        """
121
        Calculate cosine similarity between pre-computed profiles.
122
        
123
        Args:
124
            profile0: N-gram profile dictionary for first string
125
            profile1: N-gram profile dictionary for second string
126
            
127
        Returns:
128
            float: Cosine similarity between profiles
129
        """
130
```
131

132
**Usage Examples:**
133

134
```python
135
from similarity.cosine import Cosine
136

137
# Document similarity with different k values
138
cosine3 = Cosine(3)
139
cosine4 = Cosine(4)
140

141
# Text comparison
142
text1 = "The quick brown fox jumps over the lazy dog"
143
text2 = "A quick brown fox jumped over a lazy dog"
144

145
similarity = cosine3.similarity(text1, text2)
146
distance = cosine3.distance(text1, text2)
147
print(f"Cosine similarity: {similarity:.3f}")
148
print(f"Cosine distance: {distance:.3f}")
149

150
# Profile-based computation
151
cosine = Cosine(2)
152
documents = [
153
    "machine learning algorithms",
154
    "deep learning neural networks",
155
    "artificial intelligence systems",
156
    "natural language processing"
157
]
158

159
# Pre-compute profiles
160
profiles = [cosine.get_profile(doc) for doc in documents]
161

162
# Find most similar documents
163
max_sim = 0
164
best_pair = None
165
for i in range(len(documents)):
166
    for j in range(i + 1, len(documents)):
167
        sim = cosine.similarity_profiles(profiles[i], profiles[j])
168
        if sim > max_sim:
169
            max_sim = sim
170
            best_pair = (i, j)
171

172
if best_pair:
173
    i, j = best_pair
174
    print(f"Most similar: '{documents[i]}' and '{documents[j]}' ({max_sim:.3f})")
175
```
176

177
### Jaccard Index
178

179
Jaccard index treats n-gram profiles as sets (ignoring counts) and computes the ratio of intersection to union. This metric is particularly effective for comparing texts with different vocabulary sizes.
180

181
```python { .api }
182
class Jaccard(ShingleBased, MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity):
183
    def __init__(self, k: int):
184
        """
185
        Initialize Jaccard index with shingle size.
186
        
187
        Args:
188
            k: Size of character n-grams (shingles) to use
189
        """
190
    
191
    def distance(self, s0: str, s1: str) -> float:
192
        """
193
        Calculate Jaccard distance (1 - Jaccard similarity).
194
        
195
        Args:
196
            s0: First string
197
            s1: Second string
198
            
199
        Returns:
200
            float: Jaccard distance in range [0.0, 1.0] where 0.0 = identical
201
            
202
        Raises:
203
            TypeError: If either string is None
204
        """
205
    
206
    def similarity(self, s0: str, s1: str) -> float:
207
        """
208
        Calculate Jaccard similarity between two strings.
209
        
210
        Args:
211
            s0: First string
212
            s1: Second string
213
            
214
        Returns:
215
            float: Jaccard similarity in range [0.0, 1.0] where 1.0 = identical
216
            
217
        Raises:
218
            TypeError: If either string is None
219
        """
220
```
221

222
**Usage Examples:**
223

224
```python
225
from similarity.jaccard import Jaccard
226

227
# Set-based similarity
228
jaccard2 = Jaccard(2)
229
jaccard3 = Jaccard(3)
230

231
# Compare strings as character sets
232
similarity = jaccard2.similarity('hello', 'hallo')   # Set overlap
233
distance = jaccard2.distance('hello', 'hallo')
234

235
# Compare different length strings
236
similarity = jaccard3.similarity('programming', 'program')
237
print(f"'programming' vs 'program': {similarity:.3f}")
238

239
# Document deduplication example
240
documents = [
241
    "artificial intelligence machine learning",
242
    "machine learning artificial intelligence", 
243
    "deep learning neural networks",
244
    "neural networks for deep learning"
245
]
246

247
jaccard = Jaccard(3)
248
threshold = 0.5
249

250
print("Similar document pairs (Jaccard > 0.5):")
251
for i in range(len(documents)):
252
    for j in range(i + 1, len(documents)):
253
        sim = jaccard.similarity(documents[i], documents[j])
254
        if sim > threshold:
255
            print(f"  '{documents[i]}' vs '{documents[j]}': {sim:.3f}")
256
```
257

258
### Sorensen-Dice Coefficient
259

260
Similar to Jaccard but uses a different formula: 2 * |intersection| / (|set1| + |set2|). This gives higher weight to matches and is often more sensitive to similarities than Jaccard.
261

262
```python { .api }
263
class SorensenDice(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
264
    def __init__(self, k: int = 3):
265
        """
266
        Initialize Sorensen-Dice with shingle size.
267
        
268
        Args:
269
            k: Size of character n-grams (shingles) to use (default: 3)
270
        """
271
    
272
    def distance(self, s0: str, s1: str) -> float:
273
        """
274
        Calculate Sorensen-Dice distance (1 - Sorensen-Dice similarity).
275
        
276
        Args:
277
            s0: First string
278
            s1: Second string
279
            
280
        Returns:
281
            float: Sorensen-Dice distance in range [0.0, 1.0] where 0.0 = identical
282
            
283
        Raises:
284
            TypeError: If either string is None
285
        """
286
    
287
    def similarity(self, s0: str, s1: str) -> float:
288
        """
289
        Calculate Sorensen-Dice similarity between two strings.
290
        
291
        Args:
292
            s0: First string
293
            s1: Second string
294
            
295
        Returns:
296
            float: Sorensen-Dice similarity in range [0.0, 1.0] where 1.0 = identical
297
            
298
        Raises:
299
            TypeError: If either string is None
300
        """
301
```
302

303
**Usage Examples:**
304

305
```python
306
from similarity.sorensen_dice import SorensenDice
307

308
# Basic Sorensen-Dice similarity
309
dice = SorensenDice(2)
310
similarity = dice.similarity('hello', 'hallo')
311
distance = dice.distance('hello', 'hallo')
312

313
# Compare with Jaccard
314
from similarity.jaccard import Jaccard
315
jaccard = Jaccard(2)
316

317
test_pairs = [
318
    ('programming', 'program'),
319
    ('hello', 'world'),
320
    ('similar', 'similarity')
321
]
322

323
for s1, s2 in test_pairs:
324
    dice_sim = dice.similarity(s1, s2)
325
    jaccard_sim = jaccard.similarity(s1, s2)
326
    print(f"'{s1}' vs '{s2}':")
327
    print(f"  Dice: {dice_sim:.3f}, Jaccard: {jaccard_sim:.3f}")
328
```
329

330
### N-Gram Distance (Kondrak)
331

332
Normalized n-gram distance as defined by Kondrak, using special character affixing to weight initial characters more heavily. This algorithm is designed specifically for string similarity rather than set operations.
333

334
```python { .api }
335
class NGram(NormalizedStringDistance):
336
    def __init__(self, n: int = 2):
337
        """
338
        Initialize N-Gram distance with gram size.
339
        
340
        Args:
341
            n: Size of character n-grams to use (default: 2)
342
        """
343
    
344
    def distance(self, s0: str, s1: str) -> float:
345
        """
346
        Calculate normalized N-gram distance with special character affixing.
347
        
348
        Args:
349
            s0: First string
350
            s1: Second string
351
            
352
        Returns:
353
            float: Normalized distance in range [0.0, 1.0] where 0.0 = identical
354
            
355
        Raises:
356
            TypeError: If either string is None
357
        """
358
```
359

360
**Usage Examples:**
361

362
```python
363
from similarity.ngram import NGram
364

365
# Different n-gram sizes
366
twogram = NGram(2)
367
fourgram = NGram(4)
368

369
# String comparison with weighted prefixes
370
distance = twogram.distance('ABCD', 'ABTUIO')
371
print(f"2-gram distance: {distance:.3f}")
372

373
# Longer strings with different n-gram sizes
374
s1 = 'Adobe CreativeSuite 5 Master Collection from cheap 4zp'
375
s2 = 'Adobe CreativeSuite 5 Master Collection from cheap d1x'
376

377
distance2 = twogram.distance(s1, s2)
378
distance4 = fourgram.distance(s1, s2)
379
print(f"Similar strings - 2-gram: {distance2:.3f}, 4-gram: {distance4:.3f}")
380
```
381

382
### ShingleBased Base Class
383

384
All shingle-based algorithms inherit from this base class which provides profile computation:
385

386
```python { .api }
387
class ShingleBased:
388
    def __init__(self, k: int = 3):
389
        """
390
        Initialize with shingle size.
391
        
392
        Args:
393
            k: Size of character n-grams (shingles) to use
394
        """
395
    
396
    def get_k(self) -> int:
397
        """
398
        Get the current shingle size.
399
        
400
        Returns:
401
            int: Current shingle size
402
        """
403
    
404
    def get_profile(self, string: str) -> dict:
405
        """
406
        Convert string to n-gram profile (dictionary of n-gram counts).
407
        
408
        Args:
409
            string: Input string to profile
410
            
411
        Returns:
412
            dict: Dictionary mapping n-grams to their occurrence counts
413
        """
414
```
415

416
**Profile Usage Example:**
417

418
```python
419
from similarity.cosine import Cosine
420

421
cosine = Cosine(3)
422

423
# Get n-gram profiles
424
text = "hello world"
425
profile = cosine.get_profile(text)
426
print(f"3-gram profile for '{text}': {profile}")
427

428
# Profiles can be reused for multiple comparisons
429
profile1 = cosine.get_profile("machine learning")
430
profile2 = cosine.get_profile("deep learning")
431
profile3 = cosine.get_profile("learning algorithms")
432

433
similarity_12 = cosine.similarity_profiles(profile1, profile2)
434
similarity_13 = cosine.similarity_profiles(profile1, profile3)
435
print(f"ML vs DL: {similarity_12:.3f}")
436
print(f"ML vs Algorithms: {similarity_13:.3f}")
437
```

Version

Tile

Files

ngram-shingle.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

ngram-shingle.mddocs/