A library implementing different string similarity and distance measures
npx @tessl/cli install tessl/pypi-strsim@0.0.00
# strsim
1
2
A comprehensive Python library implementing different string similarity and distance measures. strsim provides over a dozen algorithms including Levenshtein edit distance, Jaro-Winkler, Longest Common Subsequence, cosine similarity, and various n-gram based measures for applications such as record linkage, duplicate detection, typo correction, and general text comparison tasks.
3
4
## Package Information
5
6
- **Package Name**: strsim
7
- **Language**: Python
8
- **Installation**: `pip install strsim`
9
10
## Core Imports
11
12
Direct algorithm imports:
13
14
```python
15
from similarity.levenshtein import Levenshtein
16
from similarity.jarowinkler import JaroWinkler
17
from similarity.cosine import Cosine
18
from similarity.jaccard import Jaccard
19
```
20
21
Factory pattern imports:
22
23
```python
24
from similarity.similarity import Factory, Algorithm
25
```
26
27
## Basic Usage
28
29
```python
30
from similarity.levenshtein import Levenshtein
31
from similarity.jarowinkler import JaroWinkler
32
from similarity.similarity import Factory, Algorithm
33
34
# Direct algorithm usage
35
levenshtein = Levenshtein()
36
distance = levenshtein.distance('hello', 'hallo') # Returns: 1.0
37
38
# Normalized similarity
39
jarowinkler = JaroWinkler()
40
similarity = jarowinkler.similarity('hello', 'hallo') # Returns: 0.933...
41
42
# Factory pattern usage
43
algorithm = Factory.get_algorithm(Algorithm.LEVENSHTEIN)
44
distance = algorithm.distance('hello', 'hallo') # Returns: 1.0
45
```
46
47
## Architecture
48
49
The library is built around a hierarchy of base interface classes that define different types of string comparison algorithms:
50
51
- **StringDistance**: Base interface for distance algorithms (0 = identical strings)
52
- **StringSimilarity**: Base interface for similarity algorithms (higher = more similar)
53
- **NormalizedStringDistance**: Distance algorithms returning values in [0.0, 1.0] range
54
- **NormalizedStringSimilarity**: Similarity algorithms returning values in [0.0, 1.0] range
55
- **MetricStringDistance**: Distance algorithms satisfying triangle inequality
56
- **ShingleBased**: Base class for n-gram/shingle-based algorithms with profile computation
57
58
This design enables consistent interfaces across algorithm types while providing flexibility for specialized implementations and pre-computed profile optimizations for large datasets.
59
60
## Capabilities
61
62
### Edit Distance Algorithms
63
64
Classic edit distance algorithms including Levenshtein, Damerau-Levenshtein, Weighted Levenshtein, and Optimal String Alignment. These algorithms measure the minimum number of character operations needed to transform one string into another.
65
66
```python { .api }
67
class Levenshtein(MetricStringDistance):
68
def distance(self, s0: str, s1: str) -> float: ...
69
70
class NormalizedLevenshtein(NormalizedStringDistance, NormalizedStringSimilarity):
71
def __init__(self): ...
72
def distance(self, s0: str, s1: str) -> float: ...
73
def similarity(self, s0: str, s1: str) -> float: ...
74
75
class WeightedLevenshtein(StringDistance):
76
def __init__(self, character_substitution: CharacterSubstitutionInterface,
77
character_ins_del: CharacterInsDelInterface = None): ...
78
def distance(self, s0: str, s1: str) -> float: ...
79
```
80
81
[Edit Distance Algorithms](./edit-distance.md)
82
83
### Phonetic and Record Linkage Algorithms
84
85
Algorithms designed for fuzzy matching, typo correction, and record linkage applications. These are optimized for short strings like person names and handle character transpositions intelligently.
86
87
```python { .api }
88
class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):
89
def __init__(self, threshold: float = 0.7): ...
90
def get_threshold(self) -> float: ...
91
def similarity(self, s0: str, s1: str) -> float: ...
92
def distance(self, s0: str, s1: str) -> float: ...
93
94
class Damerau(MetricStringDistance):
95
def distance(self, s0: str, s1: str) -> float: ...
96
```
97
98
[Phonetic and Record Linkage](./phonetic-record-linkage.md)
99
100
### Sequence-Based Algorithms
101
102
Algorithms based on longest common subsequences and sequence alignment, commonly used in diff utilities, version control systems, and bioinformatics applications.
103
104
```python { .api }
105
class LongestCommonSubsequence(StringDistance):
106
def distance(self, s0: str, s1: str) -> float: ...
107
@staticmethod
108
def length(s0: str, s1: str) -> float: ...
109
110
class MetricLCS(MetricStringDistance, NormalizedStringDistance):
111
def __init__(self): ...
112
def distance(self, s0: str, s1: str) -> float: ...
113
```
114
115
[Sequence-Based Algorithms](./sequence-based.md)
116
117
### N-Gram and Shingle-Based Algorithms
118
119
Algorithms that convert strings into sets or profiles of n-character sequences (shingles) and compute similarity based on these representations. Support both direct string comparison and pre-computed profile optimization for large datasets.
120
121
```python { .api }
122
class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
123
def __init__(self, k: int): ...
124
def distance(self, s0: str, s1: str) -> float: ...
125
def similarity(self, s0: str, s1: str) -> float: ...
126
def similarity_profiles(self, profile0: dict, profile1: dict) -> float: ...
127
128
class QGram(ShingleBased, StringDistance):
129
def __init__(self, k: int = 3): ...
130
def distance(self, s0: str, s1: str) -> float: ...
131
@staticmethod
132
def distance_profile(profile0: dict, profile1: dict) -> float: ...
133
```
134
135
[N-Gram and Shingle-Based](./ngram-shingle.md)
136
137
### Factory and Utility Classes
138
139
Factory pattern for algorithm instantiation and utility interfaces for customizing algorithm behavior.
140
141
```python { .api }
142
from enum import IntEnum
143
144
class Algorithm(IntEnum):
145
COSINE = 1
146
DAMERAU = 2
147
JACCARD = 3
148
JARO_WINKLE = 4
149
LEVENSHTEIN = 5
150
LCS = 6
151
METRIC_LCS = 7
152
N_GRAM = 8
153
NORMALIZED_LEVENSHTEIN = 9
154
OPTIMAL_STRING_ALIGNMENT = 10
155
Q_GRAM = 11
156
SORENSEN_DICE = 12
157
WEIGHTED_LEVENSHTEIN = 13
158
159
class Factory:
160
@staticmethod
161
def get_algorithm(algorithm: Algorithm, k: int = 3): ...
162
@staticmethod
163
def get_weighted_levenshtein(char_sub: CharacterSubstitutionInterface,
164
char_change: CharacterInsDelInterface): ...
165
```
166
167
[Factory and Utilities](./factory-utilities.md)
168
169
## Types
170
171
```python { .api }
172
# Base interface classes
173
class StringSimilarity:
174
def similarity(self, s0: str, s1: str) -> float: ...
175
176
class NormalizedStringSimilarity(StringSimilarity):
177
def similarity(self, s0: str, s1: str) -> float: ...
178
179
class StringDistance:
180
def distance(self, s0: str, s1: str) -> float: ...
181
182
class NormalizedStringDistance(StringDistance):
183
def distance(self, s0: str, s1: str) -> float: ...
184
185
class MetricStringDistance(StringDistance):
186
def distance(self, s0: str, s1: str) -> float: ...
187
188
# Shingle-based algorithms base class
189
class ShingleBased:
190
def __init__(self, k: int = 3): ...
191
def get_k(self) -> int: ...
192
def get_profile(self, string: str) -> dict: ...
193
194
# Weighted Levenshtein interfaces
195
class CharacterSubstitutionInterface:
196
def cost(self, c0: str, c1: str) -> float: ...
197
198
class CharacterInsDelInterface:
199
def deletion_cost(self, c: str) -> float: ...
200
def insertion_cost(self, c: str) -> float: ...
201
```