Tessl Tile for pypi/strsim@0.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

edit-distance.md factory-utilities.md index.md ngram-shingle.md phonetic-record-linkage.md sequence-based.md

index.mddocs/

0
# strsim
1

2
A comprehensive Python library implementing different string similarity and distance measures. strsim provides over a dozen algorithms including Levenshtein edit distance, Jaro-Winkler, Longest Common Subsequence, cosine similarity, and various n-gram based measures for applications such as record linkage, duplicate detection, typo correction, and general text comparison tasks.
3

4
## Package Information
5

6
- **Package Name**: strsim
7
- **Language**: Python
8
- **Installation**: `pip install strsim`
9

10
## Core Imports
11

12
Direct algorithm imports:
13

14
```python
15
from similarity.levenshtein import Levenshtein
16
from similarity.jarowinkler import JaroWinkler
17
from similarity.cosine import Cosine
18
from similarity.jaccard import Jaccard
19
```
20

21
Factory pattern imports:
22

23
```python
24
from similarity.similarity import Factory, Algorithm
25
```
26

27
## Basic Usage
28

29
```python
30
from similarity.levenshtein import Levenshtein
31
from similarity.jarowinkler import JaroWinkler
32
from similarity.similarity import Factory, Algorithm
33

34
# Direct algorithm usage
35
levenshtein = Levenshtein()
36
distance = levenshtein.distance('hello', 'hallo')  # Returns: 1.0
37

38
# Normalized similarity
39
jarowinkler = JaroWinkler()
40
similarity = jarowinkler.similarity('hello', 'hallo')  # Returns: 0.933...
41

42
# Factory pattern usage
43
algorithm = Factory.get_algorithm(Algorithm.LEVENSHTEIN)
44
distance = algorithm.distance('hello', 'hallo')  # Returns: 1.0
45
```
46

47
## Architecture
48

49
The library is built around a hierarchy of base interface classes that define different types of string comparison algorithms:
50

51
- **StringDistance**: Base interface for distance algorithms (0 = identical strings)
52
- **StringSimilarity**: Base interface for similarity algorithms (higher = more similar)
53
- **NormalizedStringDistance**: Distance algorithms returning values in [0.0, 1.0] range
54
- **NormalizedStringSimilarity**: Similarity algorithms returning values in [0.0, 1.0] range  
55
- **MetricStringDistance**: Distance algorithms satisfying triangle inequality
56
- **ShingleBased**: Base class for n-gram/shingle-based algorithms with profile computation
57

58
This design enables consistent interfaces across algorithm types while providing flexibility for specialized implementations and pre-computed profile optimizations for large datasets.
59

60
## Capabilities
61

62
### Edit Distance Algorithms
63

64
Classic edit distance algorithms including Levenshtein, Damerau-Levenshtein, Weighted Levenshtein, and Optimal String Alignment. These algorithms measure the minimum number of character operations needed to transform one string into another.
65

66
```python { .api }
67
class Levenshtein(MetricStringDistance):
68
    def distance(self, s0: str, s1: str) -> float: ...
69

70
class NormalizedLevenshtein(NormalizedStringDistance, NormalizedStringSimilarity):
71
    def __init__(self): ...
72
    def distance(self, s0: str, s1: str) -> float: ...
73
    def similarity(self, s0: str, s1: str) -> float: ...
74

75
class WeightedLevenshtein(StringDistance):
76
    def __init__(self, character_substitution: CharacterSubstitutionInterface, 
77
                 character_ins_del: CharacterInsDelInterface = None): ...
78
    def distance(self, s0: str, s1: str) -> float: ...
79
```
80

81
[Edit Distance Algorithms](./edit-distance.md)
82

83
### Phonetic and Record Linkage Algorithms
84

85
Algorithms designed for fuzzy matching, typo correction, and record linkage applications. These are optimized for short strings like person names and handle character transpositions intelligently.
86

87
```python { .api }
88
class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):
89
    def __init__(self, threshold: float = 0.7): ...
90
    def get_threshold(self) -> float: ...
91
    def similarity(self, s0: str, s1: str) -> float: ...
92
    def distance(self, s0: str, s1: str) -> float: ...
93

94
class Damerau(MetricStringDistance):
95
    def distance(self, s0: str, s1: str) -> float: ...
96
```
97

98
[Phonetic and Record Linkage](./phonetic-record-linkage.md)
99

100
### Sequence-Based Algorithms
101

102
Algorithms based on longest common subsequences and sequence alignment, commonly used in diff utilities, version control systems, and bioinformatics applications.
103

104
```python { .api }
105
class LongestCommonSubsequence(StringDistance):
106
    def distance(self, s0: str, s1: str) -> float: ...
107
    @staticmethod
108
    def length(s0: str, s1: str) -> float: ...
109

110
class MetricLCS(MetricStringDistance, NormalizedStringDistance):
111
    def __init__(self): ...
112
    def distance(self, s0: str, s1: str) -> float: ...
113
```
114

115
[Sequence-Based Algorithms](./sequence-based.md)
116

117
### N-Gram and Shingle-Based Algorithms
118

119
Algorithms that convert strings into sets or profiles of n-character sequences (shingles) and compute similarity based on these representations. Support both direct string comparison and pre-computed profile optimization for large datasets.
120

121
```python { .api }
122
class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
123
    def __init__(self, k: int): ...
124
    def distance(self, s0: str, s1: str) -> float: ...
125
    def similarity(self, s0: str, s1: str) -> float: ...
126
    def similarity_profiles(self, profile0: dict, profile1: dict) -> float: ...
127

128
class QGram(ShingleBased, StringDistance):
129
    def __init__(self, k: int = 3): ...
130
    def distance(self, s0: str, s1: str) -> float: ...
131
    @staticmethod
132
    def distance_profile(profile0: dict, profile1: dict) -> float: ...
133
```
134

135
[N-Gram and Shingle-Based](./ngram-shingle.md)
136

137
### Factory and Utility Classes
138

139
Factory pattern for algorithm instantiation and utility interfaces for customizing algorithm behavior.
140

141
```python { .api }
142
from enum import IntEnum
143

144
class Algorithm(IntEnum):
145
    COSINE = 1
146
    DAMERAU = 2
147
    JACCARD = 3
148
    JARO_WINKLE = 4
149
    LEVENSHTEIN = 5
150
    LCS = 6
151
    METRIC_LCS = 7
152
    N_GRAM = 8
153
    NORMALIZED_LEVENSHTEIN = 9
154
    OPTIMAL_STRING_ALIGNMENT = 10
155
    Q_GRAM = 11
156
    SORENSEN_DICE = 12
157
    WEIGHTED_LEVENSHTEIN = 13
158

159
class Factory:
160
    @staticmethod
161
    def get_algorithm(algorithm: Algorithm, k: int = 3): ...
162
    @staticmethod
163
    def get_weighted_levenshtein(char_sub: CharacterSubstitutionInterface, 
164
                                char_change: CharacterInsDelInterface): ...
165
```
166

167
[Factory and Utilities](./factory-utilities.md)
168

169
## Types
170

171
```python { .api }
172
# Base interface classes
173
class StringSimilarity:
174
    def similarity(self, s0: str, s1: str) -> float: ...
175

176
class NormalizedStringSimilarity(StringSimilarity):
177
    def similarity(self, s0: str, s1: str) -> float: ...
178

179
class StringDistance:
180
    def distance(self, s0: str, s1: str) -> float: ...
181

182
class NormalizedStringDistance(StringDistance):
183
    def distance(self, s0: str, s1: str) -> float: ...
184

185
class MetricStringDistance(StringDistance):
186
    def distance(self, s0: str, s1: str) -> float: ...
187

188
# Shingle-based algorithms base class
189
class ShingleBased:
190
    def __init__(self, k: int = 3): ...
191
    def get_k(self) -> int: ...
192
    def get_profile(self, string: str) -> dict: ...
193

194
# Weighted Levenshtein interfaces
195
class CharacterSubstitutionInterface:
196
    def cost(self, c0: str, c1: str) -> float: ...
197

198
class CharacterInsDelInterface:
199
    def deletion_cost(self, c: str) -> float: ...
200
    def insertion_cost(self, c: str) -> float: ...
201
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/