or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

edit-distance.mdfactory-utilities.mdindex.mdngram-shingle.mdphonetic-record-linkage.mdsequence-based.md

index.mddocs/

0

# strsim

1

2

A comprehensive Python library implementing different string similarity and distance measures. strsim provides over a dozen algorithms including Levenshtein edit distance, Jaro-Winkler, Longest Common Subsequence, cosine similarity, and various n-gram based measures for applications such as record linkage, duplicate detection, typo correction, and general text comparison tasks.

3

4

## Package Information

5

6

- **Package Name**: strsim

7

- **Language**: Python

8

- **Installation**: `pip install strsim`

9

10

## Core Imports

11

12

Direct algorithm imports:

13

14

```python

15

from similarity.levenshtein import Levenshtein

16

from similarity.jarowinkler import JaroWinkler

17

from similarity.cosine import Cosine

18

from similarity.jaccard import Jaccard

19

```

20

21

Factory pattern imports:

22

23

```python

24

from similarity.similarity import Factory, Algorithm

25

```

26

27

## Basic Usage

28

29

```python

30

from similarity.levenshtein import Levenshtein

31

from similarity.jarowinkler import JaroWinkler

32

from similarity.similarity import Factory, Algorithm

33

34

# Direct algorithm usage

35

levenshtein = Levenshtein()

36

distance = levenshtein.distance('hello', 'hallo') # Returns: 1.0

37

38

# Normalized similarity

39

jarowinkler = JaroWinkler()

40

similarity = jarowinkler.similarity('hello', 'hallo') # Returns: 0.933...

41

42

# Factory pattern usage

43

algorithm = Factory.get_algorithm(Algorithm.LEVENSHTEIN)

44

distance = algorithm.distance('hello', 'hallo') # Returns: 1.0

45

```

46

47

## Architecture

48

49

The library is built around a hierarchy of base interface classes that define different types of string comparison algorithms:

50

51

- **StringDistance**: Base interface for distance algorithms (0 = identical strings)

52

- **StringSimilarity**: Base interface for similarity algorithms (higher = more similar)

53

- **NormalizedStringDistance**: Distance algorithms returning values in [0.0, 1.0] range

54

- **NormalizedStringSimilarity**: Similarity algorithms returning values in [0.0, 1.0] range

55

- **MetricStringDistance**: Distance algorithms satisfying triangle inequality

56

- **ShingleBased**: Base class for n-gram/shingle-based algorithms with profile computation

57

58

This design enables consistent interfaces across algorithm types while providing flexibility for specialized implementations and pre-computed profile optimizations for large datasets.

59

60

## Capabilities

61

62

### Edit Distance Algorithms

63

64

Classic edit distance algorithms including Levenshtein, Damerau-Levenshtein, Weighted Levenshtein, and Optimal String Alignment. These algorithms measure the minimum number of character operations needed to transform one string into another.

65

66

```python { .api }

67

class Levenshtein(MetricStringDistance):

68

def distance(self, s0: str, s1: str) -> float: ...

69

70

class NormalizedLevenshtein(NormalizedStringDistance, NormalizedStringSimilarity):

71

def __init__(self): ...

72

def distance(self, s0: str, s1: str) -> float: ...

73

def similarity(self, s0: str, s1: str) -> float: ...

74

75

class WeightedLevenshtein(StringDistance):

76

def __init__(self, character_substitution: CharacterSubstitutionInterface,

77

character_ins_del: CharacterInsDelInterface = None): ...

78

def distance(self, s0: str, s1: str) -> float: ...

79

```

80

81

[Edit Distance Algorithms](./edit-distance.md)

82

83

### Phonetic and Record Linkage Algorithms

84

85

Algorithms designed for fuzzy matching, typo correction, and record linkage applications. These are optimized for short strings like person names and handle character transpositions intelligently.

86

87

```python { .api }

88

class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):

89

def __init__(self, threshold: float = 0.7): ...

90

def get_threshold(self) -> float: ...

91

def similarity(self, s0: str, s1: str) -> float: ...

92

def distance(self, s0: str, s1: str) -> float: ...

93

94

class Damerau(MetricStringDistance):

95

def distance(self, s0: str, s1: str) -> float: ...

96

```

97

98

[Phonetic and Record Linkage](./phonetic-record-linkage.md)

99

100

### Sequence-Based Algorithms

101

102

Algorithms based on longest common subsequences and sequence alignment, commonly used in diff utilities, version control systems, and bioinformatics applications.

103

104

```python { .api }

105

class LongestCommonSubsequence(StringDistance):

106

def distance(self, s0: str, s1: str) -> float: ...

107

@staticmethod

108

def length(s0: str, s1: str) -> float: ...

109

110

class MetricLCS(MetricStringDistance, NormalizedStringDistance):

111

def __init__(self): ...

112

def distance(self, s0: str, s1: str) -> float: ...

113

```

114

115

[Sequence-Based Algorithms](./sequence-based.md)

116

117

### N-Gram and Shingle-Based Algorithms

118

119

Algorithms that convert strings into sets or profiles of n-character sequences (shingles) and compute similarity based on these representations. Support both direct string comparison and pre-computed profile optimization for large datasets.

120

121

```python { .api }

122

class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):

123

def __init__(self, k: int): ...

124

def distance(self, s0: str, s1: str) -> float: ...

125

def similarity(self, s0: str, s1: str) -> float: ...

126

def similarity_profiles(self, profile0: dict, profile1: dict) -> float: ...

127

128

class QGram(ShingleBased, StringDistance):

129

def __init__(self, k: int = 3): ...

130

def distance(self, s0: str, s1: str) -> float: ...

131

@staticmethod

132

def distance_profile(profile0: dict, profile1: dict) -> float: ...

133

```

134

135

[N-Gram and Shingle-Based](./ngram-shingle.md)

136

137

### Factory and Utility Classes

138

139

Factory pattern for algorithm instantiation and utility interfaces for customizing algorithm behavior.

140

141

```python { .api }

142

from enum import IntEnum

143

144

class Algorithm(IntEnum):

145

COSINE = 1

146

DAMERAU = 2

147

JACCARD = 3

148

JARO_WINKLE = 4

149

LEVENSHTEIN = 5

150

LCS = 6

151

METRIC_LCS = 7

152

N_GRAM = 8

153

NORMALIZED_LEVENSHTEIN = 9

154

OPTIMAL_STRING_ALIGNMENT = 10

155

Q_GRAM = 11

156

SORENSEN_DICE = 12

157

WEIGHTED_LEVENSHTEIN = 13

158

159

class Factory:

160

@staticmethod

161

def get_algorithm(algorithm: Algorithm, k: int = 3): ...

162

@staticmethod

163

def get_weighted_levenshtein(char_sub: CharacterSubstitutionInterface,

164

char_change: CharacterInsDelInterface): ...

165

```

166

167

[Factory and Utilities](./factory-utilities.md)

168

169

## Types

170

171

```python { .api }

172

# Base interface classes

173

class StringSimilarity:

174

def similarity(self, s0: str, s1: str) -> float: ...

175

176

class NormalizedStringSimilarity(StringSimilarity):

177

def similarity(self, s0: str, s1: str) -> float: ...

178

179

class StringDistance:

180

def distance(self, s0: str, s1: str) -> float: ...

181

182

class NormalizedStringDistance(StringDistance):

183

def distance(self, s0: str, s1: str) -> float: ...

184

185

class MetricStringDistance(StringDistance):

186

def distance(self, s0: str, s1: str) -> float: ...

187

188

# Shingle-based algorithms base class

189

class ShingleBased:

190

def __init__(self, k: int = 3): ...

191

def get_k(self) -> int: ...

192

def get_profile(self, string: str) -> dict: ...

193

194

# Weighted Levenshtein interfaces

195

class CharacterSubstitutionInterface:

196

def cost(self, c0: str, c1: str) -> float: ...

197

198

class CharacterInsDelInterface:

199

def deletion_cost(self, c: str) -> float: ...

200

def insertion_cost(self, c: str) -> float: ...

201

```