or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

edit-operations.mdindex.mdstring-averaging.mdstring-distance.md

string-distance.mddocs/

0

# String Distance and Similarity

1

2

Functions for computing various string distance metrics and similarity scores. These functions provide the core string comparison capabilities including Levenshtein distance, normalized similarity ratios, Hamming distance, and Jaro/Jaro-Winkler similarities.

3

4

## Capabilities

5

6

### Levenshtein Distance

7

8

Calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into another according to Levenshtein with custom costs for insertion, deletion and substitution.

9

10

```python { .api }

11

def distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None):

12

"""

13

Calculate Levenshtein distance with custom operation weights.

14

15

Parameters:

16

- s1: First string to compare

17

- s2: Second string to compare

18

- weights: Tuple of (insertion, deletion, substitution) weights. Default (1, 1, 1)

19

- processor: Optional callable to preprocess strings before comparison

20

- score_cutoff: Maximum distance to consider. Returns cutoff + 1 if exceeded

21

- score_hint: Expected distance hint for algorithm optimization

22

23

Returns:

24

int: Distance between s1 and s2

25

26

Raises:

27

ValueError: If unsupported weights are provided

28

"""

29

```

30

31

**Usage Examples:**

32

33

```python

34

import Levenshtein

35

36

# Basic distance calculation

37

dist = Levenshtein.distance("kitten", "sitting")

38

print(dist) # 3

39

40

# Custom weights (insertion, deletion, substitution)

41

dist = Levenshtein.distance("kitten", "sitting", weights=(1, 1, 2))

42

print(dist) # 4 (substitutions cost more)

43

44

# With score cutoff for performance

45

dist = Levenshtein.distance("very long string", "another long string", score_cutoff=5)

46

print(dist) # Returns actual distance or 6 if > 5

47

```

48

49

### Similarity Ratio

50

51

Calculates a normalized indel similarity in the range [0, 1]. The indel distance calculates the minimum number of insertions and deletions required to change one sequence into the other.

52

53

```python { .api }

54

def ratio(s1, s2, *, processor=None, score_cutoff=None):

55

"""

56

Calculate normalized indel similarity ratio [0, 1].

57

58

This is calculated as 1 - (distance / (len1 + len2))

59

60

Parameters:

61

- s1: First string to compare

62

- s2: Second string to compare

63

- processor: Optional callable to preprocess strings before comparison

64

- score_cutoff: Minimum similarity threshold. Returns 0 if below threshold

65

66

Returns:

67

float: Normalized similarity between s1 and s2 as a float between 0 and 1.0

68

"""

69

```

70

71

**Usage Examples:**

72

73

```python

74

import Levenshtein

75

76

# Basic similarity ratio

77

ratio = Levenshtein.ratio("kitten", "sitting")

78

print(f"{ratio:.3f}") # 0.615

79

80

# With score cutoff

81

ratio = Levenshtein.ratio("hello", "world", score_cutoff=0.5)

82

print(ratio) # 0.0 (similarity is below 0.5)

83

84

# With custom processor

85

def preprocess(s):

86

return s.lower().strip()

87

88

ratio = Levenshtein.ratio(" Hello ", "HELLO", processor=preprocess)

89

print(ratio) # 1.0 (identical after processing)

90

```

91

92

### Hamming Distance

93

94

Calculates the Hamming distance between two strings. The hamming distance is defined as the number of positions where the two strings differ. It describes the minimum amount of substitutions required to transform s1 into s2.

95

96

```python { .api }

97

def hamming(s1, s2, *, pad=True, processor=None, score_cutoff=None):

98

"""

99

Calculate Hamming distance (substitutions only).

100

101

Parameters:

102

- s1: First string to compare

103

- s2: Second string to compare

104

- pad: Should strings be padded if there is a length difference

105

- processor: Optional callable to preprocess strings before comparison

106

- score_cutoff: Maximum distance to consider. Returns cutoff + 1 if exceeded

107

108

Returns:

109

int: Hamming distance between s1 and s2

110

111

Raises:

112

ValueError: If s1 and s2 have different length and pad=False

113

"""

114

```

115

116

**Usage Examples:**

117

118

```python

119

import Levenshtein

120

121

# Same length strings

122

dist = Levenshtein.hamming("karolin", "kathrin")

123

print(dist) # 3

124

125

# Different length strings with padding (default)

126

dist = Levenshtein.hamming("karolin", "kath")

127

print(dist) # 7 (3 substitutions + 4 for length difference)

128

129

# Different length strings without padding raises error

130

try:

131

dist = Levenshtein.hamming("karolin", "kath", pad=False)

132

except ValueError as e:

133

print("Length mismatch error")

134

```

135

136

### Jaro Similarity

137

138

Calculates the Jaro similarity between two strings. Jaro similarity is particularly effective for comparing names and short strings with character transpositions.

139

140

```python { .api }

141

def jaro(s1, s2, *, processor=None, score_cutoff=None):

142

"""

143

Calculate Jaro similarity.

144

145

Parameters:

146

- s1: First string to compare

147

- s2: Second string to compare

148

- processor: Optional callable to preprocess strings before comparison

149

- score_cutoff: Minimum similarity threshold. Returns 0 if below threshold

150

151

Returns:

152

float: Jaro similarity between s1 and s2 as a float between 0 and 1.0

153

"""

154

```

155

156

**Usage Examples:**

157

158

```python

159

import Levenshtein

160

161

# Basic Jaro similarity

162

sim = Levenshtein.jaro("martha", "marhta")

163

print(f"{sim:.3f}") # 0.944

164

165

# Names comparison

166

sim = Levenshtein.jaro("DIXON", "DICKSONX")

167

print(f"{sim:.3f}") # 0.767

168

```

169

170

### Jaro-Winkler Similarity

171

172

Calculates the Jaro-Winkler similarity, which gives more favorable ratings to strings with common prefixes. This is particularly useful for comparing names and addresses.

173

174

```python { .api }

175

def jaro_winkler(s1, s2, *, prefix_weight=0.1, processor=None, score_cutoff=None):

176

"""

177

Calculate Jaro-Winkler similarity with prefix weighting.

178

179

Parameters:

180

- s1: First string to compare

181

- s2: Second string to compare

182

- prefix_weight: Weight used for common prefix (0 to 0.25). Default 0.1

183

- processor: Optional callable to preprocess strings before comparison

184

- score_cutoff: Minimum similarity threshold. Returns 0 if below threshold

185

186

Returns:

187

float: Jaro-Winkler similarity between s1 and s2 as a float between 0 and 1.0

188

189

Raises:

190

ValueError: If prefix_weight is not between 0 and 0.25

191

"""

192

```

193

194

**Usage Examples:**

195

196

```python

197

import Levenshtein

198

199

# Basic Jaro-Winkler similarity

200

sim = Levenshtein.jaro_winkler("martha", "marhta")

201

print(f"{sim:.3f}") # 0.961 (higher than Jaro due to common prefix "mar")

202

203

# Custom prefix weight

204

sim = Levenshtein.jaro_winkler("prefix_test", "prefix_demo", prefix_weight=0.2)

205

print(f"{sim:.3f}") # Higher weight for common prefix

206

207

# Names with common prefix

208

sim = Levenshtein.jaro_winkler("JOHNSON", "JOHNSTON")

209

print(f"{sim:.3f}") # 0.957

210

```

211

212

## Common Usage Patterns

213

214

### Fuzzy String Matching

215

216

```python

217

import Levenshtein

218

219

def fuzzy_match(target, candidates, threshold=0.8):

220

"""Find best matches from candidates list."""

221

matches = []

222

for candidate in candidates:

223

ratio = Levenshtein.ratio(target, candidate)

224

if ratio >= threshold:

225

matches.append((candidate, ratio))

226

return sorted(matches, key=lambda x: x[1], reverse=True)

227

228

# Example usage

229

candidates = ["apple", "application", "apply", "april", "ample"]

230

matches = fuzzy_match("appl", candidates, threshold=0.6)

231

print(matches) # [('apply', 0.8), ('apple', 0.8), ('ample', 0.8)]

232

```

233

234

### Spell Checking

235

236

```python

237

import Levenshtein

238

239

def suggest_corrections(word, dictionary, max_distance=2, max_suggestions=5):

240

"""Suggest spelling corrections."""

241

suggestions = []

242

for dict_word in dictionary:

243

dist = Levenshtein.distance(word, dict_word)

244

if dist <= max_distance:

245

ratio = Levenshtein.ratio(word, dict_word)

246

suggestions.append((dict_word, dist, ratio))

247

248

# Sort by distance (ascending) then ratio (descending)

249

suggestions.sort(key=lambda x: (x[1], -x[2]))

250

return [word for word, _, _ in suggestions[:max_suggestions]]

251

```

252

253

### Performance Optimization

254

255

```python

256

import Levenshtein

257

258

# Use score_cutoff for early termination

259

def fast_filter(query, candidates, max_distance=3):

260

"""Quickly filter candidates by maximum distance."""

261

results = []

262

for candidate in candidates:

263

dist = Levenshtein.distance(query, candidate, score_cutoff=max_distance)

264

if dist <= max_distance:

265

results.append((candidate, dist))

266

return results

267

268

# Use score_hint when you have an expected distance

269

def optimized_distance(s1, s2, expected_dist=None):

270

"""Calculate distance with optimization hint."""

271

return Levenshtein.distance(s1, s2, score_hint=expected_dist)

272

```