0
# Jellyfish
1
2
A high-performance Python library for approximate and phonetic string matching algorithms. Jellyfish provides fast implementations of various string distance and similarity metrics along with phonetic encoding algorithms, built with Rust for maximum performance while maintaining ease of use through Python interfaces.
3
4
## Package Information
5
6
- **Package Name**: jellyfish
7
- **Package Type**: pypi
8
- **Language**: Python with Rust implementation
9
- **Installation**: `pip install jellyfish`
10
11
## Core Imports
12
13
```python
14
import jellyfish
15
```
16
17
Individual function imports:
18
19
```python
20
from jellyfish import levenshtein_distance, jaro_similarity, soundex, metaphone
21
```
22
23
## Basic Usage
24
25
```python
26
import jellyfish
27
28
# String distance calculations
29
distance = jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
30
print(distance) # 2
31
32
similarity = jellyfish.jaro_similarity('jellyfish', 'smellyfish')
33
print(similarity) # 0.896...
34
35
# Phonetic encoding
36
code = jellyfish.soundex('Jellyfish')
37
print(code) # 'J412'
38
39
metaphone_code = jellyfish.metaphone('Jellyfish')
40
print(metaphone_code) # 'JLFX'
41
```
42
43
## Capabilities
44
45
### String Distance and Similarity Functions
46
47
Distance and similarity metrics for comparing strings, useful for fuzzy matching, data deduplication, and record linkage applications.
48
49
#### Levenshtein Distance
50
51
Calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
52
53
```python { .api }
54
def levenshtein_distance(s1: str, s2: str) -> int:
55
"""
56
Calculate the Levenshtein distance between two strings.
57
58
Parameters:
59
- s1: First string to compare
60
- s2: Second string to compare
61
62
Returns:
63
int: Number of edits required to transform s1 to s2
64
65
Raises:
66
TypeError: If either argument is not a string
67
"""
68
```
69
70
#### Damerau-Levenshtein Distance
71
72
Calculates distance allowing insertions, deletions, substitutions, and transpositions (swapping of adjacent characters).
73
74
```python { .api }
75
def damerau_levenshtein_distance(s1: str, s2: str) -> int:
76
"""
77
Calculate the Damerau-Levenshtein distance between two strings.
78
79
Parameters:
80
- s1: First string to compare
81
- s2: Second string to compare
82
83
Returns:
84
int: Number of edits (including transpositions) required to transform s1 to s2
85
86
Raises:
87
TypeError: If either argument is not a string
88
"""
89
```
90
91
#### Hamming Distance
92
93
Calculates the number of positions at which corresponding characters are different. Handles strings of different lengths by including the length difference.
94
95
```python { .api }
96
def hamming_distance(s1: str, s2: str) -> int:
97
"""
98
Calculate the Hamming distance between two strings.
99
100
Parameters:
101
- s1: First string to compare
102
- s2: Second string to compare
103
104
Returns:
105
int: Number of differing positions plus length difference
106
107
Raises:
108
TypeError: If either argument is not a string
109
"""
110
```
111
112
#### Jaro Similarity
113
114
Calculates Jaro similarity, which considers character matches and transpositions.
115
116
```python { .api }
117
def jaro_similarity(s1: str, s2: str) -> float:
118
"""
119
Calculate the Jaro similarity between two strings.
120
121
Parameters:
122
- s1: First string to compare
123
- s2: Second string to compare
124
125
Returns:
126
float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
127
128
Raises:
129
TypeError: If either argument is not a string
130
"""
131
```
132
133
#### Jaro-Winkler Similarity
134
135
Enhanced Jaro similarity that gives higher scores to strings with common prefixes, with optional long string tolerance.
136
137
```python { .api }
138
def jaro_winkler_similarity(s1: str, s2: str, long_tolerance: Optional[bool] = None) -> float:
139
"""
140
Calculate the Jaro-Winkler similarity between two strings.
141
142
Parameters:
143
- s1: First string to compare
144
- s2: Second string to compare
145
- long_tolerance: Apply long string tolerance adjustment for extended similarity calculation (None and False behave identically)
146
147
Returns:
148
float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
149
150
Raises:
151
TypeError: If either argument is not a string
152
"""
153
```
154
155
#### Jaccard Similarity
156
157
Calculates Jaccard similarity/index using either word-level or character n-gram comparison.
158
159
```python { .api }
160
def jaccard_similarity(s1: str, s2: str, ngram_size: Optional[int] = None) -> float:
161
"""
162
Calculate the Jaccard similarity between two strings.
163
164
Parameters:
165
- s1: First string to compare
166
- s2: Second string to compare
167
- ngram_size: Size for character n-grams; if None, uses word-level comparison
168
169
Returns:
170
float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
171
172
Raises:
173
TypeError: If either argument is not a string
174
"""
175
```
176
177
#### Match Rating Comparison
178
179
Compares two strings using the Match Rating Approach algorithm, returning a boolean match result or None if comparison cannot be made.
180
181
```python { .api }
182
def match_rating_comparison(s1: str, s2: str) -> Optional[bool]:
183
"""
184
Compare two strings using Match Rating Approach algorithm.
185
186
Parameters:
187
- s1: First string to compare
188
- s2: Second string to compare
189
190
Returns:
191
Optional[bool]: True if strings match, False if they don't, None if length difference >= 3
192
193
Raises:
194
TypeError: If either argument is not a string
195
"""
196
```
197
198
### Phonetic Encoding Functions
199
200
Phonetic encoding algorithms that convert strings to phonetic codes, enabling "sounds-like" matching for names and words.
201
202
#### Soundex
203
204
American Soundex algorithm that encodes strings based on their English pronunciation.
205
206
```python { .api }
207
def soundex(s: str) -> str:
208
"""
209
Calculate the American Soundex code for a string.
210
211
Parameters:
212
- s: String to encode
213
214
Returns:
215
str: 4-character soundex code (letter followed by 3 digits)
216
217
Raises:
218
TypeError: If argument is not a string
219
"""
220
```
221
222
#### Metaphone
223
224
Metaphone phonetic encoding algorithm for English pronunciation matching.
225
226
```python { .api }
227
def metaphone(s: str) -> str:
228
"""
229
Calculate the Metaphone code for a string.
230
231
Parameters:
232
- s: String to encode
233
234
Returns:
235
str: Metaphone phonetic code
236
237
Raises:
238
TypeError: If argument is not a string
239
"""
240
```
241
242
#### NYSIIS
243
244
New York State Identification and Intelligence System phonetic encoding.
245
246
```python { .api }
247
def nysiis(s: str) -> str:
248
"""
249
Calculate the NYSIIS (New York State Identification and Intelligence System) code.
250
251
Parameters:
252
- s: String to encode
253
254
Returns:
255
str: NYSIIS phonetic code
256
257
Raises:
258
TypeError: If argument is not a string
259
"""
260
```
261
262
#### Match Rating Codex
263
264
Match Rating Approach codex encoding for string comparison preparation.
265
266
```python { .api }
267
def match_rating_codex(s: str) -> str:
268
"""
269
Calculate the Match Rating Approach codex for a string.
270
271
Parameters:
272
- s: String to encode (must contain only alphabetic characters)
273
274
Returns:
275
str: Match Rating codex (up to 6 characters)
276
277
Raises:
278
TypeError: If argument is not a string
279
ValueError: If string contains non-alphabetic characters
280
"""
281
```
282
283
## Types
284
285
```python { .api }
286
from typing import Optional
287
288
# All functions accept str arguments and have specific return types as documented above
289
# No custom classes or complex types are exposed in the public API
290
```