Approximate and phonetic matching of strings
npx @tessl/cli install tessl/pypi-jellyfish@1.2.0A high-performance Python library for approximate and phonetic string matching algorithms. Jellyfish provides fast implementations of various string distance and similarity metrics along with phonetic encoding algorithms, built with Rust for maximum performance while maintaining ease of use through Python interfaces.
pip install jellyfishimport jellyfishIndividual function imports:
from jellyfish import levenshtein_distance, jaro_similarity, soundex, metaphoneimport jellyfish
# String distance calculations
distance = jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
print(distance) # 2
similarity = jellyfish.jaro_similarity('jellyfish', 'smellyfish')
print(similarity) # 0.896...
# Phonetic encoding
code = jellyfish.soundex('Jellyfish')
print(code) # 'J412'
metaphone_code = jellyfish.metaphone('Jellyfish')
print(metaphone_code) # 'JLFX'Distance and similarity metrics for comparing strings, useful for fuzzy matching, data deduplication, and record linkage applications.
Calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
def levenshtein_distance(s1: str, s2: str) -> int:
"""
Calculate the Levenshtein distance between two strings.
Parameters:
- s1: First string to compare
- s2: Second string to compare
Returns:
int: Number of edits required to transform s1 to s2
Raises:
TypeError: If either argument is not a string
"""Calculates distance allowing insertions, deletions, substitutions, and transpositions (swapping of adjacent characters).
def damerau_levenshtein_distance(s1: str, s2: str) -> int:
"""
Calculate the Damerau-Levenshtein distance between two strings.
Parameters:
- s1: First string to compare
- s2: Second string to compare
Returns:
int: Number of edits (including transpositions) required to transform s1 to s2
Raises:
TypeError: If either argument is not a string
"""Calculates the number of positions at which corresponding characters are different. Handles strings of different lengths by including the length difference.
def hamming_distance(s1: str, s2: str) -> int:
"""
Calculate the Hamming distance between two strings.
Parameters:
- s1: First string to compare
- s2: Second string to compare
Returns:
int: Number of differing positions plus length difference
Raises:
TypeError: If either argument is not a string
"""Calculates Jaro similarity, which considers character matches and transpositions.
def jaro_similarity(s1: str, s2: str) -> float:
"""
Calculate the Jaro similarity between two strings.
Parameters:
- s1: First string to compare
- s2: Second string to compare
Returns:
float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
Raises:
TypeError: If either argument is not a string
"""Enhanced Jaro similarity that gives higher scores to strings with common prefixes, with optional long string tolerance.
def jaro_winkler_similarity(s1: str, s2: str, long_tolerance: Optional[bool] = None) -> float:
"""
Calculate the Jaro-Winkler similarity between two strings.
Parameters:
- s1: First string to compare
- s2: Second string to compare
- long_tolerance: Apply long string tolerance adjustment for extended similarity calculation (None and False behave identically)
Returns:
float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
Raises:
TypeError: If either argument is not a string
"""Calculates Jaccard similarity/index using either word-level or character n-gram comparison.
def jaccard_similarity(s1: str, s2: str, ngram_size: Optional[int] = None) -> float:
"""
Calculate the Jaccard similarity between two strings.
Parameters:
- s1: First string to compare
- s2: Second string to compare
- ngram_size: Size for character n-grams; if None, uses word-level comparison
Returns:
float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
Raises:
TypeError: If either argument is not a string
"""Compares two strings using the Match Rating Approach algorithm, returning a boolean match result or None if comparison cannot be made.
def match_rating_comparison(s1: str, s2: str) -> Optional[bool]:
"""
Compare two strings using Match Rating Approach algorithm.
Parameters:
- s1: First string to compare
- s2: Second string to compare
Returns:
Optional[bool]: True if strings match, False if they don't, None if length difference >= 3
Raises:
TypeError: If either argument is not a string
"""Phonetic encoding algorithms that convert strings to phonetic codes, enabling "sounds-like" matching for names and words.
American Soundex algorithm that encodes strings based on their English pronunciation.
def soundex(s: str) -> str:
"""
Calculate the American Soundex code for a string.
Parameters:
- s: String to encode
Returns:
str: 4-character soundex code (letter followed by 3 digits)
Raises:
TypeError: If argument is not a string
"""Metaphone phonetic encoding algorithm for English pronunciation matching.
def metaphone(s: str) -> str:
"""
Calculate the Metaphone code for a string.
Parameters:
- s: String to encode
Returns:
str: Metaphone phonetic code
Raises:
TypeError: If argument is not a string
"""New York State Identification and Intelligence System phonetic encoding.
def nysiis(s: str) -> str:
"""
Calculate the NYSIIS (New York State Identification and Intelligence System) code.
Parameters:
- s: String to encode
Returns:
str: NYSIIS phonetic code
Raises:
TypeError: If argument is not a string
"""Match Rating Approach codex encoding for string comparison preparation.
def match_rating_codex(s: str) -> str:
"""
Calculate the Match Rating Approach codex for a string.
Parameters:
- s: String to encode (must contain only alphabetic characters)
Returns:
str: Match Rating codex (up to 6 characters)
Raises:
TypeError: If argument is not a string
ValueError: If string contains non-alphabetic characters
"""from typing import Optional
# All functions accept str arguments and have specific return types as documented above
# No custom classes or complex types are exposed in the public API