or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

index.md
tile.json

tessl/pypi-jellyfish

Approximate and phonetic matching of strings

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/jellyfish@1.2.x

To install, run

npx @tessl/cli install tessl/pypi-jellyfish@1.2.0

index.mddocs/

Jellyfish

A high-performance Python library for approximate and phonetic string matching algorithms. Jellyfish provides fast implementations of various string distance and similarity metrics along with phonetic encoding algorithms, built with Rust for maximum performance while maintaining ease of use through Python interfaces.

Package Information

  • Package Name: jellyfish
  • Package Type: pypi
  • Language: Python with Rust implementation
  • Installation: pip install jellyfish

Core Imports

import jellyfish

Individual function imports:

from jellyfish import levenshtein_distance, jaro_similarity, soundex, metaphone

Basic Usage

import jellyfish

# String distance calculations
distance = jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
print(distance)  # 2

similarity = jellyfish.jaro_similarity('jellyfish', 'smellyfish') 
print(similarity)  # 0.896...

# Phonetic encoding
code = jellyfish.soundex('Jellyfish')
print(code)  # 'J412'

metaphone_code = jellyfish.metaphone('Jellyfish')
print(metaphone_code)  # 'JLFX'

Capabilities

String Distance and Similarity Functions

Distance and similarity metrics for comparing strings, useful for fuzzy matching, data deduplication, and record linkage applications.

Levenshtein Distance

Calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

def levenshtein_distance(s1: str, s2: str) -> int:
    """
    Calculate the Levenshtein distance between two strings.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    
    Returns:
    int: Number of edits required to transform s1 to s2
    
    Raises:
    TypeError: If either argument is not a string
    """

Damerau-Levenshtein Distance

Calculates distance allowing insertions, deletions, substitutions, and transpositions (swapping of adjacent characters).

def damerau_levenshtein_distance(s1: str, s2: str) -> int:
    """
    Calculate the Damerau-Levenshtein distance between two strings.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    
    Returns:
    int: Number of edits (including transpositions) required to transform s1 to s2
    
    Raises:
    TypeError: If either argument is not a string
    """

Hamming Distance

Calculates the number of positions at which corresponding characters are different. Handles strings of different lengths by including the length difference.

def hamming_distance(s1: str, s2: str) -> int:
    """
    Calculate the Hamming distance between two strings.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    
    Returns:
    int: Number of differing positions plus length difference
    
    Raises:
    TypeError: If either argument is not a string
    """

Jaro Similarity

Calculates Jaro similarity, which considers character matches and transpositions.

def jaro_similarity(s1: str, s2: str) -> float:
    """
    Calculate the Jaro similarity between two strings.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    
    Returns:
    float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
    
    Raises:
    TypeError: If either argument is not a string
    """

Jaro-Winkler Similarity

Enhanced Jaro similarity that gives higher scores to strings with common prefixes, with optional long string tolerance.

def jaro_winkler_similarity(s1: str, s2: str, long_tolerance: Optional[bool] = None) -> float:
    """
    Calculate the Jaro-Winkler similarity between two strings.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    - long_tolerance: Apply long string tolerance adjustment for extended similarity calculation (None and False behave identically)
    
    Returns:
    float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
    
    Raises:
    TypeError: If either argument is not a string
    """

Jaccard Similarity

Calculates Jaccard similarity/index using either word-level or character n-gram comparison.

def jaccard_similarity(s1: str, s2: str, ngram_size: Optional[int] = None) -> float:
    """
    Calculate the Jaccard similarity between two strings.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    - ngram_size: Size for character n-grams; if None, uses word-level comparison
    
    Returns:
    float: Similarity score between 0.0 (no similarity) and 1.0 (identical)
    
    Raises:
    TypeError: If either argument is not a string
    """

Match Rating Comparison

Compares two strings using the Match Rating Approach algorithm, returning a boolean match result or None if comparison cannot be made.

def match_rating_comparison(s1: str, s2: str) -> Optional[bool]:
    """
    Compare two strings using Match Rating Approach algorithm.
    
    Parameters:
    - s1: First string to compare
    - s2: Second string to compare
    
    Returns:
    Optional[bool]: True if strings match, False if they don't, None if length difference >= 3
    
    Raises:
    TypeError: If either argument is not a string
    """

Phonetic Encoding Functions

Phonetic encoding algorithms that convert strings to phonetic codes, enabling "sounds-like" matching for names and words.

Soundex

American Soundex algorithm that encodes strings based on their English pronunciation.

def soundex(s: str) -> str:
    """
    Calculate the American Soundex code for a string.
    
    Parameters:
    - s: String to encode
    
    Returns:
    str: 4-character soundex code (letter followed by 3 digits)
    
    Raises:
    TypeError: If argument is not a string
    """

Metaphone

Metaphone phonetic encoding algorithm for English pronunciation matching.

def metaphone(s: str) -> str:
    """
    Calculate the Metaphone code for a string.
    
    Parameters:
    - s: String to encode
    
    Returns:
    str: Metaphone phonetic code
    
    Raises:
    TypeError: If argument is not a string
    """

NYSIIS

New York State Identification and Intelligence System phonetic encoding.

def nysiis(s: str) -> str:
    """
    Calculate the NYSIIS (New York State Identification and Intelligence System) code.
    
    Parameters:
    - s: String to encode
    
    Returns:
    str: NYSIIS phonetic code
    
    Raises:
    TypeError: If argument is not a string
    """

Match Rating Codex

Match Rating Approach codex encoding for string comparison preparation.

def match_rating_codex(s: str) -> str:
    """
    Calculate the Match Rating Approach codex for a string.
    
    Parameters:
    - s: String to encode (must contain only alphabetic characters)
    
    Returns:
    str: Match Rating codex (up to 6 characters)
    
    Raises:
    TypeError: If argument is not a string
    ValueError: If string contains non-alphabetic characters
    """

Types

from typing import Optional

# All functions accept str arguments and have specific return types as documented above
# No custom classes or complex types are exposed in the public API