tessl/pypi-thefuzz

Fuzzy string matching library using Levenshtein Distance algorithms for approximate text comparison

Overview

Eval results

Files

TheFuzz

Name: tessl/pypi-thefuzz
Author: tessl

TheFuzz is a Python fuzzy string matching library that uses Levenshtein Distance algorithms to calculate similarities between text sequences. It provides multiple scoring strategies for approximate string comparison, from simple ratio calculations to advanced weighted algorithms that combine various matching techniques.

Package Information

Package Name: thefuzz
Language: Python
Installation: pip install thefuzz
Minimum Python Version: 3.8+

Core Imports

from thefuzz import fuzz
from thefuzz import process
from thefuzz import utils

Individual function imports:

from thefuzz.fuzz import ratio, WRatio, token_sort_ratio
from thefuzz.process import extractOne, extract, dedupe
from thefuzz.utils import full_process

Basic Usage

from thefuzz import fuzz
from thefuzz import process

# Basic string similarity scoring
ratio = fuzz.ratio("this is a test", "this is a test!")
print(ratio)  # 97

# Token-based matching (handles word order differences)
token_ratio = fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
print(token_ratio)  # 100

# Weighted ratio (combines multiple algorithms)
weighted_ratio = fuzz.WRatio("this is a test", "this is a test!")
print(weighted_ratio)  # 97

# Find best match from a list of choices
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
result = process.extractOne("new york jets", choices)
print(result)  # ('New York Jets', 100)

# Get multiple best matches
results = process.extract("new york", choices, limit=2)
print(results)  # [('New York Jets', 90), ('New York Giants', 90)]

Architecture

TheFuzz builds on the high-performance rapidfuzz library while maintaining backward compatibility with the original fuzzywuzzy interface. The library is organized into three main modules:

fuzz: Core string similarity scoring functions using various algorithms
process: Functions for finding best matches in collections of strings
utils: String preprocessing and normalization utilities

The library uses a consistent preprocessing pipeline that normalizes strings by removing non-alphanumeric characters, converting to lowercase, and optionally forcing ASCII encoding before applying fuzzy matching algorithms.

Capabilities

String Similarity Scoring

Core fuzzy string matching functions that calculate similarity ratios between two strings using different algorithms including basic ratio, partial matching, and token-based comparisons.

def ratio(s1: str, s2: str) -> int: ...
def partial_ratio(s1: str, s2: str) -> int: ...
def token_sort_ratio(s1: str, s2: str, force_ascii: bool = True, full_process: bool = True) -> int: ...
def token_set_ratio(s1: str, s2: str, force_ascii: bool = True, full_process: bool = True) -> int: ...
def WRatio(s1: str, s2: str, force_ascii: bool = True, full_process: bool = True) -> int: ...
def QRatio(s1: str, s2: str, force_ascii: bool = True, full_process: bool = True) -> int: ...

String Similarity Scoring

String Processing and Extraction

Functions for finding the best matches in collections of strings, including single and multiple match extraction, duplicate removal, and configurable scoring with custom processors.

def extractOne(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0): ...
def extract(query: str, choices, processor=None, scorer=None, limit: int = 5): ...
def extractBests(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0, limit: int = 5): ...
def extractWithoutOrder(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0): ...
def dedupe(contains_dupes: list, threshold: int = 70, scorer=None): ...

String Processing and Extraction

String Utilities

Utility functions for string preprocessing and normalization, including ASCII conversion and comprehensive text cleaning that removes non-alphanumeric characters and normalizes whitespace.

def full_process(s: str, force_ascii: bool = False) -> str: ...
def ascii_only(s: str) -> str: ...

String Utilities

Types

from typing import Callable, Union, Tuple, List, Dict, Any, Generator, TypeVar, Sequence
from collections.abc import Mapping

# Core type aliases from process module
ChoicesT = Union[Mapping[str, str], Sequence[str]]
T = TypeVar('T')
ProcessorT = Union[Callable[[str, bool], str], Callable[[Any], Any]]
ScorerT = Callable[[str, str, bool, bool], int]

# Additional type aliases for better understanding
Scorer = Callable[[str, str], int]
Processor = Callable[[str], str]
Choice = Union[str, Tuple[Any, ...], Dict[str, Any]]
Choices = Union[List[Choice], Dict[Any, Choice]]

Install with Tessl CLI

npx tessl i tessl/pypi-thefuzz

Workspace: tessl
Visibility: Public
Created: 6 months ago
Last updated: about 1 month ago
Describes: pkg:pypi/thefuzz@0.22.x