0
# RapidFuzz
1
2
A high-performance Python library for rapid fuzzy string matching that provides string similarity calculations using advanced algorithms including Levenshtein distance, Hamming distance, and Jaro-Winkler metrics. Built with C++ extensions for optimal performance, it offers a comprehensive set of string matching functions and efficient batch processing capabilities.
3
4
## Package Information
5
6
- **Package Name**: rapidfuzz
7
- **Language**: Python
8
- **Installation**: `pip install rapidfuzz`
9
- **Requires**: Python 3.10 or later
10
11
## Core Imports
12
13
```python
14
import rapidfuzz
15
```
16
17
Common patterns for specific functionality:
18
19
```python
20
from rapidfuzz import fuzz, process, distance, utils
21
```
22
23
Import specific functions:
24
25
```python
26
from rapidfuzz.fuzz import ratio, partial_ratio, partial_ratio_alignment, token_ratio, WRatio, QRatio
27
from rapidfuzz.process import extractOne, extract, extract_iter, cdist, cpdist
28
from rapidfuzz.distance import Levenshtein, Hamming, Jaro, JaroWinkler, DamerauLevenshtein
29
from rapidfuzz.distance import OSA, Indel, LCSseq, Prefix, Postfix
30
from rapidfuzz.utils import default_process
31
```
32
33
## Basic Usage
34
35
```python
36
from rapidfuzz import fuzz, process
37
38
# Basic string similarity
39
score = fuzz.ratio("this is a test", "this is a test!")
40
print(f"Similarity: {score}") # 96.55
41
42
# Partial matching (substring matching)
43
score = fuzz.partial_ratio("this is a test", "this is a test!")
44
print(f"Partial similarity: {score}") # 100.0
45
46
# Find best match from a list
47
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
48
match = process.extractOne("new york jets", choices)
49
print(f"Best match: {match}") # ('New York Jets', 76.92, 1)
50
51
# Find multiple matches
52
matches = process.extract("new york", choices, limit=2)
53
print(f"Top matches: {matches}")
54
# [('New York Jets', 76.92, 1), ('New York Giants', 64.29, 2)]
55
56
# With string preprocessing
57
from rapidfuzz import utils
58
match = process.extractOne("new york jets", choices, processor=utils.default_process)
59
print(f"Preprocessed match: {match}") # ('New York Jets', 100.0, 1)
60
```
61
62
## Architecture
63
64
RapidFuzz is organized into four main modules, each serving distinct purposes:
65
66
- **fuzz**: High-level similarity functions (ratio, partial_ratio, token_sort_ratio, WRatio, QRatio)
67
- **process**: Batch processing functions for comparing against lists of choices (extract, extractOne, cdist)
68
- **distance**: Low-level distance metrics and edit operations (Levenshtein, Hamming, Jaro, etc.)
69
- **utils**: String preprocessing utilities (default_process)
70
71
The library automatically selects optimized C++ implementations (AVX2, SSE2) when available, falling back to Python implementations for compatibility.
72
73
## Core Functions
74
75
### C++ Extension Support
76
77
```python { .api }
78
def get_include() -> str
79
```
80
81
Returns the directory containing RapidFuzz header files for building C++ extensions that use RapidFuzz functionality.
82
83
**Usage Example:**
84
```python
85
import rapidfuzz
86
87
include_dir = rapidfuzz.get_include()
88
print(f"Header files located at: {include_dir}")
89
90
# Use in setup.py for C++ extensions
91
from setuptools import Extension
92
ext = Extension(
93
'my_extension',
94
sources=['my_extension.cpp'],
95
include_dirs=[rapidfuzz.get_include()]
96
)
97
```
98
99
## Capabilities
100
101
### Fuzzy String Matching
102
103
High-level string similarity functions including basic ratios, partial matching, token-based comparisons, and weighted algorithms optimized for different use cases.
104
105
```python { .api }
106
def ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
107
def partial_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
108
def partial_ratio_alignment(s1, s2, *, processor=None, score_cutoff=0) -> ScoreAlignment | None: ...
109
def token_sort_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
110
def token_set_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
111
def token_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
112
def partial_token_sort_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
113
def partial_token_set_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
114
def partial_token_ratio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
115
def WRatio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
116
def QRatio(s1, s2, *, processor=None, score_cutoff=0) -> float: ...
117
```
118
119
[Fuzzy String Matching](./fuzzy-matching.md)
120
121
### Batch Processing
122
123
Efficient functions for comparing a query string against lists or collections of candidate strings, with support for finding single best matches, top-N matches, and distance matrices.
124
125
```python { .api }
126
def extractOne(query, choices, *, scorer=WRatio, processor=None, score_cutoff=None) -> tuple | None: ...
127
def extract(query, choices, *, scorer=WRatio, processor=None, limit=5, score_cutoff=None) -> list: ...
128
def extract_iter(query, choices, *, scorer=WRatio, processor=None, score_cutoff=None) -> Generator: ...
129
def cdist(queries, choices, *, scorer=ratio, processor=None, workers=1) -> numpy.ndarray: ...
130
def cpdist(queries, choices, *, scorer=ratio, processor=None, workers=1) -> numpy.ndarray: ...
131
```
132
133
[Batch Processing](./batch-processing.md)
134
135
### Distance Metrics
136
137
Low-level distance algorithms providing raw distance calculations, similarity scores, normalized metrics, and edit operation sequences for advanced string analysis.
138
139
```python { .api }
140
class Levenshtein:
141
@staticmethod
142
def distance(s1, s2, *, score_cutoff=None) -> int: ...
143
@staticmethod
144
def similarity(s1, s2, *, score_cutoff=None) -> int: ...
145
@staticmethod
146
def normalized_distance(s1, s2, *, score_cutoff=None) -> float: ...
147
@staticmethod
148
def normalized_similarity(s1, s2, *, score_cutoff=None) -> float: ...
149
```
150
151
[Distance Metrics](./distance-metrics.md)
152
153
### String Preprocessing
154
155
Utilities for normalizing and preprocessing strings before comparison, including case normalization, whitespace handling, and non-alphanumeric character removal.
156
157
```python { .api }
158
def default_process(sentence: str) -> str: ...
159
```
160
161
[String Preprocessing](./string-preprocessing.md)
162
163
## Types
164
165
```python { .api }
166
from typing import Sequence, Hashable, Callable, Iterable, Mapping, Any
167
from collections.abc import Generator
168
import numpy
169
170
# Core types for string inputs
171
StringType = Sequence[Hashable] # Accepts strings, lists, tuples of hashable items
172
173
# Edit operation types
174
class Editop:
175
def __init__(self, tag: str, src_pos: int, dest_pos: int) -> None: ...
176
tag: str # 'replace', 'delete', 'insert'
177
src_pos: int # Position in source string
178
dest_pos: int # Position in destination string
179
180
class Editops:
181
# List-like container of Editop objects
182
def __init__(self, editops: list | None = None, src_len: int = 0, dest_len: int = 0) -> None: ...
183
def __len__(self) -> int: ...
184
def __getitem__(self, index: int) -> Editop: ...
185
def as_opcodes(self) -> Opcodes: ...
186
def as_matching_blocks(self) -> list[MatchingBlock]: ...
187
def as_list(self) -> list[tuple[str, int, int]]: ...
188
def copy(self) -> Editops: ...
189
def inverse(self) -> Editops: ...
190
def remove_subsequence(self, subsequence: Editops) -> Editops: ...
191
def apply(self, source_string: str | bytes, destination_string: str | bytes) -> str: ...
192
@classmethod
193
def from_opcodes(cls, opcodes: Opcodes) -> Editops: ...
194
src_len: int
195
dest_len: int
196
197
class Opcode:
198
def __init__(self, tag: str, a1: int, a2: int, b1: int, b2: int) -> None: ...
199
tag: str # 'replace', 'delete', 'insert', 'equal'
200
a1: int # Start position in first string
201
a2: int # End position in first string
202
b1: int # Start position in second string
203
b2: int # End position in second string
204
205
class Opcodes:
206
# List-like container of Opcode objects
207
def __init__(self, opcodes: list | None = None, src_len: int = 0, dest_len: int = 0) -> None: ...
208
def __len__(self) -> int: ...
209
def __getitem__(self, index: int) -> Opcode: ...
210
def as_editops(self) -> Editops: ...
211
def as_matching_blocks(self) -> list[MatchingBlock]: ...
212
def as_list(self) -> list[tuple[str, int, int, int, int]]: ...
213
def copy(self) -> Opcodes: ...
214
def inverse(self) -> Opcodes: ...
215
def apply(self, source_string: str | bytes, destination_string: str | bytes) -> str: ...
216
@classmethod
217
def from_editops(cls, editops: Editops) -> Opcodes: ...
218
src_len: int
219
dest_len: int
220
221
class MatchingBlock:
222
def __init__(self, a: int, b: int, size: int) -> None: ...
223
a: int # Start position in first string
224
b: int # Start position in second string
225
size: int # Length of the matching block
226
227
class ScoreAlignment:
228
def __init__(self, score: float, src_start: int, src_end: int, dest_start: int, dest_end: int) -> None: ...
229
score: float # Similarity/distance score
230
src_start: int # Start position in source
231
src_end: int # End position in source
232
dest_start: int # Start position in destination
233
dest_end: int # End position in destination
234
235
# Process function return types
236
ExtractResult = tuple[str, float, int] # (match, score, index)
237
ExtractResultMapping = tuple[str, float, Any] # (match, score, key)
238
```