0
# Batch Processing
1
2
Efficient functions for comparing a query string against collections of candidate strings. These functions are optimized for performance when working with large lists and provide various output formats for different use cases.
3
4
## Capabilities
5
6
### Extract Single Best Match
7
8
Finds the single best match from a collection of choices.
9
10
```python { .api }
11
def extractOne(
12
query: Sequence[Hashable] | None,
13
choices: Iterable[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],
14
*,
15
scorer: Callable = WRatio,
16
processor: Callable | None = None,
17
score_cutoff: float | None = None,
18
score_hint: float | None = None,
19
scorer_kwargs: dict[str, Any] | None = None
20
) -> tuple[Sequence[Hashable], float, int | Any] | None
21
```
22
23
**Parameters:**
24
- `query`: String to find matches for
25
- `choices`: Iterable of strings or mapping {key: string}
26
- `scorer`: Scoring function (default: WRatio)
27
- `processor`: String preprocessing function
28
- `score_cutoff`: Minimum score threshold
29
- `score_hint`: Expected score for optimization
30
- `scorer_kwargs`: Additional arguments for scorer
31
32
**Returns:** `(match, score, index_or_key)` tuple or None if no match above cutoff
33
34
**Usage Example:**
35
```python
36
from rapidfuzz import process, fuzz
37
38
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
39
40
# Find best match
41
match = process.extractOne("new york jets", choices)
42
print(match) # ('New York Jets', 76.92, 1)
43
44
# With custom scorer
45
match = process.extractOne("cowboys", choices, scorer=fuzz.partial_ratio)
46
print(match) # ('Dallas Cowboys', 100.0, 3)
47
48
# With score cutoff
49
match = process.extractOne("chicago", choices, score_cutoff=50)
50
print(match) # None (no match above 50%)
51
52
# With mapping
53
choices_dict = {"team1": "Atlanta Falcons", "team2": "New York Jets"}
54
match = process.extractOne("jets", choices_dict)
55
print(match) # ('New York Jets', 90.0, 'team2')
56
```
57
58
### Extract Multiple Matches
59
60
Finds the top N matches from a collection, sorted by score in descending order.
61
62
```python { .api }
63
def extract(
64
query: Sequence[Hashable] | None,
65
choices: Collection[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],
66
*,
67
scorer: Callable = WRatio,
68
processor: Callable | None = None,
69
limit: int | None = 5,
70
score_cutoff: float | None = None,
71
score_hint: float | None = None,
72
scorer_kwargs: dict[str, Any] | None = None
73
) -> list[tuple[Sequence[Hashable], float, int | Any]]
74
```
75
76
**Parameters:**
77
- `limit`: Maximum number of matches to return (default: 5)
78
79
**Returns:** List of `(match, score, index_or_key)` tuples, sorted by score descending
80
81
**Usage Example:**
82
```python
83
from rapidfuzz import process, utils
84
85
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
86
87
# Get top 2 matches
88
matches = process.extract("new york", choices, limit=2)
89
print(matches)
90
# [('New York Jets', 76.92, 1), ('New York Giants', 64.29, 2)]
91
92
# With preprocessing for better matches
93
matches = process.extract("new york jets", choices,
94
processor=utils.default_process, limit=3)
95
print(matches)
96
# [('New York Jets', 100.0, 1), ('New York Giants', 78.57, 2), ...]
97
98
# Get all matches above threshold
99
matches = process.extract("new", choices, score_cutoff=30, limit=None)
100
print(len(matches)) # All matches with score >= 30
101
```
102
103
### Extract Iterator
104
105
Returns an iterator over all matches above the score cutoff, useful for memory-efficient processing of large choice sets.
106
107
```python { .api }
108
def extract_iter(
109
query: Sequence[Hashable] | None,
110
choices: Iterable[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],
111
*,
112
scorer: Callable = WRatio,
113
processor: Callable | None = None,
114
score_cutoff: float | None = None,
115
score_hint: float | None = None,
116
scorer_kwargs: dict[str, Any] | None = None
117
) -> Generator[tuple[Sequence[Hashable], float, int | Any], None, None]
118
```
119
120
**Returns:** Generator yielding `(match, score, index_or_key)` tuples
121
122
**Usage Example:**
123
```python
124
from rapidfuzz import process
125
126
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
127
128
# Process matches one at a time
129
for match, score, index in process.extract_iter("new", choices, score_cutoff=50):
130
print(f"Match: {match}, Score: {score:.1f}, Index: {index}")
131
132
# Memory-efficient processing of large datasets
133
large_choices = [...] # Large list of strings
134
best_score = 0
135
best_match = None
136
137
for match, score, index in process.extract_iter("query", large_choices):
138
if score > best_score:
139
best_score = score
140
best_match = (match, score, index)
141
```
142
143
### Cross-Distance Matrix
144
145
Computes similarity/distance matrix between all queries and all choices. Requires NumPy.
146
147
```python { .api }
148
def cdist(
149
queries: Collection[Sequence[Hashable] | None],
150
choices: Collection[Sequence[Hashable] | None],
151
*,
152
scorer: Callable = ratio,
153
processor: Callable | None = None,
154
score_cutoff: float | None = None,
155
score_hint: float | None = None,
156
score_multiplier: float = 1,
157
dtype: Any = None,
158
workers: int = 1,
159
scorer_kwargs: dict[str, Any] | None = None
160
) -> numpy.ndarray
161
```
162
163
**Parameters:**
164
- `queries`: List of query strings
165
- `choices`: List of choice strings
166
- `score_multiplier`: Multiply scores by this factor
167
- `dtype`: NumPy data type for result array
168
- `workers`: Number of parallel workers
169
170
**Returns:** 2D NumPy array with shape (len(queries), len(choices))
171
172
**Usage Example:**
173
```python
174
import numpy as np
175
from rapidfuzz import process
176
177
queries = ["apple", "orange"]
178
choices = ["apples", "oranges", "banana"]
179
180
# Compute full distance matrix
181
matrix = process.cdist(queries, choices)
182
print(matrix.shape) # (2, 3)
183
print(matrix)
184
# [[similarity(apple, apples), similarity(apple, oranges), similarity(apple, banana)],
185
# [similarity(orange, apples), similarity(orange, oranges), similarity(orange, banana)]]
186
187
# Find best match for each query
188
best_indices = np.argmax(matrix, axis=1)
189
for i, query in enumerate(queries):
190
best_choice = choices[best_indices[i]]
191
best_score = matrix[i, best_indices[i]]
192
print(f"{query} -> {best_choice} ({best_score:.1f})")
193
```
194
195
### Cartesian Product Distance
196
197
Computes distances for all possible pairs. Requires NumPy.
198
199
```python { .api }
200
def cpdist(
201
queries: Collection[Sequence[Hashable] | None],
202
choices: Collection[Sequence[Hashable] | None],
203
*,
204
scorer: Callable = ratio,
205
processor: Callable | None = None,
206
score_cutoff: float | None = None,
207
score_hint: float | None = None,
208
score_multiplier: float = 1,
209
dtype: Any = None,
210
workers: int = 1,
211
scorer_kwargs: dict[str, Any] | None = None
212
) -> numpy.ndarray
213
```
214
215
**Returns:** 1D NumPy array with len(queries) * len(choices) elements
216
217
## Usage Patterns
218
219
### Choosing the Right Function
220
221
- **`extractOne`**: Need single best match
222
- **`extract`**: Need top N matches, known small result set
223
- **`extract_iter`**: Large choice sets, memory-constrained, or streaming results
224
- **`cdist`**: Need complete similarity matrix, multiple queries
225
- **`cpdist`**: Need all pairwise comparisons in flat array format
226
227
### Performance Optimization
228
229
```python
230
from rapidfuzz import process, fuzz
231
232
choices = ["..." * 10000] # Large choice list
233
234
# Use score_cutoff to filter weak matches early
235
matches = process.extract("query", choices, score_cutoff=80)
236
237
# Use score_hint if you know expected score range
238
matches = process.extract("query", choices, score_hint=85)
239
240
# Use faster scorer for approximate results
241
matches = process.extract("query", choices, scorer=fuzz.QRatio)
242
243
# Parallel processing for matrix operations
244
matrix = process.cdist(queries, choices, workers=4)
245
```
246
247
### Handling Different Input Types
248
249
```python
250
from rapidfuzz import process
251
252
# List of strings (most common)
253
choices = ["option1", "option2", "option3"]
254
match = process.extractOne("query", choices)
255
# Returns: (match_string, score, index)
256
257
# Dictionary mapping
258
choices = {"a": "option1", "b": "option2", "c": "option3"}
259
match = process.extractOne("query", choices)
260
# Returns: (match_string, score, key)
261
262
# Pandas Series (if pandas available)
263
import pandas as pd
264
choices = pd.Series(["option1", "option2", "option3"])
265
match = process.extractOne("query", choices)
266
# Returns: (match_string, score, index)
267
268
# Handle None values in choices
269
choices = ["option1", None, "option3"]
270
matches = process.extract("query", choices) # None values ignored
271
```
272
273
### Custom Scoring Functions
274
275
```python
276
from rapidfuzz import process, distance
277
278
# Use distance metrics directly
279
matches = process.extract("query", choices, scorer=distance.Levenshtein.distance)
280
# Returns edit distance (lower = more similar)
281
282
# Custom scorer function
283
def custom_scorer(s1, s2, **kwargs):
284
# Custom scoring logic
285
return some_similarity_score
286
287
matches = process.extract("query", choices, scorer=custom_scorer)
288
289
# Pass additional arguments to scorer
290
matches = process.extract("query", choices,
291
scorer=distance.Levenshtein.distance,
292
scorer_kwargs={"weights": (1, 2, 1)})
293
```