Tessl Tile for pypi/rapidfuzz@3.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-processing.md distance-metrics.md fuzzy-matching.md index.md string-preprocessing.md

batch-processing.mddocs/

0
# Batch Processing
1

2
Efficient functions for comparing a query string against collections of candidate strings. These functions are optimized for performance when working with large lists and provide various output formats for different use cases.
3

4
## Capabilities
5

6
### Extract Single Best Match
7

8
Finds the single best match from a collection of choices.
9

10
```python { .api }
11
def extractOne(
12
    query: Sequence[Hashable] | None,
13
    choices: Iterable[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],
14
    *,
15
    scorer: Callable = WRatio,
16
    processor: Callable | None = None, 
17
    score_cutoff: float | None = None,
18
    score_hint: float | None = None,
19
    scorer_kwargs: dict[str, Any] | None = None
20
) -> tuple[Sequence[Hashable], float, int | Any] | None
21
```
22

23
**Parameters:**
24
- `query`: String to find matches for
25
- `choices`: Iterable of strings or mapping {key: string}  
26
- `scorer`: Scoring function (default: WRatio)
27
- `processor`: String preprocessing function
28
- `score_cutoff`: Minimum score threshold
29
- `score_hint`: Expected score for optimization
30
- `scorer_kwargs`: Additional arguments for scorer
31

32
**Returns:** `(match, score, index_or_key)` tuple or None if no match above cutoff
33

34
**Usage Example:**
35
```python
36
from rapidfuzz import process, fuzz
37

38
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
39

40
# Find best match
41
match = process.extractOne("new york jets", choices)
42
print(match)  # ('New York Jets', 76.92, 1)
43

44
# With custom scorer
45
match = process.extractOne("cowboys", choices, scorer=fuzz.partial_ratio)
46
print(match)  # ('Dallas Cowboys', 100.0, 3)
47

48
# With score cutoff
49
match = process.extractOne("chicago", choices, score_cutoff=50)
50
print(match)  # None (no match above 50%)
51

52
# With mapping
53
choices_dict = {"team1": "Atlanta Falcons", "team2": "New York Jets"}
54
match = process.extractOne("jets", choices_dict)
55
print(match)  # ('New York Jets', 90.0, 'team2')
56
```
57

58
### Extract Multiple Matches
59

60
Finds the top N matches from a collection, sorted by score in descending order.
61

62
```python { .api }
63
def extract(
64
    query: Sequence[Hashable] | None,
65
    choices: Collection[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],
66
    *,
67
    scorer: Callable = WRatio,
68
    processor: Callable | None = None,
69
    limit: int | None = 5, 
70
    score_cutoff: float | None = None,
71
    score_hint: float | None = None,
72
    scorer_kwargs: dict[str, Any] | None = None
73
) -> list[tuple[Sequence[Hashable], float, int | Any]]
74
```
75

76
**Parameters:**
77
- `limit`: Maximum number of matches to return (default: 5)
78

79
**Returns:** List of `(match, score, index_or_key)` tuples, sorted by score descending
80

81
**Usage Example:**
82
```python
83
from rapidfuzz import process, utils
84

85
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
86

87
# Get top 2 matches
88
matches = process.extract("new york", choices, limit=2)
89
print(matches)  
90
# [('New York Jets', 76.92, 1), ('New York Giants', 64.29, 2)]
91

92
# With preprocessing for better matches
93
matches = process.extract("new york jets", choices, 
94
                         processor=utils.default_process, limit=3)
95
print(matches)
96
# [('New York Jets', 100.0, 1), ('New York Giants', 78.57, 2), ...]
97

98
# Get all matches above threshold
99
matches = process.extract("new", choices, score_cutoff=30, limit=None)
100
print(len(matches))  # All matches with score >= 30
101
```
102

103
### Extract Iterator
104

105
Returns an iterator over all matches above the score cutoff, useful for memory-efficient processing of large choice sets.
106

107
```python { .api }
108
def extract_iter(
109
    query: Sequence[Hashable] | None, 
110
    choices: Iterable[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],
111
    *,
112
    scorer: Callable = WRatio,
113
    processor: Callable | None = None,
114
    score_cutoff: float | None = None,
115
    score_hint: float | None = None,
116
    scorer_kwargs: dict[str, Any] | None = None
117
) -> Generator[tuple[Sequence[Hashable], float, int | Any], None, None]
118
```
119

120
**Returns:** Generator yielding `(match, score, index_or_key)` tuples
121

122
**Usage Example:**
123
```python
124
from rapidfuzz import process
125

126
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
127

128
# Process matches one at a time
129
for match, score, index in process.extract_iter("new", choices, score_cutoff=50):
130
    print(f"Match: {match}, Score: {score:.1f}, Index: {index}")
131
    
132
# Memory-efficient processing of large datasets
133
large_choices = [...] # Large list of strings
134
best_score = 0
135
best_match = None
136

137
for match, score, index in process.extract_iter("query", large_choices):
138
    if score > best_score:
139
        best_score = score
140
        best_match = (match, score, index)
141
```
142

143
### Cross-Distance Matrix
144

145
Computes similarity/distance matrix between all queries and all choices. Requires NumPy.
146

147
```python { .api }
148
def cdist(
149
    queries: Collection[Sequence[Hashable] | None],
150
    choices: Collection[Sequence[Hashable] | None], 
151
    *,
152
    scorer: Callable = ratio,
153
    processor: Callable | None = None,
154
    score_cutoff: float | None = None,
155
    score_hint: float | None = None,
156
    score_multiplier: float = 1,
157
    dtype: Any = None,
158
    workers: int = 1,
159
    scorer_kwargs: dict[str, Any] | None = None
160
) -> numpy.ndarray
161
```
162

163
**Parameters:**
164
- `queries`: List of query strings
165
- `choices`: List of choice strings
166
- `score_multiplier`: Multiply scores by this factor
167
- `dtype`: NumPy data type for result array
168
- `workers`: Number of parallel workers
169

170
**Returns:** 2D NumPy array with shape (len(queries), len(choices))
171

172
**Usage Example:**
173
```python
174
import numpy as np
175
from rapidfuzz import process
176

177
queries = ["apple", "orange"] 
178
choices = ["apples", "oranges", "banana"]
179

180
# Compute full distance matrix
181
matrix = process.cdist(queries, choices)
182
print(matrix.shape)  # (2, 3)
183
print(matrix)
184
# [[similarity(apple, apples), similarity(apple, oranges), similarity(apple, banana)],
185
#  [similarity(orange, apples), similarity(orange, oranges), similarity(orange, banana)]]
186

187
# Find best match for each query
188
best_indices = np.argmax(matrix, axis=1)
189
for i, query in enumerate(queries):
190
    best_choice = choices[best_indices[i]]
191
    best_score = matrix[i, best_indices[i]]
192
    print(f"{query} -> {best_choice} ({best_score:.1f})")
193
```
194

195
### Cartesian Product Distance
196

197
Computes distances for all possible pairs. Requires NumPy.
198

199
```python { .api }
200
def cpdist(
201
    queries: Collection[Sequence[Hashable] | None],
202
    choices: Collection[Sequence[Hashable] | None],
203
    *,
204
    scorer: Callable = ratio,
205
    processor: Callable | None = None,
206
    score_cutoff: float | None = None,
207
    score_hint: float | None = None, 
208
    score_multiplier: float = 1,
209
    dtype: Any = None,
210
    workers: int = 1,
211
    scorer_kwargs: dict[str, Any] | None = None
212
) -> numpy.ndarray
213
```
214

215
**Returns:** 1D NumPy array with len(queries) * len(choices) elements
216

217
## Usage Patterns
218

219
### Choosing the Right Function
220

221
- **`extractOne`**: Need single best match
222
- **`extract`**: Need top N matches, known small result set
223
- **`extract_iter`**: Large choice sets, memory-constrained, or streaming results
224
- **`cdist`**: Need complete similarity matrix, multiple queries
225
- **`cpdist`**: Need all pairwise comparisons in flat array format
226

227
### Performance Optimization
228

229
```python
230
from rapidfuzz import process, fuzz
231

232
choices = ["..." * 10000]  # Large choice list
233

234
# Use score_cutoff to filter weak matches early
235
matches = process.extract("query", choices, score_cutoff=80)
236

237
# Use score_hint if you know expected score range
238
matches = process.extract("query", choices, score_hint=85)
239

240
# Use faster scorer for approximate results
241
matches = process.extract("query", choices, scorer=fuzz.QRatio)
242

243
# Parallel processing for matrix operations
244
matrix = process.cdist(queries, choices, workers=4)
245
```
246

247
### Handling Different Input Types
248

249
```python
250
from rapidfuzz import process
251

252
# List of strings (most common)
253
choices = ["option1", "option2", "option3"]
254
match = process.extractOne("query", choices)
255
# Returns: (match_string, score, index)
256

257
# Dictionary mapping
258
choices = {"a": "option1", "b": "option2", "c": "option3"}  
259
match = process.extractOne("query", choices)
260
# Returns: (match_string, score, key)
261

262
# Pandas Series (if pandas available)
263
import pandas as pd
264
choices = pd.Series(["option1", "option2", "option3"])
265
match = process.extractOne("query", choices) 
266
# Returns: (match_string, score, index)
267

268
# Handle None values in choices
269
choices = ["option1", None, "option3"]
270
matches = process.extract("query", choices)  # None values ignored
271
```
272

273
### Custom Scoring Functions
274

275
```python
276
from rapidfuzz import process, distance
277

278
# Use distance metrics directly
279
matches = process.extract("query", choices, scorer=distance.Levenshtein.distance)
280
# Returns edit distance (lower = more similar)
281

282
# Custom scorer function
283
def custom_scorer(s1, s2, **kwargs):
284
    # Custom scoring logic
285
    return some_similarity_score
286

287
matches = process.extract("query", choices, scorer=custom_scorer)
288

289
# Pass additional arguments to scorer
290
matches = process.extract("query", choices, 
291
                         scorer=distance.Levenshtein.distance,
292
                         scorer_kwargs={"weights": (1, 2, 1)})
293
```

Version

Tile

Files

batch-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

batch-processing.mddocs/