or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-processing.mddistance-metrics.mdfuzzy-matching.mdindex.mdstring-preprocessing.md

batch-processing.mddocs/

0

# Batch Processing

1

2

Efficient functions for comparing a query string against collections of candidate strings. These functions are optimized for performance when working with large lists and provide various output formats for different use cases.

3

4

## Capabilities

5

6

### Extract Single Best Match

7

8

Finds the single best match from a collection of choices.

9

10

```python { .api }

11

def extractOne(

12

query: Sequence[Hashable] | None,

13

choices: Iterable[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],

14

*,

15

scorer: Callable = WRatio,

16

processor: Callable | None = None,

17

score_cutoff: float | None = None,

18

score_hint: float | None = None,

19

scorer_kwargs: dict[str, Any] | None = None

20

) -> tuple[Sequence[Hashable], float, int | Any] | None

21

```

22

23

**Parameters:**

24

- `query`: String to find matches for

25

- `choices`: Iterable of strings or mapping {key: string}

26

- `scorer`: Scoring function (default: WRatio)

27

- `processor`: String preprocessing function

28

- `score_cutoff`: Minimum score threshold

29

- `score_hint`: Expected score for optimization

30

- `scorer_kwargs`: Additional arguments for scorer

31

32

**Returns:** `(match, score, index_or_key)` tuple or None if no match above cutoff

33

34

**Usage Example:**

35

```python

36

from rapidfuzz import process, fuzz

37

38

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

39

40

# Find best match

41

match = process.extractOne("new york jets", choices)

42

print(match) # ('New York Jets', 76.92, 1)

43

44

# With custom scorer

45

match = process.extractOne("cowboys", choices, scorer=fuzz.partial_ratio)

46

print(match) # ('Dallas Cowboys', 100.0, 3)

47

48

# With score cutoff

49

match = process.extractOne("chicago", choices, score_cutoff=50)

50

print(match) # None (no match above 50%)

51

52

# With mapping

53

choices_dict = {"team1": "Atlanta Falcons", "team2": "New York Jets"}

54

match = process.extractOne("jets", choices_dict)

55

print(match) # ('New York Jets', 90.0, 'team2')

56

```

57

58

### Extract Multiple Matches

59

60

Finds the top N matches from a collection, sorted by score in descending order.

61

62

```python { .api }

63

def extract(

64

query: Sequence[Hashable] | None,

65

choices: Collection[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],

66

*,

67

scorer: Callable = WRatio,

68

processor: Callable | None = None,

69

limit: int | None = 5,

70

score_cutoff: float | None = None,

71

score_hint: float | None = None,

72

scorer_kwargs: dict[str, Any] | None = None

73

) -> list[tuple[Sequence[Hashable], float, int | Any]]

74

```

75

76

**Parameters:**

77

- `limit`: Maximum number of matches to return (default: 5)

78

79

**Returns:** List of `(match, score, index_or_key)` tuples, sorted by score descending

80

81

**Usage Example:**

82

```python

83

from rapidfuzz import process, utils

84

85

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

86

87

# Get top 2 matches

88

matches = process.extract("new york", choices, limit=2)

89

print(matches)

90

# [('New York Jets', 76.92, 1), ('New York Giants', 64.29, 2)]

91

92

# With preprocessing for better matches

93

matches = process.extract("new york jets", choices,

94

processor=utils.default_process, limit=3)

95

print(matches)

96

# [('New York Jets', 100.0, 1), ('New York Giants', 78.57, 2), ...]

97

98

# Get all matches above threshold

99

matches = process.extract("new", choices, score_cutoff=30, limit=None)

100

print(len(matches)) # All matches with score >= 30

101

```

102

103

### Extract Iterator

104

105

Returns an iterator over all matches above the score cutoff, useful for memory-efficient processing of large choice sets.

106

107

```python { .api }

108

def extract_iter(

109

query: Sequence[Hashable] | None,

110

choices: Iterable[Sequence[Hashable] | None] | Mapping[Any, Sequence[Hashable] | None],

111

*,

112

scorer: Callable = WRatio,

113

processor: Callable | None = None,

114

score_cutoff: float | None = None,

115

score_hint: float | None = None,

116

scorer_kwargs: dict[str, Any] | None = None

117

) -> Generator[tuple[Sequence[Hashable], float, int | Any], None, None]

118

```

119

120

**Returns:** Generator yielding `(match, score, index_or_key)` tuples

121

122

**Usage Example:**

123

```python

124

from rapidfuzz import process

125

126

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

127

128

# Process matches one at a time

129

for match, score, index in process.extract_iter("new", choices, score_cutoff=50):

130

print(f"Match: {match}, Score: {score:.1f}, Index: {index}")

131

132

# Memory-efficient processing of large datasets

133

large_choices = [...] # Large list of strings

134

best_score = 0

135

best_match = None

136

137

for match, score, index in process.extract_iter("query", large_choices):

138

if score > best_score:

139

best_score = score

140

best_match = (match, score, index)

141

```

142

143

### Cross-Distance Matrix

144

145

Computes similarity/distance matrix between all queries and all choices. Requires NumPy.

146

147

```python { .api }

148

def cdist(

149

queries: Collection[Sequence[Hashable] | None],

150

choices: Collection[Sequence[Hashable] | None],

151

*,

152

scorer: Callable = ratio,

153

processor: Callable | None = None,

154

score_cutoff: float | None = None,

155

score_hint: float | None = None,

156

score_multiplier: float = 1,

157

dtype: Any = None,

158

workers: int = 1,

159

scorer_kwargs: dict[str, Any] | None = None

160

) -> numpy.ndarray

161

```

162

163

**Parameters:**

164

- `queries`: List of query strings

165

- `choices`: List of choice strings

166

- `score_multiplier`: Multiply scores by this factor

167

- `dtype`: NumPy data type for result array

168

- `workers`: Number of parallel workers

169

170

**Returns:** 2D NumPy array with shape (len(queries), len(choices))

171

172

**Usage Example:**

173

```python

174

import numpy as np

175

from rapidfuzz import process

176

177

queries = ["apple", "orange"]

178

choices = ["apples", "oranges", "banana"]

179

180

# Compute full distance matrix

181

matrix = process.cdist(queries, choices)

182

print(matrix.shape) # (2, 3)

183

print(matrix)

184

# [[similarity(apple, apples), similarity(apple, oranges), similarity(apple, banana)],

185

# [similarity(orange, apples), similarity(orange, oranges), similarity(orange, banana)]]

186

187

# Find best match for each query

188

best_indices = np.argmax(matrix, axis=1)

189

for i, query in enumerate(queries):

190

best_choice = choices[best_indices[i]]

191

best_score = matrix[i, best_indices[i]]

192

print(f"{query} -> {best_choice} ({best_score:.1f})")

193

```

194

195

### Cartesian Product Distance

196

197

Computes distances for all possible pairs. Requires NumPy.

198

199

```python { .api }

200

def cpdist(

201

queries: Collection[Sequence[Hashable] | None],

202

choices: Collection[Sequence[Hashable] | None],

203

*,

204

scorer: Callable = ratio,

205

processor: Callable | None = None,

206

score_cutoff: float | None = None,

207

score_hint: float | None = None,

208

score_multiplier: float = 1,

209

dtype: Any = None,

210

workers: int = 1,

211

scorer_kwargs: dict[str, Any] | None = None

212

) -> numpy.ndarray

213

```

214

215

**Returns:** 1D NumPy array with len(queries) * len(choices) elements

216

217

## Usage Patterns

218

219

### Choosing the Right Function

220

221

- **`extractOne`**: Need single best match

222

- **`extract`**: Need top N matches, known small result set

223

- **`extract_iter`**: Large choice sets, memory-constrained, or streaming results

224

- **`cdist`**: Need complete similarity matrix, multiple queries

225

- **`cpdist`**: Need all pairwise comparisons in flat array format

226

227

### Performance Optimization

228

229

```python

230

from rapidfuzz import process, fuzz

231

232

choices = ["..." * 10000] # Large choice list

233

234

# Use score_cutoff to filter weak matches early

235

matches = process.extract("query", choices, score_cutoff=80)

236

237

# Use score_hint if you know expected score range

238

matches = process.extract("query", choices, score_hint=85)

239

240

# Use faster scorer for approximate results

241

matches = process.extract("query", choices, scorer=fuzz.QRatio)

242

243

# Parallel processing for matrix operations

244

matrix = process.cdist(queries, choices, workers=4)

245

```

246

247

### Handling Different Input Types

248

249

```python

250

from rapidfuzz import process

251

252

# List of strings (most common)

253

choices = ["option1", "option2", "option3"]

254

match = process.extractOne("query", choices)

255

# Returns: (match_string, score, index)

256

257

# Dictionary mapping

258

choices = {"a": "option1", "b": "option2", "c": "option3"}

259

match = process.extractOne("query", choices)

260

# Returns: (match_string, score, key)

261

262

# Pandas Series (if pandas available)

263

import pandas as pd

264

choices = pd.Series(["option1", "option2", "option3"])

265

match = process.extractOne("query", choices)

266

# Returns: (match_string, score, index)

267

268

# Handle None values in choices

269

choices = ["option1", None, "option3"]

270

matches = process.extract("query", choices) # None values ignored

271

```

272

273

### Custom Scoring Functions

274

275

```python

276

from rapidfuzz import process, distance

277

278

# Use distance metrics directly

279

matches = process.extract("query", choices, scorer=distance.Levenshtein.distance)

280

# Returns edit distance (lower = more similar)

281

282

# Custom scorer function

283

def custom_scorer(s1, s2, **kwargs):

284

# Custom scoring logic

285

return some_similarity_score

286

287

matches = process.extract("query", choices, scorer=custom_scorer)

288

289

# Pass additional arguments to scorer

290

matches = process.extract("query", choices,

291

scorer=distance.Levenshtein.distance,

292

scorer_kwargs={"weights": (1, 2, 1)})

293

```