0
# String Processing and Extraction
1
2
Functions for finding the best matches in collections of strings. These functions enable searching through lists or dictionaries of choices to find the closest matches to a query string.
3
4
## Capabilities
5
6
### Single Best Match Extraction
7
8
Find the single best match from a collection of choices, with optional score thresholding.
9
10
```python { .api }
11
def extractOne(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0):
12
"""
13
Find the single best match above a score threshold.
14
15
Args:
16
query: String to match against
17
choices: List or dictionary of choices to search through
18
processor: Optional function to preprocess strings before matching
19
scorer: Optional scoring function (default: fuzz.WRatio)
20
score_cutoff: Minimum score threshold for matches (default: 0)
21
22
Returns:
23
Tuple of (match, score) for list choices, or (match, score, key) for
24
dictionary choices. Returns None if no match above score_cutoff.
25
"""
26
```
27
28
### Multiple Match Extraction
29
30
Extract multiple best matches from a collection, with configurable limits and score cutoffs.
31
32
```python { .api }
33
def extract(query: str, choices, processor=None, scorer=None, limit: int = 5):
34
"""
35
Get a list of the best matches from choices.
36
37
Args:
38
query: String to match against
39
choices: List or dictionary of choices to search through
40
processor: Optional function to preprocess strings before matching
41
scorer: Optional scoring function (default: fuzz.WRatio)
42
limit: Maximum number of matches to return (default: 5)
43
44
Returns:
45
List of tuples: (match, score) for list choices, or
46
(match, score, key) for dictionary choices
47
"""
48
49
def extractBests(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0, limit: int = 5):
50
"""
51
Get best matches with both score cutoff and limit controls.
52
53
Args:
54
query: String to match against
55
choices: List or dictionary of choices to search through
56
processor: Optional function to preprocess strings before matching
57
scorer: Optional scoring function (default: fuzz.WRatio)
58
score_cutoff: Minimum score threshold for matches (default: 0)
59
limit: Maximum number of matches to return (default: 5)
60
61
Returns:
62
List of tuples: (match, score) for list choices, or
63
(match, score, key) for dictionary choices
64
"""
65
66
def extractWithoutOrder(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0):
67
"""
68
Extract all matches above threshold without ordering or limit.
69
70
Args:
71
query: String to match against
72
choices: List or dictionary of choices to search through
73
processor: Optional function to preprocess strings before matching
74
scorer: Optional scoring function (default: fuzz.WRatio)
75
score_cutoff: Minimum score threshold for matches (default: 0)
76
77
Returns:
78
Generator yielding tuples: (match, score) for list choices, or
79
(match, score, key) for dictionary choices
80
"""
81
```
82
83
### Duplicate Removal
84
85
Remove fuzzy duplicates from a list of strings using configurable similarity thresholds.
86
87
```python { .api }
88
def dedupe(contains_dupes: list, threshold: int = 70, scorer=None):
89
"""
90
Remove fuzzy duplicates from a list of strings.
91
92
Uses fuzzy matching to identify duplicates above the threshold score,
93
then returns the longest string from each duplicate group.
94
95
Args:
96
contains_dupes: List of strings that may contain duplicates
97
threshold: Similarity threshold for considering strings duplicates (default: 70)
98
scorer: Optional scoring function (default: fuzz.token_set_ratio)
99
100
Returns:
101
List of deduplicated strings
102
"""
103
```
104
105
### Default Configuration
106
107
The process module provides sensible defaults for common use cases.
108
109
```python { .api }
110
# Module-level constants
111
default_scorer = fuzz.WRatio
112
default_processor = utils.full_process
113
```
114
115
## Usage Examples
116
117
### Basic Extraction
118
119
```python
120
from thefuzz import process
121
122
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
123
124
# Find single best match
125
result = process.extractOne("new york jets", choices)
126
print(result) # ('New York Jets', 100)
127
128
# Find multiple matches
129
results = process.extract("new york", choices, limit=2)
130
print(results) # [('New York Jets', 90), ('New York Giants', 90)]
131
```
132
133
### Working with Dictionaries
134
135
```python
136
from thefuzz import process
137
138
# Dictionary choices return key information
139
team_info = {
140
"ATL": "Atlanta Falcons",
141
"NYJ": "New York Jets",
142
"NYG": "New York Giants",
143
"DAL": "Dallas Cowboys"
144
}
145
146
result = process.extractOne("new york jets", team_info)
147
print(result) # ('New York Jets', 100, 'NYJ')
148
```
149
150
### Custom Scoring and Processing
151
152
```python
153
from thefuzz import process, fuzz
154
155
choices = [" ATLANTA FALCONS ", "new york jets", "New York Giants"]
156
157
# Custom processor to handle case and whitespace
158
def clean_processor(s):
159
return s.strip().lower()
160
161
# Use token-based scoring for better word order handling
162
results = process.extract(
163
"new york",
164
choices,
165
processor=clean_processor,
166
scorer=fuzz.token_sort_ratio,
167
limit=2
168
)
169
```
170
171
### Score Thresholding
172
173
```python
174
from thefuzz import process
175
176
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
177
178
# Only return matches above 80% similarity
179
results = process.extractBests("york", choices, score_cutoff=80, limit=10)
180
print(results) # Only high-quality matches
181
```
182
183
### Duplicate Removal
184
185
```python
186
from thefuzz import process
187
188
# List with fuzzy duplicates
189
names = [
190
"Frodo Baggins",
191
"Frodo Baggin",
192
"F. Baggins",
193
"Samwise Gamgee",
194
"Gandalf",
195
"Bilbo Baggins"
196
]
197
198
# Remove duplicates (default threshold: 70)
199
deduplicated = process.dedupe(names)
200
print(deduplicated) # ['Frodo Baggins', 'Samwise Gamgee', 'Gandalf', 'Bilbo Baggins']
201
202
# Use stricter threshold
203
strict_dedupe = process.dedupe(names, threshold=90)
204
```
205
206
### Generator-Based Processing
207
208
```python
209
from thefuzz import process
210
211
# For large datasets, use generator to avoid loading all results
212
choices = ["choice1", "choice2", ...] # Large list
213
214
for match, score in process.extractWithoutOrder("query", choices, score_cutoff=75):
215
if score > 90:
216
print(f"High confidence match: {match} ({score})")
217
else:
218
print(f"Moderate match: {match} ({score})")
219
```