0
# String Collection Processing
1
2
Functions for processing collections of strings and finding best matches using fuzzy string matching. These functions enable searching, ranking, and deduplication operations on lists or dictionaries of strings.
3
4
## Default Settings
5
6
```python { .api }
7
default_scorer = fuzz.WRatio # Default scoring function
8
default_processor = utils.full_process # Default string preprocessing function
9
```
10
11
## Capabilities
12
13
### Single Best Match Extraction
14
15
Find the single best match above a score threshold in a collection of choices.
16
17
```python { .api }
18
def extractOne(query: str, choices, processor=default_processor, scorer=default_scorer, score_cutoff: int = 0):
19
"""
20
Find the single best match above a score in a list of choices.
21
22
Parameters:
23
query: String to match against
24
choices: List or dict of choices to search through
25
processor: Function to preprocess strings before matching
26
scorer: Function to score matches (default: fuzz.WRatio)
27
score_cutoff: Minimum score threshold (default: 0)
28
29
Returns:
30
tuple: (match, score) if found, None if no match above cutoff
31
tuple: (match, score, key) if choices is a dictionary
32
"""
33
```
34
35
**Usage Example:**
36
```python
37
from fuzzywuzzy import process
38
39
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
40
result = process.extractOne("new york jets", choices)
41
print(result) # ("New York Jets", 100)
42
43
# With score cutoff
44
result = process.extractOne("new york", choices, score_cutoff=80)
45
print(result) # ("New York Jets", 90) or ("New York Giants", 90)
46
47
# With dictionary
48
choices_dict = {"team1": "Atlanta Falcons", "team2": "New York Jets"}
49
result = process.extractOne("jets", choices_dict)
50
print(result) # ("New York Jets", 90, "team2")
51
```
52
53
### Multiple Match Extraction
54
55
Extract multiple best matches from a collection with optional limits.
56
57
```python { .api }
58
def extract(query: str, choices, processor=default_processor, scorer=default_scorer, limit: int = 5):
59
"""
60
Select the best matches in a list or dictionary of choices.
61
62
Parameters:
63
query: String to match against
64
choices: List or dict of choices to search through
65
processor: Function to preprocess strings before matching
66
scorer: Function to score matches (default: fuzz.WRatio)
67
limit: Maximum number of results to return (default: 5)
68
69
Returns:
70
list: List of (match, score) tuples sorted by score descending
71
list: List of (match, score, key) tuples if choices is a dictionary
72
"""
73
```
74
75
**Usage Example:**
76
```python
77
from fuzzywuzzy import process
78
79
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
80
results = process.extract("new york", choices, limit=2)
81
print(results) # [("New York Jets", 90), ("New York Giants", 90)]
82
83
# Get all matches
84
all_results = process.extract("new", choices, limit=None)
85
print(all_results) # All matches sorted by score
86
```
87
88
### Threshold-Based Multiple Extraction
89
90
Extract multiple matches above a score threshold with optional limits.
91
92
```python { .api }
93
def extractBests(query: str, choices, processor=default_processor, scorer=default_scorer, score_cutoff: int = 0, limit: int = 5):
94
"""
95
Get a list of the best matches above a score threshold.
96
97
Parameters:
98
query: String to match against
99
choices: List or dict of choices to search through
100
processor: Function to preprocess strings before matching
101
scorer: Function to score matches (default: fuzz.WRatio)
102
score_cutoff: Minimum score threshold (default: 0)
103
limit: Maximum number of results to return (default: 5)
104
105
Returns:
106
list: List of (match, score) tuples above cutoff, sorted by score
107
"""
108
```
109
110
**Usage Example:**
111
```python
112
from fuzzywuzzy import process
113
114
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
115
results = process.extractBests("new", choices, score_cutoff=50, limit=3)
116
print(results) # Only matches scoring 50 or higher
117
```
118
119
### Unordered Extraction Generator
120
121
Generator function that yields matches without sorting, useful for large datasets.
122
123
```python { .api }
124
def extractWithoutOrder(query: str, choices, processor=default_processor, scorer=default_scorer, score_cutoff: int = 0):
125
"""
126
Generator yielding best matches without ordering, for memory efficiency.
127
128
Parameters:
129
query: String to match against
130
choices: List or dict of choices to search through
131
processor: Function to preprocess strings before matching
132
scorer: Function to score matches (default: fuzz.WRatio)
133
score_cutoff: Minimum score threshold (default: 0)
134
135
Yields:
136
tuple: (match, score) for list choices
137
tuple: (match, score, key) for dictionary choices
138
"""
139
```
140
141
**Usage Example:**
142
```python
143
from fuzzywuzzy import process
144
145
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
146
for match in process.extractWithoutOrder("new", choices, score_cutoff=60):
147
print(match) # Yields matches as found, not sorted
148
```
149
150
### Fuzzy Deduplication
151
152
Remove duplicates from a list using fuzzy matching to identify similar items.
153
154
```python { .api }
155
def dedupe(contains_dupes: list, threshold: int = 70, scorer=fuzz.token_set_ratio):
156
"""
157
Remove duplicates from a list using fuzzy matching.
158
159
Uses fuzzy matching to identify duplicates that score above the threshold.
160
For each group of duplicates, returns the longest item (most information),
161
breaking ties alphabetically.
162
163
Parameters:
164
contains_dupes: List of strings that may contain duplicates
165
threshold: Score threshold for considering items duplicates (default: 70)
166
scorer: Function to score similarity (default: fuzz.token_set_ratio)
167
168
Returns:
169
list: Deduplicated list with longest representative from each group
170
"""
171
```
172
173
**Usage Example:**
174
```python
175
from fuzzywuzzy import process
176
177
duplicates = [
178
'Frodo Baggin',
179
'Frodo Baggins',
180
'F. Baggins',
181
'Samwise G.',
182
'Gandalf',
183
'Bilbo Baggins'
184
]
185
186
deduped = process.dedupe(duplicates)
187
print(deduped) # ['Frodo Baggins', 'Samwise G.', 'Bilbo Baggins', 'Gandalf']
188
189
# Lower threshold finds more duplicates
190
deduped_aggressive = process.dedupe(duplicates, threshold=50)
191
print(deduped_aggressive) # Even fewer items returned
192
```
193
194
## Custom Processors and Scorers
195
196
You can provide custom processing and scoring functions:
197
198
**Usage Example:**
199
```python
200
from fuzzywuzzy import process, fuzz
201
202
# Custom processor that only looks at first word
203
def first_word_processor(s):
204
return s.split()[0] if s else ""
205
206
# Custom scorer that uses partial ratio
207
choices = ["John Smith", "Jane Smith", "Bob Johnson"]
208
result = process.extractOne(
209
"John",
210
choices,
211
processor=first_word_processor,
212
scorer=fuzz.partial_ratio
213
)
214
print(result) # ("John Smith", 100)
215
216
# No processing
217
result = process.extractOne(
218
"JOHN SMITH",
219
choices,
220
processor=None, # No preprocessing
221
scorer=fuzz.ratio
222
)
223
print(result) # Lower score due to case mismatch
224
```