Tessl Tile for pypi/thefuzz@0.22.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md string-processing.md string-similarity.md string-utilities.md

string-processing.mddocs/

0
# String Processing and Extraction
1

2
Functions for finding the best matches in collections of strings. These functions enable searching through lists or dictionaries of choices to find the closest matches to a query string.
3

4
## Capabilities
5

6
### Single Best Match Extraction
7

8
Find the single best match from a collection of choices, with optional score thresholding.
9

10
```python { .api }
11
def extractOne(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0):
12
    """
13
    Find the single best match above a score threshold.
14
    
15
    Args:
16
        query: String to match against
17
        choices: List or dictionary of choices to search through
18
        processor: Optional function to preprocess strings before matching
19
        scorer: Optional scoring function (default: fuzz.WRatio)
20
        score_cutoff: Minimum score threshold for matches (default: 0)
21
        
22
    Returns:
23
        Tuple of (match, score) for list choices, or (match, score, key) for 
24
        dictionary choices. Returns None if no match above score_cutoff.
25
    """
26
```
27

28
### Multiple Match Extraction
29

30
Extract multiple best matches from a collection, with configurable limits and score cutoffs.
31

32
```python { .api }
33
def extract(query: str, choices, processor=None, scorer=None, limit: int = 5):
34
    """
35
    Get a list of the best matches from choices.
36
    
37
    Args:
38
        query: String to match against
39
        choices: List or dictionary of choices to search through  
40
        processor: Optional function to preprocess strings before matching
41
        scorer: Optional scoring function (default: fuzz.WRatio)
42
        limit: Maximum number of matches to return (default: 5)
43
        
44
    Returns:
45
        List of tuples: (match, score) for list choices, or 
46
        (match, score, key) for dictionary choices
47
    """
48

49
def extractBests(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0, limit: int = 5):
50
    """
51
    Get best matches with both score cutoff and limit controls.
52
    
53
    Args:
54
        query: String to match against
55
        choices: List or dictionary of choices to search through
56
        processor: Optional function to preprocess strings before matching  
57
        scorer: Optional scoring function (default: fuzz.WRatio)
58
        score_cutoff: Minimum score threshold for matches (default: 0)
59
        limit: Maximum number of matches to return (default: 5)
60
        
61
    Returns:
62
        List of tuples: (match, score) for list choices, or
63
        (match, score, key) for dictionary choices
64
    """
65

66
def extractWithoutOrder(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0):
67
    """
68
    Extract all matches above threshold without ordering or limit.
69
    
70
    Args:
71
        query: String to match against
72
        choices: List or dictionary of choices to search through
73
        processor: Optional function to preprocess strings before matching
74
        scorer: Optional scoring function (default: fuzz.WRatio)  
75
        score_cutoff: Minimum score threshold for matches (default: 0)
76
        
77
    Returns:
78
        Generator yielding tuples: (match, score) for list choices, or
79
        (match, score, key) for dictionary choices
80
    """
81
```
82

83
### Duplicate Removal
84

85
Remove fuzzy duplicates from a list of strings using configurable similarity thresholds.
86

87
```python { .api }
88
def dedupe(contains_dupes: list, threshold: int = 70, scorer=None):
89
    """
90
    Remove fuzzy duplicates from a list of strings.
91
    
92
    Uses fuzzy matching to identify duplicates above the threshold score,
93
    then returns the longest string from each duplicate group.
94
    
95
    Args:
96
        contains_dupes: List of strings that may contain duplicates
97
        threshold: Similarity threshold for considering strings duplicates (default: 70)
98
        scorer: Optional scoring function (default: fuzz.token_set_ratio)
99
        
100
    Returns:
101
        List of deduplicated strings
102
    """
103
```
104

105
### Default Configuration
106

107
The process module provides sensible defaults for common use cases.
108

109
```python { .api }
110
# Module-level constants
111
default_scorer = fuzz.WRatio
112
default_processor = utils.full_process
113
```
114

115
## Usage Examples
116

117
### Basic Extraction
118

119
```python
120
from thefuzz import process
121

122
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
123

124
# Find single best match
125
result = process.extractOne("new york jets", choices)
126
print(result)  # ('New York Jets', 100)
127

128
# Find multiple matches  
129
results = process.extract("new york", choices, limit=2)
130
print(results)  # [('New York Jets', 90), ('New York Giants', 90)]
131
```
132

133
### Working with Dictionaries
134

135
```python
136
from thefuzz import process
137

138
# Dictionary choices return key information
139
team_info = {
140
    "ATL": "Atlanta Falcons",
141
    "NYJ": "New York Jets", 
142
    "NYG": "New York Giants",
143
    "DAL": "Dallas Cowboys"
144
}
145

146
result = process.extractOne("new york jets", team_info)
147
print(result)  # ('New York Jets', 100, 'NYJ')
148
```
149

150
### Custom Scoring and Processing
151

152
```python
153
from thefuzz import process, fuzz
154

155
choices = ["  ATLANTA FALCONS  ", "new york jets", "New York Giants"]
156

157
# Custom processor to handle case and whitespace
158
def clean_processor(s):
159
    return s.strip().lower()
160

161
# Use token-based scoring for better word order handling
162
results = process.extract(
163
    "new york", 
164
    choices,
165
    processor=clean_processor,
166
    scorer=fuzz.token_sort_ratio,
167
    limit=2
168
)
169
```
170

171
### Score Thresholding
172

173
```python
174
from thefuzz import process
175

176
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
177

178
# Only return matches above 80% similarity
179
results = process.extractBests("york", choices, score_cutoff=80, limit=10)
180
print(results)  # Only high-quality matches
181
```
182

183
### Duplicate Removal
184

185
```python
186
from thefuzz import process
187

188
# List with fuzzy duplicates
189
names = [
190
    "Frodo Baggins",
191
    "Frodo Baggin", 
192
    "F. Baggins",
193
    "Samwise Gamgee",
194
    "Gandalf",
195
    "Bilbo Baggins"
196
]
197

198
# Remove duplicates (default threshold: 70)
199
deduplicated = process.dedupe(names)
200
print(deduplicated)  # ['Frodo Baggins', 'Samwise Gamgee', 'Gandalf', 'Bilbo Baggins']
201

202
# Use stricter threshold
203
strict_dedupe = process.dedupe(names, threshold=90)
204
```
205

206
### Generator-Based Processing
207

208
```python
209
from thefuzz import process
210

211
# For large datasets, use generator to avoid loading all results
212
choices = ["choice1", "choice2", ...]  # Large list
213

214
for match, score in process.extractWithoutOrder("query", choices, score_cutoff=75):
215
    if score > 90:
216
        print(f"High confidence match: {match} ({score})")
217
    else:
218
        print(f"Moderate match: {match} ({score})")
219
```

Version

Tile

Files

string-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

string-processing.mddocs/