or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdstring-processing.mdstring-similarity.mdstring-utilities.md

string-processing.mddocs/

0

# String Processing and Extraction

1

2

Functions for finding the best matches in collections of strings. These functions enable searching through lists or dictionaries of choices to find the closest matches to a query string.

3

4

## Capabilities

5

6

### Single Best Match Extraction

7

8

Find the single best match from a collection of choices, with optional score thresholding.

9

10

```python { .api }

11

def extractOne(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0):

12

"""

13

Find the single best match above a score threshold.

14

15

Args:

16

query: String to match against

17

choices: List or dictionary of choices to search through

18

processor: Optional function to preprocess strings before matching

19

scorer: Optional scoring function (default: fuzz.WRatio)

20

score_cutoff: Minimum score threshold for matches (default: 0)

21

22

Returns:

23

Tuple of (match, score) for list choices, or (match, score, key) for

24

dictionary choices. Returns None if no match above score_cutoff.

25

"""

26

```

27

28

### Multiple Match Extraction

29

30

Extract multiple best matches from a collection, with configurable limits and score cutoffs.

31

32

```python { .api }

33

def extract(query: str, choices, processor=None, scorer=None, limit: int = 5):

34

"""

35

Get a list of the best matches from choices.

36

37

Args:

38

query: String to match against

39

choices: List or dictionary of choices to search through

40

processor: Optional function to preprocess strings before matching

41

scorer: Optional scoring function (default: fuzz.WRatio)

42

limit: Maximum number of matches to return (default: 5)

43

44

Returns:

45

List of tuples: (match, score) for list choices, or

46

(match, score, key) for dictionary choices

47

"""

48

49

def extractBests(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0, limit: int = 5):

50

"""

51

Get best matches with both score cutoff and limit controls.

52

53

Args:

54

query: String to match against

55

choices: List or dictionary of choices to search through

56

processor: Optional function to preprocess strings before matching

57

scorer: Optional scoring function (default: fuzz.WRatio)

58

score_cutoff: Minimum score threshold for matches (default: 0)

59

limit: Maximum number of matches to return (default: 5)

60

61

Returns:

62

List of tuples: (match, score) for list choices, or

63

(match, score, key) for dictionary choices

64

"""

65

66

def extractWithoutOrder(query: str, choices, processor=None, scorer=None, score_cutoff: int = 0):

67

"""

68

Extract all matches above threshold without ordering or limit.

69

70

Args:

71

query: String to match against

72

choices: List or dictionary of choices to search through

73

processor: Optional function to preprocess strings before matching

74

scorer: Optional scoring function (default: fuzz.WRatio)

75

score_cutoff: Minimum score threshold for matches (default: 0)

76

77

Returns:

78

Generator yielding tuples: (match, score) for list choices, or

79

(match, score, key) for dictionary choices

80

"""

81

```

82

83

### Duplicate Removal

84

85

Remove fuzzy duplicates from a list of strings using configurable similarity thresholds.

86

87

```python { .api }

88

def dedupe(contains_dupes: list, threshold: int = 70, scorer=None):

89

"""

90

Remove fuzzy duplicates from a list of strings.

91

92

Uses fuzzy matching to identify duplicates above the threshold score,

93

then returns the longest string from each duplicate group.

94

95

Args:

96

contains_dupes: List of strings that may contain duplicates

97

threshold: Similarity threshold for considering strings duplicates (default: 70)

98

scorer: Optional scoring function (default: fuzz.token_set_ratio)

99

100

Returns:

101

List of deduplicated strings

102

"""

103

```

104

105

### Default Configuration

106

107

The process module provides sensible defaults for common use cases.

108

109

```python { .api }

110

# Module-level constants

111

default_scorer = fuzz.WRatio

112

default_processor = utils.full_process

113

```

114

115

## Usage Examples

116

117

### Basic Extraction

118

119

```python

120

from thefuzz import process

121

122

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

123

124

# Find single best match

125

result = process.extractOne("new york jets", choices)

126

print(result) # ('New York Jets', 100)

127

128

# Find multiple matches

129

results = process.extract("new york", choices, limit=2)

130

print(results) # [('New York Jets', 90), ('New York Giants', 90)]

131

```

132

133

### Working with Dictionaries

134

135

```python

136

from thefuzz import process

137

138

# Dictionary choices return key information

139

team_info = {

140

"ATL": "Atlanta Falcons",

141

"NYJ": "New York Jets",

142

"NYG": "New York Giants",

143

"DAL": "Dallas Cowboys"

144

}

145

146

result = process.extractOne("new york jets", team_info)

147

print(result) # ('New York Jets', 100, 'NYJ')

148

```

149

150

### Custom Scoring and Processing

151

152

```python

153

from thefuzz import process, fuzz

154

155

choices = [" ATLANTA FALCONS ", "new york jets", "New York Giants"]

156

157

# Custom processor to handle case and whitespace

158

def clean_processor(s):

159

return s.strip().lower()

160

161

# Use token-based scoring for better word order handling

162

results = process.extract(

163

"new york",

164

choices,

165

processor=clean_processor,

166

scorer=fuzz.token_sort_ratio,

167

limit=2

168

)

169

```

170

171

### Score Thresholding

172

173

```python

174

from thefuzz import process

175

176

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

177

178

# Only return matches above 80% similarity

179

results = process.extractBests("york", choices, score_cutoff=80, limit=10)

180

print(results) # Only high-quality matches

181

```

182

183

### Duplicate Removal

184

185

```python

186

from thefuzz import process

187

188

# List with fuzzy duplicates

189

names = [

190

"Frodo Baggins",

191

"Frodo Baggin",

192

"F. Baggins",

193

"Samwise Gamgee",

194

"Gandalf",

195

"Bilbo Baggins"

196

]

197

198

# Remove duplicates (default threshold: 70)

199

deduplicated = process.dedupe(names)

200

print(deduplicated) # ['Frodo Baggins', 'Samwise Gamgee', 'Gandalf', 'Bilbo Baggins']

201

202

# Use stricter threshold

203

strict_dedupe = process.dedupe(names, threshold=90)

204

```

205

206

### Generator-Based Processing

207

208

```python

209

from thefuzz import process

210

211

# For large datasets, use generator to avoid loading all results

212

choices = ["choice1", "choice2", ...] # Large list

213

214

for match, score in process.extractWithoutOrder("query", choices, score_cutoff=75):

215

if score > 90:

216

print(f"High confidence match: {match} ({score})")

217

else:

218

print(f"Moderate match: {match} ({score})")

219

```