or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

fuzzy-algorithms.mdindex.mdstring-processing.mdutilities.md

string-processing.mddocs/

0

# String Collection Processing

1

2

Functions for processing collections of strings and finding best matches using fuzzy string matching. These functions enable searching, ranking, and deduplication operations on lists or dictionaries of strings.

3

4

## Default Settings

5

6

```python { .api }

7

default_scorer = fuzz.WRatio # Default scoring function

8

default_processor = utils.full_process # Default string preprocessing function

9

```

10

11

## Capabilities

12

13

### Single Best Match Extraction

14

15

Find the single best match above a score threshold in a collection of choices.

16

17

```python { .api }

18

def extractOne(query: str, choices, processor=default_processor, scorer=default_scorer, score_cutoff: int = 0):

19

"""

20

Find the single best match above a score in a list of choices.

21

22

Parameters:

23

query: String to match against

24

choices: List or dict of choices to search through

25

processor: Function to preprocess strings before matching

26

scorer: Function to score matches (default: fuzz.WRatio)

27

score_cutoff: Minimum score threshold (default: 0)

28

29

Returns:

30

tuple: (match, score) if found, None if no match above cutoff

31

tuple: (match, score, key) if choices is a dictionary

32

"""

33

```

34

35

**Usage Example:**

36

```python

37

from fuzzywuzzy import process

38

39

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

40

result = process.extractOne("new york jets", choices)

41

print(result) # ("New York Jets", 100)

42

43

# With score cutoff

44

result = process.extractOne("new york", choices, score_cutoff=80)

45

print(result) # ("New York Jets", 90) or ("New York Giants", 90)

46

47

# With dictionary

48

choices_dict = {"team1": "Atlanta Falcons", "team2": "New York Jets"}

49

result = process.extractOne("jets", choices_dict)

50

print(result) # ("New York Jets", 90, "team2")

51

```

52

53

### Multiple Match Extraction

54

55

Extract multiple best matches from a collection with optional limits.

56

57

```python { .api }

58

def extract(query: str, choices, processor=default_processor, scorer=default_scorer, limit: int = 5):

59

"""

60

Select the best matches in a list or dictionary of choices.

61

62

Parameters:

63

query: String to match against

64

choices: List or dict of choices to search through

65

processor: Function to preprocess strings before matching

66

scorer: Function to score matches (default: fuzz.WRatio)

67

limit: Maximum number of results to return (default: 5)

68

69

Returns:

70

list: List of (match, score) tuples sorted by score descending

71

list: List of (match, score, key) tuples if choices is a dictionary

72

"""

73

```

74

75

**Usage Example:**

76

```python

77

from fuzzywuzzy import process

78

79

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

80

results = process.extract("new york", choices, limit=2)

81

print(results) # [("New York Jets", 90), ("New York Giants", 90)]

82

83

# Get all matches

84

all_results = process.extract("new", choices, limit=None)

85

print(all_results) # All matches sorted by score

86

```

87

88

### Threshold-Based Multiple Extraction

89

90

Extract multiple matches above a score threshold with optional limits.

91

92

```python { .api }

93

def extractBests(query: str, choices, processor=default_processor, scorer=default_scorer, score_cutoff: int = 0, limit: int = 5):

94

"""

95

Get a list of the best matches above a score threshold.

96

97

Parameters:

98

query: String to match against

99

choices: List or dict of choices to search through

100

processor: Function to preprocess strings before matching

101

scorer: Function to score matches (default: fuzz.WRatio)

102

score_cutoff: Minimum score threshold (default: 0)

103

limit: Maximum number of results to return (default: 5)

104

105

Returns:

106

list: List of (match, score) tuples above cutoff, sorted by score

107

"""

108

```

109

110

**Usage Example:**

111

```python

112

from fuzzywuzzy import process

113

114

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

115

results = process.extractBests("new", choices, score_cutoff=50, limit=3)

116

print(results) # Only matches scoring 50 or higher

117

```

118

119

### Unordered Extraction Generator

120

121

Generator function that yields matches without sorting, useful for large datasets.

122

123

```python { .api }

124

def extractWithoutOrder(query: str, choices, processor=default_processor, scorer=default_scorer, score_cutoff: int = 0):

125

"""

126

Generator yielding best matches without ordering, for memory efficiency.

127

128

Parameters:

129

query: String to match against

130

choices: List or dict of choices to search through

131

processor: Function to preprocess strings before matching

132

scorer: Function to score matches (default: fuzz.WRatio)

133

score_cutoff: Minimum score threshold (default: 0)

134

135

Yields:

136

tuple: (match, score) for list choices

137

tuple: (match, score, key) for dictionary choices

138

"""

139

```

140

141

**Usage Example:**

142

```python

143

from fuzzywuzzy import process

144

145

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

146

for match in process.extractWithoutOrder("new", choices, score_cutoff=60):

147

print(match) # Yields matches as found, not sorted

148

```

149

150

### Fuzzy Deduplication

151

152

Remove duplicates from a list using fuzzy matching to identify similar items.

153

154

```python { .api }

155

def dedupe(contains_dupes: list, threshold: int = 70, scorer=fuzz.token_set_ratio):

156

"""

157

Remove duplicates from a list using fuzzy matching.

158

159

Uses fuzzy matching to identify duplicates that score above the threshold.

160

For each group of duplicates, returns the longest item (most information),

161

breaking ties alphabetically.

162

163

Parameters:

164

contains_dupes: List of strings that may contain duplicates

165

threshold: Score threshold for considering items duplicates (default: 70)

166

scorer: Function to score similarity (default: fuzz.token_set_ratio)

167

168

Returns:

169

list: Deduplicated list with longest representative from each group

170

"""

171

```

172

173

**Usage Example:**

174

```python

175

from fuzzywuzzy import process

176

177

duplicates = [

178

'Frodo Baggin',

179

'Frodo Baggins',

180

'F. Baggins',

181

'Samwise G.',

182

'Gandalf',

183

'Bilbo Baggins'

184

]

185

186

deduped = process.dedupe(duplicates)

187

print(deduped) # ['Frodo Baggins', 'Samwise G.', 'Bilbo Baggins', 'Gandalf']

188

189

# Lower threshold finds more duplicates

190

deduped_aggressive = process.dedupe(duplicates, threshold=50)

191

print(deduped_aggressive) # Even fewer items returned

192

```

193

194

## Custom Processors and Scorers

195

196

You can provide custom processing and scoring functions:

197

198

**Usage Example:**

199

```python

200

from fuzzywuzzy import process, fuzz

201

202

# Custom processor that only looks at first word

203

def first_word_processor(s):

204

return s.split()[0] if s else ""

205

206

# Custom scorer that uses partial ratio

207

choices = ["John Smith", "Jane Smith", "Bob Johnson"]

208

result = process.extractOne(

209

"John",

210

choices,

211

processor=first_word_processor,

212

scorer=fuzz.partial_ratio

213

)

214

print(result) # ("John Smith", 100)

215

216

# No processing

217

result = process.extractOne(

218

"JOHN SMITH",

219

choices,

220

processor=None, # No preprocessing

221

scorer=fuzz.ratio

222

)

223

print(result) # Lower score due to case mismatch

224

```