or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdstring-processing.mdstring-similarity.mdstring-utilities.md

string-utilities.mddocs/

0

# String Utilities

1

2

Utility functions for string preprocessing and normalization. These functions prepare strings for fuzzy matching by cleaning and standardizing their format.

3

4

## Capabilities

5

6

### Full String Processing

7

8

Comprehensive string preprocessing that normalizes text for optimal fuzzy matching performance.

9

10

```python { .api }

11

def full_process(s: str, force_ascii: bool = False) -> str:

12

"""

13

Process string for fuzzy matching by normalizing format.

14

15

Processing steps:

16

1. Convert to string if not already

17

2. Optionally convert to ASCII (removes accented characters)

18

3. Remove all non-alphanumeric characters (replaced with spaces)

19

4. Trim leading/trailing whitespace

20

5. Convert to lowercase

21

6. Normalize internal whitespace

22

23

Args:

24

s: String to process

25

force_ascii: If True, convert accented characters to ASCII equivalents

26

27

Returns:

28

str: Processed and normalized string

29

"""

30

```

31

32

### ASCII Conversion

33

34

Convert strings to ASCII-only by removing non-ASCII characters, useful for standardizing international text.

35

36

```python { .api }

37

def ascii_only(s: str) -> str:

38

"""

39

Convert string to ASCII by removing non-ASCII characters.

40

41

Removes characters with ASCII codes 128-255, effectively stripping

42

accented characters, emoji, and other non-ASCII content.

43

44

Args:

45

s: String to convert

46

47

Returns:

48

str: ASCII-only version of the string

49

"""

50

```

51

52

### Module Constants

53

54

```python { .api }

55

# Translation table for ASCII conversion (removes chars 128-255)

56

translation_table: dict

57

```

58

59

## Usage Examples

60

61

### Basic String Processing

62

63

```python

64

from thefuzz import utils

65

66

# Standard text normalization

67

text = " Hello, World! "

68

processed = utils.full_process(text)

69

print(processed) # "hello world"

70

71

# Handle special characters

72

text = "New York Mets vs. Atlanta Braves"

73

processed = utils.full_process(text)

74

print(processed) # "new york mets vs atlanta braves"

75

```

76

77

### ASCII Conversion

78

79

```python

80

from thefuzz import utils

81

82

# Convert accented characters

83

text = "Café Münchën"

84

ascii_text = utils.ascii_only(text)

85

print(ascii_text) # "Caf Mnchen"

86

87

# Full processing with ASCII conversion

88

processed = utils.full_process("Café Münchën", force_ascii=True)

89

print(processed) # "caf mnchen"

90

```

91

92

### Integration with Fuzzy Matching

93

94

```python

95

from thefuzz import fuzz, utils

96

97

# Manual preprocessing before comparison

98

s1 = utils.full_process("New York Mets!")

99

s2 = utils.full_process("new york mets")

100

score = fuzz.ratio(s1, s2)

101

print(score) # 100 (perfect match after processing)

102

103

# Compare with and without processing

104

raw_score = fuzz.ratio("New York Mets!", "new york mets")

105

processed_score = fuzz.ratio(

106

utils.full_process("New York Mets!"),

107

utils.full_process("new york mets")

108

)

109

print(f"Raw: {raw_score}, Processed: {processed_score}")

110

```

111

112

### Custom Processing Pipeline

113

114

```python

115

from thefuzz import utils

116

117

def custom_processor(text):

118

"""Custom processing for specific use case."""

119

# First apply standard processing

120

processed = utils.full_process(text, force_ascii=True)

121

122

# Add custom logic

123

# Remove common stop words, normalize abbreviations, etc.

124

replacements = {

125

"street": "st",

126

"avenue": "ave",

127

"boulevard": "blvd"

128

}

129

130

for old, new in replacements.items():

131

processed = processed.replace(old, new)

132

133

return processed

134

135

# Use with fuzzy matching

136

from thefuzz import process

137

138

addresses = ["123 Main Street", "456 Oak Avenue", "789 First Boulevard"]

139

result = process.extractOne("main st", addresses, processor=custom_processor)

140

```

141

142

### Performance Considerations

143

144

```python

145

from thefuzz import utils

146

147

# For batch processing, consider preprocessing once

148

texts = ["Text 1", "Text 2", "Text 3", ...]

149

processed_texts = [utils.full_process(text) for text in texts]

150

151

# Then use the processed texts for multiple comparisons

152

# This avoids repeated preprocessing in fuzzy matching functions

153

```

154

155

## Processing Behavior

156

157

### Character Handling

158

159

- **Alphanumeric**: Preserved (letters and numbers)

160

- **Whitespace**: Normalized (multiple spaces become single space, trimmed)

161

- **Punctuation**: Removed (replaced with spaces)

162

- **Accented characters**: Optionally converted to ASCII equivalents

163

- **Case**: Converted to lowercase

164

165

### Examples of Processing Results

166

167

```python

168

from thefuzz import utils

169

170

examples = [

171

"Hello, World!", # → "hello world"

172

" Multiple Spaces ", # → "multiple spaces"

173

"New York Mets vs. ATL", # → "new york mets vs atl"

174

"Café Münchën", # → "café münchën" (or "caf mnchen" with force_ascii=True)

175

"user@email.com", # → "user email com"

176

"1st & 2nd Avenue", # → "1st 2nd avenue"

177

]

178

179

for text in examples:

180

processed = utils.full_process(text)

181

processed_ascii = utils.full_process(text, force_ascii=True)

182

print(f"'{text}' → '{processed}' → '{processed_ascii}'")

183

```