0
# Text Processing
1
2
Functions for splitting text and performing substitutions using regular expressions. These operations are fundamental for text processing, data cleaning, and string manipulation tasks.
3
4
## Capabilities
5
6
### Text Splitting
7
8
Splits text into a list using a regular expression pattern as the delimiter, with optional control over the maximum number of splits.
9
10
```python { .api }
11
def split(pattern, text, maxsplit=0, options=None):
12
"""
13
Split text by pattern occurrences.
14
15
Args:
16
pattern (str): Regular expression pattern used as delimiter
17
text (str): Input text to split
18
maxsplit (int): Maximum number of splits (0 = no limit)
19
options (Options, optional): Compilation options
20
21
Returns:
22
list: List of text segments
23
"""
24
```
25
26
Example usage:
27
28
```python
29
import re2
30
31
# Split on whitespace
32
text = "apple banana cherry"
33
parts = re2.split(r'\s+', text)
34
print(parts) # ['apple', 'banana', 'cherry']
35
36
# Split with limit
37
text = "one,two,three,four"
38
parts = re2.split(r',', text, maxsplit=2)
39
print(parts) # ['one', 'two', 'three,four']
40
41
# Split capturing delimiter
42
text = "word1,word2;word3"
43
parts = re2.split(r'([,;])', text)
44
print(parts) # ['word1', ',', 'word2', ';', 'word3']
45
```
46
47
### Text Substitution
48
49
Replaces occurrences of a pattern with a replacement string, with optional control over the number of replacements.
50
51
```python { .api }
52
def sub(pattern, repl, text, count=0, options=None):
53
"""
54
Replace pattern occurrences with replacement string.
55
56
Args:
57
pattern (str): Regular expression pattern to match
58
repl (str or callable): Replacement string or function
59
text (str): Input text to process
60
count (int): Maximum number of replacements (0 = all)
61
options (Options, optional): Compilation options
62
63
Returns:
64
str: Text with replacements made
65
"""
66
```
67
68
Example usage:
69
70
```python
71
import re2
72
73
# Simple replacement
74
text = "Hello world"
75
result = re2.sub(r'world', 'universe', text)
76
print(result) # "Hello universe"
77
78
# Replace with group references
79
text = "John Smith, Jane Doe"
80
result = re2.sub(r'(\w+) (\w+)', r'\2, \1', text)
81
print(result) # "Smith, John, Doe, Jane"
82
83
# Limited replacements
84
text = "foo foo foo"
85
result = re2.sub(r'foo', 'bar', text, count=2)
86
print(result) # "bar bar foo"
87
88
# Using callable replacement
89
def upper_match(match):
90
return match.group().upper()
91
92
text = "hello world"
93
result = re2.sub(r'\w+', upper_match, text)
94
print(result) # "HELLO WORLD"
95
```
96
97
### Text Substitution with Count
98
99
Performs substitution like `sub()` but returns both the modified text and the number of substitutions made.
100
101
```python { .api }
102
def subn(pattern, repl, text, count=0, options=None):
103
"""
104
Replace pattern occurrences and return (result, count).
105
106
Args:
107
pattern (str): Regular expression pattern to match
108
repl (str or callable): Replacement string or function
109
text (str): Input text to process
110
count (int): Maximum number of replacements (0 = all)
111
options (Options, optional): Compilation options
112
113
Returns:
114
tuple: (modified_text, substitution_count)
115
"""
116
```
117
118
Example usage:
119
120
```python
121
import re2
122
123
# Get substitution count
124
text = "The quick brown fox jumps over the lazy dog"
125
result, num_subs = re2.subn(r'\b\w{4}\b', 'WORD', text)
126
print(result) # "The quick brown WORD jumps WORD the WORD dog"
127
print(num_subs) # 3
128
129
# Check if any substitutions were made
130
text = "no matches here"
131
result, num_subs = re2.subn(r'\d+', 'NUMBER', text)
132
if num_subs == 0:
133
print("No changes made")
134
```
135
136
### Utility Functions
137
138
Additional text processing utilities for escaping special characters and managing compiled pattern cache.
139
140
```python { .api }
141
def escape(pattern):
142
"""
143
Escape special regex characters in pattern.
144
145
Args:
146
pattern (str): String to escape
147
148
Returns:
149
str: Pattern with special characters escaped
150
"""
151
152
def purge():
153
"""
154
Clear the compiled regular expression cache.
155
156
This function clears the internal LRU cache that stores
157
compiled pattern objects for better performance.
158
"""
159
```
160
161
Example usage:
162
163
```python
164
import re2
165
166
# Escape special characters
167
literal_text = "Price: $19.99 (20% off)"
168
escaped = re2.escape(literal_text)
169
print(escaped) # "Price: \$19\.99 \(20% off\)"
170
171
# Use escaped text as literal pattern
172
text = "Item costs $19.99 (20% off) today"
173
match = re2.search(escaped, text)
174
print(match is not None) # True
175
176
# Clear pattern cache (useful for memory management)
177
re2.purge()
178
```
179
180
## Pattern Object Text Processing
181
182
When using compiled pattern objects, text processing methods are available as instance methods:
183
184
```python { .api }
185
class _Regexp:
186
"""Compiled regular expression pattern object."""
187
188
def split(text, maxsplit=0):
189
"""Split text using this pattern as delimiter."""
190
191
def sub(repl, text, count=0):
192
"""Replace matches with replacement string."""
193
194
def subn(repl, text, count=0):
195
"""Replace matches and return (result, count)."""
196
```
197
198
Example usage:
199
200
```python
201
import re2
202
203
# Compile pattern once, use multiple times
204
pattern = re2.compile(r'[,;]\s*')
205
text1 = "apple, banana; cherry"
206
text2 = "red,green;blue"
207
208
# Split multiple texts with same pattern
209
parts1 = pattern.split(text1) # ['apple', 'banana', 'cherry']
210
parts2 = pattern.split(text2) # ['red', 'green', 'blue']
211
212
# Replace using compiled pattern
213
result = pattern.sub(' | ', text1) # "apple | banana | cherry"
214
```