0
# Pattern Search
1
2
Efficient multi-pattern search operations using the built automaton, supporting various search modes including standard iteration, longest-match iteration, and callback-based processing.
3
4
## Capabilities
5
6
### Standard Search Iterator
7
8
Iterate over all pattern matches in text, returning overlapping matches.
9
10
```python { .api }
11
def iter(self, string, start=0, end=None, ignore_white_space=False):
12
"""
13
Perform Aho-Corasick search procedure using the provided input string.
14
15
Parameters:
16
- string: Input text to search
17
- start: Optional start index for search slice
18
- end: Optional end index for search slice
19
- ignore_white_space: If True, ignore whitespace characters in matching
20
21
Returns:
22
Iterator of tuples (end_index, value) where:
23
- end_index is the end index in input string where a trie key was found
24
- value is the value associated with found key string
25
26
Raises:
27
- AttributeError: If automaton not built with make_automaton()
28
"""
29
```
30
31
#### Usage Examples
32
33
```python
34
import ahocorasick
35
36
# Create and build automaton
37
automaton = ahocorasick.Automaton()
38
words = ['he', 'she', 'his', 'hers']
39
for i, word in enumerate(words):
40
automaton.add_word(word, (i, word))
41
42
automaton.make_automaton()
43
44
# Basic search
45
text = "she sells seashells by the seashore"
46
for end_index, (insert_order, original_string) in automaton.iter(text):
47
start_index = end_index - len(original_string) + 1
48
print(f"Found '{original_string}' at {start_index}-{end_index}")
49
50
# Search with slice bounds
51
text = "he she his hers"
52
matches = list(automaton.iter(text, start=3, end=10)) # "she hi"
53
for end_index, (insert_order, original_string) in matches:
54
print(f"Match: '{original_string}' ending at {end_index}")
55
56
# Ignore whitespace during matching
57
automaton_no_space = ahocorasick.Automaton()
58
automaton_no_space.add_word('hello', 'greeting')
59
automaton_no_space.make_automaton()
60
61
text_with_spaces = "h e l l o w o r l d"
62
matches = list(automaton_no_space.iter(text_with_spaces, ignore_white_space=True))
63
print(matches) # Will find 'hello' despite spaces
64
```
65
66
### Longest Match Iterator
67
68
Iterate over non-overlapping longest matches only.
69
70
```python { .api }
71
def iter_long(self, string, start=0, end=None):
72
"""
73
Perform modified Aho-Corasick search which matches the longest words from set.
74
75
Parameters:
76
- string: Input text to search
77
- start: Optional start index for search slice
78
- end: Optional end index for search slice
79
80
Returns:
81
Iterator of tuples (end_index, value) for non-overlapping longest matches
82
83
Raises:
84
- AttributeError: If automaton not built with make_automaton()
85
"""
86
```
87
88
#### Usage Example
89
90
```python
91
automaton = ahocorasick.Automaton()
92
# Add overlapping patterns of different lengths
93
patterns = ['he', 'her', 'hers', 'she']
94
for i, pattern in enumerate(patterns):
95
automaton.add_word(pattern, (i, pattern))
96
97
automaton.make_automaton()
98
99
text = "she hers"
100
101
# Standard iterator finds all overlapping matches
102
standard_matches = list(automaton.iter(text))
103
print("Standard matches:", standard_matches)
104
# Output: [(2, (3, 'she')), (2, (0, 'he')), (7, (0, 'he')), (7, (1, 'her')), (7, (2, 'hers'))]
105
106
# Long iterator finds only longest non-overlapping matches
107
long_matches = list(automaton.iter_long(text))
108
print("Longest matches:", long_matches)
109
# Output: [(2, (3, 'she')), (7, (2, 'hers'))]
110
```
111
112
### Callback-based Search
113
114
Search for patterns and invoke a callback function for each match.
115
116
```python { .api }
117
def find_all(self, string, callback, start=0, end=None):
118
"""
119
Perform Aho-Corasick search procedure and invoke callback for each match.
120
121
Parameters:
122
- string: Input text to search
123
- callback: Callable that accepts (end_index, value) arguments
124
- start: Optional start index for search slice
125
- end: Optional end index for search slice
126
127
Raises:
128
- AttributeError: If automaton not built with make_automaton()
129
"""
130
```
131
132
#### Usage Example
133
134
```python
135
automaton = ahocorasick.Automaton()
136
words = ['cat', 'car', 'card']
137
for word in words:
138
automaton.add_word(word, word.upper())
139
140
automaton.make_automaton()
141
142
# Callback function to process matches
143
matches_found = []
144
def collect_matches(end_index, value):
145
matches_found.append({
146
'end_position': end_index,
147
'matched_word': value,
148
'length': len(value)
149
})
150
151
# Search with callback
152
text = "the car and cat ran to the card table"
153
automaton.find_all(text, collect_matches)
154
155
for match in matches_found:
156
start_pos = match['end_position'] - match['length'] + 1
157
print(f"Found '{match['matched_word']}' at position {start_pos}-{match['end_position']}")
158
```
159
160
### Search Iterator Manipulation
161
162
Advanced control over search iteration state.
163
164
```python { .api }
165
class AutomatonSearchIter:
166
def set(self, string, reset=False):
167
"""
168
Set a new string to search in the iterator.
169
170
Parameters:
171
- string: New string to search
172
- reset: If False (default), continue search state;
173
if True, reset automaton state
174
175
Usage:
176
Allows searching large strings in chunks or continuing
177
search across multiple string segments.
178
"""
179
```
180
181
#### Usage Example
182
183
```python
184
automaton = ahocorasick.Automaton()
185
automaton.add_word('hello', 'greeting')
186
automaton.add_word('world', 'place')
187
automaton.make_automaton()
188
189
# Get iterator
190
search_iter = automaton.iter("hello")
191
192
# Process first chunk
193
for match in search_iter:
194
print(f"Match in first chunk: {match}")
195
196
# Continue with second chunk (maintains state)
197
search_iter.set(" world", reset=False)
198
for match in search_iter:
199
print(f"Match in second chunk: {match}")
200
201
# Reset and search new content
202
search_iter.set("hello again", reset=True)
203
for match in search_iter:
204
print(f"Match after reset: {match}")
205
```
206
207
### Long Search Iterator Manipulation
208
209
Advanced control over longest-match search iteration state.
210
211
```python { .api }
212
class AutomatonSearchIterLong:
213
def set(self, string, reset=False):
214
"""
215
Set a new string to search in the longest-match iterator.
216
217
Parameters:
218
- string: New string to search
219
- reset: If False (default), continue search state;
220
if True, reset automaton state
221
222
Usage:
223
Allows searching large strings in chunks for longest matches
224
or continuing search across multiple string segments.
225
"""
226
```
227
228
#### Usage Example
229
230
```python
231
automaton = ahocorasick.Automaton()
232
automaton.add_word('he', 'short')
233
automaton.add_word('hello', 'long')
234
automaton.make_automaton()
235
236
# Get longest-match iterator
237
long_iter = automaton.iter_long("hello")
238
239
# Process first chunk
240
for match in long_iter:
241
print(f"Longest match in first chunk: {match}")
242
243
# Continue with second chunk (maintains state)
244
long_iter.set(" world", reset=False)
245
for match in long_iter:
246
print(f"Longest match in second chunk: {match}")
247
248
# Reset and search new content
249
long_iter.set("hello again", reset=True)
250
for match in long_iter:
251
print(f"Longest match after reset: {match}")
252
```
253
254
## Search Performance Notes
255
256
### Memory Efficiency
257
- All search methods return iterators, not lists
258
- Memory usage is constant regardless of text size or match count
259
- Suitable for processing large texts or streaming data
260
261
### Time Complexity
262
- Search time is O(n + m) where n is text length and m is number of matches
263
- Performance independent of number of patterns (within reasonable limits)
264
- Worst-case and average-case performance are similar
265
266
### Pattern Overlap Handling
267
- `iter()`: Returns all overlapping matches
268
- `iter_long()`: Returns only longest non-overlapping matches
269
- `find_all()`: Processes all overlapping matches via callback
270
271
## Common Search Patterns
272
273
### Extract All Matches with Positions
274
275
```python
276
def extract_matches(automaton, text):
277
"""Extract all matches with their positions and original text."""
278
matches = []
279
for end_index, value in automaton.iter(text):
280
# Reconstruct original pattern length
281
if isinstance(value, tuple) and len(value) == 2:
282
_, pattern = value
283
start_index = end_index - len(pattern) + 1
284
matches.append({
285
'pattern': pattern,
286
'start': start_index,
287
'end': end_index,
288
'text': text[start_index:end_index + 1]
289
})
290
return matches
291
```
292
293
### Count Pattern Occurrences
294
295
```python
296
def count_patterns(automaton, text):
297
"""Count occurrences of each pattern."""
298
counts = {}
299
for end_index, value in automaton.iter(text):
300
pattern = value if isinstance(value, str) else value[1]
301
counts[pattern] = counts.get(pattern, 0) + 1
302
return counts
303
```
304
305
### Find Non-overlapping Matches
306
307
```python
308
def find_non_overlapping(automaton, text):
309
"""Find all non-overlapping matches (greedy left-to-right)."""
310
matches = []
311
last_end = -1
312
313
for end_index, value in automaton.iter(text):
314
pattern = value if isinstance(value, str) else value[1]
315
start_index = end_index - len(pattern) + 1
316
317
if start_index > last_end:
318
matches.append((start_index, end_index, pattern))
319
last_end = end_index
320
321
return matches
322
```