Tessl Tile for pypi/pyahocorasick@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

automaton-construction.md dictionary-interface.md index.md pattern-search.md serialization.md

pattern-search.mddocs/

0
# Pattern Search
1

2
Efficient multi-pattern search operations using the built automaton, supporting various search modes including standard iteration, longest-match iteration, and callback-based processing.
3

4
## Capabilities
5

6
### Standard Search Iterator
7

8
Iterate over all pattern matches in text, returning overlapping matches.
9

10
```python { .api }
11
def iter(self, string, start=0, end=None, ignore_white_space=False):
12
    """
13
    Perform Aho-Corasick search procedure using the provided input string.
14

15
    Parameters:
16
    - string: Input text to search
17
    - start: Optional start index for search slice
18
    - end: Optional end index for search slice  
19
    - ignore_white_space: If True, ignore whitespace characters in matching
20

21
    Returns:
22
    Iterator of tuples (end_index, value) where:
23
    - end_index is the end index in input string where a trie key was found
24
    - value is the value associated with found key string
25

26
    Raises:
27
    - AttributeError: If automaton not built with make_automaton()
28
    """
29
```
30

31
#### Usage Examples
32

33
```python
34
import ahocorasick
35

36
# Create and build automaton
37
automaton = ahocorasick.Automaton()
38
words = ['he', 'she', 'his', 'hers']
39
for i, word in enumerate(words):
40
    automaton.add_word(word, (i, word))
41

42
automaton.make_automaton()
43

44
# Basic search
45
text = "she sells seashells by the seashore"
46
for end_index, (insert_order, original_string) in automaton.iter(text):
47
    start_index = end_index - len(original_string) + 1
48
    print(f"Found '{original_string}' at {start_index}-{end_index}")
49

50
# Search with slice bounds
51
text = "he she his hers"
52
matches = list(automaton.iter(text, start=3, end=10))  # "she hi"
53
for end_index, (insert_order, original_string) in matches:
54
    print(f"Match: '{original_string}' ending at {end_index}")
55

56
# Ignore whitespace during matching
57
automaton_no_space = ahocorasick.Automaton()
58
automaton_no_space.add_word('hello', 'greeting')
59
automaton_no_space.make_automaton()
60

61
text_with_spaces = "h e l l o   w o r l d"
62
matches = list(automaton_no_space.iter(text_with_spaces, ignore_white_space=True))
63
print(matches)  # Will find 'hello' despite spaces
64
```
65

66
### Longest Match Iterator
67

68
Iterate over non-overlapping longest matches only.
69

70
```python { .api }
71
def iter_long(self, string, start=0, end=None):
72
    """
73
    Perform modified Aho-Corasick search which matches the longest words from set.
74

75
    Parameters:
76
    - string: Input text to search
77
    - start: Optional start index for search slice
78
    - end: Optional end index for search slice
79

80
    Returns:
81
    Iterator of tuples (end_index, value) for non-overlapping longest matches
82

83
    Raises:
84
    - AttributeError: If automaton not built with make_automaton()
85
    """
86
```
87

88
#### Usage Example
89

90
```python
91
automaton = ahocorasick.Automaton()
92
# Add overlapping patterns of different lengths
93
patterns = ['he', 'her', 'hers', 'she']
94
for i, pattern in enumerate(patterns):
95
    automaton.add_word(pattern, (i, pattern))
96

97
automaton.make_automaton()
98

99
text = "she hers"
100

101
# Standard iterator finds all overlapping matches
102
standard_matches = list(automaton.iter(text))
103
print("Standard matches:", standard_matches)
104
# Output: [(2, (3, 'she')), (2, (0, 'he')), (7, (0, 'he')), (7, (1, 'her')), (7, (2, 'hers'))]
105

106
# Long iterator finds only longest non-overlapping matches
107
long_matches = list(automaton.iter_long(text))
108
print("Longest matches:", long_matches)
109
# Output: [(2, (3, 'she')), (7, (2, 'hers'))]
110
```
111

112
### Callback-based Search
113

114
Search for patterns and invoke a callback function for each match.
115

116
```python { .api }
117
def find_all(self, string, callback, start=0, end=None):
118
    """
119
    Perform Aho-Corasick search procedure and invoke callback for each match.
120

121
    Parameters:
122
    - string: Input text to search
123
    - callback: Callable that accepts (end_index, value) arguments
124
    - start: Optional start index for search slice
125
    - end: Optional end index for search slice
126

127
    Raises:
128
    - AttributeError: If automaton not built with make_automaton()
129
    """
130
```
131

132
#### Usage Example
133

134
```python
135
automaton = ahocorasick.Automaton()
136
words = ['cat', 'car', 'card']
137
for word in words:
138
    automaton.add_word(word, word.upper())
139

140
automaton.make_automaton()
141

142
# Callback function to process matches
143
matches_found = []
144
def collect_matches(end_index, value):
145
    matches_found.append({
146
        'end_position': end_index,
147
        'matched_word': value,
148
        'length': len(value)
149
    })
150

151
# Search with callback
152
text = "the car and cat ran to the card table"
153
automaton.find_all(text, collect_matches)
154

155
for match in matches_found:
156
    start_pos = match['end_position'] - match['length'] + 1
157
    print(f"Found '{match['matched_word']}' at position {start_pos}-{match['end_position']}")
158
```
159

160
### Search Iterator Manipulation
161

162
Advanced control over search iteration state.
163

164
```python { .api }
165
class AutomatonSearchIter:
166
    def set(self, string, reset=False):
167
        """
168
        Set a new string to search in the iterator.
169

170
        Parameters:
171
        - string: New string to search
172
        - reset: If False (default), continue search state; 
173
                 if True, reset automaton state
174

175
        Usage:
176
        Allows searching large strings in chunks or continuing
177
        search across multiple string segments.
178
        """
179
```
180

181
#### Usage Example
182

183
```python
184
automaton = ahocorasick.Automaton()
185
automaton.add_word('hello', 'greeting')
186
automaton.add_word('world', 'place')
187
automaton.make_automaton()
188

189
# Get iterator
190
search_iter = automaton.iter("hello")
191

192
# Process first chunk
193
for match in search_iter:
194
    print(f"Match in first chunk: {match}")
195

196
# Continue with second chunk (maintains state)
197
search_iter.set(" world", reset=False)
198
for match in search_iter:
199
    print(f"Match in second chunk: {match}")
200

201
# Reset and search new content
202
search_iter.set("hello again", reset=True)
203
for match in search_iter:
204
    print(f"Match after reset: {match}")
205
```
206

207
### Long Search Iterator Manipulation
208

209
Advanced control over longest-match search iteration state.
210

211
```python { .api }
212
class AutomatonSearchIterLong:
213
    def set(self, string, reset=False):
214
        """
215
        Set a new string to search in the longest-match iterator.
216

217
        Parameters:
218
        - string: New string to search
219
        - reset: If False (default), continue search state; 
220
                 if True, reset automaton state
221

222
        Usage:
223
        Allows searching large strings in chunks for longest matches
224
        or continuing search across multiple string segments.
225
        """
226
```
227

228
#### Usage Example
229

230
```python
231
automaton = ahocorasick.Automaton()
232
automaton.add_word('he', 'short')
233
automaton.add_word('hello', 'long')
234
automaton.make_automaton()
235

236
# Get longest-match iterator
237
long_iter = automaton.iter_long("hello")
238

239
# Process first chunk
240
for match in long_iter:
241
    print(f"Longest match in first chunk: {match}")
242

243
# Continue with second chunk (maintains state)
244
long_iter.set(" world", reset=False)
245
for match in long_iter:
246
    print(f"Longest match in second chunk: {match}")
247

248
# Reset and search new content
249
long_iter.set("hello again", reset=True)
250
for match in long_iter:
251
    print(f"Longest match after reset: {match}")
252
```
253

254
## Search Performance Notes
255

256
### Memory Efficiency
257
- All search methods return iterators, not lists
258
- Memory usage is constant regardless of text size or match count
259
- Suitable for processing large texts or streaming data
260

261
### Time Complexity
262
- Search time is O(n + m) where n is text length and m is number of matches
263
- Performance independent of number of patterns (within reasonable limits)
264
- Worst-case and average-case performance are similar
265

266
### Pattern Overlap Handling
267
- `iter()`: Returns all overlapping matches
268
- `iter_long()`: Returns only longest non-overlapping matches
269
- `find_all()`: Processes all overlapping matches via callback
270

271
## Common Search Patterns
272

273
### Extract All Matches with Positions
274

275
```python
276
def extract_matches(automaton, text):
277
    """Extract all matches with their positions and original text."""
278
    matches = []
279
    for end_index, value in automaton.iter(text):
280
        # Reconstruct original pattern length
281
        if isinstance(value, tuple) and len(value) == 2:
282
            _, pattern = value
283
            start_index = end_index - len(pattern) + 1
284
            matches.append({
285
                'pattern': pattern,
286
                'start': start_index,
287
                'end': end_index,
288
                'text': text[start_index:end_index + 1]
289
            })
290
    return matches
291
```
292

293
### Count Pattern Occurrences
294

295
```python
296
def count_patterns(automaton, text):
297
    """Count occurrences of each pattern."""
298
    counts = {}
299
    for end_index, value in automaton.iter(text):
300
        pattern = value if isinstance(value, str) else value[1]
301
        counts[pattern] = counts.get(pattern, 0) + 1
302
    return counts
303
```
304

305
### Find Non-overlapping Matches
306

307
```python
308
def find_non_overlapping(automaton, text):
309
    """Find all non-overlapping matches (greedy left-to-right)."""
310
    matches = []
311
    last_end = -1
312
    
313
    for end_index, value in automaton.iter(text):
314
        pattern = value if isinstance(value, str) else value[1]
315
        start_index = end_index - len(pattern) + 1
316
        
317
        if start_index > last_end:
318
            matches.append((start_index, end_index, pattern))
319
            last_end = end_index
320
    
321
    return matches
322
```

Version

Tile

Files

pattern-search.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

pattern-search.mddocs/