0
# Automaton Construction
1
2
Core functionality for creating and managing Aho-Corasick automata, including adding and removing patterns, configuring storage types, and converting tries to search-ready automatons.
3
4
## Capabilities
5
6
### Constructor
7
8
Create a new empty Automaton with configurable storage and key types.
9
10
```python { .api }
11
class Automaton:
12
def __init__(self, store=ahocorasick.STORE_ANY, key_type=ahocorasick.KEY_STRING):
13
"""
14
Create a new empty Automaton.
15
16
Parameters:
17
- store: Storage type for values (STORE_ANY, STORE_INTS, STORE_LENGTH)
18
- key_type: Type of keys (KEY_STRING, KEY_SEQUENCE)
19
"""
20
```
21
22
#### Usage Example
23
24
```python
25
import ahocorasick
26
27
# Default: store any objects with string keys
28
automaton = ahocorasick.Automaton()
29
30
# Store integers only
31
int_automaton = ahocorasick.Automaton(ahocorasick.STORE_INTS)
32
33
# Store key lengths automatically
34
length_automaton = ahocorasick.Automaton(ahocorasick.STORE_LENGTH)
35
36
# Use integer sequences as keys
37
seq_automaton = ahocorasick.Automaton(key_type=ahocorasick.KEY_SEQUENCE)
38
```
39
40
### Add Words
41
42
Add key strings to the trie with associated values.
43
44
```python { .api }
45
def add_word(self, key, value=None):
46
"""
47
Add a key string to the trie and associate with a value.
48
49
Parameters:
50
- key: String key to add (or integer sequence if KEY_SEQUENCE)
51
- value: Associated value (required for STORE_ANY, optional for STORE_INTS,
52
not allowed for STORE_LENGTH)
53
54
Returns:
55
bool: True if new word added, False if word already exists
56
57
Raises:
58
- ValueError: If value type doesn't match store type
59
- TypeError: If key type is incorrect
60
"""
61
```
62
63
#### Usage Examples
64
65
```python
66
# STORE_ANY (default) - value required
67
automaton = ahocorasick.Automaton()
68
automaton.add_word('hello', 'greeting')
69
automaton.add_word('world', {'type': 'noun', 'meaning': 'earth'})
70
71
# STORE_INTS - value optional (defaults to insertion order)
72
int_automaton = ahocorasick.Automaton(ahocorasick.STORE_INTS)
73
int_automaton.add_word('cat', 1)
74
int_automaton.add_word('dog') # defaults to 1 (second word added)
75
76
# STORE_LENGTH - value not allowed (automatically uses key length)
77
length_automaton = ahocorasick.Automaton(ahocorasick.STORE_LENGTH)
78
length_automaton.add_word('cat') # value will be 3
79
length_automaton.add_word('mouse') # value will be 5
80
81
# KEY_SEQUENCE - use integer sequences as keys
82
seq_automaton = ahocorasick.Automaton(key_type=ahocorasick.KEY_SEQUENCE)
83
seq_automaton.add_word([1, 2, 3], 'sequence1')
84
seq_automaton.add_word((4, 5, 6), 'sequence2')
85
```
86
87
### Remove Words
88
89
Remove words from the trie.
90
91
```python { .api }
92
def remove_word(self, key):
93
"""
94
Remove given word from the trie.
95
96
Parameters:
97
- key: Word to remove
98
99
Returns:
100
bool: True if word was found and removed, False otherwise
101
"""
102
103
def pop(self, key):
104
"""
105
Remove word from trie and return its associated value.
106
107
Parameters:
108
- key: Word to remove
109
110
Returns:
111
The value associated with key
112
113
Raises:
114
- KeyError: If key not found
115
"""
116
```
117
118
#### Usage Examples
119
120
```python
121
automaton = ahocorasick.Automaton()
122
automaton.add_word('hello', 'greeting')
123
automaton.add_word('world', 'noun')
124
125
# Remove word and check if it existed
126
was_removed = automaton.remove_word('hello') # True
127
was_removed = automaton.remove_word('missing') # False
128
129
# Remove word and get its value
130
value = automaton.pop('world') # 'noun'
131
132
# Pop raises KeyError if key not found
133
try:
134
value = automaton.pop('missing')
135
except KeyError:
136
print("Key not found")
137
```
138
139
### Clear Automaton
140
141
Remove all words from the automaton.
142
143
```python { .api }
144
def clear(self):
145
"""
146
Remove all keys from the trie.
147
This method invalidates all iterators.
148
"""
149
```
150
151
### Build Automaton
152
153
Convert the trie to an Aho-Corasick automaton for efficient searching.
154
155
```python { .api }
156
def make_automaton(self):
157
"""
158
Finalize and create the Aho-Corasick automaton based on the keys
159
already added to the trie. After successful creation, the
160
automaton.kind attribute is set to ahocorasick.AHOCORASICK.
161
"""
162
```
163
164
#### Usage Example
165
166
```python
167
automaton = ahocorasick.Automaton()
168
169
# Add words to the trie
170
words = ['he', 'she', 'his', 'hers']
171
for i, word in enumerate(words):
172
automaton.add_word(word, i)
173
174
print(automaton.kind) # ahocorasick.TRIE
175
176
# Convert to automaton for searching
177
automaton.make_automaton()
178
179
print(automaton.kind) # ahocorasick.AHOCORASICK
180
181
# Now ready for efficient pattern search
182
text = "she sells seashells"
183
matches = list(automaton.iter(text))
184
```
185
186
### Automaton Properties
187
188
Access automaton state and configuration information.
189
190
```python { .api }
191
@property
192
def kind(self):
193
"""
194
Current state of automaton.
195
196
Returns:
197
int: One of EMPTY, TRIE, or AHOCORASICK
198
"""
199
200
@property
201
def store(self):
202
"""
203
Storage type used by automaton.
204
205
Returns:
206
int: One of STORE_ANY, STORE_INTS, or STORE_LENGTH
207
"""
208
209
def __len__(self):
210
"""
211
Number of words in automaton.
212
213
Returns:
214
int: Count of distinct words added
215
"""
216
```
217
218
### Statistics
219
220
Get detailed information about the automaton's structure and memory usage.
221
222
```python { .api }
223
def get_stats(self):
224
"""
225
Return a dictionary containing Automaton statistics.
226
227
Returns:
228
dict: Statistics with keys:
229
- nodes_count: Total number of nodes
230
- words_count: Number of distinct words (same as len(automaton))
231
- longest_word: Length of the longest word
232
- links_count: Number of edges
233
- sizeof_node: Size of single node in bytes
234
- total_size: Total size of trie in bytes
235
"""
236
```
237
238
#### Usage Example
239
240
```python
241
automaton = ahocorasick.Automaton()
242
for word in ['cat', 'car', 'card', 'care', 'careful']:
243
automaton.add_word(word, len(word))
244
245
automaton.make_automaton()
246
247
stats = automaton.get_stats()
248
print(f"Nodes: {stats['nodes_count']}")
249
print(f"Words: {stats['words_count']}")
250
print(f"Longest word: {stats['longest_word']}")
251
print(f"Memory usage: {stats['total_size']} bytes")
252
```
253
254
### Memory Size
255
256
Get the approximate memory usage of the automaton instance.
257
258
```python { .api }
259
def __sizeof__(self):
260
"""
261
Return the approximate size in bytes occupied by the Automaton instance in memory.
262
263
Returns:
264
int: Size in bytes (excludes size of associated objects when using STORE_ANY)
265
"""
266
```
267
268
### Structure Dump
269
270
Get detailed internal structure information for debugging and inspection.
271
272
```python { .api }
273
def dump(self):
274
"""
275
Return a three-tuple of lists describing the Automaton as a graph of nodes, edges, failure links.
276
277
Returns:
278
tuple: (nodes, edges, failure_links) where:
279
- nodes: list of (node_id, end_of_word_marker) pairs
280
- edges: list of (node_id, label_char, child_node_id) triples
281
- failure_links: list of (source_node_id, target_node_id) pairs
282
283
Returns None if automaton is empty.
284
"""
285
```
286
287
#### Usage Example
288
289
```python
290
automaton = ahocorasick.Automaton()
291
automaton.add_word('cat', 1)
292
automaton.add_word('car', 2)
293
automaton.make_automaton()
294
295
# Get memory usage
296
memory_size = automaton.__sizeof__()
297
print(f"Automaton uses {memory_size} bytes")
298
299
# Get internal structure for debugging
300
structure = automaton.dump()
301
if structure:
302
nodes, edges, failure_links = structure
303
print(f"Nodes: {len(nodes)}")
304
print(f"Edges: {len(edges)}")
305
print(f"Failure links: {len(failure_links)}")
306
307
# Example: print first few nodes
308
for node_id, is_end in nodes[:3]:
309
print(f"Node {node_id}: end_marker={is_end}")
310
```
311
312
## Storage Type Details
313
314
### STORE_ANY (Default)
315
- Stores arbitrary Python objects as values
316
- Value parameter is required in `add_word()`
317
- Most flexible but uses more memory
318
- Suitable for complex applications with rich metadata
319
320
### STORE_INTS
321
- Stores 32-bit integers only
322
- Value parameter is optional (defaults to insertion order)
323
- Memory efficient for simple use cases
324
- Good for basic pattern indexing
325
326
### STORE_LENGTH
327
- Automatically stores the length of each key as its value
328
- Value parameter not allowed in `add_word()`
329
- Most memory efficient
330
- Ideal when you only need to know pattern lengths
331
332
## Key Type Details
333
334
### KEY_STRING (Default)
335
- Uses string keys for pattern matching
336
- Supports Unicode or bytes depending on build configuration
337
- Standard choice for text processing
338
339
### KEY_SEQUENCE
340
- Uses sequences of integers as keys
341
- Useful for numeric pattern matching
342
- Keys can be lists, tuples, or other integer sequences