Fast and memory efficient library for exact or approximate multi-pattern string search using the Aho-Corasick algorithm
npx @tessl/cli install tessl/pypi-pyahocorasick@2.2.00
# pyahocorasick
1
2
A fast and memory efficient library for exact or approximate multi-pattern string search using the Aho-Corasick algorithm. With pyahocorasick, you can find multiple key string occurrences at once in input text, making it ideal for applications requiring high-throughput pattern matching such as bioinformatics, log parsing, content filtering, and data mining.
3
4
## Package Information
5
6
- **Package Name**: pyahocorasick
7
- **Language**: Python (C extension)
8
- **Installation**: `pip install pyahocorasick`
9
10
## Core Imports
11
12
```python
13
import ahocorasick
14
```
15
16
## Basic Usage
17
18
```python
19
import ahocorasick
20
21
# Create an automaton
22
automaton = ahocorasick.Automaton()
23
24
# Add words to the trie
25
for idx, key in enumerate(['he', 'she', 'his', 'hers']):
26
automaton.add_word(key, (idx, key))
27
28
# Convert to automaton for searching
29
automaton.make_automaton()
30
31
# Search for patterns in text
32
text = "she sells seashells by the seashore"
33
for end_index, (insert_order, original_string) in automaton.iter(text):
34
start_index = end_index - len(original_string) + 1
35
print(f"Found '{original_string}' at positions {start_index}-{end_index}")
36
```
37
38
## Architecture
39
40
pyahocorasick implements a two-stage pattern matching system:
41
42
- **Trie Stage**: Dictionary-like structure for storing patterns with associated values
43
- **Automaton Stage**: Aho-Corasick finite state machine for efficient multi-pattern search
44
45
The library supports flexible value storage (arbitrary objects, integers, or automatic length calculation) and can operate on both Unicode strings and byte sequences depending on build configuration.
46
47
## Capabilities
48
49
### Automaton Construction
50
51
Core functionality for creating and managing Aho-Corasick automata, including adding/removing patterns, configuring storage types, and converting tries to search-ready automatons.
52
53
```python { .api }
54
class Automaton:
55
def __init__(self, store=ahocorasick.STORE_ANY, key_type=ahocorasick.KEY_STRING): ...
56
def add_word(self, key, value=None): ...
57
def remove_word(self, key): ...
58
def make_automaton(self): ...
59
```
60
61
[Automaton Construction](./automaton-construction.md)
62
63
### Pattern Search
64
65
Efficient multi-pattern search operations using the built automaton, supporting various search modes including standard iteration, longest-match iteration, and callback-based processing.
66
67
```python { .api }
68
def iter(self, string, start=0, end=None, ignore_white_space=False): ...
69
def iter_long(self, string, start=0, end=None): ...
70
def find_all(self, string, callback, start=0, end=None): ...
71
```
72
73
[Pattern Search](./pattern-search.md)
74
75
### Dictionary Interface
76
77
Dict-like operations for accessing stored patterns and values, including existence checking, value retrieval, and iteration over keys, values, and items with optional filtering.
78
79
```python { .api }
80
def get(self, key, default=None): ...
81
def exists(self, key): ...
82
def keys(self, prefix=None, wildcard=None, how=ahocorasick.MATCH_AT_LEAST_PREFIX): ...
83
def values(self): ...
84
def items(self): ...
85
```
86
87
[Dictionary Interface](./dictionary-interface.md)
88
89
### Serialization
90
91
Save and load automaton instances to/from disk with support for custom serialization functions for arbitrary object storage and efficient built-in serialization for integer storage.
92
93
```python { .api }
94
def save(self, path, serializer=None): ...
95
def load(path, deserializer=None): ...
96
```
97
98
[Serialization](./serialization.md)
99
100
## Constants
101
102
### Storage Types
103
104
```python { .api }
105
STORE_ANY # Store arbitrary Python objects (default)
106
STORE_INTS # Store integers only
107
STORE_LENGTH # Store key lengths automatically
108
```
109
110
### Key Types
111
112
```python { .api }
113
KEY_STRING # String keys (default)
114
KEY_SEQUENCE # Integer sequence keys
115
```
116
117
### Automaton States
118
119
```python { .api }
120
EMPTY # No words added
121
TRIE # Trie built but not converted to automaton
122
AHOCORASICK # Full automaton ready for searching
123
```
124
125
### Pattern Matching Types
126
127
```python { .api }
128
MATCH_EXACT_LENGTH # Exact length matching for wildcard patterns
129
MATCH_AT_LEAST_PREFIX # At least prefix length matching (default)
130
MATCH_AT_MOST_PREFIX # At most prefix length matching
131
```
132
133
### Build Configuration
134
135
```python { .api }
136
unicode # Integer flag (0 or 1) indicating Unicode build support
137
```
138
139
## Error Handling
140
141
The library raises standard Python exceptions:
142
143
- **ValueError**: Invalid arguments, wrong store/key types, malformed data
144
- **TypeError**: Wrong argument types
145
- **KeyError**: Key not found operations
146
- **AttributeError**: Calling search methods before building automaton
147
- **IndexError**: Invalid range parameters