or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

automaton-construction.mddictionary-interface.mdindex.mdpattern-search.mdserialization.md
tile.json

tessl/pypi-pyahocorasick

Fast and memory efficient library for exact or approximate multi-pattern string search using the Aho-Corasick algorithm

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pyahocorasick@2.2.x

To install, run

npx @tessl/cli install tessl/pypi-pyahocorasick@2.2.0

index.mddocs/

pyahocorasick

A fast and memory efficient library for exact or approximate multi-pattern string search using the Aho-Corasick algorithm. With pyahocorasick, you can find multiple key string occurrences at once in input text, making it ideal for applications requiring high-throughput pattern matching such as bioinformatics, log parsing, content filtering, and data mining.

Package Information

  • Package Name: pyahocorasick
  • Language: Python (C extension)
  • Installation: pip install pyahocorasick

Core Imports

import ahocorasick

Basic Usage

import ahocorasick

# Create an automaton
automaton = ahocorasick.Automaton()

# Add words to the trie
for idx, key in enumerate(['he', 'she', 'his', 'hers']):
    automaton.add_word(key, (idx, key))

# Convert to automaton for searching
automaton.make_automaton()

# Search for patterns in text
text = "she sells seashells by the seashore"
for end_index, (insert_order, original_string) in automaton.iter(text):
    start_index = end_index - len(original_string) + 1
    print(f"Found '{original_string}' at positions {start_index}-{end_index}")

Architecture

pyahocorasick implements a two-stage pattern matching system:

  • Trie Stage: Dictionary-like structure for storing patterns with associated values
  • Automaton Stage: Aho-Corasick finite state machine for efficient multi-pattern search

The library supports flexible value storage (arbitrary objects, integers, or automatic length calculation) and can operate on both Unicode strings and byte sequences depending on build configuration.

Capabilities

Automaton Construction

Core functionality for creating and managing Aho-Corasick automata, including adding/removing patterns, configuring storage types, and converting tries to search-ready automatons.

class Automaton:
    def __init__(self, store=ahocorasick.STORE_ANY, key_type=ahocorasick.KEY_STRING): ...
    def add_word(self, key, value=None): ...
    def remove_word(self, key): ...
    def make_automaton(self): ...

Automaton Construction

Pattern Search

Efficient multi-pattern search operations using the built automaton, supporting various search modes including standard iteration, longest-match iteration, and callback-based processing.

def iter(self, string, start=0, end=None, ignore_white_space=False): ...
def iter_long(self, string, start=0, end=None): ...
def find_all(self, string, callback, start=0, end=None): ...

Pattern Search

Dictionary Interface

Dict-like operations for accessing stored patterns and values, including existence checking, value retrieval, and iteration over keys, values, and items with optional filtering.

def get(self, key, default=None): ...
def exists(self, key): ...
def keys(self, prefix=None, wildcard=None, how=ahocorasick.MATCH_AT_LEAST_PREFIX): ...
def values(self): ...
def items(self): ...

Dictionary Interface

Serialization

Save and load automaton instances to/from disk with support for custom serialization functions for arbitrary object storage and efficient built-in serialization for integer storage.

def save(self, path, serializer=None): ...
def load(path, deserializer=None): ...

Serialization

Constants

Storage Types

STORE_ANY      # Store arbitrary Python objects (default)
STORE_INTS     # Store integers only
STORE_LENGTH   # Store key lengths automatically

Key Types

KEY_STRING     # String keys (default)
KEY_SEQUENCE   # Integer sequence keys

Automaton States

EMPTY          # No words added
TRIE           # Trie built but not converted to automaton
AHOCORASICK    # Full automaton ready for searching

Pattern Matching Types

MATCH_EXACT_LENGTH      # Exact length matching for wildcard patterns
MATCH_AT_LEAST_PREFIX   # At least prefix length matching (default)
MATCH_AT_MOST_PREFIX    # At most prefix length matching

Build Configuration

unicode                 # Integer flag (0 or 1) indicating Unicode build support

Error Handling

The library raises standard Python exceptions:

  • ValueError: Invalid arguments, wrong store/key types, malformed data
  • TypeError: Wrong argument types
  • KeyError: Key not found operations
  • AttributeError: Calling search methods before building automaton
  • IndexError: Invalid range parameters