or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

automaton-construction.mddictionary-interface.mdindex.mdpattern-search.mdserialization.md

pattern-search.mddocs/

0

# Pattern Search

1

2

Efficient multi-pattern search operations using the built automaton, supporting various search modes including standard iteration, longest-match iteration, and callback-based processing.

3

4

## Capabilities

5

6

### Standard Search Iterator

7

8

Iterate over all pattern matches in text, returning overlapping matches.

9

10

```python { .api }

11

def iter(self, string, start=0, end=None, ignore_white_space=False):

12

"""

13

Perform Aho-Corasick search procedure using the provided input string.

14

15

Parameters:

16

- string: Input text to search

17

- start: Optional start index for search slice

18

- end: Optional end index for search slice

19

- ignore_white_space: If True, ignore whitespace characters in matching

20

21

Returns:

22

Iterator of tuples (end_index, value) where:

23

- end_index is the end index in input string where a trie key was found

24

- value is the value associated with found key string

25

26

Raises:

27

- AttributeError: If automaton not built with make_automaton()

28

"""

29

```

30

31

#### Usage Examples

32

33

```python

34

import ahocorasick

35

36

# Create and build automaton

37

automaton = ahocorasick.Automaton()

38

words = ['he', 'she', 'his', 'hers']

39

for i, word in enumerate(words):

40

automaton.add_word(word, (i, word))

41

42

automaton.make_automaton()

43

44

# Basic search

45

text = "she sells seashells by the seashore"

46

for end_index, (insert_order, original_string) in automaton.iter(text):

47

start_index = end_index - len(original_string) + 1

48

print(f"Found '{original_string}' at {start_index}-{end_index}")

49

50

# Search with slice bounds

51

text = "he she his hers"

52

matches = list(automaton.iter(text, start=3, end=10)) # "she hi"

53

for end_index, (insert_order, original_string) in matches:

54

print(f"Match: '{original_string}' ending at {end_index}")

55

56

# Ignore whitespace during matching

57

automaton_no_space = ahocorasick.Automaton()

58

automaton_no_space.add_word('hello', 'greeting')

59

automaton_no_space.make_automaton()

60

61

text_with_spaces = "h e l l o w o r l d"

62

matches = list(automaton_no_space.iter(text_with_spaces, ignore_white_space=True))

63

print(matches) # Will find 'hello' despite spaces

64

```

65

66

### Longest Match Iterator

67

68

Iterate over non-overlapping longest matches only.

69

70

```python { .api }

71

def iter_long(self, string, start=0, end=None):

72

"""

73

Perform modified Aho-Corasick search which matches the longest words from set.

74

75

Parameters:

76

- string: Input text to search

77

- start: Optional start index for search slice

78

- end: Optional end index for search slice

79

80

Returns:

81

Iterator of tuples (end_index, value) for non-overlapping longest matches

82

83

Raises:

84

- AttributeError: If automaton not built with make_automaton()

85

"""

86

```

87

88

#### Usage Example

89

90

```python

91

automaton = ahocorasick.Automaton()

92

# Add overlapping patterns of different lengths

93

patterns = ['he', 'her', 'hers', 'she']

94

for i, pattern in enumerate(patterns):

95

automaton.add_word(pattern, (i, pattern))

96

97

automaton.make_automaton()

98

99

text = "she hers"

100

101

# Standard iterator finds all overlapping matches

102

standard_matches = list(automaton.iter(text))

103

print("Standard matches:", standard_matches)

104

# Output: [(2, (3, 'she')), (2, (0, 'he')), (7, (0, 'he')), (7, (1, 'her')), (7, (2, 'hers'))]

105

106

# Long iterator finds only longest non-overlapping matches

107

long_matches = list(automaton.iter_long(text))

108

print("Longest matches:", long_matches)

109

# Output: [(2, (3, 'she')), (7, (2, 'hers'))]

110

```

111

112

### Callback-based Search

113

114

Search for patterns and invoke a callback function for each match.

115

116

```python { .api }

117

def find_all(self, string, callback, start=0, end=None):

118

"""

119

Perform Aho-Corasick search procedure and invoke callback for each match.

120

121

Parameters:

122

- string: Input text to search

123

- callback: Callable that accepts (end_index, value) arguments

124

- start: Optional start index for search slice

125

- end: Optional end index for search slice

126

127

Raises:

128

- AttributeError: If automaton not built with make_automaton()

129

"""

130

```

131

132

#### Usage Example

133

134

```python

135

automaton = ahocorasick.Automaton()

136

words = ['cat', 'car', 'card']

137

for word in words:

138

automaton.add_word(word, word.upper())

139

140

automaton.make_automaton()

141

142

# Callback function to process matches

143

matches_found = []

144

def collect_matches(end_index, value):

145

matches_found.append({

146

'end_position': end_index,

147

'matched_word': value,

148

'length': len(value)

149

})

150

151

# Search with callback

152

text = "the car and cat ran to the card table"

153

automaton.find_all(text, collect_matches)

154

155

for match in matches_found:

156

start_pos = match['end_position'] - match['length'] + 1

157

print(f"Found '{match['matched_word']}' at position {start_pos}-{match['end_position']}")

158

```

159

160

### Search Iterator Manipulation

161

162

Advanced control over search iteration state.

163

164

```python { .api }

165

class AutomatonSearchIter:

166

def set(self, string, reset=False):

167

"""

168

Set a new string to search in the iterator.

169

170

Parameters:

171

- string: New string to search

172

- reset: If False (default), continue search state;

173

if True, reset automaton state

174

175

Usage:

176

Allows searching large strings in chunks or continuing

177

search across multiple string segments.

178

"""

179

```

180

181

#### Usage Example

182

183

```python

184

automaton = ahocorasick.Automaton()

185

automaton.add_word('hello', 'greeting')

186

automaton.add_word('world', 'place')

187

automaton.make_automaton()

188

189

# Get iterator

190

search_iter = automaton.iter("hello")

191

192

# Process first chunk

193

for match in search_iter:

194

print(f"Match in first chunk: {match}")

195

196

# Continue with second chunk (maintains state)

197

search_iter.set(" world", reset=False)

198

for match in search_iter:

199

print(f"Match in second chunk: {match}")

200

201

# Reset and search new content

202

search_iter.set("hello again", reset=True)

203

for match in search_iter:

204

print(f"Match after reset: {match}")

205

```

206

207

### Long Search Iterator Manipulation

208

209

Advanced control over longest-match search iteration state.

210

211

```python { .api }

212

class AutomatonSearchIterLong:

213

def set(self, string, reset=False):

214

"""

215

Set a new string to search in the longest-match iterator.

216

217

Parameters:

218

- string: New string to search

219

- reset: If False (default), continue search state;

220

if True, reset automaton state

221

222

Usage:

223

Allows searching large strings in chunks for longest matches

224

or continuing search across multiple string segments.

225

"""

226

```

227

228

#### Usage Example

229

230

```python

231

automaton = ahocorasick.Automaton()

232

automaton.add_word('he', 'short')

233

automaton.add_word('hello', 'long')

234

automaton.make_automaton()

235

236

# Get longest-match iterator

237

long_iter = automaton.iter_long("hello")

238

239

# Process first chunk

240

for match in long_iter:

241

print(f"Longest match in first chunk: {match}")

242

243

# Continue with second chunk (maintains state)

244

long_iter.set(" world", reset=False)

245

for match in long_iter:

246

print(f"Longest match in second chunk: {match}")

247

248

# Reset and search new content

249

long_iter.set("hello again", reset=True)

250

for match in long_iter:

251

print(f"Longest match after reset: {match}")

252

```

253

254

## Search Performance Notes

255

256

### Memory Efficiency

257

- All search methods return iterators, not lists

258

- Memory usage is constant regardless of text size or match count

259

- Suitable for processing large texts or streaming data

260

261

### Time Complexity

262

- Search time is O(n + m) where n is text length and m is number of matches

263

- Performance independent of number of patterns (within reasonable limits)

264

- Worst-case and average-case performance are similar

265

266

### Pattern Overlap Handling

267

- `iter()`: Returns all overlapping matches

268

- `iter_long()`: Returns only longest non-overlapping matches

269

- `find_all()`: Processes all overlapping matches via callback

270

271

## Common Search Patterns

272

273

### Extract All Matches with Positions

274

275

```python

276

def extract_matches(automaton, text):

277

"""Extract all matches with their positions and original text."""

278

matches = []

279

for end_index, value in automaton.iter(text):

280

# Reconstruct original pattern length

281

if isinstance(value, tuple) and len(value) == 2:

282

_, pattern = value

283

start_index = end_index - len(pattern) + 1

284

matches.append({

285

'pattern': pattern,

286

'start': start_index,

287

'end': end_index,

288

'text': text[start_index:end_index + 1]

289

})

290

return matches

291

```

292

293

### Count Pattern Occurrences

294

295

```python

296

def count_patterns(automaton, text):

297

"""Count occurrences of each pattern."""

298

counts = {}

299

for end_index, value in automaton.iter(text):

300

pattern = value if isinstance(value, str) else value[1]

301

counts[pattern] = counts.get(pattern, 0) + 1

302

return counts

303

```

304

305

### Find Non-overlapping Matches

306

307

```python

308

def find_non_overlapping(automaton, text):

309

"""Find all non-overlapping matches (greedy left-to-right)."""

310

matches = []

311

last_end = -1

312

313

for end_index, value in automaton.iter(text):

314

pattern = value if isinstance(value, str) else value[1]

315

start_index = end_index - len(pattern) + 1

316

317

if start_index > last_end:

318

matches.append((start_index, end_index, pattern))

319

last_end = end_index

320

321

return matches

322

```