0
# Specialized Algorithms & Structures
1
2
Advanced data structures for specific algorithmic applications including bloom filters, suffix arrays, and search indexes.
3
4
## Capabilities
5
6
### BloomFilter
7
8
Probabilistic data structure for membership testing with configurable false positive rate.
9
10
```typescript { .api }
11
/**
12
* Bloom filter for probabilistic membership testing
13
* @param capacity Expected number of items
14
*/
15
function BloomFilter(capacity: number): BloomFilter;
16
17
/**
18
* Bloom filter with options object
19
* @param options Configuration object with capacity and optional errorRate
20
*/
21
function BloomFilter(options: BloomFilterOptions): BloomFilter;
22
23
type BloomFilterOptions = {
24
capacity: number;
25
errorRate?: number;
26
}
27
28
interface BloomFilter {
29
/** Expected capacity */
30
readonly capacity: number;
31
32
/** Configured error rate */
33
readonly errorRate: number;
34
35
/** Number of hash functions used */
36
readonly hashFunctions: number;
37
38
/** Add string to filter */
39
add(item: string): this;
40
41
/** Test if string might be in filter */
42
test(item: string): boolean;
43
44
/** Remove all items */
45
clear(): void;
46
47
/** Export filter state as Uint8Array */
48
toJSON(): Uint8Array;
49
50
inspect(): any;
51
52
/** Create bloom filter from iterable */
53
static from(iterable: Iterable<string>, options?: number | BloomFilterOptions): BloomFilter;
54
}
55
```
56
57
### SuffixArray
58
59
Suffix array for efficient string matching and substring queries.
60
61
```typescript { .api }
62
/**
63
* Suffix array for string pattern matching
64
* @param text Input string or array of strings
65
*/
66
function SuffixArray(text: string | Array<string>): SuffixArray;
67
68
interface SuffixArray {
69
/** The suffix array */
70
readonly array: Uint32Array;
71
72
/** Length of the suffix array */
73
readonly length: number;
74
75
/** Original string(s) */
76
readonly string: string | Array<string>;
77
78
/** Search for pattern occurrences */
79
search(pattern: string): Array<number>;
80
81
/** Find longest common substring */
82
longestCommonSubstring(): {length: number, index: number} | null;
83
84
inspect(): any;
85
}
86
```
87
88
### GeneralizedSuffixArray
89
90
Suffix array for multiple strings with cross-string pattern matching.
91
92
```typescript { .api }
93
/**
94
* Generalized suffix array for multiple strings
95
* @param strings Array of input strings
96
*/
97
function GeneralizedSuffixArray(strings: Array<string>): GeneralizedSuffixArray;
98
99
interface GeneralizedSuffixArray {
100
readonly array: Uint32Array;
101
readonly length: number;
102
readonly strings: Array<string>;
103
104
/** Search for pattern across all strings */
105
search(pattern: string): Array<{string: number, index: number}>;
106
107
/** Find longest common substring across strings */
108
longestCommonSubstring(): {length: number, strings: Array<number>, index: number} | null;
109
110
inspect(): any;
111
}
112
```
113
114
### InvertedIndex
115
116
Full-text search index with TF-IDF scoring and configurable tokenization.
117
118
```typescript { .api }
119
/**
120
* Inverted index for full-text search
121
* @param tokenizer Optional custom tokenizer function
122
*/
123
function InvertedIndex(tokenizer?: TokenizerFunction): InvertedIndex;
124
125
interface InvertedIndex {
126
readonly size: number;
127
128
/** Add document with its tokens */
129
add(document: any, tokens: Array<string>): this;
130
131
/** Search for documents matching query */
132
search(query: Array<string>): Array<{document: any, score: number}>;
133
134
/** Get all documents containing term */
135
get(term: string): Array<any>;
136
137
/** Remove all documents and terms */
138
clear(): void;
139
140
inspect(): any;
141
}
142
143
type TokenizerFunction = (text: string) => Array<string>;
144
```
145
146
### SymSpell
147
148
Spelling correction data structure using symmetric delete spelling correction algorithm.
149
150
```typescript { .api }
151
/**
152
* SymSpell algorithm for spelling correction
153
* @param options Configuration options
154
*/
155
function SymSpell(options?: SymSpellOptions): SymSpell;
156
157
interface SymSpellOptions {
158
/** Maximum edit distance for corrections (default: 2) */
159
maxEditDistance?: number;
160
/** Minimum frequency threshold (default: 1) */
161
countThreshold?: number;
162
/** Maximum number of suggestions (default: 5) */
163
maxSuggestions?: number;
164
}
165
166
interface SymSpell {
167
readonly size: number;
168
169
/** Add word with optional frequency */
170
add(word: string, frequency?: number): this;
171
172
/** Get spelling corrections for query */
173
search(query: string, maxEditDistance?: number, suggestion?: SuggestionVerbosity): Array<Suggestion>;
174
175
/** Remove all words */
176
clear(): void;
177
178
inspect(): any;
179
}
180
181
interface Suggestion {
182
term: string;
183
distance: number;
184
frequency: number;
185
}
186
187
enum SuggestionVerbosity {
188
Top, // Only top suggestion
189
Closest, // All suggestions with minimum edit distance
190
All // All suggestions within edit distance
191
}
192
```
193
194
### PassjoinIndex
195
196
Index for efficient similarity joins using the Passjoin algorithm.
197
198
```typescript { .api }
199
/**
200
* Passjoin index for similarity joins
201
* @param options Configuration for similarity matching
202
*/
203
function PassjoinIndex(options: PassjoinOptions): PassjoinIndex;
204
205
interface PassjoinOptions {
206
/** Similarity threshold (0-1) */
207
threshold: number;
208
/** Tokenizer function */
209
tokenizer: TokenizerFunction;
210
/** Optional distance function */
211
distance?: DistanceFunction;
212
}
213
214
interface PassjoinIndex {
215
readonly size: number;
216
217
/** Add record to index */
218
add(record: any): this;
219
220
/** Find similar records above threshold */
221
search(query: any, threshold?: number): Array<{record: any, similarity: number}>;
222
223
/** Remove all records */
224
clear(): void;
225
226
inspect(): any;
227
}
228
```
229
230
### HashedArrayTree
231
232
Hash array mapped trie for efficient dynamic array operations.
233
234
```typescript { .api }
235
/**
236
* Hashed Array Tree for dynamic arrays
237
* @param options Optional configuration
238
*/
239
function HashedArrayTree<T>(options?: HATOptions): HashedArrayTree<T>;
240
241
interface HATOptions {
242
initialCapacity?: number;
243
factor?: number;
244
}
245
246
interface HashedArrayTree<T> {
247
readonly length: number;
248
readonly size: number;
249
250
/** Add item to end */
251
push(item: T): number;
252
253
/** Get item at index */
254
get(index: number): T | undefined;
255
256
/** Set item at index */
257
set(index: number, value: T): void;
258
259
/** Remove and return last item */
260
pop(): T | undefined;
261
262
/** Remove all items */
263
clear(): void;
264
265
values(): IterableIterator<T>;
266
[Symbol.iterator](): IterableIterator<T>;
267
268
inspect(): any;
269
}
270
```
271
272
## Usage Examples
273
274
### BloomFilter Example
275
276
```typescript
277
import {BloomFilter} from "mnemonist";
278
279
// URL deduplication with 1% false positive rate
280
const seenUrls = new BloomFilter(1000000, 0.01);
281
282
function crawlUrl(url: string) {
283
if (seenUrls.test(url)) {
284
console.log("Probably already crawled:", url);
285
return; // Might be false positive, but likely already seen
286
}
287
288
// Crawl the URL
289
console.log("Crawling new URL:", url);
290
seenUrls.add(url);
291
}
292
293
crawlUrl("https://example.com");
294
crawlUrl("https://example.org");
295
crawlUrl("https://example.com"); // "Probably already crawled"
296
```
297
298
### SuffixArray Example
299
300
```typescript
301
import {SuffixArray} from "mnemonist";
302
303
// Text search and analysis
304
const text = "banana";
305
const suffixArray = new SuffixArray(text);
306
307
// Find all occurrences of pattern
308
const matches = suffixArray.search("ana");
309
console.log(matches); // [1, 3] - positions where "ana" occurs
310
311
// Find longest repeated substring
312
const lcs = suffixArray.longestCommonSubstring();
313
console.log(lcs); // {length: 3, index: 1} - "ana" at position 1
314
315
// DNA sequence analysis
316
const dna = "ATCGATCGATCG";
317
const dnaArray = new SuffixArray(dna);
318
const repeats = dnaArray.search("ATCG");
319
console.log(repeats); // All positions of ATCG pattern
320
```
321
322
### InvertedIndex Example
323
324
```typescript
325
import {InvertedIndex} from "mnemonist";
326
327
// Simple search engine
328
const searchIndex = new InvertedIndex();
329
330
// Custom tokenizer
331
function tokenize(text: string): Array<string> {
332
return text.toLowerCase()
333
.replace(/[^\w\s]/g, "")
334
.split(/\s+/)
335
.filter(token => token.length > 2);
336
}
337
338
const index = new InvertedIndex(tokenize);
339
340
// Add documents
341
index.add({id: 1, title: "Machine Learning Basics"},
342
tokenize("Introduction to machine learning algorithms"));
343
index.add({id: 2, title: "Deep Learning Guide"},
344
tokenize("Deep neural networks and machine learning"));
345
index.add({id: 3, title: "Data Science Tools"},
346
tokenize("Tools for data analysis and statistics"));
347
348
// Search for documents
349
const results = index.search(tokenize("machine learning"));
350
results.forEach(result => {
351
console.log(`Document ${result.document.id}: ${result.score}`);
352
});
353
```
354
355
### SymSpell Example
356
357
```typescript
358
import {SymSpell} from "mnemonist";
359
360
// Spell checker
361
const spellChecker = new SymSpell({
362
maxEditDistance: 2,
363
maxSuggestions: 5
364
});
365
366
// Add dictionary words with frequencies
367
const dictionary = [
368
{word: "javascript", freq: 1000},
369
{word: "typescript", freq: 800},
370
{word: "python", freq: 1200},
371
{word: "programming", freq: 900}
372
];
373
374
dictionary.forEach(({word, freq}) => {
375
spellChecker.add(word, freq);
376
});
377
378
// Get spelling suggestions
379
const suggestions = spellChecker.search("javascrip");
380
suggestions.forEach(suggestion => {
381
console.log(`${suggestion.term} (distance: ${suggestion.distance}, freq: ${suggestion.frequency})`);
382
});
383
// Output: javascript (distance: 1, freq: 1000)
384
```
385
386
### PassjoinIndex Example
387
388
```typescript
389
import {PassjoinIndex} from "mnemonist";
390
391
// Duplicate detection in records
392
function jaccardSimilarity(tokensA: Set<string>, tokensB: Set<string>): number {
393
const intersection = new Set([...tokensA].filter(x => tokensB.has(x)));
394
const union = new Set([...tokensA, ...tokensB]);
395
return intersection.size / union.size;
396
}
397
398
const duplicateDetector = new PassjoinIndex({
399
threshold: 0.8,
400
tokenizer: (text: string) => text.toLowerCase().split(/\s+/),
401
distance: jaccardSimilarity
402
});
403
404
// Add product records
405
duplicateDetector.add({id: 1, name: "Apple iPhone 13 Pro Max"});
406
duplicateDetector.add({id: 2, name: "Samsung Galaxy S21 Ultra"});
407
duplicateDetector.add({id: 3, name: "iPhone 13 Pro Max Apple"});
408
409
// Find duplicates
410
const query = {name: "Apple iPhone 13 Pro Max 256GB"};
411
const similar = duplicateDetector.search(query, 0.7);
412
similar.forEach(match => {
413
console.log(`Similar: ${match.record.name} (${match.similarity})`);
414
});
415
```
416
417
## Performance Characteristics
418
419
| Structure | Add/Insert | Search | Space | Notes |
420
|-----------|------------|--------|-------|--------|
421
| BloomFilter | O(k) | O(k) | O(m) | k=hash functions, m=bits |
422
| SuffixArray | O(n log n) | O(m log n) | O(n) | n=text length, m=pattern |
423
| InvertedIndex | O(d) | O(q + r) | O(V + D) | d=doc terms, q=query, r=results |
424
| SymSpell | O(1) | O(n) | O(n·k) | n=dictionary size, k=deletions |
425
| PassjoinIndex | O(s) | O(s·c) | O(n·s) | s=signature size, c=candidates |
426
| HashedArrayTree | O(1) amortized | O(1) | O(n) | Dynamic array operations |
427
428
## Use Cases
429
430
### BloomFilter
431
- Web crawling deduplication
432
- Database query optimization
433
- Distributed caching
434
- Network packet filtering
435
436
### SuffixArray
437
- Bioinformatics (DNA/protein analysis)
438
- Full-text search engines
439
- Data compression algorithms
440
- Plagiarism detection
441
442
### InvertedIndex
443
- Search engines
444
- Document retrieval systems
445
- Log analysis
446
- Content management systems
447
448
### SymSpell
449
- Spell checkers
450
- Query suggestion systems
451
- OCR error correction
452
- Fuzzy search applications
453
454
### PassjoinIndex
455
- Duplicate record detection
456
- Data cleaning pipelines
457
- Entity resolution
458
- Similarity search systems