or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

caches.mdheaps.mdindex.mdlinear.mdmaps.mdsets.mdspecialized.mdtrees.mdvectors.md

specialized.mddocs/

0

# Specialized Algorithms & Structures

1

2

Advanced data structures for specific algorithmic applications including bloom filters, suffix arrays, and search indexes.

3

4

## Capabilities

5

6

### BloomFilter

7

8

Probabilistic data structure for membership testing with configurable false positive rate.

9

10

```typescript { .api }

11

/**

12

* Bloom filter for probabilistic membership testing

13

* @param capacity Expected number of items

14

*/

15

function BloomFilter(capacity: number): BloomFilter;

16

17

/**

18

* Bloom filter with options object

19

* @param options Configuration object with capacity and optional errorRate

20

*/

21

function BloomFilter(options: BloomFilterOptions): BloomFilter;

22

23

type BloomFilterOptions = {

24

capacity: number;

25

errorRate?: number;

26

}

27

28

interface BloomFilter {

29

/** Expected capacity */

30

readonly capacity: number;

31

32

/** Configured error rate */

33

readonly errorRate: number;

34

35

/** Number of hash functions used */

36

readonly hashFunctions: number;

37

38

/** Add string to filter */

39

add(item: string): this;

40

41

/** Test if string might be in filter */

42

test(item: string): boolean;

43

44

/** Remove all items */

45

clear(): void;

46

47

/** Export filter state as Uint8Array */

48

toJSON(): Uint8Array;

49

50

inspect(): any;

51

52

/** Create bloom filter from iterable */

53

static from(iterable: Iterable<string>, options?: number | BloomFilterOptions): BloomFilter;

54

}

55

```

56

57

### SuffixArray

58

59

Suffix array for efficient string matching and substring queries.

60

61

```typescript { .api }

62

/**

63

* Suffix array for string pattern matching

64

* @param text Input string or array of strings

65

*/

66

function SuffixArray(text: string | Array<string>): SuffixArray;

67

68

interface SuffixArray {

69

/** The suffix array */

70

readonly array: Uint32Array;

71

72

/** Length of the suffix array */

73

readonly length: number;

74

75

/** Original string(s) */

76

readonly string: string | Array<string>;

77

78

/** Search for pattern occurrences */

79

search(pattern: string): Array<number>;

80

81

/** Find longest common substring */

82

longestCommonSubstring(): {length: number, index: number} | null;

83

84

inspect(): any;

85

}

86

```

87

88

### GeneralizedSuffixArray

89

90

Suffix array for multiple strings with cross-string pattern matching.

91

92

```typescript { .api }

93

/**

94

* Generalized suffix array for multiple strings

95

* @param strings Array of input strings

96

*/

97

function GeneralizedSuffixArray(strings: Array<string>): GeneralizedSuffixArray;

98

99

interface GeneralizedSuffixArray {

100

readonly array: Uint32Array;

101

readonly length: number;

102

readonly strings: Array<string>;

103

104

/** Search for pattern across all strings */

105

search(pattern: string): Array<{string: number, index: number}>;

106

107

/** Find longest common substring across strings */

108

longestCommonSubstring(): {length: number, strings: Array<number>, index: number} | null;

109

110

inspect(): any;

111

}

112

```

113

114

### InvertedIndex

115

116

Full-text search index with TF-IDF scoring and configurable tokenization.

117

118

```typescript { .api }

119

/**

120

* Inverted index for full-text search

121

* @param tokenizer Optional custom tokenizer function

122

*/

123

function InvertedIndex(tokenizer?: TokenizerFunction): InvertedIndex;

124

125

interface InvertedIndex {

126

readonly size: number;

127

128

/** Add document with its tokens */

129

add(document: any, tokens: Array<string>): this;

130

131

/** Search for documents matching query */

132

search(query: Array<string>): Array<{document: any, score: number}>;

133

134

/** Get all documents containing term */

135

get(term: string): Array<any>;

136

137

/** Remove all documents and terms */

138

clear(): void;

139

140

inspect(): any;

141

}

142

143

type TokenizerFunction = (text: string) => Array<string>;

144

```

145

146

### SymSpell

147

148

Spelling correction data structure using symmetric delete spelling correction algorithm.

149

150

```typescript { .api }

151

/**

152

* SymSpell algorithm for spelling correction

153

* @param options Configuration options

154

*/

155

function SymSpell(options?: SymSpellOptions): SymSpell;

156

157

interface SymSpellOptions {

158

/** Maximum edit distance for corrections (default: 2) */

159

maxEditDistance?: number;

160

/** Minimum frequency threshold (default: 1) */

161

countThreshold?: number;

162

/** Maximum number of suggestions (default: 5) */

163

maxSuggestions?: number;

164

}

165

166

interface SymSpell {

167

readonly size: number;

168

169

/** Add word with optional frequency */

170

add(word: string, frequency?: number): this;

171

172

/** Get spelling corrections for query */

173

search(query: string, maxEditDistance?: number, suggestion?: SuggestionVerbosity): Array<Suggestion>;

174

175

/** Remove all words */

176

clear(): void;

177

178

inspect(): any;

179

}

180

181

interface Suggestion {

182

term: string;

183

distance: number;

184

frequency: number;

185

}

186

187

enum SuggestionVerbosity {

188

Top, // Only top suggestion

189

Closest, // All suggestions with minimum edit distance

190

All // All suggestions within edit distance

191

}

192

```

193

194

### PassjoinIndex

195

196

Index for efficient similarity joins using the Passjoin algorithm.

197

198

```typescript { .api }

199

/**

200

* Passjoin index for similarity joins

201

* @param options Configuration for similarity matching

202

*/

203

function PassjoinIndex(options: PassjoinOptions): PassjoinIndex;

204

205

interface PassjoinOptions {

206

/** Similarity threshold (0-1) */

207

threshold: number;

208

/** Tokenizer function */

209

tokenizer: TokenizerFunction;

210

/** Optional distance function */

211

distance?: DistanceFunction;

212

}

213

214

interface PassjoinIndex {

215

readonly size: number;

216

217

/** Add record to index */

218

add(record: any): this;

219

220

/** Find similar records above threshold */

221

search(query: any, threshold?: number): Array<{record: any, similarity: number}>;

222

223

/** Remove all records */

224

clear(): void;

225

226

inspect(): any;

227

}

228

```

229

230

### HashedArrayTree

231

232

Hash array mapped trie for efficient dynamic array operations.

233

234

```typescript { .api }

235

/**

236

* Hashed Array Tree for dynamic arrays

237

* @param options Optional configuration

238

*/

239

function HashedArrayTree<T>(options?: HATOptions): HashedArrayTree<T>;

240

241

interface HATOptions {

242

initialCapacity?: number;

243

factor?: number;

244

}

245

246

interface HashedArrayTree<T> {

247

readonly length: number;

248

readonly size: number;

249

250

/** Add item to end */

251

push(item: T): number;

252

253

/** Get item at index */

254

get(index: number): T | undefined;

255

256

/** Set item at index */

257

set(index: number, value: T): void;

258

259

/** Remove and return last item */

260

pop(): T | undefined;

261

262

/** Remove all items */

263

clear(): void;

264

265

values(): IterableIterator<T>;

266

[Symbol.iterator](): IterableIterator<T>;

267

268

inspect(): any;

269

}

270

```

271

272

## Usage Examples

273

274

### BloomFilter Example

275

276

```typescript

277

import {BloomFilter} from "mnemonist";

278

279

// URL deduplication with 1% false positive rate

280

const seenUrls = new BloomFilter(1000000, 0.01);

281

282

function crawlUrl(url: string) {

283

if (seenUrls.test(url)) {

284

console.log("Probably already crawled:", url);

285

return; // Might be false positive, but likely already seen

286

}

287

288

// Crawl the URL

289

console.log("Crawling new URL:", url);

290

seenUrls.add(url);

291

}

292

293

crawlUrl("https://example.com");

294

crawlUrl("https://example.org");

295

crawlUrl("https://example.com"); // "Probably already crawled"

296

```

297

298

### SuffixArray Example

299

300

```typescript

301

import {SuffixArray} from "mnemonist";

302

303

// Text search and analysis

304

const text = "banana";

305

const suffixArray = new SuffixArray(text);

306

307

// Find all occurrences of pattern

308

const matches = suffixArray.search("ana");

309

console.log(matches); // [1, 3] - positions where "ana" occurs

310

311

// Find longest repeated substring

312

const lcs = suffixArray.longestCommonSubstring();

313

console.log(lcs); // {length: 3, index: 1} - "ana" at position 1

314

315

// DNA sequence analysis

316

const dna = "ATCGATCGATCG";

317

const dnaArray = new SuffixArray(dna);

318

const repeats = dnaArray.search("ATCG");

319

console.log(repeats); // All positions of ATCG pattern

320

```

321

322

### InvertedIndex Example

323

324

```typescript

325

import {InvertedIndex} from "mnemonist";

326

327

// Simple search engine

328

const searchIndex = new InvertedIndex();

329

330

// Custom tokenizer

331

function tokenize(text: string): Array<string> {

332

return text.toLowerCase()

333

.replace(/[^\w\s]/g, "")

334

.split(/\s+/)

335

.filter(token => token.length > 2);

336

}

337

338

const index = new InvertedIndex(tokenize);

339

340

// Add documents

341

index.add({id: 1, title: "Machine Learning Basics"},

342

tokenize("Introduction to machine learning algorithms"));

343

index.add({id: 2, title: "Deep Learning Guide"},

344

tokenize("Deep neural networks and machine learning"));

345

index.add({id: 3, title: "Data Science Tools"},

346

tokenize("Tools for data analysis and statistics"));

347

348

// Search for documents

349

const results = index.search(tokenize("machine learning"));

350

results.forEach(result => {

351

console.log(`Document ${result.document.id}: ${result.score}`);

352

});

353

```

354

355

### SymSpell Example

356

357

```typescript

358

import {SymSpell} from "mnemonist";

359

360

// Spell checker

361

const spellChecker = new SymSpell({

362

maxEditDistance: 2,

363

maxSuggestions: 5

364

});

365

366

// Add dictionary words with frequencies

367

const dictionary = [

368

{word: "javascript", freq: 1000},

369

{word: "typescript", freq: 800},

370

{word: "python", freq: 1200},

371

{word: "programming", freq: 900}

372

];

373

374

dictionary.forEach(({word, freq}) => {

375

spellChecker.add(word, freq);

376

});

377

378

// Get spelling suggestions

379

const suggestions = spellChecker.search("javascrip");

380

suggestions.forEach(suggestion => {

381

console.log(`${suggestion.term} (distance: ${suggestion.distance}, freq: ${suggestion.frequency})`);

382

});

383

// Output: javascript (distance: 1, freq: 1000)

384

```

385

386

### PassjoinIndex Example

387

388

```typescript

389

import {PassjoinIndex} from "mnemonist";

390

391

// Duplicate detection in records

392

function jaccardSimilarity(tokensA: Set<string>, tokensB: Set<string>): number {

393

const intersection = new Set([...tokensA].filter(x => tokensB.has(x)));

394

const union = new Set([...tokensA, ...tokensB]);

395

return intersection.size / union.size;

396

}

397

398

const duplicateDetector = new PassjoinIndex({

399

threshold: 0.8,

400

tokenizer: (text: string) => text.toLowerCase().split(/\s+/),

401

distance: jaccardSimilarity

402

});

403

404

// Add product records

405

duplicateDetector.add({id: 1, name: "Apple iPhone 13 Pro Max"});

406

duplicateDetector.add({id: 2, name: "Samsung Galaxy S21 Ultra"});

407

duplicateDetector.add({id: 3, name: "iPhone 13 Pro Max Apple"});

408

409

// Find duplicates

410

const query = {name: "Apple iPhone 13 Pro Max 256GB"};

411

const similar = duplicateDetector.search(query, 0.7);

412

similar.forEach(match => {

413

console.log(`Similar: ${match.record.name} (${match.similarity})`);

414

});

415

```

416

417

## Performance Characteristics

418

419

| Structure | Add/Insert | Search | Space | Notes |

420

|-----------|------------|--------|-------|--------|

421

| BloomFilter | O(k) | O(k) | O(m) | k=hash functions, m=bits |

422

| SuffixArray | O(n log n) | O(m log n) | O(n) | n=text length, m=pattern |

423

| InvertedIndex | O(d) | O(q + r) | O(V + D) | d=doc terms, q=query, r=results |

424

| SymSpell | O(1) | O(n) | O(n·k) | n=dictionary size, k=deletions |

425

| PassjoinIndex | O(s) | O(s·c) | O(n·s) | s=signature size, c=candidates |

426

| HashedArrayTree | O(1) amortized | O(1) | O(n) | Dynamic array operations |

427

428

## Use Cases

429

430

### BloomFilter

431

- Web crawling deduplication

432

- Database query optimization

433

- Distributed caching

434

- Network packet filtering

435

436

### SuffixArray

437

- Bioinformatics (DNA/protein analysis)

438

- Full-text search engines

439

- Data compression algorithms

440

- Plagiarism detection

441

442

### InvertedIndex

443

- Search engines

444

- Document retrieval systems

445

- Log analysis

446

- Content management systems

447

448

### SymSpell

449

- Spell checkers

450

- Query suggestion systems

451

- OCR error correction

452

- Fuzzy search applications

453

454

### PassjoinIndex

455

- Duplicate record detection

456

- Data cleaning pipelines

457

- Entity resolution

458

- Similarity search systems