or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

edit-distance.mdfactory-utilities.mdindex.mdngram-shingle.mdphonetic-record-linkage.mdsequence-based.md

ngram-shingle.mddocs/

0

# N-Gram and Shingle-Based Algorithms

1

2

Algorithms that convert strings into sets or profiles of n-character sequences (shingles or n-grams) and compute similarity based on these representations. These algorithms support both direct string comparison and pre-computed profile optimization for large datasets.

3

4

## Capabilities

5

6

### Q-Gram Distance

7

8

Q-gram distance as defined by Ukkonen, computed as the L1 norm of the difference between character n-gram profiles. This algorithm provides a lower bound on Levenshtein distance but can be computed in O(m + n) time compared to Levenshtein's O(m*n).

9

10

```python { .api }

11

class QGram(ShingleBased, StringDistance):

12

def __init__(self, k: int = 3):

13

"""

14

Initialize Q-Gram with shingle size.

15

16

Args:

17

k: Size of character n-grams (shingles) to use (default: 3)

18

"""

19

20

def distance(self, s0: str, s1: str) -> float:

21

"""

22

Calculate Q-gram distance between two strings.

23

24

Args:

25

s0: First string

26

s1: Second string

27

28

Returns:

29

float: Q-gram distance (sum of absolute differences in n-gram counts)

30

31

Raises:

32

TypeError: If either string is None

33

"""

34

35

@staticmethod

36

def distance_profile(profile0: dict, profile1: dict) -> float:

37

"""

38

Calculate Q-gram distance between pre-computed profiles.

39

40

Args:

41

profile0: N-gram profile dictionary for first string

42

profile1: N-gram profile dictionary for second string

43

44

Returns:

45

float: Q-gram distance between profiles

46

"""

47

```

48

49

**Usage Examples:**

50

51

```python

52

from similarity.qgram import QGram

53

54

# Basic Q-gram distance with different k values

55

qgram2 = QGram(2) # Bigrams

56

qgram3 = QGram(3) # Trigrams

57

58

distance = qgram2.distance('ABCD', 'ABCE') # Returns: 2 (different endings)

59

distance = qgram3.distance('hello', 'hallo') # Returns: 2 (different middle)

60

61

# Profile-based computation for large datasets

62

qgram = QGram(2)

63

strings = ['hello', 'hallo', 'hullo', 'hillo']

64

65

# Pre-compute profiles once

66

profiles = [qgram.get_profile(s) for s in strings]

67

68

# Compare all pairs efficiently

69

for i in range(len(strings)):

70

for j in range(i + 1, len(strings)):

71

dist = QGram.distance_profile(profiles[i], profiles[j])

72

print(f"'{strings[i]}' vs '{strings[j]}': {dist}")

73

```

74

75

### Cosine Similarity

76

77

Cosine similarity treats n-gram profiles as vectors and computes the cosine of the angle between them. This normalized measure is effective for comparing documents and text with different lengths.

78

79

```python { .api }

80

class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):

81

def __init__(self, k: int):

82

"""

83

Initialize Cosine similarity with shingle size.

84

85

Args:

86

k: Size of character n-grams (shingles) to use

87

"""

88

89

def distance(self, s0: str, s1: str) -> float:

90

"""

91

Calculate cosine distance (1 - cosine similarity).

92

93

Args:

94

s0: First string

95

s1: Second string

96

97

Returns:

98

float: Cosine distance in range [0.0, 1.0] where 0.0 = identical

99

100

Raises:

101

TypeError: If either string is None

102

"""

103

104

def similarity(self, s0: str, s1: str) -> float:

105

"""

106

Calculate cosine similarity between two strings.

107

108

Args:

109

s0: First string

110

s1: Second string

111

112

Returns:

113

float: Cosine similarity in range [0.0, 1.0] where 1.0 = identical

114

115

Raises:

116

TypeError: If either string is None

117

"""

118

119

def similarity_profiles(self, profile0: dict, profile1: dict) -> float:

120

"""

121

Calculate cosine similarity between pre-computed profiles.

122

123

Args:

124

profile0: N-gram profile dictionary for first string

125

profile1: N-gram profile dictionary for second string

126

127

Returns:

128

float: Cosine similarity between profiles

129

"""

130

```

131

132

**Usage Examples:**

133

134

```python

135

from similarity.cosine import Cosine

136

137

# Document similarity with different k values

138

cosine3 = Cosine(3)

139

cosine4 = Cosine(4)

140

141

# Text comparison

142

text1 = "The quick brown fox jumps over the lazy dog"

143

text2 = "A quick brown fox jumped over a lazy dog"

144

145

similarity = cosine3.similarity(text1, text2)

146

distance = cosine3.distance(text1, text2)

147

print(f"Cosine similarity: {similarity:.3f}")

148

print(f"Cosine distance: {distance:.3f}")

149

150

# Profile-based computation

151

cosine = Cosine(2)

152

documents = [

153

"machine learning algorithms",

154

"deep learning neural networks",

155

"artificial intelligence systems",

156

"natural language processing"

157

]

158

159

# Pre-compute profiles

160

profiles = [cosine.get_profile(doc) for doc in documents]

161

162

# Find most similar documents

163

max_sim = 0

164

best_pair = None

165

for i in range(len(documents)):

166

for j in range(i + 1, len(documents)):

167

sim = cosine.similarity_profiles(profiles[i], profiles[j])

168

if sim > max_sim:

169

max_sim = sim

170

best_pair = (i, j)

171

172

if best_pair:

173

i, j = best_pair

174

print(f"Most similar: '{documents[i]}' and '{documents[j]}' ({max_sim:.3f})")

175

```

176

177

### Jaccard Index

178

179

Jaccard index treats n-gram profiles as sets (ignoring counts) and computes the ratio of intersection to union. This metric is particularly effective for comparing texts with different vocabulary sizes.

180

181

```python { .api }

182

class Jaccard(ShingleBased, MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity):

183

def __init__(self, k: int):

184

"""

185

Initialize Jaccard index with shingle size.

186

187

Args:

188

k: Size of character n-grams (shingles) to use

189

"""

190

191

def distance(self, s0: str, s1: str) -> float:

192

"""

193

Calculate Jaccard distance (1 - Jaccard similarity).

194

195

Args:

196

s0: First string

197

s1: Second string

198

199

Returns:

200

float: Jaccard distance in range [0.0, 1.0] where 0.0 = identical

201

202

Raises:

203

TypeError: If either string is None

204

"""

205

206

def similarity(self, s0: str, s1: str) -> float:

207

"""

208

Calculate Jaccard similarity between two strings.

209

210

Args:

211

s0: First string

212

s1: Second string

213

214

Returns:

215

float: Jaccard similarity in range [0.0, 1.0] where 1.0 = identical

216

217

Raises:

218

TypeError: If either string is None

219

"""

220

```

221

222

**Usage Examples:**

223

224

```python

225

from similarity.jaccard import Jaccard

226

227

# Set-based similarity

228

jaccard2 = Jaccard(2)

229

jaccard3 = Jaccard(3)

230

231

# Compare strings as character sets

232

similarity = jaccard2.similarity('hello', 'hallo') # Set overlap

233

distance = jaccard2.distance('hello', 'hallo')

234

235

# Compare different length strings

236

similarity = jaccard3.similarity('programming', 'program')

237

print(f"'programming' vs 'program': {similarity:.3f}")

238

239

# Document deduplication example

240

documents = [

241

"artificial intelligence machine learning",

242

"machine learning artificial intelligence",

243

"deep learning neural networks",

244

"neural networks for deep learning"

245

]

246

247

jaccard = Jaccard(3)

248

threshold = 0.5

249

250

print("Similar document pairs (Jaccard > 0.5):")

251

for i in range(len(documents)):

252

for j in range(i + 1, len(documents)):

253

sim = jaccard.similarity(documents[i], documents[j])

254

if sim > threshold:

255

print(f" '{documents[i]}' vs '{documents[j]}': {sim:.3f}")

256

```

257

258

### Sorensen-Dice Coefficient

259

260

Similar to Jaccard but uses a different formula: 2 * |intersection| / (|set1| + |set2|). This gives higher weight to matches and is often more sensitive to similarities than Jaccard.

261

262

```python { .api }

263

class SorensenDice(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):

264

def __init__(self, k: int = 3):

265

"""

266

Initialize Sorensen-Dice with shingle size.

267

268

Args:

269

k: Size of character n-grams (shingles) to use (default: 3)

270

"""

271

272

def distance(self, s0: str, s1: str) -> float:

273

"""

274

Calculate Sorensen-Dice distance (1 - Sorensen-Dice similarity).

275

276

Args:

277

s0: First string

278

s1: Second string

279

280

Returns:

281

float: Sorensen-Dice distance in range [0.0, 1.0] where 0.0 = identical

282

283

Raises:

284

TypeError: If either string is None

285

"""

286

287

def similarity(self, s0: str, s1: str) -> float:

288

"""

289

Calculate Sorensen-Dice similarity between two strings.

290

291

Args:

292

s0: First string

293

s1: Second string

294

295

Returns:

296

float: Sorensen-Dice similarity in range [0.0, 1.0] where 1.0 = identical

297

298

Raises:

299

TypeError: If either string is None

300

"""

301

```

302

303

**Usage Examples:**

304

305

```python

306

from similarity.sorensen_dice import SorensenDice

307

308

# Basic Sorensen-Dice similarity

309

dice = SorensenDice(2)

310

similarity = dice.similarity('hello', 'hallo')

311

distance = dice.distance('hello', 'hallo')

312

313

# Compare with Jaccard

314

from similarity.jaccard import Jaccard

315

jaccard = Jaccard(2)

316

317

test_pairs = [

318

('programming', 'program'),

319

('hello', 'world'),

320

('similar', 'similarity')

321

]

322

323

for s1, s2 in test_pairs:

324

dice_sim = dice.similarity(s1, s2)

325

jaccard_sim = jaccard.similarity(s1, s2)

326

print(f"'{s1}' vs '{s2}':")

327

print(f" Dice: {dice_sim:.3f}, Jaccard: {jaccard_sim:.3f}")

328

```

329

330

### N-Gram Distance (Kondrak)

331

332

Normalized n-gram distance as defined by Kondrak, using special character affixing to weight initial characters more heavily. This algorithm is designed specifically for string similarity rather than set operations.

333

334

```python { .api }

335

class NGram(NormalizedStringDistance):

336

def __init__(self, n: int = 2):

337

"""

338

Initialize N-Gram distance with gram size.

339

340

Args:

341

n: Size of character n-grams to use (default: 2)

342

"""

343

344

def distance(self, s0: str, s1: str) -> float:

345

"""

346

Calculate normalized N-gram distance with special character affixing.

347

348

Args:

349

s0: First string

350

s1: Second string

351

352

Returns:

353

float: Normalized distance in range [0.0, 1.0] where 0.0 = identical

354

355

Raises:

356

TypeError: If either string is None

357

"""

358

```

359

360

**Usage Examples:**

361

362

```python

363

from similarity.ngram import NGram

364

365

# Different n-gram sizes

366

twogram = NGram(2)

367

fourgram = NGram(4)

368

369

# String comparison with weighted prefixes

370

distance = twogram.distance('ABCD', 'ABTUIO')

371

print(f"2-gram distance: {distance:.3f}")

372

373

# Longer strings with different n-gram sizes

374

s1 = 'Adobe CreativeSuite 5 Master Collection from cheap 4zp'

375

s2 = 'Adobe CreativeSuite 5 Master Collection from cheap d1x'

376

377

distance2 = twogram.distance(s1, s2)

378

distance4 = fourgram.distance(s1, s2)

379

print(f"Similar strings - 2-gram: {distance2:.3f}, 4-gram: {distance4:.3f}")

380

```

381

382

### ShingleBased Base Class

383

384

All shingle-based algorithms inherit from this base class which provides profile computation:

385

386

```python { .api }

387

class ShingleBased:

388

def __init__(self, k: int = 3):

389

"""

390

Initialize with shingle size.

391

392

Args:

393

k: Size of character n-grams (shingles) to use

394

"""

395

396

def get_k(self) -> int:

397

"""

398

Get the current shingle size.

399

400

Returns:

401

int: Current shingle size

402

"""

403

404

def get_profile(self, string: str) -> dict:

405

"""

406

Convert string to n-gram profile (dictionary of n-gram counts).

407

408

Args:

409

string: Input string to profile

410

411

Returns:

412

dict: Dictionary mapping n-grams to their occurrence counts

413

"""

414

```

415

416

**Profile Usage Example:**

417

418

```python

419

from similarity.cosine import Cosine

420

421

cosine = Cosine(3)

422

423

# Get n-gram profiles

424

text = "hello world"

425

profile = cosine.get_profile(text)

426

print(f"3-gram profile for '{text}': {profile}")

427

428

# Profiles can be reused for multiple comparisons

429

profile1 = cosine.get_profile("machine learning")

430

profile2 = cosine.get_profile("deep learning")

431

profile3 = cosine.get_profile("learning algorithms")

432

433

similarity_12 = cosine.similarity_profiles(profile1, profile2)

434

similarity_13 = cosine.similarity_profiles(profile1, profile3)

435

print(f"ML vs DL: {similarity_12:.3f}")

436

print(f"ML vs Algorithms: {similarity_13:.3f}")

437

```