or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-transformers.mdcross-encoder.mdevaluation.mdindex.mdloss-functions.mdsparse-encoder.mdtraining.mdutilities.md

sparse-encoder.mddocs/

0

# Sparse Encoder

1

2

Sparse encoders generate sparse embeddings that combine the efficiency of traditional sparse retrieval methods (like BM25) with neural approaches, providing efficient storage and fast retrieval for large-scale systems.

3

4

## SparseEncoder Class

5

6

### Constructor

7

8

```python

9

SparseEncoder(

10

model_name_or_path: str | None = None,

11

modules: list[torch.nn.Module] | None = None,

12

device: str | None = None,

13

prompts: dict[str, str] | None = None,

14

default_prompt_name: str | None = None,

15

similarity_fn_name: str | SimilarityFunction | None = None,

16

cache_folder: str | None = None,

17

trust_remote_code: bool = False,

18

revision: str | None = None,

19

local_files_only: bool = False,

20

token: str | bool | None = None,

21

max_active_dims: int | None = None,

22

model_kwargs: dict[str, Any] | None = None

23

)

24

```

25

`{ .api }`

26

27

Initialize a SparseEncoder model for generating sparse embeddings.

28

29

**Parameters**:

30

- `model_name_or_path`: Pre-trained model name or path

31

- `modules`: List of PyTorch modules for custom architecture

32

- `device`: Device to run the model on ("cuda", "cpu", "mps", "npu")

33

- `prompts`: Dictionary of prompts for different contexts (e.g., {"query": "query: ", "passage": "passage: "})

34

- `default_prompt_name`: Default prompt to use if prompts are provided

35

- `similarity_fn_name`: Similarity function name ("cosine", "dot", "euclidean", "manhattan") or SimilarityFunction

36

- `cache_folder`: Custom cache directory for models

37

- `trust_remote_code`: Allow custom code execution from HuggingFace Hub

38

- `revision`: Model revision/branch/tag to load

39

- `local_files_only`: Use only cached files, don't download

40

- `token`: HuggingFace authentication token

41

- `max_active_dims`: Maximum number of active (non-zero) dimensions in output embeddings

42

- `model_kwargs`: Additional model arguments (torch_dtype, attn_implementation, etc.)

43

44

### Encoding Methods

45

46

```python

47

def encode(

48

sentences: list[str] | str,

49

batch_size: int = 32,

50

show_progress_bar: bool | None = None,

51

convert_to_numpy: bool = True,

52

convert_to_tensor: bool = False,

53

device: str | None = None

54

) -> list[dict[str, Any]] | dict[str, Any]

55

```

56

`{ .api }`

57

58

Encode sentences into sparse embeddings.

59

60

**Parameters**:

61

- `sentences`: Input text(s) to encode

62

- `batch_size`: Batch size for processing

63

- `show_progress_bar`: Display progress bar

64

- `convert_to_numpy`: Return numpy arrays

65

- `convert_to_tensor`: Return PyTorch tensors

66

- `device`: Device for computation

67

68

**Returns**: Sparse embeddings as dictionaries with indices and values

69

70

```python

71

def encode_queries(

72

queries: list[str] | str,

73

**kwargs

74

) -> list[dict[str, Any]] | dict[str, Any]

75

```

76

`{ .api }`

77

78

Encode queries with query-specific processing.

79

80

```python

81

def encode_corpus(

82

corpus: list[str] | str,

83

**kwargs

84

) -> list[dict[str, Any]] | dict[str, Any]

85

```

86

`{ .api }`

87

88

Encode corpus documents with document-specific processing.

89

90

### Model Information

91

92

```python

93

def get_sentence_embedding_dimension() -> int

94

```

95

`{ .api }`

96

97

Get the vocabulary size (sparse embedding dimension).

98

99

```python

100

def get_max_seq_length() -> int

101

```

102

`{ .api }`

103

104

Get maximum sequence length the model can handle.

105

106

```python

107

def tokenize(

108

texts: list[str] | str,

109

**kwargs

110

) -> dict[str, torch.Tensor]

111

```

112

`{ .api }`

113

114

Tokenize input texts using the model's tokenizer.

115

116

### Model Persistence

117

118

```python

119

def save(

120

path: str,

121

model_name: str | None = None,

122

create_model_card: bool = True,

123

train_datasets: list[str] | None = None,

124

safe_serialization: bool = True

125

) -> None

126

```

127

`{ .api }`

128

129

Save the sparse encoder model to a directory.

130

131

```python

132

def save_pretrained(

133

save_directory: str,

134

**kwargs

135

) -> None

136

```

137

`{ .api }`

138

139

Save using HuggingFace format.

140

141

```python

142

def save_to_hub(

143

repo_id: str,

144

**kwargs

145

) -> None

146

```

147

`{ .api }`

148

149

Save and push to HuggingFace Hub.

150

151

```python

152

def push_to_hub(

153

repo_id: str,

154

**kwargs

155

) -> None

156

```

157

`{ .api }`

158

159

Push existing model to HuggingFace Hub.

160

161

### Evaluation

162

163

```python

164

def evaluate(

165

evaluator: SentenceEvaluator,

166

output_path: str | None = None

167

) -> float | dict[str, float]

168

```

169

`{ .api }`

170

171

Evaluate the model using provided evaluator.

172

173

### Properties

174

175

```python

176

@property

177

def device() -> torch.device

178

```

179

`{ .api }`

180

181

Current device of the model.

182

183

```python

184

@property

185

def tokenizer() -> PreTrainedTokenizer

186

```

187

`{ .api }`

188

189

Access to the model's tokenizer.

190

191

```python

192

@property

193

def max_seq_length() -> int

194

```

195

`{ .api }`

196

197

Maximum sequence length.

198

199

## SparseEncoderTrainer

200

201

### Constructor

202

203

```python

204

SparseEncoderTrainer(

205

model: SparseEncoder | None = None,

206

args: SparseEncoderTrainingArguments | None = None,

207

train_dataset: Dataset | None = None,

208

eval_dataset: Dataset | None = None,

209

tokenizer: PreTrainedTokenizer | None = None,

210

data_collator: DataCollator | None = None,

211

compute_metrics: callable | None = None,

212

callbacks: list[TrainerCallback] | None = None,

213

optimizers: tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),

214

preprocess_logits_for_metrics: callable | None = None

215

)

216

```

217

`{ .api }`

218

219

Trainer for sparse encoder models.

220

221

**Parameters**:

222

- `model`: SparseEncoder model to train

223

- `args`: Training arguments

224

- `train_dataset`: Training dataset

225

- `eval_dataset`: Evaluation dataset

226

- `tokenizer`: Tokenizer (auto-detected from model)

227

- `data_collator`: Data collator for batching

228

- `compute_metrics`: Metrics computation function

229

- `callbacks`: Training callbacks

230

- `optimizers`: Custom optimizer and scheduler

231

- `preprocess_logits_for_metrics`: Logits preprocessing

232

233

### Training Methods

234

235

```python

236

def train(

237

resume_from_checkpoint: str | bool | None = None,

238

trial: dict[str, Any] | None = None,

239

ignore_keys_for_eval: list[str] | None = None,

240

**kwargs

241

) -> TrainOutput

242

```

243

`{ .api }`

244

245

Train the sparse encoder model.

246

247

```python

248

def evaluate(

249

eval_dataset: Dataset | None = None,

250

ignore_keys: list[str] | None = None,

251

metric_key_prefix: str = "eval"

252

) -> dict[str, float]

253

```

254

`{ .api }`

255

256

Evaluate model performance.

257

258

## SparseEncoderTrainingArguments

259

260

```python

261

class SparseEncoderTrainingArguments(TrainingArguments):

262

def __init__(

263

self,

264

output_dir: str,

265

evaluation_strategy: str | IntervalStrategy = "no",

266

eval_steps: int | None = None,

267

eval_delay: float = 0,

268

logging_dir: str | None = None,

269

logging_strategy: str | IntervalStrategy = "steps",

270

logging_steps: int = 500,

271

save_strategy: str | IntervalStrategy = "steps",

272

save_steps: int = 500,

273

save_total_limit: int | None = None,

274

seed: int = 42,

275

data_seed: int | None = None,

276

jit_mode_eval: bool = False,

277

use_ipex: bool = False,

278

bf16: bool = False,

279

fp16: bool = False,

280

fp16_opt_level: str = "O1",

281

half_precision_backend: str = "auto",

282

bf16_full_eval: bool = False,

283

fp16_full_eval: bool = False,

284

tf32: bool | None = None,

285

local_rank: int = -1,

286

ddp_backend: str | None = None,

287

tpu_num_cores: int | None = None,

288

tpu_metrics_debug: bool = False,

289

debug: str | list[DebugOption] = "",

290

dataloader_drop_last: bool = False,

291

dataloader_num_workers: int = 0,

292

past_index: int = -1,

293

run_name: str | None = None,

294

disable_tqdm: bool | None = None,

295

remove_unused_columns: bool = True,

296

label_names: list[str] | None = None,

297

load_best_model_at_end: bool = False,

298

ignore_data_skip: bool = False,

299

fsdp: str | list[str] = "",

300

fsdp_min_num_params: int = 0,

301

fsdp_config: dict[str, Any] | None = None,

302

fsdp_transformer_layer_cls_to_wrap: str | None = None,

303

deepspeed: str | None = None,

304

label_smoothing_factor: float = 0.0,

305

optim: str | OptimizerNames = "adamw_torch",

306

optim_args: str | None = None,

307

adafactor: bool = False,

308

group_by_length: bool = False,

309

length_column_name: str | None = "length",

310

report_to: str | list[str] | None = None,

311

ddp_find_unused_parameters: bool | None = None,

312

ddp_bucket_cap_mb: int | None = None,

313

ddp_broadcast_buffers: bool | None = None,

314

dataloader_pin_memory: bool = True,

315

skip_memory_metrics: bool = True,

316

use_legacy_prediction_loop: bool = False,

317

push_to_hub: bool = False,

318

resume_from_checkpoint: str | None = None,

319

hub_model_id: str | None = None,

320

hub_strategy: str | HubStrategy = "every_save",

321

hub_token: str | None = None,

322

hub_private_repo: bool = False,

323

hub_always_push: bool = False,

324

gradient_checkpointing: bool = False,

325

include_inputs_for_metrics: bool = False,

326

auto_find_batch_size: bool = False,

327

full_determinism: bool = False,

328

torchdynamo: str | None = None,

329

ray_scope: str | None = "last",

330

ddp_timeout: int = 1800,

331

torch_compile: bool = False,

332

torch_compile_backend: str | None = None,

333

torch_compile_mode: str | None = None,

334

dispatch_batches: bool | None = None,

335

split_batches: bool | None = None,

336

include_tokens_per_second: bool = False,

337

**kwargs

338

)

339

```

340

`{ .api }`

341

342

Training arguments for sparse encoder training.

343

344

## SparseEncoderModelCardData

345

346

```python

347

class SparseEncoderModelCardData:

348

def __init__(

349

self,

350

language: str | list[str] | None = None,

351

license: str | None = None,

352

tags: str | list[str] | None = None,

353

model_name: str | None = None,

354

model_id: str | None = None,

355

eval_results: list[EvalResult] | None = None,

356

train_datasets: str | list[str] | None = None,

357

eval_datasets: str | list[str] | None = None

358

)

359

```

360

`{ .api }`

361

362

Data class for generating model cards for sparse encoder models.

363

364

**Parameters**:

365

- `language`: Language(s) supported

366

- `license`: Model license

367

- `tags`: Categorization tags

368

- `model_name`: Human-readable name

369

- `model_id`: Model identifier

370

- `eval_results`: Evaluation results

371

- `train_datasets`: Training datasets used

372

- `eval_datasets`: Evaluation datasets used

373

374

## Usage Examples

375

376

### Basic Sparse Encoding

377

378

```python

379

from sentence_transformers import SparseEncoder

380

381

# Load a sparse encoder model

382

sparse_model = SparseEncoder('naver/splade-cocondenser-ensembledistil')

383

384

# Encode sentences to sparse embeddings

385

sentences = [

386

"Machine learning is transforming technology",

387

"Artificial intelligence applications are growing",

388

"Data science requires statistical knowledge"

389

]

390

391

# Get sparse embeddings

392

sparse_embeddings = sparse_model.encode(sentences)

393

394

# Each embedding is a dictionary with 'indices' and 'values'

395

for i, embedding in enumerate(sparse_embeddings):

396

print(f"Sentence {i}:")

397

print(f" Active dimensions: {len(embedding['indices'])}")

398

print(f" Sparsity: {len(embedding['indices']) / sparse_model.get_sentence_embedding_dimension():.4f}")

399

print(f" Max value: {max(embedding['values']):.4f}")

400

print()

401

```

402

403

### Asymmetric Retrieval

404

405

```python

406

# For retrieval tasks with different query/document processing

407

queries = [

408

"What is machine learning?",

409

"How does neural networks work?"

410

]

411

412

documents = [

413

"Machine learning is a subset of artificial intelligence that focuses on algorithms",

414

"Neural networks are computational models inspired by biological neural networks",

415

"Data preprocessing is crucial for machine learning success",

416

"Deep learning uses multiple layers to model complex patterns"

417

]

418

419

# Encode queries and documents separately

420

query_embeddings = sparse_model.encode_queries(queries)

421

doc_embeddings = sparse_model.encode_corpus(documents)

422

423

print("Query embeddings:")

424

for i, emb in enumerate(query_embeddings):

425

print(f" Query {i}: {len(emb['indices'])} active dimensions")

426

427

print("Document embeddings:")

428

for i, emb in enumerate(doc_embeddings):

429

print(f" Document {i}: {len(emb['indices'])} active dimensions")

430

```

431

432

### Similarity Computation for Sparse Embeddings

433

434

```python

435

import numpy as np

436

from collections import Counter

437

438

def sparse_dot_product(emb1, emb2):

439

"""Compute dot product between two sparse embeddings."""

440

# Convert to dictionaries for efficient lookup

441

dict1 = dict(zip(emb1['indices'], emb1['values']))

442

dict2 = dict(zip(emb2['indices'], emb2['values']))

443

444

# Find common indices and compute dot product

445

common_indices = set(dict1.keys()) & set(dict2.keys())

446

return sum(dict1[idx] * dict2[idx] for idx in common_indices)

447

448

def sparse_cosine_similarity(emb1, emb2):

449

"""Compute cosine similarity between sparse embeddings."""

450

dot_product = sparse_dot_product(emb1, emb2)

451

norm1 = np.sqrt(sum(v**2 for v in emb1['values']))

452

norm2 = np.sqrt(sum(v**2 for v in emb2['values']))

453

return dot_product / (norm1 * norm2) if norm1 * norm2 > 0 else 0.0

454

455

# Example usage

456

query_emb = query_embeddings[0]

457

similarities = []

458

for doc_emb in doc_embeddings:

459

sim = sparse_cosine_similarity(query_emb, doc_emb)

460

similarities.append(sim)

461

462

print("Similarity scores:")

463

for i, sim in enumerate(similarities):

464

print(f" Query 0 - Document {i}: {sim:.4f}")

465

```

466

467

### Training a Sparse Encoder

468

469

```python

470

from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments

471

from sentence_transformers.losses import MultipleNegativesRankingLoss

472

from datasets import Dataset

473

474

# Create training dataset

475

train_data = [

476

{"query": "python programming", "positive": "Python is a programming language", "negative": "Cats are pets"},

477

{"query": "machine learning", "positive": "ML algorithms learn patterns", "negative": "Cooking recipes vary"},

478

{"query": "data science", "positive": "Data analysis and statistics", "negative": "Weather forecast"}

479

]

480

481

# Convert to dataset format expected by trainer

482

def prepare_dataset(data):

483

dataset_dict = {"query": [], "positive": [], "negative": []}

484

for item in data:

485

dataset_dict["query"].append(item["query"])

486

dataset_dict["positive"].append(item["positive"])

487

dataset_dict["negative"].append(item["negative"])

488

return Dataset.from_dict(dataset_dict)

489

490

train_dataset = prepare_dataset(train_data)

491

492

# Initialize sparse encoder model

493

model = SparseEncoder('distilbert-base-uncased')

494

495

# Training arguments

496

args = SparseEncoderTrainingArguments(

497

output_dir='./sparse-encoder-output',

498

num_train_epochs=3,

499

per_device_train_batch_size=16,

500

logging_steps=10,

501

save_steps=100,

502

evaluation_strategy="steps",

503

eval_steps=100,

504

save_total_limit=2,

505

load_best_model_at_end=True,

506

)

507

508

# Create trainer

509

trainer = SparseEncoderTrainer(

510

model=model,

511

args=args,

512

train_dataset=train_dataset,

513

)

514

515

# Train the model

516

trainer.train()

517

518

# Save trained model

519

model.save('./my-sparse-encoder')

520

```

521

522

### Advanced Usage - Custom Sparse Architecture

523

524

```python

525

from sentence_transformers.models import Transformer, SparseLinear

526

from sentence_transformers import SparseEncoder

527

528

# Create custom sparse encoder architecture

529

transformer = Transformer('distilbert-base-uncased')

530

sparse_linear = SparseLinear(

531

transformer.get_word_embedding_dimension(),

532

vocab_size=30522, # BERT vocabulary size

533

activation='relu'

534

)

535

536

# Combine modules

537

sparse_model = SparseEncoder(modules=[transformer, sparse_linear])

538

539

# Use the custom model

540

embeddings = sparse_model.encode(["Custom sparse encoder example"])

541

```

542

543

### Efficiency Analysis

544

545

```python

546

def analyze_sparsity(embeddings, vocab_size=None):

547

"""Analyze sparsity patterns in sparse embeddings."""

548

if not isinstance(embeddings, list):

549

embeddings = [embeddings]

550

551

total_active = []

552

total_values = []

553

554

for emb in embeddings:

555

active_dims = len(emb['indices'])

556

total_active.append(active_dims)

557

total_values.extend(emb['values'])

558

559

if vocab_size:

560

avg_sparsity = sum(total_active) / (len(embeddings) * vocab_size)

561

print(f"Average sparsity: {avg_sparsity:.6f}")

562

563

print(f"Average active dimensions: {np.mean(total_active):.1f}")

564

print(f"Min/Max active dimensions: {min(total_active)}/{max(total_active)}")

565

print(f"Average value: {np.mean(total_values):.4f}")

566

print(f"Value range: {min(total_values):.4f} to {max(total_values):.4f}")

567

568

# Analyze encodings

569

analyze_sparsity(sparse_embeddings, vocab_size=sparse_model.get_sentence_embedding_dimension())

570

```

571

572

### Model Card and Saving

573

574

```python

575

from sentence_transformers import SparseEncoderModelCardData

576

577

# Create model card

578

model_card_data = SparseEncoderModelCardData(

579

language=['en'],

580

license='apache-2.0',

581

tags=['sentence-transformers', 'sparse-encoder', 'retrieval'],

582

model_name='Custom Sparse Encoder',

583

train_datasets=['ms-marco'],

584

eval_datasets=['beir']

585

)

586

587

# Save with model card

588

sparse_model.save('./my-sparse-model', model_card_data=model_card_data)

589

590

# Push to hub

591

sparse_model.push_to_hub('my-username/my-sparse-encoder')

592

```

593

594

## Storage and Deployment

595

596

### Efficient Storage Format

597

598

```python

599

def sparse_to_compressed(sparse_embedding):

600

"""Convert sparse embedding to compressed format."""

601

return {

602

'indices': np.array(sparse_embedding['indices'], dtype=np.uint32),

603

'values': np.array(sparse_embedding['values'], dtype=np.float32)

604

}

605

606

def compressed_to_sparse(compressed_embedding):

607

"""Convert compressed format back to sparse embedding."""

608

return {

609

'indices': compressed_embedding['indices'].tolist(),

610

'values': compressed_embedding['values'].tolist()

611

}

612

613

# Compress embeddings for storage

614

compressed_embeddings = [sparse_to_compressed(emb) for emb in sparse_embeddings]

615

```

616

617

### Batch Processing for Large Corpora

618

619

```python

620

def encode_large_corpus(sparse_model, texts, batch_size=1000, save_every=10000):

621

"""Encode large corpus in batches with periodic saving."""

622

all_embeddings = []

623

624

for i in range(0, len(texts), batch_size):

625

batch = texts[i:i + batch_size]

626

batch_embeddings = sparse_model.encode(

627

batch,

628

batch_size=32,

629

show_progress_bar=True,

630

convert_to_numpy=False

631

)

632

all_embeddings.extend(batch_embeddings)

633

634

# Save periodically

635

if (i + batch_size) % save_every == 0:

636

print(f"Processed {i + batch_size} documents...")

637

638

return all_embeddings

639

640

# Example with large dataset

641

large_corpus = [f"Document {i} with content" for i in range(50000)]

642

corpus_embeddings = encode_large_corpus(sparse_model, large_corpus)

643

```

644

645

## Best Practices

646

647

1. **Sparsity Control**: Monitor sparsity levels to balance efficiency and quality

648

2. **Vocabulary Management**: Understand the vocabulary size and active dimensions

649

3. **Storage Efficiency**: Use compressed formats for large-scale deployment

650

4. **Retrieval Systems**: Implement efficient sparse similarity computation

651

5. **Training Data**: Use diverse query-document pairs for robust training

652

6. **Evaluation**: Test on retrieval benchmarks like BEIR for comprehensive evaluation