0
# Sparse Encoder
1
2
Sparse encoders generate sparse embeddings that combine the efficiency of traditional sparse retrieval methods (like BM25) with neural approaches, providing efficient storage and fast retrieval for large-scale systems.
3
4
## SparseEncoder Class
5
6
### Constructor
7
8
```python
9
SparseEncoder(
10
model_name_or_path: str | None = None,
11
modules: list[torch.nn.Module] | None = None,
12
device: str | None = None,
13
prompts: dict[str, str] | None = None,
14
default_prompt_name: str | None = None,
15
similarity_fn_name: str | SimilarityFunction | None = None,
16
cache_folder: str | None = None,
17
trust_remote_code: bool = False,
18
revision: str | None = None,
19
local_files_only: bool = False,
20
token: str | bool | None = None,
21
max_active_dims: int | None = None,
22
model_kwargs: dict[str, Any] | None = None
23
)
24
```
25
`{ .api }`
26
27
Initialize a SparseEncoder model for generating sparse embeddings.
28
29
**Parameters**:
30
- `model_name_or_path`: Pre-trained model name or path
31
- `modules`: List of PyTorch modules for custom architecture
32
- `device`: Device to run the model on ("cuda", "cpu", "mps", "npu")
33
- `prompts`: Dictionary of prompts for different contexts (e.g., {"query": "query: ", "passage": "passage: "})
34
- `default_prompt_name`: Default prompt to use if prompts are provided
35
- `similarity_fn_name`: Similarity function name ("cosine", "dot", "euclidean", "manhattan") or SimilarityFunction
36
- `cache_folder`: Custom cache directory for models
37
- `trust_remote_code`: Allow custom code execution from HuggingFace Hub
38
- `revision`: Model revision/branch/tag to load
39
- `local_files_only`: Use only cached files, don't download
40
- `token`: HuggingFace authentication token
41
- `max_active_dims`: Maximum number of active (non-zero) dimensions in output embeddings
42
- `model_kwargs`: Additional model arguments (torch_dtype, attn_implementation, etc.)
43
44
### Encoding Methods
45
46
```python
47
def encode(
48
sentences: list[str] | str,
49
batch_size: int = 32,
50
show_progress_bar: bool | None = None,
51
convert_to_numpy: bool = True,
52
convert_to_tensor: bool = False,
53
device: str | None = None
54
) -> list[dict[str, Any]] | dict[str, Any]
55
```
56
`{ .api }`
57
58
Encode sentences into sparse embeddings.
59
60
**Parameters**:
61
- `sentences`: Input text(s) to encode
62
- `batch_size`: Batch size for processing
63
- `show_progress_bar`: Display progress bar
64
- `convert_to_numpy`: Return numpy arrays
65
- `convert_to_tensor`: Return PyTorch tensors
66
- `device`: Device for computation
67
68
**Returns**: Sparse embeddings as dictionaries with indices and values
69
70
```python
71
def encode_queries(
72
queries: list[str] | str,
73
**kwargs
74
) -> list[dict[str, Any]] | dict[str, Any]
75
```
76
`{ .api }`
77
78
Encode queries with query-specific processing.
79
80
```python
81
def encode_corpus(
82
corpus: list[str] | str,
83
**kwargs
84
) -> list[dict[str, Any]] | dict[str, Any]
85
```
86
`{ .api }`
87
88
Encode corpus documents with document-specific processing.
89
90
### Model Information
91
92
```python
93
def get_sentence_embedding_dimension() -> int
94
```
95
`{ .api }`
96
97
Get the vocabulary size (sparse embedding dimension).
98
99
```python
100
def get_max_seq_length() -> int
101
```
102
`{ .api }`
103
104
Get maximum sequence length the model can handle.
105
106
```python
107
def tokenize(
108
texts: list[str] | str,
109
**kwargs
110
) -> dict[str, torch.Tensor]
111
```
112
`{ .api }`
113
114
Tokenize input texts using the model's tokenizer.
115
116
### Model Persistence
117
118
```python
119
def save(
120
path: str,
121
model_name: str | None = None,
122
create_model_card: bool = True,
123
train_datasets: list[str] | None = None,
124
safe_serialization: bool = True
125
) -> None
126
```
127
`{ .api }`
128
129
Save the sparse encoder model to a directory.
130
131
```python
132
def save_pretrained(
133
save_directory: str,
134
**kwargs
135
) -> None
136
```
137
`{ .api }`
138
139
Save using HuggingFace format.
140
141
```python
142
def save_to_hub(
143
repo_id: str,
144
**kwargs
145
) -> None
146
```
147
`{ .api }`
148
149
Save and push to HuggingFace Hub.
150
151
```python
152
def push_to_hub(
153
repo_id: str,
154
**kwargs
155
) -> None
156
```
157
`{ .api }`
158
159
Push existing model to HuggingFace Hub.
160
161
### Evaluation
162
163
```python
164
def evaluate(
165
evaluator: SentenceEvaluator,
166
output_path: str | None = None
167
) -> float | dict[str, float]
168
```
169
`{ .api }`
170
171
Evaluate the model using provided evaluator.
172
173
### Properties
174
175
```python
176
@property
177
def device() -> torch.device
178
```
179
`{ .api }`
180
181
Current device of the model.
182
183
```python
184
@property
185
def tokenizer() -> PreTrainedTokenizer
186
```
187
`{ .api }`
188
189
Access to the model's tokenizer.
190
191
```python
192
@property
193
def max_seq_length() -> int
194
```
195
`{ .api }`
196
197
Maximum sequence length.
198
199
## SparseEncoderTrainer
200
201
### Constructor
202
203
```python
204
SparseEncoderTrainer(
205
model: SparseEncoder | None = None,
206
args: SparseEncoderTrainingArguments | None = None,
207
train_dataset: Dataset | None = None,
208
eval_dataset: Dataset | None = None,
209
tokenizer: PreTrainedTokenizer | None = None,
210
data_collator: DataCollator | None = None,
211
compute_metrics: callable | None = None,
212
callbacks: list[TrainerCallback] | None = None,
213
optimizers: tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
214
preprocess_logits_for_metrics: callable | None = None
215
)
216
```
217
`{ .api }`
218
219
Trainer for sparse encoder models.
220
221
**Parameters**:
222
- `model`: SparseEncoder model to train
223
- `args`: Training arguments
224
- `train_dataset`: Training dataset
225
- `eval_dataset`: Evaluation dataset
226
- `tokenizer`: Tokenizer (auto-detected from model)
227
- `data_collator`: Data collator for batching
228
- `compute_metrics`: Metrics computation function
229
- `callbacks`: Training callbacks
230
- `optimizers`: Custom optimizer and scheduler
231
- `preprocess_logits_for_metrics`: Logits preprocessing
232
233
### Training Methods
234
235
```python
236
def train(
237
resume_from_checkpoint: str | bool | None = None,
238
trial: dict[str, Any] | None = None,
239
ignore_keys_for_eval: list[str] | None = None,
240
**kwargs
241
) -> TrainOutput
242
```
243
`{ .api }`
244
245
Train the sparse encoder model.
246
247
```python
248
def evaluate(
249
eval_dataset: Dataset | None = None,
250
ignore_keys: list[str] | None = None,
251
metric_key_prefix: str = "eval"
252
) -> dict[str, float]
253
```
254
`{ .api }`
255
256
Evaluate model performance.
257
258
## SparseEncoderTrainingArguments
259
260
```python
261
class SparseEncoderTrainingArguments(TrainingArguments):
262
def __init__(
263
self,
264
output_dir: str,
265
evaluation_strategy: str | IntervalStrategy = "no",
266
eval_steps: int | None = None,
267
eval_delay: float = 0,
268
logging_dir: str | None = None,
269
logging_strategy: str | IntervalStrategy = "steps",
270
logging_steps: int = 500,
271
save_strategy: str | IntervalStrategy = "steps",
272
save_steps: int = 500,
273
save_total_limit: int | None = None,
274
seed: int = 42,
275
data_seed: int | None = None,
276
jit_mode_eval: bool = False,
277
use_ipex: bool = False,
278
bf16: bool = False,
279
fp16: bool = False,
280
fp16_opt_level: str = "O1",
281
half_precision_backend: str = "auto",
282
bf16_full_eval: bool = False,
283
fp16_full_eval: bool = False,
284
tf32: bool | None = None,
285
local_rank: int = -1,
286
ddp_backend: str | None = None,
287
tpu_num_cores: int | None = None,
288
tpu_metrics_debug: bool = False,
289
debug: str | list[DebugOption] = "",
290
dataloader_drop_last: bool = False,
291
dataloader_num_workers: int = 0,
292
past_index: int = -1,
293
run_name: str | None = None,
294
disable_tqdm: bool | None = None,
295
remove_unused_columns: bool = True,
296
label_names: list[str] | None = None,
297
load_best_model_at_end: bool = False,
298
ignore_data_skip: bool = False,
299
fsdp: str | list[str] = "",
300
fsdp_min_num_params: int = 0,
301
fsdp_config: dict[str, Any] | None = None,
302
fsdp_transformer_layer_cls_to_wrap: str | None = None,
303
deepspeed: str | None = None,
304
label_smoothing_factor: float = 0.0,
305
optim: str | OptimizerNames = "adamw_torch",
306
optim_args: str | None = None,
307
adafactor: bool = False,
308
group_by_length: bool = False,
309
length_column_name: str | None = "length",
310
report_to: str | list[str] | None = None,
311
ddp_find_unused_parameters: bool | None = None,
312
ddp_bucket_cap_mb: int | None = None,
313
ddp_broadcast_buffers: bool | None = None,
314
dataloader_pin_memory: bool = True,
315
skip_memory_metrics: bool = True,
316
use_legacy_prediction_loop: bool = False,
317
push_to_hub: bool = False,
318
resume_from_checkpoint: str | None = None,
319
hub_model_id: str | None = None,
320
hub_strategy: str | HubStrategy = "every_save",
321
hub_token: str | None = None,
322
hub_private_repo: bool = False,
323
hub_always_push: bool = False,
324
gradient_checkpointing: bool = False,
325
include_inputs_for_metrics: bool = False,
326
auto_find_batch_size: bool = False,
327
full_determinism: bool = False,
328
torchdynamo: str | None = None,
329
ray_scope: str | None = "last",
330
ddp_timeout: int = 1800,
331
torch_compile: bool = False,
332
torch_compile_backend: str | None = None,
333
torch_compile_mode: str | None = None,
334
dispatch_batches: bool | None = None,
335
split_batches: bool | None = None,
336
include_tokens_per_second: bool = False,
337
**kwargs
338
)
339
```
340
`{ .api }`
341
342
Training arguments for sparse encoder training.
343
344
## SparseEncoderModelCardData
345
346
```python
347
class SparseEncoderModelCardData:
348
def __init__(
349
self,
350
language: str | list[str] | None = None,
351
license: str | None = None,
352
tags: str | list[str] | None = None,
353
model_name: str | None = None,
354
model_id: str | None = None,
355
eval_results: list[EvalResult] | None = None,
356
train_datasets: str | list[str] | None = None,
357
eval_datasets: str | list[str] | None = None
358
)
359
```
360
`{ .api }`
361
362
Data class for generating model cards for sparse encoder models.
363
364
**Parameters**:
365
- `language`: Language(s) supported
366
- `license`: Model license
367
- `tags`: Categorization tags
368
- `model_name`: Human-readable name
369
- `model_id`: Model identifier
370
- `eval_results`: Evaluation results
371
- `train_datasets`: Training datasets used
372
- `eval_datasets`: Evaluation datasets used
373
374
## Usage Examples
375
376
### Basic Sparse Encoding
377
378
```python
379
from sentence_transformers import SparseEncoder
380
381
# Load a sparse encoder model
382
sparse_model = SparseEncoder('naver/splade-cocondenser-ensembledistil')
383
384
# Encode sentences to sparse embeddings
385
sentences = [
386
"Machine learning is transforming technology",
387
"Artificial intelligence applications are growing",
388
"Data science requires statistical knowledge"
389
]
390
391
# Get sparse embeddings
392
sparse_embeddings = sparse_model.encode(sentences)
393
394
# Each embedding is a dictionary with 'indices' and 'values'
395
for i, embedding in enumerate(sparse_embeddings):
396
print(f"Sentence {i}:")
397
print(f" Active dimensions: {len(embedding['indices'])}")
398
print(f" Sparsity: {len(embedding['indices']) / sparse_model.get_sentence_embedding_dimension():.4f}")
399
print(f" Max value: {max(embedding['values']):.4f}")
400
print()
401
```
402
403
### Asymmetric Retrieval
404
405
```python
406
# For retrieval tasks with different query/document processing
407
queries = [
408
"What is machine learning?",
409
"How does neural networks work?"
410
]
411
412
documents = [
413
"Machine learning is a subset of artificial intelligence that focuses on algorithms",
414
"Neural networks are computational models inspired by biological neural networks",
415
"Data preprocessing is crucial for machine learning success",
416
"Deep learning uses multiple layers to model complex patterns"
417
]
418
419
# Encode queries and documents separately
420
query_embeddings = sparse_model.encode_queries(queries)
421
doc_embeddings = sparse_model.encode_corpus(documents)
422
423
print("Query embeddings:")
424
for i, emb in enumerate(query_embeddings):
425
print(f" Query {i}: {len(emb['indices'])} active dimensions")
426
427
print("Document embeddings:")
428
for i, emb in enumerate(doc_embeddings):
429
print(f" Document {i}: {len(emb['indices'])} active dimensions")
430
```
431
432
### Similarity Computation for Sparse Embeddings
433
434
```python
435
import numpy as np
436
from collections import Counter
437
438
def sparse_dot_product(emb1, emb2):
439
"""Compute dot product between two sparse embeddings."""
440
# Convert to dictionaries for efficient lookup
441
dict1 = dict(zip(emb1['indices'], emb1['values']))
442
dict2 = dict(zip(emb2['indices'], emb2['values']))
443
444
# Find common indices and compute dot product
445
common_indices = set(dict1.keys()) & set(dict2.keys())
446
return sum(dict1[idx] * dict2[idx] for idx in common_indices)
447
448
def sparse_cosine_similarity(emb1, emb2):
449
"""Compute cosine similarity between sparse embeddings."""
450
dot_product = sparse_dot_product(emb1, emb2)
451
norm1 = np.sqrt(sum(v**2 for v in emb1['values']))
452
norm2 = np.sqrt(sum(v**2 for v in emb2['values']))
453
return dot_product / (norm1 * norm2) if norm1 * norm2 > 0 else 0.0
454
455
# Example usage
456
query_emb = query_embeddings[0]
457
similarities = []
458
for doc_emb in doc_embeddings:
459
sim = sparse_cosine_similarity(query_emb, doc_emb)
460
similarities.append(sim)
461
462
print("Similarity scores:")
463
for i, sim in enumerate(similarities):
464
print(f" Query 0 - Document {i}: {sim:.4f}")
465
```
466
467
### Training a Sparse Encoder
468
469
```python
470
from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments
471
from sentence_transformers.losses import MultipleNegativesRankingLoss
472
from datasets import Dataset
473
474
# Create training dataset
475
train_data = [
476
{"query": "python programming", "positive": "Python is a programming language", "negative": "Cats are pets"},
477
{"query": "machine learning", "positive": "ML algorithms learn patterns", "negative": "Cooking recipes vary"},
478
{"query": "data science", "positive": "Data analysis and statistics", "negative": "Weather forecast"}
479
]
480
481
# Convert to dataset format expected by trainer
482
def prepare_dataset(data):
483
dataset_dict = {"query": [], "positive": [], "negative": []}
484
for item in data:
485
dataset_dict["query"].append(item["query"])
486
dataset_dict["positive"].append(item["positive"])
487
dataset_dict["negative"].append(item["negative"])
488
return Dataset.from_dict(dataset_dict)
489
490
train_dataset = prepare_dataset(train_data)
491
492
# Initialize sparse encoder model
493
model = SparseEncoder('distilbert-base-uncased')
494
495
# Training arguments
496
args = SparseEncoderTrainingArguments(
497
output_dir='./sparse-encoder-output',
498
num_train_epochs=3,
499
per_device_train_batch_size=16,
500
logging_steps=10,
501
save_steps=100,
502
evaluation_strategy="steps",
503
eval_steps=100,
504
save_total_limit=2,
505
load_best_model_at_end=True,
506
)
507
508
# Create trainer
509
trainer = SparseEncoderTrainer(
510
model=model,
511
args=args,
512
train_dataset=train_dataset,
513
)
514
515
# Train the model
516
trainer.train()
517
518
# Save trained model
519
model.save('./my-sparse-encoder')
520
```
521
522
### Advanced Usage - Custom Sparse Architecture
523
524
```python
525
from sentence_transformers.models import Transformer, SparseLinear
526
from sentence_transformers import SparseEncoder
527
528
# Create custom sparse encoder architecture
529
transformer = Transformer('distilbert-base-uncased')
530
sparse_linear = SparseLinear(
531
transformer.get_word_embedding_dimension(),
532
vocab_size=30522, # BERT vocabulary size
533
activation='relu'
534
)
535
536
# Combine modules
537
sparse_model = SparseEncoder(modules=[transformer, sparse_linear])
538
539
# Use the custom model
540
embeddings = sparse_model.encode(["Custom sparse encoder example"])
541
```
542
543
### Efficiency Analysis
544
545
```python
546
def analyze_sparsity(embeddings, vocab_size=None):
547
"""Analyze sparsity patterns in sparse embeddings."""
548
if not isinstance(embeddings, list):
549
embeddings = [embeddings]
550
551
total_active = []
552
total_values = []
553
554
for emb in embeddings:
555
active_dims = len(emb['indices'])
556
total_active.append(active_dims)
557
total_values.extend(emb['values'])
558
559
if vocab_size:
560
avg_sparsity = sum(total_active) / (len(embeddings) * vocab_size)
561
print(f"Average sparsity: {avg_sparsity:.6f}")
562
563
print(f"Average active dimensions: {np.mean(total_active):.1f}")
564
print(f"Min/Max active dimensions: {min(total_active)}/{max(total_active)}")
565
print(f"Average value: {np.mean(total_values):.4f}")
566
print(f"Value range: {min(total_values):.4f} to {max(total_values):.4f}")
567
568
# Analyze encodings
569
analyze_sparsity(sparse_embeddings, vocab_size=sparse_model.get_sentence_embedding_dimension())
570
```
571
572
### Model Card and Saving
573
574
```python
575
from sentence_transformers import SparseEncoderModelCardData
576
577
# Create model card
578
model_card_data = SparseEncoderModelCardData(
579
language=['en'],
580
license='apache-2.0',
581
tags=['sentence-transformers', 'sparse-encoder', 'retrieval'],
582
model_name='Custom Sparse Encoder',
583
train_datasets=['ms-marco'],
584
eval_datasets=['beir']
585
)
586
587
# Save with model card
588
sparse_model.save('./my-sparse-model', model_card_data=model_card_data)
589
590
# Push to hub
591
sparse_model.push_to_hub('my-username/my-sparse-encoder')
592
```
593
594
## Storage and Deployment
595
596
### Efficient Storage Format
597
598
```python
599
def sparse_to_compressed(sparse_embedding):
600
"""Convert sparse embedding to compressed format."""
601
return {
602
'indices': np.array(sparse_embedding['indices'], dtype=np.uint32),
603
'values': np.array(sparse_embedding['values'], dtype=np.float32)
604
}
605
606
def compressed_to_sparse(compressed_embedding):
607
"""Convert compressed format back to sparse embedding."""
608
return {
609
'indices': compressed_embedding['indices'].tolist(),
610
'values': compressed_embedding['values'].tolist()
611
}
612
613
# Compress embeddings for storage
614
compressed_embeddings = [sparse_to_compressed(emb) for emb in sparse_embeddings]
615
```
616
617
### Batch Processing for Large Corpora
618
619
```python
620
def encode_large_corpus(sparse_model, texts, batch_size=1000, save_every=10000):
621
"""Encode large corpus in batches with periodic saving."""
622
all_embeddings = []
623
624
for i in range(0, len(texts), batch_size):
625
batch = texts[i:i + batch_size]
626
batch_embeddings = sparse_model.encode(
627
batch,
628
batch_size=32,
629
show_progress_bar=True,
630
convert_to_numpy=False
631
)
632
all_embeddings.extend(batch_embeddings)
633
634
# Save periodically
635
if (i + batch_size) % save_every == 0:
636
print(f"Processed {i + batch_size} documents...")
637
638
return all_embeddings
639
640
# Example with large dataset
641
large_corpus = [f"Document {i} with content" for i in range(50000)]
642
corpus_embeddings = encode_large_corpus(sparse_model, large_corpus)
643
```
644
645
## Best Practices
646
647
1. **Sparsity Control**: Monitor sparsity levels to balance efficiency and quality
648
2. **Vocabulary Management**: Understand the vocabulary size and active dimensions
649
3. **Storage Efficiency**: Use compressed formats for large-scale deployment
650
4. **Retrieval Systems**: Implement efficient sparse similarity computation
651
5. **Training Data**: Use diverse query-document pairs for robust training
652
6. **Evaluation**: Test on retrieval benchmarks like BEIR for comprehensive evaluation