Tessl Tile for pypi/sentence-transformers@5.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-transformers.md cross-encoder.md evaluation.md index.md loss-functions.md sparse-encoder.md training.md utilities.md

utilities.mddocs/

0
# Utilities
1

2
The sentence-transformers package provides various utility functions for model optimization, quantization, export to different formats, similarity computation, and training enhancements.
3

4
## Model Quantization
5

6
### quantize_embeddings
7

8
```python
9
def quantize_embeddings(
10
    embeddings: Tensor | np.ndarray,
11
    precision: Literal["float32", "int8", "uint8", "binary", "ubinary"],
12
    ranges: np.ndarray | None = None,
13
    calibration_embeddings: np.ndarray | None = None
14
) -> np.ndarray
15
```
16
`{ .api }`
17

18
Quantize embeddings to reduce memory usage and improve inference speed.
19

20
**Parameters**:
21
- `embeddings`: Unquantized (e.g. float) embeddings to quantize to a given precision
22
- `precision`: The precision to convert to ("float32", "int8", "uint8", "binary", "ubinary")
23
- `ranges`: Ranges for quantization of embeddings. Used for int8 quantization, where the ranges refer to the minimum and maximum values for each dimension. 2D array with shape (2, embedding_dim)
24
- `calibration_embeddings`: Embeddings used for calibration during quantization. Used for int8 quantization to compute ranges
25

26
**Returns**: Quantized embeddings with the specified precision
27

28
**Usage Examples**:
29

30
```python
31
import numpy as np
32
from sentence_transformers import quantize_embeddings, SentenceTransformer
33

34
# Generate sample embeddings
35
model = SentenceTransformer('all-MiniLM-L6-v2')
36
sentences = ["Hello world", "How are you?", "Machine learning is great"]
37
embeddings = model.encode(sentences)
38

39
# Float32 quantization (no change, returns same embeddings)
40
quantized_embs = quantize_embeddings(embeddings, precision="float32")
41
print(f"Original size: {embeddings.nbytes} bytes")
42
print(f"Quantized size: {quantized_embs.nbytes} bytes")
43

44
# Int8 quantization with calibration
45
calibration_data = model.encode(["Sample sentence " + str(i) for i in range(100)])
46
quantized_int8 = quantize_embeddings(
47
    embeddings, 
48
    precision="int8",
49
    calibration_embeddings=calibration_data
50
)
51

52
# Binary quantization (extreme compression)
53
binary_embs = quantize_embeddings(embeddings, precision="binary")
54
```
55

56
## Model Export
57

58
### export_optimized_onnx_model
59

60
```python
61
def export_optimized_onnx_model(
62
    model: SentenceTransformer,
63
    onnx_model_path: str,
64
    opset_version: int = 14,
65
    optimization_level: str = "O2"
66
) -> None
67
```
68
`{ .api }`
69

70
Export SentenceTransformer model to optimized ONNX format for deployment.
71

72
**Parameters**:
73
- `model`: SentenceTransformer model to export
74
- `onnx_model_path`: Output path for ONNX model
75
- `opset_version`: ONNX opset version to use
76
- `optimization_level`: Optimization level ("O1", "O2", "O3")
77

78
### export_dynamic_quantized_onnx_model
79

80
```python
81
def export_dynamic_quantized_onnx_model(
82
    model: SentenceTransformer,
83
    onnx_model_path: str,
84
    quantization_mode: str = "IntegerOps"
85
) -> None
86
```
87
`{ .api }`
88

89
Export model to dynamically quantized ONNX format.
90

91
**Parameters**:
92
- `model`: SentenceTransformer model to export
93
- `onnx_model_path`: Output path for quantized ONNX model
94
- `quantization_mode`: Quantization mode ("IntegerOps", "QLinearOps")
95

96
### export_static_quantized_openvino_model
97

98
```python
99
def export_static_quantized_openvino_model(
100
    model: SentenceTransformer,
101
    openvino_model_path: str,
102
    calibration_dataset: list[str] | None = None
103
) -> None
104
```
105
`{ .api }`
106

107
Export model to statically quantized OpenVINO format for Intel hardware optimization.
108

109
**Parameters**:
110
- `model`: SentenceTransformer model to export
111
- `openvino_model_path`: Output path for OpenVINO model
112
- `calibration_dataset`: Dataset for static quantization calibration
113

114
**Usage Examples**:
115

116
```python
117
from sentence_transformers.backend import (
118
    export_optimized_onnx_model,
119
    export_dynamic_quantized_onnx_model,
120
    export_static_quantized_openvino_model
121
)
122

123
# Load model
124
model = SentenceTransformer('all-MiniLM-L6-v2')
125

126
# Export to optimized ONNX
127
export_optimized_onnx_model(
128
    model=model,
129
    onnx_model_path="./optimized_model.onnx",
130
    opset_version=14,
131
    optimization_level="O2"
132
)
133

134
# Export to quantized ONNX for even faster inference
135
export_dynamic_quantized_onnx_model(
136
    model=model,
137
    onnx_model_path="./quantized_model.onnx",
138
    quantization_mode="IntegerOps"
139
)
140

141
# Export to OpenVINO for Intel hardware
142
calibration_texts = ["Sample text " + str(i) for i in range(100)]
143
export_static_quantized_openvino_model(
144
    model=model,
145
    openvino_model_path="./openvino_model",
146
    calibration_dataset=calibration_texts
147
)
148

149
# Use exported ONNX model with ONNX Runtime
150
import onnxruntime as ort
151
import numpy as np
152

153
# Load ONNX model
154
ort_session = ort.InferenceSession("./optimized_model.onnx")
155

156
# Tokenize input
157
inputs = model.tokenizer("Hello world", return_tensors="np", padding=True, truncation=True)
158

159
# Run inference
160
onnx_outputs = ort_session.run(None, {
161
    "input_ids": inputs["input_ids"].astype(np.int64),
162
    "attention_mask": inputs["attention_mask"].astype(np.int64)
163
})
164

165
print(f"ONNX embedding shape: {onnx_outputs[0].shape}")
166
```
167

168
## Training Utilities
169

170
### mine_hard_negatives
171

172
```python
173
def mine_hard_negatives(
174
    model: SentenceTransformer,
175
    sentences: list[str],
176
    labels: list[int],
177
    batch_size: int = 32,
178
    top_k: int = 10,
179
    margin: float = 0.2
180
) -> list[dict[str, Any]]
181
```
182
`{ .api }`
183

184
Mine hard negative examples for improved contrastive training.
185

186
**Parameters**:
187
- `model`: SentenceTransformer model for encoding
188
- `sentences`: List of sentences to mine from
189
- `labels`: Corresponding labels for sentences
190
- `batch_size`: Batch size for encoding
191
- `top_k`: Number of hard negatives to return per positive
192
- `margin`: Margin for hard negative selection
193

194
**Returns**: List of dictionaries with anchor, positive, and hard negative examples
195

196
**Usage Examples**:
197

198
```python
199
from sentence_transformers import mine_hard_negatives
200

201
# Prepare labeled data
202
sentences = [
203
    "Python is a programming language",
204
    "Java is used for software development", 
205
    "Machine learning uses algorithms",
206
    "Deep learning is a subset of ML",
207
    "Cars are vehicles",
208
    "Trucks are large vehicles"
209
]
210

211
labels = [0, 0, 1, 1, 2, 2]  # Programming, ML, Vehicles
212

213
# Mine hard negatives
214
hard_negatives = mine_hard_negatives(
215
    model=model,
216
    sentences=sentences,
217
    labels=labels,
218
    top_k=2,
219
    margin=0.3
220
)
221

222
print("Hard negative examples:")
223
for example in hard_negatives[:3]:  # Show first 3
224
    print(f"Anchor: {example['anchor']}")
225
    print(f"Positive: {example['positive']}")
226
    print(f"Hard Negative: {example['negative']}")
227
    print(f"Similarity: {example['similarity']:.4f}")
228
    print()
229

230
# Use hard negatives in training
231
from sentence_transformers.losses import TripletLoss
232
from datasets import Dataset
233

234
# Convert to training format
235
train_examples = [
236
    {
237
        "anchor": ex["anchor"],
238
        "positive": ex["positive"], 
239
        "negative": ex["negative"]
240
    }
241
    for ex in hard_negatives
242
]
243

244
train_dataset = Dataset.from_list(train_examples)
245
triplet_loss = TripletLoss(model)
246

247
# Train with hard negatives (improves model performance)
248
```
249

250
## Similarity Functions
251

252
The `SimilarityFunction` enum provides standardized similarity computation methods:
253

254
```python
255
from sentence_transformers import SimilarityFunction
256

257
class SimilarityFunction(Enum):
258
    COSINE = "cosine"
259
    DOT_PRODUCT = "dot"
260
    DOT = "dot"  # Alias for DOT_PRODUCT
261
    EUCLIDEAN = "euclidean" 
262
    MANHATTAN = "manhattan"
263
```
264
`{ .api }`
265

266
**Usage Examples**:
267

268
```python
269
# Use with SentenceTransformer
270
model = SentenceTransformer('all-MiniLM-L6-v2', similarity_fn_name=SimilarityFunction.COSINE)
271

272
# Manual similarity computation
273
import torch
274
import torch.nn.functional as F
275

276
def compute_similarity(embeddings1, embeddings2, similarity_fn):
277
    """Compute similarity between two sets of embeddings."""
278
    if similarity_fn == SimilarityFunction.COSINE:
279
        return F.cosine_similarity(embeddings1, embeddings2, dim=-1)
280
    elif similarity_fn == SimilarityFunction.DOT_PRODUCT:
281
        return torch.sum(embeddings1 * embeddings2, dim=-1)
282
    elif similarity_fn == SimilarityFunction.EUCLIDEAN:
283
        return -torch.cdist(embeddings1, embeddings2, p=2)
284
    elif similarity_fn == SimilarityFunction.MANHATTAN:
285
        return -torch.cdist(embeddings1, embeddings2, p=1)
286

287
# Example usage
288
emb1 = model.encode(["First sentence"])
289
emb2 = model.encode(["Second sentence"])
290

291
for sim_fn in SimilarityFunction:
292
    if sim_fn != SimilarityFunction.DOT:  # Skip alias
293
        sim_score = compute_similarity(
294
            torch.tensor(emb1), 
295
            torch.tensor(emb2), 
296
            sim_fn
297
        )
298
        print(f"{sim_fn.value}: {sim_score.item():.4f}")
299
```
300

301
## Batch Samplers
302

303
### DefaultBatchSampler
304

305
```python
306
class DefaultBatchSampler:
307
    def __init__(
308
        self,
309
        dataset: Dataset,
310
        batch_size: int,
311
        drop_last: bool = False,
312
        generator: torch.Generator | None = None
313
    )
314
```
315
`{ .api }`
316

317
Standard batch sampler for single dataset training.
318

319
### MultiDatasetDefaultBatchSampler
320

321
```python
322
class MultiDatasetDefaultBatchSampler:
323
    def __init__(
324
        self,
325
        datasets: dict[str, Dataset], 
326
        batch_sizes: dict[str, int] | int,
327
        sampling_strategy: str = "proportional",
328
        generator: torch.Generator | None = None
329
    )
330
```
331
`{ .api }`
332

333
Batch sampler for multi-dataset training with different sampling strategies.
334

335
**Parameters**:
336
- `datasets`: Dictionary of dataset names to Dataset objects
337
- `batch_sizes`: Batch size per dataset or single batch size
338
- `sampling_strategy`: "proportional" or "round_robin"
339
- `generator`: Random generator for reproducibility
340

341
**Usage Examples**:
342

343
```python
344
from sentence_transformers import DefaultBatchSampler, MultiDatasetDefaultBatchSampler
345
from datasets import Dataset
346

347
# Single dataset sampler
348
dataset = Dataset.from_list([{"text": f"Example {i}"} for i in range(1000)])
349
sampler = DefaultBatchSampler(
350
    dataset=dataset,
351
    batch_size=32,
352
    drop_last=True
353
)
354

355
# Multi-dataset sampler
356
dataset1 = Dataset.from_list([{"text": f"Dataset1 {i}"} for i in range(500)])
357
dataset2 = Dataset.from_list([{"text": f"Dataset2 {i}"} for i in range(300)])
358

359
multi_sampler = MultiDatasetDefaultBatchSampler(
360
    datasets={"ds1": dataset1, "ds2": dataset2},
361
    batch_sizes={"ds1": 32, "ds2": 16},
362
    sampling_strategy="proportional"
363
)
364

365
# Use in training
366
from sentence_transformers import SentenceTransformerTrainer
367

368
trainer = SentenceTransformerTrainer(
369
    model=model,
370
    args=args,
371
    train_dataset={"ds1": dataset1, "ds2": dataset2},
372
    # Sampler is automatically configured based on datasets
373
)
374
```
375

376
## Model Components
377

378
The `sentence_transformers.models` module provides modular components for building custom architectures:
379

380
### Core Components
381

382
```python
383
from sentence_transformers.models import (
384
    Transformer,     # BERT, RoBERTa, etc.
385
    Pooling,         # Mean, max, CLS pooling
386
    Dense,           # Linear transformation
387
    Normalize        # L2 normalization
388
)
389
```
390

391
**Usage Examples**:
392

393
```python
394
from sentence_transformers import SentenceTransformer
395
from sentence_transformers.models import Transformer, Pooling, Dense, Normalize
396

397
# Build custom model architecture
398
transformer = Transformer('distilbert-base-uncased', max_seq_length=256)
399
pooling = Pooling(
400
    word_embedding_dimension=transformer.get_word_embedding_dimension(),
401
    pooling_mode='mean'
402
)
403
dense = Dense(
404
    in_features=pooling.get_sentence_embedding_dimension(),
405
    out_features=256,
406
    activation_function='tanh'
407
)
408
normalize = Normalize()
409

410
# Combine components
411
custom_model = SentenceTransformer(modules=[transformer, pooling, dense, normalize])
412

413
# Use custom model
414
embeddings = custom_model.encode(["Custom architecture example"])
415
print(f"Custom embedding shape: {embeddings.shape}")
416
```
417

418
### Additional Components
419

420
```python
421
from sentence_transformers.models import (
422
    CNN,                    # Convolutional layers
423
    LSTM,                   # LSTM layers
424
    BoW,                    # Bag of words
425
    WordEmbeddings,         # Word embeddings layer
426
    WordWeights,            # TF-IDF weighting
427
    StaticEmbedding,        # Static embeddings (Word2Vec, GloVe)
428
    WeightedLayerPooling,   # Weighted pooling across layers
429
    CLIPModel,              # CLIP integration
430
    Router,                 # Multi-encoder routing
431
    Dropout,                # Dropout layer
432
    LayerNorm              # Layer normalization
433
)
434
```
435

436
## Performance Optimization
437

438
### Memory-Efficient Training
439

440
```python
441
def create_memory_efficient_model(base_model_name, target_dim=256):
442
    """Create memory-efficient model with reduced dimensions."""
443
    from sentence_transformers.models import Transformer, Pooling, Dense, Normalize
444
    
445
    transformer = Transformer(base_model_name, max_seq_length=256)
446
    pooling = Pooling(transformer.get_word_embedding_dimension(), pooling_mode='mean')
447
    
448
    # Add dimension reduction for memory efficiency
449
    dense = Dense(
450
        in_features=pooling.get_sentence_embedding_dimension(),
451
        out_features=target_dim,
452
        activation_function='tanh'
453
    )
454
    normalize = Normalize()
455
    
456
    return SentenceTransformer(modules=[transformer, pooling, dense, normalize])
457

458
# Create efficient model
459
efficient_model = create_memory_efficient_model('bert-base-uncased', target_dim=128)
460
```
461

462
### Inference Optimization
463

464
```python
465
def optimize_for_inference(model, sentences, batch_size=64):
466
    """Optimized inference with batching and no gradients."""
467
    import torch
468
    
469
    model.eval()  # Set to evaluation mode
470
    embeddings = []
471
    
472
    with torch.no_grad():  # Disable gradient computation
473
        for i in range(0, len(sentences), batch_size):
474
            batch = sentences[i:i + batch_size]
475
            batch_embeddings = model.encode(
476
                batch,
477
                batch_size=len(batch),
478
                show_progress_bar=False,
479
                convert_to_tensor=False,
480
                normalize_embeddings=True  # For cosine similarity
481
            )
482
            embeddings.extend(batch_embeddings)
483
    
484
    return embeddings
485

486
# Optimized inference
487
sentences = [f"Sentence {i}" for i in range(1000)]
488
fast_embeddings = optimize_for_inference(model, sentences)
489
```
490

491
## Debugging and Logging
492

493
### LoggingHandler
494

495
```python
496
from sentence_transformers import LoggingHandler
497
import logging
498

499
class LoggingHandler(logging.Handler):
500
    def emit(self, record: logging.LogRecord) -> None:
501
        """Emit log record without interfering with tqdm progress bars."""
502
        pass
503
```
504
`{ .api }`
505

506
Custom logging handler that works seamlessly with tqdm progress bars.
507

508
**Usage Examples**:
509

510
```python
511
import logging
512
from sentence_transformers import LoggingHandler
513

514
# Set up logging
515
logging.basicConfig(
516
    format='%(asctime)s - %(message)s',
517
    datefmt='%Y-%m-%d %H:%M:%S',
518
    level=logging.INFO,
519
    handlers=[LoggingHandler()]
520
)
521

522
logger = logging.getLogger(__name__)
523

524
# Use with training
525
def train_with_logging(model, trainer):
526
    logger.info("Starting training...")
527
    
528
    trainer.train()
529
    
530
    logger.info("Training completed!")
531
    logger.info(f"Model saved to {trainer.args.output_dir}")
532
```
533

534
## Data Processing Utilities
535

536
### Legacy Dataset Classes (Deprecated)
537

538
```python
539
# Note: These are deprecated in favor of HuggingFace Datasets
540
from sentence_transformers.datasets import SentencesDataset, ParallelSentencesDataset
541
from sentence_transformers.readers import InputExample
542
```
543

544
### Modern Data Processing
545

546
```python
547
def create_training_dataset(examples, format_type="triplet"):
548
    """Create training dataset in various formats."""
549
    from datasets import Dataset
550
    
551
    if format_type == "triplet":
552
        # Format: anchor, positive, negative
553
        formatted_examples = [
554
            {
555
                "anchor": ex["anchor"],
556
                "positive": ex["positive"],
557
                "negative": ex["negative"]
558
            }
559
            for ex in examples
560
        ]
561
    elif format_type == "pairs":
562
        # Format: sentence1, sentence2, label
563
        formatted_examples = [
564
            {
565
                "sentence1": ex["sentence1"],
566
                "sentence2": ex["sentence2"], 
567
                "label": ex["label"]
568
            }
569
            for ex in examples
570
        ]
571
    
572
    return Dataset.from_list(formatted_examples)
573

574
# Example usage
575
examples = [
576
    {
577
        "anchor": "Python programming",
578
        "positive": "Coding in Python",
579
        "negative": "Java development"
580
    }
581
]
582

583
dataset = create_training_dataset(examples, format_type="triplet")
584
```
585

586
## Utility Functions for Analysis
587

588
```python
589
def analyze_model_performance(model, test_sentences):
590
    """Analyze model performance characteristics."""
591
    import time
592
    import numpy as np
593
    
594
    # Encoding speed test
595
    start_time = time.time()
596
    embeddings = model.encode(test_sentences, batch_size=32)
597
    encoding_time = time.time() - start_time
598
    
599
    # Embedding analysis
600
    embedding_dim = embeddings.shape[1]
601
    embedding_norms = np.linalg.norm(embeddings, axis=1)
602
    
603
    # Similarity analysis
604
    similarities = np.dot(embeddings, embeddings.T)
605
    
606
    results = {
607
        "encoding_speed": len(test_sentences) / encoding_time,
608
        "embedding_dimension": embedding_dim,
609
        "avg_embedding_norm": np.mean(embedding_norms),
610
        "std_embedding_norm": np.std(embedding_norms),
611
        "avg_similarity": np.mean(similarities[np.triu_indices_from(similarities, k=1)]),
612
        "similarity_std": np.std(similarities[np.triu_indices_from(similarities, k=1)])
613
    }
614
    
615
    return results
616

617
# Analyze model
618
test_texts = ["Sample sentence " + str(i) for i in range(100)]
619
performance = analyze_model_performance(model, test_texts)
620

621
for metric, value in performance.items():
622
    print(f"{metric}: {value:.4f}")
623
```
624

625
## Logging and Debugging
626

627
### LoggingHandler
628

629
Custom logging handler that integrates with tqdm progress bars for clean output during training and inference.
630

631
```python { .api }
632
class LoggingHandler(logging.Handler):
633
    def __init__(self, level=logging.NOTSET) -> None: ...
634
    def emit(self, record) -> None: ...
635
```
636

637
**Usage Example**:
638

639
```python
640
import logging
641
from sentence_transformers import LoggingHandler
642

643
# Set up logging with tqdm-compatible handler
644
logger = logging.getLogger("sentence_transformers")
645
logger.setLevel(logging.INFO)
646
logger.addHandler(LoggingHandler())
647

648
# Now logging output won't interfere with progress bars
649
logger.info("Training started")
650
```
651

652
## Batch Sampling (Modern Training)
653

654
### DefaultBatchSampler
655

656
Default batch sampler used in the SentenceTransformer library, equivalent to PyTorch's BatchSampler with epoch support.
657

658
```python { .api }
659
class DefaultBatchSampler(BatchSampler):
660
    def __init__(
661
        self,
662
        sampler,
663
        batch_size: int,
664
        drop_last: bool = False
665
    ) -> None: ...
666
    
667
    def set_epoch(self, epoch: int) -> None: ...
668
```
669

670
### MultiDatasetDefaultBatchSampler
671

672
Batch sampler for training on multiple datasets simultaneously with balanced sampling.
673

674
```python { .api }
675
class MultiDatasetDefaultBatchSampler(BatchSampler):
676
    def __init__(
677
        self,
678
        samplers,
679
        batch_sizes: list[int],
680
        drop_last: bool = False
681
    ) -> None: ...
682
    
683
    def set_epoch(self, epoch: int) -> None: ...
684
```
685

686
## Legacy Components (Deprecated)
687

688
These components are included for backwards compatibility but are deprecated in favor of the modern training framework.
689

690
### Legacy Dataset Classes
691

692
```python { .api }
693
class SentencesDataset:
694
    """Deprecated: Use SentenceTransformerTrainer instead"""
695
    def __init__(self, examples: list, model) -> None: ...
696

697
class ParallelSentencesDataset:
698
    """Deprecated: Use SentenceTransformerTrainer instead"""
699
    def __init__(self, student_model, teacher_model) -> None: ...
700
```
701

702
### Legacy Input Format
703

704
```python { .api }
705
class InputExample:
706
    """Deprecated: Use standard data formats instead"""
707
    def __init__(
708
        self,
709
        guid: str = "",
710
        texts: list[str] = None,
711
        label: int | float = 0
712
    ) -> None: ...
713
```
714

715
**Migration Note**: These legacy components exist for compatibility with the old `model.fit()` training approach. For new projects, use the modern `SentenceTransformerTrainer` class instead.
716

717
## Best Practices
718

719
1. **Quantization**: Use float16 for balanced performance and quality
720
2. **Export**: Export to ONNX for deployment and cross-platform compatibility  
721
3. **Hard Negatives**: Use hard negative mining to improve contrastive learning
722
4. **Batch Processing**: Process in batches for memory efficiency
723
5. **Caching**: Cache embeddings for repeated use
724
6. **Monitoring**: Use LoggingHandler for training monitoring
725
7. **Profiling**: Profile inference speed and memory usage for optimization
726
8. **Testing**: Test exported models match original model outputs

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/