0
# Loss Functions
1
2
The sentence-transformers package provides an extensive collection of loss functions designed for different learning objectives and training scenarios. These losses enable contrastive learning, supervised fine-tuning, and specialized training approaches.
3
4
## Import Statement
5
6
```python
7
from sentence_transformers.losses import (
8
CosineSimilarityLoss,
9
MultipleNegativesRankingLoss,
10
TripletLoss,
11
MatryoshkaLoss,
12
# ... other loss functions
13
)
14
```
15
16
## Core Loss Functions
17
18
### CosineSimilarityLoss
19
20
```python
21
class CosineSimilarityLoss(torch.nn.Module):
22
def __init__(
23
self,
24
model: SentenceTransformer,
25
loss_fct: torch.nn.Module = torch.nn.MSELoss(),
26
cos_score_transformation: torch.nn.Module = torch.nn.Identity()
27
)
28
```
29
`{ .api }`
30
31
Loss function that measures cosine similarity between sentence pairs with target similarity scores.
32
33
**Parameters**:
34
- `model`: SentenceTransformer model
35
- `loss_fct`: Loss function to apply to cosine similarities (default: MSELoss)
36
- `cos_score_transformation`: Transformation applied to cosine scores
37
38
**Use Case**: Regression on similarity scores, semantic textual similarity tasks
39
40
### MultipleNegativesRankingLoss
41
42
```python
43
class MultipleNegativesRankingLoss(torch.nn.Module):
44
def __init__(
45
self,
46
model: SentenceTransformer,
47
scale: float = 20.0,
48
similarity_fct: callable = cos_sim
49
)
50
```
51
`{ .api }`
52
53
Contrastive loss using in-batch negatives. Optimizes for positive pairs while treating other examples in the batch as negatives.
54
55
**Parameters**:
56
- `model`: SentenceTransformer model
57
- `scale`: Scaling factor for similarities
58
- `similarity_fct`: Function to compute similarities
59
60
**Use Case**: Asymmetric retrieval tasks, contrastive learning with large batches
61
62
### MultipleNegativesSymmetricRankingLoss
63
64
```python
65
class MultipleNegativesSymmetricRankingLoss(torch.nn.Module):
66
def __init__(
67
self,
68
model: SentenceTransformer,
69
scale: float = 20.0,
70
similarity_fct: callable = cos_sim
71
)
72
```
73
`{ .api }`
74
75
Symmetric version of MultipleNegativesRankingLoss that optimizes both (A, B) and (B, A) directions.
76
77
**Parameters**:
78
- `model`: SentenceTransformer model
79
- `scale`: Scaling factor for similarities
80
- `similarity_fct`: Function to compute similarities
81
82
**Use Case**: Symmetric retrieval tasks, bidirectional similarity learning
83
84
### TripletLoss
85
86
```python
87
class TripletLoss(torch.nn.Module):
88
def __init__(
89
self,
90
model: SentenceTransformer,
91
distance_metric: TripletDistanceMetric = TripletDistanceMetric.EUCLIDEAN,
92
triplet_margin: float = 5
93
)
94
```
95
`{ .api }`
96
97
Classic triplet loss with anchor, positive, and negative examples.
98
99
**Parameters**:
100
- `model`: SentenceTransformer model
101
- `distance_metric`: Distance metric for triplet computation
102
- `triplet_margin`: Margin between positive and negative distances
103
104
**Enum TripletDistanceMetric**:
105
- `COSINE`: Cosine distance
106
- `EUCLIDEAN`: Euclidean distance
107
- `MANHATTAN`: Manhattan distance
108
- `DOT_PRODUCT`: Dot product distance
109
110
**Use Case**: Learning embeddings with explicit positive/negative relationships
111
112
## Advanced Loss Functions
113
114
### MatryoshkaLoss
115
116
```python
117
class MatryoshkaLoss(torch.nn.Module):
118
def __init__(
119
self,
120
model: SentenceTransformer,
121
loss: torch.nn.Module,
122
matryoshka_dims: list[int],
123
matryoshka_weights: list[float] | None = None
124
)
125
```
126
`{ .api }`
127
128
Wrapper loss for Matryoshka Representation Learning, enabling models to produce useful embeddings at multiple dimensions.
129
130
**Parameters**:
131
- `model`: SentenceTransformer model
132
- `loss`: Base loss function to wrap
133
- `matryoshka_dims`: List of embedding dimensions to optimize
134
- `matryoshka_weights`: Weights for each dimension (uniform if None)
135
136
**Use Case**: Creating models that work well at multiple embedding dimensions
137
138
### Matryoshka2dLoss
139
140
```python
141
class Matryoshka2dLoss(torch.nn.Module):
142
def __init__(
143
self,
144
model: SentenceTransformer,
145
loss: torch.nn.Module,
146
matryoshka_dims: list[int],
147
n_layers_per_step: int = 1
148
)
149
```
150
`{ .api }`
151
152
2D Matryoshka loss that optimizes across both embedding dimensions and transformer layers.
153
154
**Parameters**:
155
- `model`: SentenceTransformer model
156
- `loss`: Base loss function
157
- `matryoshka_dims`: Embedding dimensions to optimize
158
- `n_layers_per_step`: Number of layers per optimization step
159
160
**Use Case**: Early exit capabilities and progressive inference
161
162
### MSELoss
163
164
```python
165
class MSELoss(torch.nn.Module):
166
def __init__(
167
self,
168
model: SentenceTransformer
169
)
170
```
171
`{ .api }`
172
173
Mean Squared Error loss for regression tasks with continuous similarity scores.
174
175
**Use Case**: Direct regression on similarity scores, knowledge distillation
176
177
### MarginMSELoss
178
179
```python
180
class MarginMSELoss(torch.nn.Module):
181
def __init__(
182
self,
183
model: SentenceTransformer
184
)
185
```
186
`{ .api }`
187
188
MSE loss with margin-based formulation for triplet-like data.
189
190
**Use Case**: Triplet data with continuous similarity scores
191
192
## Specialized Loss Functions
193
194
### ContrastiveLoss
195
196
```python
197
class ContrastiveLoss(torch.nn.Module):
198
def __init__(
199
self,
200
model: SentenceTransformer,
201
distance_metric: SiameseDistanceMetric = SiameseDistanceMetric.EUCLIDEAN,
202
margin: float = 0.5,
203
size_average: bool = True
204
)
205
```
206
`{ .api }`
207
208
Classic contrastive loss for siamese networks with binary similarity labels.
209
210
**Parameters**:
211
- `model`: SentenceTransformer model
212
- `distance_metric`: Distance metric to use
213
- `margin`: Margin for negative pairs
214
- `size_average`: Whether to average the loss
215
216
**Enum SiameseDistanceMetric**:
217
- `EUCLIDEAN`: Euclidean distance
218
- `MANHATTAN`: Manhattan distance
219
- `COSINE_DISTANCE`: Cosine distance
220
221
**Use Case**: Binary similarity classification, siamese networks
222
223
### SoftmaxLoss
224
225
```python
226
class SoftmaxLoss(torch.nn.Module):
227
def __init__(
228
self,
229
model: SentenceTransformer,
230
sentence_embedding_dimension: int,
231
num_labels: int,
232
concatenation_sent_rep: bool = True,
233
concatenation_sent_difference: bool = True,
234
concatenation_sent_multiplication: bool = False
235
)
236
```
237
`{ .api }`
238
239
Classification loss using softmax over sentence pair representations.
240
241
**Parameters**:
242
- `model`: SentenceTransformer model
243
- `sentence_embedding_dimension`: Dimension of sentence embeddings
244
- `num_labels`: Number of classification labels
245
- `concatenation_sent_rep`: Include individual sentence representations
246
- `concatenation_sent_difference`: Include element-wise difference
247
- `concatenation_sent_multiplication`: Include element-wise product
248
249
**Use Case**: Natural language inference, text classification
250
251
## Batch-Based Triplet Losses
252
253
### BatchHardTripletLoss
254
255
```python
256
class BatchHardTripletLoss(torch.nn.Module):
257
def __init__(
258
self,
259
model: SentenceTransformer,
260
distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,
261
margin: float = 5
262
)
263
```
264
`{ .api }`
265
266
Batch hard triplet loss that mines the hardest positive and negative pairs within each batch.
267
268
**Parameters**:
269
- `model`: SentenceTransformer model
270
- `distance_function`: Distance function for triplet mining
271
- `margin`: Triplet margin
272
273
**Enum BatchHardTripletLossDistanceFunction**:
274
- `cosine_distance`: Cosine distance
275
- `euclidean_distance`: Euclidean distance
276
277
**Use Case**: Metric learning with automatic hard negative mining
278
279
### BatchSemiHardTripletLoss
280
281
```python
282
class BatchSemiHardTripletLoss(torch.nn.Module):
283
def __init__(
284
self,
285
model: SentenceTransformer,
286
distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,
287
margin: float = 5
288
)
289
```
290
`{ .api }`
291
292
Batch semi-hard triplet loss that mines semi-hard negatives (harder than positive but within margin).
293
294
**Use Case**: More stable training than hard negative mining
295
296
### BatchHardSoftMarginTripletLoss
297
298
```python
299
class BatchHardSoftMarginTripletLoss(torch.nn.Module):
300
def __init__(
301
self,
302
model: SentenceTransformer,
303
distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance
304
)
305
```
306
`{ .api }`
307
308
Batch hard triplet loss with soft margin (no explicit margin parameter).
309
310
**Use Case**: Triplet learning without manual margin tuning
311
312
### BatchAllTripletLoss
313
314
```python
315
class BatchAllTripletLoss(torch.nn.Module):
316
def __init__(
317
self,
318
model: SentenceTransformer,
319
distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,
320
margin: float = 5
321
)
322
```
323
`{ .api }`
324
325
Uses all valid triplets in a batch for training.
326
327
**Use Case**: Comprehensive triplet learning when computational resources allow
328
329
## Contrastive and Tension Losses
330
331
### OnlineContrastiveLoss
332
333
```python
334
class OnlineContrastiveLoss(torch.nn.Module):
335
def __init__(
336
self,
337
model: SentenceTransformer,
338
distance_metric: SiameseDistanceMetric = SiameseDistanceMetric.COSINE_DISTANCE,
339
margin: float = 0.5,
340
size_average: bool = True
341
)
342
```
343
`{ .api }`
344
345
Online version of contrastive loss for streaming/online learning scenarios.
346
347
**Use Case**: Incremental learning, online adaptation
348
349
### ContrastiveTensionLoss
350
351
```python
352
class ContrastiveTensionLoss(torch.nn.Module):
353
def __init__(
354
self,
355
model: SentenceTransformer,
356
scale: float = 20.0,
357
similarity_fct: callable = cos_sim
358
)
359
```
360
`{ .api }`
361
362
Contrastive loss using tension-based sampling for better negative selection.
363
364
**Use Case**: Improved contrastive learning with better negative sampling
365
366
### ContrastiveTensionLossInBatchNegatives
367
368
```python
369
class ContrastiveTensionLossInBatchNegatives(torch.nn.Module):
370
def __init__(
371
self,
372
model: SentenceTransformer,
373
scale: float = 20.0,
374
similarity_fct: callable = cos_sim
375
)
376
```
377
`{ .api }`
378
379
In-batch version of contrastive tension loss.
380
381
**Use Case**: Efficient contrastive learning with in-batch negatives
382
383
### ContrastiveTensionDataLoader
384
385
```python
386
class ContrastiveTensionDataLoader:
387
def __init__(
388
self,
389
examples: list,
390
batch_size: int = 32,
391
pos_neg_ratio: int = 4
392
)
393
```
394
`{ .api }`
395
396
Specialized data loader for contrastive tension training.
397
398
**Parameters**:
399
- `examples`: Training examples
400
- `batch_size`: Batch size
401
- `pos_neg_ratio`: Ratio of positives to negatives
402
403
## Advanced and Specialized Losses
404
405
### AnglELoss
406
407
```python
408
class AnglELoss(torch.nn.Module):
409
def __init__(
410
self,
411
model: SentenceTransformer,
412
angle_w: float = 1.0,
413
angle_tau: float = 1.0,
414
cosine_w: float = 1.0,
415
cosine_tau: float = 1.0,
416
ibn_w: float = 1.0,
417
pooling_strategy: str = "cls"
418
)
419
```
420
`{ .api }`
421
422
AnglE (Angle-optimized Text Embeddings) loss function that optimizes both angle and magnitude of embeddings.
423
424
**Use Case**: State-of-the-art performance on text embedding benchmarks
425
426
### CoSENTLoss
427
428
```python
429
class CoSENTLoss(torch.nn.Module):
430
def __init__(
431
self,
432
model: SentenceTransformer,
433
scale: float = 20.0,
434
similarity_fct: callable = cos_sim
435
)
436
```
437
`{ .api }`
438
439
CoSENT (Cosine Sentence) loss for optimized sentence embeddings.
440
441
**Use Case**: Improved sentence similarity learning
442
443
### GISTEmbedLoss
444
445
```python
446
class GISTEmbedLoss(torch.nn.Module):
447
def __init__(
448
self,
449
model: SentenceTransformer,
450
guide: SentenceTransformer
451
)
452
```
453
`{ .api }`
454
455
GIST (Guided In-context Selection of Training-data) embedding loss for knowledge distillation.
456
457
**Parameters**:
458
- `model`: Student model to train
459
- `guide`: Teacher model for guidance
460
461
**Use Case**: Knowledge distillation, model compression
462
463
### CachedGISTEmbedLoss
464
465
```python
466
class CachedGISTEmbedLoss(torch.nn.Module):
467
def __init__(
468
self,
469
model: SentenceTransformer,
470
guide: SentenceTransformer,
471
mini_batch_size: int = 32
472
)
473
```
474
`{ .api }`
475
476
Cached version of GIST loss for memory efficiency with large datasets.
477
478
**Use Case**: Memory-efficient knowledge distillation
479
480
### DenoisingAutoEncoderLoss
481
482
```python
483
class DenoisingAutoEncoderLoss(torch.nn.Module):
484
def __init__(
485
self,
486
model: SentenceTransformer,
487
decoder_name_or_path: str = None,
488
tie_encoder_decoder: bool = True
489
)
490
```
491
`{ .api }`
492
493
Denoising autoencoder loss for self-supervised learning.
494
495
**Parameters**:
496
- `model`: SentenceTransformer encoder
497
- `decoder_name_or_path`: Decoder model path
498
- `tie_encoder_decoder`: Whether to tie encoder and decoder weights
499
500
**Use Case**: Self-supervised pre-training, unsupervised learning
501
502
### MegaBatchMarginLoss
503
504
```python
505
class MegaBatchMarginLoss(torch.nn.Module):
506
def __init__(
507
self,
508
model: SentenceTransformer,
509
scale: float = 1.0,
510
similarity_fct: callable = cos_sim
511
)
512
```
513
`{ .api }`
514
515
Margin-based loss designed for very large batch training.
516
517
**Use Case**: Large-scale contrastive learning with massive batches
518
519
### DistillKLDivLoss
520
521
```python
522
class DistillKLDivLoss(torch.nn.Module):
523
def __init__(
524
self,
525
model: SentenceTransformer,
526
teacher_model: SentenceTransformer
527
)
528
```
529
`{ .api }`
530
531
Knowledge distillation using KL divergence between student and teacher embeddings.
532
533
**Use Case**: Model distillation, compression
534
535
### AdaptiveLayerLoss
536
537
```python
538
class AdaptiveLayerLoss(torch.nn.Module):
539
def __init__(
540
self,
541
model: SentenceTransformer,
542
loss: torch.nn.Module,
543
n_layers_per_step: int = 1
544
)
545
```
546
`{ .api }`
547
548
Adaptive loss that progressively uses more transformer layers during training.
549
550
**Use Case**: Progressive training, computational efficiency
551
552
## Cached Loss Functions
553
554
### CachedMultipleNegativesRankingLoss
555
556
```python
557
class CachedMultipleNegativesRankingLoss(torch.nn.Module):
558
def __init__(
559
self,
560
model: SentenceTransformer,
561
scale: float = 20.0,
562
similarity_fct: callable = cos_sim,
563
mini_batch_size: int = 32
564
)
565
```
566
`{ .api }`
567
568
Memory-efficient cached version of MultipleNegativesRankingLoss for large datasets.
569
570
### CachedMultipleNegativesSymmetricRankingLoss
571
572
```python
573
class CachedMultipleNegativesSymmetricRankingLoss(torch.nn.Module):
574
def __init__(
575
self,
576
model: SentenceTransformer,
577
scale: float = 20.0,
578
similarity_fct: callable = cos_sim,
579
mini_batch_size: int = 32
580
)
581
```
582
`{ .api }`
583
584
Cached symmetric version for memory efficiency.
585
586
## Usage Examples
587
588
### Basic Contrastive Learning
589
590
```python
591
from sentence_transformers import SentenceTransformer
592
from sentence_transformers.losses import MultipleNegativesRankingLoss
593
from datasets import Dataset
594
595
# Initialize model and loss
596
model = SentenceTransformer('distilbert-base-uncased')
597
loss = MultipleNegativesRankingLoss(model, scale=20.0)
598
599
# Prepare data (anchor-positive pairs)
600
train_data = [
601
{"anchor": "The cat sits on the mat", "positive": "A feline rests on a rug"},
602
{"anchor": "Python programming language", "positive": "Coding with Python"}
603
]
604
605
train_dataset = Dataset.from_list(train_data)
606
607
# Training with contrastive loss
608
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
609
610
args = SentenceTransformerTrainingArguments(
611
output_dir='./contrastive-training',
612
per_device_train_batch_size=64, # Larger batches work better
613
num_train_epochs=3
614
)
615
616
trainer = SentenceTransformerTrainer(
617
model=model,
618
args=args,
619
train_dataset=train_dataset,
620
loss=loss
621
)
622
623
trainer.train()
624
```
625
626
### Triplet Learning
627
628
```python
629
from sentence_transformers.losses import TripletLoss, TripletDistanceMetric
630
631
# Triplet loss with cosine distance
632
triplet_loss = TripletLoss(
633
model=model,
634
distance_metric=TripletDistanceMetric.COSINE,
635
triplet_margin=0.5
636
)
637
638
# Prepare triplet data
639
triplet_data = [
640
{
641
"anchor": "The cat sits on the mat",
642
"positive": "A feline rests on a rug",
643
"negative": "Dogs are great pets"
644
}
645
]
646
647
triplet_dataset = Dataset.from_list(triplet_data)
648
649
trainer = SentenceTransformerTrainer(
650
model=model,
651
args=args,
652
train_dataset=triplet_dataset,
653
loss=triplet_loss
654
)
655
656
trainer.train()
657
```
658
659
### Matryoshka Representation Learning
660
661
```python
662
from sentence_transformers.losses import MatryoshkaLoss
663
664
# Base loss
665
base_loss = MultipleNegativesRankingLoss(model)
666
667
# Matryoshka loss with multiple dimensions
668
matryoshka_loss = MatryoshkaLoss(
669
model=model,
670
loss=base_loss,
671
matryoshka_dims=[768, 512, 256, 128, 64],
672
matryoshka_weights=[1, 1, 1, 1, 1] # Equal weights
673
)
674
675
trainer = SentenceTransformerTrainer(
676
model=model,
677
args=args,
678
train_dataset=train_dataset,
679
loss=matryoshka_loss
680
)
681
682
trainer.train()
683
684
# Test at different dimensions
685
embeddings_full = model.encode(["Test"], truncate_dim=None)
686
embeddings_256 = model.encode(["Test"], truncate_dim=256)
687
embeddings_64 = model.encode(["Test"], truncate_dim=64)
688
```
689
690
### Similarity Regression
691
692
```python
693
from sentence_transformers.losses import CosineSimilarityLoss
694
import torch.nn as nn
695
696
# Cosine similarity loss with different transformations
697
mse_loss = CosineSimilarityLoss(
698
model=model,
699
loss_fct=nn.MSELoss(),
700
cos_score_transformation=nn.Identity()
701
)
702
703
# For scores in [0, 1] range
704
sigmoid_loss = CosineSimilarityLoss(
705
model=model,
706
loss_fct=nn.MSELoss(),
707
cos_score_transformation=nn.Sigmoid()
708
)
709
710
# Prepare similarity data
711
similarity_data = [
712
{"sentence1": "The cat sits", "sentence2": "A cat is sitting", "label": 0.9},
713
{"sentence1": "Dogs bark", "sentence2": "Cars are fast", "label": 0.1}
714
]
715
716
similarity_dataset = Dataset.from_list(similarity_data)
717
718
trainer = SentenceTransformerTrainer(
719
model=model,
720
args=args,
721
train_dataset=similarity_dataset,
722
loss=mse_loss
723
)
724
725
trainer.train()
726
```
727
728
### Knowledge Distillation
729
730
```python
731
from sentence_transformers.losses import DistillKLDivLoss
732
733
# Teacher model (larger, pre-trained)
734
teacher_model = SentenceTransformer('all-mpnet-base-v2')
735
736
# Student model (smaller)
737
student_model = SentenceTransformer('distilbert-base-uncased')
738
739
# Distillation loss
740
distill_loss = DistillKLDivLoss(
741
model=student_model,
742
teacher_model=teacher_model
743
)
744
745
trainer = SentenceTransformerTrainer(
746
model=student_model,
747
args=args,
748
train_dataset=train_dataset,
749
loss=distill_loss
750
)
751
752
trainer.train()
753
```
754
755
### Multi-Task Learning
756
757
```python
758
from sentence_transformers.losses import SoftmaxLoss
759
760
# Combine different losses for multi-task learning
761
contrastive_loss = MultipleNegativesRankingLoss(model)
762
classification_loss = SoftmaxLoss(
763
model=model,
764
sentence_embedding_dimension=768,
765
num_labels=3 # For NLI: entailment, contradiction, neutral
766
)
767
768
# Multi-dataset training
769
datasets = {
770
"similarity": similarity_dataset,
771
"classification": nli_dataset
772
}
773
774
losses = {
775
"similarity": contrastive_loss,
776
"classification": classification_loss
777
}
778
779
trainer = SentenceTransformerTrainer(
780
model=model,
781
args=args,
782
train_dataset=datasets,
783
loss=losses
784
)
785
786
trainer.train()
787
```
788
789
### Advanced Batch Mining
790
791
```python
792
from sentence_transformers.losses import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction
793
794
# Hard negative mining within batches
795
batch_hard_loss = BatchHardTripletLoss(
796
model=model,
797
distance_function=BatchHardTripletLossDistanceFunction.cosine_distance,
798
margin=0.2
799
)
800
801
# Use with datasets that have class labels
802
class_data = [
803
{"text": "Python programming", "label": 0},
804
{"text": "Coding in Python", "label": 0},
805
{"text": "Machine learning", "label": 1},
806
{"text": "AI algorithms", "label": 1}
807
]
808
809
class_dataset = Dataset.from_list(class_data)
810
811
trainer = SentenceTransformerTrainer(
812
model=model,
813
args=args,
814
train_dataset=class_dataset,
815
loss=batch_hard_loss
816
)
817
818
trainer.train()
819
```
820
821
## Best Practices
822
823
1. **Loss Selection**: Choose loss functions based on your data format and task
824
2. **Batch Size**: Use larger batches (64+) for contrastive losses when possible
825
3. **Scaling**: Adjust scale parameters based on your similarity function
826
4. **Negative Sampling**: Consider hard negative mining for improved performance
827
5. **Multi-Task**: Combine different losses for comprehensive training
828
6. **Progressive Training**: Use Matryoshka or adaptive losses for efficiency
829
7. **Evaluation**: Monitor performance on validation sets during training
830
8. **Hyperparameter Tuning**: Experiment with margins, scales, and learning rates