CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-sentence-transformers

Embeddings, Retrieval, and Reranking framework for computing dense, sparse, and cross-encoder embeddings using state-of-the-art transformer models

Pending
Overview
Eval results
Files

loss-functions.mddocs/

Loss Functions

The sentence-transformers package provides an extensive collection of loss functions designed for different learning objectives and training scenarios. These losses enable contrastive learning, supervised fine-tuning, and specialized training approaches.

Import Statement

from sentence_transformers.losses import (
    CosineSimilarityLoss,
    MultipleNegativesRankingLoss, 
    TripletLoss,
    MatryoshkaLoss,
    # ... other loss functions
)

Core Loss Functions

CosineSimilarityLoss

class CosineSimilarityLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        loss_fct: torch.nn.Module = torch.nn.MSELoss(),
        cos_score_transformation: torch.nn.Module = torch.nn.Identity()
    )

{ .api }

Loss function that measures cosine similarity between sentence pairs with target similarity scores.

Parameters:

  • model: SentenceTransformer model
  • loss_fct: Loss function to apply to cosine similarities (default: MSELoss)
  • cos_score_transformation: Transformation applied to cosine scores

Use Case: Regression on similarity scores, semantic textual similarity tasks

MultipleNegativesRankingLoss

class MultipleNegativesRankingLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim
    )

{ .api }

Contrastive loss using in-batch negatives. Optimizes for positive pairs while treating other examples in the batch as negatives.

Parameters:

  • model: SentenceTransformer model
  • scale: Scaling factor for similarities
  • similarity_fct: Function to compute similarities

Use Case: Asymmetric retrieval tasks, contrastive learning with large batches

MultipleNegativesSymmetricRankingLoss

class MultipleNegativesSymmetricRankingLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim
    )

{ .api }

Symmetric version of MultipleNegativesRankingLoss that optimizes both (A, B) and (B, A) directions.

Parameters:

  • model: SentenceTransformer model
  • scale: Scaling factor for similarities
  • similarity_fct: Function to compute similarities

Use Case: Symmetric retrieval tasks, bidirectional similarity learning

TripletLoss

class TripletLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_metric: TripletDistanceMetric = TripletDistanceMetric.EUCLIDEAN,
        triplet_margin: float = 5
    )

{ .api }

Classic triplet loss with anchor, positive, and negative examples.

Parameters:

  • model: SentenceTransformer model
  • distance_metric: Distance metric for triplet computation
  • triplet_margin: Margin between positive and negative distances

Enum TripletDistanceMetric:

  • COSINE: Cosine distance
  • EUCLIDEAN: Euclidean distance
  • MANHATTAN: Manhattan distance
  • DOT_PRODUCT: Dot product distance

Use Case: Learning embeddings with explicit positive/negative relationships

Advanced Loss Functions

MatryoshkaLoss

class MatryoshkaLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        loss: torch.nn.Module,
        matryoshka_dims: list[int],
        matryoshka_weights: list[float] | None = None
    )

{ .api }

Wrapper loss for Matryoshka Representation Learning, enabling models to produce useful embeddings at multiple dimensions.

Parameters:

  • model: SentenceTransformer model
  • loss: Base loss function to wrap
  • matryoshka_dims: List of embedding dimensions to optimize
  • matryoshka_weights: Weights for each dimension (uniform if None)

Use Case: Creating models that work well at multiple embedding dimensions

Matryoshka2dLoss

class Matryoshka2dLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        loss: torch.nn.Module,
        matryoshka_dims: list[int],
        n_layers_per_step: int = 1
    )

{ .api }

2D Matryoshka loss that optimizes across both embedding dimensions and transformer layers.

Parameters:

  • model: SentenceTransformer model
  • loss: Base loss function
  • matryoshka_dims: Embedding dimensions to optimize
  • n_layers_per_step: Number of layers per optimization step

Use Case: Early exit capabilities and progressive inference

MSELoss

class MSELoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer
    )

{ .api }

Mean Squared Error loss for regression tasks with continuous similarity scores.

Use Case: Direct regression on similarity scores, knowledge distillation

MarginMSELoss

class MarginMSELoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer
    )

{ .api }

MSE loss with margin-based formulation for triplet-like data.

Use Case: Triplet data with continuous similarity scores

Specialized Loss Functions

ContrastiveLoss

class ContrastiveLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_metric: SiameseDistanceMetric = SiameseDistanceMetric.EUCLIDEAN,
        margin: float = 0.5,
        size_average: bool = True
    )

{ .api }

Classic contrastive loss for siamese networks with binary similarity labels.

Parameters:

  • model: SentenceTransformer model
  • distance_metric: Distance metric to use
  • margin: Margin for negative pairs
  • size_average: Whether to average the loss

Enum SiameseDistanceMetric:

  • EUCLIDEAN: Euclidean distance
  • MANHATTAN: Manhattan distance
  • COSINE_DISTANCE: Cosine distance

Use Case: Binary similarity classification, siamese networks

SoftmaxLoss

class SoftmaxLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        sentence_embedding_dimension: int,
        num_labels: int,
        concatenation_sent_rep: bool = True,
        concatenation_sent_difference: bool = True,
        concatenation_sent_multiplication: bool = False
    )

{ .api }

Classification loss using softmax over sentence pair representations.

Parameters:

  • model: SentenceTransformer model
  • sentence_embedding_dimension: Dimension of sentence embeddings
  • num_labels: Number of classification labels
  • concatenation_sent_rep: Include individual sentence representations
  • concatenation_sent_difference: Include element-wise difference
  • concatenation_sent_multiplication: Include element-wise product

Use Case: Natural language inference, text classification

Batch-Based Triplet Losses

BatchHardTripletLoss

class BatchHardTripletLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,
        margin: float = 5
    )

{ .api }

Batch hard triplet loss that mines the hardest positive and negative pairs within each batch.

Parameters:

  • model: SentenceTransformer model
  • distance_function: Distance function for triplet mining
  • margin: Triplet margin

Enum BatchHardTripletLossDistanceFunction:

  • cosine_distance: Cosine distance
  • euclidean_distance: Euclidean distance

Use Case: Metric learning with automatic hard negative mining

BatchSemiHardTripletLoss

class BatchSemiHardTripletLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,
        margin: float = 5
    )

{ .api }

Batch semi-hard triplet loss that mines semi-hard negatives (harder than positive but within margin).

Use Case: More stable training than hard negative mining

BatchHardSoftMarginTripletLoss

class BatchHardSoftMarginTripletLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance
    )

{ .api }

Batch hard triplet loss with soft margin (no explicit margin parameter).

Use Case: Triplet learning without manual margin tuning

BatchAllTripletLoss

class BatchAllTripletLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,
        margin: float = 5
    )

{ .api }

Uses all valid triplets in a batch for training.

Use Case: Comprehensive triplet learning when computational resources allow

Contrastive and Tension Losses

OnlineContrastiveLoss

class OnlineContrastiveLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        distance_metric: SiameseDistanceMetric = SiameseDistanceMetric.COSINE_DISTANCE,
        margin: float = 0.5,
        size_average: bool = True
    )

{ .api }

Online version of contrastive loss for streaming/online learning scenarios.

Use Case: Incremental learning, online adaptation

ContrastiveTensionLoss

class ContrastiveTensionLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim
    )

{ .api }

Contrastive loss using tension-based sampling for better negative selection.

Use Case: Improved contrastive learning with better negative sampling

ContrastiveTensionLossInBatchNegatives

class ContrastiveTensionLossInBatchNegatives(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim
    )

{ .api }

In-batch version of contrastive tension loss.

Use Case: Efficient contrastive learning with in-batch negatives

ContrastiveTensionDataLoader

class ContrastiveTensionDataLoader:
    def __init__(
        self,
        examples: list,
        batch_size: int = 32,
        pos_neg_ratio: int = 4
    )

{ .api }

Specialized data loader for contrastive tension training.

Parameters:

  • examples: Training examples
  • batch_size: Batch size
  • pos_neg_ratio: Ratio of positives to negatives

Advanced and Specialized Losses

AnglELoss

class AnglELoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        angle_w: float = 1.0,
        angle_tau: float = 1.0,
        cosine_w: float = 1.0,
        cosine_tau: float = 1.0,
        ibn_w: float = 1.0,
        pooling_strategy: str = "cls"
    )

{ .api }

AnglE (Angle-optimized Text Embeddings) loss function that optimizes both angle and magnitude of embeddings.

Use Case: State-of-the-art performance on text embedding benchmarks

CoSENTLoss

class CoSENTLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim
    )

{ .api }

CoSENT (Cosine Sentence) loss for optimized sentence embeddings.

Use Case: Improved sentence similarity learning

GISTEmbedLoss

class GISTEmbedLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        guide: SentenceTransformer
    )

{ .api }

GIST (Guided In-context Selection of Training-data) embedding loss for knowledge distillation.

Parameters:

  • model: Student model to train
  • guide: Teacher model for guidance

Use Case: Knowledge distillation, model compression

CachedGISTEmbedLoss

class CachedGISTEmbedLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        guide: SentenceTransformer,
        mini_batch_size: int = 32
    )

{ .api }

Cached version of GIST loss for memory efficiency with large datasets.

Use Case: Memory-efficient knowledge distillation

DenoisingAutoEncoderLoss

class DenoisingAutoEncoderLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        decoder_name_or_path: str = None,
        tie_encoder_decoder: bool = True
    )

{ .api }

Denoising autoencoder loss for self-supervised learning.

Parameters:

  • model: SentenceTransformer encoder
  • decoder_name_or_path: Decoder model path
  • tie_encoder_decoder: Whether to tie encoder and decoder weights

Use Case: Self-supervised pre-training, unsupervised learning

MegaBatchMarginLoss

class MegaBatchMarginLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 1.0,
        similarity_fct: callable = cos_sim
    )

{ .api }

Margin-based loss designed for very large batch training.

Use Case: Large-scale contrastive learning with massive batches

DistillKLDivLoss

class DistillKLDivLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        teacher_model: SentenceTransformer
    )

{ .api }

Knowledge distillation using KL divergence between student and teacher embeddings.

Use Case: Model distillation, compression

AdaptiveLayerLoss

class AdaptiveLayerLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        loss: torch.nn.Module,
        n_layers_per_step: int = 1
    )

{ .api }

Adaptive loss that progressively uses more transformer layers during training.

Use Case: Progressive training, computational efficiency

Cached Loss Functions

CachedMultipleNegativesRankingLoss

class CachedMultipleNegativesRankingLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim,
        mini_batch_size: int = 32
    )

{ .api }

Memory-efficient cached version of MultipleNegativesRankingLoss for large datasets.

CachedMultipleNegativesSymmetricRankingLoss

class CachedMultipleNegativesSymmetricRankingLoss(torch.nn.Module):
    def __init__(
        self,
        model: SentenceTransformer,
        scale: float = 20.0,
        similarity_fct: callable = cos_sim,
        mini_batch_size: int = 32
    )

{ .api }

Cached symmetric version for memory efficiency.

Usage Examples

Basic Contrastive Learning

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MultipleNegativesRankingLoss
from datasets import Dataset

# Initialize model and loss
model = SentenceTransformer('distilbert-base-uncased')
loss = MultipleNegativesRankingLoss(model, scale=20.0)

# Prepare data (anchor-positive pairs)
train_data = [
    {"anchor": "The cat sits on the mat", "positive": "A feline rests on a rug"},
    {"anchor": "Python programming language", "positive": "Coding with Python"}
]

train_dataset = Dataset.from_list(train_data)

# Training with contrastive loss
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments

args = SentenceTransformerTrainingArguments(
    output_dir='./contrastive-training',
    per_device_train_batch_size=64,  # Larger batches work better
    num_train_epochs=3
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss
)

trainer.train()

Triplet Learning

from sentence_transformers.losses import TripletLoss, TripletDistanceMetric

# Triplet loss with cosine distance
triplet_loss = TripletLoss(
    model=model,
    distance_metric=TripletDistanceMetric.COSINE,
    triplet_margin=0.5
)

# Prepare triplet data
triplet_data = [
    {
        "anchor": "The cat sits on the mat",
        "positive": "A feline rests on a rug", 
        "negative": "Dogs are great pets"
    }
]

triplet_dataset = Dataset.from_list(triplet_data)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=triplet_dataset,
    loss=triplet_loss
)

trainer.train()

Matryoshka Representation Learning

from sentence_transformers.losses import MatryoshkaLoss

# Base loss
base_loss = MultipleNegativesRankingLoss(model)

# Matryoshka loss with multiple dimensions
matryoshka_loss = MatryoshkaLoss(
    model=model,
    loss=base_loss,
    matryoshka_dims=[768, 512, 256, 128, 64],
    matryoshka_weights=[1, 1, 1, 1, 1]  # Equal weights
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=matryoshka_loss
)

trainer.train()

# Test at different dimensions
embeddings_full = model.encode(["Test"], truncate_dim=None)
embeddings_256 = model.encode(["Test"], truncate_dim=256)
embeddings_64 = model.encode(["Test"], truncate_dim=64)

Similarity Regression

from sentence_transformers.losses import CosineSimilarityLoss
import torch.nn as nn

# Cosine similarity loss with different transformations
mse_loss = CosineSimilarityLoss(
    model=model,
    loss_fct=nn.MSELoss(),
    cos_score_transformation=nn.Identity()
)

# For scores in [0, 1] range
sigmoid_loss = CosineSimilarityLoss(
    model=model,
    loss_fct=nn.MSELoss(),
    cos_score_transformation=nn.Sigmoid()
)

# Prepare similarity data
similarity_data = [
    {"sentence1": "The cat sits", "sentence2": "A cat is sitting", "label": 0.9},
    {"sentence1": "Dogs bark", "sentence2": "Cars are fast", "label": 0.1}
]

similarity_dataset = Dataset.from_list(similarity_data)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=similarity_dataset,
    loss=mse_loss
)

trainer.train()

Knowledge Distillation

from sentence_transformers.losses import DistillKLDivLoss

# Teacher model (larger, pre-trained)
teacher_model = SentenceTransformer('all-mpnet-base-v2')

# Student model (smaller)
student_model = SentenceTransformer('distilbert-base-uncased')

# Distillation loss
distill_loss = DistillKLDivLoss(
    model=student_model,
    teacher_model=teacher_model
)

trainer = SentenceTransformerTrainer(
    model=student_model,
    args=args,
    train_dataset=train_dataset,
    loss=distill_loss
)

trainer.train()

Multi-Task Learning

from sentence_transformers.losses import SoftmaxLoss

# Combine different losses for multi-task learning
contrastive_loss = MultipleNegativesRankingLoss(model)
classification_loss = SoftmaxLoss(
    model=model,
    sentence_embedding_dimension=768,
    num_labels=3  # For NLI: entailment, contradiction, neutral
)

# Multi-dataset training
datasets = {
    "similarity": similarity_dataset,
    "classification": nli_dataset
}

losses = {
    "similarity": contrastive_loss,
    "classification": classification_loss
}

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=datasets,
    loss=losses
)

trainer.train()

Advanced Batch Mining

from sentence_transformers.losses import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction

# Hard negative mining within batches
batch_hard_loss = BatchHardTripletLoss(
    model=model,
    distance_function=BatchHardTripletLossDistanceFunction.cosine_distance,
    margin=0.2
)

# Use with datasets that have class labels
class_data = [
    {"text": "Python programming", "label": 0},
    {"text": "Coding in Python", "label": 0},
    {"text": "Machine learning", "label": 1},
    {"text": "AI algorithms", "label": 1}
]

class_dataset = Dataset.from_list(class_data)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=class_dataset,
    loss=batch_hard_loss
)

trainer.train()

Best Practices

  1. Loss Selection: Choose loss functions based on your data format and task
  2. Batch Size: Use larger batches (64+) for contrastive losses when possible
  3. Scaling: Adjust scale parameters based on your similarity function
  4. Negative Sampling: Consider hard negative mining for improved performance
  5. Multi-Task: Combine different losses for comprehensive training
  6. Progressive Training: Use Matryoshka or adaptive losses for efficiency
  7. Evaluation: Monitor performance on validation sets during training
  8. Hyperparameter Tuning: Experiment with margins, scales, and learning rates

Install with Tessl CLI

npx tessl i tessl/pypi-sentence-transformers

docs

core-transformers.md

cross-encoder.md

evaluation.md

index.md

loss-functions.md

sparse-encoder.md

training.md

utilities.md

tile.json