or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-transformers.mdcross-encoder.mdevaluation.mdindex.mdloss-functions.mdsparse-encoder.mdtraining.mdutilities.md

loss-functions.mddocs/

0

# Loss Functions

1

2

The sentence-transformers package provides an extensive collection of loss functions designed for different learning objectives and training scenarios. These losses enable contrastive learning, supervised fine-tuning, and specialized training approaches.

3

4

## Import Statement

5

6

```python

7

from sentence_transformers.losses import (

8

CosineSimilarityLoss,

9

MultipleNegativesRankingLoss,

10

TripletLoss,

11

MatryoshkaLoss,

12

# ... other loss functions

13

)

14

```

15

16

## Core Loss Functions

17

18

### CosineSimilarityLoss

19

20

```python

21

class CosineSimilarityLoss(torch.nn.Module):

22

def __init__(

23

self,

24

model: SentenceTransformer,

25

loss_fct: torch.nn.Module = torch.nn.MSELoss(),

26

cos_score_transformation: torch.nn.Module = torch.nn.Identity()

27

)

28

```

29

`{ .api }`

30

31

Loss function that measures cosine similarity between sentence pairs with target similarity scores.

32

33

**Parameters**:

34

- `model`: SentenceTransformer model

35

- `loss_fct`: Loss function to apply to cosine similarities (default: MSELoss)

36

- `cos_score_transformation`: Transformation applied to cosine scores

37

38

**Use Case**: Regression on similarity scores, semantic textual similarity tasks

39

40

### MultipleNegativesRankingLoss

41

42

```python

43

class MultipleNegativesRankingLoss(torch.nn.Module):

44

def __init__(

45

self,

46

model: SentenceTransformer,

47

scale: float = 20.0,

48

similarity_fct: callable = cos_sim

49

)

50

```

51

`{ .api }`

52

53

Contrastive loss using in-batch negatives. Optimizes for positive pairs while treating other examples in the batch as negatives.

54

55

**Parameters**:

56

- `model`: SentenceTransformer model

57

- `scale`: Scaling factor for similarities

58

- `similarity_fct`: Function to compute similarities

59

60

**Use Case**: Asymmetric retrieval tasks, contrastive learning with large batches

61

62

### MultipleNegativesSymmetricRankingLoss

63

64

```python

65

class MultipleNegativesSymmetricRankingLoss(torch.nn.Module):

66

def __init__(

67

self,

68

model: SentenceTransformer,

69

scale: float = 20.0,

70

similarity_fct: callable = cos_sim

71

)

72

```

73

`{ .api }`

74

75

Symmetric version of MultipleNegativesRankingLoss that optimizes both (A, B) and (B, A) directions.

76

77

**Parameters**:

78

- `model`: SentenceTransformer model

79

- `scale`: Scaling factor for similarities

80

- `similarity_fct`: Function to compute similarities

81

82

**Use Case**: Symmetric retrieval tasks, bidirectional similarity learning

83

84

### TripletLoss

85

86

```python

87

class TripletLoss(torch.nn.Module):

88

def __init__(

89

self,

90

model: SentenceTransformer,

91

distance_metric: TripletDistanceMetric = TripletDistanceMetric.EUCLIDEAN,

92

triplet_margin: float = 5

93

)

94

```

95

`{ .api }`

96

97

Classic triplet loss with anchor, positive, and negative examples.

98

99

**Parameters**:

100

- `model`: SentenceTransformer model

101

- `distance_metric`: Distance metric for triplet computation

102

- `triplet_margin`: Margin between positive and negative distances

103

104

**Enum TripletDistanceMetric**:

105

- `COSINE`: Cosine distance

106

- `EUCLIDEAN`: Euclidean distance

107

- `MANHATTAN`: Manhattan distance

108

- `DOT_PRODUCT`: Dot product distance

109

110

**Use Case**: Learning embeddings with explicit positive/negative relationships

111

112

## Advanced Loss Functions

113

114

### MatryoshkaLoss

115

116

```python

117

class MatryoshkaLoss(torch.nn.Module):

118

def __init__(

119

self,

120

model: SentenceTransformer,

121

loss: torch.nn.Module,

122

matryoshka_dims: list[int],

123

matryoshka_weights: list[float] | None = None

124

)

125

```

126

`{ .api }`

127

128

Wrapper loss for Matryoshka Representation Learning, enabling models to produce useful embeddings at multiple dimensions.

129

130

**Parameters**:

131

- `model`: SentenceTransformer model

132

- `loss`: Base loss function to wrap

133

- `matryoshka_dims`: List of embedding dimensions to optimize

134

- `matryoshka_weights`: Weights for each dimension (uniform if None)

135

136

**Use Case**: Creating models that work well at multiple embedding dimensions

137

138

### Matryoshka2dLoss

139

140

```python

141

class Matryoshka2dLoss(torch.nn.Module):

142

def __init__(

143

self,

144

model: SentenceTransformer,

145

loss: torch.nn.Module,

146

matryoshka_dims: list[int],

147

n_layers_per_step: int = 1

148

)

149

```

150

`{ .api }`

151

152

2D Matryoshka loss that optimizes across both embedding dimensions and transformer layers.

153

154

**Parameters**:

155

- `model`: SentenceTransformer model

156

- `loss`: Base loss function

157

- `matryoshka_dims`: Embedding dimensions to optimize

158

- `n_layers_per_step`: Number of layers per optimization step

159

160

**Use Case**: Early exit capabilities and progressive inference

161

162

### MSELoss

163

164

```python

165

class MSELoss(torch.nn.Module):

166

def __init__(

167

self,

168

model: SentenceTransformer

169

)

170

```

171

`{ .api }`

172

173

Mean Squared Error loss for regression tasks with continuous similarity scores.

174

175

**Use Case**: Direct regression on similarity scores, knowledge distillation

176

177

### MarginMSELoss

178

179

```python

180

class MarginMSELoss(torch.nn.Module):

181

def __init__(

182

self,

183

model: SentenceTransformer

184

)

185

```

186

`{ .api }`

187

188

MSE loss with margin-based formulation for triplet-like data.

189

190

**Use Case**: Triplet data with continuous similarity scores

191

192

## Specialized Loss Functions

193

194

### ContrastiveLoss

195

196

```python

197

class ContrastiveLoss(torch.nn.Module):

198

def __init__(

199

self,

200

model: SentenceTransformer,

201

distance_metric: SiameseDistanceMetric = SiameseDistanceMetric.EUCLIDEAN,

202

margin: float = 0.5,

203

size_average: bool = True

204

)

205

```

206

`{ .api }`

207

208

Classic contrastive loss for siamese networks with binary similarity labels.

209

210

**Parameters**:

211

- `model`: SentenceTransformer model

212

- `distance_metric`: Distance metric to use

213

- `margin`: Margin for negative pairs

214

- `size_average`: Whether to average the loss

215

216

**Enum SiameseDistanceMetric**:

217

- `EUCLIDEAN`: Euclidean distance

218

- `MANHATTAN`: Manhattan distance

219

- `COSINE_DISTANCE`: Cosine distance

220

221

**Use Case**: Binary similarity classification, siamese networks

222

223

### SoftmaxLoss

224

225

```python

226

class SoftmaxLoss(torch.nn.Module):

227

def __init__(

228

self,

229

model: SentenceTransformer,

230

sentence_embedding_dimension: int,

231

num_labels: int,

232

concatenation_sent_rep: bool = True,

233

concatenation_sent_difference: bool = True,

234

concatenation_sent_multiplication: bool = False

235

)

236

```

237

`{ .api }`

238

239

Classification loss using softmax over sentence pair representations.

240

241

**Parameters**:

242

- `model`: SentenceTransformer model

243

- `sentence_embedding_dimension`: Dimension of sentence embeddings

244

- `num_labels`: Number of classification labels

245

- `concatenation_sent_rep`: Include individual sentence representations

246

- `concatenation_sent_difference`: Include element-wise difference

247

- `concatenation_sent_multiplication`: Include element-wise product

248

249

**Use Case**: Natural language inference, text classification

250

251

## Batch-Based Triplet Losses

252

253

### BatchHardTripletLoss

254

255

```python

256

class BatchHardTripletLoss(torch.nn.Module):

257

def __init__(

258

self,

259

model: SentenceTransformer,

260

distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,

261

margin: float = 5

262

)

263

```

264

`{ .api }`

265

266

Batch hard triplet loss that mines the hardest positive and negative pairs within each batch.

267

268

**Parameters**:

269

- `model`: SentenceTransformer model

270

- `distance_function`: Distance function for triplet mining

271

- `margin`: Triplet margin

272

273

**Enum BatchHardTripletLossDistanceFunction**:

274

- `cosine_distance`: Cosine distance

275

- `euclidean_distance`: Euclidean distance

276

277

**Use Case**: Metric learning with automatic hard negative mining

278

279

### BatchSemiHardTripletLoss

280

281

```python

282

class BatchSemiHardTripletLoss(torch.nn.Module):

283

def __init__(

284

self,

285

model: SentenceTransformer,

286

distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,

287

margin: float = 5

288

)

289

```

290

`{ .api }`

291

292

Batch semi-hard triplet loss that mines semi-hard negatives (harder than positive but within margin).

293

294

**Use Case**: More stable training than hard negative mining

295

296

### BatchHardSoftMarginTripletLoss

297

298

```python

299

class BatchHardSoftMarginTripletLoss(torch.nn.Module):

300

def __init__(

301

self,

302

model: SentenceTransformer,

303

distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance

304

)

305

```

306

`{ .api }`

307

308

Batch hard triplet loss with soft margin (no explicit margin parameter).

309

310

**Use Case**: Triplet learning without manual margin tuning

311

312

### BatchAllTripletLoss

313

314

```python

315

class BatchAllTripletLoss(torch.nn.Module):

316

def __init__(

317

self,

318

model: SentenceTransformer,

319

distance_function: BatchHardTripletLossDistanceFunction = BatchHardTripletLossDistanceFunction.cosine_distance,

320

margin: float = 5

321

)

322

```

323

`{ .api }`

324

325

Uses all valid triplets in a batch for training.

326

327

**Use Case**: Comprehensive triplet learning when computational resources allow

328

329

## Contrastive and Tension Losses

330

331

### OnlineContrastiveLoss

332

333

```python

334

class OnlineContrastiveLoss(torch.nn.Module):

335

def __init__(

336

self,

337

model: SentenceTransformer,

338

distance_metric: SiameseDistanceMetric = SiameseDistanceMetric.COSINE_DISTANCE,

339

margin: float = 0.5,

340

size_average: bool = True

341

)

342

```

343

`{ .api }`

344

345

Online version of contrastive loss for streaming/online learning scenarios.

346

347

**Use Case**: Incremental learning, online adaptation

348

349

### ContrastiveTensionLoss

350

351

```python

352

class ContrastiveTensionLoss(torch.nn.Module):

353

def __init__(

354

self,

355

model: SentenceTransformer,

356

scale: float = 20.0,

357

similarity_fct: callable = cos_sim

358

)

359

```

360

`{ .api }`

361

362

Contrastive loss using tension-based sampling for better negative selection.

363

364

**Use Case**: Improved contrastive learning with better negative sampling

365

366

### ContrastiveTensionLossInBatchNegatives

367

368

```python

369

class ContrastiveTensionLossInBatchNegatives(torch.nn.Module):

370

def __init__(

371

self,

372

model: SentenceTransformer,

373

scale: float = 20.0,

374

similarity_fct: callable = cos_sim

375

)

376

```

377

`{ .api }`

378

379

In-batch version of contrastive tension loss.

380

381

**Use Case**: Efficient contrastive learning with in-batch negatives

382

383

### ContrastiveTensionDataLoader

384

385

```python

386

class ContrastiveTensionDataLoader:

387

def __init__(

388

self,

389

examples: list,

390

batch_size: int = 32,

391

pos_neg_ratio: int = 4

392

)

393

```

394

`{ .api }`

395

396

Specialized data loader for contrastive tension training.

397

398

**Parameters**:

399

- `examples`: Training examples

400

- `batch_size`: Batch size

401

- `pos_neg_ratio`: Ratio of positives to negatives

402

403

## Advanced and Specialized Losses

404

405

### AnglELoss

406

407

```python

408

class AnglELoss(torch.nn.Module):

409

def __init__(

410

self,

411

model: SentenceTransformer,

412

angle_w: float = 1.0,

413

angle_tau: float = 1.0,

414

cosine_w: float = 1.0,

415

cosine_tau: float = 1.0,

416

ibn_w: float = 1.0,

417

pooling_strategy: str = "cls"

418

)

419

```

420

`{ .api }`

421

422

AnglE (Angle-optimized Text Embeddings) loss function that optimizes both angle and magnitude of embeddings.

423

424

**Use Case**: State-of-the-art performance on text embedding benchmarks

425

426

### CoSENTLoss

427

428

```python

429

class CoSENTLoss(torch.nn.Module):

430

def __init__(

431

self,

432

model: SentenceTransformer,

433

scale: float = 20.0,

434

similarity_fct: callable = cos_sim

435

)

436

```

437

`{ .api }`

438

439

CoSENT (Cosine Sentence) loss for optimized sentence embeddings.

440

441

**Use Case**: Improved sentence similarity learning

442

443

### GISTEmbedLoss

444

445

```python

446

class GISTEmbedLoss(torch.nn.Module):

447

def __init__(

448

self,

449

model: SentenceTransformer,

450

guide: SentenceTransformer

451

)

452

```

453

`{ .api }`

454

455

GIST (Guided In-context Selection of Training-data) embedding loss for knowledge distillation.

456

457

**Parameters**:

458

- `model`: Student model to train

459

- `guide`: Teacher model for guidance

460

461

**Use Case**: Knowledge distillation, model compression

462

463

### CachedGISTEmbedLoss

464

465

```python

466

class CachedGISTEmbedLoss(torch.nn.Module):

467

def __init__(

468

self,

469

model: SentenceTransformer,

470

guide: SentenceTransformer,

471

mini_batch_size: int = 32

472

)

473

```

474

`{ .api }`

475

476

Cached version of GIST loss for memory efficiency with large datasets.

477

478

**Use Case**: Memory-efficient knowledge distillation

479

480

### DenoisingAutoEncoderLoss

481

482

```python

483

class DenoisingAutoEncoderLoss(torch.nn.Module):

484

def __init__(

485

self,

486

model: SentenceTransformer,

487

decoder_name_or_path: str = None,

488

tie_encoder_decoder: bool = True

489

)

490

```

491

`{ .api }`

492

493

Denoising autoencoder loss for self-supervised learning.

494

495

**Parameters**:

496

- `model`: SentenceTransformer encoder

497

- `decoder_name_or_path`: Decoder model path

498

- `tie_encoder_decoder`: Whether to tie encoder and decoder weights

499

500

**Use Case**: Self-supervised pre-training, unsupervised learning

501

502

### MegaBatchMarginLoss

503

504

```python

505

class MegaBatchMarginLoss(torch.nn.Module):

506

def __init__(

507

self,

508

model: SentenceTransformer,

509

scale: float = 1.0,

510

similarity_fct: callable = cos_sim

511

)

512

```

513

`{ .api }`

514

515

Margin-based loss designed for very large batch training.

516

517

**Use Case**: Large-scale contrastive learning with massive batches

518

519

### DistillKLDivLoss

520

521

```python

522

class DistillKLDivLoss(torch.nn.Module):

523

def __init__(

524

self,

525

model: SentenceTransformer,

526

teacher_model: SentenceTransformer

527

)

528

```

529

`{ .api }`

530

531

Knowledge distillation using KL divergence between student and teacher embeddings.

532

533

**Use Case**: Model distillation, compression

534

535

### AdaptiveLayerLoss

536

537

```python

538

class AdaptiveLayerLoss(torch.nn.Module):

539

def __init__(

540

self,

541

model: SentenceTransformer,

542

loss: torch.nn.Module,

543

n_layers_per_step: int = 1

544

)

545

```

546

`{ .api }`

547

548

Adaptive loss that progressively uses more transformer layers during training.

549

550

**Use Case**: Progressive training, computational efficiency

551

552

## Cached Loss Functions

553

554

### CachedMultipleNegativesRankingLoss

555

556

```python

557

class CachedMultipleNegativesRankingLoss(torch.nn.Module):

558

def __init__(

559

self,

560

model: SentenceTransformer,

561

scale: float = 20.0,

562

similarity_fct: callable = cos_sim,

563

mini_batch_size: int = 32

564

)

565

```

566

`{ .api }`

567

568

Memory-efficient cached version of MultipleNegativesRankingLoss for large datasets.

569

570

### CachedMultipleNegativesSymmetricRankingLoss

571

572

```python

573

class CachedMultipleNegativesSymmetricRankingLoss(torch.nn.Module):

574

def __init__(

575

self,

576

model: SentenceTransformer,

577

scale: float = 20.0,

578

similarity_fct: callable = cos_sim,

579

mini_batch_size: int = 32

580

)

581

```

582

`{ .api }`

583

584

Cached symmetric version for memory efficiency.

585

586

## Usage Examples

587

588

### Basic Contrastive Learning

589

590

```python

591

from sentence_transformers import SentenceTransformer

592

from sentence_transformers.losses import MultipleNegativesRankingLoss

593

from datasets import Dataset

594

595

# Initialize model and loss

596

model = SentenceTransformer('distilbert-base-uncased')

597

loss = MultipleNegativesRankingLoss(model, scale=20.0)

598

599

# Prepare data (anchor-positive pairs)

600

train_data = [

601

{"anchor": "The cat sits on the mat", "positive": "A feline rests on a rug"},

602

{"anchor": "Python programming language", "positive": "Coding with Python"}

603

]

604

605

train_dataset = Dataset.from_list(train_data)

606

607

# Training with contrastive loss

608

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments

609

610

args = SentenceTransformerTrainingArguments(

611

output_dir='./contrastive-training',

612

per_device_train_batch_size=64, # Larger batches work better

613

num_train_epochs=3

614

)

615

616

trainer = SentenceTransformerTrainer(

617

model=model,

618

args=args,

619

train_dataset=train_dataset,

620

loss=loss

621

)

622

623

trainer.train()

624

```

625

626

### Triplet Learning

627

628

```python

629

from sentence_transformers.losses import TripletLoss, TripletDistanceMetric

630

631

# Triplet loss with cosine distance

632

triplet_loss = TripletLoss(

633

model=model,

634

distance_metric=TripletDistanceMetric.COSINE,

635

triplet_margin=0.5

636

)

637

638

# Prepare triplet data

639

triplet_data = [

640

{

641

"anchor": "The cat sits on the mat",

642

"positive": "A feline rests on a rug",

643

"negative": "Dogs are great pets"

644

}

645

]

646

647

triplet_dataset = Dataset.from_list(triplet_data)

648

649

trainer = SentenceTransformerTrainer(

650

model=model,

651

args=args,

652

train_dataset=triplet_dataset,

653

loss=triplet_loss

654

)

655

656

trainer.train()

657

```

658

659

### Matryoshka Representation Learning

660

661

```python

662

from sentence_transformers.losses import MatryoshkaLoss

663

664

# Base loss

665

base_loss = MultipleNegativesRankingLoss(model)

666

667

# Matryoshka loss with multiple dimensions

668

matryoshka_loss = MatryoshkaLoss(

669

model=model,

670

loss=base_loss,

671

matryoshka_dims=[768, 512, 256, 128, 64],

672

matryoshka_weights=[1, 1, 1, 1, 1] # Equal weights

673

)

674

675

trainer = SentenceTransformerTrainer(

676

model=model,

677

args=args,

678

train_dataset=train_dataset,

679

loss=matryoshka_loss

680

)

681

682

trainer.train()

683

684

# Test at different dimensions

685

embeddings_full = model.encode(["Test"], truncate_dim=None)

686

embeddings_256 = model.encode(["Test"], truncate_dim=256)

687

embeddings_64 = model.encode(["Test"], truncate_dim=64)

688

```

689

690

### Similarity Regression

691

692

```python

693

from sentence_transformers.losses import CosineSimilarityLoss

694

import torch.nn as nn

695

696

# Cosine similarity loss with different transformations

697

mse_loss = CosineSimilarityLoss(

698

model=model,

699

loss_fct=nn.MSELoss(),

700

cos_score_transformation=nn.Identity()

701

)

702

703

# For scores in [0, 1] range

704

sigmoid_loss = CosineSimilarityLoss(

705

model=model,

706

loss_fct=nn.MSELoss(),

707

cos_score_transformation=nn.Sigmoid()

708

)

709

710

# Prepare similarity data

711

similarity_data = [

712

{"sentence1": "The cat sits", "sentence2": "A cat is sitting", "label": 0.9},

713

{"sentence1": "Dogs bark", "sentence2": "Cars are fast", "label": 0.1}

714

]

715

716

similarity_dataset = Dataset.from_list(similarity_data)

717

718

trainer = SentenceTransformerTrainer(

719

model=model,

720

args=args,

721

train_dataset=similarity_dataset,

722

loss=mse_loss

723

)

724

725

trainer.train()

726

```

727

728

### Knowledge Distillation

729

730

```python

731

from sentence_transformers.losses import DistillKLDivLoss

732

733

# Teacher model (larger, pre-trained)

734

teacher_model = SentenceTransformer('all-mpnet-base-v2')

735

736

# Student model (smaller)

737

student_model = SentenceTransformer('distilbert-base-uncased')

738

739

# Distillation loss

740

distill_loss = DistillKLDivLoss(

741

model=student_model,

742

teacher_model=teacher_model

743

)

744

745

trainer = SentenceTransformerTrainer(

746

model=student_model,

747

args=args,

748

train_dataset=train_dataset,

749

loss=distill_loss

750

)

751

752

trainer.train()

753

```

754

755

### Multi-Task Learning

756

757

```python

758

from sentence_transformers.losses import SoftmaxLoss

759

760

# Combine different losses for multi-task learning

761

contrastive_loss = MultipleNegativesRankingLoss(model)

762

classification_loss = SoftmaxLoss(

763

model=model,

764

sentence_embedding_dimension=768,

765

num_labels=3 # For NLI: entailment, contradiction, neutral

766

)

767

768

# Multi-dataset training

769

datasets = {

770

"similarity": similarity_dataset,

771

"classification": nli_dataset

772

}

773

774

losses = {

775

"similarity": contrastive_loss,

776

"classification": classification_loss

777

}

778

779

trainer = SentenceTransformerTrainer(

780

model=model,

781

args=args,

782

train_dataset=datasets,

783

loss=losses

784

)

785

786

trainer.train()

787

```

788

789

### Advanced Batch Mining

790

791

```python

792

from sentence_transformers.losses import BatchHardTripletLoss, BatchHardTripletLossDistanceFunction

793

794

# Hard negative mining within batches

795

batch_hard_loss = BatchHardTripletLoss(

796

model=model,

797

distance_function=BatchHardTripletLossDistanceFunction.cosine_distance,

798

margin=0.2

799

)

800

801

# Use with datasets that have class labels

802

class_data = [

803

{"text": "Python programming", "label": 0},

804

{"text": "Coding in Python", "label": 0},

805

{"text": "Machine learning", "label": 1},

806

{"text": "AI algorithms", "label": 1}

807

]

808

809

class_dataset = Dataset.from_list(class_data)

810

811

trainer = SentenceTransformerTrainer(

812

model=model,

813

args=args,

814

train_dataset=class_dataset,

815

loss=batch_hard_loss

816

)

817

818

trainer.train()

819

```

820

821

## Best Practices

822

823

1. **Loss Selection**: Choose loss functions based on your data format and task

824

2. **Batch Size**: Use larger batches (64+) for contrastive losses when possible

825

3. **Scaling**: Adjust scale parameters based on your similarity function

826

4. **Negative Sampling**: Consider hard negative mining for improved performance

827

5. **Multi-Task**: Combine different losses for comprehensive training

828

6. **Progressive Training**: Use Matryoshka or adaptive losses for efficiency

829

7. **Evaluation**: Monitor performance on validation sets during training

830

8. **Hyperparameter Tuning**: Experiment with margins, scales, and learning rates