or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-transformers.mdcross-encoder.mdevaluation.mdindex.mdloss-functions.mdsparse-encoder.mdtraining.mdutilities.md

evaluation.mddocs/

0

# Evaluation

1

2

The sentence-transformers package provides a comprehensive evaluation framework for measuring model performance across various tasks including semantic similarity, information retrieval, classification, and clustering.

3

4

## Import Statement

5

6

```python

7

from sentence_transformers.evaluation import (

8

EmbeddingSimilarityEvaluator,

9

InformationRetrievalEvaluator,

10

BinaryClassificationEvaluator,

11

# ... other evaluators

12

)

13

```

14

15

## Base Evaluator

16

17

### SentenceEvaluator

18

19

```python

20

class SentenceEvaluator:

21

def __call__(

22

self,

23

model: SentenceTransformer,

24

output_path: str | None = None,

25

epoch: int = -1,

26

steps: int = -1

27

) -> float

28

```

29

`{ .api }`

30

31

Abstract base class for all sentence transformer evaluators.

32

33

**Parameters**:

34

- `model`: SentenceTransformer model to evaluate

35

- `output_path`: Directory to save evaluation results

36

- `epoch`: Current training epoch (for logging)

37

- `steps`: Current training steps (for logging)

38

39

**Returns**: Primary evaluation metric score

40

41

## Similarity Evaluation

42

43

### EmbeddingSimilarityEvaluator

44

45

```python

46

class EmbeddingSimilarityEvaluator(SentenceEvaluator):

47

def __init__(

48

self,

49

sentences1: list[str],

50

sentences2: list[str],

51

scores: list[float],

52

batch_size: int = 16,

53

main_similarity: SimilarityFunction | None = None,

54

name: str = "",

55

show_progress_bar: bool = None,

56

write_csv: bool = True

57

)

58

```

59

`{ .api }`

60

61

Evaluates model performance on semantic textual similarity tasks by computing correlation between predicted and gold similarity scores.

62

63

**Parameters**:

64

- `sentences1`: First sentences in pairs

65

- `sentences2`: Second sentences in pairs

66

- `scores`: Gold similarity scores (typically -1 to 1 or 0 to 1)

67

- `batch_size`: Batch size for encoding

68

- `main_similarity`: Similarity function to use (defaults to model's function)

69

- `name`: Name for evaluation results

70

- `show_progress_bar`: Display progress during evaluation

71

- `write_csv`: Save detailed results to CSV file

72

73

**Returns**: Spearman correlation coefficient

74

75

### MSEEvaluator

76

77

```python

78

class MSEEvaluator(SentenceEvaluator):

79

def __init__(

80

self,

81

sentences1: list[str],

82

sentences2: list[str],

83

scores: list[float],

84

batch_size: int = 16,

85

name: str = "",

86

show_progress_bar: bool = None,

87

write_csv: bool = True

88

)

89

```

90

`{ .api }`

91

92

Evaluates model using Mean Squared Error between predicted and gold similarity scores.

93

94

**Returns**: Negative MSE (higher is better)

95

96

### MSEEvaluatorFromDataFrame

97

98

```python

99

class MSEEvaluatorFromDataFrame(SentenceEvaluator):

100

def __init__(

101

self,

102

dataframe: pandas.DataFrame,

103

sentence1_column_name: str = None,

104

sentence2_column_name: str = None,

105

score_column_name: str = None,

106

batch_size: int = 16,

107

name: str = "",

108

show_progress_bar: bool = None,

109

write_csv: bool = True

110

)

111

```

112

`{ .api }`

113

114

MSE evaluator that loads data from a pandas DataFrame.

115

116

**Parameters**:

117

- `dataframe`: DataFrame containing evaluation data

118

- `sentence1_column_name`: Column name for first sentences

119

- `sentence2_column_name`: Column name for second sentences

120

- `score_column_name`: Column name for similarity scores

121

- Other parameters same as MSEEvaluator

122

123

## Classification Evaluation

124

125

### BinaryClassificationEvaluator

126

127

```python

128

class BinaryClassificationEvaluator(SentenceEvaluator):

129

def __init__(

130

self,

131

sentences1: list[str],

132

sentences2: list[str],

133

labels: list[int],

134

batch_size: int = 16,

135

name: str = "",

136

show_progress_bar: bool = None,

137

write_csv: bool = True

138

)

139

```

140

`{ .api }`

141

142

Evaluates binary classification performance using cosine similarity as classification score.

143

144

**Parameters**:

145

- `sentences1`: First sentences in pairs

146

- `sentences2`: Second sentences in pairs

147

- `labels`: Binary labels (0 or 1)

148

- `batch_size`: Batch size for encoding

149

- `name`: Name for evaluation results

150

- `show_progress_bar`: Display progress bar

151

- `write_csv`: Save results to CSV

152

153

**Returns**: Average Precision (AP) score

154

155

### LabelAccuracyEvaluator

156

157

```python

158

class LabelAccuracyEvaluator(SentenceEvaluator):

159

def __init__(

160

self,

161

sentences: list[str],

162

labels: list[int],

163

name: str = "",

164

batch_size: int = 32,

165

show_progress_bar: bool = None,

166

write_csv: bool = True

167

)

168

```

169

`{ .api }`

170

171

Evaluates classification accuracy by finding the closest label embedding for each sentence.

172

173

**Parameters**:

174

- `sentences`: Input sentences to classify

175

- `labels`: Ground truth labels

176

- `name`: Name for evaluation

177

- `batch_size`: Batch size for encoding

178

- `show_progress_bar`: Display progress bar

179

- `write_csv`: Save results to CSV

180

181

**Returns**: Classification accuracy

182

183

## Information Retrieval Evaluation

184

185

### InformationRetrievalEvaluator

186

187

```python

188

class InformationRetrievalEvaluator(SentenceEvaluator):

189

def __init__(

190

self,

191

queries: dict[str, str],

192

corpus: dict[str, str],

193

relevant_docs: dict[str, set[str]],

194

corpus_chunk_size: int = 50000,

195

mrr_at_k: list[int] = [10],

196

ndcg_at_k: list[int] = [10],

197

accuracy_at_k: list[int] = [1, 3, 5, 10],

198

precision_recall_at_k: list[int] = [1, 3, 5, 10],

199

map_at_k: list[int] = [100],

200

max_corpus_size: int = None,

201

show_progress_bar: bool = None,

202

batch_size: int = 32,

203

name: str = "",

204

write_csv: bool = True

205

)

206

```

207

`{ .api }`

208

209

Comprehensive information retrieval evaluation with multiple metrics.

210

211

**Parameters**:

212

- `queries`: Dictionary mapping query IDs to query texts

213

- `corpus`: Dictionary mapping document IDs to document texts

214

- `relevant_docs`: Dictionary mapping query IDs to sets of relevant document IDs

215

- `corpus_chunk_size`: Size of corpus chunks for processing

216

- `mrr_at_k`: Ranks for Mean Reciprocal Rank calculation

217

- `ndcg_at_k`: Ranks for NDCG calculation

218

- `accuracy_at_k`: Ranks for accuracy calculation

219

- `precision_recall_at_k`: Ranks for precision/recall calculation

220

- `map_at_k`: Ranks for Mean Average Precision calculation

221

- `max_corpus_size`: Maximum corpus size to use

222

- `show_progress_bar`: Display progress bar

223

- `batch_size`: Batch size for encoding

224

- `name`: Name for evaluation

225

- `write_csv`: Save results to CSV

226

227

**Returns**: NDCG@10 score

228

229

### RerankingEvaluator

230

231

```python

232

class RerankingEvaluator(SentenceEvaluator):

233

def __init__(

234

self,

235

samples: list[dict],

236

mrr_at_k: list[int] = [10],

237

ndcg_at_k: list[int] = [10],

238

accuracy_at_k: list[int] = [1, 3, 5, 10],

239

precision_recall_at_k: list[int] = [1, 3, 5, 10],

240

map_at_k: list[int] = [100],

241

name: str = "",

242

write_csv: bool = True,

243

batch_size: int = 512,

244

show_progress_bar: bool = None

245

)

246

```

247

`{ .api }`

248

249

Evaluates reranking performance on query-document pairs.

250

251

**Parameters**:

252

- `samples`: List of samples with query, positive, and negative documents

253

- Other parameters similar to InformationRetrievalEvaluator

254

255

**Returns**: MRR@10 score

256

257

## Specialized Evaluators

258

259

### TripletEvaluator

260

261

```python

262

class TripletEvaluator(SentenceEvaluator):

263

def __init__(

264

self,

265

anchors: list[str],

266

positives: list[str],

267

negatives: list[str],

268

main_distance_function: SimilarityFunction | None = None,

269

name: str = "",

270

batch_size: int = 16,

271

show_progress_bar: bool = None,

272

write_csv: bool = True

273

)

274

```

275

`{ .api }`

276

277

Evaluates triplet accuracy: anchor should be closer to positive than negative.

278

279

**Parameters**:

280

- `anchors`: Anchor sentences

281

- `positives`: Positive sentences

282

- `negatives`: Negative sentences

283

- `main_distance_function`: Distance function to use

284

- `name`: Name for evaluation

285

- `batch_size`: Batch size for encoding

286

- `show_progress_bar`: Display progress bar

287

- `write_csv`: Save results to CSV

288

289

**Returns**: Triplet accuracy (percentage of correct triplets)

290

291

### ParaphraseMiningEvaluator

292

293

```python

294

class ParaphraseMiningEvaluator(SentenceEvaluator):

295

def __init__(

296

self,

297

sentences_map: dict[str, str],

298

duplicates_list: set[tuple[str, str]],

299

duplicates_dict: dict[str, dict[str, bool]] = None,

300

query_chunk_size: int = 5000,

301

corpus_chunk_size: int = 100000,

302

max_pairs: int = 500000,

303

top_k: int = 100,

304

name: str = "",

305

batch_size: int = 16,

306

show_progress_bar: bool = None,

307

write_csv: bool = True

308

)

309

```

310

`{ .api }`

311

312

Evaluates paraphrase mining performance by finding duplicate/similar sentences.

313

314

**Parameters**:

315

- `sentences_map`: Dictionary mapping sentence IDs to texts

316

- `duplicates_list`: Set of sentence ID pairs that are duplicates

317

- `duplicates_dict`: Alternative format for duplicates

318

- `query_chunk_size`: Size of query chunks for processing

319

- `corpus_chunk_size`: Size of corpus chunks

320

- `max_pairs`: Maximum pairs to evaluate

321

- `top_k`: Number of top pairs to consider

322

- `name`: Name for evaluation

323

- `batch_size`: Batch size for encoding

324

- `show_progress_bar`: Display progress bar

325

- `write_csv`: Save results to CSV

326

327

**Returns**: Average Precision score

328

329

### TranslationEvaluator

330

331

```python

332

class TranslationEvaluator(SentenceEvaluator):

333

def __init__(

334

self,

335

source_sentences: list[str],

336

target_sentences: list[str],

337

batch_size: int = 16,

338

name: str = "",

339

show_progress_bar: bool = None,

340

write_csv: bool = True

341

)

342

```

343

`{ .api }`

344

345

Evaluates cross-lingual or translation performance by measuring similarity between source and target sentences.

346

347

**Parameters**:

348

- `source_sentences`: Source language sentences

349

- `target_sentences`: Target language sentences (translations)

350

- `batch_size`: Batch size for encoding

351

- `name`: Name for evaluation

352

- `show_progress_bar`: Display progress bar

353

- `write_csv`: Save results to CSV

354

355

**Returns**: Average cosine similarity between translations

356

357

## Advanced Evaluators

358

359

### SequentialEvaluator

360

361

```python

362

class SequentialEvaluator(SentenceEvaluator):

363

def __init__(

364

self,

365

evaluators: list[SentenceEvaluator],

366

main_score_function: callable = None

367

)

368

```

369

`{ .api }`

370

371

Runs multiple evaluators sequentially and combines their results.

372

373

**Parameters**:

374

- `evaluators`: List of evaluators to run

375

- `main_score_function`: Function to combine scores into main score

376

377

**Returns**: Combined evaluation score

378

379

### NanoBEIREvaluator

380

381

```python

382

class NanoBEIREvaluator(SentenceEvaluator):

383

def __init__(

384

self,

385

dataset_name: str | None = None,

386

dataset_config: str | None = None,

387

dataset_revision: str | None = None,

388

corpus_chunk_size: int = 50000,

389

max_corpus_size: int | None = None,

390

**kwargs

391

)

392

```

393

`{ .api }`

394

395

Evaluator for NanoBEIR (Neural Assessment of Natural Language Generation over Information Retrieval) benchmark tasks.

396

397

**Parameters**:

398

- `dataset_name`: Name of the NanoBEIR dataset

399

- `dataset_config`: Dataset configuration

400

- `dataset_revision`: Dataset revision to use

401

- `corpus_chunk_size`: Corpus processing chunk size

402

- `max_corpus_size`: Maximum corpus size to evaluate on

403

- `**kwargs`: Additional arguments passed to base evaluator

404

405

**Returns**: NDCG@10 score on the NanoBEIR task

406

407

## Utility Classes

408

409

### SimilarityFunction

410

411

```python

412

from sentence_transformers.evaluation import SimilarityFunction

413

414

class SimilarityFunction(Enum):

415

COSINE = "cosine"

416

DOT_PRODUCT = "dot"

417

EUCLIDEAN = "euclidean"

418

MANHATTAN = "manhattan"

419

```

420

`{ .api }`

421

422

Enumeration of similarity functions available for evaluation.

423

424

## Usage Examples

425

426

### Basic Similarity Evaluation

427

428

```python

429

from sentence_transformers import SentenceTransformer

430

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

431

432

# Load model

433

model = SentenceTransformer('all-MiniLM-L6-v2')

434

435

# Prepare evaluation data

436

sentences1 = ["The cat sits on the mat", "I love programming"]

437

sentences2 = ["A feline rests on a rug", "I enjoy coding"]

438

scores = [0.9, 0.8] # Similarity scores

439

440

# Create evaluator

441

evaluator = EmbeddingSimilarityEvaluator(

442

sentences1=sentences1,

443

sentences2=sentences2,

444

scores=scores,

445

name="dev"

446

)

447

448

# Evaluate model

449

correlation = evaluator(model, output_path="./evaluation_results/")

450

print(f"Spearman correlation: {correlation:.4f}")

451

```

452

453

### Information Retrieval Evaluation

454

455

```python

456

from sentence_transformers.evaluation import InformationRetrievalEvaluator

457

458

# Prepare IR evaluation data

459

queries = {

460

"q1": "What is machine learning?",

461

"q2": "How do neural networks work?"

462

}

463

464

corpus = {

465

"d1": "Machine learning is a subset of artificial intelligence",

466

"d2": "Neural networks are computational models inspired by biology",

467

"d3": "Weather forecasting uses statistical models",

468

"d4": "Deep learning uses multiple layers of neural networks"

469

}

470

471

relevant_docs = {

472

"q1": {"d1", "d4"}, # Relevant documents for q1

473

"q2": {"d2", "d4"} # Relevant documents for q2

474

}

475

476

# Create IR evaluator

477

ir_evaluator = InformationRetrievalEvaluator(

478

queries=queries,

479

corpus=corpus,

480

relevant_docs=relevant_docs,

481

name="test_retrieval"

482

)

483

484

# Evaluate

485

ndcg_score = ir_evaluator(model, output_path="./ir_results/")

486

print(f"NDCG@10: {ndcg_score:.4f}")

487

```

488

489

### Binary Classification Evaluation

490

491

```python

492

from sentence_transformers.evaluation import BinaryClassificationEvaluator

493

494

# Prepare binary classification data

495

sentences1 = [

496

"The cat sits on the mat",

497

"I love programming",

498

"Dogs are great pets",

499

"Weather is nice today"

500

]

501

502

sentences2 = [

503

"A feline rests on a rug", # Similar to first

504

"Cooking is fun", # Different from second

505

"Cats are wonderful animals", # Related to third

506

"It's sunny outside" # Similar to fourth

507

]

508

509

labels = [1, 0, 1, 1] # Binary similarity labels

510

511

# Create evaluator

512

binary_evaluator = BinaryClassificationEvaluator(

513

sentences1=sentences1,

514

sentences2=sentences2,

515

labels=labels,

516

name="binary_classification"

517

)

518

519

# Evaluate

520

ap_score = binary_evaluator(model, output_path="./binary_results/")

521

print(f"Average Precision: {ap_score:.4f}")

522

```

523

524

### Triplet Evaluation

525

526

```python

527

from sentence_transformers.evaluation import TripletEvaluator

528

529

# Prepare triplet data

530

anchors = ["The cat sits on the mat", "I love programming"]

531

positives = ["A feline rests on a rug", "I enjoy coding"]

532

negatives = ["Dogs are great pets", "Weather is nice"]

533

534

# Create triplet evaluator

535

triplet_evaluator = TripletEvaluator(

536

anchors=anchors,

537

positives=positives,

538

negatives=negatives,

539

name="triplet_eval"

540

)

541

542

# Evaluate

543

accuracy = triplet_evaluator(model, output_path="./triplet_results/")

544

print(f"Triplet accuracy: {accuracy:.4f}")

545

```

546

547

### Sequential Multi-Task Evaluation

548

549

```python

550

from sentence_transformers.evaluation import SequentialEvaluator

551

552

# Create multiple evaluators

553

similarity_eval = EmbeddingSimilarityEvaluator(sentences1, sentences2, scores, name="similarity")

554

binary_eval = BinaryClassificationEvaluator(sentences1, sentences2, labels, name="binary")

555

triplet_eval = TripletEvaluator(anchors, positives, negatives, name="triplet")

556

557

# Combine evaluators

558

sequential_evaluator = SequentialEvaluator(

559

evaluators=[similarity_eval, binary_eval, triplet_eval],

560

main_score_function=lambda scores: sum(scores) / len(scores) # Average score

561

)

562

563

# Run all evaluations

564

combined_score = sequential_evaluator(model, output_path="./multi_eval_results/")

565

print(f"Combined evaluation score: {combined_score:.4f}")

566

```

567

568

### Training Integration

569

570

```python

571

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments

572

573

# Create evaluator for training

574

dev_evaluator = EmbeddingSimilarityEvaluator(

575

sentences1=dev_sentences1,

576

sentences2=dev_sentences2,

577

scores=dev_scores,

578

name="sts-dev"

579

)

580

581

# Training arguments with evaluation

582

args = SentenceTransformerTrainingArguments(

583

output_dir='./training_with_eval',

584

evaluation_strategy="steps",

585

eval_steps=100,

586

logging_steps=100,

587

save_steps=100,

588

num_train_epochs=3,

589

per_device_train_batch_size=16,

590

load_best_model_at_end=True,

591

metric_for_best_model="eval_spearman_cosine",

592

greater_is_better=True

593

)

594

595

def compute_metrics(eval_pred):

596

"""Custom metrics for trainer."""

597

# This would be called during training evaluation

598

return dev_evaluator(model, output_path=args.output_dir)

599

600

trainer = SentenceTransformerTrainer(

601

model=model,

602

args=args,

603

train_dataset=train_dataset,

604

loss=loss,

605

compute_metrics=compute_metrics

606

)

607

608

trainer.train()

609

```

610

611

### Custom Evaluator

612

613

```python

614

class CustomEvaluator(SentenceEvaluator):

615

"""Custom evaluator for specific task."""

616

617

def __init__(self, test_data, name="custom"):

618

self.test_data = test_data

619

self.name = name

620

621

def __call__(self, model, output_path=None, epoch=-1, steps=-1):

622

# Implement custom evaluation logic

623

embeddings = model.encode([item['text'] for item in self.test_data])

624

625

# Calculate your custom metric

626

custom_score = self.calculate_custom_metric(embeddings)

627

628

# Save results if output_path provided

629

if output_path:

630

self.save_results(custom_score, output_path, epoch, steps)

631

632

return custom_score

633

634

def calculate_custom_metric(self, embeddings):

635

# Implement your metric calculation

636

return 0.85 # Placeholder

637

638

def save_results(self, score, output_path, epoch, steps):

639

# Save evaluation results

640

import os, csv

641

csv_file = os.path.join(output_path, f"{self.name}_results.csv")

642

with open(csv_file, 'w', newline='') as f:

643

writer = csv.writer(f)

644

writer.writerow(['epoch', 'steps', 'score'])

645

writer.writerow([epoch, steps, score])

646

647

# Use custom evaluator

648

custom_eval = CustomEvaluator(test_data)

649

score = custom_eval(model, output_path="./custom_results/")

650

```

651

652

### Batch Evaluation on Multiple Datasets

653

654

```python

655

def evaluate_on_multiple_datasets(model, datasets_config):

656

"""Evaluate model on multiple datasets."""

657

results = {}

658

659

for dataset_name, config in datasets_config.items():

660

if config['type'] == 'similarity':

661

evaluator = EmbeddingSimilarityEvaluator(

662

sentences1=config['sentences1'],

663

sentences2=config['sentences2'],

664

scores=config['scores'],

665

name=dataset_name

666

)

667

elif config['type'] == 'retrieval':

668

evaluator = InformationRetrievalEvaluator(

669

queries=config['queries'],

670

corpus=config['corpus'],

671

relevant_docs=config['relevant_docs'],

672

name=dataset_name

673

)

674

675

score = evaluator(model, output_path=f"./results/{dataset_name}/")

676

results[dataset_name] = score

677

print(f"{dataset_name}: {score:.4f}")

678

679

return results

680

681

# Configuration for multiple datasets

682

datasets_config = {

683

"sts_benchmark": {

684

"type": "similarity",

685

"sentences1": sts_sentences1,

686

"sentences2": sts_sentences2,

687

"scores": sts_scores

688

},

689

"msmarco": {

690

"type": "retrieval",

691

"queries": msmarco_queries,

692

"corpus": msmarco_corpus,

693

"relevant_docs": msmarco_qrels

694

}

695

}

696

697

# Run evaluations

698

all_results = evaluate_on_multiple_datasets(model, datasets_config)

699

```

700

701

## Best Practices

702

703

1. **Evaluation Data**: Use high-quality, diverse evaluation datasets

704

2. **Multiple Metrics**: Evaluate on multiple tasks and metrics for comprehensive assessment

705

3. **Statistical Significance**: Use appropriate sample sizes for reliable results

706

4. **Cross-Validation**: Consider cross-validation for robust evaluation

707

5. **Domain Matching**: Ensure evaluation data matches your target domain

708

6. **Baseline Comparison**: Always compare against relevant baselines

709

7. **Error Analysis**: Analyze failure cases to understand model limitations

710

8. **Reproducibility**: Save evaluation configurations and random seeds