0
# Evaluation
1
2
The sentence-transformers package provides a comprehensive evaluation framework for measuring model performance across various tasks including semantic similarity, information retrieval, classification, and clustering.
3
4
## Import Statement
5
6
```python
7
from sentence_transformers.evaluation import (
8
EmbeddingSimilarityEvaluator,
9
InformationRetrievalEvaluator,
10
BinaryClassificationEvaluator,
11
# ... other evaluators
12
)
13
```
14
15
## Base Evaluator
16
17
### SentenceEvaluator
18
19
```python
20
class SentenceEvaluator:
21
def __call__(
22
self,
23
model: SentenceTransformer,
24
output_path: str | None = None,
25
epoch: int = -1,
26
steps: int = -1
27
) -> float
28
```
29
`{ .api }`
30
31
Abstract base class for all sentence transformer evaluators.
32
33
**Parameters**:
34
- `model`: SentenceTransformer model to evaluate
35
- `output_path`: Directory to save evaluation results
36
- `epoch`: Current training epoch (for logging)
37
- `steps`: Current training steps (for logging)
38
39
**Returns**: Primary evaluation metric score
40
41
## Similarity Evaluation
42
43
### EmbeddingSimilarityEvaluator
44
45
```python
46
class EmbeddingSimilarityEvaluator(SentenceEvaluator):
47
def __init__(
48
self,
49
sentences1: list[str],
50
sentences2: list[str],
51
scores: list[float],
52
batch_size: int = 16,
53
main_similarity: SimilarityFunction | None = None,
54
name: str = "",
55
show_progress_bar: bool = None,
56
write_csv: bool = True
57
)
58
```
59
`{ .api }`
60
61
Evaluates model performance on semantic textual similarity tasks by computing correlation between predicted and gold similarity scores.
62
63
**Parameters**:
64
- `sentences1`: First sentences in pairs
65
- `sentences2`: Second sentences in pairs
66
- `scores`: Gold similarity scores (typically -1 to 1 or 0 to 1)
67
- `batch_size`: Batch size for encoding
68
- `main_similarity`: Similarity function to use (defaults to model's function)
69
- `name`: Name for evaluation results
70
- `show_progress_bar`: Display progress during evaluation
71
- `write_csv`: Save detailed results to CSV file
72
73
**Returns**: Spearman correlation coefficient
74
75
### MSEEvaluator
76
77
```python
78
class MSEEvaluator(SentenceEvaluator):
79
def __init__(
80
self,
81
sentences1: list[str],
82
sentences2: list[str],
83
scores: list[float],
84
batch_size: int = 16,
85
name: str = "",
86
show_progress_bar: bool = None,
87
write_csv: bool = True
88
)
89
```
90
`{ .api }`
91
92
Evaluates model using Mean Squared Error between predicted and gold similarity scores.
93
94
**Returns**: Negative MSE (higher is better)
95
96
### MSEEvaluatorFromDataFrame
97
98
```python
99
class MSEEvaluatorFromDataFrame(SentenceEvaluator):
100
def __init__(
101
self,
102
dataframe: pandas.DataFrame,
103
sentence1_column_name: str = None,
104
sentence2_column_name: str = None,
105
score_column_name: str = None,
106
batch_size: int = 16,
107
name: str = "",
108
show_progress_bar: bool = None,
109
write_csv: bool = True
110
)
111
```
112
`{ .api }`
113
114
MSE evaluator that loads data from a pandas DataFrame.
115
116
**Parameters**:
117
- `dataframe`: DataFrame containing evaluation data
118
- `sentence1_column_name`: Column name for first sentences
119
- `sentence2_column_name`: Column name for second sentences
120
- `score_column_name`: Column name for similarity scores
121
- Other parameters same as MSEEvaluator
122
123
## Classification Evaluation
124
125
### BinaryClassificationEvaluator
126
127
```python
128
class BinaryClassificationEvaluator(SentenceEvaluator):
129
def __init__(
130
self,
131
sentences1: list[str],
132
sentences2: list[str],
133
labels: list[int],
134
batch_size: int = 16,
135
name: str = "",
136
show_progress_bar: bool = None,
137
write_csv: bool = True
138
)
139
```
140
`{ .api }`
141
142
Evaluates binary classification performance using cosine similarity as classification score.
143
144
**Parameters**:
145
- `sentences1`: First sentences in pairs
146
- `sentences2`: Second sentences in pairs
147
- `labels`: Binary labels (0 or 1)
148
- `batch_size`: Batch size for encoding
149
- `name`: Name for evaluation results
150
- `show_progress_bar`: Display progress bar
151
- `write_csv`: Save results to CSV
152
153
**Returns**: Average Precision (AP) score
154
155
### LabelAccuracyEvaluator
156
157
```python
158
class LabelAccuracyEvaluator(SentenceEvaluator):
159
def __init__(
160
self,
161
sentences: list[str],
162
labels: list[int],
163
name: str = "",
164
batch_size: int = 32,
165
show_progress_bar: bool = None,
166
write_csv: bool = True
167
)
168
```
169
`{ .api }`
170
171
Evaluates classification accuracy by finding the closest label embedding for each sentence.
172
173
**Parameters**:
174
- `sentences`: Input sentences to classify
175
- `labels`: Ground truth labels
176
- `name`: Name for evaluation
177
- `batch_size`: Batch size for encoding
178
- `show_progress_bar`: Display progress bar
179
- `write_csv`: Save results to CSV
180
181
**Returns**: Classification accuracy
182
183
## Information Retrieval Evaluation
184
185
### InformationRetrievalEvaluator
186
187
```python
188
class InformationRetrievalEvaluator(SentenceEvaluator):
189
def __init__(
190
self,
191
queries: dict[str, str],
192
corpus: dict[str, str],
193
relevant_docs: dict[str, set[str]],
194
corpus_chunk_size: int = 50000,
195
mrr_at_k: list[int] = [10],
196
ndcg_at_k: list[int] = [10],
197
accuracy_at_k: list[int] = [1, 3, 5, 10],
198
precision_recall_at_k: list[int] = [1, 3, 5, 10],
199
map_at_k: list[int] = [100],
200
max_corpus_size: int = None,
201
show_progress_bar: bool = None,
202
batch_size: int = 32,
203
name: str = "",
204
write_csv: bool = True
205
)
206
```
207
`{ .api }`
208
209
Comprehensive information retrieval evaluation with multiple metrics.
210
211
**Parameters**:
212
- `queries`: Dictionary mapping query IDs to query texts
213
- `corpus`: Dictionary mapping document IDs to document texts
214
- `relevant_docs`: Dictionary mapping query IDs to sets of relevant document IDs
215
- `corpus_chunk_size`: Size of corpus chunks for processing
216
- `mrr_at_k`: Ranks for Mean Reciprocal Rank calculation
217
- `ndcg_at_k`: Ranks for NDCG calculation
218
- `accuracy_at_k`: Ranks for accuracy calculation
219
- `precision_recall_at_k`: Ranks for precision/recall calculation
220
- `map_at_k`: Ranks for Mean Average Precision calculation
221
- `max_corpus_size`: Maximum corpus size to use
222
- `show_progress_bar`: Display progress bar
223
- `batch_size`: Batch size for encoding
224
- `name`: Name for evaluation
225
- `write_csv`: Save results to CSV
226
227
**Returns**: NDCG@10 score
228
229
### RerankingEvaluator
230
231
```python
232
class RerankingEvaluator(SentenceEvaluator):
233
def __init__(
234
self,
235
samples: list[dict],
236
mrr_at_k: list[int] = [10],
237
ndcg_at_k: list[int] = [10],
238
accuracy_at_k: list[int] = [1, 3, 5, 10],
239
precision_recall_at_k: list[int] = [1, 3, 5, 10],
240
map_at_k: list[int] = [100],
241
name: str = "",
242
write_csv: bool = True,
243
batch_size: int = 512,
244
show_progress_bar: bool = None
245
)
246
```
247
`{ .api }`
248
249
Evaluates reranking performance on query-document pairs.
250
251
**Parameters**:
252
- `samples`: List of samples with query, positive, and negative documents
253
- Other parameters similar to InformationRetrievalEvaluator
254
255
**Returns**: MRR@10 score
256
257
## Specialized Evaluators
258
259
### TripletEvaluator
260
261
```python
262
class TripletEvaluator(SentenceEvaluator):
263
def __init__(
264
self,
265
anchors: list[str],
266
positives: list[str],
267
negatives: list[str],
268
main_distance_function: SimilarityFunction | None = None,
269
name: str = "",
270
batch_size: int = 16,
271
show_progress_bar: bool = None,
272
write_csv: bool = True
273
)
274
```
275
`{ .api }`
276
277
Evaluates triplet accuracy: anchor should be closer to positive than negative.
278
279
**Parameters**:
280
- `anchors`: Anchor sentences
281
- `positives`: Positive sentences
282
- `negatives`: Negative sentences
283
- `main_distance_function`: Distance function to use
284
- `name`: Name for evaluation
285
- `batch_size`: Batch size for encoding
286
- `show_progress_bar`: Display progress bar
287
- `write_csv`: Save results to CSV
288
289
**Returns**: Triplet accuracy (percentage of correct triplets)
290
291
### ParaphraseMiningEvaluator
292
293
```python
294
class ParaphraseMiningEvaluator(SentenceEvaluator):
295
def __init__(
296
self,
297
sentences_map: dict[str, str],
298
duplicates_list: set[tuple[str, str]],
299
duplicates_dict: dict[str, dict[str, bool]] = None,
300
query_chunk_size: int = 5000,
301
corpus_chunk_size: int = 100000,
302
max_pairs: int = 500000,
303
top_k: int = 100,
304
name: str = "",
305
batch_size: int = 16,
306
show_progress_bar: bool = None,
307
write_csv: bool = True
308
)
309
```
310
`{ .api }`
311
312
Evaluates paraphrase mining performance by finding duplicate/similar sentences.
313
314
**Parameters**:
315
- `sentences_map`: Dictionary mapping sentence IDs to texts
316
- `duplicates_list`: Set of sentence ID pairs that are duplicates
317
- `duplicates_dict`: Alternative format for duplicates
318
- `query_chunk_size`: Size of query chunks for processing
319
- `corpus_chunk_size`: Size of corpus chunks
320
- `max_pairs`: Maximum pairs to evaluate
321
- `top_k`: Number of top pairs to consider
322
- `name`: Name for evaluation
323
- `batch_size`: Batch size for encoding
324
- `show_progress_bar`: Display progress bar
325
- `write_csv`: Save results to CSV
326
327
**Returns**: Average Precision score
328
329
### TranslationEvaluator
330
331
```python
332
class TranslationEvaluator(SentenceEvaluator):
333
def __init__(
334
self,
335
source_sentences: list[str],
336
target_sentences: list[str],
337
batch_size: int = 16,
338
name: str = "",
339
show_progress_bar: bool = None,
340
write_csv: bool = True
341
)
342
```
343
`{ .api }`
344
345
Evaluates cross-lingual or translation performance by measuring similarity between source and target sentences.
346
347
**Parameters**:
348
- `source_sentences`: Source language sentences
349
- `target_sentences`: Target language sentences (translations)
350
- `batch_size`: Batch size for encoding
351
- `name`: Name for evaluation
352
- `show_progress_bar`: Display progress bar
353
- `write_csv`: Save results to CSV
354
355
**Returns**: Average cosine similarity between translations
356
357
## Advanced Evaluators
358
359
### SequentialEvaluator
360
361
```python
362
class SequentialEvaluator(SentenceEvaluator):
363
def __init__(
364
self,
365
evaluators: list[SentenceEvaluator],
366
main_score_function: callable = None
367
)
368
```
369
`{ .api }`
370
371
Runs multiple evaluators sequentially and combines their results.
372
373
**Parameters**:
374
- `evaluators`: List of evaluators to run
375
- `main_score_function`: Function to combine scores into main score
376
377
**Returns**: Combined evaluation score
378
379
### NanoBEIREvaluator
380
381
```python
382
class NanoBEIREvaluator(SentenceEvaluator):
383
def __init__(
384
self,
385
dataset_name: str | None = None,
386
dataset_config: str | None = None,
387
dataset_revision: str | None = None,
388
corpus_chunk_size: int = 50000,
389
max_corpus_size: int | None = None,
390
**kwargs
391
)
392
```
393
`{ .api }`
394
395
Evaluator for NanoBEIR (Neural Assessment of Natural Language Generation over Information Retrieval) benchmark tasks.
396
397
**Parameters**:
398
- `dataset_name`: Name of the NanoBEIR dataset
399
- `dataset_config`: Dataset configuration
400
- `dataset_revision`: Dataset revision to use
401
- `corpus_chunk_size`: Corpus processing chunk size
402
- `max_corpus_size`: Maximum corpus size to evaluate on
403
- `**kwargs`: Additional arguments passed to base evaluator
404
405
**Returns**: NDCG@10 score on the NanoBEIR task
406
407
## Utility Classes
408
409
### SimilarityFunction
410
411
```python
412
from sentence_transformers.evaluation import SimilarityFunction
413
414
class SimilarityFunction(Enum):
415
COSINE = "cosine"
416
DOT_PRODUCT = "dot"
417
EUCLIDEAN = "euclidean"
418
MANHATTAN = "manhattan"
419
```
420
`{ .api }`
421
422
Enumeration of similarity functions available for evaluation.
423
424
## Usage Examples
425
426
### Basic Similarity Evaluation
427
428
```python
429
from sentence_transformers import SentenceTransformer
430
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
431
432
# Load model
433
model = SentenceTransformer('all-MiniLM-L6-v2')
434
435
# Prepare evaluation data
436
sentences1 = ["The cat sits on the mat", "I love programming"]
437
sentences2 = ["A feline rests on a rug", "I enjoy coding"]
438
scores = [0.9, 0.8] # Similarity scores
439
440
# Create evaluator
441
evaluator = EmbeddingSimilarityEvaluator(
442
sentences1=sentences1,
443
sentences2=sentences2,
444
scores=scores,
445
name="dev"
446
)
447
448
# Evaluate model
449
correlation = evaluator(model, output_path="./evaluation_results/")
450
print(f"Spearman correlation: {correlation:.4f}")
451
```
452
453
### Information Retrieval Evaluation
454
455
```python
456
from sentence_transformers.evaluation import InformationRetrievalEvaluator
457
458
# Prepare IR evaluation data
459
queries = {
460
"q1": "What is machine learning?",
461
"q2": "How do neural networks work?"
462
}
463
464
corpus = {
465
"d1": "Machine learning is a subset of artificial intelligence",
466
"d2": "Neural networks are computational models inspired by biology",
467
"d3": "Weather forecasting uses statistical models",
468
"d4": "Deep learning uses multiple layers of neural networks"
469
}
470
471
relevant_docs = {
472
"q1": {"d1", "d4"}, # Relevant documents for q1
473
"q2": {"d2", "d4"} # Relevant documents for q2
474
}
475
476
# Create IR evaluator
477
ir_evaluator = InformationRetrievalEvaluator(
478
queries=queries,
479
corpus=corpus,
480
relevant_docs=relevant_docs,
481
name="test_retrieval"
482
)
483
484
# Evaluate
485
ndcg_score = ir_evaluator(model, output_path="./ir_results/")
486
print(f"NDCG@10: {ndcg_score:.4f}")
487
```
488
489
### Binary Classification Evaluation
490
491
```python
492
from sentence_transformers.evaluation import BinaryClassificationEvaluator
493
494
# Prepare binary classification data
495
sentences1 = [
496
"The cat sits on the mat",
497
"I love programming",
498
"Dogs are great pets",
499
"Weather is nice today"
500
]
501
502
sentences2 = [
503
"A feline rests on a rug", # Similar to first
504
"Cooking is fun", # Different from second
505
"Cats are wonderful animals", # Related to third
506
"It's sunny outside" # Similar to fourth
507
]
508
509
labels = [1, 0, 1, 1] # Binary similarity labels
510
511
# Create evaluator
512
binary_evaluator = BinaryClassificationEvaluator(
513
sentences1=sentences1,
514
sentences2=sentences2,
515
labels=labels,
516
name="binary_classification"
517
)
518
519
# Evaluate
520
ap_score = binary_evaluator(model, output_path="./binary_results/")
521
print(f"Average Precision: {ap_score:.4f}")
522
```
523
524
### Triplet Evaluation
525
526
```python
527
from sentence_transformers.evaluation import TripletEvaluator
528
529
# Prepare triplet data
530
anchors = ["The cat sits on the mat", "I love programming"]
531
positives = ["A feline rests on a rug", "I enjoy coding"]
532
negatives = ["Dogs are great pets", "Weather is nice"]
533
534
# Create triplet evaluator
535
triplet_evaluator = TripletEvaluator(
536
anchors=anchors,
537
positives=positives,
538
negatives=negatives,
539
name="triplet_eval"
540
)
541
542
# Evaluate
543
accuracy = triplet_evaluator(model, output_path="./triplet_results/")
544
print(f"Triplet accuracy: {accuracy:.4f}")
545
```
546
547
### Sequential Multi-Task Evaluation
548
549
```python
550
from sentence_transformers.evaluation import SequentialEvaluator
551
552
# Create multiple evaluators
553
similarity_eval = EmbeddingSimilarityEvaluator(sentences1, sentences2, scores, name="similarity")
554
binary_eval = BinaryClassificationEvaluator(sentences1, sentences2, labels, name="binary")
555
triplet_eval = TripletEvaluator(anchors, positives, negatives, name="triplet")
556
557
# Combine evaluators
558
sequential_evaluator = SequentialEvaluator(
559
evaluators=[similarity_eval, binary_eval, triplet_eval],
560
main_score_function=lambda scores: sum(scores) / len(scores) # Average score
561
)
562
563
# Run all evaluations
564
combined_score = sequential_evaluator(model, output_path="./multi_eval_results/")
565
print(f"Combined evaluation score: {combined_score:.4f}")
566
```
567
568
### Training Integration
569
570
```python
571
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
572
573
# Create evaluator for training
574
dev_evaluator = EmbeddingSimilarityEvaluator(
575
sentences1=dev_sentences1,
576
sentences2=dev_sentences2,
577
scores=dev_scores,
578
name="sts-dev"
579
)
580
581
# Training arguments with evaluation
582
args = SentenceTransformerTrainingArguments(
583
output_dir='./training_with_eval',
584
evaluation_strategy="steps",
585
eval_steps=100,
586
logging_steps=100,
587
save_steps=100,
588
num_train_epochs=3,
589
per_device_train_batch_size=16,
590
load_best_model_at_end=True,
591
metric_for_best_model="eval_spearman_cosine",
592
greater_is_better=True
593
)
594
595
def compute_metrics(eval_pred):
596
"""Custom metrics for trainer."""
597
# This would be called during training evaluation
598
return dev_evaluator(model, output_path=args.output_dir)
599
600
trainer = SentenceTransformerTrainer(
601
model=model,
602
args=args,
603
train_dataset=train_dataset,
604
loss=loss,
605
compute_metrics=compute_metrics
606
)
607
608
trainer.train()
609
```
610
611
### Custom Evaluator
612
613
```python
614
class CustomEvaluator(SentenceEvaluator):
615
"""Custom evaluator for specific task."""
616
617
def __init__(self, test_data, name="custom"):
618
self.test_data = test_data
619
self.name = name
620
621
def __call__(self, model, output_path=None, epoch=-1, steps=-1):
622
# Implement custom evaluation logic
623
embeddings = model.encode([item['text'] for item in self.test_data])
624
625
# Calculate your custom metric
626
custom_score = self.calculate_custom_metric(embeddings)
627
628
# Save results if output_path provided
629
if output_path:
630
self.save_results(custom_score, output_path, epoch, steps)
631
632
return custom_score
633
634
def calculate_custom_metric(self, embeddings):
635
# Implement your metric calculation
636
return 0.85 # Placeholder
637
638
def save_results(self, score, output_path, epoch, steps):
639
# Save evaluation results
640
import os, csv
641
csv_file = os.path.join(output_path, f"{self.name}_results.csv")
642
with open(csv_file, 'w', newline='') as f:
643
writer = csv.writer(f)
644
writer.writerow(['epoch', 'steps', 'score'])
645
writer.writerow([epoch, steps, score])
646
647
# Use custom evaluator
648
custom_eval = CustomEvaluator(test_data)
649
score = custom_eval(model, output_path="./custom_results/")
650
```
651
652
### Batch Evaluation on Multiple Datasets
653
654
```python
655
def evaluate_on_multiple_datasets(model, datasets_config):
656
"""Evaluate model on multiple datasets."""
657
results = {}
658
659
for dataset_name, config in datasets_config.items():
660
if config['type'] == 'similarity':
661
evaluator = EmbeddingSimilarityEvaluator(
662
sentences1=config['sentences1'],
663
sentences2=config['sentences2'],
664
scores=config['scores'],
665
name=dataset_name
666
)
667
elif config['type'] == 'retrieval':
668
evaluator = InformationRetrievalEvaluator(
669
queries=config['queries'],
670
corpus=config['corpus'],
671
relevant_docs=config['relevant_docs'],
672
name=dataset_name
673
)
674
675
score = evaluator(model, output_path=f"./results/{dataset_name}/")
676
results[dataset_name] = score
677
print(f"{dataset_name}: {score:.4f}")
678
679
return results
680
681
# Configuration for multiple datasets
682
datasets_config = {
683
"sts_benchmark": {
684
"type": "similarity",
685
"sentences1": sts_sentences1,
686
"sentences2": sts_sentences2,
687
"scores": sts_scores
688
},
689
"msmarco": {
690
"type": "retrieval",
691
"queries": msmarco_queries,
692
"corpus": msmarco_corpus,
693
"relevant_docs": msmarco_qrels
694
}
695
}
696
697
# Run evaluations
698
all_results = evaluate_on_multiple_datasets(model, datasets_config)
699
```
700
701
## Best Practices
702
703
1. **Evaluation Data**: Use high-quality, diverse evaluation datasets
704
2. **Multiple Metrics**: Evaluate on multiple tasks and metrics for comprehensive assessment
705
3. **Statistical Significance**: Use appropriate sample sizes for reliable results
706
4. **Cross-Validation**: Consider cross-validation for robust evaluation
707
5. **Domain Matching**: Ensure evaluation data matches your target domain
708
6. **Baseline Comparison**: Always compare against relevant baselines
709
7. **Error Analysis**: Analyze failure cases to understand model limitations
710
8. **Reproducibility**: Save evaluation configurations and random seeds