Tessl Tile for pypi/sentence-transformers@5.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-transformers.md cross-encoder.md evaluation.md index.md loss-functions.md sparse-encoder.md training.md utilities.md

evaluation.mddocs/

0
# Evaluation
1

2
The sentence-transformers package provides a comprehensive evaluation framework for measuring model performance across various tasks including semantic similarity, information retrieval, classification, and clustering.
3

4
## Import Statement
5

6
```python
7
from sentence_transformers.evaluation import (
8
    EmbeddingSimilarityEvaluator,
9
    InformationRetrievalEvaluator,
10
    BinaryClassificationEvaluator, 
11
    # ... other evaluators
12
)
13
```
14

15
## Base Evaluator
16

17
### SentenceEvaluator
18

19
```python
20
class SentenceEvaluator:
21
    def __call__(
22
        self,
23
        model: SentenceTransformer,
24
        output_path: str | None = None,
25
        epoch: int = -1,
26
        steps: int = -1
27
    ) -> float
28
```
29
`{ .api }`
30

31
Abstract base class for all sentence transformer evaluators.
32

33
**Parameters**:
34
- `model`: SentenceTransformer model to evaluate
35
- `output_path`: Directory to save evaluation results
36
- `epoch`: Current training epoch (for logging)
37
- `steps`: Current training steps (for logging)
38

39
**Returns**: Primary evaluation metric score
40

41
## Similarity Evaluation
42

43
### EmbeddingSimilarityEvaluator
44

45
```python
46
class EmbeddingSimilarityEvaluator(SentenceEvaluator):
47
    def __init__(
48
        self,
49
        sentences1: list[str],
50
        sentences2: list[str], 
51
        scores: list[float],
52
        batch_size: int = 16,
53
        main_similarity: SimilarityFunction | None = None,
54
        name: str = "",
55
        show_progress_bar: bool = None,
56
        write_csv: bool = True
57
    )
58
```
59
`{ .api }`
60

61
Evaluates model performance on semantic textual similarity tasks by computing correlation between predicted and gold similarity scores.
62

63
**Parameters**:
64
- `sentences1`: First sentences in pairs
65
- `sentences2`: Second sentences in pairs
66
- `scores`: Gold similarity scores (typically -1 to 1 or 0 to 1)
67
- `batch_size`: Batch size for encoding
68
- `main_similarity`: Similarity function to use (defaults to model's function)
69
- `name`: Name for evaluation results
70
- `show_progress_bar`: Display progress during evaluation
71
- `write_csv`: Save detailed results to CSV file
72

73
**Returns**: Spearman correlation coefficient
74

75
### MSEEvaluator
76

77
```python
78
class MSEEvaluator(SentenceEvaluator):
79
    def __init__(
80
        self,
81
        sentences1: list[str],
82
        sentences2: list[str],
83
        scores: list[float],
84
        batch_size: int = 16,
85
        name: str = "",
86
        show_progress_bar: bool = None,
87
        write_csv: bool = True
88
    )
89
```
90
`{ .api }`
91

92
Evaluates model using Mean Squared Error between predicted and gold similarity scores.
93

94
**Returns**: Negative MSE (higher is better)
95

96
### MSEEvaluatorFromDataFrame
97

98
```python
99
class MSEEvaluatorFromDataFrame(SentenceEvaluator):
100
    def __init__(
101
        self,
102
        dataframe: pandas.DataFrame,
103
        sentence1_column_name: str = None,
104
        sentence2_column_name: str = None,
105
        score_column_name: str = None,
106
        batch_size: int = 16,
107
        name: str = "",
108
        show_progress_bar: bool = None,
109
        write_csv: bool = True
110
    )
111
```
112
`{ .api }`
113

114
MSE evaluator that loads data from a pandas DataFrame.
115

116
**Parameters**:
117
- `dataframe`: DataFrame containing evaluation data
118
- `sentence1_column_name`: Column name for first sentences
119
- `sentence2_column_name`: Column name for second sentences  
120
- `score_column_name`: Column name for similarity scores
121
- Other parameters same as MSEEvaluator
122

123
## Classification Evaluation
124

125
### BinaryClassificationEvaluator
126

127
```python
128
class BinaryClassificationEvaluator(SentenceEvaluator):
129
    def __init__(
130
        self,
131
        sentences1: list[str],
132
        sentences2: list[str],
133
        labels: list[int],
134
        batch_size: int = 16,
135
        name: str = "",
136
        show_progress_bar: bool = None,
137
        write_csv: bool = True
138
    )
139
```
140
`{ .api }`
141

142
Evaluates binary classification performance using cosine similarity as classification score.
143

144
**Parameters**:
145
- `sentences1`: First sentences in pairs
146
- `sentences2`: Second sentences in pairs
147
- `labels`: Binary labels (0 or 1)
148
- `batch_size`: Batch size for encoding
149
- `name`: Name for evaluation results
150
- `show_progress_bar`: Display progress bar
151
- `write_csv`: Save results to CSV
152

153
**Returns**: Average Precision (AP) score
154

155
### LabelAccuracyEvaluator
156

157
```python
158
class LabelAccuracyEvaluator(SentenceEvaluator):
159
    def __init__(
160
        self,
161
        sentences: list[str],
162
        labels: list[int],
163
        name: str = "",
164
        batch_size: int = 32,
165
        show_progress_bar: bool = None,
166
        write_csv: bool = True
167
    )
168
```
169
`{ .api }`
170

171
Evaluates classification accuracy by finding the closest label embedding for each sentence.
172

173
**Parameters**:
174
- `sentences`: Input sentences to classify
175
- `labels`: Ground truth labels
176
- `name`: Name for evaluation
177
- `batch_size`: Batch size for encoding
178
- `show_progress_bar`: Display progress bar
179
- `write_csv`: Save results to CSV
180

181
**Returns**: Classification accuracy
182

183
## Information Retrieval Evaluation
184

185
### InformationRetrievalEvaluator
186

187
```python
188
class InformationRetrievalEvaluator(SentenceEvaluator):
189
    def __init__(
190
        self,
191
        queries: dict[str, str],
192
        corpus: dict[str, str],
193
        relevant_docs: dict[str, set[str]],
194
        corpus_chunk_size: int = 50000,
195
        mrr_at_k: list[int] = [10],
196
        ndcg_at_k: list[int] = [10],
197
        accuracy_at_k: list[int] = [1, 3, 5, 10],
198
        precision_recall_at_k: list[int] = [1, 3, 5, 10],
199
        map_at_k: list[int] = [100],
200
        max_corpus_size: int = None,
201
        show_progress_bar: bool = None,
202
        batch_size: int = 32,
203
        name: str = "",
204
        write_csv: bool = True
205
    )
206
```
207
`{ .api }`
208

209
Comprehensive information retrieval evaluation with multiple metrics.
210

211
**Parameters**:
212
- `queries`: Dictionary mapping query IDs to query texts
213
- `corpus`: Dictionary mapping document IDs to document texts  
214
- `relevant_docs`: Dictionary mapping query IDs to sets of relevant document IDs
215
- `corpus_chunk_size`: Size of corpus chunks for processing
216
- `mrr_at_k`: Ranks for Mean Reciprocal Rank calculation
217
- `ndcg_at_k`: Ranks for NDCG calculation
218
- `accuracy_at_k`: Ranks for accuracy calculation
219
- `precision_recall_at_k`: Ranks for precision/recall calculation  
220
- `map_at_k`: Ranks for Mean Average Precision calculation
221
- `max_corpus_size`: Maximum corpus size to use
222
- `show_progress_bar`: Display progress bar
223
- `batch_size`: Batch size for encoding
224
- `name`: Name for evaluation
225
- `write_csv`: Save results to CSV
226

227
**Returns**: NDCG@10 score
228

229
### RerankingEvaluator
230

231
```python
232
class RerankingEvaluator(SentenceEvaluator):
233
    def __init__(
234
        self,
235
        samples: list[dict],
236
        mrr_at_k: list[int] = [10],
237
        ndcg_at_k: list[int] = [10],
238
        accuracy_at_k: list[int] = [1, 3, 5, 10],
239
        precision_recall_at_k: list[int] = [1, 3, 5, 10],
240
        map_at_k: list[int] = [100],
241
        name: str = "",
242
        write_csv: bool = True,
243
        batch_size: int = 512,
244
        show_progress_bar: bool = None
245
    )
246
```
247
`{ .api }`
248

249
Evaluates reranking performance on query-document pairs.
250

251
**Parameters**:
252
- `samples`: List of samples with query, positive, and negative documents
253
- Other parameters similar to InformationRetrievalEvaluator
254

255
**Returns**: MRR@10 score
256

257
## Specialized Evaluators
258

259
### TripletEvaluator
260

261
```python
262
class TripletEvaluator(SentenceEvaluator):
263
    def __init__(
264
        self,
265
        anchors: list[str],
266
        positives: list[str], 
267
        negatives: list[str],
268
        main_distance_function: SimilarityFunction | None = None,
269
        name: str = "",
270
        batch_size: int = 16,
271
        show_progress_bar: bool = None,
272
        write_csv: bool = True
273
    )
274
```
275
`{ .api }`
276

277
Evaluates triplet accuracy: anchor should be closer to positive than negative.
278

279
**Parameters**:
280
- `anchors`: Anchor sentences
281
- `positives`: Positive sentences  
282
- `negatives`: Negative sentences
283
- `main_distance_function`: Distance function to use
284
- `name`: Name for evaluation
285
- `batch_size`: Batch size for encoding
286
- `show_progress_bar`: Display progress bar
287
- `write_csv`: Save results to CSV
288

289
**Returns**: Triplet accuracy (percentage of correct triplets)
290

291
### ParaphraseMiningEvaluator
292

293
```python
294
class ParaphraseMiningEvaluator(SentenceEvaluator):
295
    def __init__(
296
        self,
297
        sentences_map: dict[str, str],
298
        duplicates_list: set[tuple[str, str]],
299
        duplicates_dict: dict[str, dict[str, bool]] = None,
300
        query_chunk_size: int = 5000,
301
        corpus_chunk_size: int = 100000,
302
        max_pairs: int = 500000,
303
        top_k: int = 100,
304
        name: str = "",
305
        batch_size: int = 16,
306
        show_progress_bar: bool = None,
307
        write_csv: bool = True
308
    )
309
```
310
`{ .api }`
311

312
Evaluates paraphrase mining performance by finding duplicate/similar sentences.
313

314
**Parameters**:
315
- `sentences_map`: Dictionary mapping sentence IDs to texts
316
- `duplicates_list`: Set of sentence ID pairs that are duplicates
317
- `duplicates_dict`: Alternative format for duplicates
318
- `query_chunk_size`: Size of query chunks for processing
319
- `corpus_chunk_size`: Size of corpus chunks
320
- `max_pairs`: Maximum pairs to evaluate
321
- `top_k`: Number of top pairs to consider
322
- `name`: Name for evaluation
323
- `batch_size`: Batch size for encoding
324
- `show_progress_bar`: Display progress bar
325
- `write_csv`: Save results to CSV
326

327
**Returns**: Average Precision score
328

329
### TranslationEvaluator
330

331
```python
332
class TranslationEvaluator(SentenceEvaluator):
333
    def __init__(
334
        self,
335
        source_sentences: list[str],
336
        target_sentences: list[str],
337
        batch_size: int = 16,
338
        name: str = "",
339
        show_progress_bar: bool = None,
340
        write_csv: bool = True
341
    )
342
```
343
`{ .api }`
344

345
Evaluates cross-lingual or translation performance by measuring similarity between source and target sentences.
346

347
**Parameters**:
348
- `source_sentences`: Source language sentences
349
- `target_sentences`: Target language sentences (translations)
350
- `batch_size`: Batch size for encoding
351
- `name`: Name for evaluation
352
- `show_progress_bar`: Display progress bar
353
- `write_csv`: Save results to CSV
354

355
**Returns**: Average cosine similarity between translations
356

357
## Advanced Evaluators
358

359
### SequentialEvaluator
360

361
```python
362
class SequentialEvaluator(SentenceEvaluator):
363
    def __init__(
364
        self,
365
        evaluators: list[SentenceEvaluator],
366
        main_score_function: callable = None
367
    )
368
```
369
`{ .api }`
370

371
Runs multiple evaluators sequentially and combines their results.
372

373
**Parameters**:
374
- `evaluators`: List of evaluators to run
375
- `main_score_function`: Function to combine scores into main score
376

377
**Returns**: Combined evaluation score
378

379
### NanoBEIREvaluator
380

381
```python
382
class NanoBEIREvaluator(SentenceEvaluator):
383
    def __init__(
384
        self,
385
        dataset_name: str | None = None,
386
        dataset_config: str | None = None,
387
        dataset_revision: str | None = None,
388
        corpus_chunk_size: int = 50000,
389
        max_corpus_size: int | None = None,
390
        **kwargs
391
    )
392
```
393
`{ .api }`
394

395
Evaluator for NanoBEIR (Neural Assessment of Natural Language Generation over Information Retrieval) benchmark tasks.
396

397
**Parameters**:
398
- `dataset_name`: Name of the NanoBEIR dataset
399
- `dataset_config`: Dataset configuration
400
- `dataset_revision`: Dataset revision to use
401
- `corpus_chunk_size`: Corpus processing chunk size
402
- `max_corpus_size`: Maximum corpus size to evaluate on
403
- `**kwargs`: Additional arguments passed to base evaluator
404

405
**Returns**: NDCG@10 score on the NanoBEIR task
406

407
## Utility Classes
408

409
### SimilarityFunction
410

411
```python
412
from sentence_transformers.evaluation import SimilarityFunction
413

414
class SimilarityFunction(Enum):
415
    COSINE = "cosine"
416
    DOT_PRODUCT = "dot"
417
    EUCLIDEAN = "euclidean"
418
    MANHATTAN = "manhattan"
419
```
420
`{ .api }`
421

422
Enumeration of similarity functions available for evaluation.
423

424
## Usage Examples
425

426
### Basic Similarity Evaluation
427

428
```python
429
from sentence_transformers import SentenceTransformer
430
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
431

432
# Load model
433
model = SentenceTransformer('all-MiniLM-L6-v2')
434

435
# Prepare evaluation data
436
sentences1 = ["The cat sits on the mat", "I love programming"]
437
sentences2 = ["A feline rests on a rug", "I enjoy coding"]
438
scores = [0.9, 0.8]  # Similarity scores
439

440
# Create evaluator
441
evaluator = EmbeddingSimilarityEvaluator(
442
    sentences1=sentences1,
443
    sentences2=sentences2,
444
    scores=scores,
445
    name="dev"
446
)
447

448
# Evaluate model
449
correlation = evaluator(model, output_path="./evaluation_results/")
450
print(f"Spearman correlation: {correlation:.4f}")
451
```
452

453
### Information Retrieval Evaluation
454

455
```python
456
from sentence_transformers.evaluation import InformationRetrievalEvaluator
457

458
# Prepare IR evaluation data
459
queries = {
460
    "q1": "What is machine learning?",
461
    "q2": "How do neural networks work?"
462
}
463

464
corpus = {
465
    "d1": "Machine learning is a subset of artificial intelligence",
466
    "d2": "Neural networks are computational models inspired by biology",
467
    "d3": "Weather forecasting uses statistical models",
468
    "d4": "Deep learning uses multiple layers of neural networks"
469
}
470

471
relevant_docs = {
472
    "q1": {"d1", "d4"},  # Relevant documents for q1
473
    "q2": {"d2", "d4"}   # Relevant documents for q2
474
}
475

476
# Create IR evaluator
477
ir_evaluator = InformationRetrievalEvaluator(
478
    queries=queries,
479
    corpus=corpus,
480
    relevant_docs=relevant_docs,
481
    name="test_retrieval"
482
)
483

484
# Evaluate
485
ndcg_score = ir_evaluator(model, output_path="./ir_results/")
486
print(f"NDCG@10: {ndcg_score:.4f}")
487
```
488

489
### Binary Classification Evaluation
490

491
```python
492
from sentence_transformers.evaluation import BinaryClassificationEvaluator
493

494
# Prepare binary classification data
495
sentences1 = [
496
    "The cat sits on the mat",
497
    "I love programming", 
498
    "Dogs are great pets",
499
    "Weather is nice today"
500
]
501

502
sentences2 = [
503
    "A feline rests on a rug",     # Similar to first
504
    "Cooking is fun",              # Different from second
505
    "Cats are wonderful animals", # Related to third
506
    "It's sunny outside"          # Similar to fourth
507
]
508

509
labels = [1, 0, 1, 1]  # Binary similarity labels
510

511
# Create evaluator
512
binary_evaluator = BinaryClassificationEvaluator(
513
    sentences1=sentences1,
514
    sentences2=sentences2,
515
    labels=labels,
516
    name="binary_classification"
517
)
518

519
# Evaluate
520
ap_score = binary_evaluator(model, output_path="./binary_results/")
521
print(f"Average Precision: {ap_score:.4f}")
522
```
523

524
### Triplet Evaluation
525

526
```python
527
from sentence_transformers.evaluation import TripletEvaluator
528

529
# Prepare triplet data
530
anchors = ["The cat sits on the mat", "I love programming"]
531
positives = ["A feline rests on a rug", "I enjoy coding"]
532
negatives = ["Dogs are great pets", "Weather is nice"]
533

534
# Create triplet evaluator
535
triplet_evaluator = TripletEvaluator(
536
    anchors=anchors,
537
    positives=positives,
538
    negatives=negatives,
539
    name="triplet_eval"
540
)
541

542
# Evaluate
543
accuracy = triplet_evaluator(model, output_path="./triplet_results/")
544
print(f"Triplet accuracy: {accuracy:.4f}")
545
```
546

547
### Sequential Multi-Task Evaluation
548

549
```python
550
from sentence_transformers.evaluation import SequentialEvaluator
551

552
# Create multiple evaluators
553
similarity_eval = EmbeddingSimilarityEvaluator(sentences1, sentences2, scores, name="similarity")
554
binary_eval = BinaryClassificationEvaluator(sentences1, sentences2, labels, name="binary")
555
triplet_eval = TripletEvaluator(anchors, positives, negatives, name="triplet")
556

557
# Combine evaluators
558
sequential_evaluator = SequentialEvaluator(
559
    evaluators=[similarity_eval, binary_eval, triplet_eval],
560
    main_score_function=lambda scores: sum(scores) / len(scores)  # Average score
561
)
562

563
# Run all evaluations
564
combined_score = sequential_evaluator(model, output_path="./multi_eval_results/")
565
print(f"Combined evaluation score: {combined_score:.4f}")
566
```
567

568
### Training Integration
569

570
```python
571
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
572

573
# Create evaluator for training
574
dev_evaluator = EmbeddingSimilarityEvaluator(
575
    sentences1=dev_sentences1,
576
    sentences2=dev_sentences2, 
577
    scores=dev_scores,
578
    name="sts-dev"
579
)
580

581
# Training arguments with evaluation
582
args = SentenceTransformerTrainingArguments(
583
    output_dir='./training_with_eval',
584
    evaluation_strategy="steps",
585
    eval_steps=100,
586
    logging_steps=100,
587
    save_steps=100,
588
    num_train_epochs=3,
589
    per_device_train_batch_size=16,
590
    load_best_model_at_end=True,
591
    metric_for_best_model="eval_spearman_cosine",
592
    greater_is_better=True
593
)
594

595
def compute_metrics(eval_pred):
596
    """Custom metrics for trainer."""
597
    # This would be called during training evaluation
598
    return dev_evaluator(model, output_path=args.output_dir)
599

600
trainer = SentenceTransformerTrainer(
601
    model=model,
602
    args=args,
603
    train_dataset=train_dataset,
604
    loss=loss,
605
    compute_metrics=compute_metrics
606
)
607

608
trainer.train()
609
```
610

611
### Custom Evaluator
612

613
```python
614
class CustomEvaluator(SentenceEvaluator):
615
    """Custom evaluator for specific task."""
616
    
617
    def __init__(self, test_data, name="custom"):
618
        self.test_data = test_data
619
        self.name = name
620
    
621
    def __call__(self, model, output_path=None, epoch=-1, steps=-1):
622
        # Implement custom evaluation logic
623
        embeddings = model.encode([item['text'] for item in self.test_data])
624
        
625
        # Calculate your custom metric
626
        custom_score = self.calculate_custom_metric(embeddings)
627
        
628
        # Save results if output_path provided
629
        if output_path:
630
            self.save_results(custom_score, output_path, epoch, steps)
631
        
632
        return custom_score
633
    
634
    def calculate_custom_metric(self, embeddings):
635
        # Implement your metric calculation
636
        return 0.85  # Placeholder
637
    
638
    def save_results(self, score, output_path, epoch, steps):
639
        # Save evaluation results
640
        import os, csv
641
        csv_file = os.path.join(output_path, f"{self.name}_results.csv")
642
        with open(csv_file, 'w', newline='') as f:
643
            writer = csv.writer(f)
644
            writer.writerow(['epoch', 'steps', 'score'])
645
            writer.writerow([epoch, steps, score])
646

647
# Use custom evaluator
648
custom_eval = CustomEvaluator(test_data)
649
score = custom_eval(model, output_path="./custom_results/")
650
```
651

652
### Batch Evaluation on Multiple Datasets
653

654
```python
655
def evaluate_on_multiple_datasets(model, datasets_config):
656
    """Evaluate model on multiple datasets."""
657
    results = {}
658
    
659
    for dataset_name, config in datasets_config.items():
660
        if config['type'] == 'similarity':
661
            evaluator = EmbeddingSimilarityEvaluator(
662
                sentences1=config['sentences1'],
663
                sentences2=config['sentences2'],
664
                scores=config['scores'],
665
                name=dataset_name
666
            )
667
        elif config['type'] == 'retrieval':
668
            evaluator = InformationRetrievalEvaluator(
669
                queries=config['queries'],
670
                corpus=config['corpus'],
671
                relevant_docs=config['relevant_docs'],
672
                name=dataset_name
673
            )
674
        
675
        score = evaluator(model, output_path=f"./results/{dataset_name}/")
676
        results[dataset_name] = score
677
        print(f"{dataset_name}: {score:.4f}")
678
    
679
    return results
680

681
# Configuration for multiple datasets
682
datasets_config = {
683
    "sts_benchmark": {
684
        "type": "similarity",
685
        "sentences1": sts_sentences1,
686
        "sentences2": sts_sentences2, 
687
        "scores": sts_scores
688
    },
689
    "msmarco": {
690
        "type": "retrieval",
691
        "queries": msmarco_queries,
692
        "corpus": msmarco_corpus,
693
        "relevant_docs": msmarco_qrels
694
    }
695
}
696

697
# Run evaluations
698
all_results = evaluate_on_multiple_datasets(model, datasets_config)
699
```
700

701
## Best Practices
702

703
1. **Evaluation Data**: Use high-quality, diverse evaluation datasets
704
2. **Multiple Metrics**: Evaluate on multiple tasks and metrics for comprehensive assessment
705
3. **Statistical Significance**: Use appropriate sample sizes for reliable results
706
4. **Cross-Validation**: Consider cross-validation for robust evaluation
707
5. **Domain Matching**: Ensure evaluation data matches your target domain
708
6. **Baseline Comparison**: Always compare against relevant baselines
709
7. **Error Analysis**: Analyze failure cases to understand model limitations
710
8. **Reproducibility**: Save evaluation configurations and random seeds

Version

Tile

Files

evaluation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

evaluation.mddocs/