Tessl Tile for pypi/python-terrier@0.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md evaluation.md index.md indexing.md java.md retrieval.md text-processing.md transformers.md utilities.md

evaluation.mddocs/

0
# Evaluation Framework
1

2
PyTerrier's evaluation framework provides comprehensive tools for conducting information retrieval experiments, including statistical significance testing, parameter tuning, and cross-validation. The framework is designed for rigorous experimental evaluation of retrieval systems.
3

4
## Capabilities
5

6
### Experiment Class
7

8
Core experimental framework for evaluating multiple retrieval systems with statistical significance testing.
9

10
```python { .api }
11
class Experiment:
12
    """
13
    Comprehensive evaluation framework for comparing retrieval systems.
14
    
15
    Parameters:
16
    - retr_systems: List of transformer systems to evaluate
17
    - topics: DataFrame with queries ('qid', 'query' columns)
18
    - qrels: DataFrame with relevance judgments ('qid', 'docno', 'label' columns)
19
    - eval_metrics: List of evaluation metrics to compute
20
    - names: Optional list of system names for results
21
    - perquery: Whether to compute per-query results (default: False)
22
    - dataframe: Whether to return results as DataFrame (default: True)
23
    - batch_size: Batch size for processing queries (default: None)
24
    - filter_by_qrels: Filter results to only qrels topics (default: True)
25
    - filter_by_topics: Filter qrels to only topic qids (default: True)
26
    - baseline: Index of baseline system for significance testing
27
    - test: Statistical test to use ('ttest', 'wilcoxon', 'fisher')
28
    - correction: Multiple testing correction ('bonferroni', 'holm', 'fdr')
29
    - highlight: Highlighting mode for significant results ('bold', 'color')
30
    - round: Number of decimal places for results (default: 4)
31
    - verbose: Enable verbose output (default: False)
32
    - save_dir: Directory to save detailed results
33
    """
34
    def __init__(self, retr_systems: List['Transformer'], topics: pd.DataFrame,
35
                 qrels: pd.DataFrame, eval_metrics: List[str], 
36
                 names: List[str] = None, perquery: bool = False, 
37
                 dataframe: bool = True, batch_size: int = None,
38
                 filter_by_qrels: bool = True, filter_by_topics: bool = True,
39
                 baseline: int = None, test: str = None, correction: str = None,
40
                 highlight: str = None, round: int = 4, verbose: bool = False,
41
                 save_dir: str = None, **kwargs) -> pd.DataFrame: ...
42
```
43

44
**Usage Examples:**
45

46
```python
47
# Basic experiment
48
systems = [bm25_retriever, pl2_retriever, dfr_retriever]
49
topics = dataset.get_topics()
50
qrels = dataset.get_qrels()
51

52
results = pt.Experiment(
53
    systems, 
54
    topics, 
55
    qrels, 
56
    ['map', 'ndcg', 'P_10'],
57
    names=['BM25', 'PL2', 'DFR']
58
)
59
print(results)
60

61
# Experiment with significance testing
62
results = pt.Experiment(
63
    systems,
64
    topics, 
65
    qrels,
66
    ['map', 'ndcg'],
67
    names=['BM25', 'PL2', 'DFR'],
68
    baseline=0,              # Use first system as baseline
69
    test='ttest',           # Paired t-test
70
    correction='bonferroni'  # Multiple testing correction
71
)
72

73
# Per-query results for detailed analysis
74
perquery_results = pt.Experiment(
75
    systems,
76
    topics,
77
    qrels, 
78
    ['map', 'ndcg'],
79
    perquery=True,          # Return per-query scores
80
    save_dir='/path/to/results'  # Save detailed results
81
)
82
```
83

84
### Parameter Tuning Classes
85

86
Advanced parameter optimization classes for systematic hyperparameter tuning.
87

88
```python { .api }
89
class GridSearch:
90
    """
91
    Grid search optimization for finding best parameter combinations.
92
    
93
    Parameters:
94
    - pipeline: Transformer pipeline to optimize
95
    - params: Dictionary mapping parameter names to value lists
96
    - topics: Topics DataFrame for evaluation
97
    - qrels: Qrels DataFrame for evaluation  
98
    - metric: Single metric to optimize (e.g., 'map', 'ndcg')
99
    - verbose: Enable verbose output
100
    - jobs: Number of parallel jobs for evaluation
101
    """
102
    def __init__(self, pipeline: 'Transformer', params: Dict[str, List[Any]],
103
                 topics: pd.DataFrame, qrels: pd.DataFrame, metric: str,
104
                 verbose: bool = False, jobs: int = 1, **kwargs) -> 'Transformer': ...
105

106
class GridScan:
107
    """
108
    Parameter grid scanning for exploring parameter space.
109
    
110
    Parameters:
111
    - pipeline: Transformer pipeline to scan
112
    - params: Dictionary mapping parameter names to value lists  
113
    - topics: Topics DataFrame for evaluation
114
    - qrels: Qrels DataFrame for evaluation
115
    - metrics: List of metrics to compute for each parameter combination
116
    - verbose: Enable verbose output
117
    - jobs: Number of parallel jobs for evaluation
118
    """
119
    def __init__(self, pipeline: 'Transformer', params: Dict[str, List[Any]],
120
                 topics: pd.DataFrame, qrels: pd.DataFrame, metrics: List[str],
121
                 verbose: bool = False, jobs: int = 1, **kwargs) -> pd.DataFrame: ...
122

123
class KFoldGridSearch:
124
    """
125
    K-fold cross-validation grid search for robust parameter optimization.
126
    
127
    Parameters:
128
    - pipeline: Transformer pipeline to optimize
129
    - params: Dictionary mapping parameter names to value lists
130
    - topics_list: List of topic DataFrames for each fold
131
    - qrels: Qrels DataFrame for evaluation
132
    - metric: Single metric to optimize
133
    - verbose: Enable verbose output
134
    - jobs: Number of parallel jobs for evaluation
135
    """
136
    def __init__(self, pipeline: 'Transformer', params: Dict[str, List[Any]],
137
                 topics_list: List[pd.DataFrame], qrels: pd.DataFrame, 
138
                 metric: str, verbose: bool = False, jobs: int = 1, **kwargs) -> 'Transformer': ...
139
```
140

141
**Usage Examples:**
142

143
```python
144
# Grid search for optimal BM25 parameters
145
bm25_pipeline = pt.terrier.Retriever(index_ref, wmodel='BM25')
146
param_grid = {
147
    'k1': [0.9, 1.2, 1.5, 1.8],
148
    'b': [0.3, 0.5, 0.7, 0.9]
149
}
150

151
best_bm25 = pt.GridSearch(
152
    bm25_pipeline,
153
    param_grid,
154
    topics,
155
    qrels,
156
    'map',
157
    verbose=True
158
)
159

160
# Grid scan to explore parameter space
161
scan_results = pt.GridScan(
162
    bm25_pipeline,
163
    param_grid, 
164
    topics,
165
    qrels,
166
    ['map', 'ndcg', 'P_10'],
167
    verbose=True
168
)
169
print(scan_results)
170

171
# K-fold cross-validation for robust optimization
172
from sklearn.model_selection import KFold
173
kf = KFold(n_splits=5, shuffle=True, random_state=42)
174
topics_folds = [topics.iloc[train_idx] for train_idx, _ in kf.split(topics)]
175

176
best_params = pt.KFoldGridSearch(
177
    bm25_pipeline,
178
    param_grid,
179
    topics_folds,
180
    qrels,
181
    'map'
182
)
183
```
184

185
### Single Evaluation Function
186

187
Simple evaluation function for single system assessment.
188

189
```python { .api }
190
class Evaluate:
191
    """
192
    Single result evaluation for computing metrics on retrieval results.
193
    
194
    Parameters:
195
    - res: Results DataFrame with 'qid', 'docno', 'score' columns
196
    - qrels: Qrels DataFrame with 'qid', 'docno', 'label' columns
197
    - metrics: List of evaluation metrics to compute
198
    - perquery: Whether to return per-query results (default: False)
199
    """
200
    def __init__(self, res: pd.DataFrame, qrels: pd.DataFrame, 
201
                 metrics: List[str], perquery: bool = False) -> pd.DataFrame: ...
202
```
203

204
**Usage Example:**
205

206
```python
207
# Evaluate single system results
208
results = retriever.transform(topics)
209
evaluation = pt.Evaluate(results, qrels, ['map', 'ndcg', 'P_10'])
210
print(evaluation)
211

212
# Per-query evaluation  
213
perquery_eval = pt.Evaluate(results, qrels, ['map', 'ndcg'], perquery=True)
214
```
215

216
### Utility Transformers
217

218
Helper transformers for result processing and normalization in evaluation pipelines.
219

220
```python { .api }
221
class PerQueryMaxMinScoreTransformer(Transformer):
222
    """
223
    Per-query min-max score normalization transformer.
224
    
225
    Normalizes document scores within each query to [0,1] range using
226
    min-max normalization per query.
227
    """
228
    def __init__(self): ...
229
```
230

231
**Usage Example:**
232

233
```python
234
# Normalize scores before fusion
235
normalized_pipeline = (
236
    retriever >> 
237
    pt.pipelines.PerQueryMaxMinScoreTransformer() >>
238
    reranker
239
)
240
```
241

242
## Evaluation Metrics
243

244
PyTerrier supports comprehensive evaluation metrics through integration with the `ir-measures` library:
245

246
### Precision-Based Metrics
247
- `P_5`, `P_10`, `P_20`: Precision at rank cutoffs
248
- `R_5`, `R_10`, `R_20`: Recall at rank cutoffs  
249
- `F_5`, `F_10`, `F_20`: F1-score at rank cutoffs
250

251
### Ranking-Based Metrics
252
- `map`: Mean Average Precision
253
- `ndcg`: Normalized Discounted Cumulative Gain
254
- `ndcg_cut_5`, `ndcg_cut_10`, `ndcg_cut_20`: NDCG at rank cutoffs
255
- `mrr`: Mean Reciprocal Rank
256

257
### Advanced Metrics
258
- `bpref`: Binary preference measure
259
- `rbp`: Rank-biased precision
260
- `err`: Expected reciprocal rank
261
- `infap`: Inferred Average Precision
262

263
## Advanced Evaluation Patterns
264

265
### Multi-Dataset Evaluation
266

267
```python
268
# Evaluate across multiple datasets
269
datasets = ['vaswani', 'antique', 'trec-robust-2004']
270
systems = [bm25, pl2, dfr]
271
all_results = []
272

273
for dataset_name in datasets:
274
    dataset = pt.get_dataset(dataset_name)
275
    topics = dataset.get_topics()
276
    qrels = dataset.get_qrels()
277
    
278
    results = pt.Experiment(
279
        systems,
280
        topics, 
281
        qrels,
282
        ['map', 'ndcg'],
283
        names=['BM25', 'PL2', 'DFR']
284
    )
285
    results['dataset'] = dataset_name
286
    all_results.append(results)
287

288
combined = pd.concat(all_results)
289
```
290

291
### Statistical Significance Analysis
292

293
```python
294
# Comprehensive significance testing
295
results = pt.Experiment(
296
    [bm25, improved_system],
297
    topics,
298
    qrels,
299
    ['map', 'ndcg', 'P_10'],
300
    names=['Baseline', 'Improved'],
301
    baseline=0,                    # First system as baseline
302
    test='ttest',                 # Paired t-test
303
    correction='bonferroni',       # Multiple testing correction
304
    highlight='bold',             # Bold significant improvements
305
    perquery=True,                # Enable per-query analysis
306
    save_dir='/path/to/results'   # Save detailed results
307
)
308
```
309

310
### Learning Curve Analysis
311

312
```python
313
# Evaluate with different training set sizes
314
training_sizes = [0.1, 0.2, 0.5, 0.8, 1.0]
315
learning_results = []
316

317
for size in training_sizes:
318
    # Sample training data
319
    train_sample = training_data.sample(frac=size, random_state=42)
320
    
321
    # Train model
322
    trained_model = ltr_model.fit(train_sample)
323
    
324
    # Evaluate
325
    pipeline = retriever >> trained_model
326
    result = pt.Experiment([pipeline], test_topics, test_qrels, ['map'])
327
    result['training_size'] = size
328
    learning_results.append(result)
329

330
learning_curve = pd.concat(learning_results)
331
```
332

333
### Cross-Validation Evaluation
334

335
```python
336
from sklearn.model_selection import KFold
337

338
# K-fold cross-validation for robust evaluation
339
kf = KFold(n_splits=5, shuffle=True, random_state=42)
340
cv_results = []
341

342
for fold, (train_idx, test_idx) in enumerate(kf.split(topics)):
343
    train_topics = topics.iloc[train_idx]
344
    test_topics = topics.iloc[test_idx]
345
    
346
    # Train on fold
347
    trained_system = estimator.fit(train_topics)
348
    
349
    # Evaluate on fold
350
    result = pt.Experiment([trained_system], test_topics, qrels, ['map'])
351
    result['fold'] = fold
352
    cv_results.append(result)
353

354
cv_summary = pd.concat(cv_results).groupby('name').agg({
355
    'map': ['mean', 'std']
356
}).round(4)
357
```
358

359
## Types
360

361
```python { .api }
362
from typing import Dict, List, Any, Union, Optional
363
import pandas as pd
364

365
# Evaluation-specific types
366
SystemList = List['Transformer']  # List of systems to evaluate
367
MetricList = List[str]  # List of evaluation metrics
368
ParameterGrid = Dict[str, List[Any]]  # Parameter search space
369
StatisticalTest = str  # Statistical test name ('ttest', 'wilcoxon', 'fisher')
370
CorrectionMethod = str  # Multiple testing correction method
371
HighlightMode = str  # Result highlighting mode ('bold', 'color')
372
BaselineIndex = int  # Index of baseline system
373
TopicsFold = List[pd.DataFrame]  # List of topic DataFrames for cross-validation
374
```

Version

Tile

Files

evaluation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

evaluation.mddocs/