0
# PyTerrier
1
2
A comprehensive Python API for the Terrier information retrieval platform, enabling declarative experimentation with transformer pipelines for indexing, retrieval, and evaluation tasks. PyTerrier provides a declarative approach to information retrieval research through composable transformer pipelines that can be chained using Python operators.
3
4
## Package Information
5
6
- **Package Name**: python-terrier
7
- **Language**: Python
8
- **Installation**: `pip install python-terrier`
9
10
## Core Imports
11
12
```python
13
import pyterrier as pt
14
```
15
16
Common for working with specific components:
17
18
```python
19
from pyterrier import Transformer, Estimator, Indexer
20
from pyterrier import Experiment, GridSearch
21
from pyterrier.terrier import Retriever, IndexFactory
22
```
23
24
## Basic Usage
25
26
```python
27
import pyterrier as pt
28
import pandas as pd
29
30
# Initialize PyTerrier (sets up Java VM)
31
if not pt.java.started():
32
pt.java.init()
33
34
# Create a simple retrieval pipeline
35
bm25 = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed', wmodel='BM25')
36
37
# Perform retrieval
38
queries = pd.DataFrame([
39
{'qid': '1', 'query': 'information retrieval'},
40
{'qid': '2', 'query': 'search engines'}
41
])
42
43
results = bm25.transform(queries)
44
print(results.head())
45
46
# Chain transformers using operators
47
dataset = pt.get_dataset('vaswani')
48
text_getter = pt.text.get_text(dataset)
49
reranker = pt.terrier.Retriever(dataset.get_index(), wmodel='PL2')
50
pipeline = bm25 >> text_getter >> reranker
51
results = pipeline.transform(queries)
52
53
# Run experiments with evaluation
54
topics = pt.get_dataset('vaswani').get_topics()
55
qrels = pt.get_dataset('vaswani').get_qrels()
56
evaluation = pt.Experiment([bm25], topics, qrels, ['map', 'ndcg'])
57
print(evaluation)
58
```
59
60
## Architecture
61
62
PyTerrier's architecture is built around several key design patterns:
63
64
- **Transformer Pipeline Pattern**: All components implement the Transformer interface, enabling composition through operators (`>>`, `+`, `**`, etc.)
65
- **Dual API Support**: Most components support both DataFrame (`transform()`) and iterator (`transform_iter()`) interfaces
66
- **Java Integration Layer**: Seamless integration with the Terrier IR platform through comprehensive Java interop
67
- **Declarative Experimentation**: Built-in experiment framework with statistical significance testing
68
- **Plugin Architecture**: Extensible through entry points and custom transformer creation
69
70
## Capabilities
71
72
### Core Transformers
73
74
Base classes and pipeline operators that form the foundation of PyTerrier's transformer architecture, enabling composable information retrieval pipelines.
75
76
```python { .api }
77
class Transformer:
78
def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame: ...
79
def transform_iter(self, input_iter) -> Iterator: ...
80
def __rshift__(self, other): ... # >> operator for composition
81
def __add__(self, other): ... # + operator for score addition
82
def __pow__(self, other): ... # ** operator for feature union
83
def __or__(self, other): ... # | operator for set union
84
def __and__(self, other): ... # & operator for set intersection
85
86
class Estimator(Transformer):
87
def fit(self, topics_and_res: pd.DataFrame) -> 'Estimator': ...
88
89
class Indexer(Transformer):
90
def index(self, iter_dict) -> IndexRef: ...
91
```
92
93
[Core Transformers](./transformers.md)
94
95
### Retrieval
96
97
Retrieval components for searching indexed collections, including various weighting models, feature extraction, and text scoring capabilities.
98
99
```python { .api }
100
class Retriever(Transformer):
101
@staticmethod
102
def from_dataset(dataset_name: str, variant: str = None, version: str = 'latest', **kwargs) -> 'Retriever': ...
103
def __init__(self, index_location: Union[str, Any],
104
controls: Optional[Dict[str, str]] = None,
105
properties: Optional[Dict[str, str]] = None,
106
metadata: List[str] = ["docno"],
107
num_results: Optional[int] = None,
108
wmodel: Optional[Union[str, Callable]] = None,
109
threads: int = 1,
110
verbose: bool = False): ...
111
112
class FeaturesRetriever(Transformer):
113
def __init__(self, index_location: Union[str, Any], features: List[str],
114
controls: Optional[Dict[str, str]] = None,
115
properties: Optional[Dict[str, str]] = None,
116
threads: int = 1, **kwargs): ...
117
118
class TextScorer(Transformer):
119
def __init__(self, wmodel: str = 'BM25', background_index: Any = None,
120
takes: str = 'docs', body_attr: str = 'text',
121
verbose: bool = False, **kwargs): ...
122
```
123
124
[Retrieval](./retrieval.md)
125
126
### Indexing
127
128
Index creation and management functionality for building searchable collections from various document formats.
129
130
```python { .api }
131
class IndexFactory:
132
@staticmethod
133
def from_dataset(dataset_name: str) -> IndexRef: ...
134
@staticmethod
135
def from_trec(path: str, **kwargs) -> IndexRef: ...
136
137
class FilesIndexer(Indexer):
138
def __init__(self, index_path: str, **kwargs): ...
139
140
class TRECCollectionIndexer(Indexer):
141
def __init__(self, index_path: str, **kwargs): ...
142
143
class DFIndexer(Indexer):
144
def __init__(self, index_path: str, **kwargs): ...
145
```
146
147
[Indexing](./indexing.md)
148
149
### Java Integration
150
151
Java VM initialization, configuration, and integration with the underlying Terrier platform.
152
153
```python { .api }
154
def init(version: str = None, **kwargs) -> None: ...
155
def started() -> bool: ...
156
def configure(**kwargs) -> None: ...
157
def set_memory_limit(memory: str) -> None: ...
158
def extend_classpath(paths: List[str]) -> None: ...
159
def set_property(key: str, value: str) -> None: ...
160
```
161
162
[Java Integration](./java.md)
163
164
### Datasets
165
166
Dataset management for accessing standard IR test collections and creating custom datasets.
167
168
```python { .api }
169
def get_dataset(name: str) -> Dataset: ...
170
def find_datasets(query: str = None, **kwargs) -> List[str]: ...
171
def list_datasets() -> List[str]: ...
172
173
class Dataset:
174
def get_topics(self, variant: str = None) -> pd.DataFrame: ...
175
def get_qrels(self, variant: str = None) -> pd.DataFrame: ...
176
def get_corpus_iter(self, verbose: bool = True) -> Iterator: ...
177
```
178
179
[Datasets](./datasets.md)
180
181
### Evaluation Framework
182
183
Comprehensive evaluation and parameter tuning framework with statistical significance testing.
184
185
```python { .api }
186
class Experiment:
187
def __init__(self, retr_systems: List[Transformer], topics: pd.DataFrame,
188
qrels: pd.DataFrame, eval_metrics: List[str], **kwargs): ...
189
190
class GridSearch:
191
def __init__(self, pipeline: Transformer, params: Dict, topics: pd.DataFrame,
192
qrels: pd.DataFrame, metric: str, **kwargs): ...
193
194
class GridScan:
195
def __init__(self, pipeline: Transformer, params: Dict, topics: pd.DataFrame,
196
qrels: pd.DataFrame, metrics: List[str], **kwargs): ...
197
```
198
199
[Evaluation Framework](./evaluation.md)
200
201
### Text Processing
202
203
Text processing utilities including stemming, tokenization, stopword removal, and text transformation.
204
205
```python { .api }
206
class TerrierStemmer(Transformer):
207
def __init__(self, stemmer: str = 'porter'): ...
208
209
class TerrierTokeniser(Transformer):
210
def __init__(self, **kwargs): ...
211
212
class TerrierStopwords(Transformer):
213
def __init__(self, stopwords: str = 'terrier'): ...
214
```
215
216
[Text Processing](./text-processing.md)
217
218
### Utilities
219
220
Supporting utilities for DataFrame manipulation, progress tracking, I/O operations, and general helper functions.
221
222
```python { .api }
223
def set_tqdm(tqdm_type: str = None) -> None: ...
224
def coerce_dataframe(input_data) -> pd.DataFrame: ...
225
def add_ranks(df: pd.DataFrame, single_query: bool = False) -> pd.DataFrame: ...
226
def autoopen(filename: str, mode: str = 'r', **kwargs): ...
227
```
228
229
[Utilities](./utilities.md)
230
231
## Types
232
233
```python { .api }
234
# Core type definitions used across PyTerrier
235
from typing import Dict, List, Any, Iterator, Union, Optional, Callable, Sequence, Literal
236
import pandas as pd
237
import numpy.typing as npt
238
239
IterDictRecord = Dict[str, Any]
240
IterDict = Iterator[IterDictRecord]
241
IndexRef = Any # Java IndexRef object
242
Dataset = Any # Dataset object
243
TransformerLike = Union['Transformer', Callable[[pd.DataFrame], pd.DataFrame]]
244
QueryInput = Union[str, Dict[str, str], pd.DataFrame]
245
WeightingModel = str # Weighting model identifier (e.g., 'BM25', 'PL2')
246
MetricList = List[str] # List of evaluation metrics
247
```