or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdevaluation.mdindex.mdindexing.mdjava.mdretrieval.mdtext-processing.mdtransformers.mdutilities.md

index.mddocs/

0

# PyTerrier

1

2

A comprehensive Python API for the Terrier information retrieval platform, enabling declarative experimentation with transformer pipelines for indexing, retrieval, and evaluation tasks. PyTerrier provides a declarative approach to information retrieval research through composable transformer pipelines that can be chained using Python operators.

3

4

## Package Information

5

6

- **Package Name**: python-terrier

7

- **Language**: Python

8

- **Installation**: `pip install python-terrier`

9

10

## Core Imports

11

12

```python

13

import pyterrier as pt

14

```

15

16

Common for working with specific components:

17

18

```python

19

from pyterrier import Transformer, Estimator, Indexer

20

from pyterrier import Experiment, GridSearch

21

from pyterrier.terrier import Retriever, IndexFactory

22

```

23

24

## Basic Usage

25

26

```python

27

import pyterrier as pt

28

import pandas as pd

29

30

# Initialize PyTerrier (sets up Java VM)

31

if not pt.java.started():

32

pt.java.init()

33

34

# Create a simple retrieval pipeline

35

bm25 = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed', wmodel='BM25')

36

37

# Perform retrieval

38

queries = pd.DataFrame([

39

{'qid': '1', 'query': 'information retrieval'},

40

{'qid': '2', 'query': 'search engines'}

41

])

42

43

results = bm25.transform(queries)

44

print(results.head())

45

46

# Chain transformers using operators

47

dataset = pt.get_dataset('vaswani')

48

text_getter = pt.text.get_text(dataset)

49

reranker = pt.terrier.Retriever(dataset.get_index(), wmodel='PL2')

50

pipeline = bm25 >> text_getter >> reranker

51

results = pipeline.transform(queries)

52

53

# Run experiments with evaluation

54

topics = pt.get_dataset('vaswani').get_topics()

55

qrels = pt.get_dataset('vaswani').get_qrels()

56

evaluation = pt.Experiment([bm25], topics, qrels, ['map', 'ndcg'])

57

print(evaluation)

58

```

59

60

## Architecture

61

62

PyTerrier's architecture is built around several key design patterns:

63

64

- **Transformer Pipeline Pattern**: All components implement the Transformer interface, enabling composition through operators (`>>`, `+`, `**`, etc.)

65

- **Dual API Support**: Most components support both DataFrame (`transform()`) and iterator (`transform_iter()`) interfaces

66

- **Java Integration Layer**: Seamless integration with the Terrier IR platform through comprehensive Java interop

67

- **Declarative Experimentation**: Built-in experiment framework with statistical significance testing

68

- **Plugin Architecture**: Extensible through entry points and custom transformer creation

69

70

## Capabilities

71

72

### Core Transformers

73

74

Base classes and pipeline operators that form the foundation of PyTerrier's transformer architecture, enabling composable information retrieval pipelines.

75

76

```python { .api }

77

class Transformer:

78

def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame: ...

79

def transform_iter(self, input_iter) -> Iterator: ...

80

def __rshift__(self, other): ... # >> operator for composition

81

def __add__(self, other): ... # + operator for score addition

82

def __pow__(self, other): ... # ** operator for feature union

83

def __or__(self, other): ... # | operator for set union

84

def __and__(self, other): ... # & operator for set intersection

85

86

class Estimator(Transformer):

87

def fit(self, topics_and_res: pd.DataFrame) -> 'Estimator': ...

88

89

class Indexer(Transformer):

90

def index(self, iter_dict) -> IndexRef: ...

91

```

92

93

[Core Transformers](./transformers.md)

94

95

### Retrieval

96

97

Retrieval components for searching indexed collections, including various weighting models, feature extraction, and text scoring capabilities.

98

99

```python { .api }

100

class Retriever(Transformer):

101

@staticmethod

102

def from_dataset(dataset_name: str, variant: str = None, version: str = 'latest', **kwargs) -> 'Retriever': ...

103

def __init__(self, index_location: Union[str, Any],

104

controls: Optional[Dict[str, str]] = None,

105

properties: Optional[Dict[str, str]] = None,

106

metadata: List[str] = ["docno"],

107

num_results: Optional[int] = None,

108

wmodel: Optional[Union[str, Callable]] = None,

109

threads: int = 1,

110

verbose: bool = False): ...

111

112

class FeaturesRetriever(Transformer):

113

def __init__(self, index_location: Union[str, Any], features: List[str],

114

controls: Optional[Dict[str, str]] = None,

115

properties: Optional[Dict[str, str]] = None,

116

threads: int = 1, **kwargs): ...

117

118

class TextScorer(Transformer):

119

def __init__(self, wmodel: str = 'BM25', background_index: Any = None,

120

takes: str = 'docs', body_attr: str = 'text',

121

verbose: bool = False, **kwargs): ...

122

```

123

124

[Retrieval](./retrieval.md)

125

126

### Indexing

127

128

Index creation and management functionality for building searchable collections from various document formats.

129

130

```python { .api }

131

class IndexFactory:

132

@staticmethod

133

def from_dataset(dataset_name: str) -> IndexRef: ...

134

@staticmethod

135

def from_trec(path: str, **kwargs) -> IndexRef: ...

136

137

class FilesIndexer(Indexer):

138

def __init__(self, index_path: str, **kwargs): ...

139

140

class TRECCollectionIndexer(Indexer):

141

def __init__(self, index_path: str, **kwargs): ...

142

143

class DFIndexer(Indexer):

144

def __init__(self, index_path: str, **kwargs): ...

145

```

146

147

[Indexing](./indexing.md)

148

149

### Java Integration

150

151

Java VM initialization, configuration, and integration with the underlying Terrier platform.

152

153

```python { .api }

154

def init(version: str = None, **kwargs) -> None: ...

155

def started() -> bool: ...

156

def configure(**kwargs) -> None: ...

157

def set_memory_limit(memory: str) -> None: ...

158

def extend_classpath(paths: List[str]) -> None: ...

159

def set_property(key: str, value: str) -> None: ...

160

```

161

162

[Java Integration](./java.md)

163

164

### Datasets

165

166

Dataset management for accessing standard IR test collections and creating custom datasets.

167

168

```python { .api }

169

def get_dataset(name: str) -> Dataset: ...

170

def find_datasets(query: str = None, **kwargs) -> List[str]: ...

171

def list_datasets() -> List[str]: ...

172

173

class Dataset:

174

def get_topics(self, variant: str = None) -> pd.DataFrame: ...

175

def get_qrels(self, variant: str = None) -> pd.DataFrame: ...

176

def get_corpus_iter(self, verbose: bool = True) -> Iterator: ...

177

```

178

179

[Datasets](./datasets.md)

180

181

### Evaluation Framework

182

183

Comprehensive evaluation and parameter tuning framework with statistical significance testing.

184

185

```python { .api }

186

class Experiment:

187

def __init__(self, retr_systems: List[Transformer], topics: pd.DataFrame,

188

qrels: pd.DataFrame, eval_metrics: List[str], **kwargs): ...

189

190

class GridSearch:

191

def __init__(self, pipeline: Transformer, params: Dict, topics: pd.DataFrame,

192

qrels: pd.DataFrame, metric: str, **kwargs): ...

193

194

class GridScan:

195

def __init__(self, pipeline: Transformer, params: Dict, topics: pd.DataFrame,

196

qrels: pd.DataFrame, metrics: List[str], **kwargs): ...

197

```

198

199

[Evaluation Framework](./evaluation.md)

200

201

### Text Processing

202

203

Text processing utilities including stemming, tokenization, stopword removal, and text transformation.

204

205

```python { .api }

206

class TerrierStemmer(Transformer):

207

def __init__(self, stemmer: str = 'porter'): ...

208

209

class TerrierTokeniser(Transformer):

210

def __init__(self, **kwargs): ...

211

212

class TerrierStopwords(Transformer):

213

def __init__(self, stopwords: str = 'terrier'): ...

214

```

215

216

[Text Processing](./text-processing.md)

217

218

### Utilities

219

220

Supporting utilities for DataFrame manipulation, progress tracking, I/O operations, and general helper functions.

221

222

```python { .api }

223

def set_tqdm(tqdm_type: str = None) -> None: ...

224

def coerce_dataframe(input_data) -> pd.DataFrame: ...

225

def add_ranks(df: pd.DataFrame, single_query: bool = False) -> pd.DataFrame: ...

226

def autoopen(filename: str, mode: str = 'r', **kwargs): ...

227

```

228

229

[Utilities](./utilities.md)

230

231

## Types

232

233

```python { .api }

234

# Core type definitions used across PyTerrier

235

from typing import Dict, List, Any, Iterator, Union, Optional, Callable, Sequence, Literal

236

import pandas as pd

237

import numpy.typing as npt

238

239

IterDictRecord = Dict[str, Any]

240

IterDict = Iterator[IterDictRecord]

241

IndexRef = Any # Java IndexRef object

242

Dataset = Any # Dataset object

243

TransformerLike = Union['Transformer', Callable[[pd.DataFrame], pd.DataFrame]]

244

QueryInput = Union[str, Dict[str, str], pd.DataFrame]

245

WeightingModel = str # Weighting model identifier (e.g., 'BM25', 'PL2')

246

MetricList = List[str] # List of evaluation metrics

247

```