or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdevaluation.mdindex.mdindexing.mdjava.mdretrieval.mdtext-processing.mdtransformers.mdutilities.md

retrieval.mddocs/

0

# Retrieval

1

2

PyTerrier's retrieval components provide comprehensive search functionality for indexed collections, supporting various weighting models, feature extraction, and text scoring capabilities. The retrieval system is built around the Transformer interface, enabling seamless integration into complex pipelines.

3

4

## Capabilities

5

6

### Core Retrieval

7

8

The primary retrieval class that replaces the deprecated BatchRetrieve and TerrierRetrieve classes, providing access to various weighting models and retrieval configurations.

9

10

```python { .api }

11

class Retriever(Transformer):

12

"""

13

Main retrieval class supporting various weighting models and configurations.

14

15

Parameters:

16

- index_location: Index reference, path, or dataset name

17

- controls: Dictionary of Terrier controls/properties (optional)

18

- properties: Dictionary of Terrier properties (optional)

19

- metadata: List of metadata fields to include in results (default: ["docno"])

20

- num_results: Maximum number of results to return (optional)

21

- wmodel: Weighting model name or callable (optional)

22

- threads: Number of threads for parallel retrieval (default: 1)

23

- verbose: Enable verbose output (default: False)

24

"""

25

def __init__(self, index_location: Union[str, Any],

26

controls: Optional[Dict[str, str]] = None,

27

properties: Optional[Dict[str, str]] = None,

28

metadata: List[str] = ["docno"],

29

num_results: Optional[int] = None,

30

wmodel: Optional[Union[str, Callable]] = None,

31

threads: int = 1,

32

verbose: bool = False): ...

33

34

@staticmethod

35

def from_dataset(dataset_name: str, variant: str = None, **kwargs) -> 'Retriever': ...

36

```

37

38

**Supported Weighting Models:**

39

- `BM25`: Okapi BM25 ranking function

40

- `PL2`: Divergence from Randomness PL2 model

41

- `TF_IDF`: Classic TF-IDF weighting

42

- `DPH`: Divergence from Randomness DPH model

43

- `DFR_BM25`: Divergence from Randomness version of BM25

44

- `Hiemstra_LM`: Hiemstra Language Model

45

- `DirichletLM`: Dirichlet Language Model

46

- `JelinekMercerLM`: Jelinek-Mercer Language Model

47

48

**Usage Examples:**

49

50

```python

51

# Create retriever from index path

52

bm25 = pt.terrier.Retriever('/path/to/index', wmodel='BM25')

53

54

# Create retriever from dataset

55

vaswani_retriever = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed')

56

57

# Configure retrieval parameters

58

pl2 = pt.terrier.Retriever(index_ref, wmodel='PL2',

59

controls={'c': '1.0'},

60

num_results=50)

61

62

# Include metadata fields

63

retriever_with_meta = pt.terrier.Retriever(index_ref,

64

metadata=['docno', 'title', 'url'])

65

66

# Perform retrieval

67

queries = pd.DataFrame([

68

{'qid': '1', 'query': 'information retrieval'},

69

{'qid': '2', 'query': 'search engines'}

70

])

71

results = bm25.transform(queries)

72

```

73

74

### Feature Extraction

75

76

Retrieval component specialized for extracting ranking features, useful for learning-to-rank and feature analysis.

77

78

```python { .api }

79

class FeaturesRetriever(Transformer):

80

"""

81

Feature extraction retriever for learning-to-rank applications.

82

83

Parameters:

84

- index_ref: Reference to the index

85

- features: List of feature names to extract

86

- controls: Dictionary of Terrier controls

87

- properties: Dictionary of Terrier properties

88

"""

89

def __init__(self, index_ref: Any, features: List[str],

90

controls: Dict[str, str] = None,

91

properties: Dict[str, str] = None, **kwargs): ...

92

```

93

94

**Common Features:**

95

- `TF`: Term frequency

96

- `IDF`: Inverse document frequency

97

- `QTFN`: Query term frequency normalized

98

- `WMODEL:BM25`: BM25 weighting model score

99

- `WMODEL:PL2`: PL2 weighting model score

100

- `DOCLEN`: Document length

101

- `QLEN`: Query length

102

103

**Usage Example:**

104

105

```python

106

# Extract multiple features for learning-to-rank

107

features_retriever = pt.terrier.FeaturesRetriever(

108

index_ref,

109

features=['TF', 'IDF', 'WMODEL:BM25', 'WMODEL:PL2', 'DOCLEN']

110

)

111

112

# Get features for query-document pairs

113

topics_and_res = pd.DataFrame([

114

{'qid': '1', 'query': 'information retrieval', 'docno': 'doc1'},

115

{'qid': '1', 'query': 'information retrieval', 'docno': 'doc2'}

116

])

117

features = features_retriever.transform(topics_and_res)

118

```

119

120

### Text Scoring

121

122

Component for scoring text passages against queries without requiring a pre-built index.

123

124

```python { .api }

125

class TextScorer(Transformer):

126

"""

127

Score text passages against queries using specified weighting models.

128

129

Parameters:

130

- wmodel: Weighting model to use for scoring (default: 'BM25')

131

- background_index: Optional background index for IDF statistics

132

- takes: Specifies input format ('queries' or 'docs')

133

- body_attr: Attribute name containing text to score (default: 'text')

134

- verbose: Enable verbose output

135

"""

136

def __init__(self, wmodel: str = 'BM25', background_index: Any = None,

137

takes: str = 'docs', body_attr: str = 'text',

138

verbose: bool = False, **kwargs): ...

139

```

140

141

**Usage Example:**

142

143

```python

144

# Score documents against queries

145

scorer = pt.terrier.TextScorer(wmodel='BM25')

146

147

# Input: queries and documents to score

148

input_df = pd.DataFrame([

149

{'qid': '1', 'query': 'machine learning', 'docno': 'doc1',

150

'text': 'Machine learning is a subset of artificial intelligence...'},

151

{'qid': '1', 'query': 'machine learning', 'docno': 'doc2',

152

'text': 'Deep learning uses neural networks for pattern recognition...'}

153

])

154

155

scored_results = scorer.transform(input_df)

156

```

157

158

### Query Rewriting

159

160

Query transformation and expansion capabilities for improving retrieval effectiveness.

161

162

```python { .api }

163

# Query rewriting transformers from pt.terrier.rewrite

164

class SequentialDependenceModel(Transformer):

165

"""Sequential Dependence Model query rewriting."""

166

def __init__(self, index_ref: Any, **kwargs): ...

167

168

class DependenceModelPrecomputed(Transformer):

169

"""Precomputed dependence model rewriting."""

170

def __init__(self, index_ref: Any, **kwargs): ...

171

172

class QueryExpansion(Transformer):

173

"""Relevance feedback based query expansion."""

174

def __init__(self, index_ref: Any, fb_terms: int = 10, fb_docs: int = 3, **kwargs): ...

175

```

176

177

**Usage Example:**

178

179

```python

180

# Sequential dependence model for phrase matching

181

sdm = pt.terrier.rewrite.SequentialDependenceModel(index_ref)

182

sdm_pipeline = sdm >> retriever

183

184

# Query expansion with relevance feedback

185

qe = pt.terrier.rewrite.QueryExpansion(index_ref, fb_terms=20, fb_docs=5)

186

qe_pipeline = retriever >> qe >> retriever

187

```

188

189

## Deprecated Components

190

191

These components are maintained for backward compatibility but issue deprecation warnings:

192

193

```python { .api }

194

# Deprecated - use Retriever instead

195

class BatchRetrieve(Transformer): ...

196

class TerrierRetrieve(Transformer): ...

197

198

# Deprecated - use FeaturesRetriever instead

199

class FeaturesBatchRetrieve(Transformer): ...

200

```

201

202

## Advanced Usage Patterns

203

204

### Multi-Stage Retrieval

205

206

```python

207

# Two-stage retrieval with reranking

208

first_stage = pt.terrier.Retriever(index_ref, wmodel='BM25', num_results=1000)

209

reranker = pt.terrier.Retriever(index_ref, wmodel='PL2')

210

pipeline = first_stage >> (reranker % 50) # Rerank top 1000, return top 50

211

```

212

213

### Feature-Based Retrieval

214

215

```python

216

# Extract features for learning-to-rank

217

feature_pipeline = (

218

pt.terrier.Retriever(index_ref, num_results=100) >>

219

pt.terrier.FeaturesRetriever(index_ref, features=['TF', 'IDF', 'WMODEL:BM25'])

220

)

221

```

222

223

### Score Fusion

224

225

```python

226

# Late fusion of multiple retrieval models

227

bm25 = pt.terrier.Retriever(index_ref, wmodel='BM25')

228

pl2 = pt.terrier.Retriever(index_ref, wmodel='PL2')

229

fused = bm25 + pl2 # Add scores from both models

230

```

231

232

## Types

233

234

```python { .api }

235

from typing import Dict, List, Any, Union, Optional

236

import pandas as pd

237

238

# Retrieval-specific types

239

IndexRef = Any # Java IndexRef object

240

WeightingModel = str # Weighting model identifier

241

Controls = Dict[str, str] # Terrier control parameters

242

Properties = Dict[str, str] # Terrier properties

243

MetadataFields = List[str] # Metadata field names

244

```