0
# Retrieval
1
2
PyTerrier's retrieval components provide comprehensive search functionality for indexed collections, supporting various weighting models, feature extraction, and text scoring capabilities. The retrieval system is built around the Transformer interface, enabling seamless integration into complex pipelines.
3
4
## Capabilities
5
6
### Core Retrieval
7
8
The primary retrieval class that replaces the deprecated BatchRetrieve and TerrierRetrieve classes, providing access to various weighting models and retrieval configurations.
9
10
```python { .api }
11
class Retriever(Transformer):
12
"""
13
Main retrieval class supporting various weighting models and configurations.
14
15
Parameters:
16
- index_location: Index reference, path, or dataset name
17
- controls: Dictionary of Terrier controls/properties (optional)
18
- properties: Dictionary of Terrier properties (optional)
19
- metadata: List of metadata fields to include in results (default: ["docno"])
20
- num_results: Maximum number of results to return (optional)
21
- wmodel: Weighting model name or callable (optional)
22
- threads: Number of threads for parallel retrieval (default: 1)
23
- verbose: Enable verbose output (default: False)
24
"""
25
def __init__(self, index_location: Union[str, Any],
26
controls: Optional[Dict[str, str]] = None,
27
properties: Optional[Dict[str, str]] = None,
28
metadata: List[str] = ["docno"],
29
num_results: Optional[int] = None,
30
wmodel: Optional[Union[str, Callable]] = None,
31
threads: int = 1,
32
verbose: bool = False): ...
33
34
@staticmethod
35
def from_dataset(dataset_name: str, variant: str = None, **kwargs) -> 'Retriever': ...
36
```
37
38
**Supported Weighting Models:**
39
- `BM25`: Okapi BM25 ranking function
40
- `PL2`: Divergence from Randomness PL2 model
41
- `TF_IDF`: Classic TF-IDF weighting
42
- `DPH`: Divergence from Randomness DPH model
43
- `DFR_BM25`: Divergence from Randomness version of BM25
44
- `Hiemstra_LM`: Hiemstra Language Model
45
- `DirichletLM`: Dirichlet Language Model
46
- `JelinekMercerLM`: Jelinek-Mercer Language Model
47
48
**Usage Examples:**
49
50
```python
51
# Create retriever from index path
52
bm25 = pt.terrier.Retriever('/path/to/index', wmodel='BM25')
53
54
# Create retriever from dataset
55
vaswani_retriever = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed')
56
57
# Configure retrieval parameters
58
pl2 = pt.terrier.Retriever(index_ref, wmodel='PL2',
59
controls={'c': '1.0'},
60
num_results=50)
61
62
# Include metadata fields
63
retriever_with_meta = pt.terrier.Retriever(index_ref,
64
metadata=['docno', 'title', 'url'])
65
66
# Perform retrieval
67
queries = pd.DataFrame([
68
{'qid': '1', 'query': 'information retrieval'},
69
{'qid': '2', 'query': 'search engines'}
70
])
71
results = bm25.transform(queries)
72
```
73
74
### Feature Extraction
75
76
Retrieval component specialized for extracting ranking features, useful for learning-to-rank and feature analysis.
77
78
```python { .api }
79
class FeaturesRetriever(Transformer):
80
"""
81
Feature extraction retriever for learning-to-rank applications.
82
83
Parameters:
84
- index_ref: Reference to the index
85
- features: List of feature names to extract
86
- controls: Dictionary of Terrier controls
87
- properties: Dictionary of Terrier properties
88
"""
89
def __init__(self, index_ref: Any, features: List[str],
90
controls: Dict[str, str] = None,
91
properties: Dict[str, str] = None, **kwargs): ...
92
```
93
94
**Common Features:**
95
- `TF`: Term frequency
96
- `IDF`: Inverse document frequency
97
- `QTFN`: Query term frequency normalized
98
- `WMODEL:BM25`: BM25 weighting model score
99
- `WMODEL:PL2`: PL2 weighting model score
100
- `DOCLEN`: Document length
101
- `QLEN`: Query length
102
103
**Usage Example:**
104
105
```python
106
# Extract multiple features for learning-to-rank
107
features_retriever = pt.terrier.FeaturesRetriever(
108
index_ref,
109
features=['TF', 'IDF', 'WMODEL:BM25', 'WMODEL:PL2', 'DOCLEN']
110
)
111
112
# Get features for query-document pairs
113
topics_and_res = pd.DataFrame([
114
{'qid': '1', 'query': 'information retrieval', 'docno': 'doc1'},
115
{'qid': '1', 'query': 'information retrieval', 'docno': 'doc2'}
116
])
117
features = features_retriever.transform(topics_and_res)
118
```
119
120
### Text Scoring
121
122
Component for scoring text passages against queries without requiring a pre-built index.
123
124
```python { .api }
125
class TextScorer(Transformer):
126
"""
127
Score text passages against queries using specified weighting models.
128
129
Parameters:
130
- wmodel: Weighting model to use for scoring (default: 'BM25')
131
- background_index: Optional background index for IDF statistics
132
- takes: Specifies input format ('queries' or 'docs')
133
- body_attr: Attribute name containing text to score (default: 'text')
134
- verbose: Enable verbose output
135
"""
136
def __init__(self, wmodel: str = 'BM25', background_index: Any = None,
137
takes: str = 'docs', body_attr: str = 'text',
138
verbose: bool = False, **kwargs): ...
139
```
140
141
**Usage Example:**
142
143
```python
144
# Score documents against queries
145
scorer = pt.terrier.TextScorer(wmodel='BM25')
146
147
# Input: queries and documents to score
148
input_df = pd.DataFrame([
149
{'qid': '1', 'query': 'machine learning', 'docno': 'doc1',
150
'text': 'Machine learning is a subset of artificial intelligence...'},
151
{'qid': '1', 'query': 'machine learning', 'docno': 'doc2',
152
'text': 'Deep learning uses neural networks for pattern recognition...'}
153
])
154
155
scored_results = scorer.transform(input_df)
156
```
157
158
### Query Rewriting
159
160
Query transformation and expansion capabilities for improving retrieval effectiveness.
161
162
```python { .api }
163
# Query rewriting transformers from pt.terrier.rewrite
164
class SequentialDependenceModel(Transformer):
165
"""Sequential Dependence Model query rewriting."""
166
def __init__(self, index_ref: Any, **kwargs): ...
167
168
class DependenceModelPrecomputed(Transformer):
169
"""Precomputed dependence model rewriting."""
170
def __init__(self, index_ref: Any, **kwargs): ...
171
172
class QueryExpansion(Transformer):
173
"""Relevance feedback based query expansion."""
174
def __init__(self, index_ref: Any, fb_terms: int = 10, fb_docs: int = 3, **kwargs): ...
175
```
176
177
**Usage Example:**
178
179
```python
180
# Sequential dependence model for phrase matching
181
sdm = pt.terrier.rewrite.SequentialDependenceModel(index_ref)
182
sdm_pipeline = sdm >> retriever
183
184
# Query expansion with relevance feedback
185
qe = pt.terrier.rewrite.QueryExpansion(index_ref, fb_terms=20, fb_docs=5)
186
qe_pipeline = retriever >> qe >> retriever
187
```
188
189
## Deprecated Components
190
191
These components are maintained for backward compatibility but issue deprecation warnings:
192
193
```python { .api }
194
# Deprecated - use Retriever instead
195
class BatchRetrieve(Transformer): ...
196
class TerrierRetrieve(Transformer): ...
197
198
# Deprecated - use FeaturesRetriever instead
199
class FeaturesBatchRetrieve(Transformer): ...
200
```
201
202
## Advanced Usage Patterns
203
204
### Multi-Stage Retrieval
205
206
```python
207
# Two-stage retrieval with reranking
208
first_stage = pt.terrier.Retriever(index_ref, wmodel='BM25', num_results=1000)
209
reranker = pt.terrier.Retriever(index_ref, wmodel='PL2')
210
pipeline = first_stage >> (reranker % 50) # Rerank top 1000, return top 50
211
```
212
213
### Feature-Based Retrieval
214
215
```python
216
# Extract features for learning-to-rank
217
feature_pipeline = (
218
pt.terrier.Retriever(index_ref, num_results=100) >>
219
pt.terrier.FeaturesRetriever(index_ref, features=['TF', 'IDF', 'WMODEL:BM25'])
220
)
221
```
222
223
### Score Fusion
224
225
```python
226
# Late fusion of multiple retrieval models
227
bm25 = pt.terrier.Retriever(index_ref, wmodel='BM25')
228
pl2 = pt.terrier.Retriever(index_ref, wmodel='PL2')
229
fused = bm25 + pl2 # Add scores from both models
230
```
231
232
## Types
233
234
```python { .api }
235
from typing import Dict, List, Any, Union, Optional
236
import pandas as pd
237
238
# Retrieval-specific types
239
IndexRef = Any # Java IndexRef object
240
WeightingModel = str # Weighting model identifier
241
Controls = Dict[str, str] # Terrier control parameters
242
Properties = Dict[str, str] # Terrier properties
243
MetadataFields = List[str] # Metadata field names
244
```