Tessl Tile for pypi/python-terrier@0.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md evaluation.md index.md indexing.md java.md retrieval.md text-processing.md transformers.md utilities.md

retrieval.mddocs/

0
# Retrieval
1

2
PyTerrier's retrieval components provide comprehensive search functionality for indexed collections, supporting various weighting models, feature extraction, and text scoring capabilities. The retrieval system is built around the Transformer interface, enabling seamless integration into complex pipelines.
3

4
## Capabilities
5

6
### Core Retrieval
7

8
The primary retrieval class that replaces the deprecated BatchRetrieve and TerrierRetrieve classes, providing access to various weighting models and retrieval configurations.
9

10
```python { .api }
11
class Retriever(Transformer):
12
    """
13
    Main retrieval class supporting various weighting models and configurations.
14
    
15
    Parameters:
16
    - index_location: Index reference, path, or dataset name
17
    - controls: Dictionary of Terrier controls/properties (optional)
18
    - properties: Dictionary of Terrier properties (optional) 
19
    - metadata: List of metadata fields to include in results (default: ["docno"])
20
    - num_results: Maximum number of results to return (optional)
21
    - wmodel: Weighting model name or callable (optional)
22
    - threads: Number of threads for parallel retrieval (default: 1)
23
    - verbose: Enable verbose output (default: False)
24
    """
25
    def __init__(self, index_location: Union[str, Any], 
26
                 controls: Optional[Dict[str, str]] = None, 
27
                 properties: Optional[Dict[str, str]] = None,
28
                 metadata: List[str] = ["docno"], 
29
                 num_results: Optional[int] = None, 
30
                 wmodel: Optional[Union[str, Callable]] = None, 
31
                 threads: int = 1, 
32
                 verbose: bool = False): ...
33
    
34
    @staticmethod
35
    def from_dataset(dataset_name: str, variant: str = None, **kwargs) -> 'Retriever': ...
36
```
37

38
**Supported Weighting Models:**
39
- `BM25`: Okapi BM25 ranking function
40
- `PL2`: Divergence from Randomness PL2 model  
41
- `TF_IDF`: Classic TF-IDF weighting
42
- `DPH`: Divergence from Randomness DPH model
43
- `DFR_BM25`: Divergence from Randomness version of BM25
44
- `Hiemstra_LM`: Hiemstra Language Model
45
- `DirichletLM`: Dirichlet Language Model
46
- `JelinekMercerLM`: Jelinek-Mercer Language Model
47

48
**Usage Examples:**
49

50
```python
51
# Create retriever from index path
52
bm25 = pt.terrier.Retriever('/path/to/index', wmodel='BM25')
53

54
# Create retriever from dataset
55
vaswani_retriever = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed')
56

57
# Configure retrieval parameters
58
pl2 = pt.terrier.Retriever(index_ref, wmodel='PL2', 
59
                          controls={'c': '1.0'}, 
60
                          num_results=50)
61

62
# Include metadata fields
63
retriever_with_meta = pt.terrier.Retriever(index_ref, 
64
                                           metadata=['docno', 'title', 'url'])
65

66
# Perform retrieval
67
queries = pd.DataFrame([
68
    {'qid': '1', 'query': 'information retrieval'},
69
    {'qid': '2', 'query': 'search engines'}  
70
])
71
results = bm25.transform(queries)
72
```
73

74
### Feature Extraction
75

76
Retrieval component specialized for extracting ranking features, useful for learning-to-rank and feature analysis.
77

78
```python { .api }
79
class FeaturesRetriever(Transformer):
80
    """
81
    Feature extraction retriever for learning-to-rank applications.
82
    
83
    Parameters:
84
    - index_ref: Reference to the index
85
    - features: List of feature names to extract
86
    - controls: Dictionary of Terrier controls
87
    - properties: Dictionary of Terrier properties
88
    """
89
    def __init__(self, index_ref: Any, features: List[str], 
90
                 controls: Dict[str, str] = None, 
91
                 properties: Dict[str, str] = None, **kwargs): ...
92
```
93

94
**Common Features:**
95
- `TF`: Term frequency
96
- `IDF`: Inverse document frequency  
97
- `QTFN`: Query term frequency normalized
98
- `WMODEL:BM25`: BM25 weighting model score
99
- `WMODEL:PL2`: PL2 weighting model score
100
- `DOCLEN`: Document length
101
- `QLEN`: Query length
102

103
**Usage Example:**
104

105
```python
106
# Extract multiple features for learning-to-rank
107
features_retriever = pt.terrier.FeaturesRetriever(
108
    index_ref, 
109
    features=['TF', 'IDF', 'WMODEL:BM25', 'WMODEL:PL2', 'DOCLEN']
110
)
111

112
# Get features for query-document pairs
113
topics_and_res = pd.DataFrame([
114
    {'qid': '1', 'query': 'information retrieval', 'docno': 'doc1'},
115
    {'qid': '1', 'query': 'information retrieval', 'docno': 'doc2'}
116
])
117
features = features_retriever.transform(topics_and_res)
118
```
119

120
### Text Scoring
121

122
Component for scoring text passages against queries without requiring a pre-built index.
123

124
```python { .api }
125
class TextScorer(Transformer):
126
    """
127
    Score text passages against queries using specified weighting models.
128
    
129
    Parameters:
130
    - wmodel: Weighting model to use for scoring (default: 'BM25')
131
    - background_index: Optional background index for IDF statistics
132
    - takes: Specifies input format ('queries' or 'docs')
133
    - body_attr: Attribute name containing text to score (default: 'text')
134
    - verbose: Enable verbose output
135
    """
136
    def __init__(self, wmodel: str = 'BM25', background_index: Any = None,
137
                 takes: str = 'docs', body_attr: str = 'text', 
138
                 verbose: bool = False, **kwargs): ...
139
```
140

141
**Usage Example:**
142

143
```python  
144
# Score documents against queries
145
scorer = pt.terrier.TextScorer(wmodel='BM25')
146

147
# Input: queries and documents to score
148
input_df = pd.DataFrame([
149
    {'qid': '1', 'query': 'machine learning', 'docno': 'doc1', 
150
     'text': 'Machine learning is a subset of artificial intelligence...'},
151
    {'qid': '1', 'query': 'machine learning', 'docno': 'doc2',
152
     'text': 'Deep learning uses neural networks for pattern recognition...'}
153
])
154

155
scored_results = scorer.transform(input_df)
156
```
157

158
### Query Rewriting
159

160
Query transformation and expansion capabilities for improving retrieval effectiveness.
161

162
```python { .api }
163
# Query rewriting transformers from pt.terrier.rewrite
164
class SequentialDependenceModel(Transformer):
165
    """Sequential Dependence Model query rewriting."""
166
    def __init__(self, index_ref: Any, **kwargs): ...
167

168
class DependenceModelPrecomputed(Transformer): 
169
    """Precomputed dependence model rewriting."""
170
    def __init__(self, index_ref: Any, **kwargs): ...
171

172
class QueryExpansion(Transformer):
173
    """Relevance feedback based query expansion."""
174
    def __init__(self, index_ref: Any, fb_terms: int = 10, fb_docs: int = 3, **kwargs): ...
175
```
176

177
**Usage Example:**
178

179
```python
180
# Sequential dependence model for phrase matching
181
sdm = pt.terrier.rewrite.SequentialDependenceModel(index_ref)
182
sdm_pipeline = sdm >> retriever
183

184
# Query expansion with relevance feedback  
185
qe = pt.terrier.rewrite.QueryExpansion(index_ref, fb_terms=20, fb_docs=5)
186
qe_pipeline = retriever >> qe >> retriever
187
```
188

189
## Deprecated Components
190

191
These components are maintained for backward compatibility but issue deprecation warnings:
192

193
```python { .api }
194
# Deprecated - use Retriever instead
195
class BatchRetrieve(Transformer): ...
196
class TerrierRetrieve(Transformer): ...
197

198
# Deprecated - use FeaturesRetriever instead  
199
class FeaturesBatchRetrieve(Transformer): ...
200
```
201

202
## Advanced Usage Patterns
203

204
### Multi-Stage Retrieval
205

206
```python
207
# Two-stage retrieval with reranking
208
first_stage = pt.terrier.Retriever(index_ref, wmodel='BM25', num_results=1000)
209
reranker = pt.terrier.Retriever(index_ref, wmodel='PL2')
210
pipeline = first_stage >> (reranker % 50)  # Rerank top 1000, return top 50
211
```
212

213
### Feature-Based Retrieval
214

215
```python  
216
# Extract features for learning-to-rank
217
feature_pipeline = (
218
    pt.terrier.Retriever(index_ref, num_results=100) >>
219
    pt.terrier.FeaturesRetriever(index_ref, features=['TF', 'IDF', 'WMODEL:BM25'])
220
)
221
```
222

223
### Score Fusion
224

225
```python
226
# Late fusion of multiple retrieval models
227
bm25 = pt.terrier.Retriever(index_ref, wmodel='BM25')  
228
pl2 = pt.terrier.Retriever(index_ref, wmodel='PL2')
229
fused = bm25 + pl2  # Add scores from both models
230
```
231

232
## Types
233

234
```python { .api }
235
from typing import Dict, List, Any, Union, Optional
236
import pandas as pd
237

238
# Retrieval-specific types
239
IndexRef = Any  # Java IndexRef object
240
WeightingModel = str  # Weighting model identifier
241
Controls = Dict[str, str]  # Terrier control parameters
242
Properties = Dict[str, str]  # Terrier properties
243
MetadataFields = List[str]  # Metadata field names
244
```

Version

Tile

Files

retrieval.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

retrieval.mddocs/