0
# Core Transformers
1
2
PyTerrier's core transformer architecture provides the foundation for building composable information retrieval pipelines. All PyTerrier components inherit from base transformer classes that support operator overloading for intuitive pipeline construction.
3
4
## Capabilities
5
6
### Base Transformer Class
7
8
The fundamental base class that all PyTerrier components inherit from, providing pipeline composition through operator overloading.
9
10
```python { .api }
11
class Transformer:
12
"""
13
Base class for all PyTerrier transformers that process dataframes or iterators.
14
15
Core Methods:
16
- transform(topics_or_res): Transform DataFrame input to DataFrame output
17
- transform_iter(input_iter): Transform iterator input to iterator output
18
- search(query): Convenience method for single query search
19
"""
20
def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame: ...
21
def transform_iter(self, input_iter: Iterator[Dict[str, Any]]) -> Iterator[Dict[str, Any]]: ...
22
def search(self, query: str, qid: str = "1") -> pd.DataFrame: ...
23
def compile(self) -> 'Transformer': ...
24
def parallel(self, jobs: int = 2, backend: str = 'joblib') -> 'Transformer': ...
25
def get_parameter(self, name: str) -> Any: ...
26
def set_parameter(self, name: str, value: Any) -> 'Transformer': ...
27
28
# Static methods
29
@staticmethod
30
def identity() -> 'Transformer': ...
31
@staticmethod
32
def from_df(df: pd.DataFrame, copy: bool = True) -> 'Transformer': ...
33
34
# Pipeline operators
35
def __rshift__(self, other: 'Transformer') -> 'Transformer': ... # >>
36
def __add__(self, other: 'Transformer') -> 'Transformer': ... # +
37
def __pow__(self, other: 'Transformer') -> 'Transformer': ... # **
38
def __or__(self, other: 'Transformer') -> 'Transformer': ... # |
39
def __and__(self, other: 'Transformer') -> 'Transformer': ... # &
40
def __mod__(self, cutoff: int) -> 'Transformer': ... # %
41
def __xor__(self, other: 'Transformer') -> 'Transformer': ... # ^
42
def __mul__(self, factor: float) -> 'Transformer': ... # *
43
```
44
45
**Usage Examples:**
46
47
```python
48
# Basic pipeline composition
49
pipeline = retriever >> reranker >> cutoff_transformer
50
51
# Score combination
52
combined = system1 + system2 # Add scores
53
54
# Feature union
55
features = feature_extractor1 ** feature_extractor2
56
57
# Set operations
58
union_results = system1 | system2 # Union of retrieved documents
59
intersection = system1 & system2 # Intersection of retrieved documents
60
61
# Rank cutoff
62
top10 = retriever % 10 # Keep only top 10 results
63
64
# Result concatenation
65
concatenated = system1 ^ system2
66
```
67
68
### Estimator Class
69
70
Base class for trainable transformers that can learn from training data.
71
72
```python { .api }
73
class Estimator(Transformer):
74
"""
75
Base class for trainable transformers that learn from training data.
76
77
Parameters:
78
- topics_or_res_tr: Training topics (usually with documents)
79
- qrels_tr: Training qrels (relevance judgments)
80
- topics_or_res_va: Validation topics (usually with documents)
81
- qrels_va: Validation qrels (relevance judgments)
82
83
Returns:
84
- Trained estimator instance
85
"""
86
def fit(self, topics_or_res_tr: pd.DataFrame, qrels_tr: pd.DataFrame,
87
topics_or_res_va: pd.DataFrame, qrels_va: pd.DataFrame) -> 'Estimator': ...
88
```
89
90
**Usage Example:**
91
92
```python
93
# Train a learning-to-rank model
94
ltr_model = SomeLearnToRankTransformer()
95
trained_model = ltr_model.fit(training_topics_res, training_qrels,
96
validation_topics_res, validation_qrels)
97
98
# Use trained model in pipeline
99
pipeline = retriever >> trained_model
100
```
101
102
### Indexer Class
103
104
Base class for components that create searchable indexes from document collections.
105
106
```python { .api }
107
class Indexer(Transformer):
108
"""
109
Base class for indexers that create searchable indexes from document collections.
110
111
Parameters:
112
- iter_dict: Iterator over documents with 'docno' and 'text' fields
113
114
Returns:
115
- IndexRef object representing the created index
116
"""
117
def index(self, iter_dict: Iterator[Dict[str, Any]]) -> Any: ...
118
```
119
120
**Usage Example:**
121
122
```python
123
# Create an indexer
124
indexer = pt.FilesIndexer('/path/to/index')
125
126
# Index documents
127
documents = [
128
{'docno': 'doc1', 'text': 'This is document 1'},
129
{'docno': 'doc2', 'text': 'This is document 2'}
130
]
131
index_ref = indexer.index(documents)
132
```
133
134
### Pipeline Operators
135
136
Specialized transformer classes that implement pipeline operators for combining multiple transformers.
137
138
```python { .api }
139
class Compose(Transformer):
140
"""Pipeline composition operator (>>). Chains transformers sequentially."""
141
def __init__(self, *transformers: Transformer): ...
142
def index(self, iter: Iterator[Dict[str, Any]], batch_size: int = None) -> Any: ...
143
def transform_iter(self, inp: Iterator[Dict[str, Any]]) -> Iterator[Dict[str, Any]]: ...
144
def fit(self, topics_or_res_tr: pd.DataFrame, qrels_tr: pd.DataFrame,
145
topics_or_res_va: pd.DataFrame = None, qrels_va: pd.DataFrame = None) -> None: ...
146
147
class RankCutoff(Transformer):
148
"""Rank cutoff operator (%). Limits results to top-k documents."""
149
def __init__(self, k: int = 1000): ...
150
151
class FeatureUnion(Transformer):
152
"""Feature union operator (**). Combines features from multiple transformers."""
153
def __init__(self, *transformers: Transformer): ...
154
155
class Sum(Transformer):
156
"""Score addition operator (+). Adds scores from multiple transformers."""
157
def __init__(self, left: Transformer, right: Transformer): ...
158
159
class SetUnion(Transformer):
160
"""Set union operator (|). Union of documents from multiple transformers."""
161
def __init__(self, left: Transformer, right: Transformer): ...
162
163
class SetIntersection(Transformer):
164
"""Set intersection operator (&). Intersection of documents from multiple transformers."""
165
def __init__(self, left: Transformer, right: Transformer): ...
166
167
class Concatenate(Transformer):
168
"""Concatenation operator (^). Concatenates results from multiple transformers."""
169
def __init__(self, left: Transformer, right: Transformer): ...
170
171
class ScalarProduct(Transformer):
172
"""Scalar multiplication operator (*). Multiplies scores by a constant factor."""
173
def __init__(self, scalar: float): ...
174
```
175
176
### Apply Interface
177
178
Dynamic transformer creation interface for building custom transformers from functions.
179
180
```python { .api }
181
# Apply interface methods accessed via pt.apply.*
182
def query(fn: Callable[[Union[pd.Series, Dict[str, Any]]], str], *args, **kwargs) -> Transformer: ...
183
def doc_score(fn: Union[Callable[[Union[pd.Series, Dict[str, Any]]], float],
184
Callable[[pd.DataFrame], Sequence[float]]],
185
*args, batch_size: Optional[int] = None, **kwargs) -> Transformer: ...
186
def doc_features(fn: Callable[[Union[pd.Series, Dict[str, Any]]], npt.NDArray[Any]],
187
*args, **kwargs) -> Transformer: ...
188
def indexer(fn: Callable[[Iterator[Dict[str, Any]]], Any], **kwargs) -> Indexer: ...
189
def rename(columns: Dict[str, str], *args, errors: Literal['raise', 'ignore'] = 'raise', **kwargs) -> Transformer: ...
190
def generic(fn: Union[Callable[[pd.DataFrame], pd.DataFrame],
191
Callable[[Iterator[Dict[str, Any]]], Iterator[Dict[str, Any]]]],
192
*args, batch_size: Optional[int] = None, iter: bool = False, **kwargs) -> Transformer: ...
193
def by_query(fn: Union[Callable[[pd.DataFrame], pd.DataFrame],
194
Callable[[Iterator[Dict[str, Any]]], Iterator[Dict[str, Any]]]],
195
*args, batch_size: Optional[int] = None, iter: bool = False,
196
verbose: bool = False, **kwargs) -> Transformer: ...
197
```
198
199
**Usage Examples:**
200
201
```python
202
# Create custom query transformer
203
query_expander = pt.apply.query(lambda q: q["query"] + " information retrieval")
204
205
# Create custom scoring transformer (row-wise)
206
score_booster = pt.apply.doc_score(lambda row: row["score"] * 2)
207
208
# Create custom feature transformer
209
feature_extractor = pt.apply.doc_features(lambda row: np.array([len(row["text"])]))
210
211
# Column renaming transformer
212
renamer = pt.apply.rename({'old_column': 'new_column'})
213
214
# Batch-wise scoring transformer
215
def batch_scorer(df):
216
return df["score"] * 2
217
batch_score_booster = pt.apply.doc_score(batch_scorer, batch_size=128)
218
```
219
220
## Design Patterns
221
222
### Operator Overloading
223
224
PyTerrier's operator overloading enables intuitive pipeline construction:
225
226
- `>>`: Sequential composition (pipe operator)
227
- `+`: Score addition for late fusion
228
- `**`: Feature union for combining features
229
- `|`: Set union for combining document sets
230
- `&`: Set intersection for filtering results
231
- `%`: Rank cutoff for limiting results
232
- `^`: Result concatenation
233
- `*`: Score multiplication by constant factor
234
235
### Dual API Support
236
237
Most transformers support both DataFrame and iterator interfaces:
238
239
- `transform(df)`: Process pandas DataFrame (preferred for most use cases)
240
- `transform_iter(iter)`: Process iterator of dictionaries (memory efficient for large datasets)
241
242
### Parameter Management
243
244
Transformers support dynamic parameter access:
245
246
- `get_parameter(name)`: Retrieve parameter value
247
- `set_parameter(name, value)`: Update parameter value
248
249
This enables parameter tuning and grid search functionality.
250
251
## Types
252
253
```python { .api }
254
from typing import Dict, List, Any, Iterator, Callable, Union, Optional
255
import pandas as pd
256
257
# Common type aliases
258
IterDictRecord = Dict[str, Any]
259
IterDict = Iterator[IterDictRecord]
260
TransformerLike = Union[Transformer, Callable[[pd.DataFrame], pd.DataFrame]]
261
```