or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdevaluation.mdindex.mdindexing.mdjava.mdretrieval.mdtext-processing.mdtransformers.mdutilities.md

transformers.mddocs/

0

# Core Transformers

1

2

PyTerrier's core transformer architecture provides the foundation for building composable information retrieval pipelines. All PyTerrier components inherit from base transformer classes that support operator overloading for intuitive pipeline construction.

3

4

## Capabilities

5

6

### Base Transformer Class

7

8

The fundamental base class that all PyTerrier components inherit from, providing pipeline composition through operator overloading.

9

10

```python { .api }

11

class Transformer:

12

"""

13

Base class for all PyTerrier transformers that process dataframes or iterators.

14

15

Core Methods:

16

- transform(topics_or_res): Transform DataFrame input to DataFrame output

17

- transform_iter(input_iter): Transform iterator input to iterator output

18

- search(query): Convenience method for single query search

19

"""

20

def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame: ...

21

def transform_iter(self, input_iter: Iterator[Dict[str, Any]]) -> Iterator[Dict[str, Any]]: ...

22

def search(self, query: str, qid: str = "1") -> pd.DataFrame: ...

23

def compile(self) -> 'Transformer': ...

24

def parallel(self, jobs: int = 2, backend: str = 'joblib') -> 'Transformer': ...

25

def get_parameter(self, name: str) -> Any: ...

26

def set_parameter(self, name: str, value: Any) -> 'Transformer': ...

27

28

# Static methods

29

@staticmethod

30

def identity() -> 'Transformer': ...

31

@staticmethod

32

def from_df(df: pd.DataFrame, copy: bool = True) -> 'Transformer': ...

33

34

# Pipeline operators

35

def __rshift__(self, other: 'Transformer') -> 'Transformer': ... # >>

36

def __add__(self, other: 'Transformer') -> 'Transformer': ... # +

37

def __pow__(self, other: 'Transformer') -> 'Transformer': ... # **

38

def __or__(self, other: 'Transformer') -> 'Transformer': ... # |

39

def __and__(self, other: 'Transformer') -> 'Transformer': ... # &

40

def __mod__(self, cutoff: int) -> 'Transformer': ... # %

41

def __xor__(self, other: 'Transformer') -> 'Transformer': ... # ^

42

def __mul__(self, factor: float) -> 'Transformer': ... # *

43

```

44

45

**Usage Examples:**

46

47

```python

48

# Basic pipeline composition

49

pipeline = retriever >> reranker >> cutoff_transformer

50

51

# Score combination

52

combined = system1 + system2 # Add scores

53

54

# Feature union

55

features = feature_extractor1 ** feature_extractor2

56

57

# Set operations

58

union_results = system1 | system2 # Union of retrieved documents

59

intersection = system1 & system2 # Intersection of retrieved documents

60

61

# Rank cutoff

62

top10 = retriever % 10 # Keep only top 10 results

63

64

# Result concatenation

65

concatenated = system1 ^ system2

66

```

67

68

### Estimator Class

69

70

Base class for trainable transformers that can learn from training data.

71

72

```python { .api }

73

class Estimator(Transformer):

74

"""

75

Base class for trainable transformers that learn from training data.

76

77

Parameters:

78

- topics_or_res_tr: Training topics (usually with documents)

79

- qrels_tr: Training qrels (relevance judgments)

80

- topics_or_res_va: Validation topics (usually with documents)

81

- qrels_va: Validation qrels (relevance judgments)

82

83

Returns:

84

- Trained estimator instance

85

"""

86

def fit(self, topics_or_res_tr: pd.DataFrame, qrels_tr: pd.DataFrame,

87

topics_or_res_va: pd.DataFrame, qrels_va: pd.DataFrame) -> 'Estimator': ...

88

```

89

90

**Usage Example:**

91

92

```python

93

# Train a learning-to-rank model

94

ltr_model = SomeLearnToRankTransformer()

95

trained_model = ltr_model.fit(training_topics_res, training_qrels,

96

validation_topics_res, validation_qrels)

97

98

# Use trained model in pipeline

99

pipeline = retriever >> trained_model

100

```

101

102

### Indexer Class

103

104

Base class for components that create searchable indexes from document collections.

105

106

```python { .api }

107

class Indexer(Transformer):

108

"""

109

Base class for indexers that create searchable indexes from document collections.

110

111

Parameters:

112

- iter_dict: Iterator over documents with 'docno' and 'text' fields

113

114

Returns:

115

- IndexRef object representing the created index

116

"""

117

def index(self, iter_dict: Iterator[Dict[str, Any]]) -> Any: ...

118

```

119

120

**Usage Example:**

121

122

```python

123

# Create an indexer

124

indexer = pt.FilesIndexer('/path/to/index')

125

126

# Index documents

127

documents = [

128

{'docno': 'doc1', 'text': 'This is document 1'},

129

{'docno': 'doc2', 'text': 'This is document 2'}

130

]

131

index_ref = indexer.index(documents)

132

```

133

134

### Pipeline Operators

135

136

Specialized transformer classes that implement pipeline operators for combining multiple transformers.

137

138

```python { .api }

139

class Compose(Transformer):

140

"""Pipeline composition operator (>>). Chains transformers sequentially."""

141

def __init__(self, *transformers: Transformer): ...

142

def index(self, iter: Iterator[Dict[str, Any]], batch_size: int = None) -> Any: ...

143

def transform_iter(self, inp: Iterator[Dict[str, Any]]) -> Iterator[Dict[str, Any]]: ...

144

def fit(self, topics_or_res_tr: pd.DataFrame, qrels_tr: pd.DataFrame,

145

topics_or_res_va: pd.DataFrame = None, qrels_va: pd.DataFrame = None) -> None: ...

146

147

class RankCutoff(Transformer):

148

"""Rank cutoff operator (%). Limits results to top-k documents."""

149

def __init__(self, k: int = 1000): ...

150

151

class FeatureUnion(Transformer):

152

"""Feature union operator (**). Combines features from multiple transformers."""

153

def __init__(self, *transformers: Transformer): ...

154

155

class Sum(Transformer):

156

"""Score addition operator (+). Adds scores from multiple transformers."""

157

def __init__(self, left: Transformer, right: Transformer): ...

158

159

class SetUnion(Transformer):

160

"""Set union operator (|). Union of documents from multiple transformers."""

161

def __init__(self, left: Transformer, right: Transformer): ...

162

163

class SetIntersection(Transformer):

164

"""Set intersection operator (&). Intersection of documents from multiple transformers."""

165

def __init__(self, left: Transformer, right: Transformer): ...

166

167

class Concatenate(Transformer):

168

"""Concatenation operator (^). Concatenates results from multiple transformers."""

169

def __init__(self, left: Transformer, right: Transformer): ...

170

171

class ScalarProduct(Transformer):

172

"""Scalar multiplication operator (*). Multiplies scores by a constant factor."""

173

def __init__(self, scalar: float): ...

174

```

175

176

### Apply Interface

177

178

Dynamic transformer creation interface for building custom transformers from functions.

179

180

```python { .api }

181

# Apply interface methods accessed via pt.apply.*

182

def query(fn: Callable[[Union[pd.Series, Dict[str, Any]]], str], *args, **kwargs) -> Transformer: ...

183

def doc_score(fn: Union[Callable[[Union[pd.Series, Dict[str, Any]]], float],

184

Callable[[pd.DataFrame], Sequence[float]]],

185

*args, batch_size: Optional[int] = None, **kwargs) -> Transformer: ...

186

def doc_features(fn: Callable[[Union[pd.Series, Dict[str, Any]]], npt.NDArray[Any]],

187

*args, **kwargs) -> Transformer: ...

188

def indexer(fn: Callable[[Iterator[Dict[str, Any]]], Any], **kwargs) -> Indexer: ...

189

def rename(columns: Dict[str, str], *args, errors: Literal['raise', 'ignore'] = 'raise', **kwargs) -> Transformer: ...

190

def generic(fn: Union[Callable[[pd.DataFrame], pd.DataFrame],

191

Callable[[Iterator[Dict[str, Any]]], Iterator[Dict[str, Any]]]],

192

*args, batch_size: Optional[int] = None, iter: bool = False, **kwargs) -> Transformer: ...

193

def by_query(fn: Union[Callable[[pd.DataFrame], pd.DataFrame],

194

Callable[[Iterator[Dict[str, Any]]], Iterator[Dict[str, Any]]]],

195

*args, batch_size: Optional[int] = None, iter: bool = False,

196

verbose: bool = False, **kwargs) -> Transformer: ...

197

```

198

199

**Usage Examples:**

200

201

```python

202

# Create custom query transformer

203

query_expander = pt.apply.query(lambda q: q["query"] + " information retrieval")

204

205

# Create custom scoring transformer (row-wise)

206

score_booster = pt.apply.doc_score(lambda row: row["score"] * 2)

207

208

# Create custom feature transformer

209

feature_extractor = pt.apply.doc_features(lambda row: np.array([len(row["text"])]))

210

211

# Column renaming transformer

212

renamer = pt.apply.rename({'old_column': 'new_column'})

213

214

# Batch-wise scoring transformer

215

def batch_scorer(df):

216

return df["score"] * 2

217

batch_score_booster = pt.apply.doc_score(batch_scorer, batch_size=128)

218

```

219

220

## Design Patterns

221

222

### Operator Overloading

223

224

PyTerrier's operator overloading enables intuitive pipeline construction:

225

226

- `>>`: Sequential composition (pipe operator)

227

- `+`: Score addition for late fusion

228

- `**`: Feature union for combining features

229

- `|`: Set union for combining document sets

230

- `&`: Set intersection for filtering results

231

- `%`: Rank cutoff for limiting results

232

- `^`: Result concatenation

233

- `*`: Score multiplication by constant factor

234

235

### Dual API Support

236

237

Most transformers support both DataFrame and iterator interfaces:

238

239

- `transform(df)`: Process pandas DataFrame (preferred for most use cases)

240

- `transform_iter(iter)`: Process iterator of dictionaries (memory efficient for large datasets)

241

242

### Parameter Management

243

244

Transformers support dynamic parameter access:

245

246

- `get_parameter(name)`: Retrieve parameter value

247

- `set_parameter(name, value)`: Update parameter value

248

249

This enables parameter tuning and grid search functionality.

250

251

## Types

252

253

```python { .api }

254

from typing import Dict, List, Any, Iterator, Callable, Union, Optional

255

import pandas as pd

256

257

# Common type aliases

258

IterDictRecord = Dict[str, Any]

259

IterDict = Iterator[IterDictRecord]

260

TransformerLike = Union[Transformer, Callable[[pd.DataFrame], pd.DataFrame]]

261

```