Tessl Tile for pypi/python-terrier@0.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md evaluation.md index.md indexing.md java.md retrieval.md text-processing.md transformers.md utilities.md

index.mddocs/

0
# PyTerrier
1

2
A comprehensive Python API for the Terrier information retrieval platform, enabling declarative experimentation with transformer pipelines for indexing, retrieval, and evaluation tasks. PyTerrier provides a declarative approach to information retrieval research through composable transformer pipelines that can be chained using Python operators.
3

4
## Package Information
5

6
- **Package Name**: python-terrier
7
- **Language**: Python
8
- **Installation**: `pip install python-terrier`
9

10
## Core Imports
11

12
```python
13
import pyterrier as pt
14
```
15

16
Common for working with specific components:
17

18
```python
19
from pyterrier import Transformer, Estimator, Indexer
20
from pyterrier import Experiment, GridSearch
21
from pyterrier.terrier import Retriever, IndexFactory
22
```
23

24
## Basic Usage
25

26
```python
27
import pyterrier as pt
28
import pandas as pd
29

30
# Initialize PyTerrier (sets up Java VM)
31
if not pt.java.started():
32
    pt.java.init()
33

34
# Create a simple retrieval pipeline
35
bm25 = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed', wmodel='BM25')
36

37
# Perform retrieval
38
queries = pd.DataFrame([
39
    {'qid': '1', 'query': 'information retrieval'},
40
    {'qid': '2', 'query': 'search engines'}
41
])
42

43
results = bm25.transform(queries)
44
print(results.head())
45

46
# Chain transformers using operators
47
dataset = pt.get_dataset('vaswani')
48
text_getter = pt.text.get_text(dataset)
49
reranker = pt.terrier.Retriever(dataset.get_index(), wmodel='PL2')
50
pipeline = bm25 >> text_getter >> reranker
51
results = pipeline.transform(queries)
52

53
# Run experiments with evaluation
54
topics = pt.get_dataset('vaswani').get_topics()
55
qrels = pt.get_dataset('vaswani').get_qrels()
56
evaluation = pt.Experiment([bm25], topics, qrels, ['map', 'ndcg'])
57
print(evaluation)
58
```
59

60
## Architecture
61

62
PyTerrier's architecture is built around several key design patterns:
63

64
- **Transformer Pipeline Pattern**: All components implement the Transformer interface, enabling composition through operators (`>>`, `+`, `**`, etc.)
65
- **Dual API Support**: Most components support both DataFrame (`transform()`) and iterator (`transform_iter()`) interfaces
66
- **Java Integration Layer**: Seamless integration with the Terrier IR platform through comprehensive Java interop
67
- **Declarative Experimentation**: Built-in experiment framework with statistical significance testing
68
- **Plugin Architecture**: Extensible through entry points and custom transformer creation
69

70
## Capabilities
71

72
### Core Transformers
73

74
Base classes and pipeline operators that form the foundation of PyTerrier's transformer architecture, enabling composable information retrieval pipelines.
75

76
```python { .api }
77
class Transformer:
78
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame: ...
79
    def transform_iter(self, input_iter) -> Iterator: ...
80
    def __rshift__(self, other): ...  # >> operator for composition
81
    def __add__(self, other): ...     # + operator for score addition
82
    def __pow__(self, other): ...     # ** operator for feature union
83
    def __or__(self, other): ...      # | operator for set union
84
    def __and__(self, other): ...     # & operator for set intersection
85

86
class Estimator(Transformer):
87
    def fit(self, topics_and_res: pd.DataFrame) -> 'Estimator': ...
88

89
class Indexer(Transformer):
90
    def index(self, iter_dict) -> IndexRef: ...
91
```
92

93
[Core Transformers](./transformers.md)
94

95
### Retrieval
96

97
Retrieval components for searching indexed collections, including various weighting models, feature extraction, and text scoring capabilities.
98

99
```python { .api }
100
class Retriever(Transformer):
101
    @staticmethod 
102
    def from_dataset(dataset_name: str, variant: str = None, version: str = 'latest', **kwargs) -> 'Retriever': ...
103
    def __init__(self, index_location: Union[str, Any], 
104
                 controls: Optional[Dict[str, str]] = None, 
105
                 properties: Optional[Dict[str, str]] = None,
106
                 metadata: List[str] = ["docno"], 
107
                 num_results: Optional[int] = None, 
108
                 wmodel: Optional[Union[str, Callable]] = None, 
109
                 threads: int = 1, 
110
                 verbose: bool = False): ...
111

112
class FeaturesRetriever(Transformer):
113
    def __init__(self, index_location: Union[str, Any], features: List[str], 
114
                 controls: Optional[Dict[str, str]] = None, 
115
                 properties: Optional[Dict[str, str]] = None, 
116
                 threads: int = 1, **kwargs): ...
117

118
class TextScorer(Transformer):
119
    def __init__(self, wmodel: str = 'BM25', background_index: Any = None,
120
                 takes: str = 'docs', body_attr: str = 'text', 
121
                 verbose: bool = False, **kwargs): ...
122
```
123

124
[Retrieval](./retrieval.md)
125

126
### Indexing
127

128
Index creation and management functionality for building searchable collections from various document formats.
129

130
```python { .api }
131
class IndexFactory:
132
    @staticmethod
133
    def from_dataset(dataset_name: str) -> IndexRef: ...
134
    @staticmethod  
135
    def from_trec(path: str, **kwargs) -> IndexRef: ...
136

137
class FilesIndexer(Indexer):
138
    def __init__(self, index_path: str, **kwargs): ...
139

140
class TRECCollectionIndexer(Indexer):
141
    def __init__(self, index_path: str, **kwargs): ...
142

143
class DFIndexer(Indexer):
144
    def __init__(self, index_path: str, **kwargs): ...
145
```
146

147
[Indexing](./indexing.md)
148

149
### Java Integration
150

151
Java VM initialization, configuration, and integration with the underlying Terrier platform.
152

153
```python { .api }
154
def init(version: str = None, **kwargs) -> None: ...
155
def started() -> bool: ...
156
def configure(**kwargs) -> None: ...
157
def set_memory_limit(memory: str) -> None: ...
158
def extend_classpath(paths: List[str]) -> None: ...
159
def set_property(key: str, value: str) -> None: ...
160
```
161

162
[Java Integration](./java.md)
163

164
### Datasets
165

166
Dataset management for accessing standard IR test collections and creating custom datasets.
167

168
```python { .api }
169
def get_dataset(name: str) -> Dataset: ...
170
def find_datasets(query: str = None, **kwargs) -> List[str]: ...
171
def list_datasets() -> List[str]: ...
172

173
class Dataset:
174
    def get_topics(self, variant: str = None) -> pd.DataFrame: ...
175
    def get_qrels(self, variant: str = None) -> pd.DataFrame: ...
176
    def get_corpus_iter(self, verbose: bool = True) -> Iterator: ...
177
```
178

179
[Datasets](./datasets.md)
180

181
### Evaluation Framework
182

183
Comprehensive evaluation and parameter tuning framework with statistical significance testing.
184

185
```python { .api }
186
class Experiment:
187
    def __init__(self, retr_systems: List[Transformer], topics: pd.DataFrame, 
188
                 qrels: pd.DataFrame, eval_metrics: List[str], **kwargs): ...
189

190
class GridSearch:
191
    def __init__(self, pipeline: Transformer, params: Dict, topics: pd.DataFrame,
192
                 qrels: pd.DataFrame, metric: str, **kwargs): ...
193

194
class GridScan:
195
    def __init__(self, pipeline: Transformer, params: Dict, topics: pd.DataFrame,
196
                 qrels: pd.DataFrame, metrics: List[str], **kwargs): ...
197
```
198

199
[Evaluation Framework](./evaluation.md)
200

201
### Text Processing
202

203
Text processing utilities including stemming, tokenization, stopword removal, and text transformation.
204

205
```python { .api }
206
class TerrierStemmer(Transformer):
207
    def __init__(self, stemmer: str = 'porter'): ...
208

209
class TerrierTokeniser(Transformer):
210
    def __init__(self, **kwargs): ...
211

212
class TerrierStopwords(Transformer):
213
    def __init__(self, stopwords: str = 'terrier'): ...
214
```
215

216
[Text Processing](./text-processing.md)
217

218
### Utilities
219

220
Supporting utilities for DataFrame manipulation, progress tracking, I/O operations, and general helper functions.
221

222
```python { .api }
223
def set_tqdm(tqdm_type: str = None) -> None: ...
224
def coerce_dataframe(input_data) -> pd.DataFrame: ...
225
def add_ranks(df: pd.DataFrame, single_query: bool = False) -> pd.DataFrame: ...
226
def autoopen(filename: str, mode: str = 'r', **kwargs): ...
227
```
228

229
[Utilities](./utilities.md)
230

231
## Types
232

233
```python { .api }
234
# Core type definitions used across PyTerrier
235
from typing import Dict, List, Any, Iterator, Union, Optional, Callable, Sequence, Literal
236
import pandas as pd
237
import numpy.typing as npt
238

239
IterDictRecord = Dict[str, Any]
240
IterDict = Iterator[IterDictRecord]
241
IndexRef = Any  # Java IndexRef object
242
Dataset = Any  # Dataset object
243
TransformerLike = Union['Transformer', Callable[[pd.DataFrame], pd.DataFrame]]
244
QueryInput = Union[str, Dict[str, str], pd.DataFrame]
245
WeightingModel = str  # Weighting model identifier (e.g., 'BM25', 'PL2')
246
MetricList = List[str]  # List of evaluation metrics
247
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/