tessl/pypi-python-terrier

A comprehensive Python API for the Terrier information retrieval platform, enabling declarative experimentation with transformer pipelines for indexing, retrieval, and evaluation tasks.

—

Pending

Overview

Eval results

Files

Core Transformers

Name: tessl/pypi-python-terrier
Author: tessl

PyTerrier's core transformer architecture provides the foundation for building composable information retrieval pipelines. All PyTerrier components inherit from base transformer classes that support operator overloading for intuitive pipeline construction.

Capabilities

Base Transformer Class

The fundamental base class that all PyTerrier components inherit from, providing pipeline composition through operator overloading.

class Transformer:
    """
    Base class for all PyTerrier transformers that process dataframes or iterators.
    
    Core Methods:
    - transform(topics_or_res): Transform DataFrame input to DataFrame output
    - transform_iter(input_iter): Transform iterator input to iterator output  
    - search(query): Convenience method for single query search
    """
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame: ...
    def transform_iter(self, input_iter: Iterator[Dict[str, Any]]) -> Iterator[Dict[str, Any]]: ...
    def search(self, query: str, qid: str = "1") -> pd.DataFrame: ...
    def compile(self) -> 'Transformer': ...
    def parallel(self, jobs: int = 2, backend: str = 'joblib') -> 'Transformer': ...
    def get_parameter(self, name: str) -> Any: ...
    def set_parameter(self, name: str, value: Any) -> 'Transformer': ...
    
    # Static methods
    @staticmethod
    def identity() -> 'Transformer': ...
    @staticmethod  
    def from_df(df: pd.DataFrame, copy: bool = True) -> 'Transformer': ...
    
    # Pipeline operators
    def __rshift__(self, other: 'Transformer') -> 'Transformer': ...  # >>
    def __add__(self, other: 'Transformer') -> 'Transformer': ...     # +
    def __pow__(self, other: 'Transformer') -> 'Transformer': ...     # **
    def __or__(self, other: 'Transformer') -> 'Transformer': ...      # |
    def __and__(self, other: 'Transformer') -> 'Transformer': ...     # &
    def __mod__(self, cutoff: int) -> 'Transformer': ...              # %
    def __xor__(self, other: 'Transformer') -> 'Transformer': ...     # ^
    def __mul__(self, factor: float) -> 'Transformer': ...            # *

Usage Examples:

# Basic pipeline composition
pipeline = retriever >> reranker >> cutoff_transformer

# Score combination
combined = system1 + system2  # Add scores

# Feature union  
features = feature_extractor1 ** feature_extractor2

# Set operations
union_results = system1 | system2      # Union of retrieved documents
intersection = system1 & system2       # Intersection of retrieved documents

# Rank cutoff
top10 = retriever % 10  # Keep only top 10 results

# Result concatenation
concatenated = system1 ^ system2

Estimator Class

Base class for trainable transformers that can learn from training data.

class Estimator(Transformer):
    """
    Base class for trainable transformers that learn from training data.
    
    Parameters:
    - topics_or_res_tr: Training topics (usually with documents)
    - qrels_tr: Training qrels (relevance judgments)
    - topics_or_res_va: Validation topics (usually with documents)
    - qrels_va: Validation qrels (relevance judgments)
    
    Returns:
    - Trained estimator instance
    """
    def fit(self, topics_or_res_tr: pd.DataFrame, qrels_tr: pd.DataFrame, 
            topics_or_res_va: pd.DataFrame, qrels_va: pd.DataFrame) -> 'Estimator': ...

Usage Example:

# Train a learning-to-rank model
ltr_model = SomeLearnToRankTransformer()
trained_model = ltr_model.fit(training_topics_res, training_qrels, 
                              validation_topics_res, validation_qrels)

# Use trained model in pipeline
pipeline = retriever >> trained_model

Indexer Class

Base class for components that create searchable indexes from document collections.

class Indexer(Transformer):
    """
    Base class for indexers that create searchable indexes from document collections.
    
    Parameters:
    - iter_dict: Iterator over documents with 'docno' and 'text' fields
    
    Returns:
    - IndexRef object representing the created index
    """
    def index(self, iter_dict: Iterator[Dict[str, Any]]) -> Any: ...

Usage Example:

# Create an indexer
indexer = pt.FilesIndexer('/path/to/index')

# Index documents
documents = [
    {'docno': 'doc1', 'text': 'This is document 1'},
    {'docno': 'doc2', 'text': 'This is document 2'}
]
index_ref = indexer.index(documents)

Pipeline Operators

Specialized transformer classes that implement pipeline operators for combining multiple transformers.

class Compose(Transformer):
    """Pipeline composition operator (>>). Chains transformers sequentially."""
    def __init__(self, *transformers: Transformer): ...
    def index(self, iter: Iterator[Dict[str, Any]], batch_size: int = None) -> Any: ...
    def transform_iter(self, inp: Iterator[Dict[str, Any]]) -> Iterator[Dict[str, Any]]: ...
    def fit(self, topics_or_res_tr: pd.DataFrame, qrels_tr: pd.DataFrame, 
            topics_or_res_va: pd.DataFrame = None, qrels_va: pd.DataFrame = None) -> None: ...

class RankCutoff(Transformer):
    """Rank cutoff operator (%). Limits results to top-k documents."""
    def __init__(self, k: int = 1000): ...

class FeatureUnion(Transformer):  
    """Feature union operator (**). Combines features from multiple transformers."""
    def __init__(self, *transformers: Transformer): ...

class Sum(Transformer):
    """Score addition operator (+). Adds scores from multiple transformers."""
    def __init__(self, left: Transformer, right: Transformer): ...

class SetUnion(Transformer):
    """Set union operator (|). Union of documents from multiple transformers."""
    def __init__(self, left: Transformer, right: Transformer): ...

class SetIntersection(Transformer):
    """Set intersection operator (&). Intersection of documents from multiple transformers."""
    def __init__(self, left: Transformer, right: Transformer): ...

class Concatenate(Transformer):
    """Concatenation operator (^). Concatenates results from multiple transformers."""
    def __init__(self, left: Transformer, right: Transformer): ...

class ScalarProduct(Transformer): 
    """Scalar multiplication operator (*). Multiplies scores by a constant factor."""
    def __init__(self, scalar: float): ...

Apply Interface

Dynamic transformer creation interface for building custom transformers from functions.

# Apply interface methods accessed via pt.apply.*
def query(fn: Callable[[Union[pd.Series, Dict[str, Any]]], str], *args, **kwargs) -> Transformer: ...
def doc_score(fn: Union[Callable[[Union[pd.Series, Dict[str, Any]]], float], 
                       Callable[[pd.DataFrame], Sequence[float]]], 
              *args, batch_size: Optional[int] = None, **kwargs) -> Transformer: ...
def doc_features(fn: Callable[[Union[pd.Series, Dict[str, Any]]], npt.NDArray[Any]], 
                 *args, **kwargs) -> Transformer: ...
def indexer(fn: Callable[[Iterator[Dict[str, Any]]], Any], **kwargs) -> Indexer: ...
def rename(columns: Dict[str, str], *args, errors: Literal['raise', 'ignore'] = 'raise', **kwargs) -> Transformer: ...
def generic(fn: Union[Callable[[pd.DataFrame], pd.DataFrame], 
                      Callable[[Iterator[Dict[str, Any]]], Iterator[Dict[str, Any]]]], 
            *args, batch_size: Optional[int] = None, iter: bool = False, **kwargs) -> Transformer: ...
def by_query(fn: Union[Callable[[pd.DataFrame], pd.DataFrame], 
                       Callable[[Iterator[Dict[str, Any]]], Iterator[Dict[str, Any]]]], 
             *args, batch_size: Optional[int] = None, iter: bool = False, 
             verbose: bool = False, **kwargs) -> Transformer: ...

Usage Examples:

# Create custom query transformer
query_expander = pt.apply.query(lambda q: q["query"] + " information retrieval")

# Create custom scoring transformer (row-wise)
score_booster = pt.apply.doc_score(lambda row: row["score"] * 2)

# Create custom feature transformer
feature_extractor = pt.apply.doc_features(lambda row: np.array([len(row["text"])]))

# Column renaming transformer
renamer = pt.apply.rename({'old_column': 'new_column'})

# Batch-wise scoring transformer
def batch_scorer(df):
    return df["score"] * 2
batch_score_booster = pt.apply.doc_score(batch_scorer, batch_size=128)

Design Patterns

Operator Overloading

PyTerrier's operator overloading enables intuitive pipeline construction:

>>: Sequential composition (pipe operator)
+: Score addition for late fusion
**: Feature union for combining features
|: Set union for combining document sets
&: Set intersection for filtering results
%: Rank cutoff for limiting results
^: Result concatenation
*: Score multiplication by constant factor

Dual API Support

Most transformers support both DataFrame and iterator interfaces:

transform(df): Process pandas DataFrame (preferred for most use cases)
transform_iter(iter): Process iterator of dictionaries (memory efficient for large datasets)

Parameter Management

Transformers support dynamic parameter access:

get_parameter(name): Retrieve parameter value
set_parameter(name, value): Update parameter value

This enables parameter tuning and grid search functionality.

Types

from typing import Dict, List, Any, Iterator, Callable, Union, Optional
import pandas as pd

# Common type aliases
IterDictRecord = Dict[str, Any]
IterDict = Iterator[IterDictRecord]
TransformerLike = Union[Transformer, Callable[[pd.DataFrame], pd.DataFrame]]

Install with Tessl CLI