Tessl Tile for pypi/python-terrier@0.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md evaluation.md index.md indexing.md java.md retrieval.md text-processing.md transformers.md utilities.md

utilities.mddocs/

0
# Utilities
1

2
PyTerrier's utility modules provide supporting functionality for DataFrame manipulation, I/O operations, progress tracking, debugging, and general helper functions that support the core information retrieval capabilities.
3

4
## Capabilities
5

6
### General Utilities (`pyterrier.utils`)
7

8
Core utility functions for progress tracking, system information, and general helper functionality.
9

10
```python { .api }
11
def set_tqdm(tqdm_type: str = None) -> None:
12
    """
13
    Configure progress bar type for PyTerrier operations.
14
    
15
    Parameters:
16
    - tqdm_type: Progress bar type ('tqdm', 'notebook', 'auto', or None to disable)
17
    """
18

19
def entry_points(group: str) -> List[Any]:
20
    """
21
    Get package entry points for specified group.
22
    
23
    Parameters:
24
    - group: Entry point group name
25
    
26
    Returns:
27
    - List of entry point objects
28
    """
29

30
def is_windows() -> bool:
31
    """
32
    Check if running on Windows operating system.
33
    
34
    Returns:
35
    - True if Windows, False otherwise
36
    """
37

38
def noop(*args, **kwargs) -> None:
39
    """
40
    No-operation function that accepts any arguments and does nothing.
41
    """
42

43
def once() -> Callable:
44
    """
45
    Decorator that ensures a function can only be called once.
46
    
47
    Returns:
48
    - Decorator function
49
    """
50

51
def get_class_methods(cls: type) -> List[str]:
52
    """
53
    Get list of methods defined by a class (not inherited).
54
    
55
    Parameters:
56
    - cls: Class to inspect
57
    
58
    Returns:
59
    - List of method names
60
    """
61

62
def pre_invocation_decorator(decorator: Callable) -> Callable:
63
    """
64
    Create decorator that runs before method invocation.
65
    
66
    Parameters:
67
    - decorator: Decorator function to apply
68
    
69
    Returns:
70
    - Pre-invocation decorator
71
    """
72

73
def byte_count_to_human_readable(byte_count: int) -> str:
74
    """
75
    Convert byte count to human-readable format.
76
    
77
    Parameters:
78
    - byte_count: Number of bytes
79
    
80
    Returns:
81
    - Human-readable string (e.g., '1.2 GB', '45.3 MB')
82
    """
83

84
def temp_env(key: str, value: str):
85
    """
86
    Context manager for temporarily setting environment variable.
87
    
88
    Parameters:
89
    - key: Environment variable name
90
    - value: Temporary value
91
    
92
    Returns:
93
    - Context manager
94
    """
95

96
class GeneratorLen:
97
    """
98
    Wrapper for generator that tracks length.
99
    
100
    Provides len() support for generators by consuming and caching items.
101
    """
102
    def __init__(self, generator: Iterator, length: int = None): ...
103
    def __len__(self) -> int: ...
104
    def __iter__(self) -> Iterator: ...
105
```
106

107
**Usage Examples:**
108

109
```python
110
# Configure progress bars
111
pt.utils.set_tqdm('notebook')  # For Jupyter notebooks
112
pt.utils.set_tqdm('tqdm')      # For command line
113
pt.utils.set_tqdm(None)        # Disable progress bars
114

115
# System information
116
if pt.utils.is_windows():
117
    print("Running on Windows")
118

119
# Temporary environment variable
120
with pt.utils.temp_env('JAVA_HOME', '/custom/java/path'):
121
    pt.java.init()  # Uses custom Java path
122

123
# Human-readable byte counts
124
size_str = pt.utils.byte_count_to_human_readable(1073741824)  # "1.0 GB"
125

126
# Generator with length
127
def doc_generator():
128
    for i in range(1000):
129
        yield {'docno': f'doc_{i}', 'text': f'Document {i}'}
130

131
gen_with_len = pt.utils.GeneratorLen(doc_generator(), 1000)
132
print(f"Generator length: {len(gen_with_len)}")
133
```
134

135
### DataFrame Model Utilities (`pyterrier.model`)
136

137
Utilities for manipulating PyTerrier DataFrames and data structures.
138

139
```python { .api }
140
def add_ranks(df: pd.DataFrame, single_query: bool = False) -> pd.DataFrame:
141
    """
142
    Add rank column to DataFrame based on score values.
143
    
144
    Parameters:
145
    - df: DataFrame with score column
146
    - single_query: Whether DataFrame contains single query (default: False)
147
    
148
    Returns:
149
    - DataFrame with added 'rank' column
150
    """
151

152
def document_columns(df: pd.DataFrame) -> List[str]:
153
    """
154
    Get document-related column names from DataFrame.
155
    
156
    Parameters:
157
    - df: DataFrame to analyze
158
    
159
    Returns:
160
    - List of document-related column names
161
    """
162

163
def query_columns(df: pd.DataFrame, qid: bool = True) -> List[str]:
164
    """
165
    Get query-related column names from DataFrame.
166
    
167
    Parameters:
168
    - df: DataFrame to analyze
169
    - qid: Whether to include 'qid' column (default: True)
170
    
171
    Returns:
172
    - List of query-related column names
173
    """
174

175
def push_queries(df: pd.DataFrame) -> pd.DataFrame:
176
    """
177
    Push query columns (rename query -> query_0, etc.).
178
    
179
    Parameters:
180
    - df: DataFrame with query columns
181
    
182
    Returns:
183
    - DataFrame with pushed query columns
184
    """
185

186
def push_queries_dict(input_dict: Dict[str, Any]) -> Dict[str, Any]:
187
    """
188
    Dictionary version of push_queries.
189
    
190
    Parameters:
191
    - input_dict: Dictionary with query fields
192
    
193
    Returns:
194
    - Dictionary with pushed query fields
195
    """
196

197
def pop_queries(df: pd.DataFrame) -> pd.DataFrame:
198
    """
199
    Pop query columns (reverse of push_queries).
200
    
201
    Parameters:
202
    - df: DataFrame with pushed query columns
203
    
204
    Returns:
205
    - DataFrame with restored query columns
206
    """
207

208
def ranked_documents_to_queries(topics_and_res: pd.DataFrame) -> pd.DataFrame:
209
    """
210
    Extract unique queries from ranked documents DataFrame.
211
    
212
    Parameters:
213
    - topics_and_res: DataFrame with queries and results
214
    
215
    Returns:
216
    - DataFrame with unique queries
217
    """
218

219
def coerce_queries_dataframe(query: Union[str, Dict, pd.DataFrame]) -> pd.DataFrame:
220
    """
221
    Convert various input types to standard queries DataFrame.
222
    
223
    Parameters:
224
    - query: Query in various formats (string, dict, DataFrame)
225
    
226
    Returns:
227
    - Standard queries DataFrame with 'qid' and 'query' columns
228
    """
229

230
def coerce_dataframe_types(dataframe: pd.DataFrame) -> pd.DataFrame:
231
    """
232
    Ensure proper column data types for PyTerrier DataFrames.
233
    
234
    Parameters:
235
    - dataframe: DataFrame to type-check
236
    
237
    Returns:
238
    - DataFrame with corrected types
239
    """
240

241
def split_df(df: pd.DataFrame, N: int = None, batch_size: int = None) -> Iterator[pd.DataFrame]:
242
    """
243
    Split DataFrame into chunks for batch processing.
244
    
245
    Parameters:
246
    - df: DataFrame to split
247
    - N: Number of chunks (alternative to batch_size)
248
    - batch_size: Size of each chunk
249
    
250
    Returns:
251
    - Iterator of DataFrame chunks
252
    """
253

254
# Constants
255
FIRST_RANK: int = 0  # Starting rank value
256
STRICT_SORT: bool = False  # Whether to enforce strict sorting
257
```
258

259
**Usage Examples:**
260

261
```python
262
# Add ranks to results
263
results_with_ranks = pt.model.add_ranks(retrieval_results)
264

265
# Get column information
266
doc_cols = pt.model.document_columns(results)
267
query_cols = pt.model.query_columns(results)
268

269
# Query manipulation
270
pushed_queries = pt.model.push_queries(topics)
271
restored_queries = pt.model.pop_queries(pushed_queries)
272

273
# Convert various query formats
274
query_df = pt.model.coerce_queries_dataframe("information retrieval")
275
query_df = pt.model.coerce_queries_dataframe({'qid': '1', 'query': 'search'})
276

277
# Batch processing
278
for batch in pt.model.split_df(large_dataframe, batch_size=1000):
279
    processed_batch = some_transformer.transform(batch)
280
```
281

282
### I/O Utilities (`pyterrier.io`)
283

284
File input/output utilities with support for various formats and compression.
285

286
```python { .api }
287
def autoopen(filename: str, mode: str = 'r', **kwargs):
288
    """
289
    Automatically handle file opening with compression detection.
290
    
291
    Parameters:
292
    - filename: File path (supports .gz, .bz2, .xz compression)
293
    - mode: File opening mode ('r', 'w', 'rb', 'wb', etc.)
294
    - **kwargs: Additional arguments for file opening
295
    
296
    Returns:
297
    - File handle with appropriate compression handling
298
    """
299

300
def finalized_open(path: str, mode: str):
301
    """
302
    Atomic file writing context manager.
303
    
304
    Parameters:
305
    - path: Target file path
306
    - mode: File opening mode
307
    
308
    Returns:
309
    - Context manager for atomic file writing
310
    """
311

312
def find_files(directory: str) -> List[str]:
313
    """
314
    Recursively find all files in directory.
315
    
316
    Parameters:
317
    - directory: Directory path to search
318
    
319
    Returns:
320
    - List of file paths
321
    """
322

323
def coerce_dataframe(obj: Any) -> pd.DataFrame:
324
    """
325
    Convert various object types to DataFrame.
326
    
327
    Parameters:
328
    - obj: Object to convert (dict, list, etc.)
329
    
330
    Returns:
331
    - Converted DataFrame
332
    """
333
```
334

335
**Usage Examples:**
336

337
```python
338
# Automatic compression handling
339
with pt.io.autoopen('data.txt.gz', 'r') as f:
340
    content = f.read()
341

342
with pt.io.autoopen('results.json.bz2', 'w') as f:
343
    json.dump(data, f)
344

345
# Atomic file writing
346
with pt.io.finalized_open('important_results.txt', 'w') as f:
347
    f.write("Critical data")  # Only written if no exceptions
348

349
# File discovery
350
all_files = pt.io.find_files('/path/to/documents')
351
text_files = [f for f in all_files if f.endswith('.txt')]
352

353
# DataFrame conversion
354
df = pt.io.coerce_dataframe([{'docno': 'doc1', 'text': 'content'}])
355
```
356

357
### Debugging Utilities (`pyterrier.debug`)
358

359
Debugging and inspection utilities for PyTerrier pipelines.
360

361
```python { .api }
362
def print_columns(by_query: bool = False, message: str = None) -> Transformer:
363
    """
364
    Debug transformer that prints DataFrame column information.
365
    
366
    Parameters:
367
    - by_query: Whether to group output by query (default: False)
368
    - message: Optional message to print with column information
369
    
370
    Returns:
371
    - Transformer that prints column info and passes data through
372
    """
373
```
374

375
**Usage Example:**
376

377
```python
378
# Debug pipeline by printing column information
379
debug_pipeline = (
380
    retriever >>
381
    pt.debug.print_columns(message="After retrieval") >>
382
    reranker >>
383
    pt.debug.print_columns(message="After reranking", by_query=True) >>
384
    (lambda df: df.head(10))  # Final cutoff
385
)
386

387
results = debug_pipeline.transform(topics)
388
```
389

390
### DataFrame Creation Utilities (`pyterrier.new`)
391

392
Utilities for creating standard PyTerrier DataFrames.
393

394
```python { .api }
395
def empty_Q() -> pd.DataFrame:
396
    """
397
    Create empty queries DataFrame with standard columns.
398
    
399
    Returns:
400
    - Empty DataFrame with 'qid' and 'query' columns
401
    """
402

403
def queries(queries: Union[List[str], Dict[str, str]], qid: str = None, **others) -> pd.DataFrame:
404
    """
405
    Create queries DataFrame from various input formats.
406
    
407
    Parameters:
408
    - queries: Query data (list of strings, dict mapping qid->query, etc.)
409
    - qid: Base qid for auto-generated IDs
410
    - **others: Additional columns to include
411
    
412
    Returns:
413
    - Standard queries DataFrame
414
    """
415

416
def empty_R() -> pd.DataFrame:
417
    """
418
    Create empty results DataFrame with standard columns.
419
    
420
    Returns:
421
    - Empty DataFrame with 'qid', 'docno', 'score', 'rank' columns
422
    """
423

424
def ranked_documents(topics: pd.DataFrame = None, docnos: List[str] = None, 
425
                    scores: List[float] = None, **others) -> pd.DataFrame:
426
    """
427
    Create ranked documents DataFrame.
428
    
429
    Parameters:
430
    - topics: Topics DataFrame to associate with documents
431
    - docnos: List of document IDs
432
    - scores: List of relevance scores
433
    - **others: Additional columns to include
434
    
435
    Returns:
436
    - Standard ranked documents DataFrame
437
    """
438
```
439

440
**Usage Examples:**
441

442
```python
443
# Create queries DataFrame
444
topics = pt.new.queries([
445
    "information retrieval",
446
    "search engines", 
447
    "natural language processing"
448
])
449

450
topics_with_ids = pt.new.queries({
451
    'q1': 'machine learning',
452
    'q2': 'deep learning',
453
    'q3': 'neural networks'
454
})
455

456
# Create results DataFrame
457
results = pt.new.ranked_documents(
458
    topics=topics,
459
    docnos=['doc1', 'doc2', 'doc3'],
460
    scores=[0.95, 0.87, 0.76]
461
)
462

463
# Empty DataFrames for initialization
464
empty_queries = pt.new.empty_Q()
465
empty_results = pt.new.empty_R()
466
```
467

468
### Inspection Utilities (`pyterrier.inspect`)
469

470
Utilities for inspecting PyTerrier objects and artifacts.
471

472
```python { .api }
473
def artifact_type_format(artifact: Any) -> Tuple[str, str]:
474
    """
475
    Get artifact type and format information.
476
    
477
    Parameters:
478
    - artifact: Artifact object to inspect
479
    
480
    Returns:
481
    - Tuple of (type_name, format_name)
482
    """
483
```
484

485
### Learning-to-Rank Utilities (`pyterrier.ltr`)
486

487
Utilities for learning-to-rank applications and feature analysis.
488

489
```python { .api }
490
class AblateFeatures(Transformer):
491
    """
492
    Feature ablation transformer for analyzing feature importance.
493
    
494
    Systematically removes features to study their impact on ranking performance.
495
    """
496
    def __init__(self, features_to_ablate: List[str] = None): ...
497
```
498

499
### Parallel Processing Utilities (`pyterrier.parallel`)
500

501
Utilities for parallel processing and distributed execution.
502

503
```python { .api }
504
class PoolParallelTransformer(Transformer):
505
    """
506
    Wrapper transformer for parallel execution using process pools.
507
    
508
    Parameters:
509
    - transformer: Base transformer to parallelize
510
    - jobs: Number of parallel processes
511
    - backend: Parallel backend ('joblib', 'ray')
512
    """
513
    def __init__(self, transformer: Transformer, jobs: int = 2, backend: str = 'joblib'): ...
514
```
515

516
## Types
517

518
```python { .api }
519
from typing import Dict, List, Any, Union, Optional, Iterator, Callable, Tuple
520
import pandas as pd
521

522
# Utility types
523
ProgressBarType = str  # Progress bar implementation name
524
EnvironmentVariable = Tuple[str, str]  # Environment variable key-value pair
525
ByteCount = int  # Number of bytes
526
HumanReadableSize = str  # Human-readable size string (e.g., '1.2 GB')
527
ColumnNames = List[str]  # List of DataFrame column names
528
BatchSize = int  # Batch processing size
529
ChunkCount = int  # Number of chunks for splitting
530
QueryInput = Union[str, Dict[str, str], pd.DataFrame]  # Various query input formats
531
DataFrameChunk = pd.DataFrame  # DataFrame chunk for batch processing
532
ArtifactInfo = Tuple[str, str]  # Artifact type and format information
533
```

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/