Tessl Tile for pypi/python-terrier@0.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md evaluation.md index.md indexing.md java.md retrieval.md text-processing.md transformers.md utilities.md

text-processing.mddocs/

0
# Text Processing
1

2
PyTerrier's text processing components provide comprehensive text analysis and transformation capabilities, including stemming, tokenization, stopword removal, and text loading utilities integrated with the Terrier platform.
3

4
## Capabilities
5

6
### Stemming
7

8
Text stemming functionality using various stemming algorithms supported by the Terrier platform.
9

10
```python { .api }
11
class TerrierStemmer(Transformer):
12
    """
13
    Stemming transformer using Terrier's stemming implementations.
14
    
15
    Parameters:
16
    - stemmer: Stemmer name to use (default: 'porter')
17
    - text_attr: Attribute name containing text to stem (default: 'text')
18
    """
19
    def __init__(self, stemmer: str = 'porter', text_attr: str = 'text'): ...
20
```
21

22
**Supported Stemmers:**
23
- `porter`: Porter stemmer (most common)
24
- `weak_porter`: Weak Porter stemmer  
25
- `snowball`: Snowball stemmer
26
- `lovins`: Lovins stemmer
27
- `paice`: Paice/Husk stemmer
28

29
**Usage Examples:**
30

31
```python
32
# Basic Porter stemming
33
porter_stemmer = pt.terrier.TerrierStemmer()
34

35
# Apply stemming to query text
36
stemmed_queries = porter_stemmer.transform(topics)
37

38
# Use different stemmer
39
snowball_stemmer = pt.terrier.TerrierStemmer(stemmer='snowball')
40

41
# Stem custom text attribute
42
custom_stemmer = pt.terrier.TerrierStemmer(
43
    stemmer='porter', 
44
    text_attr='custom_text'
45
)
46

47
# Pipeline integration
48
pipeline = retriever >> pt.terrier.TerrierStemmer() >> reranker
49
```
50

51
### Tokenization
52

53
Text tokenization functionality for splitting text into tokens using Terrier's tokenization implementations.
54

55
```python { .api }
56
class TerrierTokeniser(Transformer):
57
    """
58
    Tokenization transformer using Terrier's tokenizer implementations.
59
    
60
    Parameters:
61
    - tokeniser: Tokenizer configuration or name
62
    - text_attr: Attribute name containing text to tokenize (default: 'text')
63
    - **kwargs: Additional tokenizer configuration options
64
    """
65
    def __init__(self, tokeniser: str = None, text_attr: str = 'text', **kwargs): ...
66
```
67

68
**Tokenizer Options:**
69
- Default: Standard English tokenization
70
- `UTFTokeniser`: UTF-8 aware tokenization
71
- `EnglishTokeniser`: English-specific tokenization rules
72
- Custom tokenizer configurations
73

74
**Usage Examples:**
75

76
```python
77
# Basic tokenization
78
tokenizer = pt.terrier.TerrierTokeniser()
79
tokenized_text = tokenizer.transform(documents)
80

81
# UTF-8 tokenization for international text
82
utf_tokenizer = pt.terrier.TerrierTokeniser(tokeniser='UTFTokeniser')
83

84
# English-specific tokenization
85
english_tokenizer = pt.terrier.TerrierTokeniser(tokeniser='EnglishTokeniser')
86

87
# Custom tokenizer configuration
88
custom_tokenizer = pt.terrier.TerrierTokeniser(
89
    tokeniser='EnglishTokeniser',
90
    lowercase=True,
91
    numbers=False
92
)
93
```
94

95
### Stopword Removal
96

97
Stopword filtering using various predefined stopword lists or custom stopword sets.
98

99
```python { .api }
100
class TerrierStopwords(Transformer):
101
    """
102
    Stopword removal transformer using Terrier's stopword lists.
103
    
104
    Parameters:
105
    - stopwords: Stopword list name or custom list (default: 'terrier')
106
    - text_attr: Attribute name containing text to filter (default: 'text')
107
    """
108
    def __init__(self, stopwords: Union[str, List[str]] = 'terrier', 
109
                 text_attr: str = 'text'): ...
110
```
111

112
**Predefined Stopword Lists:**
113
- `terrier`: Default Terrier stopword list
114
- `smart`: SMART stopword list
115
- `indri`: Indri stopword list  
116
- `custom`: Use custom stopword list
117

118
**Usage Examples:**
119

120
```python
121
# Basic stopword removal
122
stopword_filter = pt.terrier.TerrierStopwords()
123
filtered_text = stopword_filter.transform(documents)
124

125
# Use SMART stopword list
126
smart_filter = pt.terrier.TerrierStopwords(stopwords='smart')
127

128
# Custom stopword list
129
custom_stopwords = ['the', 'and', 'or', 'but', 'custom_word']
130
custom_filter = pt.terrier.TerrierStopwords(stopwords=custom_stopwords)
131

132
# Filter custom text attribute
133
attr_filter = pt.terrier.TerrierStopwords(
134
    stopwords='smart',
135
    text_attr='title'
136
)
137
```
138

139
### Text Loading
140

141
Text loading utilities for reading and processing text from various sources and formats.
142

143
```python { .api }
144
class TerrierTextLoader(Transformer):
145
    """
146
    Text loading transformer for extracting text from documents.
147
    
148
    Parameters:
149
    - text_loader: Text loader implementation to use
150
    - **kwargs: Additional text loader configuration options
151
    """
152
    def __init__(self, text_loader: str = None, **kwargs): ...
153

154
def terrier_text_loader(text_loader_spec: str = None, **kwargs) -> 'TerrierTextLoader':
155
    """
156
    Factory function for creating text loaders.
157
    
158
    Parameters:
159
    - text_loader_spec: Text loader specification string
160
    - **kwargs: Additional configuration options
161
    
162
    Returns:
163
    - Configured TerrierTextLoader instance
164
    """
165
```
166

167
**Text Loader Types:**
168
- `txt`: Plain text files
169
- `pdf`: PDF document extraction
170
- `docx`: Microsoft Word document extraction
171
- `html`: HTML content extraction
172
- `xml`: XML content extraction
173

174
**Usage Examples:**
175

176
```python
177
# Basic text loading
178
text_loader = pt.terrier.TerrierTextLoader()
179

180
# PDF text extraction
181
pdf_loader = pt.terrier.terrier_text_loader('pdf')
182
pdf_text = pdf_loader.transform(pdf_documents)
183

184
# HTML content extraction
185
html_loader = pt.terrier.terrier_text_loader('html')
186
html_text = html_loader.transform(html_documents)
187

188
# Microsoft Word document extraction
189
docx_loader = pt.terrier.terrier_text_loader('docx')
190
```
191

192
### Text Processing Protocol
193

194
Protocol interface for components that support text loading capabilities.
195

196
```python { .api }
197
from typing import Protocol
198

199
class HasTextLoader(Protocol):
200
    """
201
    Protocol for components that support text loading functionality.
202
    """
203
    def get_text_loader(self) -> Any: ...
204
```
205

206
## Text Processing Pipelines
207

208
### Complete Text Processing Pipeline
209

210
```python
211
# Comprehensive text processing pipeline
212
text_pipeline = (
213
    pt.terrier.TerrierTextLoader() >>           # Load text content
214
    pt.terrier.TerrierTokeniser() >>            # Tokenize text
215
    pt.terrier.TerrierStopwords(stopwords='smart') >>  # Remove stopwords
216
    pt.terrier.TerrierStemmer(stemmer='porter')     # Apply stemming
217
)
218

219
processed_documents = text_pipeline.transform(raw_documents)
220
```
221

222
### Query Processing Pipeline
223

224
```python
225
# Query preprocessing pipeline
226
query_processor = (
227
    pt.terrier.TerrierTokeniser() >>
228
    pt.terrier.TerrierStopwords() >>
229
    pt.terrier.TerrierStemmer()
230
)
231

232
# Apply to queries before retrieval
233
processed_queries = query_processor.transform(topics)
234
retrieval_results = retriever.transform(processed_queries)
235
```
236

237
### Document Processing for Indexing
238

239
```python
240
# Document preprocessing for indexing
241
doc_processor = (
242
    pt.terrier.TerrierTextLoader() >>
243
    pt.terrier.TerrierTokeniser(tokeniser='EnglishTokeniser') >>
244
    pt.terrier.TerrierStopwords(stopwords='terrier')
245
    # Note: Stemming typically done during indexing, not preprocessing
246
)
247

248
# Process documents before indexing
249
processed_docs = doc_processor.transform(document_collection)
250
indexer = pt.DFIndexer('/path/to/index', stemmer='porter')
251
index_ref = indexer.index(processed_docs)
252
```
253

254
## Advanced Text Processing
255

256
### Multi-Field Text Processing
257

258
```python
259
# Process different text fields with different settings
260
title_processor = pt.terrier.TerrierStemmer(
261
    stemmer='weak_porter',
262
    text_attr='title'
263
)
264

265
content_processor = pt.terrier.TerrierStemmer(
266
    stemmer='porter', 
267
    text_attr='content'
268
)
269

270
# Apply different processing to different fields
271
processed_titles = title_processor.transform(documents)
272
processed_content = content_processor.transform(documents)
273
```
274

275
### Language-Specific Processing
276

277
```python
278
# Configure for non-English text
279
international_tokenizer = pt.terrier.TerrierTokeniser(
280
    tokeniser='UTFTokeniser'
281
)
282

283
# Custom stopwords for specific language
284
spanish_stopwords = ['el', 'la', 'de', 'que', 'y', 'a', 'en', 'un', 'es', 'se']
285
spanish_filter = pt.terrier.TerrierStopwords(stopwords=spanish_stopwords)
286

287
# Language-specific pipeline
288
spanish_pipeline = (
289
    international_tokenizer >>
290
    spanish_filter >>
291
    pt.terrier.TerrierStemmer(stemmer='snowball')  # Snowball supports multiple languages
292
)
293
```
294

295
### Custom Text Processing
296

297
```python
298
# Combine with custom transformers
299
import re
300

301
class CustomTextCleaner(pt.Transformer):
302
    def transform(self, df):
303
        # Custom cleaning logic
304
        df = df.copy()
305
        df['text'] = df['text'].str.replace(r'[^\w\s]', '', regex=True)
306
        df['text'] = df['text'].str.lower()
307
        return df
308

309
# Integrated pipeline
310
custom_pipeline = (
311
    CustomTextCleaner() >>
312
    pt.terrier.TerrierTokeniser() >>
313
    pt.terrier.TerrierStopwords() >>
314
    pt.terrier.TerrierStemmer()
315
)
316
```
317

318
### Performance Optimization
319

320
```python
321
# Optimize text processing for large collections
322
optimized_pipeline = (
323
    pt.terrier.TerrierTokeniser() >>
324
    pt.terrier.TerrierStopwords(stopwords='smart') >>
325
    pt.terrier.TerrierStemmer(stemmer='porter')
326
).parallel(jobs=4)  # Parallel processing
327

328
# Batch processing for memory efficiency
329
batch_size = 1000
330
for batch in pt.model.split_df(large_document_collection, batch_size=batch_size):
331
    processed_batch = optimized_pipeline.transform(batch)
332
    # Process batch results
333
```
334

335
## Integration with Retrieval
336

337
### Query-Time Processing
338

339
```python
340
# Process queries at retrieval time
341
retrieval_pipeline = (
342
    pt.terrier.TerrierStemmer() >>  # Stem queries
343
    pt.terrier.Retriever(index_ref, wmodel='BM25')
344
)
345

346
results = retrieval_pipeline.transform(topics)
347
```
348

349
### Document-Time Processing  
350

351
```python
352
# Process retrieved documents
353
document_pipeline = (
354
    pt.terrier.Retriever(index_ref) >>
355
    pt.text.get_text(dataset) >>  # Get full document text
356
    pt.terrier.TerrierStemmer() >>  # Process retrieved text
357
    some_reranker
358
)
359
```
360

361
## Types
362

363
```python { .api }
364
from typing import Union, List, Any, Protocol
365
import pandas as pd
366

367
# Text processing types
368
StemmerName = str  # Stemmer algorithm name
369
TokeniserName = str  # Tokenizer implementation name  
370
StopwordList = Union[str, List[str]]  # Stopword list specification
371
TextAttribute = str  # Column/attribute name containing text
372
TextLoaderSpec = str  # Text loader specification
373
ProcessingConfig = Dict[str, Any]  # Text processing configuration
374

375
# Protocol types
376
class HasTextLoader(Protocol):
377
    def get_text_loader(self) -> Any: ...
378
```

Version

Tile

Files

text-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

text-processing.mddocs/