0
# Text Processing
1
2
PyTerrier's text processing components provide comprehensive text analysis and transformation capabilities, including stemming, tokenization, stopword removal, and text loading utilities integrated with the Terrier platform.
3
4
## Capabilities
5
6
### Stemming
7
8
Text stemming functionality using various stemming algorithms supported by the Terrier platform.
9
10
```python { .api }
11
class TerrierStemmer(Transformer):
12
"""
13
Stemming transformer using Terrier's stemming implementations.
14
15
Parameters:
16
- stemmer: Stemmer name to use (default: 'porter')
17
- text_attr: Attribute name containing text to stem (default: 'text')
18
"""
19
def __init__(self, stemmer: str = 'porter', text_attr: str = 'text'): ...
20
```
21
22
**Supported Stemmers:**
23
- `porter`: Porter stemmer (most common)
24
- `weak_porter`: Weak Porter stemmer
25
- `snowball`: Snowball stemmer
26
- `lovins`: Lovins stemmer
27
- `paice`: Paice/Husk stemmer
28
29
**Usage Examples:**
30
31
```python
32
# Basic Porter stemming
33
porter_stemmer = pt.terrier.TerrierStemmer()
34
35
# Apply stemming to query text
36
stemmed_queries = porter_stemmer.transform(topics)
37
38
# Use different stemmer
39
snowball_stemmer = pt.terrier.TerrierStemmer(stemmer='snowball')
40
41
# Stem custom text attribute
42
custom_stemmer = pt.terrier.TerrierStemmer(
43
stemmer='porter',
44
text_attr='custom_text'
45
)
46
47
# Pipeline integration
48
pipeline = retriever >> pt.terrier.TerrierStemmer() >> reranker
49
```
50
51
### Tokenization
52
53
Text tokenization functionality for splitting text into tokens using Terrier's tokenization implementations.
54
55
```python { .api }
56
class TerrierTokeniser(Transformer):
57
"""
58
Tokenization transformer using Terrier's tokenizer implementations.
59
60
Parameters:
61
- tokeniser: Tokenizer configuration or name
62
- text_attr: Attribute name containing text to tokenize (default: 'text')
63
- **kwargs: Additional tokenizer configuration options
64
"""
65
def __init__(self, tokeniser: str = None, text_attr: str = 'text', **kwargs): ...
66
```
67
68
**Tokenizer Options:**
69
- Default: Standard English tokenization
70
- `UTFTokeniser`: UTF-8 aware tokenization
71
- `EnglishTokeniser`: English-specific tokenization rules
72
- Custom tokenizer configurations
73
74
**Usage Examples:**
75
76
```python
77
# Basic tokenization
78
tokenizer = pt.terrier.TerrierTokeniser()
79
tokenized_text = tokenizer.transform(documents)
80
81
# UTF-8 tokenization for international text
82
utf_tokenizer = pt.terrier.TerrierTokeniser(tokeniser='UTFTokeniser')
83
84
# English-specific tokenization
85
english_tokenizer = pt.terrier.TerrierTokeniser(tokeniser='EnglishTokeniser')
86
87
# Custom tokenizer configuration
88
custom_tokenizer = pt.terrier.TerrierTokeniser(
89
tokeniser='EnglishTokeniser',
90
lowercase=True,
91
numbers=False
92
)
93
```
94
95
### Stopword Removal
96
97
Stopword filtering using various predefined stopword lists or custom stopword sets.
98
99
```python { .api }
100
class TerrierStopwords(Transformer):
101
"""
102
Stopword removal transformer using Terrier's stopword lists.
103
104
Parameters:
105
- stopwords: Stopword list name or custom list (default: 'terrier')
106
- text_attr: Attribute name containing text to filter (default: 'text')
107
"""
108
def __init__(self, stopwords: Union[str, List[str]] = 'terrier',
109
text_attr: str = 'text'): ...
110
```
111
112
**Predefined Stopword Lists:**
113
- `terrier`: Default Terrier stopword list
114
- `smart`: SMART stopword list
115
- `indri`: Indri stopword list
116
- `custom`: Use custom stopword list
117
118
**Usage Examples:**
119
120
```python
121
# Basic stopword removal
122
stopword_filter = pt.terrier.TerrierStopwords()
123
filtered_text = stopword_filter.transform(documents)
124
125
# Use SMART stopword list
126
smart_filter = pt.terrier.TerrierStopwords(stopwords='smart')
127
128
# Custom stopword list
129
custom_stopwords = ['the', 'and', 'or', 'but', 'custom_word']
130
custom_filter = pt.terrier.TerrierStopwords(stopwords=custom_stopwords)
131
132
# Filter custom text attribute
133
attr_filter = pt.terrier.TerrierStopwords(
134
stopwords='smart',
135
text_attr='title'
136
)
137
```
138
139
### Text Loading
140
141
Text loading utilities for reading and processing text from various sources and formats.
142
143
```python { .api }
144
class TerrierTextLoader(Transformer):
145
"""
146
Text loading transformer for extracting text from documents.
147
148
Parameters:
149
- text_loader: Text loader implementation to use
150
- **kwargs: Additional text loader configuration options
151
"""
152
def __init__(self, text_loader: str = None, **kwargs): ...
153
154
def terrier_text_loader(text_loader_spec: str = None, **kwargs) -> 'TerrierTextLoader':
155
"""
156
Factory function for creating text loaders.
157
158
Parameters:
159
- text_loader_spec: Text loader specification string
160
- **kwargs: Additional configuration options
161
162
Returns:
163
- Configured TerrierTextLoader instance
164
"""
165
```
166
167
**Text Loader Types:**
168
- `txt`: Plain text files
169
- `pdf`: PDF document extraction
170
- `docx`: Microsoft Word document extraction
171
- `html`: HTML content extraction
172
- `xml`: XML content extraction
173
174
**Usage Examples:**
175
176
```python
177
# Basic text loading
178
text_loader = pt.terrier.TerrierTextLoader()
179
180
# PDF text extraction
181
pdf_loader = pt.terrier.terrier_text_loader('pdf')
182
pdf_text = pdf_loader.transform(pdf_documents)
183
184
# HTML content extraction
185
html_loader = pt.terrier.terrier_text_loader('html')
186
html_text = html_loader.transform(html_documents)
187
188
# Microsoft Word document extraction
189
docx_loader = pt.terrier.terrier_text_loader('docx')
190
```
191
192
### Text Processing Protocol
193
194
Protocol interface for components that support text loading capabilities.
195
196
```python { .api }
197
from typing import Protocol
198
199
class HasTextLoader(Protocol):
200
"""
201
Protocol for components that support text loading functionality.
202
"""
203
def get_text_loader(self) -> Any: ...
204
```
205
206
## Text Processing Pipelines
207
208
### Complete Text Processing Pipeline
209
210
```python
211
# Comprehensive text processing pipeline
212
text_pipeline = (
213
pt.terrier.TerrierTextLoader() >> # Load text content
214
pt.terrier.TerrierTokeniser() >> # Tokenize text
215
pt.terrier.TerrierStopwords(stopwords='smart') >> # Remove stopwords
216
pt.terrier.TerrierStemmer(stemmer='porter') # Apply stemming
217
)
218
219
processed_documents = text_pipeline.transform(raw_documents)
220
```
221
222
### Query Processing Pipeline
223
224
```python
225
# Query preprocessing pipeline
226
query_processor = (
227
pt.terrier.TerrierTokeniser() >>
228
pt.terrier.TerrierStopwords() >>
229
pt.terrier.TerrierStemmer()
230
)
231
232
# Apply to queries before retrieval
233
processed_queries = query_processor.transform(topics)
234
retrieval_results = retriever.transform(processed_queries)
235
```
236
237
### Document Processing for Indexing
238
239
```python
240
# Document preprocessing for indexing
241
doc_processor = (
242
pt.terrier.TerrierTextLoader() >>
243
pt.terrier.TerrierTokeniser(tokeniser='EnglishTokeniser') >>
244
pt.terrier.TerrierStopwords(stopwords='terrier')
245
# Note: Stemming typically done during indexing, not preprocessing
246
)
247
248
# Process documents before indexing
249
processed_docs = doc_processor.transform(document_collection)
250
indexer = pt.DFIndexer('/path/to/index', stemmer='porter')
251
index_ref = indexer.index(processed_docs)
252
```
253
254
## Advanced Text Processing
255
256
### Multi-Field Text Processing
257
258
```python
259
# Process different text fields with different settings
260
title_processor = pt.terrier.TerrierStemmer(
261
stemmer='weak_porter',
262
text_attr='title'
263
)
264
265
content_processor = pt.terrier.TerrierStemmer(
266
stemmer='porter',
267
text_attr='content'
268
)
269
270
# Apply different processing to different fields
271
processed_titles = title_processor.transform(documents)
272
processed_content = content_processor.transform(documents)
273
```
274
275
### Language-Specific Processing
276
277
```python
278
# Configure for non-English text
279
international_tokenizer = pt.terrier.TerrierTokeniser(
280
tokeniser='UTFTokeniser'
281
)
282
283
# Custom stopwords for specific language
284
spanish_stopwords = ['el', 'la', 'de', 'que', 'y', 'a', 'en', 'un', 'es', 'se']
285
spanish_filter = pt.terrier.TerrierStopwords(stopwords=spanish_stopwords)
286
287
# Language-specific pipeline
288
spanish_pipeline = (
289
international_tokenizer >>
290
spanish_filter >>
291
pt.terrier.TerrierStemmer(stemmer='snowball') # Snowball supports multiple languages
292
)
293
```
294
295
### Custom Text Processing
296
297
```python
298
# Combine with custom transformers
299
import re
300
301
class CustomTextCleaner(pt.Transformer):
302
def transform(self, df):
303
# Custom cleaning logic
304
df = df.copy()
305
df['text'] = df['text'].str.replace(r'[^\w\s]', '', regex=True)
306
df['text'] = df['text'].str.lower()
307
return df
308
309
# Integrated pipeline
310
custom_pipeline = (
311
CustomTextCleaner() >>
312
pt.terrier.TerrierTokeniser() >>
313
pt.terrier.TerrierStopwords() >>
314
pt.terrier.TerrierStemmer()
315
)
316
```
317
318
### Performance Optimization
319
320
```python
321
# Optimize text processing for large collections
322
optimized_pipeline = (
323
pt.terrier.TerrierTokeniser() >>
324
pt.terrier.TerrierStopwords(stopwords='smart') >>
325
pt.terrier.TerrierStemmer(stemmer='porter')
326
).parallel(jobs=4) # Parallel processing
327
328
# Batch processing for memory efficiency
329
batch_size = 1000
330
for batch in pt.model.split_df(large_document_collection, batch_size=batch_size):
331
processed_batch = optimized_pipeline.transform(batch)
332
# Process batch results
333
```
334
335
## Integration with Retrieval
336
337
### Query-Time Processing
338
339
```python
340
# Process queries at retrieval time
341
retrieval_pipeline = (
342
pt.terrier.TerrierStemmer() >> # Stem queries
343
pt.terrier.Retriever(index_ref, wmodel='BM25')
344
)
345
346
results = retrieval_pipeline.transform(topics)
347
```
348
349
### Document-Time Processing
350
351
```python
352
# Process retrieved documents
353
document_pipeline = (
354
pt.terrier.Retriever(index_ref) >>
355
pt.text.get_text(dataset) >> # Get full document text
356
pt.terrier.TerrierStemmer() >> # Process retrieved text
357
some_reranker
358
)
359
```
360
361
## Types
362
363
```python { .api }
364
from typing import Union, List, Any, Protocol
365
import pandas as pd
366
367
# Text processing types
368
StemmerName = str # Stemmer algorithm name
369
TokeniserName = str # Tokenizer implementation name
370
StopwordList = Union[str, List[str]] # Stopword list specification
371
TextAttribute = str # Column/attribute name containing text
372
TextLoaderSpec = str # Text loader specification
373
ProcessingConfig = Dict[str, Any] # Text processing configuration
374
375
# Protocol types
376
class HasTextLoader(Protocol):
377
def get_text_loader(self) -> Any: ...
378
```