or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdevaluation.mdindex.mdindexing.mdjava.mdretrieval.mdtext-processing.mdtransformers.mdutilities.md

text-processing.mddocs/

0

# Text Processing

1

2

PyTerrier's text processing components provide comprehensive text analysis and transformation capabilities, including stemming, tokenization, stopword removal, and text loading utilities integrated with the Terrier platform.

3

4

## Capabilities

5

6

### Stemming

7

8

Text stemming functionality using various stemming algorithms supported by the Terrier platform.

9

10

```python { .api }

11

class TerrierStemmer(Transformer):

12

"""

13

Stemming transformer using Terrier's stemming implementations.

14

15

Parameters:

16

- stemmer: Stemmer name to use (default: 'porter')

17

- text_attr: Attribute name containing text to stem (default: 'text')

18

"""

19

def __init__(self, stemmer: str = 'porter', text_attr: str = 'text'): ...

20

```

21

22

**Supported Stemmers:**

23

- `porter`: Porter stemmer (most common)

24

- `weak_porter`: Weak Porter stemmer

25

- `snowball`: Snowball stemmer

26

- `lovins`: Lovins stemmer

27

- `paice`: Paice/Husk stemmer

28

29

**Usage Examples:**

30

31

```python

32

# Basic Porter stemming

33

porter_stemmer = pt.terrier.TerrierStemmer()

34

35

# Apply stemming to query text

36

stemmed_queries = porter_stemmer.transform(topics)

37

38

# Use different stemmer

39

snowball_stemmer = pt.terrier.TerrierStemmer(stemmer='snowball')

40

41

# Stem custom text attribute

42

custom_stemmer = pt.terrier.TerrierStemmer(

43

stemmer='porter',

44

text_attr='custom_text'

45

)

46

47

# Pipeline integration

48

pipeline = retriever >> pt.terrier.TerrierStemmer() >> reranker

49

```

50

51

### Tokenization

52

53

Text tokenization functionality for splitting text into tokens using Terrier's tokenization implementations.

54

55

```python { .api }

56

class TerrierTokeniser(Transformer):

57

"""

58

Tokenization transformer using Terrier's tokenizer implementations.

59

60

Parameters:

61

- tokeniser: Tokenizer configuration or name

62

- text_attr: Attribute name containing text to tokenize (default: 'text')

63

- **kwargs: Additional tokenizer configuration options

64

"""

65

def __init__(self, tokeniser: str = None, text_attr: str = 'text', **kwargs): ...

66

```

67

68

**Tokenizer Options:**

69

- Default: Standard English tokenization

70

- `UTFTokeniser`: UTF-8 aware tokenization

71

- `EnglishTokeniser`: English-specific tokenization rules

72

- Custom tokenizer configurations

73

74

**Usage Examples:**

75

76

```python

77

# Basic tokenization

78

tokenizer = pt.terrier.TerrierTokeniser()

79

tokenized_text = tokenizer.transform(documents)

80

81

# UTF-8 tokenization for international text

82

utf_tokenizer = pt.terrier.TerrierTokeniser(tokeniser='UTFTokeniser')

83

84

# English-specific tokenization

85

english_tokenizer = pt.terrier.TerrierTokeniser(tokeniser='EnglishTokeniser')

86

87

# Custom tokenizer configuration

88

custom_tokenizer = pt.terrier.TerrierTokeniser(

89

tokeniser='EnglishTokeniser',

90

lowercase=True,

91

numbers=False

92

)

93

```

94

95

### Stopword Removal

96

97

Stopword filtering using various predefined stopword lists or custom stopword sets.

98

99

```python { .api }

100

class TerrierStopwords(Transformer):

101

"""

102

Stopword removal transformer using Terrier's stopword lists.

103

104

Parameters:

105

- stopwords: Stopword list name or custom list (default: 'terrier')

106

- text_attr: Attribute name containing text to filter (default: 'text')

107

"""

108

def __init__(self, stopwords: Union[str, List[str]] = 'terrier',

109

text_attr: str = 'text'): ...

110

```

111

112

**Predefined Stopword Lists:**

113

- `terrier`: Default Terrier stopword list

114

- `smart`: SMART stopword list

115

- `indri`: Indri stopword list

116

- `custom`: Use custom stopword list

117

118

**Usage Examples:**

119

120

```python

121

# Basic stopword removal

122

stopword_filter = pt.terrier.TerrierStopwords()

123

filtered_text = stopword_filter.transform(documents)

124

125

# Use SMART stopword list

126

smart_filter = pt.terrier.TerrierStopwords(stopwords='smart')

127

128

# Custom stopword list

129

custom_stopwords = ['the', 'and', 'or', 'but', 'custom_word']

130

custom_filter = pt.terrier.TerrierStopwords(stopwords=custom_stopwords)

131

132

# Filter custom text attribute

133

attr_filter = pt.terrier.TerrierStopwords(

134

stopwords='smart',

135

text_attr='title'

136

)

137

```

138

139

### Text Loading

140

141

Text loading utilities for reading and processing text from various sources and formats.

142

143

```python { .api }

144

class TerrierTextLoader(Transformer):

145

"""

146

Text loading transformer for extracting text from documents.

147

148

Parameters:

149

- text_loader: Text loader implementation to use

150

- **kwargs: Additional text loader configuration options

151

"""

152

def __init__(self, text_loader: str = None, **kwargs): ...

153

154

def terrier_text_loader(text_loader_spec: str = None, **kwargs) -> 'TerrierTextLoader':

155

"""

156

Factory function for creating text loaders.

157

158

Parameters:

159

- text_loader_spec: Text loader specification string

160

- **kwargs: Additional configuration options

161

162

Returns:

163

- Configured TerrierTextLoader instance

164

"""

165

```

166

167

**Text Loader Types:**

168

- `txt`: Plain text files

169

- `pdf`: PDF document extraction

170

- `docx`: Microsoft Word document extraction

171

- `html`: HTML content extraction

172

- `xml`: XML content extraction

173

174

**Usage Examples:**

175

176

```python

177

# Basic text loading

178

text_loader = pt.terrier.TerrierTextLoader()

179

180

# PDF text extraction

181

pdf_loader = pt.terrier.terrier_text_loader('pdf')

182

pdf_text = pdf_loader.transform(pdf_documents)

183

184

# HTML content extraction

185

html_loader = pt.terrier.terrier_text_loader('html')

186

html_text = html_loader.transform(html_documents)

187

188

# Microsoft Word document extraction

189

docx_loader = pt.terrier.terrier_text_loader('docx')

190

```

191

192

### Text Processing Protocol

193

194

Protocol interface for components that support text loading capabilities.

195

196

```python { .api }

197

from typing import Protocol

198

199

class HasTextLoader(Protocol):

200

"""

201

Protocol for components that support text loading functionality.

202

"""

203

def get_text_loader(self) -> Any: ...

204

```

205

206

## Text Processing Pipelines

207

208

### Complete Text Processing Pipeline

209

210

```python

211

# Comprehensive text processing pipeline

212

text_pipeline = (

213

pt.terrier.TerrierTextLoader() >> # Load text content

214

pt.terrier.TerrierTokeniser() >> # Tokenize text

215

pt.terrier.TerrierStopwords(stopwords='smart') >> # Remove stopwords

216

pt.terrier.TerrierStemmer(stemmer='porter') # Apply stemming

217

)

218

219

processed_documents = text_pipeline.transform(raw_documents)

220

```

221

222

### Query Processing Pipeline

223

224

```python

225

# Query preprocessing pipeline

226

query_processor = (

227

pt.terrier.TerrierTokeniser() >>

228

pt.terrier.TerrierStopwords() >>

229

pt.terrier.TerrierStemmer()

230

)

231

232

# Apply to queries before retrieval

233

processed_queries = query_processor.transform(topics)

234

retrieval_results = retriever.transform(processed_queries)

235

```

236

237

### Document Processing for Indexing

238

239

```python

240

# Document preprocessing for indexing

241

doc_processor = (

242

pt.terrier.TerrierTextLoader() >>

243

pt.terrier.TerrierTokeniser(tokeniser='EnglishTokeniser') >>

244

pt.terrier.TerrierStopwords(stopwords='terrier')

245

# Note: Stemming typically done during indexing, not preprocessing

246

)

247

248

# Process documents before indexing

249

processed_docs = doc_processor.transform(document_collection)

250

indexer = pt.DFIndexer('/path/to/index', stemmer='porter')

251

index_ref = indexer.index(processed_docs)

252

```

253

254

## Advanced Text Processing

255

256

### Multi-Field Text Processing

257

258

```python

259

# Process different text fields with different settings

260

title_processor = pt.terrier.TerrierStemmer(

261

stemmer='weak_porter',

262

text_attr='title'

263

)

264

265

content_processor = pt.terrier.TerrierStemmer(

266

stemmer='porter',

267

text_attr='content'

268

)

269

270

# Apply different processing to different fields

271

processed_titles = title_processor.transform(documents)

272

processed_content = content_processor.transform(documents)

273

```

274

275

### Language-Specific Processing

276

277

```python

278

# Configure for non-English text

279

international_tokenizer = pt.terrier.TerrierTokeniser(

280

tokeniser='UTFTokeniser'

281

)

282

283

# Custom stopwords for specific language

284

spanish_stopwords = ['el', 'la', 'de', 'que', 'y', 'a', 'en', 'un', 'es', 'se']

285

spanish_filter = pt.terrier.TerrierStopwords(stopwords=spanish_stopwords)

286

287

# Language-specific pipeline

288

spanish_pipeline = (

289

international_tokenizer >>

290

spanish_filter >>

291

pt.terrier.TerrierStemmer(stemmer='snowball') # Snowball supports multiple languages

292

)

293

```

294

295

### Custom Text Processing

296

297

```python

298

# Combine with custom transformers

299

import re

300

301

class CustomTextCleaner(pt.Transformer):

302

def transform(self, df):

303

# Custom cleaning logic

304

df = df.copy()

305

df['text'] = df['text'].str.replace(r'[^\w\s]', '', regex=True)

306

df['text'] = df['text'].str.lower()

307

return df

308

309

# Integrated pipeline

310

custom_pipeline = (

311

CustomTextCleaner() >>

312

pt.terrier.TerrierTokeniser() >>

313

pt.terrier.TerrierStopwords() >>

314

pt.terrier.TerrierStemmer()

315

)

316

```

317

318

### Performance Optimization

319

320

```python

321

# Optimize text processing for large collections

322

optimized_pipeline = (

323

pt.terrier.TerrierTokeniser() >>

324

pt.terrier.TerrierStopwords(stopwords='smart') >>

325

pt.terrier.TerrierStemmer(stemmer='porter')

326

).parallel(jobs=4) # Parallel processing

327

328

# Batch processing for memory efficiency

329

batch_size = 1000

330

for batch in pt.model.split_df(large_document_collection, batch_size=batch_size):

331

processed_batch = optimized_pipeline.transform(batch)

332

# Process batch results

333

```

334

335

## Integration with Retrieval

336

337

### Query-Time Processing

338

339

```python

340

# Process queries at retrieval time

341

retrieval_pipeline = (

342

pt.terrier.TerrierStemmer() >> # Stem queries

343

pt.terrier.Retriever(index_ref, wmodel='BM25')

344

)

345

346

results = retrieval_pipeline.transform(topics)

347

```

348

349

### Document-Time Processing

350

351

```python

352

# Process retrieved documents

353

document_pipeline = (

354

pt.terrier.Retriever(index_ref) >>

355

pt.text.get_text(dataset) >> # Get full document text

356

pt.terrier.TerrierStemmer() >> # Process retrieved text

357

some_reranker

358

)

359

```

360

361

## Types

362

363

```python { .api }

364

from typing import Union, List, Any, Protocol

365

import pandas as pd

366

367

# Text processing types

368

StemmerName = str # Stemmer algorithm name

369

TokeniserName = str # Tokenizer implementation name

370

StopwordList = Union[str, List[str]] # Stopword list specification

371

TextAttribute = str # Column/attribute name containing text

372

TextLoaderSpec = str # Text loader specification

373

ProcessingConfig = Dict[str, Any] # Text processing configuration

374

375

# Protocol types

376

class HasTextLoader(Protocol):

377

def get_text_loader(self) -> Any: ...

378

```