or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

image-processing.mdindex.mdsequence-processing.mdtext-processing.md

text-processing.mddocs/

0

# Text Processing

1

2

Text tokenization, vocabulary management, and text-to-sequence conversion utilities for natural language processing. These tools handle the transformation of raw text into numerical representations suitable for neural network training.

3

4

## Capabilities

5

6

### Text Tokenization

7

8

The Tokenizer class provides comprehensive text tokenization and vocabulary management with configurable preprocessing, filtering, and encoding options.

9

10

```python { .api }

11

class Tokenizer:

12

"""

13

Text tokenization utility class for vectorizing text corpus.

14

15

Converts text to sequences of integers or other vectorized representations.

16

Maintains internal vocabulary and word-to-index mappings.

17

"""

18

19

def __init__(self, num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',

20

lower=True, split=' ', char_level=False, oov_token=None,

21

document_count=0, **kwargs):

22

"""

23

Initialize tokenizer.

24

25

Parameters:

26

- num_words (int, optional): Maximum number of words to keep based on frequency

27

- filters (str): Characters to filter out from texts

28

- lower (bool): Whether to convert texts to lowercase

29

- split (str): Separator for word splitting

30

- char_level (bool): Whether to use character-level tokenization

31

- oov_token (str, optional): Token to replace out-of-vocabulary words

32

- document_count (int): Count of documents processed (for statistics)

33

"""

34

35

def fit_on_texts(self, texts):

36

"""

37

Update internal vocabulary based on a list of texts.

38

39

Parameters:

40

- texts (list): List of texts to fit on

41

"""

42

43

def texts_to_sequences(self, texts):

44

"""

45

Transform each text to a sequence of integers.

46

47

Parameters:

48

- texts (list): List of texts to transform

49

50

Returns:

51

- list: List of sequences (lists of integers)

52

"""

53

54

def texts_to_sequences_generator(self, texts):

55

"""

56

Generator version of texts_to_sequences.

57

58

Parameters:

59

- texts (list): List of texts to transform

60

61

Yields:

62

- list: Sequence (list of integers) for each text

63

"""

64

65

def sequences_to_texts(self, sequences):

66

"""

67

Transform sequences back to texts.

68

69

Parameters:

70

- sequences (list): List of sequences to transform

71

72

Returns:

73

- list: List of texts

74

"""

75

76

def sequences_to_texts_generator(self, sequences):

77

"""

78

Generator version of sequences_to_texts.

79

80

Parameters:

81

- sequences (list): List of sequences to transform

82

83

Yields:

84

- str: Text for each sequence

85

"""

86

87

def texts_to_matrix(self, texts, mode='binary'):

88

"""

89

Convert texts to a matrix representation.

90

91

Parameters:

92

- texts (list): List of texts to convert

93

- mode (str): 'binary', 'count', 'tfidf', 'freq'

94

95

Returns:

96

- numpy.ndarray: Matrix representation of texts

97

"""

98

99

def sequences_to_matrix(self, sequences, mode='binary'):

100

"""

101

Convert sequences to a matrix representation.

102

103

Parameters:

104

- sequences (list): List of sequences to convert

105

- mode (str): 'binary', 'count', 'tfidf', 'freq'

106

107

Returns:

108

- numpy.ndarray: Matrix representation of sequences

109

"""

110

111

def fit_on_sequences(self, sequences):

112

"""

113

Update internal vocabulary based on a list of sequences.

114

115

Parameters:

116

- sequences (list): List of sequences to fit on

117

"""

118

119

def get_config(self):

120

"""

121

Return tokenizer configuration as dictionary.

122

123

Returns:

124

- dict: Configuration dictionary

125

"""

126

127

def to_json(self, **kwargs):

128

"""

129

Return JSON string containing tokenizer configuration.

130

131

Returns:

132

- str: JSON string of tokenizer configuration

133

"""

134

```

135

136

### Text Preprocessing Functions

137

138

Utility functions for basic text preprocessing operations.

139

140

```python { .api }

141

def text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',

142

lower=True, split=" "):

143

"""

144

Convert text to a sequence of words (or tokens).

145

146

Parameters:

147

- text (str): Input text

148

- filters (str): Characters to filter out (punctuation, etc.)

149

- lower (bool): Whether to convert to lowercase

150

- split (str): Separator for word splitting

151

152

Returns:

153

- list: List of words/tokens

154

"""

155

```

156

157

### Text Encoding Functions

158

159

Functions for encoding text using hashing and one-hot techniques.

160

161

```python { .api }

162

def one_hot(text, n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',

163

lower=True, split=' '):

164

"""

165

One-hot encode text into list of word indexes using hashing.

166

167

Parameters:

168

- text (str): Input text

169

- n (int): Size of vocabulary (hashing space)

170

- filters (str): Characters to filter out

171

- lower (bool): Whether to convert to lowercase

172

- split (str): Separator for word splitting

173

174

Returns:

175

- list: List of integers (word indexes)

176

"""

177

178

def hashing_trick(text, n, hash_function=None,

179

filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',

180

lower=True, split=' '):

181

"""

182

Convert text to sequence of indexes in fixed-size hashing space.

183

184

Parameters:

185

- text (str): Input text

186

- n (int): Size of hashing space

187

- hash_function (callable, optional): Hash function to use (default: hash())

188

- filters (str): Characters to filter out

189

- lower (bool): Whether to convert to lowercase

190

- split (str): Separator for word splitting

191

192

Returns:

193

- list: List of integers (hashed word indexes)

194

"""

195

```

196

197

### Serialization

198

199

```python { .api }

200

def tokenizer_from_json(json_string):

201

"""

202

Parse JSON tokenizer configuration and return tokenizer instance.

203

204

Parameters:

205

- json_string (str): JSON string containing tokenizer configuration

206

207

Returns:

208

- Tokenizer: Tokenizer instance with loaded configuration

209

"""

210

```

211

212

## Usage Examples

213

214

### Basic Tokenization

215

216

```python

217

from keras_preprocessing.text import Tokenizer

218

219

# Create and fit tokenizer

220

tokenizer = Tokenizer(num_words=1000, oov_token='<OOV>')

221

texts = [

222

'The quick brown fox',

223

'jumps over the lazy dog',

224

'The dog was lazy'

225

]

226

227

tokenizer.fit_on_texts(texts)

228

229

# Convert texts to sequences

230

sequences = tokenizer.texts_to_sequences(texts)

231

print(sequences)

232

# [[1, 4, 5, 6], [7, 8, 1, 2, 3], [1, 3, 9, 2]]

233

234

# Get word index

235

print(tokenizer.word_index)

236

# {'the': 1, 'lazy': 2, 'dog': 3, 'quick': 4, ...}

237

```

238

239

### Text to Matrix Conversion

240

241

```python

242

# Convert to binary matrix

243

binary_matrix = tokenizer.texts_to_matrix(texts, mode='binary')

244

print(binary_matrix.shape) # (3, 1000)

245

246

# Convert to TF-IDF matrix

247

tfidf_matrix = tokenizer.texts_to_matrix(texts, mode='tfidf')

248

```

249

250

### Simple Text Preprocessing

251

252

```python

253

from keras_preprocessing.text import text_to_word_sequence, one_hot

254

255

# Basic word tokenization

256

words = text_to_word_sequence('Hello, world! How are you?')

257

print(words) # ['hello', 'world', 'how', 'are', 'you']

258

259

# One-hot encoding with hashing

260

encoded = one_hot('Hello world', n=1000)

261

print(encoded) # [123, 456] # Hash-based word indexes

262

```