0
# Text Processing
1
2
Text tokenization, vocabulary management, and text-to-sequence conversion utilities for natural language processing. These tools handle the transformation of raw text into numerical representations suitable for neural network training.
3
4
## Capabilities
5
6
### Text Tokenization
7
8
The Tokenizer class provides comprehensive text tokenization and vocabulary management with configurable preprocessing, filtering, and encoding options.
9
10
```python { .api }
11
class Tokenizer:
12
"""
13
Text tokenization utility class for vectorizing text corpus.
14
15
Converts text to sequences of integers or other vectorized representations.
16
Maintains internal vocabulary and word-to-index mappings.
17
"""
18
19
def __init__(self, num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
20
lower=True, split=' ', char_level=False, oov_token=None,
21
document_count=0, **kwargs):
22
"""
23
Initialize tokenizer.
24
25
Parameters:
26
- num_words (int, optional): Maximum number of words to keep based on frequency
27
- filters (str): Characters to filter out from texts
28
- lower (bool): Whether to convert texts to lowercase
29
- split (str): Separator for word splitting
30
- char_level (bool): Whether to use character-level tokenization
31
- oov_token (str, optional): Token to replace out-of-vocabulary words
32
- document_count (int): Count of documents processed (for statistics)
33
"""
34
35
def fit_on_texts(self, texts):
36
"""
37
Update internal vocabulary based on a list of texts.
38
39
Parameters:
40
- texts (list): List of texts to fit on
41
"""
42
43
def texts_to_sequences(self, texts):
44
"""
45
Transform each text to a sequence of integers.
46
47
Parameters:
48
- texts (list): List of texts to transform
49
50
Returns:
51
- list: List of sequences (lists of integers)
52
"""
53
54
def texts_to_sequences_generator(self, texts):
55
"""
56
Generator version of texts_to_sequences.
57
58
Parameters:
59
- texts (list): List of texts to transform
60
61
Yields:
62
- list: Sequence (list of integers) for each text
63
"""
64
65
def sequences_to_texts(self, sequences):
66
"""
67
Transform sequences back to texts.
68
69
Parameters:
70
- sequences (list): List of sequences to transform
71
72
Returns:
73
- list: List of texts
74
"""
75
76
def sequences_to_texts_generator(self, sequences):
77
"""
78
Generator version of sequences_to_texts.
79
80
Parameters:
81
- sequences (list): List of sequences to transform
82
83
Yields:
84
- str: Text for each sequence
85
"""
86
87
def texts_to_matrix(self, texts, mode='binary'):
88
"""
89
Convert texts to a matrix representation.
90
91
Parameters:
92
- texts (list): List of texts to convert
93
- mode (str): 'binary', 'count', 'tfidf', 'freq'
94
95
Returns:
96
- numpy.ndarray: Matrix representation of texts
97
"""
98
99
def sequences_to_matrix(self, sequences, mode='binary'):
100
"""
101
Convert sequences to a matrix representation.
102
103
Parameters:
104
- sequences (list): List of sequences to convert
105
- mode (str): 'binary', 'count', 'tfidf', 'freq'
106
107
Returns:
108
- numpy.ndarray: Matrix representation of sequences
109
"""
110
111
def fit_on_sequences(self, sequences):
112
"""
113
Update internal vocabulary based on a list of sequences.
114
115
Parameters:
116
- sequences (list): List of sequences to fit on
117
"""
118
119
def get_config(self):
120
"""
121
Return tokenizer configuration as dictionary.
122
123
Returns:
124
- dict: Configuration dictionary
125
"""
126
127
def to_json(self, **kwargs):
128
"""
129
Return JSON string containing tokenizer configuration.
130
131
Returns:
132
- str: JSON string of tokenizer configuration
133
"""
134
```
135
136
### Text Preprocessing Functions
137
138
Utility functions for basic text preprocessing operations.
139
140
```python { .api }
141
def text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
142
lower=True, split=" "):
143
"""
144
Convert text to a sequence of words (or tokens).
145
146
Parameters:
147
- text (str): Input text
148
- filters (str): Characters to filter out (punctuation, etc.)
149
- lower (bool): Whether to convert to lowercase
150
- split (str): Separator for word splitting
151
152
Returns:
153
- list: List of words/tokens
154
"""
155
```
156
157
### Text Encoding Functions
158
159
Functions for encoding text using hashing and one-hot techniques.
160
161
```python { .api }
162
def one_hot(text, n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
163
lower=True, split=' '):
164
"""
165
One-hot encode text into list of word indexes using hashing.
166
167
Parameters:
168
- text (str): Input text
169
- n (int): Size of vocabulary (hashing space)
170
- filters (str): Characters to filter out
171
- lower (bool): Whether to convert to lowercase
172
- split (str): Separator for word splitting
173
174
Returns:
175
- list: List of integers (word indexes)
176
"""
177
178
def hashing_trick(text, n, hash_function=None,
179
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
180
lower=True, split=' '):
181
"""
182
Convert text to sequence of indexes in fixed-size hashing space.
183
184
Parameters:
185
- text (str): Input text
186
- n (int): Size of hashing space
187
- hash_function (callable, optional): Hash function to use (default: hash())
188
- filters (str): Characters to filter out
189
- lower (bool): Whether to convert to lowercase
190
- split (str): Separator for word splitting
191
192
Returns:
193
- list: List of integers (hashed word indexes)
194
"""
195
```
196
197
### Serialization
198
199
```python { .api }
200
def tokenizer_from_json(json_string):
201
"""
202
Parse JSON tokenizer configuration and return tokenizer instance.
203
204
Parameters:
205
- json_string (str): JSON string containing tokenizer configuration
206
207
Returns:
208
- Tokenizer: Tokenizer instance with loaded configuration
209
"""
210
```
211
212
## Usage Examples
213
214
### Basic Tokenization
215
216
```python
217
from keras_preprocessing.text import Tokenizer
218
219
# Create and fit tokenizer
220
tokenizer = Tokenizer(num_words=1000, oov_token='<OOV>')
221
texts = [
222
'The quick brown fox',
223
'jumps over the lazy dog',
224
'The dog was lazy'
225
]
226
227
tokenizer.fit_on_texts(texts)
228
229
# Convert texts to sequences
230
sequences = tokenizer.texts_to_sequences(texts)
231
print(sequences)
232
# [[1, 4, 5, 6], [7, 8, 1, 2, 3], [1, 3, 9, 2]]
233
234
# Get word index
235
print(tokenizer.word_index)
236
# {'the': 1, 'lazy': 2, 'dog': 3, 'quick': 4, ...}
237
```
238
239
### Text to Matrix Conversion
240
241
```python
242
# Convert to binary matrix
243
binary_matrix = tokenizer.texts_to_matrix(texts, mode='binary')
244
print(binary_matrix.shape) # (3, 1000)
245
246
# Convert to TF-IDF matrix
247
tfidf_matrix = tokenizer.texts_to_matrix(texts, mode='tfidf')
248
```
249
250
### Simple Text Preprocessing
251
252
```python
253
from keras_preprocessing.text import text_to_word_sequence, one_hot
254
255
# Basic word tokenization
256
words = text_to_word_sequence('Hello, world! How are you?')
257
print(words) # ['hello', 'world', 'how', 'are', 'you']
258
259
# One-hot encoding with hashing
260
encoded = one_hot('Hello world', n=1000)
261
print(encoded) # [123, 456] # Hash-based word indexes
262
```