Tessl Tile for pypi/keras-preprocessing@1.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

image-processing.md index.md sequence-processing.md text-processing.md

text-processing.mddocs/

0
# Text Processing
1

2
Text tokenization, vocabulary management, and text-to-sequence conversion utilities for natural language processing. These tools handle the transformation of raw text into numerical representations suitable for neural network training.
3

4
## Capabilities
5

6
### Text Tokenization
7

8
The Tokenizer class provides comprehensive text tokenization and vocabulary management with configurable preprocessing, filtering, and encoding options.
9

10
```python { .api }
11
class Tokenizer:
12
    """
13
    Text tokenization utility class for vectorizing text corpus.
14
    
15
    Converts text to sequences of integers or other vectorized representations.
16
    Maintains internal vocabulary and word-to-index mappings.
17
    """
18
    
19
    def __init__(self, num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
20
                 lower=True, split=' ', char_level=False, oov_token=None, 
21
                 document_count=0, **kwargs):
22
        """
23
        Initialize tokenizer.
24
        
25
        Parameters:
26
        - num_words (int, optional): Maximum number of words to keep based on frequency
27
        - filters (str): Characters to filter out from texts
28
        - lower (bool): Whether to convert texts to lowercase
29
        - split (str): Separator for word splitting
30
        - char_level (bool): Whether to use character-level tokenization
31
        - oov_token (str, optional): Token to replace out-of-vocabulary words
32
        - document_count (int): Count of documents processed (for statistics)
33
        """
34
    
35
    def fit_on_texts(self, texts):
36
        """
37
        Update internal vocabulary based on a list of texts.
38
        
39
        Parameters:
40
        - texts (list): List of texts to fit on
41
        """
42
    
43
    def texts_to_sequences(self, texts):
44
        """
45
        Transform each text to a sequence of integers.
46
        
47
        Parameters:
48
        - texts (list): List of texts to transform
49
        
50
        Returns:
51
        - list: List of sequences (lists of integers)
52
        """
53
    
54
    def texts_to_sequences_generator(self, texts):
55
        """
56
        Generator version of texts_to_sequences.
57
        
58
        Parameters:
59
        - texts (list): List of texts to transform
60
        
61
        Yields:
62
        - list: Sequence (list of integers) for each text
63
        """
64
    
65
    def sequences_to_texts(self, sequences):
66
        """
67
        Transform sequences back to texts.
68
        
69
        Parameters:
70
        - sequences (list): List of sequences to transform
71
        
72
        Returns:
73
        - list: List of texts
74
        """
75
    
76
    def sequences_to_texts_generator(self, sequences):
77
        """
78
        Generator version of sequences_to_texts.
79
        
80
        Parameters:
81
        - sequences (list): List of sequences to transform
82
        
83
        Yields:
84
        - str: Text for each sequence
85
        """
86
    
87
    def texts_to_matrix(self, texts, mode='binary'):
88
        """
89
        Convert texts to a matrix representation.
90
        
91
        Parameters:
92
        - texts (list): List of texts to convert
93
        - mode (str): 'binary', 'count', 'tfidf', 'freq'
94
        
95
        Returns:
96
        - numpy.ndarray: Matrix representation of texts
97
        """
98
    
99
    def sequences_to_matrix(self, sequences, mode='binary'):
100
        """
101
        Convert sequences to a matrix representation.
102
        
103
        Parameters:
104
        - sequences (list): List of sequences to convert
105
        - mode (str): 'binary', 'count', 'tfidf', 'freq'
106
        
107
        Returns:
108
        - numpy.ndarray: Matrix representation of sequences
109
        """
110
    
111
    def fit_on_sequences(self, sequences):
112
        """
113
        Update internal vocabulary based on a list of sequences.
114
        
115
        Parameters:
116
        - sequences (list): List of sequences to fit on
117
        """
118
    
119
    def get_config(self):
120
        """
121
        Return tokenizer configuration as dictionary.
122
        
123
        Returns:
124
        - dict: Configuration dictionary
125
        """
126
    
127
    def to_json(self, **kwargs):
128
        """
129
        Return JSON string containing tokenizer configuration.
130
        
131
        Returns:
132
        - str: JSON string of tokenizer configuration
133
        """
134
```
135

136
### Text Preprocessing Functions
137

138
Utility functions for basic text preprocessing operations.
139

140
```python { .api }
141
def text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
142
                          lower=True, split=" "):
143
    """
144
    Convert text to a sequence of words (or tokens).
145
    
146
    Parameters:
147
    - text (str): Input text
148
    - filters (str): Characters to filter out (punctuation, etc.)
149
    - lower (bool): Whether to convert to lowercase
150
    - split (str): Separator for word splitting
151
    
152
    Returns:
153
    - list: List of words/tokens
154
    """
155
```
156

157
### Text Encoding Functions
158

159
Functions for encoding text using hashing and one-hot techniques.
160

161
```python { .api }
162
def one_hot(text, n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
163
            lower=True, split=' '):
164
    """
165
    One-hot encode text into list of word indexes using hashing.
166
    
167
    Parameters:
168
    - text (str): Input text
169
    - n (int): Size of vocabulary (hashing space)
170
    - filters (str): Characters to filter out
171
    - lower (bool): Whether to convert to lowercase
172
    - split (str): Separator for word splitting
173
    
174
    Returns:
175
    - list: List of integers (word indexes)
176
    """
177

178
def hashing_trick(text, n, hash_function=None, 
179
                  filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
180
                  lower=True, split=' '):
181
    """
182
    Convert text to sequence of indexes in fixed-size hashing space.
183
    
184
    Parameters:
185
    - text (str): Input text
186
    - n (int): Size of hashing space
187
    - hash_function (callable, optional): Hash function to use (default: hash())
188
    - filters (str): Characters to filter out
189
    - lower (bool): Whether to convert to lowercase
190
    - split (str): Separator for word splitting
191
    
192
    Returns:
193
    - list: List of integers (hashed word indexes)
194
    """
195
```
196

197
### Serialization
198

199
```python { .api }
200
def tokenizer_from_json(json_string):
201
    """
202
    Parse JSON tokenizer configuration and return tokenizer instance.
203
    
204
    Parameters:
205
    - json_string (str): JSON string containing tokenizer configuration
206
    
207
    Returns:
208
    - Tokenizer: Tokenizer instance with loaded configuration
209
    """
210
```
211

212
## Usage Examples
213

214
### Basic Tokenization
215

216
```python
217
from keras_preprocessing.text import Tokenizer
218

219
# Create and fit tokenizer
220
tokenizer = Tokenizer(num_words=1000, oov_token='<OOV>')
221
texts = [
222
    'The quick brown fox',
223
    'jumps over the lazy dog',
224
    'The dog was lazy'
225
]
226

227
tokenizer.fit_on_texts(texts)
228

229
# Convert texts to sequences
230
sequences = tokenizer.texts_to_sequences(texts)
231
print(sequences)
232
# [[1, 4, 5, 6], [7, 8, 1, 2, 3], [1, 3, 9, 2]]
233

234
# Get word index
235
print(tokenizer.word_index)
236
# {'the': 1, 'lazy': 2, 'dog': 3, 'quick': 4, ...}
237
```
238

239
### Text to Matrix Conversion
240

241
```python
242
# Convert to binary matrix
243
binary_matrix = tokenizer.texts_to_matrix(texts, mode='binary')
244
print(binary_matrix.shape)  # (3, 1000)
245

246
# Convert to TF-IDF matrix
247
tfidf_matrix = tokenizer.texts_to_matrix(texts, mode='tfidf')
248
```
249

250
### Simple Text Preprocessing
251

252
```python
253
from keras_preprocessing.text import text_to_word_sequence, one_hot
254

255
# Basic word tokenization
256
words = text_to_word_sequence('Hello, world! How are you?')
257
print(words)  # ['hello', 'world', 'how', 'are', 'you']
258

259
# One-hot encoding with hashing
260
encoded = one_hot('Hello world', n=1000)
261
print(encoded)  # [123, 456]  # Hash-based word indexes
262
```

Version

Tile

Files

text-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

text-processing.mddocs/