Tessl Tile for pypi/keras-preprocessing@1.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

image-processing.md index.md sequence-processing.md text-processing.md

sequence-processing.mddocs/

0
# Sequence Processing
1

2
Sequence padding, temporal data generation, and utilities for preparing sequential data for neural networks. These tools handle variable-length sequences and time series data preprocessing for recurrent neural networks and sequence models.
3

4
## Capabilities
5

6
### Sequence Padding
7

8
Utilities for converting variable-length sequences to fixed-length arrays suitable for batch processing in neural networks.
9

10
```python { .api }
11
def pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', 
12
                  truncating='pre', value=0.):
13
    """
14
    Pad sequences to the same length.
15
    
16
    Transforms a list of num_samples sequences (lists of integers) into a 2D 
17
    numpy array of shape (num_samples, num_timesteps). Sequences shorter than 
18
    num_timesteps are padded with value. Sequences longer than num_timesteps 
19
    are truncated.
20
    
21
    Parameters:
22
    - sequences (list): List of lists, where each element is a sequence
23
    - maxlen (int, optional): Maximum length of all sequences. If None, uses length of longest sequence
24
    - dtype (str): Type of the output sequences ('int32', 'float32', etc.)
25
    - padding (str): 'pre' or 'post' - pad either before or after each sequence
26
    - truncating (str): 'pre' or 'post' - remove values from sequences larger than maxlen
27
    - value (float or str): Padding value
28
    
29
    Returns:
30
    - numpy.ndarray: Array with shape (len(sequences), maxlen)
31
    """
32
```
33

34
### Time Series Generation
35

36
Generator class for creating batches of temporal data from continuous sequences.
37

38
```python { .api }
39
class TimeseriesGenerator:
40
    """
41
    Utility class for generating batches of temporal data.
42
    
43
    Creates overlapping sequences from continuous time series data for training
44
    sequence models. Handles sampling, stride, shuffling, and batch generation.
45
    """
46
    
47
    def __init__(self, data, targets, length, sampling_rate=1, stride=1, 
48
                 start_index=0, end_index=None, shuffle=False, reverse=False, 
49
                 batch_size=128):
50
        """
51
        Initialize timeseries generator.
52
        
53
        Parameters:
54
        - data (numpy.ndarray): Time series data
55
        - targets (numpy.ndarray): Target values corresponding to data
56
        - length (int): Length of input sequences
57
        - sampling_rate (int): Period between successive individual timesteps
58
        - stride (int): Period between successive sequences
59
        - start_index (int): Data points earlier than start_index will not be used
60
        - end_index (int, optional): Data points later than end_index will not be used
61
        - shuffle (bool): Whether to shuffle the rows at each epoch
62
        - reverse (bool): Whether to reverse the temporal order of sequences
63
        - batch_size (int): Number of timeseries samples in each batch
64
        """
65
    
66
    def __len__(self):
67
        """
68
        Return number of batches in the generator.
69
        
70
        Returns:
71
        - int: Number of batches
72
        """
73
    
74
    def __getitem__(self, index):
75
        """
76
        Get batch at specified index.
77
        
78
        Parameters:
79
        - index (int): Batch index
80
        
81
        Returns:
82
        - tuple: (samples, targets) - batch of sequences and corresponding targets
83
        """
84
    
85
    def get_config(self):
86
        """
87
        Return generator configuration as dictionary.
88
        
89
        Returns:
90
        - dict: Configuration dictionary
91
        """
92
    
93
    def to_json(self, **kwargs):
94
        """
95
        Return JSON string containing generator configuration.
96
        
97
        Returns:
98
        - str: JSON string of generator configuration
99
        """
100
```
101

102
### Skipgram Generation
103

104
Utilities for generating skipgram word pairs for word2vec training.
105

106
```python { .api }
107
def skipgrams(sequence, vocabulary_size, window_size=4, negative_samples=1., 
108
              shuffle=True, categorical=False, sampling_table=None, seed=None):
109
    """
110
    Generate skipgram word pairs for word2vec training.
111
    
112
    Creates (word, context) pairs and (word, random_word) negative samples
113
    from a sequence of word indexes.
114
    
115
    Parameters:
116
    - sequence (list): Sequence of word indexes
117
    - vocabulary_size (int): Size of vocabulary
118
    - window_size (int): Maximum distance between current and predicted word
119
    - negative_samples (float): Ratio of negative samples to positive samples
120
    - shuffle (bool): Whether to shuffle word couples before returning
121
    - categorical (bool): Whether to return categorical labels
122
    - sampling_table (numpy.ndarray, optional): Probability table for sampling
123
    - seed (int, optional): Random seed
124
    
125
    Returns:
126
    - tuple: (couples, labels) where couples is list of word pairs and labels 
127
             indicates positive (1) or negative (0) samples
128
    """
129

130
def make_sampling_table(size, sampling_factor=1e-5):
131
    """
132
    Generate word rank-based probabilistic sampling table for skipgrams.
133
    
134
    Creates sampling probabilities based on word frequency ranks, used to
135
    downsample frequent words in skipgram generation.
136
    
137
    Parameters:
138
    - size (int): Size of vocabulary
139
    - sampling_factor (float): Factor for downsampling frequent words
140
    
141
    Returns:
142
    - numpy.ndarray: Sampling probabilities for each word rank
143
    """
144
```
145

146
### Serialization
147

148
```python { .api }
149
def timeseries_generator_from_json(json_string):
150
    """
151
    Parse JSON timeseries generator configuration and return generator instance.
152
    
153
    Parameters:
154
    - json_string (str): JSON string containing generator configuration
155
    
156
    Returns:
157
    - TimeseriesGenerator: Generator instance with loaded configuration
158
    """
159
```
160

161
## Usage Examples
162

163
### Basic Sequence Padding
164

165
```python
166
from keras_preprocessing.sequence import pad_sequences
167

168
# Variable length sequences
169
sequences = [
170
    [1, 2, 3],
171
    [1, 2, 3, 4, 5],
172
    [1, 2]
173
]
174

175
# Pad to same length (default: pre-padding with zeros)
176
padded = pad_sequences(sequences, maxlen=5)
177
print(padded)
178
# [[0 0 1 2 3]
179
#  [1 2 3 4 5]
180
#  [0 0 0 1 2]]
181

182
# Post-padding
183
padded_post = pad_sequences(sequences, maxlen=5, padding='post')
184
print(padded_post)
185
# [[1 2 3 0 0]
186
#  [1 2 3 4 5]
187
#  [1 2 0 0 0]]
188

189
# Truncation
190
long_sequences = [[1, 2, 3, 4, 5, 6, 7]]
191
truncated = pad_sequences(long_sequences, maxlen=5, truncating='post')
192
print(truncated)  # [[1 2 3 4 5]]
193
```
194

195
### Time Series Data Generation
196

197
```python
198
import numpy as np
199
from keras_preprocessing.sequence import TimeseriesGenerator
200

201
# Create sample time series data
202
data = np.array([i for i in range(50)])  # [0, 1, 2, ..., 49]
203
targets = data  # For autoregression, targets can be same as data
204

205
# Create generator for sequences of length 10
206
generator = TimeseriesGenerator(
207
    data=data,
208
    targets=targets,
209
    length=10,
210
    batch_size=6,
211
    sampling_rate=1,
212
    stride=1
213
)
214

215
print(f"Number of batches: {len(generator)}")  # 7
216

217
# Get first batch
218
batch_x, batch_y = generator[0]
219
print(f"Batch shape: {batch_x.shape}")  # (6, 10)
220
print(f"Target shape: {batch_y.shape}")  # (6,)
221

222
# First sequence: data[0:10] -> target[10]
223
print(f"First sequence: {batch_x[0]} -> {batch_y[0]}")
224
# [0 1 2 3 4 5 6 7 8 9] -> 10
225
```
226

227
### Skipgram Generation for Word2Vec
228

229
```python
230
from keras_preprocessing.sequence import skipgrams, make_sampling_table
231

232
# Sample word sequence
233
sequence = [1, 2, 3, 4, 5, 2, 6, 7, 8, 9]
234
vocabulary_size = 10
235

236
# Generate skipgrams
237
couples, labels = skipgrams(
238
    sequence=sequence,
239
    vocabulary_size=vocabulary_size,
240
    window_size=2,
241
    negative_samples=1.0
242
)
243

244
print(f"Generated {len(couples)} word pairs")
245
print(f"Positive samples: {sum(labels)}")
246
print(f"Negative samples: {len(labels) - sum(labels)}")
247

248
# Example couples and labels
249
for i in range(5):
250
    word, context = couples[i]
251
    label_type = "positive" if labels[i] == 1 else "negative"
252
    print(f"({word}, {context}) - {label_type}")
253

254
# Create sampling table for frequent word downsampling
255
sampling_table = make_sampling_table(vocabulary_size)
256
print(f"Sampling probabilities: {sampling_table[:5]}")
257
```
258

259
### Advanced Time Series with Custom Parameters
260

261
```python
262
# Multi-feature time series
263
data = np.random.randn(100, 3)  # 100 timesteps, 3 features
264
targets = np.random.randn(100, 1)  # Regression targets
265

266
# Generator with stride and sampling
267
generator = TimeseriesGenerator(
268
    data=data,
269
    targets=targets,
270
    length=15,
271
    sampling_rate=2,  # Use every 2nd timestep
272
    stride=3,         # Move 3 steps between sequences
273
    batch_size=4,
274
    shuffle=True,
275
    reverse=False
276
)
277

278
# Get configuration for serialization
279
config = generator.get_config()
280
json_config = generator.to_json()
281
```

Version

Tile

Files

sequence-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

sequence-processing.mddocs/