Tessl Tile for maven/org.deeplearning4j/deeplearning4j-nlp@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

bag-of-words.md dataset-loading.md document-embeddings.md glove.md index.md text-processing.md word-embeddings.md

dataset-loading.mddocs/

0
# Dataset Loading and Iteration
1

2
Pre-built dataset loaders and iterators for common NLP datasets and data formats, designed for seamless integration with neural network training pipelines. Provides standardized access to benchmark datasets and custom data preparation utilities.
3

4
## Capabilities
5

6
### CNN Sentence Dataset Iterator
7

8
Dataset iterator specifically designed for Convolutional Neural Network sentence classification tasks with configurable preprocessing and batching.
9

10
```java { .api }
11
/**
12
 * CNN sentence dataset iterator for sentence classification tasks
13
 * Provides standardized data preparation for CNN-based text classification
14
 */
15
public class CnnSentenceDataSetIterator {
16
    // CNN-specific dataset iteration with sentence-level batching and preprocessing
17
}
18
```
19

20
### Reuters News Groups Dataset
21

22
Standardized access to the Reuters News Groups dataset for document classification and text analysis benchmarking.
23

24
```java { .api }
25
/**
26
 * Reuters News Groups dataset iterator
27
 * Provides access to Reuters news articles with category labels
28
 */
29
public class ReutersNewsGroupsDataSetIterator {
30
    // Iterator for Reuters dataset with automatic downloading and preprocessing
31
}
32

33
/**
34
 * Reuters News Groups dataset loader
35
 * Handles downloading, extraction, and preparation of Reuters dataset
36
 */
37
public class ReutersNewsGroupsLoader {
38
    // Dataset loading utilities for Reuters news groups data
39
}
40
```
41

42
### Labeled Sentence Providers
43

44
Interface and implementations for providing labeled sentences to dataset iterators with various data source options.
45

46
```java { .api }
47
/**
48
 * Interface for providing labeled sentences to dataset iterators
49
 */
50
public interface LabeledSentenceProvider {
51
    /**
52
     * Get total number of labeled sentences
53
     * @return Total count of available labeled sentences
54
     */
55
    int totalNumSentences();
56
    
57
    /**
58
     * Get all available sentence labels
59
     * @return List of all unique labels in the dataset
60
     */
61
    List<String> allLabels();
62
    
63
    /**
64
     * Get sentence at specific index
65
     * @param index Index of sentence to retrieve
66
     * @return Sentence string at specified index
67
     */
68
    String sentenceAt(int index);
69
    
70
    /**
71
     * Get label for sentence at specific index
72
     * @param index Index of sentence label to retrieve
73
     * @return Label string for sentence at specified index
74
     */
75
    String labelAt(int index);
76
}
77

78
/**
79
 * Collection-based labeled sentence provider
80
 * Provides labeled sentences from in-memory collections
81
 */
82
public class CollectionLabeledSentenceProvider implements LabeledSentenceProvider {
83
    
84
    /**
85
     * Create provider from sentence and label collections
86
     * @param sentences Collection of sentence strings
87
     * @param labels Collection of corresponding label strings
88
     */
89
    public CollectionLabeledSentenceProvider(Collection<String> sentences, Collection<String> labels);
90
    
91
    /**
92
     * Create provider from labeled document collection
93
     * @param documents Collection of LabelledDocument instances
94
     */
95
    public CollectionLabeledSentenceProvider(Collection<LabelledDocument> documents);
96
}
97

98
/**
99
 * File-based labeled sentence provider
100
 * Reads labeled sentences from file system with various formats
101
 */
102
public class FileLabeledSentenceProvider implements LabeledSentenceProvider {
103
    
104
    /**
105
     * Create provider from file with specified format
106
     * @param file File containing labeled sentences
107
     * @param format Format specification for parsing labeled data
108
     */
109
    public FileLabeledSentenceProvider(File file, LabeledSentenceFormat format);
110
    
111
    /**
112
     * Create provider from directory with label-based organization
113
     * @param directory Directory containing subdirectories for each label
114
     */
115
    public FileLabeledSentenceProvider(File directory);
116
}
117
```
118

119
### Label-Aware Data Conversion
120

121
Utilities for converting between different labeled data formats and iterator types.
122

123
```java { .api }
124
/**
125
 * Converter for label-aware data formats
126
 * Handles conversion between different labeled data representations
127
 */
128
public class LabelAwareConverter {
129
    
130
    /**
131
     * Convert label-aware iterator to standard format
132
     * @param iterator LabelAwareIterator to convert
133
     * @return Converted data in standard format
134
     */
135
    public static ConvertedData convert(LabelAwareIterator iterator);
136
    
137
    /**
138
     * Convert labeled document collection to provider format
139
     * @param documents Collection of LabelledDocument instances
140
     * @return LabeledSentenceProvider for the documents
141
     */
142
    public static LabeledSentenceProvider convert(Collection<LabelledDocument> documents);
143
}
144
```
145

146
**Usage Examples:**
147

148
```java
149
import org.deeplearning4j.datasets.iterator.ReutersNewsGroupsDataSetIterator;
150
import org.deeplearning4j.datasets.loader.ReutersNewsGroupsLoader;
151
import org.deeplearning4j.iterator.*;
152
import org.deeplearning4j.iterator.provider.*;
153

154
// Reuters News Groups dataset usage
155
ReutersNewsGroupsDataSetIterator reutersIterator = new ReutersNewsGroupsDataSetIterator(
156
    32,      // batch size
157
    100,     // truncate length
158
    true,    // train set
159
    new DefaultTokenizerFactory()
160
);
161

162
while (reutersIterator.hasNext()) {
163
    DataSet batch = reutersIterator.next();
164
    // Process batch for training
165
    System.out.println("Batch size: " + batch.numExamples());
166
}
167

168
// Custom labeled sentence provider from collections
169
Collection<String> sentences = Arrays.asList(
170
    "This is a positive example",
171
    "This is a negative example",
172
    "Another positive case"
173
);
174

175
Collection<String> labels = Arrays.asList(
176
    "positive",
177
    "negative", 
178
    "positive"
179
);
180

181
LabeledSentenceProvider provider = new CollectionLabeledSentenceProvider(sentences, labels);
182
System.out.println("Total sentences: " + provider.totalNumSentences());
183
System.out.println("Available labels: " + provider.allLabels());
184

185
// Access specific sentences and labels
186
for (int i = 0; i < provider.totalNumSentences(); i++) {
187
    String sentence = provider.sentenceAt(i);
188
    String label = provider.labelAt(i);
189
    System.out.println("Sentence: " + sentence + " -> Label: " + label);
190
}
191

192
// File-based labeled sentence provider
193
File labeledDataFile = new File("labeled_data.txt");
194
FileLabeledSentenceProvider fileProvider = new FileLabeledSentenceProvider(
195
    labeledDataFile, 
196
    LabeledSentenceFormat.TAB_SEPARATED // or other format
197
);
198

199
// Directory-based provider (subdirectories as labels)
200
File dataDirectory = new File("data/");
201
// Expected structure:
202
// data/
203
//   positive/
204
//     file1.txt
205
//     file2.txt
206
//   negative/
207
//     file3.txt
208
//     file4.txt
209

210
FileLabeledSentenceProvider dirProvider = new FileLabeledSentenceProvider(dataDirectory);
211

212
// CNN sentence dataset iterator configuration
213
LabeledSentenceProvider sentenceProvider = new CollectionLabeledSentenceProvider(
214
    sentences, labels
215
);
216

217
CnnSentenceDataSetIterator cnnIterator = new CnnSentenceDataSetIterator.Builder()
218
    .sentenceProvider(sentenceProvider)
219
    .tokenizerFactory(new DefaultTokenizerFactory())
220
    .maxSentenceLength(100)
221
    .minibatchSize(32)
222
    .build();
223

224
// Use with neural network training
225
while (cnnIterator.hasNext()) {
226
    DataSet batch = cnnIterator.next();
227
    // Train CNN model with batch
228
}
229

230
// Reuters dataset downloading and preparation
231
ReutersNewsGroupsLoader loader = new ReutersNewsGroupsLoader();
232
// loader.downloadAndExtract(); // Downloads Reuters data if not present
233

234
// Advanced iterator configuration with custom preprocessing
235
TokenizerFactory customTokenizer = new DefaultTokenizerFactory();
236
customTokenizer.setTokenPreProcessor(new CommonPreprocessor());
237

238
CnnSentenceDataSetIterator advancedIterator = new CnnSentenceDataSetIterator.Builder()
239
    .sentenceProvider(provider)
240
    .tokenizerFactory(customTokenizer)
241
    .maxSentenceLength(150)
242
    .minibatchSize(64)
243
    .useNormalizedWordVectors(true)
244
    .build();
245
```
246

247
## Dataset Integration Patterns
248

249
The dataset loading components support several common patterns:
250

251
### Benchmark Dataset Access
252
- **Automatic downloading**: Datasets are downloaded automatically when first accessed
253
- **Standardized preprocessing**: Consistent text cleaning and tokenization across datasets
254
- **Train/test splits**: Pre-defined data splits for reproducible experiments
255
- **Label encoding**: Automatic conversion of text labels to numerical representations
256

257
### Custom Dataset Integration
258
- **Flexible input formats**: Support for various file formats and directory structures
259
- **Label discovery**: Automatic label extraction from filenames, directories, or file content
260
- **Memory efficiency**: Streaming access to large datasets without loading everything into memory
261
- **Preprocessing pipelines**: Integration with tokenization and text processing components
262

263
### Neural Network Integration
264
- **Batch preparation**: Automatic batching with configurable sizes for efficient training
265
- **Sequence padding**: Handling variable-length text sequences with padding strategies
266
- **Label encoding**: One-hot encoding and other label representations for classification
267
- **Memory management**: Efficient data loading that scales to large datasets
268

269
These dataset utilities provide the foundation for training and evaluating NLP models on both standard benchmarks and custom datasets, ensuring consistent data preparation across different model types and training scenarios.

Version

Tile

Files

dataset-loading.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

dataset-loading.mddocs/