0
# Dataset Loading and Iteration
1
2
Pre-built dataset loaders and iterators for common NLP datasets and data formats, designed for seamless integration with neural network training pipelines. Provides standardized access to benchmark datasets and custom data preparation utilities.
3
4
## Capabilities
5
6
### CNN Sentence Dataset Iterator
7
8
Dataset iterator specifically designed for Convolutional Neural Network sentence classification tasks with configurable preprocessing and batching.
9
10
```java { .api }
11
/**
12
* CNN sentence dataset iterator for sentence classification tasks
13
* Provides standardized data preparation for CNN-based text classification
14
*/
15
public class CnnSentenceDataSetIterator {
16
// CNN-specific dataset iteration with sentence-level batching and preprocessing
17
}
18
```
19
20
### Reuters News Groups Dataset
21
22
Standardized access to the Reuters News Groups dataset for document classification and text analysis benchmarking.
23
24
```java { .api }
25
/**
26
* Reuters News Groups dataset iterator
27
* Provides access to Reuters news articles with category labels
28
*/
29
public class ReutersNewsGroupsDataSetIterator {
30
// Iterator for Reuters dataset with automatic downloading and preprocessing
31
}
32
33
/**
34
* Reuters News Groups dataset loader
35
* Handles downloading, extraction, and preparation of Reuters dataset
36
*/
37
public class ReutersNewsGroupsLoader {
38
// Dataset loading utilities for Reuters news groups data
39
}
40
```
41
42
### Labeled Sentence Providers
43
44
Interface and implementations for providing labeled sentences to dataset iterators with various data source options.
45
46
```java { .api }
47
/**
48
* Interface for providing labeled sentences to dataset iterators
49
*/
50
public interface LabeledSentenceProvider {
51
/**
52
* Get total number of labeled sentences
53
* @return Total count of available labeled sentences
54
*/
55
int totalNumSentences();
56
57
/**
58
* Get all available sentence labels
59
* @return List of all unique labels in the dataset
60
*/
61
List<String> allLabels();
62
63
/**
64
* Get sentence at specific index
65
* @param index Index of sentence to retrieve
66
* @return Sentence string at specified index
67
*/
68
String sentenceAt(int index);
69
70
/**
71
* Get label for sentence at specific index
72
* @param index Index of sentence label to retrieve
73
* @return Label string for sentence at specified index
74
*/
75
String labelAt(int index);
76
}
77
78
/**
79
* Collection-based labeled sentence provider
80
* Provides labeled sentences from in-memory collections
81
*/
82
public class CollectionLabeledSentenceProvider implements LabeledSentenceProvider {
83
84
/**
85
* Create provider from sentence and label collections
86
* @param sentences Collection of sentence strings
87
* @param labels Collection of corresponding label strings
88
*/
89
public CollectionLabeledSentenceProvider(Collection<String> sentences, Collection<String> labels);
90
91
/**
92
* Create provider from labeled document collection
93
* @param documents Collection of LabelledDocument instances
94
*/
95
public CollectionLabeledSentenceProvider(Collection<LabelledDocument> documents);
96
}
97
98
/**
99
* File-based labeled sentence provider
100
* Reads labeled sentences from file system with various formats
101
*/
102
public class FileLabeledSentenceProvider implements LabeledSentenceProvider {
103
104
/**
105
* Create provider from file with specified format
106
* @param file File containing labeled sentences
107
* @param format Format specification for parsing labeled data
108
*/
109
public FileLabeledSentenceProvider(File file, LabeledSentenceFormat format);
110
111
/**
112
* Create provider from directory with label-based organization
113
* @param directory Directory containing subdirectories for each label
114
*/
115
public FileLabeledSentenceProvider(File directory);
116
}
117
```
118
119
### Label-Aware Data Conversion
120
121
Utilities for converting between different labeled data formats and iterator types.
122
123
```java { .api }
124
/**
125
* Converter for label-aware data formats
126
* Handles conversion between different labeled data representations
127
*/
128
public class LabelAwareConverter {
129
130
/**
131
* Convert label-aware iterator to standard format
132
* @param iterator LabelAwareIterator to convert
133
* @return Converted data in standard format
134
*/
135
public static ConvertedData convert(LabelAwareIterator iterator);
136
137
/**
138
* Convert labeled document collection to provider format
139
* @param documents Collection of LabelledDocument instances
140
* @return LabeledSentenceProvider for the documents
141
*/
142
public static LabeledSentenceProvider convert(Collection<LabelledDocument> documents);
143
}
144
```
145
146
**Usage Examples:**
147
148
```java
149
import org.deeplearning4j.datasets.iterator.ReutersNewsGroupsDataSetIterator;
150
import org.deeplearning4j.datasets.loader.ReutersNewsGroupsLoader;
151
import org.deeplearning4j.iterator.*;
152
import org.deeplearning4j.iterator.provider.*;
153
154
// Reuters News Groups dataset usage
155
ReutersNewsGroupsDataSetIterator reutersIterator = new ReutersNewsGroupsDataSetIterator(
156
32, // batch size
157
100, // truncate length
158
true, // train set
159
new DefaultTokenizerFactory()
160
);
161
162
while (reutersIterator.hasNext()) {
163
DataSet batch = reutersIterator.next();
164
// Process batch for training
165
System.out.println("Batch size: " + batch.numExamples());
166
}
167
168
// Custom labeled sentence provider from collections
169
Collection<String> sentences = Arrays.asList(
170
"This is a positive example",
171
"This is a negative example",
172
"Another positive case"
173
);
174
175
Collection<String> labels = Arrays.asList(
176
"positive",
177
"negative",
178
"positive"
179
);
180
181
LabeledSentenceProvider provider = new CollectionLabeledSentenceProvider(sentences, labels);
182
System.out.println("Total sentences: " + provider.totalNumSentences());
183
System.out.println("Available labels: " + provider.allLabels());
184
185
// Access specific sentences and labels
186
for (int i = 0; i < provider.totalNumSentences(); i++) {
187
String sentence = provider.sentenceAt(i);
188
String label = provider.labelAt(i);
189
System.out.println("Sentence: " + sentence + " -> Label: " + label);
190
}
191
192
// File-based labeled sentence provider
193
File labeledDataFile = new File("labeled_data.txt");
194
FileLabeledSentenceProvider fileProvider = new FileLabeledSentenceProvider(
195
labeledDataFile,
196
LabeledSentenceFormat.TAB_SEPARATED // or other format
197
);
198
199
// Directory-based provider (subdirectories as labels)
200
File dataDirectory = new File("data/");
201
// Expected structure:
202
// data/
203
// positive/
204
// file1.txt
205
// file2.txt
206
// negative/
207
// file3.txt
208
// file4.txt
209
210
FileLabeledSentenceProvider dirProvider = new FileLabeledSentenceProvider(dataDirectory);
211
212
// CNN sentence dataset iterator configuration
213
LabeledSentenceProvider sentenceProvider = new CollectionLabeledSentenceProvider(
214
sentences, labels
215
);
216
217
CnnSentenceDataSetIterator cnnIterator = new CnnSentenceDataSetIterator.Builder()
218
.sentenceProvider(sentenceProvider)
219
.tokenizerFactory(new DefaultTokenizerFactory())
220
.maxSentenceLength(100)
221
.minibatchSize(32)
222
.build();
223
224
// Use with neural network training
225
while (cnnIterator.hasNext()) {
226
DataSet batch = cnnIterator.next();
227
// Train CNN model with batch
228
}
229
230
// Reuters dataset downloading and preparation
231
ReutersNewsGroupsLoader loader = new ReutersNewsGroupsLoader();
232
// loader.downloadAndExtract(); // Downloads Reuters data if not present
233
234
// Advanced iterator configuration with custom preprocessing
235
TokenizerFactory customTokenizer = new DefaultTokenizerFactory();
236
customTokenizer.setTokenPreProcessor(new CommonPreprocessor());
237
238
CnnSentenceDataSetIterator advancedIterator = new CnnSentenceDataSetIterator.Builder()
239
.sentenceProvider(provider)
240
.tokenizerFactory(customTokenizer)
241
.maxSentenceLength(150)
242
.minibatchSize(64)
243
.useNormalizedWordVectors(true)
244
.build();
245
```
246
247
## Dataset Integration Patterns
248
249
The dataset loading components support several common patterns:
250
251
### Benchmark Dataset Access
252
- **Automatic downloading**: Datasets are downloaded automatically when first accessed
253
- **Standardized preprocessing**: Consistent text cleaning and tokenization across datasets
254
- **Train/test splits**: Pre-defined data splits for reproducible experiments
255
- **Label encoding**: Automatic conversion of text labels to numerical representations
256
257
### Custom Dataset Integration
258
- **Flexible input formats**: Support for various file formats and directory structures
259
- **Label discovery**: Automatic label extraction from filenames, directories, or file content
260
- **Memory efficiency**: Streaming access to large datasets without loading everything into memory
261
- **Preprocessing pipelines**: Integration with tokenization and text processing components
262
263
### Neural Network Integration
264
- **Batch preparation**: Automatic batching with configurable sizes for efficient training
265
- **Sequence padding**: Handling variable-length text sequences with padding strategies
266
- **Label encoding**: One-hot encoding and other label representations for classification
267
- **Memory management**: Efficient data loading that scales to large datasets
268
269
These dataset utilities provide the foundation for training and evaluating NLP models on both standard benchmarks and custom datasets, ensuring consistent data preparation across different model types and training scenarios.