or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

bag-of-words.mddataset-loading.mddocument-embeddings.mdglove.mdindex.mdtext-processing.mdword-embeddings.md

dataset-loading.mddocs/

0

# Dataset Loading and Iteration

1

2

Pre-built dataset loaders and iterators for common NLP datasets and data formats, designed for seamless integration with neural network training pipelines. Provides standardized access to benchmark datasets and custom data preparation utilities.

3

4

## Capabilities

5

6

### CNN Sentence Dataset Iterator

7

8

Dataset iterator specifically designed for Convolutional Neural Network sentence classification tasks with configurable preprocessing and batching.

9

10

```java { .api }

11

/**

12

* CNN sentence dataset iterator for sentence classification tasks

13

* Provides standardized data preparation for CNN-based text classification

14

*/

15

public class CnnSentenceDataSetIterator {

16

// CNN-specific dataset iteration with sentence-level batching and preprocessing

17

}

18

```

19

20

### Reuters News Groups Dataset

21

22

Standardized access to the Reuters News Groups dataset for document classification and text analysis benchmarking.

23

24

```java { .api }

25

/**

26

* Reuters News Groups dataset iterator

27

* Provides access to Reuters news articles with category labels

28

*/

29

public class ReutersNewsGroupsDataSetIterator {

30

// Iterator for Reuters dataset with automatic downloading and preprocessing

31

}

32

33

/**

34

* Reuters News Groups dataset loader

35

* Handles downloading, extraction, and preparation of Reuters dataset

36

*/

37

public class ReutersNewsGroupsLoader {

38

// Dataset loading utilities for Reuters news groups data

39

}

40

```

41

42

### Labeled Sentence Providers

43

44

Interface and implementations for providing labeled sentences to dataset iterators with various data source options.

45

46

```java { .api }

47

/**

48

* Interface for providing labeled sentences to dataset iterators

49

*/

50

public interface LabeledSentenceProvider {

51

/**

52

* Get total number of labeled sentences

53

* @return Total count of available labeled sentences

54

*/

55

int totalNumSentences();

56

57

/**

58

* Get all available sentence labels

59

* @return List of all unique labels in the dataset

60

*/

61

List<String> allLabels();

62

63

/**

64

* Get sentence at specific index

65

* @param index Index of sentence to retrieve

66

* @return Sentence string at specified index

67

*/

68

String sentenceAt(int index);

69

70

/**

71

* Get label for sentence at specific index

72

* @param index Index of sentence label to retrieve

73

* @return Label string for sentence at specified index

74

*/

75

String labelAt(int index);

76

}

77

78

/**

79

* Collection-based labeled sentence provider

80

* Provides labeled sentences from in-memory collections

81

*/

82

public class CollectionLabeledSentenceProvider implements LabeledSentenceProvider {

83

84

/**

85

* Create provider from sentence and label collections

86

* @param sentences Collection of sentence strings

87

* @param labels Collection of corresponding label strings

88

*/

89

public CollectionLabeledSentenceProvider(Collection<String> sentences, Collection<String> labels);

90

91

/**

92

* Create provider from labeled document collection

93

* @param documents Collection of LabelledDocument instances

94

*/

95

public CollectionLabeledSentenceProvider(Collection<LabelledDocument> documents);

96

}

97

98

/**

99

* File-based labeled sentence provider

100

* Reads labeled sentences from file system with various formats

101

*/

102

public class FileLabeledSentenceProvider implements LabeledSentenceProvider {

103

104

/**

105

* Create provider from file with specified format

106

* @param file File containing labeled sentences

107

* @param format Format specification for parsing labeled data

108

*/

109

public FileLabeledSentenceProvider(File file, LabeledSentenceFormat format);

110

111

/**

112

* Create provider from directory with label-based organization

113

* @param directory Directory containing subdirectories for each label

114

*/

115

public FileLabeledSentenceProvider(File directory);

116

}

117

```

118

119

### Label-Aware Data Conversion

120

121

Utilities for converting between different labeled data formats and iterator types.

122

123

```java { .api }

124

/**

125

* Converter for label-aware data formats

126

* Handles conversion between different labeled data representations

127

*/

128

public class LabelAwareConverter {

129

130

/**

131

* Convert label-aware iterator to standard format

132

* @param iterator LabelAwareIterator to convert

133

* @return Converted data in standard format

134

*/

135

public static ConvertedData convert(LabelAwareIterator iterator);

136

137

/**

138

* Convert labeled document collection to provider format

139

* @param documents Collection of LabelledDocument instances

140

* @return LabeledSentenceProvider for the documents

141

*/

142

public static LabeledSentenceProvider convert(Collection<LabelledDocument> documents);

143

}

144

```

145

146

**Usage Examples:**

147

148

```java

149

import org.deeplearning4j.datasets.iterator.ReutersNewsGroupsDataSetIterator;

150

import org.deeplearning4j.datasets.loader.ReutersNewsGroupsLoader;

151

import org.deeplearning4j.iterator.*;

152

import org.deeplearning4j.iterator.provider.*;

153

154

// Reuters News Groups dataset usage

155

ReutersNewsGroupsDataSetIterator reutersIterator = new ReutersNewsGroupsDataSetIterator(

156

32, // batch size

157

100, // truncate length

158

true, // train set

159

new DefaultTokenizerFactory()

160

);

161

162

while (reutersIterator.hasNext()) {

163

DataSet batch = reutersIterator.next();

164

// Process batch for training

165

System.out.println("Batch size: " + batch.numExamples());

166

}

167

168

// Custom labeled sentence provider from collections

169

Collection<String> sentences = Arrays.asList(

170

"This is a positive example",

171

"This is a negative example",

172

"Another positive case"

173

);

174

175

Collection<String> labels = Arrays.asList(

176

"positive",

177

"negative",

178

"positive"

179

);

180

181

LabeledSentenceProvider provider = new CollectionLabeledSentenceProvider(sentences, labels);

182

System.out.println("Total sentences: " + provider.totalNumSentences());

183

System.out.println("Available labels: " + provider.allLabels());

184

185

// Access specific sentences and labels

186

for (int i = 0; i < provider.totalNumSentences(); i++) {

187

String sentence = provider.sentenceAt(i);

188

String label = provider.labelAt(i);

189

System.out.println("Sentence: " + sentence + " -> Label: " + label);

190

}

191

192

// File-based labeled sentence provider

193

File labeledDataFile = new File("labeled_data.txt");

194

FileLabeledSentenceProvider fileProvider = new FileLabeledSentenceProvider(

195

labeledDataFile,

196

LabeledSentenceFormat.TAB_SEPARATED // or other format

197

);

198

199

// Directory-based provider (subdirectories as labels)

200

File dataDirectory = new File("data/");

201

// Expected structure:

202

// data/

203

// positive/

204

// file1.txt

205

// file2.txt

206

// negative/

207

// file3.txt

208

// file4.txt

209

210

FileLabeledSentenceProvider dirProvider = new FileLabeledSentenceProvider(dataDirectory);

211

212

// CNN sentence dataset iterator configuration

213

LabeledSentenceProvider sentenceProvider = new CollectionLabeledSentenceProvider(

214

sentences, labels

215

);

216

217

CnnSentenceDataSetIterator cnnIterator = new CnnSentenceDataSetIterator.Builder()

218

.sentenceProvider(sentenceProvider)

219

.tokenizerFactory(new DefaultTokenizerFactory())

220

.maxSentenceLength(100)

221

.minibatchSize(32)

222

.build();

223

224

// Use with neural network training

225

while (cnnIterator.hasNext()) {

226

DataSet batch = cnnIterator.next();

227

// Train CNN model with batch

228

}

229

230

// Reuters dataset downloading and preparation

231

ReutersNewsGroupsLoader loader = new ReutersNewsGroupsLoader();

232

// loader.downloadAndExtract(); // Downloads Reuters data if not present

233

234

// Advanced iterator configuration with custom preprocessing

235

TokenizerFactory customTokenizer = new DefaultTokenizerFactory();

236

customTokenizer.setTokenPreProcessor(new CommonPreprocessor());

237

238

CnnSentenceDataSetIterator advancedIterator = new CnnSentenceDataSetIterator.Builder()

239

.sentenceProvider(provider)

240

.tokenizerFactory(customTokenizer)

241

.maxSentenceLength(150)

242

.minibatchSize(64)

243

.useNormalizedWordVectors(true)

244

.build();

245

```

246

247

## Dataset Integration Patterns

248

249

The dataset loading components support several common patterns:

250

251

### Benchmark Dataset Access

252

- **Automatic downloading**: Datasets are downloaded automatically when first accessed

253

- **Standardized preprocessing**: Consistent text cleaning and tokenization across datasets

254

- **Train/test splits**: Pre-defined data splits for reproducible experiments

255

- **Label encoding**: Automatic conversion of text labels to numerical representations

256

257

### Custom Dataset Integration

258

- **Flexible input formats**: Support for various file formats and directory structures

259

- **Label discovery**: Automatic label extraction from filenames, directories, or file content

260

- **Memory efficiency**: Streaming access to large datasets without loading everything into memory

261

- **Preprocessing pipelines**: Integration with tokenization and text processing components

262

263

### Neural Network Integration

264

- **Batch preparation**: Automatic batching with configurable sizes for efficient training

265

- **Sequence padding**: Handling variable-length text sequences with padding strategies

266

- **Label encoding**: One-hot encoding and other label representations for classification

267

- **Memory management**: Efficient data loading that scales to large datasets

268

269

These dataset utilities provide the foundation for training and evaluating NLP models on both standard benchmarks and custom datasets, ensuring consistent data preparation across different model types and training scenarios.