0
# Sample Test Data
1
2
The test data package provides pre-built datasets and validation utilities for common Flink algorithms and testing scenarios. These datasets are commonly used in Flink examples and benchmarks.
3
4
## Algorithm Test Data
5
6
### PageRank Data
7
8
Test data for PageRank algorithm implementations.
9
10
```java { .api }
11
public class PageRankData {
12
public static final int NUM_VERTICES = 5;
13
public static final String VERTICES;
14
public static final String EDGES;
15
public static final String RANKS_AFTER_3_ITERATIONS;
16
public static final String RANKS_AFTER_EPSILON_0_0001_CONVERGENCE;
17
}
18
```
19
20
**Usage Example:**
21
22
```java
23
@Test
24
public void testPageRank() throws Exception {
25
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
26
27
// Use provided test data
28
DataSet<String> vertices = env.fromElements(PageRankData.VERTICES.split("\n"));
29
DataSet<String> edges = env.fromElements(PageRankData.EDGES.split("\n"));
30
31
// Run PageRank algorithm
32
// ... PageRank implementation
33
34
// Validate against expected results
35
TestBaseUtils.compareResultsByLinesInMemory(
36
PageRankData.RANKS_AFTER_3_ITERATIONS,
37
resultPath
38
);
39
}
40
```
41
42
**Data Format:**
43
- **Vertices**: `vertexId` (5 vertices total)
44
- **Edges**: `sourceVertexId targetVertexId`
45
- **Results**: `vertexId pageRankValue`
46
47
### Word Count Data
48
49
Test data for WordCount implementations using German text from Goethe's Faust tragedy.
50
51
```java { .api }
52
public class WordCountData {
53
public static final String TEXT;
54
public static final String COUNTS;
55
public static final String STREAMING_COUNTS_AS_TUPLES;
56
public static final String COUNTS_AS_TUPLES;
57
}
58
```
59
60
**Usage Example:**
61
62
```java
63
@Test
64
public void testWordCount() throws Exception {
65
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
66
67
// Use provided German text
68
DataSet<String> text = env.fromElements(WordCountData.TEXT.split("\n"));
69
70
DataSet<Tuple2<String, Integer>> wordCounts = text
71
.flatMap(new Tokenizer())
72
.groupBy(0)
73
.sum(1);
74
75
List<Tuple2<String, Integer>> result = wordCounts.collect();
76
77
// Compare with expected counts
78
TestBaseUtils.compareResultAsTuples(result, WordCountData.COUNTS_AS_TUPLES);
79
}
80
```
81
82
**Data Content:**
83
- **TEXT**: German text from Goethe's Faust tragedy
84
- **COUNTS**: Expected word count results as `word count` format
85
- **COUNTS_AS_TUPLES**: Expected results as `(word,count)` tuples
86
- **STREAMING_COUNTS_AS_TUPLES**: Expected streaming results as tuples
87
88
### K-Means Clustering Data
89
90
Test data for K-Means clustering algorithm with both 2D and 3D datasets.
91
92
```java { .api }
93
public class KMeansData {
94
// 3D clustering data
95
public static final String DATAPOINTS;
96
public static final String INITIAL_CENTERS;
97
public static final String CENTERS_AFTER_ONE_STEP;
98
public static final String CENTERS_AFTER_ONE_STEP_SINGLE_DIGIT;
99
public static final String CENTERS_AFTER_20_ITERATIONS_SINGLE_DIGIT;
100
public static final String CENTERS_AFTER_20_ITERATIONS_DOUBLE_DIGIT;
101
102
// 2D clustering data
103
public static final String DATAPOINTS_2D;
104
public static final String INITIAL_CENTERS_2D;
105
public static final String CENTERS_2D_AFTER_SINGLE_ITERATION_DOUBLE_DIGIT;
106
public static final String CENTERS_2D_AFTER_20_ITERATIONS_DOUBLE_DIGIT;
107
108
// Validation utility
109
public static void checkResultsWithDelta(String expectedResults, List<String> resultLines, double maxDelta);
110
}
111
```
112
113
**Usage Example:**
114
115
```java
116
@Test
117
public void testKMeans3D() throws Exception {
118
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
119
120
// Use 3D test data
121
DataSet<String> points = env.fromElements(KMeansData.DATAPOINTS.split("\n"));
122
DataSet<String> centers = env.fromElements(KMeansData.INITIAL_CENTERS.split("\n"));
123
124
// Run K-Means algorithm for 20 iterations
125
// ... K-Means implementation
126
127
List<String> finalCenters = resultCenters.collect();
128
129
// Validate with delta tolerance for floating point comparison
130
KMeansData.checkResultsWithDelta(
131
KMeansData.CENTERS_AFTER_20_ITERATIONS_DOUBLE_DIGIT,
132
finalCenters,
133
0.01
134
);
135
}
136
137
@Test
138
public void testKMeans2D() throws Exception {
139
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
140
141
// Use 2D test data
142
DataSet<String> points = env.fromElements(KMeansData.DATAPOINTS_2D.split("\n"));
143
DataSet<String> centers = env.fromElements(KMeansData.INITIAL_CENTERS_2D.split("\n"));
144
145
// ... K-Means implementation
146
147
List<String> result = resultCenters.collect();
148
KMeansData.checkResultsWithDelta(
149
KMeansData.CENTERS_2D_AFTER_20_ITERATIONS_DOUBLE_DIGIT,
150
result,
151
0.01
152
);
153
}
154
```
155
156
**Data Format:**
157
- **3D Points**: `pointId x y z` (100 data points)
158
- **2D Points**: `pointId x y`
159
- **Centers**: `centerId x y z` (3D) or `centerId x y` (2D)
160
161
## Graph Algorithm Test Data
162
163
### Connected Components Data
164
165
Test data and validation utilities for Connected Components algorithm.
166
167
```java { .api }
168
public class ConnectedComponentsData {
169
// Generate test vertices
170
public static String getEnumeratingVertices(int num);
171
172
// Generate random edges with odd/even pattern
173
public static String getRandomOddEvenEdges(int numEdges, int numVertices, long seed);
174
175
// Validate connected components results
176
public static void checkOddEvenResult(BufferedReader result) throws IOException;
177
public static void checkOddEvenResult(List<Tuple2<Long, Long>> lines) throws IOException;
178
}
179
```
180
181
**Usage Example:**
182
183
```java
184
@Test
185
public void testConnectedComponents() throws Exception {
186
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
187
188
// Generate test data
189
String vertices = ConnectedComponentsData.getEnumeratingVertices(100);
190
String edges = ConnectedComponentsData.getRandomOddEvenEdges(150, 100, 12345L);
191
192
DataSet<String> vertexData = env.fromElements(vertices.split("\n"));
193
DataSet<String> edgeData = env.fromElements(edges.split("\n"));
194
195
// Run Connected Components algorithm
196
// ... implementation
197
198
List<Tuple2<Long, Long>> components = result.collect();
199
200
// Validate odd/even component structure
201
ConnectedComponentsData.checkOddEvenResult(components);
202
}
203
```
204
205
### Transitive Closure Data
206
207
Test data for Transitive Closure algorithm with validation utilities.
208
209
```java { .api }
210
public class TransitiveClosureData {
211
// Validate transitive closure results
212
public static void checkOddEvenResult(BufferedReader result) throws IOException;
213
}
214
```
215
216
**Usage Example:**
217
218
```java
219
@Test
220
public void testTransitiveClosure() throws Exception {
221
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
222
223
// Create test edges with odd/even pattern
224
DataSet<Tuple2<Long, Long>> edges = env.fromElements(
225
new Tuple2<>(1L, 2L),
226
new Tuple2<>(2L, 3L),
227
new Tuple2<>(3L, 4L)
228
);
229
230
// Run Transitive Closure algorithm
231
// ... implementation
232
233
// Write results to file
234
result.writeAsText(outputPath);
235
env.execute();
236
237
// Validate closure properties
238
BufferedReader reader = new BufferedReader(new FileReader(outputPath));
239
TransitiveClosureData.checkOddEvenResult(reader);
240
reader.close();
241
}
242
```
243
244
### Triangle Enumeration Data
245
246
Test data for triangle enumeration in graphs.
247
248
```java { .api }
249
public class EnumTriangleData {
250
public static final String EDGES;
251
public static final String TRIANGLES_BY_ID;
252
public static final String TRIANGLES_BY_DEGREE;
253
}
254
```
255
256
**Usage Example:**
257
258
```java
259
@Test
260
public void testTriangleEnumeration() throws Exception {
261
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
262
263
// Use provided edge data
264
DataSet<String> edges = env.fromElements(EnumTriangleData.EDGES.split("\n"));
265
266
// Run triangle enumeration algorithm
267
// ... implementation
268
269
List<String> triangles = result.collect();
270
271
// Compare with expected triangles (sorted by ID)
272
TestBaseUtils.compareResultsByLinesInMemory(
273
EnumTriangleData.TRIANGLES_BY_ID,
274
triangles
275
);
276
}
277
```
278
279
**Data Format:**
280
- **EDGES**: `vertexId1 vertexId2` representing undirected edges
281
- **TRIANGLES_BY_ID**: Expected triangle results sorted by vertex ID
282
- **TRIANGLES_BY_DEGREE**: Expected triangle results sorted by vertex degree
283
284
## Web Analytics Test Data
285
286
### Web Log Analysis Data
287
288
Test data for web log analysis and web graph algorithms.
289
290
```java { .api }
291
public class WebLogAnalysisData {
292
public static final String DOCS;
293
public static final String RANKS;
294
public static final String VISITS;
295
public static final String EXCEPTED_RESULT;
296
}
297
```
298
299
**Usage Example:**
300
301
```java
302
@Test
303
public void testWebLogAnalysis() throws Exception {
304
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
305
306
// Load web data
307
DataSet<String> documents = env.fromElements(WebLogAnalysisData.DOCS.split("\n"));
308
DataSet<String> pageRanks = env.fromElements(WebLogAnalysisData.RANKS.split("\n"));
309
DataSet<String> visits = env.fromElements(WebLogAnalysisData.VISITS.split("\n"));
310
311
// Run web log analysis
312
// ... implementation combining docs, ranks, and visits
313
314
List<String> analysis = result.collect();
315
316
// Validate analysis results
317
TestBaseUtils.compareResultsByLinesInMemory(
318
WebLogAnalysisData.EXCEPTED_RESULT,
319
analysis
320
);
321
}
322
```
323
324
**Data Format:**
325
- **DOCS**: `url|content` - Web documents with URL and content
326
- **RANKS**: `url rank` - Page rank values for URLs
327
- **VISITS**: `url visitCount` - Visit statistics for URLs
328
- **EXCEPTED_RESULT**: Expected combined analysis results
329
330
## Data Usage Patterns
331
332
### Loading Test Data
333
334
```java
335
// Split multi-line data into DataSet
336
DataSet<String> dataSet = env.fromElements(TestData.SAMPLE_DATA.split("\n"));
337
338
// Parse structured data
339
DataSet<Tuple2<String, Integer>> tuples = dataSet
340
.map(line -> {
341
String[] parts = line.split(",");
342
return new Tuple2<>(parts[0], Integer.parseInt(parts[1]));
343
});
344
```
345
346
### Validation with Delta Tolerance
347
348
```java
349
// For floating-point comparisons
350
KMeansData.checkResultsWithDelta(expectedResults, actualResults, 0.001);
351
352
// For key-value pairs with tolerance
353
TestBaseUtils.compareKeyValuePairsWithDelta(expected, resultPath, ",", 0.01);
354
```
355
356
### Custom Validation Logic
357
358
```java
359
// Implement custom validation for specific algorithms
360
List<String> results = algorithm.collect();
361
for (String result : results) {
362
// Custom validation logic
363
assertTrue("Result format validation", result.matches("\\d+,\\d+\\.\\d+"));
364
}
365
```
366
367
### Generating Random Test Data
368
369
```java
370
// Use ConnectedComponentsData for reproducible random data
371
String randomEdges = ConnectedComponentsData.getRandomOddEvenEdges(1000, 500, 42L);
372
DataSet<String> edges = env.fromElements(randomEdges.split("\n"));
373
```
374
375
## Integration with Test Frameworks
376
377
These test data classes integrate seamlessly with all test base classes:
378
379
```java
380
public class AlgorithmTest extends JavaProgramTestBase {
381
@Override
382
protected void testProgram() throws Exception {
383
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
384
385
// Use any test data class
386
DataSet<String> input = env.fromElements(WordCountData.TEXT.split("\\s+"));
387
388
// Run algorithm and validate
389
List<String> result = processInput(input).collect();
390
TestBaseUtils.compareResultAsText(result, expectedOutput);
391
}
392
}
393
```