0
# Dataset Conversions
1
2
ADAM APIs provides Dataset-based converters that parallel the RDD converter functionality but work with Spark SQL Datasets for better performance and SQL integration. These converters enable type-safe transformations while leveraging Catalyst query optimization.
3
4
## Capabilities
5
6
### Base Dataset Conversion Traits
7
8
Foundation traits that define the interface for Dataset-based genomic data conversions.
9
10
```java { .api }
11
/**
12
* Base trait for conversions to contig fragment datasets
13
* @param <T> Source record type
14
* @param <U> Source genomic dataset type
15
*/
16
interface ToContigDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
17
extends GenomicDatasetConversion<T, U, NucleotideContigFragment, NucleotideContigFragmentRDD> {
18
TypeTag<NucleotideContigFragment> xTag = typeTag[NucleotideContigFragment];
19
}
20
21
/**
22
* Base trait for conversions to coverage datasets
23
*/
24
interface ToCoverageDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
25
extends GenomicDatasetConversion<T, U, Coverage, CoverageRDD> {
26
TypeTag<Coverage> xTag = typeTag[Coverage];
27
}
28
29
/**
30
* Base trait for conversions to feature datasets
31
*/
32
interface ToFeatureDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
33
extends GenomicDatasetConversion<T, U, Feature, FeatureRDD> {
34
TypeTag<Feature> xTag = typeTag[Feature];
35
}
36
37
/**
38
* Base trait for conversions to fragment datasets
39
*/
40
interface ToFragmentDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
41
extends GenomicDatasetConversion<T, U, Fragment, FragmentRDD> {
42
TypeTag<Fragment> xTag = typeTag[Fragment];
43
}
44
45
/**
46
* Base trait for conversions to alignment record datasets
47
*/
48
interface ToAlignmentRecordDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
49
extends GenomicDatasetConversion<T, U, AlignmentRecord, AlignmentRecordRDD> {
50
TypeTag<AlignmentRecord> xTag = typeTag[AlignmentRecord];
51
}
52
53
/**
54
* Base trait for conversions to genotype datasets
55
*/
56
interface ToGenotypeDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
57
extends GenomicDatasetConversion<T, U, Genotype, GenotypeRDD> {
58
TypeTag<Genotype> xTag = typeTag[Genotype];
59
}
60
61
/**
62
* Base trait for conversions to variant datasets
63
*/
64
interface ToVariantDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>
65
extends GenomicDatasetConversion<T, U, Variant, VariantRDD> {
66
TypeTag<Variant> xTag = typeTag[Variant];
67
}
68
```
69
70
**Note:** VariantContext dataset conversions are not currently supported in the Dataset converter API. Use RDD converters for VariantContext transformations.
71
72
### Contig Fragment Dataset Converters
73
74
Convert nucleotide contig fragments using Dataset operations for better SQL integration.
75
76
```java { .api }
77
/**
78
* Convert NucleotideContigFragmentRDD with Coverage Dataset to CoverageRDD
79
*/
80
class ContigsToCoverageDatasetConverter extends ToCoverageDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
81
/**
82
* Perform the dataset-based conversion
83
* @param v1 Source NucleotideContigFragmentRDD with metadata
84
* @param v2 Target Dataset[Coverage] with structured data
85
* @return CoverageRDD with combined metadata and data
86
*/
87
CoverageRDD call(NucleotideContigFragmentRDD v1, Dataset<Coverage> v2);
88
}
89
90
/**
91
* Convert NucleotideContigFragmentRDD with Feature Dataset to FeatureRDD
92
*/
93
class ContigsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
94
FeatureRDD call(NucleotideContigFragmentRDD v1, Dataset<Feature> v2);
95
}
96
97
/**
98
* Convert NucleotideContigFragmentRDD with Fragment Dataset to FragmentRDD
99
*/
100
class ContigsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
101
FragmentRDD call(NucleotideContigFragmentRDD v1, Dataset<Fragment> v2);
102
}
103
104
/**
105
* Convert NucleotideContigFragmentRDD with AlignmentRecord Dataset to AlignmentRecordRDD
106
*/
107
class ContigsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
108
AlignmentRecordRDD call(NucleotideContigFragmentRDD v1, Dataset<AlignmentRecord> v2);
109
}
110
111
/**
112
* Convert NucleotideContigFragmentRDD with Genotype Dataset to GenotypeRDD
113
*/
114
class ContigsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
115
GenotypeRDD call(NucleotideContigFragmentRDD v1, Dataset<Genotype> v2);
116
}
117
118
/**
119
* Convert NucleotideContigFragmentRDD with Variant Dataset to VariantRDD
120
*/
121
class ContigsToVariantsDatasetConverter extends ToVariantDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
122
VariantRDD call(NucleotideContigFragmentRDD v1, Dataset<Variant> v2);
123
}
124
```
125
126
### Coverage Dataset Converters
127
128
Convert coverage data using Dataset operations for optimized query execution.
129
130
```java { .api }
131
/**
132
* Convert CoverageRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
133
*/
134
class CoverageToContigsDatasetConverter extends ToContigDatasetConversion<Coverage, CoverageRDD> {
135
NucleotideContigFragmentRDD call(CoverageRDD v1, Dataset<NucleotideContigFragment> v2);
136
}
137
138
/**
139
* Convert CoverageRDD with Feature Dataset to FeatureRDD
140
*/
141
class CoverageToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Coverage, CoverageRDD> {
142
FeatureRDD call(CoverageRDD v1, Dataset<Feature> v2);
143
}
144
145
/**
146
* Convert CoverageRDD with Fragment Dataset to FragmentRDD
147
*/
148
class CoverageToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Coverage, CoverageRDD> {
149
FragmentRDD call(CoverageRDD v1, Dataset<Fragment> v2);
150
}
151
152
/**
153
* Convert CoverageRDD with AlignmentRecord Dataset to AlignmentRecordRDD
154
*/
155
class CoverageToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Coverage, CoverageRDD> {
156
AlignmentRecordRDD call(CoverageRDD v1, Dataset<AlignmentRecord> v2);
157
}
158
159
/**
160
* Convert CoverageRDD with Genotype Dataset to GenotypeRDD
161
*/
162
class CoverageToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Coverage, CoverageRDD> {
163
GenotypeRDD call(CoverageRDD v1, Dataset<Genotype> v2);
164
}
165
166
/**
167
* Convert CoverageRDD with Variant Dataset to VariantRDD
168
*/
169
class CoverageToVariantsDatasetConverter extends ToVariantDatasetConversion<Coverage, CoverageRDD> {
170
VariantRDD call(CoverageRDD v1, Dataset<Variant> v2);
171
}
172
```
173
174
### Feature Dataset Converters
175
176
Convert genomic feature data using Dataset operations for SQL compatibility.
177
178
```java { .api }
179
/**
180
* Convert FeatureRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
181
*/
182
class FeaturesToContigsDatasetConverter extends ToContigDatasetConversion<Feature, FeatureRDD> {
183
NucleotideContigFragmentRDD call(FeatureRDD v1, Dataset<NucleotideContigFragment> v2);
184
}
185
186
/**
187
* Convert FeatureRDD with Coverage Dataset to CoverageRDD
188
*/
189
class FeaturesToCoverageDatasetConverter extends ToCoverageDatasetConversion<Feature, FeatureRDD> {
190
CoverageRDD call(FeatureRDD v1, Dataset<Coverage> v2);
191
}
192
193
/**
194
* Convert FeatureRDD with Fragment Dataset to FragmentRDD
195
*/
196
class FeaturesToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Feature, FeatureRDD> {
197
FragmentRDD call(FeatureRDD v1, Dataset<Fragment> v2);
198
}
199
200
/**
201
* Convert FeatureRDD with AlignmentRecord Dataset to AlignmentRecordRDD
202
*/
203
class FeaturesToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Feature, FeatureRDD> {
204
AlignmentRecordRDD call(FeatureRDD v1, Dataset<AlignmentRecord> v2);
205
}
206
207
/**
208
* Convert FeatureRDD with Genotype Dataset to GenotypeRDD
209
*/
210
class FeaturesToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Feature, FeatureRDD> {
211
GenotypeRDD call(FeatureRDD v1, Dataset<Genotype> v2);
212
}
213
214
/**
215
* Convert FeatureRDD with Variant Dataset to VariantRDD
216
*/
217
class FeaturesToVariantsDatasetConverter extends ToVariantDatasetConversion<Feature, FeatureRDD> {
218
VariantRDD call(FeatureRDD v1, Dataset<Variant> v2);
219
}
220
```
221
222
### Fragment Dataset Converters
223
224
Convert sequencing fragment data using Dataset operations for enhanced performance.
225
226
```java { .api }
227
/**
228
* Convert FragmentRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
229
*/
230
class FragmentsToContigsDatasetConverter extends ToContigDatasetConversion<Fragment, FragmentRDD> {
231
NucleotideContigFragmentRDD call(FragmentRDD v1, Dataset<NucleotideContigFragment> v2);
232
}
233
234
/**
235
* Convert FragmentRDD with Coverage Dataset to CoverageRDD
236
*/
237
class FragmentsToCoverageDatasetConverter extends ToCoverageDatasetConversion<Fragment, FragmentRDD> {
238
CoverageRDD call(FragmentRDD v1, Dataset<Coverage> v2);
239
}
240
241
/**
242
* Convert FragmentRDD with Feature Dataset to FeatureRDD
243
*/
244
class FragmentsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Fragment, FragmentRDD> {
245
FeatureRDD call(FragmentRDD v1, Dataset<Feature> v2);
246
}
247
248
/**
249
* Convert FragmentRDD with AlignmentRecord Dataset to AlignmentRecordRDD
250
*/
251
class FragmentsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Fragment, FragmentRDD> {
252
AlignmentRecordRDD call(FragmentRDD v1, Dataset<AlignmentRecord> v2);
253
}
254
255
/**
256
* Convert FragmentRDD with Genotype Dataset to GenotypeRDD
257
*/
258
class FragmentsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Fragment, FragmentRDD> {
259
GenotypeRDD call(FragmentRDD v1, Dataset<Genotype> v2);
260
}
261
262
/**
263
* Convert FragmentRDD with Variant Dataset to VariantRDD
264
*/
265
class FragmentsToVariantsDatasetConverter extends ToVariantDatasetConversion<Fragment, FragmentRDD> {
266
VariantRDD call(FragmentRDD v1, Dataset<Variant> v2);
267
}
268
```
269
270
### Alignment Record Dataset Converters
271
272
Convert alignment record data using Dataset operations with Catalyst optimization.
273
274
```java { .api }
275
/**
276
* Convert AlignmentRecordRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
277
*/
278
class AlignmentRecordsToContigsDatasetConverter extends ToContigDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
279
NucleotideContigFragmentRDD call(AlignmentRecordRDD v1, Dataset<NucleotideContigFragment> v2);
280
}
281
282
/**
283
* Convert AlignmentRecordRDD with Coverage Dataset to CoverageRDD
284
*/
285
class AlignmentRecordsToCoverageDatasetConverter extends ToCoverageDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
286
CoverageRDD call(AlignmentRecordRDD v1, Dataset<Coverage> v2);
287
}
288
289
/**
290
* Convert AlignmentRecordRDD with Feature Dataset to FeatureRDD
291
*/
292
class AlignmentRecordsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
293
FeatureRDD call(AlignmentRecordRDD v1, Dataset<Feature> v2);
294
}
295
296
/**
297
* Convert AlignmentRecordRDD with Fragment Dataset to FragmentRDD
298
*/
299
class AlignmentRecordsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
300
FragmentRDD call(AlignmentRecordRDD v1, Dataset<Fragment> v2);
301
}
302
303
/**
304
* Convert AlignmentRecordRDD with Genotype Dataset to GenotypeRDD
305
*/
306
class AlignmentRecordsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
307
GenotypeRDD call(AlignmentRecordRDD v1, Dataset<Genotype> v2);
308
}
309
310
/**
311
* Convert AlignmentRecordRDD with Variant Dataset to VariantRDD
312
*/
313
class AlignmentRecordsToVariantsDatasetConverter extends ToVariantDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
314
VariantRDD call(AlignmentRecordRDD v1, Dataset<Variant> v2);
315
}
316
```
317
318
### Genotype Dataset Converters
319
320
Convert genotype data using Dataset operations for optimized variant analysis.
321
322
```java { .api }
323
/**
324
* Convert GenotypeRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
325
*/
326
class GenotypesToContigsDatasetConverter extends ToContigDatasetConversion<Genotype, GenotypeRDD> {
327
NucleotideContigFragmentRDD call(GenotypeRDD v1, Dataset<NucleotideContigFragment> v2);
328
}
329
330
/**
331
* Convert GenotypeRDD with Coverage Dataset to CoverageRDD
332
*/
333
class GenotypesToCoverageDatasetConverter extends ToCoverageDatasetConversion<Genotype, GenotypeRDD> {
334
CoverageRDD call(GenotypeRDD v1, Dataset<Coverage> v2);
335
}
336
337
/**
338
* Convert GenotypeRDD with Feature Dataset to FeatureRDD
339
*/
340
class GenotypesToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Genotype, GenotypeRDD> {
341
FeatureRDD call(GenotypeRDD v1, Dataset<Feature> v2);
342
}
343
344
/**
345
* Convert GenotypeRDD with Fragment Dataset to FragmentRDD
346
*/
347
class GenotypesToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Genotype, GenotypeRDD> {
348
FragmentRDD call(GenotypeRDD v1, Dataset<Fragment> v2);
349
}
350
351
/**
352
* Convert GenotypeRDD with AlignmentRecord Dataset to AlignmentRecordRDD
353
*/
354
class GenotypesToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Genotype, GenotypeRDD> {
355
AlignmentRecordRDD call(GenotypeRDD v1, Dataset<AlignmentRecord> v2);
356
}
357
358
/**
359
* Convert GenotypeRDD with Variant Dataset to VariantRDD
360
*/
361
class GenotypesToVariantsDatasetConverter extends ToVariantDatasetConversion<Genotype, GenotypeRDD> {
362
VariantRDD call(GenotypeRDD v1, Dataset<Variant> v2);
363
}
364
```
365
366
### Variant Dataset Converters
367
368
Convert variant data using Dataset operations for enhanced genomic analysis workflows.
369
370
```java { .api }
371
/**
372
* Convert VariantRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
373
*/
374
class VariantsToContigsDatasetConverter extends ToContigDatasetConversion<Variant, VariantRDD> {
375
NucleotideContigFragmentRDD call(VariantRDD v1, Dataset<NucleotideContigFragment> v2);
376
}
377
378
/**
379
* Convert VariantRDD with Coverage Dataset to CoverageRDD
380
*/
381
class VariantsToCoverageDatasetConverter extends ToCoverageDatasetConversion<Variant, VariantRDD> {
382
CoverageRDD call(VariantRDD v1, Dataset<Coverage> v2);
383
}
384
385
/**
386
* Convert VariantRDD with Feature Dataset to FeatureRDD
387
*/
388
class VariantsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Variant, VariantRDD> {
389
FeatureRDD call(VariantRDD v1, Dataset<Feature> v2);
390
}
391
392
/**
393
* Convert VariantRDD with Fragment Dataset to FragmentRDD
394
*/
395
class VariantsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Variant, VariantRDD> {
396
FragmentRDD call(VariantRDD v1, Dataset<Fragment> v2);
397
}
398
399
/**
400
* Convert VariantRDD with AlignmentRecord Dataset to AlignmentRecordRDD
401
*/
402
class VariantsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Variant, VariantRDD> {
403
AlignmentRecordRDD call(VariantRDD v1, Dataset<AlignmentRecord> v2);
404
}
405
406
/**
407
* Convert VariantRDD with Genotype Dataset to GenotypeRDD
408
*/
409
class VariantsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Variant, VariantRDD> {
410
GenotypeRDD call(VariantRDD v1, Dataset<Genotype> v2);
411
}
412
```
413
414
## Usage Examples
415
416
**Basic Dataset conversion with SQL integration:**
417
418
```java
419
import org.bdgenomics.adam.api.java.*;
420
import org.apache.spark.sql.Dataset;
421
422
// Load genomic data
423
AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");
424
VariantRDD variants = jac.loadVariants("variants.vcf");
425
426
// Convert to Dataset for SQL operations
427
Dataset<Variant> variantDS = variants.dataset();
428
429
// Apply SQL transformations
430
Dataset<Variant> filteredVariants = variantDS
431
.filter("start > 1000000")
432
.filter("qual > 30.0");
433
434
// Convert back using Dataset converter
435
AlignmentRecordsToVariantsDatasetConverter converter =
436
new AlignmentRecordsToVariantsDatasetConverter();
437
VariantRDD convertedVariants = converter.call(alignments, filteredVariants);
438
```
439
440
**Performance-optimized Dataset operations:**
441
442
```java
443
// Load large genomic datasets
444
GenotypeRDD genotypes = jac.loadGenotypes("large_cohort.vcf");
445
FeatureRDD features = jac.loadFeatures("annotations.gtf");
446
447
// Convert to Datasets for Catalyst optimization
448
Dataset<Genotype> genotypeDS = genotypes.dataset();
449
Dataset<Feature> featureDS = features.dataset();
450
451
// Perform complex SQL-based analysis
452
Dataset<Feature> annotatedFeatures = featureDS
453
.join(genotypeDS, "contigName")
454
.where("genotype.variant.qual > 50")
455
.select("feature.*");
456
457
// Convert back with preserved metadata
458
GenotypesToFeaturesDatasetConverter converter =
459
new GenotypesToFeaturesDatasetConverter();
460
FeatureRDD result = converter.call(genotypes, annotatedFeatures);
461
```
462
463
**Combining RDD and Dataset operations:**
464
465
```java
466
// Start with RDD operations for complex logic
467
AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");
468
RDD<AlignmentRecord> filteredRDD = alignments.jrdd()
469
.filter(read -> read.getMapq() > 30 && read.getReadMapped());
470
471
// Convert to Dataset for SQL operations
472
Dataset<AlignmentRecord> alignmentDS = spark.createDataset(
473
filteredRDD, Encoders.bean(AlignmentRecord.class));
474
475
Dataset<Coverage> coverageDS = alignmentDS
476
.groupBy("contigName", "start")
477
.agg(count("*").as("count"))
478
.select(col("contigName"), col("start"), col("count").as("score"));
479
480
// Convert back to genomic RDD with metadata
481
AlignmentRecordsToCoverageDatasetConverter converter =
482
new AlignmentRecordsToCoverageDatasetConverter();
483
CoverageRDD coverage = converter.call(alignments, coverageDS);
484
```
485
486
## Key Benefits
487
488
- **Catalyst Optimization**: Leverages Spark SQL's query optimizer for better performance
489
- **SQL Integration**: Enables SQL queries on genomic data through Dataset API
490
- **Type Safety**: Maintains compile-time type checking with structured data
491
- **Metadata Preservation**: Preserves genomic metadata while enabling SQL operations
492
- **Interoperability**: Seamlessly bridges RDD and Dataset APIs
493
- **Performance**: Better performance for complex analytical queries compared to RDD operations