0
# Dataset Conversions
1
2
Type-safe conversion system for transforming between different genomic dataset types using Spark DataFrames. The GenomicDatasetConverters module provides comprehensive conversion capabilities between all genomic data types while preserving type safety and metadata.
3
4
## Capabilities
5
6
### Conversion Traits
7
8
Base traits defining the conversion interface for each target genomic data type.
9
10
```scala { .api }
11
/**
12
* Convert to NucleotideContigFragmentRDD from any source genomic dataset.
13
*/
14
trait ToContigDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
15
extends GenomicDatasetConversion[T, U, NucleotideContigFragment, NucleotideContigFragmentRDD] {
16
val xTag: TypeTag[NucleotideContigFragment]
17
}
18
19
/**
20
* Convert to CoverageRDD from any source genomic dataset.
21
*/
22
trait ToCoverageDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
23
extends GenomicDatasetConversion[T, U, Coverage, CoverageRDD] {
24
val xTag: TypeTag[Coverage]
25
}
26
27
/**
28
* Convert to FeatureRDD from any source genomic dataset.
29
*/
30
trait ToFeatureDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
31
extends GenomicDatasetConversion[T, U, Feature, FeatureRDD] {
32
val xTag: TypeTag[Feature]
33
}
34
35
/**
36
* Convert to FragmentRDD from any source genomic dataset.
37
*/
38
trait ToFragmentDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
39
extends GenomicDatasetConversion[T, U, Fragment, FragmentRDD] {
40
val xTag: TypeTag[Fragment]
41
}
42
43
/**
44
* Convert to AlignmentRecordRDD from any source genomic dataset.
45
*/
46
trait ToAlignmentRecordDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
47
extends GenomicDatasetConversion[T, U, AlignmentRecord, AlignmentRecordRDD] {
48
val xTag: TypeTag[AlignmentRecord]
49
}
50
51
/**
52
* Convert to GenotypeRDD from any source genomic dataset.
53
*/
54
trait ToGenotypeDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
55
extends GenomicDatasetConversion[T, U, Genotype, GenotypeRDD] {
56
val xTag: TypeTag[Genotype]
57
}
58
59
/**
60
* Convert to VariantRDD from any source genomic dataset.
61
*/
62
trait ToVariantDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]
63
extends GenomicDatasetConversion[T, U, Variant, VariantRDD] {
64
val xTag: TypeTag[Variant]
65
}
66
```
67
68
### Contig Dataset Conversions
69
70
Convert other genomic data types to nucleotide contig fragments (reference sequences).
71
72
```scala { .api }
73
/**
74
* Convert CoverageRDD to NucleotideContigFragmentRDD via Dataset.
75
*/
76
class CoverageToContigsDatasetConverter
77
extends ToContigDatasetConversion[Coverage, NucleotideContigFragment]
78
79
/**
80
* Convert FeatureRDD to NucleotideContigFragmentRDD via Dataset.
81
*/
82
class FeaturesToContigsDatasetConverter
83
extends ToContigDatasetConversion[Feature, NucleotideContigFragment]
84
85
/**
86
* Convert FragmentRDD to NucleotideContigFragmentRDD via Dataset.
87
*/
88
class FragmentsToContigsDatasetConverter
89
extends ToContigDatasetConversion[Fragment, NucleotideContigFragment]
90
91
/**
92
* Convert AlignmentRecordRDD to NucleotideContigFragmentRDD via Dataset.
93
*/
94
class AlignmentRecordsToContigsDatasetConverter
95
extends ToContigDatasetConversion[AlignmentRecord, NucleotideContigFragment]
96
97
/**
98
* Convert GenotypeRDD to NucleotideContigFragmentRDD via Dataset.
99
*/
100
class GenotypesToContigsDatasetConverter
101
extends ToContigDatasetConversion[Genotype, NucleotideContigFragment]
102
103
/**
104
* Convert VariantRDD to NucleotideContigFragmentRDD via Dataset.
105
*/
106
class VariantsToContigsDatasetConverter
107
extends ToContigDatasetConversion[Variant, NucleotideContigFragment]
108
```
109
110
### Coverage Dataset Conversions
111
112
Convert other genomic data types to coverage data representing sequencing depth or signal intensity.
113
114
```scala { .api }
115
/**
116
* Convert NucleotideContigFragmentRDD to CoverageRDD via Dataset.
117
*/
118
class ContigsToCoverageDatasetConverter
119
extends ToCoverageDatasetConversion[NucleotideContigFragment, Coverage]
120
121
/**
122
* Convert FeatureRDD to CoverageRDD via Dataset.
123
*/
124
class FeaturesToCoverageDatasetConverter
125
extends ToCoverageDatasetConversion[Feature, Coverage]
126
127
/**
128
* Convert FragmentRDD to CoverageRDD via Dataset.
129
*/
130
class FragmentsToCoverageDatasetConverter
131
extends ToCoverageDatasetConversion[Fragment, Coverage]
132
133
/**
134
* Convert AlignmentRecordRDD to CoverageRDD via Dataset.
135
*/
136
class AlignmentRecordsToCoverageDatasetConverter
137
extends ToCoverageDatasetConversion[AlignmentRecord, Coverage]
138
139
/**
140
* Convert GenotypeRDD to CoverageRDD via Dataset.
141
*/
142
class GenotypesToCoverageDatasetConverter
143
extends ToCoverageDatasetConversion[Genotype, Coverage]
144
145
/**
146
* Convert VariantRDD to CoverageRDD via Dataset.
147
*/
148
class VariantsToCoverageDatasetConverter
149
extends ToCoverageDatasetConversion[Variant, Coverage]
150
```
151
152
### Feature Dataset Conversions
153
154
Convert other genomic data types to genomic feature annotations (genes, intervals, etc.).
155
156
```scala { .api }
157
/**
158
* Convert NucleotideContigFragmentRDD to FeatureRDD via Dataset.
159
*/
160
class ContigsToFeaturesDatasetConverter
161
extends ToFeatureDatasetConversion[NucleotideContigFragment, Feature]
162
163
/**
164
* Convert CoverageRDD to FeatureRDD via Dataset.
165
*/
166
class CoverageToFeaturesDatasetConverter
167
extends ToFeatureDatasetConversion[Coverage, Feature]
168
169
/**
170
* Convert FragmentRDD to FeatureRDD via Dataset.
171
*/
172
class FragmentsToFeaturesDatasetConverter
173
extends ToFeatureDatasetConversion[Fragment, Feature]
174
175
/**
176
* Convert AlignmentRecordRDD to FeatureRDD via Dataset.
177
*/
178
class AlignmentRecordsToFeaturesDatasetConverter
179
extends ToFeatureDatasetConversion[AlignmentRecord, Feature]
180
181
/**
182
* Convert GenotypeRDD to FeatureRDD via Dataset.
183
*/
184
class GenotypesToFeaturesDatasetConverter
185
extends ToFeatureDatasetConversion[Genotype, Feature]
186
187
/**
188
* Convert VariantRDD to FeatureRDD via Dataset.
189
*/
190
class VariantsToFeaturesDatasetConverter
191
extends ToFeatureDatasetConversion[Variant, Feature]
192
```
193
194
### Fragment Dataset Conversions
195
196
Convert other genomic data types to paired-end sequencing fragments.
197
198
```scala { .api }
199
/**
200
* Convert NucleotideContigFragmentRDD to FragmentRDD via Dataset.
201
*/
202
class ContigsToFragmentsDatasetConverter
203
extends ToFragmentDatasetConversion[NucleotideContigFragment, Fragment]
204
205
/**
206
* Convert CoverageRDD to FragmentRDD via Dataset.
207
*/
208
class CoverageToFragmentsDatasetConverter
209
extends ToFragmentDatasetConversion[Coverage, Fragment]
210
211
/**
212
* Convert FeatureRDD to FragmentRDD via Dataset.
213
*/
214
class FeaturesToFragmentsDatasetConverter
215
extends ToFragmentDatasetConversion[Feature, Fragment]
216
217
/**
218
* Convert AlignmentRecordRDD to FragmentRDD via Dataset.
219
*/
220
class AlignmentRecordsToFragmentsDatasetConverter
221
extends ToFragmentDatasetConversion[AlignmentRecord, Fragment]
222
223
/**
224
* Convert GenotypeRDD to FragmentRDD via Dataset.
225
*/
226
class GenotypesToFragmentsDatasetConverter
227
extends ToFragmentDatasetConversion[Genotype, Fragment]
228
229
/**
230
* Convert VariantRDD to FragmentRDD via Dataset.
231
*/
232
class VariantsToFragmentsDatasetConverter
233
extends ToFragmentDatasetConversion[Variant, Fragment]
234
```
235
236
### Alignment Record Dataset Conversions
237
238
Convert other genomic data types to sequence alignment records.
239
240
```scala { .api }
241
/**
242
* Convert NucleotideContigFragmentRDD to AlignmentRecordRDD via Dataset.
243
*/
244
class ContigsToAlignmentRecordsDatasetConverter
245
extends ToAlignmentRecordDatasetConversion[NucleotideContigFragment, AlignmentRecord]
246
247
/**
248
* Convert CoverageRDD to AlignmentRecordRDD via Dataset.
249
*/
250
class CoverageToAlignmentRecordsDatasetConverter
251
extends ToAlignmentRecordDatasetConversion[Coverage, AlignmentRecord]
252
253
/**
254
* Convert FeatureRDD to AlignmentRecordRDD via Dataset.
255
*/
256
class FeaturesToAlignmentRecordsDatasetConverter
257
extends ToAlignmentRecordDatasetConversion[Feature, AlignmentRecord]
258
259
/**
260
* Convert FragmentRDD to AlignmentRecordRDD via Dataset.
261
*/
262
class FragmentsToAlignmentRecordsDatasetConverter
263
extends ToAlignmentRecordDatasetConversion[Fragment, AlignmentRecord]
264
265
/**
266
* Convert GenotypeRDD to AlignmentRecordRDD via Dataset.
267
*/
268
class GenotypesToAlignmentRecordsDatasetConverter
269
extends ToAlignmentRecordDatasetConversion[Genotype, AlignmentRecord]
270
271
/**
272
* Convert VariantRDD to AlignmentRecordRDD via Dataset.
273
*/
274
class VariantsToAlignmentRecordsDatasetConverter
275
extends ToAlignmentRecordDatasetConversion[Variant, AlignmentRecord]
276
```
277
278
### Genotype Dataset Conversions
279
280
Convert other genomic data types to genotype information from variant calling.
281
282
```scala { .api }
283
/**
284
* Convert NucleotideContigFragmentRDD to GenotypeRDD via Dataset.
285
*/
286
class ContigsToGenotypesDatasetConverter
287
extends ToGenotypeDatasetConversion[NucleotideContigFragment, Genotype]
288
289
/**
290
* Convert CoverageRDD to GenotypeRDD via Dataset.
291
*/
292
class CoverageToGenotypesDatasetConverter
293
extends ToGenotypeDatasetConversion[Coverage, Genotype]
294
295
/**
296
* Convert FeatureRDD to GenotypeRDD via Dataset.
297
*/
298
class FeaturesToGenotypesDatasetConverter
299
extends ToGenotypeDatasetConversion[Feature, Genotype]
300
301
/**
302
* Convert FragmentRDD to GenotypeRDD via Dataset.
303
*/
304
class FragmentsToGenotypesDatasetConverter
305
extends ToGenotypeDatasetConversion[Fragment, Genotype]
306
307
/**
308
* Convert AlignmentRecordRDD to GenotypeRDD via Dataset.
309
*/
310
class AlignmentRecordsToGenotypesDatasetConverter
311
extends ToGenotypeDatasetConversion[AlignmentRecord, Genotype]
312
313
/**
314
* Convert VariantRDD to GenotypeRDD via Dataset.
315
*/
316
class VariantsToGenotypesDatasetConverter
317
extends ToGenotypeDatasetConversion[Variant, Genotype]
318
```
319
320
### Variant Dataset Conversions
321
322
Convert other genomic data types to genetic variant information.
323
324
```scala { .api }
325
/**
326
* Convert NucleotideContigFragmentRDD to VariantRDD via Dataset.
327
*/
328
class ContigsToVariantsDatasetConverter
329
extends ToVariantDatasetConversion[NucleotideContigFragment, Variant]
330
331
/**
332
* Convert CoverageRDD to VariantRDD via Dataset.
333
*/
334
class CoverageToVariantsDatasetConverter
335
extends ToVariantDatasetConversion[Coverage, Variant]
336
337
/**
338
* Convert FeatureRDD to VariantRDD via Dataset.
339
*/
340
class FeaturesToVariantsDatasetConverter
341
extends ToVariantDatasetConversion[Feature, Variant]
342
343
/**
344
* Convert FragmentRDD to VariantRDD via Dataset.
345
*/
346
class FragmentsToVariantsDatasetConverter
347
extends ToVariantDatasetConversion[Fragment, Variant]
348
349
/**
350
* Convert AlignmentRecordRDD to VariantRDD via Dataset.
351
*/
352
class AlignmentRecordsToVariantsDatasetConverter
353
extends ToVariantDatasetConversion[AlignmentRecord, Variant]
354
355
/**
356
* Convert GenotypeRDD to VariantRDD via Dataset.
357
*/
358
class GenotypesToVariantsDatasetConverter
359
extends ToVariantDatasetConversion[Genotype, Variant]
360
```
361
362
## Usage Examples
363
364
```scala
365
import org.bdgenomics.adam.api.java.GenomicDatasetConverters._
366
import org.apache.spark.sql.Dataset
367
368
// Convert alignment records to features using Dataset
369
val alignments: AlignmentRecordRDD = jac.loadAlignments("input.bam")
370
val alignmentDataset: Dataset[AlignmentRecord] = alignments.toDF()
371
val emptyFeatureDataset: Dataset[Feature] = spark.emptyDataset[Feature]
372
373
val converter = new AlignmentRecordsToFeaturesDatasetConverter()
374
val features: FeatureRDD = converter.call(alignments, emptyFeatureDataset)
375
376
// Convert variants to coverage using Dataset
377
val variants: VariantRDD = jac.loadVariants("variants.vcf")
378
val variantDataset: Dataset[Variant] = variants.toDF()
379
val emptyCoverageDataset: Dataset[Coverage] = spark.emptyDataset[Coverage]
380
381
val coverageConverter = new VariantsToCoverageDatasetConverter()
382
val coverage: CoverageRDD = coverageConverter.call(variants, emptyCoverageDataset)
383
```
384
385
## Type Safety and Metadata Preservation
386
387
All dataset converters maintain:
388
- **Type Safety**: Compile-time guarantees for conversion compatibility
389
- **Sequence Dictionary**: Reference genome information preserved across conversions
390
- **Record Group Dictionary**: Sample and library information maintained for alignment-based conversions
391
- **Processing Commands**: History of transformations tracked in metadata