or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-conversions.mdindex.mdjava-api.mdpython-integration.mdrdd-conversions.md

dataset-conversions.mddocs/

0

# Dataset Conversions

1

2

Type-safe conversion system for transforming between different genomic dataset types using Spark DataFrames. The GenomicDatasetConverters module provides comprehensive conversion capabilities between all genomic data types while preserving type safety and metadata.

3

4

## Capabilities

5

6

### Conversion Traits

7

8

Base traits defining the conversion interface for each target genomic data type.

9

10

```scala { .api }

11

/**

12

* Convert to NucleotideContigFragmentRDD from any source genomic dataset.

13

*/

14

trait ToContigDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

15

extends GenomicDatasetConversion[T, U, NucleotideContigFragment, NucleotideContigFragmentRDD] {

16

val xTag: TypeTag[NucleotideContigFragment]

17

}

18

19

/**

20

* Convert to CoverageRDD from any source genomic dataset.

21

*/

22

trait ToCoverageDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

23

extends GenomicDatasetConversion[T, U, Coverage, CoverageRDD] {

24

val xTag: TypeTag[Coverage]

25

}

26

27

/**

28

* Convert to FeatureRDD from any source genomic dataset.

29

*/

30

trait ToFeatureDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

31

extends GenomicDatasetConversion[T, U, Feature, FeatureRDD] {

32

val xTag: TypeTag[Feature]

33

}

34

35

/**

36

* Convert to FragmentRDD from any source genomic dataset.

37

*/

38

trait ToFragmentDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

39

extends GenomicDatasetConversion[T, U, Fragment, FragmentRDD] {

40

val xTag: TypeTag[Fragment]

41

}

42

43

/**

44

* Convert to AlignmentRecordRDD from any source genomic dataset.

45

*/

46

trait ToAlignmentRecordDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

47

extends GenomicDatasetConversion[T, U, AlignmentRecord, AlignmentRecordRDD] {

48

val xTag: TypeTag[AlignmentRecord]

49

}

50

51

/**

52

* Convert to GenotypeRDD from any source genomic dataset.

53

*/

54

trait ToGenotypeDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

55

extends GenomicDatasetConversion[T, U, Genotype, GenotypeRDD] {

56

val xTag: TypeTag[Genotype]

57

}

58

59

/**

60

* Convert to VariantRDD from any source genomic dataset.

61

*/

62

trait ToVariantDatasetConversion[T <: Product, U <: GenomicDataset[_, T, U]]

63

extends GenomicDatasetConversion[T, U, Variant, VariantRDD] {

64

val xTag: TypeTag[Variant]

65

}

66

```

67

68

### Contig Dataset Conversions

69

70

Convert other genomic data types to nucleotide contig fragments (reference sequences).

71

72

```scala { .api }

73

/**

74

* Convert CoverageRDD to NucleotideContigFragmentRDD via Dataset.

75

*/

76

class CoverageToContigsDatasetConverter

77

extends ToContigDatasetConversion[Coverage, NucleotideContigFragment]

78

79

/**

80

* Convert FeatureRDD to NucleotideContigFragmentRDD via Dataset.

81

*/

82

class FeaturesToContigsDatasetConverter

83

extends ToContigDatasetConversion[Feature, NucleotideContigFragment]

84

85

/**

86

* Convert FragmentRDD to NucleotideContigFragmentRDD via Dataset.

87

*/

88

class FragmentsToContigsDatasetConverter

89

extends ToContigDatasetConversion[Fragment, NucleotideContigFragment]

90

91

/**

92

* Convert AlignmentRecordRDD to NucleotideContigFragmentRDD via Dataset.

93

*/

94

class AlignmentRecordsToContigsDatasetConverter

95

extends ToContigDatasetConversion[AlignmentRecord, NucleotideContigFragment]

96

97

/**

98

* Convert GenotypeRDD to NucleotideContigFragmentRDD via Dataset.

99

*/

100

class GenotypesToContigsDatasetConverter

101

extends ToContigDatasetConversion[Genotype, NucleotideContigFragment]

102

103

/**

104

* Convert VariantRDD to NucleotideContigFragmentRDD via Dataset.

105

*/

106

class VariantsToContigsDatasetConverter

107

extends ToContigDatasetConversion[Variant, NucleotideContigFragment]

108

```

109

110

### Coverage Dataset Conversions

111

112

Convert other genomic data types to coverage data representing sequencing depth or signal intensity.

113

114

```scala { .api }

115

/**

116

* Convert NucleotideContigFragmentRDD to CoverageRDD via Dataset.

117

*/

118

class ContigsToCoverageDatasetConverter

119

extends ToCoverageDatasetConversion[NucleotideContigFragment, Coverage]

120

121

/**

122

* Convert FeatureRDD to CoverageRDD via Dataset.

123

*/

124

class FeaturesToCoverageDatasetConverter

125

extends ToCoverageDatasetConversion[Feature, Coverage]

126

127

/**

128

* Convert FragmentRDD to CoverageRDD via Dataset.

129

*/

130

class FragmentsToCoverageDatasetConverter

131

extends ToCoverageDatasetConversion[Fragment, Coverage]

132

133

/**

134

* Convert AlignmentRecordRDD to CoverageRDD via Dataset.

135

*/

136

class AlignmentRecordsToCoverageDatasetConverter

137

extends ToCoverageDatasetConversion[AlignmentRecord, Coverage]

138

139

/**

140

* Convert GenotypeRDD to CoverageRDD via Dataset.

141

*/

142

class GenotypesToCoverageDatasetConverter

143

extends ToCoverageDatasetConversion[Genotype, Coverage]

144

145

/**

146

* Convert VariantRDD to CoverageRDD via Dataset.

147

*/

148

class VariantsToCoverageDatasetConverter

149

extends ToCoverageDatasetConversion[Variant, Coverage]

150

```

151

152

### Feature Dataset Conversions

153

154

Convert other genomic data types to genomic feature annotations (genes, intervals, etc.).

155

156

```scala { .api }

157

/**

158

* Convert NucleotideContigFragmentRDD to FeatureRDD via Dataset.

159

*/

160

class ContigsToFeaturesDatasetConverter

161

extends ToFeatureDatasetConversion[NucleotideContigFragment, Feature]

162

163

/**

164

* Convert CoverageRDD to FeatureRDD via Dataset.

165

*/

166

class CoverageToFeaturesDatasetConverter

167

extends ToFeatureDatasetConversion[Coverage, Feature]

168

169

/**

170

* Convert FragmentRDD to FeatureRDD via Dataset.

171

*/

172

class FragmentsToFeaturesDatasetConverter

173

extends ToFeatureDatasetConversion[Fragment, Feature]

174

175

/**

176

* Convert AlignmentRecordRDD to FeatureRDD via Dataset.

177

*/

178

class AlignmentRecordsToFeaturesDatasetConverter

179

extends ToFeatureDatasetConversion[AlignmentRecord, Feature]

180

181

/**

182

* Convert GenotypeRDD to FeatureRDD via Dataset.

183

*/

184

class GenotypesToFeaturesDatasetConverter

185

extends ToFeatureDatasetConversion[Genotype, Feature]

186

187

/**

188

* Convert VariantRDD to FeatureRDD via Dataset.

189

*/

190

class VariantsToFeaturesDatasetConverter

191

extends ToFeatureDatasetConversion[Variant, Feature]

192

```

193

194

### Fragment Dataset Conversions

195

196

Convert other genomic data types to paired-end sequencing fragments.

197

198

```scala { .api }

199

/**

200

* Convert NucleotideContigFragmentRDD to FragmentRDD via Dataset.

201

*/

202

class ContigsToFragmentsDatasetConverter

203

extends ToFragmentDatasetConversion[NucleotideContigFragment, Fragment]

204

205

/**

206

* Convert CoverageRDD to FragmentRDD via Dataset.

207

*/

208

class CoverageToFragmentsDatasetConverter

209

extends ToFragmentDatasetConversion[Coverage, Fragment]

210

211

/**

212

* Convert FeatureRDD to FragmentRDD via Dataset.

213

*/

214

class FeaturesToFragmentsDatasetConverter

215

extends ToFragmentDatasetConversion[Feature, Fragment]

216

217

/**

218

* Convert AlignmentRecordRDD to FragmentRDD via Dataset.

219

*/

220

class AlignmentRecordsToFragmentsDatasetConverter

221

extends ToFragmentDatasetConversion[AlignmentRecord, Fragment]

222

223

/**

224

* Convert GenotypeRDD to FragmentRDD via Dataset.

225

*/

226

class GenotypesToFragmentsDatasetConverter

227

extends ToFragmentDatasetConversion[Genotype, Fragment]

228

229

/**

230

* Convert VariantRDD to FragmentRDD via Dataset.

231

*/

232

class VariantsToFragmentsDatasetConverter

233

extends ToFragmentDatasetConversion[Variant, Fragment]

234

```

235

236

### Alignment Record Dataset Conversions

237

238

Convert other genomic data types to sequence alignment records.

239

240

```scala { .api }

241

/**

242

* Convert NucleotideContigFragmentRDD to AlignmentRecordRDD via Dataset.

243

*/

244

class ContigsToAlignmentRecordsDatasetConverter

245

extends ToAlignmentRecordDatasetConversion[NucleotideContigFragment, AlignmentRecord]

246

247

/**

248

* Convert CoverageRDD to AlignmentRecordRDD via Dataset.

249

*/

250

class CoverageToAlignmentRecordsDatasetConverter

251

extends ToAlignmentRecordDatasetConversion[Coverage, AlignmentRecord]

252

253

/**

254

* Convert FeatureRDD to AlignmentRecordRDD via Dataset.

255

*/

256

class FeaturesToAlignmentRecordsDatasetConverter

257

extends ToAlignmentRecordDatasetConversion[Feature, AlignmentRecord]

258

259

/**

260

* Convert FragmentRDD to AlignmentRecordRDD via Dataset.

261

*/

262

class FragmentsToAlignmentRecordsDatasetConverter

263

extends ToAlignmentRecordDatasetConversion[Fragment, AlignmentRecord]

264

265

/**

266

* Convert GenotypeRDD to AlignmentRecordRDD via Dataset.

267

*/

268

class GenotypesToAlignmentRecordsDatasetConverter

269

extends ToAlignmentRecordDatasetConversion[Genotype, AlignmentRecord]

270

271

/**

272

* Convert VariantRDD to AlignmentRecordRDD via Dataset.

273

*/

274

class VariantsToAlignmentRecordsDatasetConverter

275

extends ToAlignmentRecordDatasetConversion[Variant, AlignmentRecord]

276

```

277

278

### Genotype Dataset Conversions

279

280

Convert other genomic data types to genotype information from variant calling.

281

282

```scala { .api }

283

/**

284

* Convert NucleotideContigFragmentRDD to GenotypeRDD via Dataset.

285

*/

286

class ContigsToGenotypesDatasetConverter

287

extends ToGenotypeDatasetConversion[NucleotideContigFragment, Genotype]

288

289

/**

290

* Convert CoverageRDD to GenotypeRDD via Dataset.

291

*/

292

class CoverageToGenotypesDatasetConverter

293

extends ToGenotypeDatasetConversion[Coverage, Genotype]

294

295

/**

296

* Convert FeatureRDD to GenotypeRDD via Dataset.

297

*/

298

class FeaturesToGenotypesDatasetConverter

299

extends ToGenotypeDatasetConversion[Feature, Genotype]

300

301

/**

302

* Convert FragmentRDD to GenotypeRDD via Dataset.

303

*/

304

class FragmentsToGenotypesDatasetConverter

305

extends ToGenotypeDatasetConversion[Fragment, Genotype]

306

307

/**

308

* Convert AlignmentRecordRDD to GenotypeRDD via Dataset.

309

*/

310

class AlignmentRecordsToGenotypesDatasetConverter

311

extends ToGenotypeDatasetConversion[AlignmentRecord, Genotype]

312

313

/**

314

* Convert VariantRDD to GenotypeRDD via Dataset.

315

*/

316

class VariantsToGenotypesDatasetConverter

317

extends ToGenotypeDatasetConversion[Variant, Genotype]

318

```

319

320

### Variant Dataset Conversions

321

322

Convert other genomic data types to genetic variant information.

323

324

```scala { .api }

325

/**

326

* Convert NucleotideContigFragmentRDD to VariantRDD via Dataset.

327

*/

328

class ContigsToVariantsDatasetConverter

329

extends ToVariantDatasetConversion[NucleotideContigFragment, Variant]

330

331

/**

332

* Convert CoverageRDD to VariantRDD via Dataset.

333

*/

334

class CoverageToVariantsDatasetConverter

335

extends ToVariantDatasetConversion[Coverage, Variant]

336

337

/**

338

* Convert FeatureRDD to VariantRDD via Dataset.

339

*/

340

class FeaturesToVariantsDatasetConverter

341

extends ToVariantDatasetConversion[Feature, Variant]

342

343

/**

344

* Convert FragmentRDD to VariantRDD via Dataset.

345

*/

346

class FragmentsToVariantsDatasetConverter

347

extends ToVariantDatasetConversion[Fragment, Variant]

348

349

/**

350

* Convert AlignmentRecordRDD to VariantRDD via Dataset.

351

*/

352

class AlignmentRecordsToVariantsDatasetConverter

353

extends ToVariantDatasetConversion[AlignmentRecord, Variant]

354

355

/**

356

* Convert GenotypeRDD to VariantRDD via Dataset.

357

*/

358

class GenotypesToVariantsDatasetConverter

359

extends ToVariantDatasetConversion[Genotype, Variant]

360

```

361

362

## Usage Examples

363

364

```scala

365

import org.bdgenomics.adam.api.java.GenomicDatasetConverters._

366

import org.apache.spark.sql.Dataset

367

368

// Convert alignment records to features using Dataset

369

val alignments: AlignmentRecordRDD = jac.loadAlignments("input.bam")

370

val alignmentDataset: Dataset[AlignmentRecord] = alignments.toDF()

371

val emptyFeatureDataset: Dataset[Feature] = spark.emptyDataset[Feature]

372

373

val converter = new AlignmentRecordsToFeaturesDatasetConverter()

374

val features: FeatureRDD = converter.call(alignments, emptyFeatureDataset)

375

376

// Convert variants to coverage using Dataset

377

val variants: VariantRDD = jac.loadVariants("variants.vcf")

378

val variantDataset: Dataset[Variant] = variants.toDF()

379

val emptyCoverageDataset: Dataset[Coverage] = spark.emptyDataset[Coverage]

380

381

val coverageConverter = new VariantsToCoverageDatasetConverter()

382

val coverage: CoverageRDD = coverageConverter.call(variants, emptyCoverageDataset)

383

```

384

385

## Type Safety and Metadata Preservation

386

387

All dataset converters maintain:

388

- **Type Safety**: Compile-time guarantees for conversion compatibility

389

- **Sequence Dictionary**: Reference genome information preserved across conversions

390

- **Record Group Dictionary**: Sample and library information maintained for alignment-based conversions

391

- **Processing Commands**: History of transformations tracked in metadata