or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-conversions.mdgenomic-data-loading.mdindex.mdpython-integration.mdrdd-conversions.md

dataset-conversions.mddocs/

0

# Dataset Conversions

1

2

ADAM APIs provides Dataset-based converters that parallel the RDD converter functionality but work with Spark SQL Datasets for better performance and SQL integration. These converters enable type-safe transformations while leveraging Catalyst query optimization.

3

4

## Capabilities

5

6

### Base Dataset Conversion Traits

7

8

Foundation traits that define the interface for Dataset-based genomic data conversions.

9

10

```java { .api }

11

/**

12

* Base trait for conversions to contig fragment datasets

13

* @param <T> Source record type

14

* @param <U> Source genomic dataset type

15

*/

16

interface ToContigDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

17

extends GenomicDatasetConversion<T, U, NucleotideContigFragment, NucleotideContigFragmentRDD> {

18

TypeTag<NucleotideContigFragment> xTag = typeTag[NucleotideContigFragment];

19

}

20

21

/**

22

* Base trait for conversions to coverage datasets

23

*/

24

interface ToCoverageDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

25

extends GenomicDatasetConversion<T, U, Coverage, CoverageRDD> {

26

TypeTag<Coverage> xTag = typeTag[Coverage];

27

}

28

29

/**

30

* Base trait for conversions to feature datasets

31

*/

32

interface ToFeatureDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

33

extends GenomicDatasetConversion<T, U, Feature, FeatureRDD> {

34

TypeTag<Feature> xTag = typeTag[Feature];

35

}

36

37

/**

38

* Base trait for conversions to fragment datasets

39

*/

40

interface ToFragmentDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

41

extends GenomicDatasetConversion<T, U, Fragment, FragmentRDD> {

42

TypeTag<Fragment> xTag = typeTag[Fragment];

43

}

44

45

/**

46

* Base trait for conversions to alignment record datasets

47

*/

48

interface ToAlignmentRecordDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

49

extends GenomicDatasetConversion<T, U, AlignmentRecord, AlignmentRecordRDD> {

50

TypeTag<AlignmentRecord> xTag = typeTag[AlignmentRecord];

51

}

52

53

/**

54

* Base trait for conversions to genotype datasets

55

*/

56

interface ToGenotypeDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

57

extends GenomicDatasetConversion<T, U, Genotype, GenotypeRDD> {

58

TypeTag<Genotype> xTag = typeTag[Genotype];

59

}

60

61

/**

62

* Base trait for conversions to variant datasets

63

*/

64

interface ToVariantDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

65

extends GenomicDatasetConversion<T, U, Variant, VariantRDD> {

66

TypeTag<Variant> xTag = typeTag[Variant];

67

}

68

```

69

70

**Note:** VariantContext dataset conversions are not currently supported in the Dataset converter API. Use RDD converters for VariantContext transformations.

71

72

### Contig Fragment Dataset Converters

73

74

Convert nucleotide contig fragments using Dataset operations for better SQL integration.

75

76

```java { .api }

77

/**

78

* Convert NucleotideContigFragmentRDD with Coverage Dataset to CoverageRDD

79

*/

80

class ContigsToCoverageDatasetConverter extends ToCoverageDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {

81

/**

82

* Perform the dataset-based conversion

83

* @param v1 Source NucleotideContigFragmentRDD with metadata

84

* @param v2 Target Dataset[Coverage] with structured data

85

* @return CoverageRDD with combined metadata and data

86

*/

87

CoverageRDD call(NucleotideContigFragmentRDD v1, Dataset<Coverage> v2);

88

}

89

90

/**

91

* Convert NucleotideContigFragmentRDD with Feature Dataset to FeatureRDD

92

*/

93

class ContigsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {

94

FeatureRDD call(NucleotideContigFragmentRDD v1, Dataset<Feature> v2);

95

}

96

97

/**

98

* Convert NucleotideContigFragmentRDD with Fragment Dataset to FragmentRDD

99

*/

100

class ContigsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {

101

FragmentRDD call(NucleotideContigFragmentRDD v1, Dataset<Fragment> v2);

102

}

103

104

/**

105

* Convert NucleotideContigFragmentRDD with AlignmentRecord Dataset to AlignmentRecordRDD

106

*/

107

class ContigsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {

108

AlignmentRecordRDD call(NucleotideContigFragmentRDD v1, Dataset<AlignmentRecord> v2);

109

}

110

111

/**

112

* Convert NucleotideContigFragmentRDD with Genotype Dataset to GenotypeRDD

113

*/

114

class ContigsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {

115

GenotypeRDD call(NucleotideContigFragmentRDD v1, Dataset<Genotype> v2);

116

}

117

118

/**

119

* Convert NucleotideContigFragmentRDD with Variant Dataset to VariantRDD

120

*/

121

class ContigsToVariantsDatasetConverter extends ToVariantDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {

122

VariantRDD call(NucleotideContigFragmentRDD v1, Dataset<Variant> v2);

123

}

124

```

125

126

### Coverage Dataset Converters

127

128

Convert coverage data using Dataset operations for optimized query execution.

129

130

```java { .api }

131

/**

132

* Convert CoverageRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD

133

*/

134

class CoverageToContigsDatasetConverter extends ToContigDatasetConversion<Coverage, CoverageRDD> {

135

NucleotideContigFragmentRDD call(CoverageRDD v1, Dataset<NucleotideContigFragment> v2);

136

}

137

138

/**

139

* Convert CoverageRDD with Feature Dataset to FeatureRDD

140

*/

141

class CoverageToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Coverage, CoverageRDD> {

142

FeatureRDD call(CoverageRDD v1, Dataset<Feature> v2);

143

}

144

145

/**

146

* Convert CoverageRDD with Fragment Dataset to FragmentRDD

147

*/

148

class CoverageToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Coverage, CoverageRDD> {

149

FragmentRDD call(CoverageRDD v1, Dataset<Fragment> v2);

150

}

151

152

/**

153

* Convert CoverageRDD with AlignmentRecord Dataset to AlignmentRecordRDD

154

*/

155

class CoverageToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Coverage, CoverageRDD> {

156

AlignmentRecordRDD call(CoverageRDD v1, Dataset<AlignmentRecord> v2);

157

}

158

159

/**

160

* Convert CoverageRDD with Genotype Dataset to GenotypeRDD

161

*/

162

class CoverageToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Coverage, CoverageRDD> {

163

GenotypeRDD call(CoverageRDD v1, Dataset<Genotype> v2);

164

}

165

166

/**

167

* Convert CoverageRDD with Variant Dataset to VariantRDD

168

*/

169

class CoverageToVariantsDatasetConverter extends ToVariantDatasetConversion<Coverage, CoverageRDD> {

170

VariantRDD call(CoverageRDD v1, Dataset<Variant> v2);

171

}

172

```

173

174

### Feature Dataset Converters

175

176

Convert genomic feature data using Dataset operations for SQL compatibility.

177

178

```java { .api }

179

/**

180

* Convert FeatureRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD

181

*/

182

class FeaturesToContigsDatasetConverter extends ToContigDatasetConversion<Feature, FeatureRDD> {

183

NucleotideContigFragmentRDD call(FeatureRDD v1, Dataset<NucleotideContigFragment> v2);

184

}

185

186

/**

187

* Convert FeatureRDD with Coverage Dataset to CoverageRDD

188

*/

189

class FeaturesToCoverageDatasetConverter extends ToCoverageDatasetConversion<Feature, FeatureRDD> {

190

CoverageRDD call(FeatureRDD v1, Dataset<Coverage> v2);

191

}

192

193

/**

194

* Convert FeatureRDD with Fragment Dataset to FragmentRDD

195

*/

196

class FeaturesToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Feature, FeatureRDD> {

197

FragmentRDD call(FeatureRDD v1, Dataset<Fragment> v2);

198

}

199

200

/**

201

* Convert FeatureRDD with AlignmentRecord Dataset to AlignmentRecordRDD

202

*/

203

class FeaturesToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Feature, FeatureRDD> {

204

AlignmentRecordRDD call(FeatureRDD v1, Dataset<AlignmentRecord> v2);

205

}

206

207

/**

208

* Convert FeatureRDD with Genotype Dataset to GenotypeRDD

209

*/

210

class FeaturesToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Feature, FeatureRDD> {

211

GenotypeRDD call(FeatureRDD v1, Dataset<Genotype> v2);

212

}

213

214

/**

215

* Convert FeatureRDD with Variant Dataset to VariantRDD

216

*/

217

class FeaturesToVariantsDatasetConverter extends ToVariantDatasetConversion<Feature, FeatureRDD> {

218

VariantRDD call(FeatureRDD v1, Dataset<Variant> v2);

219

}

220

```

221

222

### Fragment Dataset Converters

223

224

Convert sequencing fragment data using Dataset operations for enhanced performance.

225

226

```java { .api }

227

/**

228

* Convert FragmentRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD

229

*/

230

class FragmentsToContigsDatasetConverter extends ToContigDatasetConversion<Fragment, FragmentRDD> {

231

NucleotideContigFragmentRDD call(FragmentRDD v1, Dataset<NucleotideContigFragment> v2);

232

}

233

234

/**

235

* Convert FragmentRDD with Coverage Dataset to CoverageRDD

236

*/

237

class FragmentsToCoverageDatasetConverter extends ToCoverageDatasetConversion<Fragment, FragmentRDD> {

238

CoverageRDD call(FragmentRDD v1, Dataset<Coverage> v2);

239

}

240

241

/**

242

* Convert FragmentRDD with Feature Dataset to FeatureRDD

243

*/

244

class FragmentsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Fragment, FragmentRDD> {

245

FeatureRDD call(FragmentRDD v1, Dataset<Feature> v2);

246

}

247

248

/**

249

* Convert FragmentRDD with AlignmentRecord Dataset to AlignmentRecordRDD

250

*/

251

class FragmentsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Fragment, FragmentRDD> {

252

AlignmentRecordRDD call(FragmentRDD v1, Dataset<AlignmentRecord> v2);

253

}

254

255

/**

256

* Convert FragmentRDD with Genotype Dataset to GenotypeRDD

257

*/

258

class FragmentsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Fragment, FragmentRDD> {

259

GenotypeRDD call(FragmentRDD v1, Dataset<Genotype> v2);

260

}

261

262

/**

263

* Convert FragmentRDD with Variant Dataset to VariantRDD

264

*/

265

class FragmentsToVariantsDatasetConverter extends ToVariantDatasetConversion<Fragment, FragmentRDD> {

266

VariantRDD call(FragmentRDD v1, Dataset<Variant> v2);

267

}

268

```

269

270

### Alignment Record Dataset Converters

271

272

Convert alignment record data using Dataset operations with Catalyst optimization.

273

274

```java { .api }

275

/**

276

* Convert AlignmentRecordRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD

277

*/

278

class AlignmentRecordsToContigsDatasetConverter extends ToContigDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {

279

NucleotideContigFragmentRDD call(AlignmentRecordRDD v1, Dataset<NucleotideContigFragment> v2);

280

}

281

282

/**

283

* Convert AlignmentRecordRDD with Coverage Dataset to CoverageRDD

284

*/

285

class AlignmentRecordsToCoverageDatasetConverter extends ToCoverageDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {

286

CoverageRDD call(AlignmentRecordRDD v1, Dataset<Coverage> v2);

287

}

288

289

/**

290

* Convert AlignmentRecordRDD with Feature Dataset to FeatureRDD

291

*/

292

class AlignmentRecordsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {

293

FeatureRDD call(AlignmentRecordRDD v1, Dataset<Feature> v2);

294

}

295

296

/**

297

* Convert AlignmentRecordRDD with Fragment Dataset to FragmentRDD

298

*/

299

class AlignmentRecordsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {

300

FragmentRDD call(AlignmentRecordRDD v1, Dataset<Fragment> v2);

301

}

302

303

/**

304

* Convert AlignmentRecordRDD with Genotype Dataset to GenotypeRDD

305

*/

306

class AlignmentRecordsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {

307

GenotypeRDD call(AlignmentRecordRDD v1, Dataset<Genotype> v2);

308

}

309

310

/**

311

* Convert AlignmentRecordRDD with Variant Dataset to VariantRDD

312

*/

313

class AlignmentRecordsToVariantsDatasetConverter extends ToVariantDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {

314

VariantRDD call(AlignmentRecordRDD v1, Dataset<Variant> v2);

315

}

316

```

317

318

### Genotype Dataset Converters

319

320

Convert genotype data using Dataset operations for optimized variant analysis.

321

322

```java { .api }

323

/**

324

* Convert GenotypeRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD

325

*/

326

class GenotypesToContigsDatasetConverter extends ToContigDatasetConversion<Genotype, GenotypeRDD> {

327

NucleotideContigFragmentRDD call(GenotypeRDD v1, Dataset<NucleotideContigFragment> v2);

328

}

329

330

/**

331

* Convert GenotypeRDD with Coverage Dataset to CoverageRDD

332

*/

333

class GenotypesToCoverageDatasetConverter extends ToCoverageDatasetConversion<Genotype, GenotypeRDD> {

334

CoverageRDD call(GenotypeRDD v1, Dataset<Coverage> v2);

335

}

336

337

/**

338

* Convert GenotypeRDD with Feature Dataset to FeatureRDD

339

*/

340

class GenotypesToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Genotype, GenotypeRDD> {

341

FeatureRDD call(GenotypeRDD v1, Dataset<Feature> v2);

342

}

343

344

/**

345

* Convert GenotypeRDD with Fragment Dataset to FragmentRDD

346

*/

347

class GenotypesToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Genotype, GenotypeRDD> {

348

FragmentRDD call(GenotypeRDD v1, Dataset<Fragment> v2);

349

}

350

351

/**

352

* Convert GenotypeRDD with AlignmentRecord Dataset to AlignmentRecordRDD

353

*/

354

class GenotypesToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Genotype, GenotypeRDD> {

355

AlignmentRecordRDD call(GenotypeRDD v1, Dataset<AlignmentRecord> v2);

356

}

357

358

/**

359

* Convert GenotypeRDD with Variant Dataset to VariantRDD

360

*/

361

class GenotypesToVariantsDatasetConverter extends ToVariantDatasetConversion<Genotype, GenotypeRDD> {

362

VariantRDD call(GenotypeRDD v1, Dataset<Variant> v2);

363

}

364

```

365

366

### Variant Dataset Converters

367

368

Convert variant data using Dataset operations for enhanced genomic analysis workflows.

369

370

```java { .api }

371

/**

372

* Convert VariantRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD

373

*/

374

class VariantsToContigsDatasetConverter extends ToContigDatasetConversion<Variant, VariantRDD> {

375

NucleotideContigFragmentRDD call(VariantRDD v1, Dataset<NucleotideContigFragment> v2);

376

}

377

378

/**

379

* Convert VariantRDD with Coverage Dataset to CoverageRDD

380

*/

381

class VariantsToCoverageDatasetConverter extends ToCoverageDatasetConversion<Variant, VariantRDD> {

382

CoverageRDD call(VariantRDD v1, Dataset<Coverage> v2);

383

}

384

385

/**

386

* Convert VariantRDD with Feature Dataset to FeatureRDD

387

*/

388

class VariantsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Variant, VariantRDD> {

389

FeatureRDD call(VariantRDD v1, Dataset<Feature> v2);

390

}

391

392

/**

393

* Convert VariantRDD with Fragment Dataset to FragmentRDD

394

*/

395

class VariantsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Variant, VariantRDD> {

396

FragmentRDD call(VariantRDD v1, Dataset<Fragment> v2);

397

}

398

399

/**

400

* Convert VariantRDD with AlignmentRecord Dataset to AlignmentRecordRDD

401

*/

402

class VariantsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Variant, VariantRDD> {

403

AlignmentRecordRDD call(VariantRDD v1, Dataset<AlignmentRecord> v2);

404

}

405

406

/**

407

* Convert VariantRDD with Genotype Dataset to GenotypeRDD

408

*/

409

class VariantsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Variant, VariantRDD> {

410

GenotypeRDD call(VariantRDD v1, Dataset<Genotype> v2);

411

}

412

```

413

414

## Usage Examples

415

416

**Basic Dataset conversion with SQL integration:**

417

418

```java

419

import org.bdgenomics.adam.api.java.*;

420

import org.apache.spark.sql.Dataset;

421

422

// Load genomic data

423

AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");

424

VariantRDD variants = jac.loadVariants("variants.vcf");

425

426

// Convert to Dataset for SQL operations

427

Dataset<Variant> variantDS = variants.dataset();

428

429

// Apply SQL transformations

430

Dataset<Variant> filteredVariants = variantDS

431

.filter("start > 1000000")

432

.filter("qual > 30.0");

433

434

// Convert back using Dataset converter

435

AlignmentRecordsToVariantsDatasetConverter converter =

436

new AlignmentRecordsToVariantsDatasetConverter();

437

VariantRDD convertedVariants = converter.call(alignments, filteredVariants);

438

```

439

440

**Performance-optimized Dataset operations:**

441

442

```java

443

// Load large genomic datasets

444

GenotypeRDD genotypes = jac.loadGenotypes("large_cohort.vcf");

445

FeatureRDD features = jac.loadFeatures("annotations.gtf");

446

447

// Convert to Datasets for Catalyst optimization

448

Dataset<Genotype> genotypeDS = genotypes.dataset();

449

Dataset<Feature> featureDS = features.dataset();

450

451

// Perform complex SQL-based analysis

452

Dataset<Feature> annotatedFeatures = featureDS

453

.join(genotypeDS, "contigName")

454

.where("genotype.variant.qual > 50")

455

.select("feature.*");

456

457

// Convert back with preserved metadata

458

GenotypesToFeaturesDatasetConverter converter =

459

new GenotypesToFeaturesDatasetConverter();

460

FeatureRDD result = converter.call(genotypes, annotatedFeatures);

461

```

462

463

**Combining RDD and Dataset operations:**

464

465

```java

466

// Start with RDD operations for complex logic

467

AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");

468

RDD<AlignmentRecord> filteredRDD = alignments.jrdd()

469

.filter(read -> read.getMapq() > 30 && read.getReadMapped());

470

471

// Convert to Dataset for SQL operations

472

Dataset<AlignmentRecord> alignmentDS = spark.createDataset(

473

filteredRDD, Encoders.bean(AlignmentRecord.class));

474

475

Dataset<Coverage> coverageDS = alignmentDS

476

.groupBy("contigName", "start")

477

.agg(count("*").as("count"))

478

.select(col("contigName"), col("start"), col("count").as("score"));

479

480

// Convert back to genomic RDD with metadata

481

AlignmentRecordsToCoverageDatasetConverter converter =

482

new AlignmentRecordsToCoverageDatasetConverter();

483

CoverageRDD coverage = converter.call(alignments, coverageDS);

484

```

485

486

## Key Benefits

487

488

- **Catalyst Optimization**: Leverages Spark SQL's query optimizer for better performance

489

- **SQL Integration**: Enables SQL queries on genomic data through Dataset API

490

- **Type Safety**: Maintains compile-time type checking with structured data

491

- **Metadata Preservation**: Preserves genomic metadata while enabling SQL operations

492

- **Interoperability**: Seamlessly bridges RDD and Dataset APIs

493

- **Performance**: Better performance for complex analytical queries compared to RDD operations