CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-bdgenomics-adam--adam-apis-2-10

Java/Python API wrappers for ADAM genomics analysis library enabling scalable genomic data processing with Apache Spark

Pending
Overview
Eval results
Files

dataset-conversions.mddocs/

Dataset Conversions

ADAM APIs provides Dataset-based converters that parallel the RDD converter functionality but work with Spark SQL Datasets for better performance and SQL integration. These converters enable type-safe transformations while leveraging Catalyst query optimization.

Capabilities

Base Dataset Conversion Traits

Foundation traits that define the interface for Dataset-based genomic data conversions.

/**
 * Base trait for conversions to contig fragment datasets
 * @param <T> Source record type
 * @param <U> Source genomic dataset type
 */
interface ToContigDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, NucleotideContigFragment, NucleotideContigFragmentRDD> {
    TypeTag<NucleotideContigFragment> xTag = typeTag[NucleotideContigFragment];
}

/**
 * Base trait for conversions to coverage datasets
 */
interface ToCoverageDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, Coverage, CoverageRDD> {
    TypeTag<Coverage> xTag = typeTag[Coverage];
}

/**
 * Base trait for conversions to feature datasets
 */
interface ToFeatureDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, Feature, FeatureRDD> {
    TypeTag<Feature> xTag = typeTag[Feature];
}

/**
 * Base trait for conversions to fragment datasets
 */
interface ToFragmentDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, Fragment, FragmentRDD> {
    TypeTag<Fragment> xTag = typeTag[Fragment];
}

/**
 * Base trait for conversions to alignment record datasets
 */
interface ToAlignmentRecordDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, AlignmentRecord, AlignmentRecordRDD> {
    TypeTag<AlignmentRecord> xTag = typeTag[AlignmentRecord];
}

/**
 * Base trait for conversions to genotype datasets
 */
interface ToGenotypeDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, Genotype, GenotypeRDD> {
    TypeTag<Genotype> xTag = typeTag[Genotype];
}

/**
 * Base trait for conversions to variant datasets
 */
interface ToVariantDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>> 
    extends GenomicDatasetConversion<T, U, Variant, VariantRDD> {
    TypeTag<Variant> xTag = typeTag[Variant];
}

Note: VariantContext dataset conversions are not currently supported in the Dataset converter API. Use RDD converters for VariantContext transformations.

Contig Fragment Dataset Converters

Convert nucleotide contig fragments using Dataset operations for better SQL integration.

/**
 * Convert NucleotideContigFragmentRDD with Coverage Dataset to CoverageRDD
 */
class ContigsToCoverageDatasetConverter extends ToCoverageDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
    /**
     * Perform the dataset-based conversion
     * @param v1 Source NucleotideContigFragmentRDD with metadata
     * @param v2 Target Dataset[Coverage] with structured data
     * @return CoverageRDD with combined metadata and data
     */
    CoverageRDD call(NucleotideContigFragmentRDD v1, Dataset<Coverage> v2);
}

/**
 * Convert NucleotideContigFragmentRDD with Feature Dataset to FeatureRDD
 */
class ContigsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
    FeatureRDD call(NucleotideContigFragmentRDD v1, Dataset<Feature> v2);
}

/**
 * Convert NucleotideContigFragmentRDD with Fragment Dataset to FragmentRDD
 */
class ContigsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
    FragmentRDD call(NucleotideContigFragmentRDD v1, Dataset<Fragment> v2);
}

/**
 * Convert NucleotideContigFragmentRDD with AlignmentRecord Dataset to AlignmentRecordRDD
 */
class ContigsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
    AlignmentRecordRDD call(NucleotideContigFragmentRDD v1, Dataset<AlignmentRecord> v2);
}

/**
 * Convert NucleotideContigFragmentRDD with Genotype Dataset to GenotypeRDD
 */
class ContigsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
    GenotypeRDD call(NucleotideContigFragmentRDD v1, Dataset<Genotype> v2);
}

/**
 * Convert NucleotideContigFragmentRDD with Variant Dataset to VariantRDD
 */
class ContigsToVariantsDatasetConverter extends ToVariantDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD> {
    VariantRDD call(NucleotideContigFragmentRDD v1, Dataset<Variant> v2);
}

Coverage Dataset Converters

Convert coverage data using Dataset operations for optimized query execution.

/**
 * Convert CoverageRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
 */
class CoverageToContigsDatasetConverter extends ToContigDatasetConversion<Coverage, CoverageRDD> {
    NucleotideContigFragmentRDD call(CoverageRDD v1, Dataset<NucleotideContigFragment> v2);
}

/**
 * Convert CoverageRDD with Feature Dataset to FeatureRDD
 */
class CoverageToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Coverage, CoverageRDD> {
    FeatureRDD call(CoverageRDD v1, Dataset<Feature> v2);
}

/**
 * Convert CoverageRDD with Fragment Dataset to FragmentRDD
 */
class CoverageToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Coverage, CoverageRDD> {
    FragmentRDD call(CoverageRDD v1, Dataset<Fragment> v2);
}

/**
 * Convert CoverageRDD with AlignmentRecord Dataset to AlignmentRecordRDD
 */
class CoverageToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Coverage, CoverageRDD> {
    AlignmentRecordRDD call(CoverageRDD v1, Dataset<AlignmentRecord> v2);
}

/**
 * Convert CoverageRDD with Genotype Dataset to GenotypeRDD
 */
class CoverageToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Coverage, CoverageRDD> {
    GenotypeRDD call(CoverageRDD v1, Dataset<Genotype> v2);
}

/**
 * Convert CoverageRDD with Variant Dataset to VariantRDD
 */
class CoverageToVariantsDatasetConverter extends ToVariantDatasetConversion<Coverage, CoverageRDD> {
    VariantRDD call(CoverageRDD v1, Dataset<Variant> v2);
}

Feature Dataset Converters

Convert genomic feature data using Dataset operations for SQL compatibility.

/**
 * Convert FeatureRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
 */
class FeaturesToContigsDatasetConverter extends ToContigDatasetConversion<Feature, FeatureRDD> {
    NucleotideContigFragmentRDD call(FeatureRDD v1, Dataset<NucleotideContigFragment> v2);
}

/**
 * Convert FeatureRDD with Coverage Dataset to CoverageRDD
 */
class FeaturesToCoverageDatasetConverter extends ToCoverageDatasetConversion<Feature, FeatureRDD> {
    CoverageRDD call(FeatureRDD v1, Dataset<Coverage> v2);
}

/**
 * Convert FeatureRDD with Fragment Dataset to FragmentRDD
 */
class FeaturesToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Feature, FeatureRDD> {
    FragmentRDD call(FeatureRDD v1, Dataset<Fragment> v2);
}

/**
 * Convert FeatureRDD with AlignmentRecord Dataset to AlignmentRecordRDD
 */
class FeaturesToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Feature, FeatureRDD> {
    AlignmentRecordRDD call(FeatureRDD v1, Dataset<AlignmentRecord> v2);
}

/**
 * Convert FeatureRDD with Genotype Dataset to GenotypeRDD
 */
class FeaturesToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Feature, FeatureRDD> {
    GenotypeRDD call(FeatureRDD v1, Dataset<Genotype> v2);
}

/**
 * Convert FeatureRDD with Variant Dataset to VariantRDD
 */
class FeaturesToVariantsDatasetConverter extends ToVariantDatasetConversion<Feature, FeatureRDD> {
    VariantRDD call(FeatureRDD v1, Dataset<Variant> v2);
}

Fragment Dataset Converters

Convert sequencing fragment data using Dataset operations for enhanced performance.

/**
 * Convert FragmentRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
 */
class FragmentsToContigsDatasetConverter extends ToContigDatasetConversion<Fragment, FragmentRDD> {
    NucleotideContigFragmentRDD call(FragmentRDD v1, Dataset<NucleotideContigFragment> v2);
}

/**
 * Convert FragmentRDD with Coverage Dataset to CoverageRDD
 */
class FragmentsToCoverageDatasetConverter extends ToCoverageDatasetConversion<Fragment, FragmentRDD> {
    CoverageRDD call(FragmentRDD v1, Dataset<Coverage> v2);
}

/**
 * Convert FragmentRDD with Feature Dataset to FeatureRDD
 */
class FragmentsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Fragment, FragmentRDD> {
    FeatureRDD call(FragmentRDD v1, Dataset<Feature> v2);
}

/**
 * Convert FragmentRDD with AlignmentRecord Dataset to AlignmentRecordRDD
 */
class FragmentsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Fragment, FragmentRDD> {
    AlignmentRecordRDD call(FragmentRDD v1, Dataset<AlignmentRecord> v2);
}

/**
 * Convert FragmentRDD with Genotype Dataset to GenotypeRDD
 */
class FragmentsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Fragment, FragmentRDD> {
    GenotypeRDD call(FragmentRDD v1, Dataset<Genotype> v2);
}

/**
 * Convert FragmentRDD with Variant Dataset to VariantRDD
 */
class FragmentsToVariantsDatasetConverter extends ToVariantDatasetConversion<Fragment, FragmentRDD> {
    VariantRDD call(FragmentRDD v1, Dataset<Variant> v2);
}

Alignment Record Dataset Converters

Convert alignment record data using Dataset operations with Catalyst optimization.

/**
 * Convert AlignmentRecordRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
 */
class AlignmentRecordsToContigsDatasetConverter extends ToContigDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
    NucleotideContigFragmentRDD call(AlignmentRecordRDD v1, Dataset<NucleotideContigFragment> v2);
}

/**
 * Convert AlignmentRecordRDD with Coverage Dataset to CoverageRDD
 */
class AlignmentRecordsToCoverageDatasetConverter extends ToCoverageDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
    CoverageRDD call(AlignmentRecordRDD v1, Dataset<Coverage> v2);
}

/**
 * Convert AlignmentRecordRDD with Feature Dataset to FeatureRDD
 */
class AlignmentRecordsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
    FeatureRDD call(AlignmentRecordRDD v1, Dataset<Feature> v2);
}

/**
 * Convert AlignmentRecordRDD with Fragment Dataset to FragmentRDD
 */
class AlignmentRecordsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
    FragmentRDD call(AlignmentRecordRDD v1, Dataset<Fragment> v2);
}

/**
 * Convert AlignmentRecordRDD with Genotype Dataset to GenotypeRDD
 */
class AlignmentRecordsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
    GenotypeRDD call(AlignmentRecordRDD v1, Dataset<Genotype> v2);
}

/**
 * Convert AlignmentRecordRDD with Variant Dataset to VariantRDD
 */
class AlignmentRecordsToVariantsDatasetConverter extends ToVariantDatasetConversion<AlignmentRecord, AlignmentRecordRDD> {
    VariantRDD call(AlignmentRecordRDD v1, Dataset<Variant> v2);
}

Genotype Dataset Converters

Convert genotype data using Dataset operations for optimized variant analysis.

/**
 * Convert GenotypeRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
 */
class GenotypesToContigsDatasetConverter extends ToContigDatasetConversion<Genotype, GenotypeRDD> {
    NucleotideContigFragmentRDD call(GenotypeRDD v1, Dataset<NucleotideContigFragment> v2);
}

/**
 * Convert GenotypeRDD with Coverage Dataset to CoverageRDD
 */
class GenotypesToCoverageDatasetConverter extends ToCoverageDatasetConversion<Genotype, GenotypeRDD> {
    CoverageRDD call(GenotypeRDD v1, Dataset<Coverage> v2);
}

/**
 * Convert GenotypeRDD with Feature Dataset to FeatureRDD
 */
class GenotypesToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Genotype, GenotypeRDD> {
    FeatureRDD call(GenotypeRDD v1, Dataset<Feature> v2);
}

/**
 * Convert GenotypeRDD with Fragment Dataset to FragmentRDD
 */
class GenotypesToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Genotype, GenotypeRDD> {
    FragmentRDD call(GenotypeRDD v1, Dataset<Fragment> v2);
}

/**
 * Convert GenotypeRDD with AlignmentRecord Dataset to AlignmentRecordRDD
 */
class GenotypesToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Genotype, GenotypeRDD> {
    AlignmentRecordRDD call(GenotypeRDD v1, Dataset<AlignmentRecord> v2);
}

/**
 * Convert GenotypeRDD with Variant Dataset to VariantRDD
 */
class GenotypesToVariantsDatasetConverter extends ToVariantDatasetConversion<Genotype, GenotypeRDD> {
    VariantRDD call(GenotypeRDD v1, Dataset<Variant> v2);
}

Variant Dataset Converters

Convert variant data using Dataset operations for enhanced genomic analysis workflows.

/**
 * Convert VariantRDD with NucleotideContigFragment Dataset to NucleotideContigFragmentRDD
 */
class VariantsToContigsDatasetConverter extends ToContigDatasetConversion<Variant, VariantRDD> {
    NucleotideContigFragmentRDD call(VariantRDD v1, Dataset<NucleotideContigFragment> v2);
}

/**
 * Convert VariantRDD with Coverage Dataset to CoverageRDD
 */
class VariantsToCoverageDatasetConverter extends ToCoverageDatasetConversion<Variant, VariantRDD> {
    CoverageRDD call(VariantRDD v1, Dataset<Coverage> v2);
}

/**
 * Convert VariantRDD with Feature Dataset to FeatureRDD
 */
class VariantsToFeaturesDatasetConverter extends ToFeatureDatasetConversion<Variant, VariantRDD> {
    FeatureRDD call(VariantRDD v1, Dataset<Feature> v2);
}

/**
 * Convert VariantRDD with Fragment Dataset to FragmentRDD
 */
class VariantsToFragmentsDatasetConverter extends ToFragmentDatasetConversion<Variant, VariantRDD> {
    FragmentRDD call(VariantRDD v1, Dataset<Fragment> v2);
}

/**
 * Convert VariantRDD with AlignmentRecord Dataset to AlignmentRecordRDD
 */
class VariantsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<Variant, VariantRDD> {
    AlignmentRecordRDD call(VariantRDD v1, Dataset<AlignmentRecord> v2);
}

/**
 * Convert VariantRDD with Genotype Dataset to GenotypeRDD
 */
class VariantsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Variant, VariantRDD> {
    GenotypeRDD call(VariantRDD v1, Dataset<Genotype> v2);
}

Usage Examples

Basic Dataset conversion with SQL integration:

import org.bdgenomics.adam.api.java.*;
import org.apache.spark.sql.Dataset;

// Load genomic data
AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");
VariantRDD variants = jac.loadVariants("variants.vcf");

// Convert to Dataset for SQL operations
Dataset<Variant> variantDS = variants.dataset();

// Apply SQL transformations
Dataset<Variant> filteredVariants = variantDS
    .filter("start > 1000000")
    .filter("qual > 30.0");

// Convert back using Dataset converter
AlignmentRecordsToVariantsDatasetConverter converter = 
    new AlignmentRecordsToVariantsDatasetConverter();
VariantRDD convertedVariants = converter.call(alignments, filteredVariants);

Performance-optimized Dataset operations:

// Load large genomic datasets
GenotypeRDD genotypes = jac.loadGenotypes("large_cohort.vcf");
FeatureRDD features = jac.loadFeatures("annotations.gtf");

// Convert to Datasets for Catalyst optimization
Dataset<Genotype> genotypeDS = genotypes.dataset();
Dataset<Feature> featureDS = features.dataset();

// Perform complex SQL-based analysis
Dataset<Feature> annotatedFeatures = featureDS
    .join(genotypeDS, "contigName")
    .where("genotype.variant.qual > 50")
    .select("feature.*");

// Convert back with preserved metadata
GenotypesToFeaturesDatasetConverter converter = 
    new GenotypesToFeaturesDatasetConverter();
FeatureRDD result = converter.call(genotypes, annotatedFeatures);

Combining RDD and Dataset operations:

// Start with RDD operations for complex logic
AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");
RDD<AlignmentRecord> filteredRDD = alignments.jrdd()
    .filter(read -> read.getMapq() > 30 && read.getReadMapped());

// Convert to Dataset for SQL operations
Dataset<AlignmentRecord> alignmentDS = spark.createDataset(
    filteredRDD, Encoders.bean(AlignmentRecord.class));

Dataset<Coverage> coverageDS = alignmentDS
    .groupBy("contigName", "start")
    .agg(count("*").as("count"))
    .select(col("contigName"), col("start"), col("count").as("score"));

// Convert back to genomic RDD with metadata
AlignmentRecordsToCoverageDatasetConverter converter = 
    new AlignmentRecordsToCoverageDatasetConverter();
CoverageRDD coverage = converter.call(alignments, coverageDS);

Key Benefits

  • Catalyst Optimization: Leverages Spark SQL's query optimizer for better performance
  • SQL Integration: Enables SQL queries on genomic data through Dataset API
  • Type Safety: Maintains compile-time type checking with structured data
  • Metadata Preservation: Preserves genomic metadata while enabling SQL operations
  • Interoperability: Seamlessly bridges RDD and Dataset APIs
  • Performance: Better performance for complex analytical queries compared to RDD operations

Install with Tessl CLI

npx tessl i tessl/maven-org-bdgenomics-adam--adam-apis-2-10

docs

dataset-conversions.md

genomic-data-loading.md

index.md

python-integration.md

rdd-conversions.md

tile.json