Core library for distributed genomics data processing built on Apache Spark with support for major genomic file formats
npx @tessl/cli install tessl/maven-org-bdgenomics-adam--adam-core@0.23.0ADAM Core is a foundational library for distributed genomics data processing built on Apache Spark. It provides high-performance, fault-tolerant data structures and algorithms for genomic sequences, alignments, variants, and features, with support for legacy formats (SAM/BAM/CRAM, VCF, BED/GFF3/GTF, FASTQ, FASTA) and modern columnar storage (Apache Parquet). The library enables scalable genomic data processing across cluster computing environments while maintaining competitive single-node performance.
<dependency>
<groupId>org.bdgenomics.adam</groupId>
<artifactId>adam-core_2.10</artifactId>
<version>0.23.0</version>
</dependency>import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.{AlignmentRecordRDD, VariantRDD, FeatureRDD}
import org.apache.spark.SparkContextimport org.bdgenomics.adam.rdd.ADAMContext._
import org.apache.spark.{SparkContext, SparkConf}
// Initialize Spark context
val conf = new SparkConf().setAppName("GenomicsAnalysis")
val sc = new SparkContext(conf)
// Load genomic data (implicit conversion sc -> ADAMContext)
val alignments = sc.loadBam("input.bam")
val variants = sc.loadVcf("variants.vcf")
val features = sc.loadBed("annotations.bed")
// Transform and analyze
val mappedReads = alignments.transform(_.filter(_.getReadMapped))
val coverage = alignments.toCoverage()
// Save results
mappedReads.saveAsParquet("mapped_reads.adam")
coverage.saveAsWig("coverage.wig")ADAM Core is built around several key architectural components:
Core functionality for loading genomic data from various file formats and saving transformed results. Supports indexed access for efficient region-based queries.
// Main entry point - implicit conversion from SparkContext
implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext
// Load alignment data
def loadBam(pathName: String,
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD
// Load variant data
def loadVcf(pathName: String,
stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD
// Load sequence data
def loadFastq(pathName1: String, optPathName2: Option[String] = None,
optRecordGroup: Option[String] = None,
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDDDistributed RDD implementations for major genomic data types, providing transformations, joins, and analysis operations optimized for genomic workflows.
// Base trait for all genomic RDDs
trait GenomicRDD[T, U <: GenomicRDD[T, U]] {
def transform(tFn: RDD[T] => RDD[T]): U
def union(rdds: U*): U
def saveAsParquet(pathName: String): Unit
def cache(): U
def persist(storageLevel: StorageLevel): U
def unpersist(): U
}
// Main genomic data types (abstract sealed classes from actual implementation)
sealed abstract class AlignmentRecordRDD extends AvroRecordGroupGenomicRDD[AlignmentRecord, AlignmentRecordProduct, AlignmentRecordRDD]
sealed abstract class VariantRDD extends AvroGenomicRDD[Variant, VariantProduct, VariantRDD]
sealed abstract class GenotypeRDD extends MultisampleAvroGenomicRDD[Genotype, GenotypeProduct, GenotypeRDD]
sealed abstract class FeatureRDD extends AvroGenomicRDD[Feature, FeatureProduct, FeatureRDD]
abstract class CoverageRDD extends GenomicDataset[Coverage, Coverage, CoverageRDD]Genomic-specific transformations including format conversions, quality score recalibration, duplicate marking, and coverage analysis.
// AlignmentRecordRDD transformations
def toCoverage(collapse: Boolean = true): CoverageRDD
def toFragments(): FragmentRDD
def markDuplicates(): AlignmentRecordRDD
def recalibrateBaseQualities(knownSnps: VariantRDD): AlignmentRecordRDD
// VariantRDD transformations
def toGenotypes(): GenotypeRDD
def toVariantContexts(): VariantContextRDDComprehensive support for reading and writing genomic file formats, with automatic format detection and validation.
// Format-agnostic loading
def loadAlignments(pathName: String): AlignmentRecordRDD
def loadVariants(pathName: String): VariantRDD
def loadGenotypes(pathName: String): GenotypeRDD
def loadFeatures(pathName: String): FeatureRDD
// Format-specific saving
def saveAsSam(pathName: String, asType: SAMFormat = SAMFormat.SAM): Unit
def saveAsVcf(pathName: String): Unit
def saveAsBed(pathName: String): UnitBioinformatics algorithms including consensus calling, sequence alignment, and variant normalization optimized for distributed processing.
// Consensus generation
trait ConsensusGenerator {
def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
}
// Sequence alignment
object SmithWaterman {
def align(reference: String, read: String, scoring: SmithWatermanScoring): Alignment
}// Genomic coordinates and regions
case class ReferenceRegion(referenceName: String, start: Long, end: Long) {
def contains(pos: ReferencePosition): Boolean
def overlaps(other: ReferenceRegion): Boolean
def width: Long
}
case class ReferencePosition(referenceName: String, pos: Long) extends Ordered[ReferencePosition]
// Reference genome metadata
class SequenceDictionary {
def records: Seq[SequenceRecord]
def apply(contigName: String): SequenceRecord
}
// Validation stringency levels
object ValidationStringency extends Enumeration {
val STRICT, LENIENT, SILENT = Value
}
// Base traits for genomic RDD hierarchy (from actual implementation)
trait AvroGenomicRDD[T, U, V <: AvroGenomicRDD[T, U, V]] extends GenomicRDD[T, V]
trait AvroRecordGroupGenomicRDD[T, U, V <: AvroRecordGroupGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]
trait MultisampleAvroGenomicRDD[T, U, V <: MultisampleAvroGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]
trait GenomicDataset[T, U, V <: GenomicDataset[T, U, V]] extends GenomicRDD[T, V]
// Avro record product types
trait AlignmentRecordProduct
trait VariantProduct
trait GenotypeProduct
trait FeatureProduct
// Core Avro data types (from ADAM schemas)
case class AlignmentRecord(/* fields from Avro schema */)
case class Variant(/* fields from Avro schema */)
case class Genotype(/* fields from Avro schema */)
case class Feature(/* fields from Avro schema */)
case class Coverage(/* fields from Avro schema */)
// Storage level from Spark
import org.apache.spark.storage.StorageLevelADAM Core uses validation stringency levels to control error handling:
Most loading methods accept an optional ValidationStringency parameter to customize error handling behavior.