CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-bdgenomics-adam--adam-core

Core library for distributed genomics data processing built on Apache Spark with support for major genomic file formats

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

ADAM Core

ADAM Core is a foundational library for distributed genomics data processing built on Apache Spark. It provides high-performance, fault-tolerant data structures and algorithms for genomic sequences, alignments, variants, and features, with support for legacy formats (SAM/BAM/CRAM, VCF, BED/GFF3/GTF, FASTQ, FASTA) and modern columnar storage (Apache Parquet). The library enables scalable genomic data processing across cluster computing environments while maintaining competitive single-node performance.

Package Information

  • Package Name: adam-core_2.10
  • Package Type: maven
  • Language: Scala
  • Framework: Apache Spark
  • Installation: Add to Maven pom.xml:
    <dependency>
      <groupId>org.bdgenomics.adam</groupId>
      <artifactId>adam-core_2.10</artifactId>
      <version>0.23.0</version>
    </dependency>

Core Imports

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.{AlignmentRecordRDD, VariantRDD, FeatureRDD}
import org.apache.spark.SparkContext

Basic Usage

import org.bdgenomics.adam.rdd.ADAMContext._
import org.apache.spark.{SparkContext, SparkConf}

// Initialize Spark context
val conf = new SparkConf().setAppName("GenomicsAnalysis")
val sc = new SparkContext(conf)

// Load genomic data (implicit conversion sc -> ADAMContext)
val alignments = sc.loadBam("input.bam")
val variants = sc.loadVcf("variants.vcf")
val features = sc.loadBed("annotations.bed")

// Transform and analyze
val mappedReads = alignments.transform(_.filter(_.getReadMapped))
val coverage = alignments.toCoverage()

// Save results
mappedReads.saveAsParquet("mapped_reads.adam")
coverage.saveAsWig("coverage.wig")

Architecture

ADAM Core is built around several key architectural components:

  • ADAMContext: Entry point providing loading methods for all genomic file formats, extending SparkContext functionality
  • GenomicRDD Framework: Base distributed data structures with genomic-aware partitioning, transformations, and I/O operations
  • Data Type System: Strongly-typed genomic records based on Avro schemas for serialization efficiency
  • Format Converters: Bidirectional conversion between legacy formats and ADAM's internal representations
  • Schema Projections: Column-store optimizations for accessing only required fields from Parquet data
  • Algorithms Package: Genomic algorithms including consensus generation and sequence alignment

Capabilities

Data Loading and I/O

Core functionality for loading genomic data from various file formats and saving transformed results. Supports indexed access for efficient region-based queries.

// Main entry point - implicit conversion from SparkContext
implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext

// Load alignment data
def loadBam(pathName: String, 
           stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD

// Load variant data  
def loadVcf(pathName: String,
           stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD

// Load sequence data
def loadFastq(pathName1: String, optPathName2: Option[String] = None, 
             optRecordGroup: Option[String] = None,
             stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD

Data Loading and I/O

Genomic Data Types

Distributed RDD implementations for major genomic data types, providing transformations, joins, and analysis operations optimized for genomic workflows.

// Base trait for all genomic RDDs
trait GenomicRDD[T, U <: GenomicRDD[T, U]] {
  def transform(tFn: RDD[T] => RDD[T]): U
  def union(rdds: U*): U
  def saveAsParquet(pathName: String): Unit
  def cache(): U
  def persist(storageLevel: StorageLevel): U
  def unpersist(): U
}

// Main genomic data types (abstract sealed classes from actual implementation)
sealed abstract class AlignmentRecordRDD extends AvroRecordGroupGenomicRDD[AlignmentRecord, AlignmentRecordProduct, AlignmentRecordRDD]
sealed abstract class VariantRDD extends AvroGenomicRDD[Variant, VariantProduct, VariantRDD]
sealed abstract class GenotypeRDD extends MultisampleAvroGenomicRDD[Genotype, GenotypeProduct, GenotypeRDD]
sealed abstract class FeatureRDD extends AvroGenomicRDD[Feature, FeatureProduct, FeatureRDD]  
abstract class CoverageRDD extends GenomicDataset[Coverage, Coverage, CoverageRDD]

Genomic Data Types

Data Transformations

Genomic-specific transformations including format conversions, quality score recalibration, duplicate marking, and coverage analysis.

// AlignmentRecordRDD transformations
def toCoverage(collapse: Boolean = true): CoverageRDD
def toFragments(): FragmentRDD
def markDuplicates(): AlignmentRecordRDD
def recalibrateBaseQualities(knownSnps: VariantRDD): AlignmentRecordRDD

// VariantRDD transformations  
def toGenotypes(): GenotypeRDD
def toVariantContexts(): VariantContextRDD

Data Transformations

File Format Support

Comprehensive support for reading and writing genomic file formats, with automatic format detection and validation.

// Format-agnostic loading
def loadAlignments(pathName: String): AlignmentRecordRDD
def loadVariants(pathName: String): VariantRDD  
def loadGenotypes(pathName: String): GenotypeRDD
def loadFeatures(pathName: String): FeatureRDD

// Format-specific saving
def saveAsSam(pathName: String, asType: SAMFormat = SAMFormat.SAM): Unit
def saveAsVcf(pathName: String): Unit
def saveAsBed(pathName: String): Unit

File Format Support

Genomic Algorithms

Bioinformatics algorithms including consensus calling, sequence alignment, and variant normalization optimized for distributed processing.

// Consensus generation
trait ConsensusGenerator {
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
}

// Sequence alignment
object SmithWaterman {
  def align(reference: String, read: String, scoring: SmithWatermanScoring): Alignment
}

Genomic Algorithms

Key Data Types

// Genomic coordinates and regions
case class ReferenceRegion(referenceName: String, start: Long, end: Long) {
  def contains(pos: ReferencePosition): Boolean
  def overlaps(other: ReferenceRegion): Boolean  
  def width: Long
}

case class ReferencePosition(referenceName: String, pos: Long) extends Ordered[ReferencePosition]

// Reference genome metadata
class SequenceDictionary {
  def records: Seq[SequenceRecord]
  def apply(contigName: String): SequenceRecord
}

// Validation stringency levels
object ValidationStringency extends Enumeration {
  val STRICT, LENIENT, SILENT = Value
}

// Base traits for genomic RDD hierarchy (from actual implementation)
trait AvroGenomicRDD[T, U, V <: AvroGenomicRDD[T, U, V]] extends GenomicRDD[T, V]
trait AvroRecordGroupGenomicRDD[T, U, V <: AvroRecordGroupGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]
trait MultisampleAvroGenomicRDD[T, U, V <: MultisampleAvroGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]
trait GenomicDataset[T, U, V <: GenomicDataset[T, U, V]] extends GenomicRDD[T, V]

// Avro record product types
trait AlignmentRecordProduct
trait VariantProduct  
trait GenotypeProduct
trait FeatureProduct

// Core Avro data types (from ADAM schemas)
case class AlignmentRecord(/* fields from Avro schema */)
case class Variant(/* fields from Avro schema */)
case class Genotype(/* fields from Avro schema */)
case class Feature(/* fields from Avro schema */)
case class Coverage(/* fields from Avro schema */)

// Storage level from Spark
import org.apache.spark.storage.StorageLevel

Error Handling

ADAM Core uses validation stringency levels to control error handling:

  • STRICT: Fails immediately on any format violations or data inconsistencies
  • LENIENT: Logs warnings for format violations but continues processing
  • SILENT: Ignores format violations and processes data without warnings

Most loading methods accept an optional ValidationStringency parameter to customize error handling behavior.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.bdgenomics.adam/adam-core_2.10@0.23.x
Publish Source
CLI
Badge
tessl/maven-org-bdgenomics-adam--adam-core badge