tessl/maven-org-bdgenomics-adam--adam-core

Core library for distributed genomics data processing built on Apache Spark with support for major genomic file formats

—

Pending

Overview

Eval results

Files

Genomic Algorithms

Name: tessl/maven-org-bdgenomics-adam--adam-core
Author: tessl

Bioinformatics algorithms including consensus calling, sequence alignment, and variant normalization optimized for distributed processing. ADAM Core provides scalable implementations of standard genomic algorithms while maintaining compatibility with established tools like GATK and Picard.

Capabilities

Consensus Generation

Algorithms for generating consensus sequences from aligned reads or incorporating known variants.

/**
 * Base trait for consensus sequence generation algorithms
 * Provides interface for generating consensus from genomic data
 */
trait ConsensusGenerator {
  /**
   * Generate consensus sequence from aligned reads
   * @param reads - Iterable collection of alignment records
   * @return Consensus object containing consensus sequence and quality information
   */
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
}

/**
 * Generate consensus sequences from aligned reads using coverage-based calling
 * Implements majority-rule consensus with configurable thresholds
 */
class ConsensusGeneratorFromReads(
  consensusFrequencyThreshold: Double = 0.4,
  minReadsAtPosition: Int = 5
) extends ConsensusGenerator {
  
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
}

/**
 * Generate consensus incorporating known variant sites
 * Uses known variants to guide consensus calling in ambiguous regions
 */
class ConsensusGeneratorFromKnowns(
  variants: RDD[Variant],
  minAlleleFraction: Double = 0.25
) extends ConsensusGenerator {
  
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
}

/**
 * Union consensus generator combining multiple consensus strategies
 * Merges results from multiple consensus generators with conflict resolution
 */
class UnionConsensusGenerator(
  generators: List[ConsensusGenerator]
) extends ConsensusGenerator {
  
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
}

Usage Examples:

import org.bdgenomics.adam.algorithms.consensus._
import org.bdgenomics.adam.rdd.ADAMContext._

// Basic consensus from reads
val generator = new ConsensusGeneratorFromReads(
  consensusFrequencyThreshold = 0.6,
  minReadsAtPosition = 10
)

val alignments = sc.loadBam("sample.bam")
val readsAtPosition = alignments.rdd.filter(_.getContigName == "chr1")
val consensus = generator.findConsensus(readsAtPosition.collect())

// Consensus incorporating known variants
val knownVariants = sc.loadVcf("dbsnp.vcf").toVariants().rdd
val variantGenerator = new ConsensusGeneratorFromKnowns(
  knownVariants,
  minAlleleFraction = 0.3
)

// Combined consensus strategy
val unionGenerator = new UnionConsensusGenerator(
  List(generator, variantGenerator)
)

Sequence Alignment

Local sequence alignment algorithms for comparing genomic sequences.

/**
 * Smith-Waterman local sequence alignment algorithm
 * Provides optimal local alignment between two sequences
 */
object SmithWaterman {
  /**
   * Perform local sequence alignment between reference and query sequences
   * @param reference - Reference sequence string
   * @param read - Query sequence string  
   * @param scoring - Scoring scheme for alignment
   * @return Alignment object containing optimal local alignment
   */
  def align(reference: String, 
           read: String, 
           scoring: SmithWatermanScoring): Alignment
}

/**
 * Scoring scheme interface for sequence alignment
 */
trait SmithWatermanScoring {
  /** Score for matching bases */
  def matchScore: Int
  
  /** Score for mismatching bases */
  def mismatchScore: Int
  
  /** Score for opening a gap */
  def gapOpenScore: Int
  
  /** Score for extending a gap */
  def gapExtendScore: Int
}

/**
 * Constant gap scoring scheme with fixed penalties
 * @param matchScore - Score for matches
 * @param mismatchScore - Penalty for mismatches  
 * @param gapScore - Penalty for gaps
 */
class SmithWatermanConstantGapScoring(
  matchScore: Int = 10,
  mismatchScore: Int = -2,
  gapScore: Int = -8
) extends SmithWatermanScoring {
  
  def gapOpenScore: Int = gapScore
  def gapExtendScore: Int = gapScore
}

/**
 * Affine gap scoring with different open and extend penalties
 * @param matchScore - Score for matches
 * @param mismatchScore - Penalty for mismatches
 * @param gapOpenScore - Penalty for opening gaps
 * @param gapExtendScore - Penalty for extending gaps
 */
class SmithWatermanGapScoringFromFn(
  matchScore: Int,
  mismatchScore: Int, 
  gapOpenScore: Int,
  gapExtendScore: Int
) extends SmithWatermanScoring

Usage Examples:

import org.bdgenomics.adam.algorithms.smithwaterman._

// Set up scoring scheme
val scoring = new SmithWatermanConstantGapScoring(
  matchScore = 10,
  mismatchScore = -2,
  gapScore = -5
)

// Perform alignment
val reference = "ACGTACGTACGT"
val query = "ACGTACGT"
val alignment = SmithWaterman.align(reference, query, scoring)

// Affine gap scoring for more realistic penalties
val affineScoring = new SmithWatermanGapScoringFromFn(
  matchScore = 10,
  mismatchScore = -4,
  gapOpenScore = -10,
  gapExtendScore = -1
)

val affineAlignment = SmithWaterman.align(reference, query, affineScoring)

Variant Normalization

Algorithms for standardizing variant representations and resolving complex variants.

/**
 * Utilities for normalizing variant representations
 * Implements variant normalization standards for consistent representation
 */
object NormalizationUtils {
  /**
   * Left-align and trim variant representation
   * @param variant - Variant to normalize
   * @param reference - Reference sequence context
   * @return Normalized variant representation
   */
  def leftAlignAndTrim(variant: Variant, reference: String): Variant
  
  /**
   * Decompose multi-nucleotide variants into primitive representations
   * @param variant - Complex variant to decompose
   * @return Sequence of primitive variants
   */
  def decomposeMNV(variant: Variant): Seq[Variant]
  
  /**
   * Normalize indel representation to canonical form
   * @param variant - Indel variant to normalize
   * @param reference - Reference sequence context
   * @return Canonically represented indel
   */
  def normalizeIndel(variant: Variant, reference: String): Variant
}

Usage Examples:

import org.bdgenomics.adam.algorithms.consensus.NormalizationUtils
import org.bdgenomics.adam.rdd.ADAMContext._

// Load variants and reference
val variants = sc.loadVcf("variants.vcf").toVariants()
val reference = sc.loadFasta("reference.fasta")

// Normalize variants
val normalized = variants.rdd.map { variant =>
  val refContext = reference.extract(
    ReferenceRegion(variant.getContigName, variant.getStart - 10, variant.getEnd + 10)
  )
  
  NormalizationUtils.leftAlignAndTrim(variant, refContext)
}

// Decompose complex variants
val decomposed = variants.rdd.flatMap { variant =>
  if (variant.getAlternateAllele.length > 1) {
    NormalizationUtils.decomposeMNV(variant)
  } else {
    Seq(variant)
  }
}

Quality Score Recalibration

Base quality score recalibration algorithms for improving sequencing accuracy.

/**
 * Base quality score recalibration using known variant sites
 * Implements GATK-style base quality score recalibration
 * @param alignments - AlignmentRecordRDD to recalibrate
 * @param knownSnps - Known variant sites for recalibration modeling
 * @return AlignmentRecordRDD with recalibrated quality scores
 */
def recalibrateBaseQualities(alignments: AlignmentRecordRDD, 
                           knownSnps: VariantRDD): AlignmentRecordRDD

/**
 * Build quality score recalibration model
 * @param alignments - Training alignments
 * @param knownSites - Known variant sites
 * @return Recalibration model for applying to new data
 */
def buildRecalibrationModel(alignments: AlignmentRecordRDD,
                          knownSites: VariantRDD): RecalibrationModel

Duplicate Marking

Algorithms for identifying and marking duplicate reads in sequencing data.

/**
 * Mark duplicate reads based on alignment coordinates
 * Implements Picard-compatible duplicate marking algorithm
 * @param alignments - AlignmentRecordRDD to process
 * @return AlignmentRecordRDD with duplicate reads flagged
 */
def markDuplicates(alignments: AlignmentRecordRDD): AlignmentRecordRDD

/**
 * Mark duplicate fragments for paired-end data
 * @param fragments - FragmentRDD to process  
 * @return FragmentRDD with duplicate fragments flagged
 */
def markDuplicates(fragments: FragmentRDD): FragmentRDD

Usage Examples:

import org.bdgenomics.adam.rdd.ADAMContext._

val alignments = sc.loadBam("sample.bam")

// Mark duplicates
val markedDuplicates = alignments.markDuplicates()

// Remove duplicates for analysis
val uniqueReads = markedDuplicates.transform(_.filter(!_.getDuplicateRead))

// Quality recalibration
val knownSites = sc.loadVcf("dbsnp.vcf").toVariants()
val recalibrated = alignments.recalibrateBaseQualities(knownSites)

Indel Realignment

Local realignment algorithms for improving alignment accuracy around indels.

/**
 * Local realignment around indels to improve alignment accuracy
 * @param alignments - AlignmentRecordRDD to realign
 * @param knownIndels - Optional known indel sites for guidance
 * @return AlignmentRecordRDD with improved alignments around indels
 */
def realignIndels(alignments: AlignmentRecordRDD, 
                 knownIndels: Option[VariantRDD] = None): AlignmentRecordRDD

/**
 * Identify candidate regions for realignment
 * @param alignments - Input alignments
 * @return RDD of genomic regions requiring realignment
 */
def findRealignmentCandidates(alignments: AlignmentRecordRDD): RDD[ReferenceRegion]

Coverage Analysis Algorithms

Algorithms for analyzing sequencing depth and coverage patterns.

/**
 * Calculate coverage depth across genomic positions
 * @param alignments - AlignmentRecordRDD to analyze
 * @param collapse - Whether to merge adjacent positions with same coverage
 * @return CoverageRDD with depth information
 */
def calculateCoverage(alignments: AlignmentRecordRDD, 
                     collapse: Boolean = true): CoverageRDD

/**
 * Identify regions with coverage above specified threshold
 * @param coverage - CoverageRDD to analyze
 * @param threshold - Minimum coverage threshold
 * @return FeatureRDD of high-coverage regions
 */
def findHighCoverageRegions(coverage: CoverageRDD, 
                          threshold: Double): FeatureRDD

/**
 * Calculate coverage statistics across genomic regions
 * @param coverage - Coverage data
 * @param regions - Regions of interest
 * @return RDD of coverage statistics per region
 */
def coverageStatistics(coverage: CoverageRDD, 
                      regions: FeatureRDD): RDD[(Feature, CoverageStatistics)]

K-mer Analysis

Algorithms for analyzing k-mer composition in genomic sequences.

/**
 * Count k-mers in sequencing reads
 * @param alignments - Reads to analyze
 * @param kmerLength - Length of k-mers to count
 * @return RDD of k-mer sequences with occurrence counts
 */
def countKmers(alignments: AlignmentRecordRDD, 
              kmerLength: Int): RDD[(String, Long)]

/**
 * Find overrepresented k-mers in reads
 * @param kmers - K-mer counts from countKmers
 * @param threshold - Minimum count threshold for overrepresentation
 * @return RDD of overrepresented k-mers
 */
def findOverrepresentedKmers(kmers: RDD[(String, Long)], 
                           threshold: Long): RDD[(String, Long)]

Usage Examples:

// K-mer analysis workflow
val alignments = sc.loadBam("sample.bam")

// Count 21-mers
val kmers = alignments.countKmers(21)

// Find most frequent k-mers
val topKmers = kmers.top(100)(Ordering.by(_._2))

// Identify overrepresented sequences
val overrepresented = kmers.filter(_._2 > 1000)

Algorithm Performance Considerations

Use appropriate partitioning - Algorithms work best with data partitioned by genomic region
Consider memory requirements - Some algorithms require substantial memory for intermediate data structures
Leverage broadcast variables - Use broadcast for small reference datasets accessed by all partitions
Pipeline operations - Chain multiple algorithms to minimize data shuffling
Persist intermediate results - Cache RDDs that will be reused in iterative algorithms
Tune parallelism - Adjust partition count based on cluster size and data volume

Algorithm Compatibility

ADAM's algorithms maintain compatibility with standard bioinformatics tools:

Duplicate marking: Compatible with Picard MarkDuplicates
Quality recalibration: Compatible with GATK BaseRecalibrator
Variant normalization: Follows VCF specification standards
Consensus calling: Compatible with samtools consensus methods

This ensures ADAM-processed data integrates seamlessly with existing genomics pipelines.

Install with Tessl CLI

npx tessl i tessl/maven-org-bdgenomics-adam--adam-core

docs

tessl/maven-org-bdgenomics-adam--adam-core

algorithms.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Genomic Algorithms

Capabilities

Consensus Generation

Sequence Alignment

Variant Normalization

Quality Score Recalibration

Duplicate Marking

Indel Realignment

Coverage Analysis Algorithms

K-mer Analysis

Algorithm Performance Considerations

Algorithm Compatibility

algorithms.mddocs/