CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-bdgenomics-adam--adam-core

Core library for distributed genomics data processing built on Apache Spark with support for major genomic file formats

Pending
Overview
Eval results
Files

transformations.mddocs/

Data Transformations

Genomic-specific transformations including format conversions, quality score recalibration, duplicate marking, and coverage analysis. These operations are optimized for distributed processing of large genomic datasets while maintaining data integrity and genomic coordinate awareness.

Capabilities

Alignment Transformations

Specialized transformations for sequencing read data including quality improvement and analysis operations.

/**
 * Convert alignment records to coverage depth information
 * @param collapse - Whether to merge adjacent coverage records with same depth
 * @return CoverageRDD representing sequencing depth across genomic positions
 */
def toCoverage(collapse: Boolean = true): CoverageRDD

/**
 * Group paired-end reads into fragments representing complete DNA molecules
 * @return FragmentRDD containing paired reads as single fragments
 */
def toFragments(): FragmentRDD

/**
 * Mark duplicate reads based on genomic coordinates and orientation
 * Uses Picard's duplicate marking algorithm for compatibility
 * @return AlignmentRecordRDD with duplicate reads flagged
 */
def markDuplicates(): AlignmentRecordRDD

/**
 * Recalibrate base quality scores using known variant sites
 * Implements GATK-style base quality score recalibration
 * @param knownSnps - VariantRDD of known variant sites for recalibration
 * @return AlignmentRecordRDD with recalibrated quality scores
 */
def recalibrateBaseQualities(knownSnps: VariantRDD): AlignmentRecordRDD

/**
 * Realign reads around indels to improve alignment accuracy
 * @return AlignmentRecordRDD with improved alignments around indels
 */
def realignIndels(): AlignmentRecordRDD

/**
 * Sort reads by genomic coordinate for efficient processing
 * @return AlignmentRecordRDD sorted by reference position
 */
def sortReadsByReferencePosition(): AlignmentRecordRDD

/**
 * Count k-mers of specified length across all reads
 * @param kmerLength - Length of k-mers to count
 * @return RDD of k-mer sequences with their occurrence counts
 */
def countKmers(kmerLength: Int): RDD[(String, Long)]

Usage Examples:

import org.bdgenomics.adam.rdd.ADAMContext._

val alignments = sc.loadBam("input.bam")

// Generate coverage from alignments
val coverage = alignments.toCoverage(collapse = true)
coverage.saveAsWig("coverage.wig")

// Mark and remove duplicates
val deduped = alignments
  .markDuplicates()
  .transform(_.filter(!_.getDuplicateRead))

// Quality score recalibration workflow
val knownVariants = sc.loadVcf("known_sites.vcf").toVariants()
val recalibrated = alignments.recalibrateBaseQualities(knownVariants)

// K-mer analysis
val kmers21 = alignments.countKmers(21)
val topKmers = kmers21.top(100)(Ordering.by(_._2))

Variant Transformations

Transformations for variant calling and population genetics workflows.

/**
 * Convert variants to genotype calls for population analysis
 * @return GenotypeRDD containing sample-specific genotype information
 */
def toGenotypes(): GenotypeRDD

/**
 * Convert variants to variant contexts with full VCF metadata
 * @return VariantContextRDD containing variants with header information
 */
def toVariantContexts(): VariantContextRDD

/**
 * Extract variants from genotype calls
 * @return VariantRDD containing unique variant sites
 */
def toVariants(): VariantRDD  // Available on GenotypeRDD and VariantContextRDD

Usage Examples:

// Load and transform variant data
val variantContexts = sc.loadVcf("population.vcf")

// Extract variants for analysis
val variants = variantContexts.toVariants()

// Extract genotypes for population genetics
val genotypes = variantContexts.toGenotypes()

// Filter variants by quality and convert back
val highQualityVariants = variants
  .transform(_.filter(_.getQuality > 30.0))
  .toVariantContexts()

Coverage Analysis

Transformations for analyzing sequencing depth and coverage patterns.

/**
 * Collapse adjacent coverage records with identical depth values
 * @return CoverageRDD with consolidated coverage intervals
 */
def collapse(): CoverageRDD

/**
 * Convert coverage records to genomic features for analysis
 * @return FeatureRDD representing coverage intervals as features
 */
def toFeatures(): FeatureRDD

/**
 * Aggregate coverage across multiple samples at each position
 * @return CoverageRDD with combined coverage information
 */
def aggregateByPosition(): CoverageRDD

Usage Examples:

// Generate and analyze coverage
val alignments = sc.loadBam("sample.bam")
val coverage = alignments.toCoverage()

// Collapse adjacent regions with same coverage
val collapsed = coverage.collapse()

// Find regions with high coverage (>50x)
val highCoverage = coverage
  .transform(_.filter(_.count > 50.0))
  .toFeatures()

highCoverage.saveAsBed("high_coverage_regions.bed")

Fragment Operations

Transformations for paired-end sequencing fragment analysis.

/**
 * Convert fragments back to individual alignment records
 * @return AlignmentRecordRDD containing separate read alignments
 */
def toAlignmentRecords(): AlignmentRecordRDD

/**
 * Mark duplicate fragments based on alignment coordinates
 * @return FragmentRDD with duplicate fragments flagged
 */
def markDuplicates(): FragmentRDD

Usage Examples:

// Work with paired-end fragments
val alignments = sc.loadBam("paired_end.bam")
val fragments = alignments.toFragments()

// Mark duplicate fragments
val dedupedFragments = fragments.markDuplicates()

// Convert back to reads for downstream processing
val cleanReads = dedupedFragments.toAlignmentRecords()

Format Conversions

Convert between different genomic data representations and file formats.

/**
 * Save alignment records as SAM/BAM/CRAM format
 * @param pathName - Output file path
 * @param asType - Output format (SAM, BAM, or CRAM)
 * @param asSingleFile - Whether to save as single file or partitioned
 */
def saveAsSam(pathName: String, 
             asType: SAMFormat = SAMFormat.SAM,
             asSingleFile: Boolean = false): Unit

/**
 * Save alignment records as FASTQ format
 * @param pathName - Output file path
 * @param outputOriginalBaseQualities - Whether to output original quality scores
 * @param asSingleFile - Whether to save as single file
 */
def saveAsFastq(pathName: String,
               outputOriginalBaseQualities: Boolean = false,
               asSingleFile: Boolean = false): Unit

/**
 * Save variants as VCF format
 * @param pathName - Output file path
 * @param stringency - Validation stringency for VCF compliance
 */
def saveAsVcf(pathName: String,
             stringency: ValidationStringency = ValidationStringency.STRICT): Unit

/**
 * Save features in various annotation formats
 */
def saveAsBed(pathName: String): Unit        // BED format
def saveAsGtf(pathName: String): Unit        // GTF format  
def saveAsGff3(pathName: String): Unit       // GFF3 format
def saveAsIntervalList(pathName: String): Unit // Picard interval list
def saveAsNarrowPeak(pathName: String): Unit  // ENCODE narrowPeak

/**
 * Save coverage as WIG format for genome browsers
 * @param pathName - Output file path
 */
def saveAsWig(pathName: String): Unit

/**
 * Save reference sequences as FASTA format
 * @param pathName - Output file path
 * @param lineWidth - Number of bases per line
 */
def saveAsFasta(pathName: String, lineWidth: Int = 60): Unit

Advanced Transformations

Complex genomic analysis operations for specialized workflows.

/**
 * Generic transformation preserving genomic metadata
 * @param tFn - Function to transform the underlying RDD
 * @return Same GenomicRDD type with transformed data
 */
def transform(tFn: RDD[T] => RDD[T]): U

/**
 * Transform to different GenomicRDD type
 * @param tFn - Function to transform RDD to different type
 * @return New GenomicRDD type with appropriate metadata
 */
def transmute[X, Y <: GenomicRDD[X, Y]](tFn: RDD[T] => RDD[X]): Y

/**
 * Union multiple GenomicRDDs of the same type
 * @param rdds - Variable number of GenomicRDDs to union
 * @return Combined GenomicRDD containing all records
 */
def union(rdds: U*): U

Usage Examples:

// Complex transformation pipeline
val alignments = sc.loadBam("input.bam")

val processed = alignments
  .transform(_.filter(_.getReadMapped))           // Filter mapped reads
  .markDuplicates()                               // Mark duplicates
  .transform(_.filter(!_.getDuplicateRead))       // Remove duplicates
  .sortReadsByReferencePosition()                 // Sort by position

// Convert to different format
val coverage = processed.toCoverage()

// Union multiple samples
val sample1 = sc.loadBam("sample1.bam")
val sample2 = sc.loadBam("sample2.bam") 
val sample3 = sc.loadBam("sample3.bam")

val combined = sample1.union(sample2, sample3)

Quality Control Operations

Operations for assessing and improving data quality.

/**
 * Calculate alignment statistics for quality assessment
 * @return RDD of alignment statistics by reference contig
 */
def alignmentStatistics(): RDD[(String, AlignmentStatistics)]

/**
 * Filter reads based on quality metrics
 * @param minMappingQuality - Minimum mapping quality score
 * @param requireProperPair - Whether to require properly paired reads
 * @return AlignmentRecordRDD with filtered reads
 */
def filterByQuality(minMappingQuality: Int, requireProperPair: Boolean = false): AlignmentRecordRDD

/**
 * Calculate insert size distribution for paired-end reads
 * @return RDD of insert sizes with their frequencies
 */
def insertSizeDistribution(): RDD[(Int, Long)]

Usage Examples:

// Quality control workflow
val alignments = sc.loadBam("sample.bam")

// Filter by mapping quality
val highQuality = alignments.filterByQuality(minMappingQuality = 30)

// Calculate statistics
val stats = alignments.alignmentStatistics()
stats.collect().foreach { case (contig, stat) =>
  println(s"$contig: ${stat.mappedReads} mapped reads")
}

// Analyze insert sizes
val insertSizes = alignments.insertSizeDistribution()
val medianInsertSize = insertSizes.map(_._1).median()

Transformation Performance Tips

  1. Chain operations before triggering actions to minimize data shuffling
  2. Use broadcast joins for small reference datasets
  3. Persist intermediate results that will be reused multiple times
  4. Apply filters early in transformation chains to reduce data volume
  5. Consider partitioning by genomic region for region-based analyses
  6. Use appropriate storage levels when persisting large datasets

Error Handling in Transformations

All transformations respect the validation stringency settings and handle malformed data according to the specified policy:

  • STRICT: Fail immediately on malformed records
  • LENIENT: Log warnings and skip malformed records
  • SILENT: Skip malformed records without warnings

Use appropriate stringency levels based on data quality and processing requirements.

Install with Tessl CLI

npx tessl i tessl/maven-org-bdgenomics-adam--adam-core

docs

algorithms.md

data-loading.md

data-types.md

file-formats.md

index.md

transformations.md

tile.json