tessl/maven-org-bdgenomics-adam--adam-cli-spark2-2-10

Command line interface for ADAM, a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis across cluster/cloud computing environments

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Pending

The risk profile of this skill

Overview

Eval results

Files

ADAM CLI

Name: tessl/maven-org-bdgenomics-adam--adam-cli-spark2-2-10
Author: tessl

ADAM CLI is a command-line interface for genomic data analysis using Apache Spark. It provides distributed processing capabilities for various genomic file formats including SAM/BAM/CRAM, BED/GFF3/GTF, VCF, and FASTA/FASTQ, with optimized Parquet columnar storage for improved performance and scalability.

Package Information

Package Name: adam-cli-spark2_2.10
Package Type: maven
Language: Scala (with Java support)
Group ID: org.bdgenomics.adam
Version: 0.23.0
Installation:
```
<dependency>
  <groupId>org.bdgenomics.adam</groupId>
  <artifactId>adam-cli-spark2_2.10</artifactId>
  <version>0.23.0</version>
</dependency>
```
Or download precompiled distribution from GitHub releases

Core Usage

ADAM CLI is executed through the adam-submit script, which wraps Spark submission:

# Basic command structure
adam-submit [<spark-args> --] <command> [<command-args>]

# Example: Transform BAM to ADAM format
adam-submit transformAlignments input.bam output.adam

# Example with Spark arguments
adam-submit --master local[4] --driver-memory 8g -- transformAlignments input.bam output.adam

Architecture

ADAM CLI is organized around several key architectural components:

Command System: Modular command structure with 15 specialized tools organized into 3 functional groups
Spark Integration: Built-in Apache Spark integration for distributed processing across clusters
Format Support: Comprehensive support for genomic file formats with intelligent format detection
Parquet Optimization: Columnar storage format for improved query performance and compression
Streaming Processing: Ability to process large datasets that exceed single-node memory capacity

Main Entry Point

object ADAMMain {
  def main(args: Array[String]): Unit
  val defaultCommandGroups: List[CommandGroup]
}

class ADAMMain @Inject() (commandGroups: List[CommandGroup]) extends Logging {
  def apply(args: Array[String]): Unit
}

case class CommandGroup(name: String, commands: List[BDGCommandCompanion])

Capabilities

Genomic Data Processing

Core genomic data analysis operations including k-mer counting, coverage analysis, alignment transformations, and multi-format data processing.

// K-mer analysis
object CountReadKmers extends BDGCommandCompanion {
  val commandName = "countKmers"
  val commandDescription = "Counts the k-mers/q-mers from a read dataset."
}
object CountContigKmers extends BDGCommandCompanion {
  val commandName = "countContigKmers"
  val commandDescription = "Counts the k-mers/q-mers from a read dataset."
}

// Coverage analysis  
object Reads2Coverage extends BDGCommandCompanion {
  val commandName = "reads2coverage"
  val commandDescription = "Calculate the coverage from a given ADAM file"
}

// Data transformations
object TransformAlignments extends BDGCommandCompanion {
  val commandName = "transformAlignments"
  val commandDescription = "Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations"
}
object TransformFeatures extends BDGCommandCompanion {
  val commandName = "transformFeatures"
  val commandDescription = "Convert a file with sequence features into corresponding ADAM format and vice versa"
}
object TransformGenotypes extends BDGCommandCompanion {
  val commandName = "transformGenotypes"
  val commandDescription = "Convert a file with genotypes into corresponding ADAM format and vice versa"
}
object TransformVariants extends BDGCommandCompanion {
  val commandName = "transformVariants"
  val commandDescription = "Convert a file with variants into corresponding ADAM format and vice versa"
}
object TransformFragments extends BDGCommandCompanion {
  val commandName = "transformFragments"
  val commandDescription = "Convert alignment records into fragment records"
}

// Utilities
object MergeShards extends BDGCommandCompanion {
  val commandName = "mergeShards"
  val commandDescription = "Merges the shards of a file"
}

Genomic Data Processing

Format Conversion

Comprehensive format conversion utilities for transforming between various genomic file formats and ADAM's optimized Parquet format.

// FASTA conversions
object Fasta2ADAM extends BDGCommandCompanion {
  val commandName = "fasta2adam"
  val commandDescription = "Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences"
}
object ADAM2Fasta extends BDGCommandCompanion {
  val commandName = "adam2fasta"
  val commandDescription = "Convert ADAM nucleotide contig fragments to FASTA files"
}

// FASTQ conversions
object ADAM2Fastq extends BDGCommandCompanion {
  val commandName = "adam2fastq"
  val commandDescription = "Convert BAM to FASTQ files"
}

Format Conversion

Data Inspection and Analysis

Tools for viewing, analyzing, and generating statistics from genomic datasets, providing samtools-like functionality with distributed processing capabilities.

// Data viewing and filtering
object View extends BDGCommandCompanion {
  val commandName = "view"
  val commandDescription = "View certain reads from an alignment-record file."
}
object PrintADAM extends BDGCommandCompanion {
  val commandName = "print" 
  val commandDescription = "Print an ADAM formatted file"
}

// Statistics and analysis
object FlagStat extends BDGCommandCompanion {
  val commandName = "flagstat"
  val commandDescription = "Print statistics on reads in an ADAM file (similar to samtools flagstat)"
}

Data Inspection

Common Types and Patterns

Command Pattern

All ADAM CLI commands follow a consistent architectural pattern:

// Command companion object
trait BDGCommandCompanion {
  val commandName: String
  val commandDescription: String
  def apply(cmdLine: Array[String]): BDGCommand
}

// Command arguments base class
class Args4jBase extends Logging with Serializable {
  @Args4jOption(required = false, name = "-print_metrics", usage = "Print metrics to the log on completion")
  var printMetrics = false
}

// Common argument mixins
trait ParquetArgs {
  @Args4jOption(required = false, name = "-parquet_compression", usage = "Parquet compression codec")
  var compressionCodec: String = "GZIP"
  
  @Args4jOption(required = false, name = "-parquet_block_size", usage = "Parquet block size (default: 128mb)")
  var blockSize: Int = 128 * 1024 * 1024
  
  @Args4jOption(required = false, name = "-parquet_page_size", usage = "Parquet page size (default: 1mb)")
  var pageSize: Int = 1024 * 1024
}

trait ParquetSaveArgs extends ParquetArgs {
  @Args4jOption(required = false, name = "-disable_dictionary", usage = "Disable dictionary encoding")
  var disableDictionaryEncoding = false
}

trait ADAMSaveAnyArgs {
  @Args4jOption(required = false, name = "-single", usage = "Save as single file")
  var asSingleFile = false
  
  @Args4jOption(required = false, name = "-defer", usage = "Defer merging single file")
  var deferMerging = false
  
  @Args4jOption(required = false, name = "-disable_fast_concat", usage = "Disable fast concatenation")
  var disableFastConcat = false
}

// Command execution
abstract class BDGSparkCommand[T <: Args4jBase] extends BDGCommand[T] {
  val companion: BDGCommandCompanion
  def run(sc: SparkContext): Unit
}

Validation Stringency

// Validation levels for input parsing
type ValidationStringency = htsjdk.samtools.ValidationStringency
// Values: STRICT, LENIENT, SILENT

Common Arguments

Most commands support these common arguments:

Input/Output Paths: File system paths for source and destination data
Partitioning: Control over data partitioning for performance optimization
Validation: Stringency levels for input data validation
Storage: Spark storage levels for intermediate data caching
Format Options: Parquet-specific configuration options

Version Information

class About {
  def artifactId(): String
  def buildTimestamp(): String  
  def commit(): String
  def hadoopVersion(): String
  def scalaVersion(): String
  def sparkVersion(): String
  def version(): String
  def isSnapshot(): Boolean
}

Error Handling

ADAM CLI commands use standard exit codes and provide comprehensive error messages:

Exit Code 0: Successful execution
Exit Code 1: General errors (invalid arguments, file not found, etc.)
Spark Exceptions: Distributed processing errors with full stack traces
Validation Errors: Input data validation failures with detailed reports

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 2 months ago
Describes: pkg:maven/org.bdgenomics.adam/adam-cli-spark2_2.10@0.23.x
Publish Source: CLI
Badge