or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

data-inspection.mdformat-conversion.mdgenomic-processing.mdindex.md
tile.json

format-conversion.mddocs/

Format Conversion

This document covers ADAM CLI's format conversion capabilities for transforming between various genomic file formats and ADAM's optimized Parquet storage format.

FASTA Conversions

FASTA to ADAM

Convert FASTA sequence files to ADAM's Parquet-based nucleotide contig format for improved performance and integration with Spark-based analysis pipelines.

object Fasta2ADAM extends BDGCommandCompanion {
  val commandName = "fasta2adam"
  val commandDescription = "Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences."
  def apply(cmdLine: Array[String]): Fasta2ADAM
}

class Fasta2ADAMArgs extends Args4jBase with ParquetSaveArgs {
  var fastaFile: String           // Input FASTA file path
  var outputPath: String          // Output ADAM file path
  var verbose: Boolean            // Enhanced debugging information
  var reads: String               // Contig ID mapping for read compatibility  
  var maximumLength: Long         // Maximum fragment length (default: 10,000)
  var partitions: Int             // Number of output partitions
}

Key Features:

  • Sequence Indexing: Automatically creates sequence dictionaries for downstream tools
  • Fragment Control: Splits large sequences into manageable fragments
  • ID Mapping: Maps contig IDs to match existing read datasets
  • Partitioning: Controls output parallelization for optimal performance

Usage Examples:

# Basic conversion
adam-submit fasta2adam reference.fasta reference.adam

# With verbose output and custom fragment length
adam-submit fasta2adam \
  --verbose \
  --fragment_length 50000 \
  --repartition 100 \
  reference.fasta reference.adam

# Map contig IDs to match read dataset  
adam-submit fasta2adam \
  --reads alignments.adam \
  --verbose \
  reference.fasta reference.adam

ADAM to FASTA

Convert ADAM nucleotide contig data back to standard FASTA format for compatibility with external tools.

object ADAM2Fasta extends BDGCommandCompanion {
  val commandName = "adam2fasta"
  val commandDescription = "Convert ADAM nucleotide contig fragments to FASTA files"
  def apply(cmdLine: Array[String]): ADAM2Fasta
}

class ADAM2FastaArgs extends Args4jBase {
  var inputPath: String           // Input ADAM contig file
  var outputPath: String          // Output FASTA file path
  var lineWidth: Int              // FASTA line width (default: 70)
  var coalesce: Int               // Number of output partitions
  var disableDictionary: Boolean  // Skip sequence dictionary output
}

Usage Examples:

# Basic conversion
adam-submit adam2fasta contigs.adam output.fasta  

# Custom line width and single output file
adam-submit adam2fasta \
  --lineWidth 80 \
  --coalesce 1 \
  contigs.adam reference.fasta

FASTQ Conversions

ADAM to FASTQ

Convert ADAM alignment or fragment data to FASTQ format for compatibility with external alignment tools and quality control applications.

object ADAM2Fastq extends BDGCommandCompanion {
  val commandName = "adam2fastq"
  val commandDescription = "Convert ADAM read data to FASTQ files"
  def apply(cmdLine: Array[String]): ADAM2Fastq
}

class ADAM2FastqArgs extends Args4jBase {
  var inputPath: String                      // Input ADAM file
  var outputPath: String                     // Primary FASTQ output  
  var outputPath2: String                    // Secondary FASTQ for paired reads
  var validationStringency: ValidationStringency  // Input validation level
  var repartition: Int                       // Output partitioning
  var persistLevel: String                   // Spark persistence level
  var disableProjection: Boolean             // Disable column projection
  var outputOriginalBaseQualities: Boolean   // Use original quality scores
}

Key Features:

  • Paired-End Support: Automatic separation of read pairs into separate files
  • Quality Score Options: Choose between recalibrated and original quality scores
  • Validation Control: Configurable stringency for malformed read handling
  • Memory Management: Configurable persistence levels for large datasets

Usage Examples:

# Single-end reads
adam-submit adam2fastq reads.adam output.fastq

# Paired-end reads with separate output files
adam-submit adam2fastq \
  reads.adam \
  output_R1.fastq \
  output_R2.fastq

# Use original base qualities with lenient validation
adam-submit adam2fastq \
  --outputOriginalBaseQualities \
  --validationStringency LENIENT \
  reads.adam output.fastq

# High-memory processing with custom persistence
adam-submit adam2fastq \
  --persistLevel MEMORY_AND_DISK_SER \
  --repartition 200 \
  large_dataset.adam output.fastq

Multi-Format Fragment Processing

Transform Fragments

Convert various genomic formats (SAM/BAM/CRAM) to ADAM fragment format, which maintains paired-end relationships and insert size information.

object TransformFragments extends BDGCommandCompanion {
  val commandName = "transformFragments"
  val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"
  def apply(cmdLine: Array[String]): TransformFragments
}

class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
  var inputPath: String           // Input alignment file
  var outputPath: String          // Output fragment file
  var coalesce: Int               // Output partition count
  var forceShuffle: Boolean       // Force data shuffling
  var storageLevel: String        // Spark storage level
}

Fragment Benefits:

  • Insert Size Analysis: Maintains paired-end insert size distributions
  • Quality Metrics: Preserves alignment quality and mapping information
  • Memory Efficiency: Optimized storage for paired-end data analysis
  • Downstream Compatibility: Works with ADAM's fragment-based analysis tools

Usage Example:

# Convert BAM to fragments with performance optimization
adam-submit transformFragments \
  --coalesce 50 \
  --storageLevel MEMORY_AND_DISK \
  paired_reads.bam fragments.adam

Format Support Matrix

Input FormatOutput FormatCommandKey Features
FASTAADAM Contigsfasta2adamSequence indexing, fragmentation
ADAM ContigsFASTAadam2fastaDictionary generation, line formatting
ADAM Reads/AlignmentsFASTQadam2fastqPaired-end separation, quality options
SAM/BAM/CRAMADAM FragmentstransformFragmentsInsert size preservation, pairing

Performance Optimization

Memory Management

# For large datasets, use disk-based persistence
--persistLevel MEMORY_AND_DISK_SER

# Control memory usage with partitioning
--repartition 100  # Increase for large files
--coalesce 10      # Decrease for small files

I/O Optimization

# Force data shuffling for balanced partitions
--forceShuffle

# Disable column projection for full schema access
--disableProjection

Validation Control

// Validation stringency levels
ValidationStringency.STRICT   // Fail on any malformed data
ValidationStringency.LENIENT  // Warn on malformed data  
ValidationStringency.SILENT   // Ignore malformed data

Integration with External Tools

Sequence Dictionaries

FASTA conversions automatically generate sequence dictionaries compatible with:

  • SAMtools: For reference-based operations
  • GATK: For variant calling pipelines
  • Picard: For data validation and metrics

Quality Score Handling

FASTQ conversions support both:

  • Original Quality Scores: As recorded in source files
  • Recalibrated Scores: From ADAM quality score recalibration

File Format Compatibility

All conversions maintain compatibility with standard genomics file format specifications:

  • FASTA: NCBI/EMBL standard format
  • FASTQ: Illumina 1.8+ Phred+33 encoding
  • SAM/BAM: HTSlib specification compliance