or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

data-inspection.mdformat-conversion.mdgenomic-processing.mdindex.md
tile.json

data-inspection.mddocs/

Data Inspection and Analysis

This document covers ADAM CLI's data inspection capabilities for viewing, filtering, and analyzing genomic datasets. These tools provide samtools-like functionality with distributed processing capabilities.

Data Viewing and Filtering

View Command

The View command provides samtools view-like functionality for filtering and examining genomic alignment data with support for flag-based filtering and format conversion.

object View extends BDGCommandCompanion {
  val commandName = "view"
  val commandDescription = "View certain reads from an alignment-record file."
  def apply(cmdLine: Array[String]): View
}

class ViewArgs extends Args4jBase with ParquetArgs with ADAMSaveAnyArgs {
  var inputPath: String                // Input alignment file
  var outputPath: String               // Output file (optional)
  var outputPathArg: String            // Alternative output specification
  
  // Flag-based filtering (samtools-compatible)
  var matchAllBits: Int                // Include reads matching all bits (-f)
  var mismatchAllBits: Int             // Exclude reads matching all bits (-F)  
  var matchSomeBits: Int               // Include reads matching some bits (-g)
  var mismatchSomeBits: Int            // Exclude reads matching some bits (-G)
  
  // Output options
  var printCount: Boolean              // Print count only (-c)
}

Flag Filtering Examples:

# View only mapped reads (exclude unmapped, flag 4)
adam-submit view -F 4 alignments.adam

# View only proper pairs (flag 2) that are mapped (exclude flag 4)
adam-submit view -f 2 -F 4 alignments.adam mapped_pairs.adam

# Count unmapped reads
adam-submit view -f 4 -c alignments.adam

# View first read in pair (flag 64), exclude secondary alignments (flag 256)
adam-submit view -f 64 -F 256 alignments.adam first_reads.adam

Common SAM Flags:

  • 1: Read is paired
  • 2: Read is in proper pair
  • 4: Read is unmapped
  • 8: Mate is unmapped
  • 16: Read is on reverse strand
  • 64: First read in pair
  • 128: Second read in pair
  • 256: Secondary alignment
  • 512: Read fails quality checks
  • 1024: PCR/optical duplicate

Print ADAM Data

Display the contents of ADAM files in human-readable format for data inspection and debugging.

object PrintADAM extends BDGCommandCompanion {
  val commandName = "printAdam"
  val commandDescription = "Print the contents of an ADAM file"
  def apply(cmdLine: Array[String]): PrintADAM
}

class PrintADAMArgs extends Args4jBase with ParquetArgs {
  var inputPath: String               // Input ADAM file to print
  var outputPath: String              // Optional output file
  var pretty: Boolean                 // Pretty-print JSON output
  var records: Int                    // Number of records to print
}

Usage Examples:

# Print first 10 records to console
adam-submit printAdam --records 10 data.adam

# Pretty-print all records to file  
adam-submit printAdam --pretty data.adam output.txt

# Inspect data structure
adam-submit printAdam --records 1 --pretty alignments.adam

Statistical Analysis

FlagStat

Generate comprehensive alignment statistics similar to samtools flagstat, providing essential quality control metrics for sequencing data.

object FlagStat extends BDGCommandCompanion {
  val commandName = "flagstat"
  val commandDescription = "Print statistics about reads in an alignment file"
  def apply(cmdLine: Array[String]): FlagStat
}

class FlagStatArgs extends Args4jBase {
  var inputPath: String               // Input alignment file
  var outputPath: String              // Optional output file for statistics  
  var stringency: String              // Validation stringency
}

Statistics Generated:

  • Total reads processed
  • Mapped reads and mapping percentage
  • Properly paired reads for paired-end data
  • Singleton reads (mate unmapped)
  • Read duplicates (PCR/optical)
  • Secondary and supplementary alignments
  • Quality control failures

Usage Examples:

# Basic flagstat to console
adam-submit flagstat alignments.adam

# Save statistics to file
adam-submit flagstat alignments.adam stats.txt

# Use lenient validation for problematic files
adam-submit flagstat --stringency LENIENT alignments.adam

Sample Output:

71723 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary  
0 + 0 duplicates
69543 + 0 mapped (97.0% : N/A)
71723 + 0 paired in sequencing
35861 + 0 read1
35862 + 0 read2
67432 + 0 properly paired (94.0% : N/A)
69543 + 0 with itself and mate mapped
0 + 0 singletons (0.0% : N/A)

Quality Control and Validation

Validation Stringency Control

All inspection tools support configurable validation stringency for handling problematic data:

// Validation levels
ValidationStringency.STRICT   // Fail on any validation errors
ValidationStringency.LENIENT  // Issue warnings for validation errors
ValidationStringency.SILENT   // Ignore validation errors

Usage in Commands:

# Strict validation (default)
adam-submit view --stringency STRICT alignments.adam

# Lenient validation for legacy data
adam-submit flagstat --stringency LENIENT old_alignments.adam

# Silent validation for known problematic files
adam-submit printAdam --stringency SILENT problematic.adam

Performance Considerations

Large Dataset Handling

For very large datasets, consider these optimization strategies:

# Use sampling for quick inspection
adam-submit view -c alignments.adam  # Count only, no data transfer

# Limit record processing for quick stats
adam-submit printAdam --records 1000 large_file.adam

# Use appropriate Spark resources
adam-submit --driver-memory 8g --executor-memory 4g -- \
  flagstat huge_alignment.adam

Memory Management

# For memory-intensive operations
adam-submit --conf spark.sql.adaptive.enabled=true \
  --conf spark.sql.adaptive.coalescePartitions.enabled=true \
  view -f 2 large_alignments.adam filtered.adam

Integration with Analysis Pipelines

Filtering for Downstream Analysis

The View command is commonly used to prepare data subsets:

# Extract high-quality mapped pairs for variant calling
adam-submit view \
  -f 3 \           # Paired and both mapped
  -F 1028 \        # Exclude duplicates and secondary
  -q 20 \          # Minimum mapping quality
  input.adam high_quality.adam

# Extract unmapped reads for assembly
adam-submit view -f 4 input.adam unmapped.adam

# Extract reads from specific chromosome
adam-submit view \
  --regionPredicate "referenceName=chr22" \
  input.adam chr22.adam

Quality Control Workflows

Combine tools for comprehensive QC:

# 1. Get overall statistics
adam-submit flagstat input.adam > qc_stats.txt

# 2. Inspect problematic reads
adam-submit view -f 512 input.adam failed_qc.adam

# 3. Check duplicate rates
adam-submit view -f 1024 -c input.adam

Data Validation Pipelines

# Validate file integrity
adam-submit printAdam --records 1 --stringency STRICT data.adam

# Generate detailed statistics
adam-submit flagstat --stringency STRICT data.adam stats.txt

# Filter and validate simultaneously  
adam-submit view -F 4 --stringency LENIENT input.adam validated.adam

Output Format Options

Supported Output Formats

The View command supports multiple output formats through the ADAMSaveAnyArgs mixin:

  • ADAM Parquet: Native format for continued ADAM processing
  • SAM/BAM: For external tool compatibility
  • JSON: For programmatic access and debugging
  • Text: Human-readable format for inspection

Format Specification

# Save as BAM for external tools
adam-submit view -f 2 input.adam -o output.bam

# Save as JSON for analysis scripts
adam-submit view --records 100 input.adam -o sample.json

# Save as text for manual inspection
adam-submit view --records 10 input.adam -o sample.txt