Tessl Tile for maven/org.bdgenomics.adam/adam-cli-spark2_2.10@0.23.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

data-inspection.md format-conversion.md genomic-processing.md index.md

format-conversion.mddocs/

0
# Format Conversion
1

2
This document covers ADAM CLI's format conversion capabilities for transforming between various genomic file formats and ADAM's optimized Parquet storage format.
3

4
## FASTA Conversions
5

6
### FASTA to ADAM
7

8
Convert FASTA sequence files to ADAM's Parquet-based nucleotide contig format for improved performance and integration with Spark-based analysis pipelines.
9

10
```scala { .api }
11
object Fasta2ADAM extends BDGCommandCompanion {
12
  val commandName = "fasta2adam"
13
  val commandDescription = "Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences."
14
  def apply(cmdLine: Array[String]): Fasta2ADAM
15
}
16

17
class Fasta2ADAMArgs extends Args4jBase with ParquetSaveArgs {
18
  var fastaFile: String           // Input FASTA file path
19
  var outputPath: String          // Output ADAM file path
20
  var verbose: Boolean            // Enhanced debugging information
21
  var reads: String               // Contig ID mapping for read compatibility  
22
  var maximumLength: Long         // Maximum fragment length (default: 10,000)
23
  var partitions: Int             // Number of output partitions
24
}
25
```
26

27
**Key Features:**
28
- **Sequence Indexing**: Automatically creates sequence dictionaries for downstream tools
29
- **Fragment Control**: Splits large sequences into manageable fragments
30
- **ID Mapping**: Maps contig IDs to match existing read datasets
31
- **Partitioning**: Controls output parallelization for optimal performance
32

33
**Usage Examples:**
34
```bash
35
# Basic conversion
36
adam-submit fasta2adam reference.fasta reference.adam
37

38
# With verbose output and custom fragment length
39
adam-submit fasta2adam \
40
  --verbose \
41
  --fragment_length 50000 \
42
  --repartition 100 \
43
  reference.fasta reference.adam
44

45
# Map contig IDs to match read dataset  
46
adam-submit fasta2adam \
47
  --reads alignments.adam \
48
  --verbose \
49
  reference.fasta reference.adam
50
```
51

52
### ADAM to FASTA
53

54
Convert ADAM nucleotide contig data back to standard FASTA format for compatibility with external tools.
55

56
```scala { .api }
57
object ADAM2Fasta extends BDGCommandCompanion {
58
  val commandName = "adam2fasta"
59
  val commandDescription = "Convert ADAM nucleotide contig fragments to FASTA files"
60
  def apply(cmdLine: Array[String]): ADAM2Fasta
61
}
62

63
class ADAM2FastaArgs extends Args4jBase {
64
  var inputPath: String           // Input ADAM contig file
65
  var outputPath: String          // Output FASTA file path
66
  var lineWidth: Int              // FASTA line width (default: 70)
67
  var coalesce: Int               // Number of output partitions
68
  var disableDictionary: Boolean  // Skip sequence dictionary output
69
}
70
```
71

72
**Usage Examples:**
73
```bash
74
# Basic conversion
75
adam-submit adam2fasta contigs.adam output.fasta  
76

77
# Custom line width and single output file
78
adam-submit adam2fasta \
79
  --lineWidth 80 \
80
  --coalesce 1 \
81
  contigs.adam reference.fasta
82
```
83

84
## FASTQ Conversions
85

86
### ADAM to FASTQ
87

88
Convert ADAM alignment or fragment data to FASTQ format for compatibility with external alignment tools and quality control applications.
89

90
```scala { .api }
91
object ADAM2Fastq extends BDGCommandCompanion {
92
  val commandName = "adam2fastq"
93
  val commandDescription = "Convert ADAM read data to FASTQ files"
94
  def apply(cmdLine: Array[String]): ADAM2Fastq
95
}
96

97
class ADAM2FastqArgs extends Args4jBase {
98
  var inputPath: String                      // Input ADAM file
99
  var outputPath: String                     // Primary FASTQ output  
100
  var outputPath2: String                    // Secondary FASTQ for paired reads
101
  var validationStringency: ValidationStringency  // Input validation level
102
  var repartition: Int                       // Output partitioning
103
  var persistLevel: String                   // Spark persistence level
104
  var disableProjection: Boolean             // Disable column projection
105
  var outputOriginalBaseQualities: Boolean   // Use original quality scores
106
}
107
```
108

109
**Key Features:**
110
- **Paired-End Support**: Automatic separation of read pairs into separate files
111
- **Quality Score Options**: Choose between recalibrated and original quality scores
112
- **Validation Control**: Configurable stringency for malformed read handling
113
- **Memory Management**: Configurable persistence levels for large datasets
114

115
**Usage Examples:**
116
```bash
117
# Single-end reads
118
adam-submit adam2fastq reads.adam output.fastq
119

120
# Paired-end reads with separate output files
121
adam-submit adam2fastq \
122
  reads.adam \
123
  output_R1.fastq \
124
  output_R2.fastq
125

126
# Use original base qualities with lenient validation
127
adam-submit adam2fastq \
128
  --outputOriginalBaseQualities \
129
  --validationStringency LENIENT \
130
  reads.adam output.fastq
131

132
# High-memory processing with custom persistence
133
adam-submit adam2fastq \
134
  --persistLevel MEMORY_AND_DISK_SER \
135
  --repartition 200 \
136
  large_dataset.adam output.fastq
137
```
138

139
## Multi-Format Fragment Processing
140

141
### Transform Fragments
142

143
Convert various genomic formats (SAM/BAM/CRAM) to ADAM fragment format, which maintains paired-end relationships and insert size information.
144

145
```scala { .api }
146
object TransformFragments extends BDGCommandCompanion {
147
  val commandName = "transformFragments"
148
  val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"
149
  def apply(cmdLine: Array[String]): TransformFragments
150
}
151

152
class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
153
  var inputPath: String           // Input alignment file
154
  var outputPath: String          // Output fragment file
155
  var coalesce: Int               // Output partition count
156
  var forceShuffle: Boolean       // Force data shuffling
157
  var storageLevel: String        // Spark storage level
158
}
159
```
160

161
**Fragment Benefits:**
162
- **Insert Size Analysis**: Maintains paired-end insert size distributions
163
- **Quality Metrics**: Preserves alignment quality and mapping information  
164
- **Memory Efficiency**: Optimized storage for paired-end data analysis
165
- **Downstream Compatibility**: Works with ADAM's fragment-based analysis tools
166

167
**Usage Example:**
168
```bash
169
# Convert BAM to fragments with performance optimization
170
adam-submit transformFragments \
171
  --coalesce 50 \
172
  --storageLevel MEMORY_AND_DISK \
173
  paired_reads.bam fragments.adam
174
```
175

176
## Format Support Matrix
177

178
| Input Format | Output Format | Command | Key Features |
179
|--------------|---------------|---------|--------------|
180
| FASTA | ADAM Contigs | `fasta2adam` | Sequence indexing, fragmentation |
181
| ADAM Contigs | FASTA | `adam2fasta` | Dictionary generation, line formatting |
182
| ADAM Reads/Alignments | FASTQ | `adam2fastq` | Paired-end separation, quality options |
183
| SAM/BAM/CRAM | ADAM Fragments | `transformFragments` | Insert size preservation, pairing |
184
185
## Performance Optimization
186

187
### Memory Management
188
```bash
189
# For large datasets, use disk-based persistence
190
--persistLevel MEMORY_AND_DISK_SER
191

192
# Control memory usage with partitioning
193
--repartition 100  # Increase for large files
194
--coalesce 10      # Decrease for small files
195
```
196

197
### I/O Optimization
198
```bash
199
# Force data shuffling for balanced partitions
200
--forceShuffle
201

202
# Disable column projection for full schema access
203
--disableProjection
204
```
205

206
### Validation Control
207
```scala { .api }
208
// Validation stringency levels
209
ValidationStringency.STRICT   // Fail on any malformed data
210
ValidationStringency.LENIENT  // Warn on malformed data  
211
ValidationStringency.SILENT   // Ignore malformed data
212
```
213

214
## Integration with External Tools
215

216
### Sequence Dictionaries
217
FASTA conversions automatically generate sequence dictionaries compatible with:
218
- **SAMtools**: For reference-based operations
219
- **GATK**: For variant calling pipelines  
220
- **Picard**: For data validation and metrics
221

222
### Quality Score Handling
223
FASTQ conversions support both:
224
- **Original Quality Scores**: As recorded in source files
225
- **Recalibrated Scores**: From ADAM quality score recalibration
226

227
### File Format Compatibility
228
All conversions maintain compatibility with standard genomics file format specifications:
229
- **FASTA**: NCBI/EMBL standard format
230
- **FASTQ**: Illumina 1.8+ Phred+33 encoding
231
- **SAM/BAM**: HTSlib specification compliance

Version

Tile

Files

format-conversion.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

format-conversion.mddocs/