0
# Format Conversion
1
2
This document covers ADAM CLI's format conversion capabilities for transforming between various genomic file formats and ADAM's optimized Parquet storage format.
3
4
## FASTA Conversions
5
6
### FASTA to ADAM
7
8
Convert FASTA sequence files to ADAM's Parquet-based nucleotide contig format for improved performance and integration with Spark-based analysis pipelines.
9
10
```scala { .api }
11
object Fasta2ADAM extends BDGCommandCompanion {
12
val commandName = "fasta2adam"
13
val commandDescription = "Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences."
14
def apply(cmdLine: Array[String]): Fasta2ADAM
15
}
16
17
class Fasta2ADAMArgs extends Args4jBase with ParquetSaveArgs {
18
var fastaFile: String // Input FASTA file path
19
var outputPath: String // Output ADAM file path
20
var verbose: Boolean // Enhanced debugging information
21
var reads: String // Contig ID mapping for read compatibility
22
var maximumLength: Long // Maximum fragment length (default: 10,000)
23
var partitions: Int // Number of output partitions
24
}
25
```
26
27
**Key Features:**
28
- **Sequence Indexing**: Automatically creates sequence dictionaries for downstream tools
29
- **Fragment Control**: Splits large sequences into manageable fragments
30
- **ID Mapping**: Maps contig IDs to match existing read datasets
31
- **Partitioning**: Controls output parallelization for optimal performance
32
33
**Usage Examples:**
34
```bash
35
# Basic conversion
36
adam-submit fasta2adam reference.fasta reference.adam
37
38
# With verbose output and custom fragment length
39
adam-submit fasta2adam \
40
--verbose \
41
--fragment_length 50000 \
42
--repartition 100 \
43
reference.fasta reference.adam
44
45
# Map contig IDs to match read dataset
46
adam-submit fasta2adam \
47
--reads alignments.adam \
48
--verbose \
49
reference.fasta reference.adam
50
```
51
52
### ADAM to FASTA
53
54
Convert ADAM nucleotide contig data back to standard FASTA format for compatibility with external tools.
55
56
```scala { .api }
57
object ADAM2Fasta extends BDGCommandCompanion {
58
val commandName = "adam2fasta"
59
val commandDescription = "Convert ADAM nucleotide contig fragments to FASTA files"
60
def apply(cmdLine: Array[String]): ADAM2Fasta
61
}
62
63
class ADAM2FastaArgs extends Args4jBase {
64
var inputPath: String // Input ADAM contig file
65
var outputPath: String // Output FASTA file path
66
var lineWidth: Int // FASTA line width (default: 70)
67
var coalesce: Int // Number of output partitions
68
var disableDictionary: Boolean // Skip sequence dictionary output
69
}
70
```
71
72
**Usage Examples:**
73
```bash
74
# Basic conversion
75
adam-submit adam2fasta contigs.adam output.fasta
76
77
# Custom line width and single output file
78
adam-submit adam2fasta \
79
--lineWidth 80 \
80
--coalesce 1 \
81
contigs.adam reference.fasta
82
```
83
84
## FASTQ Conversions
85
86
### ADAM to FASTQ
87
88
Convert ADAM alignment or fragment data to FASTQ format for compatibility with external alignment tools and quality control applications.
89
90
```scala { .api }
91
object ADAM2Fastq extends BDGCommandCompanion {
92
val commandName = "adam2fastq"
93
val commandDescription = "Convert ADAM read data to FASTQ files"
94
def apply(cmdLine: Array[String]): ADAM2Fastq
95
}
96
97
class ADAM2FastqArgs extends Args4jBase {
98
var inputPath: String // Input ADAM file
99
var outputPath: String // Primary FASTQ output
100
var outputPath2: String // Secondary FASTQ for paired reads
101
var validationStringency: ValidationStringency // Input validation level
102
var repartition: Int // Output partitioning
103
var persistLevel: String // Spark persistence level
104
var disableProjection: Boolean // Disable column projection
105
var outputOriginalBaseQualities: Boolean // Use original quality scores
106
}
107
```
108
109
**Key Features:**
110
- **Paired-End Support**: Automatic separation of read pairs into separate files
111
- **Quality Score Options**: Choose between recalibrated and original quality scores
112
- **Validation Control**: Configurable stringency for malformed read handling
113
- **Memory Management**: Configurable persistence levels for large datasets
114
115
**Usage Examples:**
116
```bash
117
# Single-end reads
118
adam-submit adam2fastq reads.adam output.fastq
119
120
# Paired-end reads with separate output files
121
adam-submit adam2fastq \
122
reads.adam \
123
output_R1.fastq \
124
output_R2.fastq
125
126
# Use original base qualities with lenient validation
127
adam-submit adam2fastq \
128
--outputOriginalBaseQualities \
129
--validationStringency LENIENT \
130
reads.adam output.fastq
131
132
# High-memory processing with custom persistence
133
adam-submit adam2fastq \
134
--persistLevel MEMORY_AND_DISK_SER \
135
--repartition 200 \
136
large_dataset.adam output.fastq
137
```
138
139
## Multi-Format Fragment Processing
140
141
### Transform Fragments
142
143
Convert various genomic formats (SAM/BAM/CRAM) to ADAM fragment format, which maintains paired-end relationships and insert size information.
144
145
```scala { .api }
146
object TransformFragments extends BDGCommandCompanion {
147
val commandName = "transformFragments"
148
val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"
149
def apply(cmdLine: Array[String]): TransformFragments
150
}
151
152
class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
153
var inputPath: String // Input alignment file
154
var outputPath: String // Output fragment file
155
var coalesce: Int // Output partition count
156
var forceShuffle: Boolean // Force data shuffling
157
var storageLevel: String // Spark storage level
158
}
159
```
160
161
**Fragment Benefits:**
162
- **Insert Size Analysis**: Maintains paired-end insert size distributions
163
- **Quality Metrics**: Preserves alignment quality and mapping information
164
- **Memory Efficiency**: Optimized storage for paired-end data analysis
165
- **Downstream Compatibility**: Works with ADAM's fragment-based analysis tools
166
167
**Usage Example:**
168
```bash
169
# Convert BAM to fragments with performance optimization
170
adam-submit transformFragments \
171
--coalesce 50 \
172
--storageLevel MEMORY_AND_DISK \
173
paired_reads.bam fragments.adam
174
```
175
176
## Format Support Matrix
177
178
| Input Format | Output Format | Command | Key Features |
179
|--------------|---------------|---------|--------------|
180
| FASTA | ADAM Contigs | `fasta2adam` | Sequence indexing, fragmentation |
181
| ADAM Contigs | FASTA | `adam2fasta` | Dictionary generation, line formatting |
182
| ADAM Reads/Alignments | FASTQ | `adam2fastq` | Paired-end separation, quality options |
183
| SAM/BAM/CRAM | ADAM Fragments | `transformFragments` | Insert size preservation, pairing |
184
185
## Performance Optimization
186
187
### Memory Management
188
```bash
189
# For large datasets, use disk-based persistence
190
--persistLevel MEMORY_AND_DISK_SER
191
192
# Control memory usage with partitioning
193
--repartition 100 # Increase for large files
194
--coalesce 10 # Decrease for small files
195
```
196
197
### I/O Optimization
198
```bash
199
# Force data shuffling for balanced partitions
200
--forceShuffle
201
202
# Disable column projection for full schema access
203
--disableProjection
204
```
205
206
### Validation Control
207
```scala { .api }
208
// Validation stringency levels
209
ValidationStringency.STRICT // Fail on any malformed data
210
ValidationStringency.LENIENT // Warn on malformed data
211
ValidationStringency.SILENT // Ignore malformed data
212
```
213
214
## Integration with External Tools
215
216
### Sequence Dictionaries
217
FASTA conversions automatically generate sequence dictionaries compatible with:
218
- **SAMtools**: For reference-based operations
219
- **GATK**: For variant calling pipelines
220
- **Picard**: For data validation and metrics
221
222
### Quality Score Handling
223
FASTQ conversions support both:
224
- **Original Quality Scores**: As recorded in source files
225
- **Recalibrated Scores**: From ADAM quality score recalibration
226
227
### File Format Compatibility
228
All conversions maintain compatibility with standard genomics file format specifications:
229
- **FASTA**: NCBI/EMBL standard format
230
- **FASTQ**: Illumina 1.8+ Phred+33 encoding
231
- **SAM/BAM**: HTSlib specification compliance