0
# File Format Support
1
2
Comprehensive support for reading and writing genomic file formats, with automatic format detection and validation. ADAM Core bridges legacy genomic file formats with modern distributed processing capabilities, providing efficient I/O operations for all major genomic data types.
3
4
## Capabilities
5
6
### Alignment File Formats
7
8
Support for sequencing alignment data in standard and modern formats.
9
10
```scala { .api }
11
/**
12
* Load alignment data from SAM, BAM, or CRAM files
13
* Automatically detects format based on file extension and magic bytes
14
* @param pathName - Path to alignment file or directory
15
* @param stringency - Validation stringency for format compliance
16
* @return AlignmentRecordRDD containing alignment records
17
*/
18
def loadBam(pathName: String,
19
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
20
21
/**
22
* Save alignment records as SAM/BAM/CRAM format
23
* @param pathName - Output path
24
* @param asType - Output format specification
25
* @param asSingleFile - Whether to merge output into single file
26
* @param stringency - Validation stringency for output compliance
27
*/
28
def saveAsSam(pathName: String,
29
asType: SAMFormat = SAMFormat.SAM,
30
asSingleFile: Boolean = false,
31
stringency: ValidationStringency = ValidationStringency.STRICT): Unit
32
```
33
34
**Supported Alignment Formats:**
35
- **SAM**: Sequence Alignment/Map plain text format
36
- **BAM**: Binary compressed SAM format
37
- **CRAM**: Reference-based compressed alignment format
38
- **ADAM Parquet**: Native columnar format with schema evolution support
39
40
**Usage Examples:**
41
42
```scala
43
import org.bdgenomics.adam.rdd.ADAMContext._
44
45
// Load various alignment formats
46
val samData = sc.loadBam("alignments.sam")
47
val bamData = sc.loadBam("alignments.bam")
48
val cramData = sc.loadBam("alignments.cram")
49
50
// Save in different formats
51
samData.saveAsSam("output.sam", SAMFormat.SAM)
52
samData.saveAsSam("output.bam", SAMFormat.BAM)
53
samData.saveAsSam("output.cram", SAMFormat.CRAM)
54
55
// Save as ADAM's native format
56
samData.saveAsParquet("alignments.adam")
57
```
58
59
### Variant File Formats
60
61
Support for genetic variant data with full VCF specification compliance.
62
63
```scala { .api }
64
/**
65
* Load variant data from VCF files (plain text or compressed)
66
* Supports VCF 4.0+ specifications with full header parsing
67
* @param pathName - Path to VCF file or directory
68
* @param stringency - Validation stringency for VCF compliance
69
* @return VariantContextRDD containing variants with metadata
70
*/
71
def loadVcf(pathName: String,
72
stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
73
74
/**
75
* Save variant data as VCF format
76
* @param pathName - Output path
77
* @param stringency - Validation stringency for VCF compliance
78
* @param asSingleFile - Whether to merge output into single file
79
*/
80
def saveAsVcf(pathName: String,
81
stringency: ValidationStringency = ValidationStringency.STRICT,
82
asSingleFile: Boolean = false): Unit
83
```
84
85
**Supported Variant Formats:**
86
- **VCF**: Variant Call Format (plain text)
87
- **VCF.gz**: Compressed VCF with bgzip compression
88
- **BCF**: Binary VCF format (through conversion)
89
- **ADAM Parquet**: Native columnar variant storage
90
91
**Usage Examples:**
92
93
```scala
94
// Load VCF files
95
val variants = sc.loadVcf("variants.vcf")
96
val compressedVariants = sc.loadVcf("variants.vcf.gz")
97
98
// Work with different variant representations
99
val variantOnly = variants.toVariants() // Just variant sites
100
val genotypes = variants.toGenotypes() // Genotype calls
101
102
// Save in various formats
103
variants.saveAsVcf("output.vcf")
104
variantOnly.saveAsParquet("variants.adam")
105
genotypes.saveAsParquet("genotypes.adam")
106
```
107
108
### Sequence File Formats
109
110
Support for raw sequencing data and reference genomes.
111
112
```scala { .api }
113
/**
114
* Load FASTQ sequencing data (single-end or paired-end)
115
* @param pathName1 - First FASTQ file (or single-end file)
116
* @param optPathName2 - Optional second FASTQ file for paired-end
117
* @return AlignmentRecordRDD containing unaligned reads
118
*/
119
def loadFastq(pathName1: String, optPathName2: Option[String] = None): AlignmentRecordRDD
120
121
/**
122
* Load interleaved FASTQ where paired reads alternate
123
* @param pathName - Path to interleaved FASTQ file
124
* @return AlignmentRecordRDD containing paired reads
125
*/
126
def loadInterleavedFastq(pathName: String): AlignmentRecordRDD
127
128
/**
129
* Load reference genome sequences from FASTA files
130
* @param pathName - Path to FASTA file
131
* @param maximumLength - Maximum sequence length to load per record
132
* @return NucleotideContigFragmentRDD containing reference sequences
133
*/
134
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD
135
136
/**
137
* Save reads as FASTQ format
138
* @param pathName - Output path
139
* @param outputOriginalBaseQualities - Use original vs. recalibrated qualities
140
* @param asSingleFile - Merge output into single file
141
*/
142
def saveAsFastq(pathName: String,
143
outputOriginalBaseQualities: Boolean = false,
144
asSingleFile: Boolean = false): Unit
145
146
/**
147
* Save reference sequences as FASTA format
148
* @param pathName - Output path
149
* @param lineWidth - Bases per line in output
150
*/
151
def saveAsFasta(pathName: String, lineWidth: Int = 60): Unit
152
```
153
154
**Supported Sequence Formats:**
155
- **FASTQ**: Raw sequencing reads with quality scores
156
- **FASTA**: Reference genome sequences
157
- **2bit**: Compact reference genome format (loading only)
158
159
**Usage Examples:**
160
161
```scala
162
// Load sequencing data
163
val singleEnd = sc.loadFastq("reads.fastq")
164
val pairedEnd = sc.loadFastq("R1.fastq", Some("R2.fastq"))
165
val interleaved = sc.loadInterleavedFastq("paired.fastq")
166
167
// Load reference genome
168
val reference = sc.loadFasta("hg38.fasta")
169
170
// Save processed reads
171
val processed = singleEnd.transform(_.filter(_.getReadName.startsWith("good")))
172
processed.saveAsFastq("filtered_reads.fastq")
173
174
// Save reference sequences
175
reference.saveAsFasta("output_reference.fasta", lineWidth = 80)
176
```
177
178
### Feature Annotation Formats
179
180
Support for genomic feature annotations in multiple standard formats.
181
182
```scala { .api }
183
/**
184
* Load genomic features with automatic format detection
185
* Supports BED, GFF3, GTF, IntervalList, and NarrowPeak formats
186
* @param pathName - Path to feature file
187
* @return FeatureRDD containing genomic annotations
188
*/
189
def loadFeatures(pathName: String): FeatureRDD
190
191
/**
192
* Save features as BED format
193
* @param pathName - Output path
194
*/
195
def saveAsBed(pathName: String): Unit
196
197
/**
198
* Save features as GTF format (Gene Transfer Format)
199
* @param pathName - Output path
200
*/
201
def saveAsGtf(pathName: String): Unit
202
203
/**
204
* Save features as GFF3 format (General Feature Format)
205
* @param pathName - Output path
206
*/
207
def saveAsGff3(pathName: String): Unit
208
209
/**
210
* Save features as Picard IntervalList format
211
* @param pathName - Output path
212
*/
213
def saveAsIntervalList(pathName: String): Unit
214
215
/**
216
* Save features as ENCODE narrowPeak format
217
* @param pathName - Output path
218
*/
219
def saveAsNarrowPeak(pathName: String): Unit
220
```
221
222
**Supported Feature Formats:**
223
- **BED**: Browser Extensible Data format
224
- **GTF**: Gene Transfer Format for gene annotations
225
- **GFF3**: General Feature Format version 3
226
- **IntervalList**: Picard toolkit interval format
227
- **NarrowPeak**: ENCODE ChIP-seq peak format
228
- **ADAM Parquet**: Native columnar feature storage
229
230
**Usage Examples:**
231
232
```scala
233
// Load various annotation formats
234
val bedFeatures = sc.loadFeatures("regions.bed")
235
val geneAnnotations = sc.loadFeatures("genes.gtf")
236
val gff3Features = sc.loadFeatures("annotations.gff3")
237
238
// Convert between formats
239
bedFeatures.saveAsGtf("converted.gtf")
240
geneAnnotations.saveAsBed("genes.bed")
241
242
// Filter and save
243
val exons = geneAnnotations.transform(_.filter(_.getFeatureType == "exon"))
244
exons.saveAsIntervalList("exons.interval_list")
245
```
246
247
### Coverage and Depth Formats
248
249
Support for sequencing depth and coverage data visualization formats.
250
251
```scala { .api }
252
/**
253
* Save coverage data as WIG format for genome browsers
254
* @param pathName - Output path
255
*/
256
def saveAsWig(pathName: String): Unit
257
258
/**
259
* Save coverage data as BigWig format (through conversion)
260
* @param pathName - Output path
261
* @param sequenceDictionary - Reference sequence information
262
*/
263
def saveAsBigWig(pathName: String, sequenceDictionary: SequenceDictionary): Unit
264
```
265
266
**Usage Examples:**
267
268
```scala
269
// Generate and save coverage
270
val alignments = sc.loadBam("sample.bam")
271
val coverage = alignments.toCoverage()
272
273
// Save for genome browser visualization
274
coverage.saveAsWig("coverage.wig")
275
276
// Convert features to coverage
277
val features = sc.loadFeatures("peaks.bed")
278
val featureCoverage = features.toCoverage()
279
featureCoverage.saveAsWig("peak_coverage.wig")
280
```
281
282
### Format-Agnostic Loading
283
284
Automatic format detection for seamless data loading regardless of file format.
285
286
```scala { .api }
287
/**
288
* Load alignment data with automatic format detection
289
* Detects SAM, BAM, CRAM, or ADAM formats automatically
290
* @param pathName - Path to alignment file
291
* @return AlignmentRecordRDD
292
*/
293
def loadAlignments(pathName: String): AlignmentRecordRDD
294
295
/**
296
* Load variant data with automatic format detection
297
* Detects VCF or ADAM variant formats automatically
298
* @param pathName - Path to variant file
299
* @return VariantRDD
300
*/
301
def loadVariants(pathName: String): VariantRDD
302
303
/**
304
* Load genotype data with automatic format detection
305
* Detects VCF or ADAM genotype formats automatically
306
* @param pathName - Path to genotype file
307
* @return GenotypeRDD
308
*/
309
def loadGenotypes(pathName: String): GenotypeRDD
310
311
/**
312
* Load feature data with automatic format detection
313
* Detects BED, GTF, GFF3, or ADAM feature formats automatically
314
* @param pathName - Path to feature file
315
* @return FeatureRDD
316
*/
317
def loadFeatures(pathName: String): FeatureRDD
318
```
319
320
**Usage Examples:**
321
322
```scala
323
// Load without specifying format
324
val alignments = sc.loadAlignments("unknown_format_file") // Auto-detects
325
val variants = sc.loadVariants("variants_file") // Auto-detects
326
val features = sc.loadFeatures("annotations_file") // Auto-detects
327
328
// Particularly useful for processing directories with mixed formats
329
val mixedAlignments = sc.loadAlignments("alignment_directory/")
330
```
331
332
### ADAM Native Parquet Format
333
334
ADAM's high-performance columnar storage format with schema evolution support.
335
336
```scala { .api }
337
/**
338
* Load data from ADAM Parquet files with optional projection and filtering
339
* @param pathName - Path to Parquet file or directory
340
* @param optPredicate - Optional server-side filtering predicate
341
* @param optProjection - Optional column projection for efficiency
342
* @return RDD of specified type
343
*/
344
def loadParquet[T](pathName: String,
345
optPredicate: Option[FilterPredicate] = None,
346
optProjection: Option[Schema] = None): RDD[T]
347
348
/**
349
* Save any GenomicRDD as ADAM Parquet format
350
* @param pathName - Output path
351
*/
352
def saveAsParquet(pathName: String): Unit
353
```
354
355
**Usage Examples:**
356
357
```scala
358
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
359
import org.apache.parquet.filter2.predicate.FilterApi._
360
361
// Save as ADAM format
362
val alignments = sc.loadBam("input.bam")
363
alignments.saveAsParquet("alignments.adam")
364
365
// Load with column projection for efficiency
366
val projection = Projection(AlignmentRecordField.readName,
367
AlignmentRecordField.sequence,
368
AlignmentRecordField.readMapped)
369
val projectedAlignments = sc.loadParquet[AlignmentRecord](
370
"alignments.adam",
371
optProjection = Some(projection)
372
)
373
374
// Load with server-side filtering
375
val mappedFilter = equal(binaryColumn("readMapped"), true)
376
val mappedReads = sc.loadParquet[AlignmentRecord](
377
"alignments.adam",
378
optPredicate = Some(mappedFilter)
379
)
380
```
381
382
### Format Validation and Error Handling
383
384
Comprehensive validation and error handling across all supported formats.
385
386
```scala { .api }
387
/**
388
* Validation stringency levels for format compliance
389
*/
390
object ValidationStringency extends Enumeration {
391
/** Fail immediately on any format violations */
392
val STRICT = Value
393
394
/** Log warnings for format violations but continue processing */
395
val LENIENT = Value
396
397
/** Ignore format violations silently */
398
val SILENT = Value
399
}
400
401
/**
402
* SAM format output types
403
*/
404
object SAMFormat extends Enumeration {
405
val SAM = Value // Plain text SAM
406
val BAM = Value // Binary BAM
407
val CRAM = Value // Reference compressed CRAM
408
}
409
```
410
411
## File Format Performance Considerations
412
413
1. **Use indexed files** (BAM/CRAM with .bai/.crai, VCF with .tbi/.csi) for region queries
414
2. **Prefer ADAM Parquet format** for repeated analysis - provides best performance
415
3. **Apply column projection** when loading Parquet to read only necessary fields
416
4. **Use predicate pushdown** to filter data at storage level
417
5. **Consider compression trade-offs** - CRAM for storage, BAM for processing speed
418
6. **Partition large files** across multiple smaller files for better parallelization
419
420
## Supported File Extensions
421
422
ADAM automatically detects formats based on file extensions:
423
424
- **.sam, .bam, .cram** → Alignment data
425
- **.vcf, .vcf.gz, .vcf.bgz** → Variant data
426
- **.fastq, .fq, .fastq.gz** → Sequencing reads
427
- **.fasta, .fa, .fna** → Reference sequences
428
- **.bed, .bed.gz** → BED features
429
- **.gtf, .gtf.gz, .gff3, .gff3.gz** → Gene annotations
430
- **.wig, .wiggle** → Coverage data
431
- **.adam** → ADAM Parquet format
432
433
When file extensions are ambiguous or missing, ADAM examines file content headers for format detection.