Tessl Tile for maven/org.bdgenomics.adam/adam-core_2.10@0.23.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

algorithms.md data-loading.md data-types.md file-formats.md index.md transformations.md

file-formats.mddocs/

0
# File Format Support
1

2
Comprehensive support for reading and writing genomic file formats, with automatic format detection and validation. ADAM Core bridges legacy genomic file formats with modern distributed processing capabilities, providing efficient I/O operations for all major genomic data types.
3

4
## Capabilities
5

6
### Alignment File Formats
7

8
Support for sequencing alignment data in standard and modern formats.
9

10
```scala { .api }
11
/**
12
 * Load alignment data from SAM, BAM, or CRAM files
13
 * Automatically detects format based on file extension and magic bytes
14
 * @param pathName - Path to alignment file or directory
15
 * @param stringency - Validation stringency for format compliance
16
 * @return AlignmentRecordRDD containing alignment records
17
 */
18
def loadBam(pathName: String, 
19
           stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
20

21
/**
22
 * Save alignment records as SAM/BAM/CRAM format
23
 * @param pathName - Output path
24
 * @param asType - Output format specification
25
 * @param asSingleFile - Whether to merge output into single file
26
 * @param stringency - Validation stringency for output compliance
27
 */
28
def saveAsSam(pathName: String,
29
             asType: SAMFormat = SAMFormat.SAM,
30
             asSingleFile: Boolean = false,
31
             stringency: ValidationStringency = ValidationStringency.STRICT): Unit
32
```
33

34
**Supported Alignment Formats:**
35
- **SAM**: Sequence Alignment/Map plain text format
36
- **BAM**: Binary compressed SAM format  
37
- **CRAM**: Reference-based compressed alignment format
38
- **ADAM Parquet**: Native columnar format with schema evolution support
39

40
**Usage Examples:**
41

42
```scala
43
import org.bdgenomics.adam.rdd.ADAMContext._
44

45
// Load various alignment formats
46
val samData = sc.loadBam("alignments.sam")
47
val bamData = sc.loadBam("alignments.bam") 
48
val cramData = sc.loadBam("alignments.cram")
49

50
// Save in different formats
51
samData.saveAsSam("output.sam", SAMFormat.SAM)
52
samData.saveAsSam("output.bam", SAMFormat.BAM)
53
samData.saveAsSam("output.cram", SAMFormat.CRAM)
54

55
// Save as ADAM's native format
56
samData.saveAsParquet("alignments.adam")
57
```
58

59
### Variant File Formats
60

61
Support for genetic variant data with full VCF specification compliance.
62

63
```scala { .api }
64
/**
65
 * Load variant data from VCF files (plain text or compressed)
66
 * Supports VCF 4.0+ specifications with full header parsing
67
 * @param pathName - Path to VCF file or directory
68
 * @param stringency - Validation stringency for VCF compliance
69
 * @return VariantContextRDD containing variants with metadata
70
 */
71
def loadVcf(pathName: String,
72
           stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
73

74
/**
75
 * Save variant data as VCF format
76
 * @param pathName - Output path
77
 * @param stringency - Validation stringency for VCF compliance
78
 * @param asSingleFile - Whether to merge output into single file
79
 */
80
def saveAsVcf(pathName: String,
81
             stringency: ValidationStringency = ValidationStringency.STRICT,
82
             asSingleFile: Boolean = false): Unit
83
```
84

85
**Supported Variant Formats:**
86
- **VCF**: Variant Call Format (plain text)
87
- **VCF.gz**: Compressed VCF with bgzip compression
88
- **BCF**: Binary VCF format (through conversion)
89
- **ADAM Parquet**: Native columnar variant storage
90

91
**Usage Examples:**
92

93
```scala
94
// Load VCF files
95
val variants = sc.loadVcf("variants.vcf")
96
val compressedVariants = sc.loadVcf("variants.vcf.gz")
97

98
// Work with different variant representations
99
val variantOnly = variants.toVariants()      // Just variant sites
100
val genotypes = variants.toGenotypes()       // Genotype calls
101

102
// Save in various formats
103
variants.saveAsVcf("output.vcf")
104
variantOnly.saveAsParquet("variants.adam")
105
genotypes.saveAsParquet("genotypes.adam")
106
```
107

108
### Sequence File Formats
109

110
Support for raw sequencing data and reference genomes.
111

112
```scala { .api }
113
/**
114
 * Load FASTQ sequencing data (single-end or paired-end)
115
 * @param pathName1 - First FASTQ file (or single-end file)
116
 * @param optPathName2 - Optional second FASTQ file for paired-end
117
 * @return AlignmentRecordRDD containing unaligned reads
118
 */
119
def loadFastq(pathName1: String, optPathName2: Option[String] = None): AlignmentRecordRDD
120

121
/**
122
 * Load interleaved FASTQ where paired reads alternate
123
 * @param pathName - Path to interleaved FASTQ file
124
 * @return AlignmentRecordRDD containing paired reads
125
 */
126
def loadInterleavedFastq(pathName: String): AlignmentRecordRDD
127

128
/**
129
 * Load reference genome sequences from FASTA files
130
 * @param pathName - Path to FASTA file
131
 * @param maximumLength - Maximum sequence length to load per record
132
 * @return NucleotideContigFragmentRDD containing reference sequences
133
 */
134
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD
135

136
/**
137
 * Save reads as FASTQ format
138
 * @param pathName - Output path
139
 * @param outputOriginalBaseQualities - Use original vs. recalibrated qualities
140
 * @param asSingleFile - Merge output into single file
141
 */
142
def saveAsFastq(pathName: String,
143
               outputOriginalBaseQualities: Boolean = false,
144
               asSingleFile: Boolean = false): Unit
145

146
/**
147
 * Save reference sequences as FASTA format
148
 * @param pathName - Output path
149
 * @param lineWidth - Bases per line in output
150
 */
151
def saveAsFasta(pathName: String, lineWidth: Int = 60): Unit
152
```
153

154
**Supported Sequence Formats:**
155
- **FASTQ**: Raw sequencing reads with quality scores
156
- **FASTA**: Reference genome sequences
157
- **2bit**: Compact reference genome format (loading only)
158

159
**Usage Examples:**
160

161
```scala
162
// Load sequencing data
163
val singleEnd = sc.loadFastq("reads.fastq")
164
val pairedEnd = sc.loadFastq("R1.fastq", Some("R2.fastq"))
165
val interleaved = sc.loadInterleavedFastq("paired.fastq")
166

167
// Load reference genome
168
val reference = sc.loadFasta("hg38.fasta")
169

170
// Save processed reads
171
val processed = singleEnd.transform(_.filter(_.getReadName.startsWith("good")))
172
processed.saveAsFastq("filtered_reads.fastq")
173

174
// Save reference sequences
175
reference.saveAsFasta("output_reference.fasta", lineWidth = 80)
176
```
177

178
### Feature Annotation Formats
179

180
Support for genomic feature annotations in multiple standard formats.
181

182
```scala { .api }
183
/**
184
 * Load genomic features with automatic format detection
185
 * Supports BED, GFF3, GTF, IntervalList, and NarrowPeak formats
186
 * @param pathName - Path to feature file
187
 * @return FeatureRDD containing genomic annotations
188
 */
189
def loadFeatures(pathName: String): FeatureRDD
190

191
/**
192
 * Save features as BED format
193
 * @param pathName - Output path
194
 */
195
def saveAsBed(pathName: String): Unit
196

197
/**
198
 * Save features as GTF format (Gene Transfer Format)
199
 * @param pathName - Output path
200
 */
201
def saveAsGtf(pathName: String): Unit
202

203
/**
204
 * Save features as GFF3 format (General Feature Format)
205
 * @param pathName - Output path
206
 */
207
def saveAsGff3(pathName: String): Unit
208

209
/**
210
 * Save features as Picard IntervalList format
211
 * @param pathName - Output path
212
 */
213
def saveAsIntervalList(pathName: String): Unit
214

215
/**
216
 * Save features as ENCODE narrowPeak format
217
 * @param pathName - Output path
218
 */
219
def saveAsNarrowPeak(pathName: String): Unit
220
```
221

222
**Supported Feature Formats:**
223
- **BED**: Browser Extensible Data format
224
- **GTF**: Gene Transfer Format for gene annotations
225
- **GFF3**: General Feature Format version 3
226
- **IntervalList**: Picard toolkit interval format
227
- **NarrowPeak**: ENCODE ChIP-seq peak format
228
- **ADAM Parquet**: Native columnar feature storage
229

230
**Usage Examples:**
231

232
```scala
233
// Load various annotation formats
234
val bedFeatures = sc.loadFeatures("regions.bed")
235
val geneAnnotations = sc.loadFeatures("genes.gtf")
236
val gff3Features = sc.loadFeatures("annotations.gff3")
237

238
// Convert between formats
239
bedFeatures.saveAsGtf("converted.gtf")
240
geneAnnotations.saveAsBed("genes.bed")
241

242
// Filter and save
243
val exons = geneAnnotations.transform(_.filter(_.getFeatureType == "exon"))
244
exons.saveAsIntervalList("exons.interval_list")
245
```
246

247
### Coverage and Depth Formats
248

249
Support for sequencing depth and coverage data visualization formats.
250

251
```scala { .api }
252
/**
253
 * Save coverage data as WIG format for genome browsers
254
 * @param pathName - Output path
255
 */
256
def saveAsWig(pathName: String): Unit
257

258
/**
259
 * Save coverage data as BigWig format (through conversion)
260
 * @param pathName - Output path
261
 * @param sequenceDictionary - Reference sequence information
262
 */
263
def saveAsBigWig(pathName: String, sequenceDictionary: SequenceDictionary): Unit
264
```
265

266
**Usage Examples:**
267

268
```scala
269
// Generate and save coverage
270
val alignments = sc.loadBam("sample.bam")
271
val coverage = alignments.toCoverage()
272

273
// Save for genome browser visualization
274
coverage.saveAsWig("coverage.wig")
275

276
// Convert features to coverage
277
val features = sc.loadFeatures("peaks.bed")
278
val featureCoverage = features.toCoverage()
279
featureCoverage.saveAsWig("peak_coverage.wig")
280
```
281

282
### Format-Agnostic Loading
283

284
Automatic format detection for seamless data loading regardless of file format.
285

286
```scala { .api }
287
/**
288
 * Load alignment data with automatic format detection
289
 * Detects SAM, BAM, CRAM, or ADAM formats automatically
290
 * @param pathName - Path to alignment file
291
 * @return AlignmentRecordRDD
292
 */
293
def loadAlignments(pathName: String): AlignmentRecordRDD
294

295
/**
296
 * Load variant data with automatic format detection  
297
 * Detects VCF or ADAM variant formats automatically
298
 * @param pathName - Path to variant file
299
 * @return VariantRDD
300
 */
301
def loadVariants(pathName: String): VariantRDD
302

303
/**
304
 * Load genotype data with automatic format detection
305
 * Detects VCF or ADAM genotype formats automatically  
306
 * @param pathName - Path to genotype file
307
 * @return GenotypeRDD
308
 */
309
def loadGenotypes(pathName: String): GenotypeRDD
310

311
/**
312
 * Load feature data with automatic format detection
313
 * Detects BED, GTF, GFF3, or ADAM feature formats automatically
314
 * @param pathName - Path to feature file
315
 * @return FeatureRDD
316
 */
317
def loadFeatures(pathName: String): FeatureRDD
318
```
319

320
**Usage Examples:**
321

322
```scala
323
// Load without specifying format
324
val alignments = sc.loadAlignments("unknown_format_file")  // Auto-detects
325
val variants = sc.loadVariants("variants_file")            // Auto-detects
326
val features = sc.loadFeatures("annotations_file")         // Auto-detects
327

328
// Particularly useful for processing directories with mixed formats
329
val mixedAlignments = sc.loadAlignments("alignment_directory/")
330
```
331

332
### ADAM Native Parquet Format
333

334
ADAM's high-performance columnar storage format with schema evolution support.
335

336
```scala { .api }
337
/**
338
 * Load data from ADAM Parquet files with optional projection and filtering
339
 * @param pathName - Path to Parquet file or directory
340
 * @param optPredicate - Optional server-side filtering predicate
341
 * @param optProjection - Optional column projection for efficiency
342
 * @return RDD of specified type
343
 */
344
def loadParquet[T](pathName: String,
345
                  optPredicate: Option[FilterPredicate] = None,
346
                  optProjection: Option[Schema] = None): RDD[T]
347

348
/**
349
 * Save any GenomicRDD as ADAM Parquet format
350
 * @param pathName - Output path
351
 */
352
def saveAsParquet(pathName: String): Unit
353
```
354

355
**Usage Examples:**
356

357
```scala
358
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
359
import org.apache.parquet.filter2.predicate.FilterApi._
360

361
// Save as ADAM format
362
val alignments = sc.loadBam("input.bam")
363
alignments.saveAsParquet("alignments.adam")
364

365
// Load with column projection for efficiency
366
val projection = Projection(AlignmentRecordField.readName, 
367
                           AlignmentRecordField.sequence,
368
                           AlignmentRecordField.readMapped)
369
val projectedAlignments = sc.loadParquet[AlignmentRecord](
370
  "alignments.adam", 
371
  optProjection = Some(projection)
372
)
373

374
// Load with server-side filtering
375
val mappedFilter = equal(binaryColumn("readMapped"), true)
376
val mappedReads = sc.loadParquet[AlignmentRecord](
377
  "alignments.adam",
378
  optPredicate = Some(mappedFilter)
379
)
380
```
381

382
### Format Validation and Error Handling
383

384
Comprehensive validation and error handling across all supported formats.
385

386
```scala { .api }
387
/**
388
 * Validation stringency levels for format compliance
389
 */
390
object ValidationStringency extends Enumeration {
391
  /** Fail immediately on any format violations */
392
  val STRICT = Value
393
  
394
  /** Log warnings for format violations but continue processing */
395
  val LENIENT = Value
396
  
397
  /** Ignore format violations silently */
398
  val SILENT = Value
399
}
400

401
/**
402
 * SAM format output types
403
 */
404
object SAMFormat extends Enumeration {
405
  val SAM = Value   // Plain text SAM
406
  val BAM = Value   // Binary BAM  
407
  val CRAM = Value  // Reference compressed CRAM
408
}
409
```
410

411
## File Format Performance Considerations
412

413
1. **Use indexed files** (BAM/CRAM with .bai/.crai, VCF with .tbi/.csi) for region queries
414
2. **Prefer ADAM Parquet format** for repeated analysis - provides best performance
415
3. **Apply column projection** when loading Parquet to read only necessary fields  
416
4. **Use predicate pushdown** to filter data at storage level
417
5. **Consider compression trade-offs** - CRAM for storage, BAM for processing speed
418
6. **Partition large files** across multiple smaller files for better parallelization
419

420
## Supported File Extensions
421

422
ADAM automatically detects formats based on file extensions:
423

424
- **.sam, .bam, .cram** → Alignment data
425
- **.vcf, .vcf.gz, .vcf.bgz** → Variant data
426
- **.fastq, .fq, .fastq.gz** → Sequencing reads
427
- **.fasta, .fa, .fna** → Reference sequences
428
- **.bed, .bed.gz** → BED features
429
- **.gtf, .gtf.gz, .gff3, .gff3.gz** → Gene annotations
430
- **.wig, .wiggle** → Coverage data
431
- **.adam** → ADAM Parquet format
432

433
When file extensions are ambiguous or missing, ADAM examines file content headers for format detection.

Version

Tile

Files

file-formats.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

file-formats.mddocs/