or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

algorithms.mddata-loading.mddata-types.mdfile-formats.mdindex.mdtransformations.md

file-formats.mddocs/

0

# File Format Support

1

2

Comprehensive support for reading and writing genomic file formats, with automatic format detection and validation. ADAM Core bridges legacy genomic file formats with modern distributed processing capabilities, providing efficient I/O operations for all major genomic data types.

3

4

## Capabilities

5

6

### Alignment File Formats

7

8

Support for sequencing alignment data in standard and modern formats.

9

10

```scala { .api }

11

/**

12

* Load alignment data from SAM, BAM, or CRAM files

13

* Automatically detects format based on file extension and magic bytes

14

* @param pathName - Path to alignment file or directory

15

* @param stringency - Validation stringency for format compliance

16

* @return AlignmentRecordRDD containing alignment records

17

*/

18

def loadBam(pathName: String,

19

stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD

20

21

/**

22

* Save alignment records as SAM/BAM/CRAM format

23

* @param pathName - Output path

24

* @param asType - Output format specification

25

* @param asSingleFile - Whether to merge output into single file

26

* @param stringency - Validation stringency for output compliance

27

*/

28

def saveAsSam(pathName: String,

29

asType: SAMFormat = SAMFormat.SAM,

30

asSingleFile: Boolean = false,

31

stringency: ValidationStringency = ValidationStringency.STRICT): Unit

32

```

33

34

**Supported Alignment Formats:**

35

- **SAM**: Sequence Alignment/Map plain text format

36

- **BAM**: Binary compressed SAM format

37

- **CRAM**: Reference-based compressed alignment format

38

- **ADAM Parquet**: Native columnar format with schema evolution support

39

40

**Usage Examples:**

41

42

```scala

43

import org.bdgenomics.adam.rdd.ADAMContext._

44

45

// Load various alignment formats

46

val samData = sc.loadBam("alignments.sam")

47

val bamData = sc.loadBam("alignments.bam")

48

val cramData = sc.loadBam("alignments.cram")

49

50

// Save in different formats

51

samData.saveAsSam("output.sam", SAMFormat.SAM)

52

samData.saveAsSam("output.bam", SAMFormat.BAM)

53

samData.saveAsSam("output.cram", SAMFormat.CRAM)

54

55

// Save as ADAM's native format

56

samData.saveAsParquet("alignments.adam")

57

```

58

59

### Variant File Formats

60

61

Support for genetic variant data with full VCF specification compliance.

62

63

```scala { .api }

64

/**

65

* Load variant data from VCF files (plain text or compressed)

66

* Supports VCF 4.0+ specifications with full header parsing

67

* @param pathName - Path to VCF file or directory

68

* @param stringency - Validation stringency for VCF compliance

69

* @return VariantContextRDD containing variants with metadata

70

*/

71

def loadVcf(pathName: String,

72

stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD

73

74

/**

75

* Save variant data as VCF format

76

* @param pathName - Output path

77

* @param stringency - Validation stringency for VCF compliance

78

* @param asSingleFile - Whether to merge output into single file

79

*/

80

def saveAsVcf(pathName: String,

81

stringency: ValidationStringency = ValidationStringency.STRICT,

82

asSingleFile: Boolean = false): Unit

83

```

84

85

**Supported Variant Formats:**

86

- **VCF**: Variant Call Format (plain text)

87

- **VCF.gz**: Compressed VCF with bgzip compression

88

- **BCF**: Binary VCF format (through conversion)

89

- **ADAM Parquet**: Native columnar variant storage

90

91

**Usage Examples:**

92

93

```scala

94

// Load VCF files

95

val variants = sc.loadVcf("variants.vcf")

96

val compressedVariants = sc.loadVcf("variants.vcf.gz")

97

98

// Work with different variant representations

99

val variantOnly = variants.toVariants() // Just variant sites

100

val genotypes = variants.toGenotypes() // Genotype calls

101

102

// Save in various formats

103

variants.saveAsVcf("output.vcf")

104

variantOnly.saveAsParquet("variants.adam")

105

genotypes.saveAsParquet("genotypes.adam")

106

```

107

108

### Sequence File Formats

109

110

Support for raw sequencing data and reference genomes.

111

112

```scala { .api }

113

/**

114

* Load FASTQ sequencing data (single-end or paired-end)

115

* @param pathName1 - First FASTQ file (or single-end file)

116

* @param optPathName2 - Optional second FASTQ file for paired-end

117

* @return AlignmentRecordRDD containing unaligned reads

118

*/

119

def loadFastq(pathName1: String, optPathName2: Option[String] = None): AlignmentRecordRDD

120

121

/**

122

* Load interleaved FASTQ where paired reads alternate

123

* @param pathName - Path to interleaved FASTQ file

124

* @return AlignmentRecordRDD containing paired reads

125

*/

126

def loadInterleavedFastq(pathName: String): AlignmentRecordRDD

127

128

/**

129

* Load reference genome sequences from FASTA files

130

* @param pathName - Path to FASTA file

131

* @param maximumLength - Maximum sequence length to load per record

132

* @return NucleotideContigFragmentRDD containing reference sequences

133

*/

134

def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD

135

136

/**

137

* Save reads as FASTQ format

138

* @param pathName - Output path

139

* @param outputOriginalBaseQualities - Use original vs. recalibrated qualities

140

* @param asSingleFile - Merge output into single file

141

*/

142

def saveAsFastq(pathName: String,

143

outputOriginalBaseQualities: Boolean = false,

144

asSingleFile: Boolean = false): Unit

145

146

/**

147

* Save reference sequences as FASTA format

148

* @param pathName - Output path

149

* @param lineWidth - Bases per line in output

150

*/

151

def saveAsFasta(pathName: String, lineWidth: Int = 60): Unit

152

```

153

154

**Supported Sequence Formats:**

155

- **FASTQ**: Raw sequencing reads with quality scores

156

- **FASTA**: Reference genome sequences

157

- **2bit**: Compact reference genome format (loading only)

158

159

**Usage Examples:**

160

161

```scala

162

// Load sequencing data

163

val singleEnd = sc.loadFastq("reads.fastq")

164

val pairedEnd = sc.loadFastq("R1.fastq", Some("R2.fastq"))

165

val interleaved = sc.loadInterleavedFastq("paired.fastq")

166

167

// Load reference genome

168

val reference = sc.loadFasta("hg38.fasta")

169

170

// Save processed reads

171

val processed = singleEnd.transform(_.filter(_.getReadName.startsWith("good")))

172

processed.saveAsFastq("filtered_reads.fastq")

173

174

// Save reference sequences

175

reference.saveAsFasta("output_reference.fasta", lineWidth = 80)

176

```

177

178

### Feature Annotation Formats

179

180

Support for genomic feature annotations in multiple standard formats.

181

182

```scala { .api }

183

/**

184

* Load genomic features with automatic format detection

185

* Supports BED, GFF3, GTF, IntervalList, and NarrowPeak formats

186

* @param pathName - Path to feature file

187

* @return FeatureRDD containing genomic annotations

188

*/

189

def loadFeatures(pathName: String): FeatureRDD

190

191

/**

192

* Save features as BED format

193

* @param pathName - Output path

194

*/

195

def saveAsBed(pathName: String): Unit

196

197

/**

198

* Save features as GTF format (Gene Transfer Format)

199

* @param pathName - Output path

200

*/

201

def saveAsGtf(pathName: String): Unit

202

203

/**

204

* Save features as GFF3 format (General Feature Format)

205

* @param pathName - Output path

206

*/

207

def saveAsGff3(pathName: String): Unit

208

209

/**

210

* Save features as Picard IntervalList format

211

* @param pathName - Output path

212

*/

213

def saveAsIntervalList(pathName: String): Unit

214

215

/**

216

* Save features as ENCODE narrowPeak format

217

* @param pathName - Output path

218

*/

219

def saveAsNarrowPeak(pathName: String): Unit

220

```

221

222

**Supported Feature Formats:**

223

- **BED**: Browser Extensible Data format

224

- **GTF**: Gene Transfer Format for gene annotations

225

- **GFF3**: General Feature Format version 3

226

- **IntervalList**: Picard toolkit interval format

227

- **NarrowPeak**: ENCODE ChIP-seq peak format

228

- **ADAM Parquet**: Native columnar feature storage

229

230

**Usage Examples:**

231

232

```scala

233

// Load various annotation formats

234

val bedFeatures = sc.loadFeatures("regions.bed")

235

val geneAnnotations = sc.loadFeatures("genes.gtf")

236

val gff3Features = sc.loadFeatures("annotations.gff3")

237

238

// Convert between formats

239

bedFeatures.saveAsGtf("converted.gtf")

240

geneAnnotations.saveAsBed("genes.bed")

241

242

// Filter and save

243

val exons = geneAnnotations.transform(_.filter(_.getFeatureType == "exon"))

244

exons.saveAsIntervalList("exons.interval_list")

245

```

246

247

### Coverage and Depth Formats

248

249

Support for sequencing depth and coverage data visualization formats.

250

251

```scala { .api }

252

/**

253

* Save coverage data as WIG format for genome browsers

254

* @param pathName - Output path

255

*/

256

def saveAsWig(pathName: String): Unit

257

258

/**

259

* Save coverage data as BigWig format (through conversion)

260

* @param pathName - Output path

261

* @param sequenceDictionary - Reference sequence information

262

*/

263

def saveAsBigWig(pathName: String, sequenceDictionary: SequenceDictionary): Unit

264

```

265

266

**Usage Examples:**

267

268

```scala

269

// Generate and save coverage

270

val alignments = sc.loadBam("sample.bam")

271

val coverage = alignments.toCoverage()

272

273

// Save for genome browser visualization

274

coverage.saveAsWig("coverage.wig")

275

276

// Convert features to coverage

277

val features = sc.loadFeatures("peaks.bed")

278

val featureCoverage = features.toCoverage()

279

featureCoverage.saveAsWig("peak_coverage.wig")

280

```

281

282

### Format-Agnostic Loading

283

284

Automatic format detection for seamless data loading regardless of file format.

285

286

```scala { .api }

287

/**

288

* Load alignment data with automatic format detection

289

* Detects SAM, BAM, CRAM, or ADAM formats automatically

290

* @param pathName - Path to alignment file

291

* @return AlignmentRecordRDD

292

*/

293

def loadAlignments(pathName: String): AlignmentRecordRDD

294

295

/**

296

* Load variant data with automatic format detection

297

* Detects VCF or ADAM variant formats automatically

298

* @param pathName - Path to variant file

299

* @return VariantRDD

300

*/

301

def loadVariants(pathName: String): VariantRDD

302

303

/**

304

* Load genotype data with automatic format detection

305

* Detects VCF or ADAM genotype formats automatically

306

* @param pathName - Path to genotype file

307

* @return GenotypeRDD

308

*/

309

def loadGenotypes(pathName: String): GenotypeRDD

310

311

/**

312

* Load feature data with automatic format detection

313

* Detects BED, GTF, GFF3, or ADAM feature formats automatically

314

* @param pathName - Path to feature file

315

* @return FeatureRDD

316

*/

317

def loadFeatures(pathName: String): FeatureRDD

318

```

319

320

**Usage Examples:**

321

322

```scala

323

// Load without specifying format

324

val alignments = sc.loadAlignments("unknown_format_file") // Auto-detects

325

val variants = sc.loadVariants("variants_file") // Auto-detects

326

val features = sc.loadFeatures("annotations_file") // Auto-detects

327

328

// Particularly useful for processing directories with mixed formats

329

val mixedAlignments = sc.loadAlignments("alignment_directory/")

330

```

331

332

### ADAM Native Parquet Format

333

334

ADAM's high-performance columnar storage format with schema evolution support.

335

336

```scala { .api }

337

/**

338

* Load data from ADAM Parquet files with optional projection and filtering

339

* @param pathName - Path to Parquet file or directory

340

* @param optPredicate - Optional server-side filtering predicate

341

* @param optProjection - Optional column projection for efficiency

342

* @return RDD of specified type

343

*/

344

def loadParquet[T](pathName: String,

345

optPredicate: Option[FilterPredicate] = None,

346

optProjection: Option[Schema] = None): RDD[T]

347

348

/**

349

* Save any GenomicRDD as ADAM Parquet format

350

* @param pathName - Output path

351

*/

352

def saveAsParquet(pathName: String): Unit

353

```

354

355

**Usage Examples:**

356

357

```scala

358

import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}

359

import org.apache.parquet.filter2.predicate.FilterApi._

360

361

// Save as ADAM format

362

val alignments = sc.loadBam("input.bam")

363

alignments.saveAsParquet("alignments.adam")

364

365

// Load with column projection for efficiency

366

val projection = Projection(AlignmentRecordField.readName,

367

AlignmentRecordField.sequence,

368

AlignmentRecordField.readMapped)

369

val projectedAlignments = sc.loadParquet[AlignmentRecord](

370

"alignments.adam",

371

optProjection = Some(projection)

372

)

373

374

// Load with server-side filtering

375

val mappedFilter = equal(binaryColumn("readMapped"), true)

376

val mappedReads = sc.loadParquet[AlignmentRecord](

377

"alignments.adam",

378

optPredicate = Some(mappedFilter)

379

)

380

```

381

382

### Format Validation and Error Handling

383

384

Comprehensive validation and error handling across all supported formats.

385

386

```scala { .api }

387

/**

388

* Validation stringency levels for format compliance

389

*/

390

object ValidationStringency extends Enumeration {

391

/** Fail immediately on any format violations */

392

val STRICT = Value

393

394

/** Log warnings for format violations but continue processing */

395

val LENIENT = Value

396

397

/** Ignore format violations silently */

398

val SILENT = Value

399

}

400

401

/**

402

* SAM format output types

403

*/

404

object SAMFormat extends Enumeration {

405

val SAM = Value // Plain text SAM

406

val BAM = Value // Binary BAM

407

val CRAM = Value // Reference compressed CRAM

408

}

409

```

410

411

## File Format Performance Considerations

412

413

1. **Use indexed files** (BAM/CRAM with .bai/.crai, VCF with .tbi/.csi) for region queries

414

2. **Prefer ADAM Parquet format** for repeated analysis - provides best performance

415

3. **Apply column projection** when loading Parquet to read only necessary fields

416

4. **Use predicate pushdown** to filter data at storage level

417

5. **Consider compression trade-offs** - CRAM for storage, BAM for processing speed

418

6. **Partition large files** across multiple smaller files for better parallelization

419

420

## Supported File Extensions

421

422

ADAM automatically detects formats based on file extensions:

423

424

- **.sam, .bam, .cram** → Alignment data

425

- **.vcf, .vcf.gz, .vcf.bgz** → Variant data

426

- **.fastq, .fq, .fastq.gz** → Sequencing reads

427

- **.fasta, .fa, .fna** → Reference sequences

428

- **.bed, .bed.gz** → BED features

429

- **.gtf, .gtf.gz, .gff3, .gff3.gz** → Gene annotations

430

- **.wig, .wiggle** → Coverage data

431

- **.adam** → ADAM Parquet format

432

433

When file extensions are ambiguous or missing, ADAM examines file content headers for format detection.