or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

algorithms.mddata-loading.mddata-types.mdfile-formats.mdindex.mdtransformations.md

data-loading.mddocs/

0

# Data Loading and I/O

1

2

Core functionality for loading genomic data from various file formats and saving transformed results. ADAM Core provides a unified interface for accessing genomic data regardless of the underlying storage format, with support for both local files and distributed storage systems.

3

4

## Capabilities

5

6

### ADAMContext Entry Point

7

8

The main entry point for all data loading operations, automatically added to SparkContext via implicit conversion.

9

10

```scala { .api }

11

/**

12

* Implicit conversion that adds ADAM data loading methods to SparkContext

13

* @param sc - Spark context to extend

14

* @return ADAMContext with genomic data loading capabilities

15

*/

16

implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext

17

```

18

19

### Alignment Data Loading

20

21

Load aligned and unaligned sequencing reads from SAM, BAM, and CRAM formats.

22

23

```scala { .api }

24

/**

25

* Load alignment records from SAM/BAM/CRAM files

26

* @param pathName - Path to alignment file or directory of files

27

* @param stringency - Validation stringency for format compliance

28

* @return AlignmentRecordRDD containing sequencing reads

29

*/

30

def loadBam(pathName: String,

31

stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD

32

33

/**

34

* Load alignment records from indexed BAM/CRAM file for specific genomic regions

35

* @param pathName - Path to indexed alignment file

36

* @param viewRegions - Genomic regions to query

37

* @return AlignmentRecordRDD containing reads overlapping the regions

38

*/

39

def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD

40

```

41

42

**Usage Examples:**

43

44

```scala

45

import org.bdgenomics.adam.rdd.ADAMContext._

46

import org.bdgenomics.adam.models.ReferenceRegion

47

48

// Load entire BAM file

49

val allReads = sc.loadBam("sample.bam")

50

51

// Load with lenient validation for malformed files

52

val reads = sc.loadBam("sample.bam", ValidationStringency.LENIENT)

53

54

// Load specific region from indexed BAM

55

val region = ReferenceRegion("chr1", 1000000, 2000000)

56

val regionReads = sc.loadIndexedBam("sample.bam", region)

57

```

58

59

### Variant Data Loading

60

61

Load genetic variants and genotype information from VCF files.

62

63

```scala { .api }

64

/**

65

* Load variant contexts from VCF files with full metadata

66

* @param pathName - Path to VCF file or directory of files

67

* @param stringency - Validation stringency for VCF format compliance

68

* @return VariantContextRDD containing variants with genotype information

69

*/

70

def loadVcf(pathName: String,

71

stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD

72

73

/**

74

* Load variants from indexed VCF file for specific genomic regions

75

* @param pathName - Path to indexed VCF file (with .tbi or .csi index)

76

* @param viewRegions - Genomic regions to query

77

* @return VariantContextRDD containing variants in the specified regions

78

*/

79

def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD

80

```

81

82

**Usage Examples:**

83

84

```scala

85

// Load complete VCF file

86

val variants = sc.loadVcf("variants.vcf")

87

88

// Load from compressed VCF with strict validation

89

val compressedVariants = sc.loadVcf("variants.vcf.gz", ValidationStringency.STRICT)

90

91

// Load specific chromosomal region

92

val chrRegion = ReferenceRegion("chr22", 0, 51304566)

93

val chr22Variants = sc.loadIndexedVcf("variants.vcf.gz", chrRegion)

94

```

95

96

### Sequence Data Loading

97

98

Load raw sequencing data and reference sequences.

99

100

```scala { .api }

101

/**

102

* Load FASTQ sequencing reads (single-end or paired-end)

103

* @param pathName1 - Path to first FASTQ file (or single-end file)

104

* @param optPathName2 - Optional path to second FASTQ file for paired-end reads

105

* @param optRecordGroup - Optional read group identifier for the reads

106

* @param stringency - Validation stringency for FASTQ format compliance

107

* @return AlignmentRecordRDD containing unaligned sequencing reads

108

*/

109

def loadFastq(pathName1: String,

110

optPathName2: Option[String],

111

optRecordGroup: Option[String] = None,

112

stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD

113

114

/**

115

* Load interleaved FASTQ file where paired reads are alternately arranged

116

* @param pathName - Path to interleaved FASTQ file

117

* @return AlignmentRecordRDD containing paired sequencing reads

118

*/

119

def loadInterleavedFastq(pathName: String): AlignmentRecordRDD

120

121

/**

122

* Load reference genome sequences from FASTA files

123

* @param pathName - Path to FASTA file

124

* @param maximumLength - Maximum length of individual sequences to load

125

* @return NucleotideContigFragmentRDD containing reference sequences

126

*/

127

def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD

128

```

129

130

**Usage Examples:**

131

132

```scala

133

// Load single-end FASTQ

134

val singleEndReads = sc.loadFastq("reads.fastq")

135

136

// Load paired-end FASTQ files

137

val pairedReads = sc.loadFastq("reads_R1.fastq", Some("reads_R2.fastq"))

138

139

// Load with custom read group

140

val readsWithRG = sc.loadFastq("reads.fastq", None, Some("sample1.rg1"))

141

142

// Load interleaved paired FASTQ

143

val interleavedReads = sc.loadInterleavedFastq("paired.fastq")

144

145

// Load reference genome

146

val reference = sc.loadFasta("hg38.fasta", maximumLength = 50000L)

147

```

148

149

### Feature Data Loading

150

151

Load genomic annotations and features from various formats.

152

153

```scala { .api }

154

/**

155

* Load genomic features from BED, GFF3, GTF, or other supported formats

156

* @param pathName - Path to feature file

157

* @return FeatureRDD containing genomic annotations

158

*/

159

def loadFeatures(pathName: String): FeatureRDD

160

```

161

162

**Usage Examples:**

163

164

```scala

165

// Load BED file annotations

166

val bedFeatures = sc.loadFeatures("annotations.bed")

167

168

// Load GTF gene annotations

169

val geneFeatures = sc.loadFeatures("genes.gtf")

170

171

// Load GFF3 annotations

172

val gff3Features = sc.loadFeatures("features.gff3")

173

```

174

175

### Format-Agnostic Loading

176

177

Load genomic data without specifying the exact format, with automatic format detection.

178

179

```scala { .api }

180

/**

181

* Load alignment records with automatic format detection

182

* @param pathName - Path to alignment file (SAM/BAM/CRAM/ADAM)

183

* @return AlignmentRecordRDD containing sequencing reads

184

*/

185

def loadAlignments(pathName: String): AlignmentRecordRDD

186

187

/**

188

* Load variants with automatic format detection

189

* @param pathName - Path to variant file (VCF/ADAM)

190

* @return VariantRDD containing genetic variants

191

*/

192

def loadVariants(pathName: String): VariantRDD

193

194

/**

195

* Load genotypes with automatic format detection

196

* @param pathName - Path to genotype file (VCF/ADAM)

197

* @return GenotypeRDD containing genotype calls

198

*/

199

def loadGenotypes(pathName: String): GenotypeRDD

200

```

201

202

### Parquet Data Loading

203

204

Load data from ADAM's native Parquet+Avro format with optional filtering and projection.

205

206

```scala { .api }

207

/**

208

* Load data from Parquet files with optional predicate pushdown and column projection

209

* @param pathName - Path to Parquet file or directory

210

* @param optPredicate - Optional filter predicate for server-side filtering

211

* @param optProjection - Optional schema projection to load only specific fields

212

* @return RDD of the specified type T

213

*/

214

def loadParquet[T](pathName: String,

215

optPredicate: Option[FilterPredicate] = None,

216

optProjection: Option[Schema] = None): RDD[T]

217

```

218

219

**Usage Examples:**

220

221

```scala

222

import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}

223

import org.apache.parquet.filter2.predicate.FilterApi._

224

225

// Load with column projection for efficiency

226

val projection = Projection(AlignmentRecordField.readName, AlignmentRecordField.readMapped)

227

val projectedReads = sc.loadParquet[AlignmentRecord]("reads.adam",

228

optProjection = Some(projection))

229

230

// Load with predicate filtering

231

val mappedFilter = equal(binaryColumn("readMapped"), true)

232

val mappedReads = sc.loadParquet[AlignmentRecord]("reads.adam",

233

optPredicate = Some(mappedFilter))

234

```

235

236

### Advanced Loading Options

237

238

Additional configuration options for specialized loading scenarios.

239

240

```scala { .api }

241

// Validation stringency options

242

object ValidationStringency extends Enumeration {

243

val STRICT = Value // Fail on format violations

244

val LENIENT = Value // Log warnings for violations

245

val SILENT = Value // Ignore format violations

246

}

247

248

// SAM format types for saving

249

object SAMFormat extends Enumeration {

250

val SAM = Value // Plain text SAM

251

val BAM = Value // Binary BAM

252

val CRAM = Value // Compressed CRAM

253

}

254

```

255

256

## Loading Performance Tips

257

258

1. **Use indexed files** for region-based queries to avoid scanning entire files

259

2. **Apply projection** when loading Parquet data to read only necessary columns

260

3. **Use predicate pushdown** to filter data at the storage level

261

4. **Consider validation stringency** - use LENIENT or SILENT for performance-critical applications with trusted data

262

5. **Partition large datasets** across multiple files for better parallelization

263

6. **Cache frequently accessed RDDs** in memory using `.cache()` method

264

265

## Supported File Formats

266

267

- **Alignment Data**: SAM, BAM, CRAM, ADAM Parquet

268

- **Variant Data**: VCF (plain text and compressed), ADAM Parquet

269

- **Sequence Data**: FASTQ (single-end, paired-end, interleaved), FASTA

270

- **Feature Data**: BED, GFF3, GTF, IntervalList, NarrowPeak, ADAM Parquet

271

- **Reference Data**: FASTA, 2bit format