0
# Data Loading and I/O
1
2
Core functionality for loading genomic data from various file formats and saving transformed results. ADAM Core provides a unified interface for accessing genomic data regardless of the underlying storage format, with support for both local files and distributed storage systems.
3
4
## Capabilities
5
6
### ADAMContext Entry Point
7
8
The main entry point for all data loading operations, automatically added to SparkContext via implicit conversion.
9
10
```scala { .api }
11
/**
12
* Implicit conversion that adds ADAM data loading methods to SparkContext
13
* @param sc - Spark context to extend
14
* @return ADAMContext with genomic data loading capabilities
15
*/
16
implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext
17
```
18
19
### Alignment Data Loading
20
21
Load aligned and unaligned sequencing reads from SAM, BAM, and CRAM formats.
22
23
```scala { .api }
24
/**
25
* Load alignment records from SAM/BAM/CRAM files
26
* @param pathName - Path to alignment file or directory of files
27
* @param stringency - Validation stringency for format compliance
28
* @return AlignmentRecordRDD containing sequencing reads
29
*/
30
def loadBam(pathName: String,
31
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
32
33
/**
34
* Load alignment records from indexed BAM/CRAM file for specific genomic regions
35
* @param pathName - Path to indexed alignment file
36
* @param viewRegions - Genomic regions to query
37
* @return AlignmentRecordRDD containing reads overlapping the regions
38
*/
39
def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD
40
```
41
42
**Usage Examples:**
43
44
```scala
45
import org.bdgenomics.adam.rdd.ADAMContext._
46
import org.bdgenomics.adam.models.ReferenceRegion
47
48
// Load entire BAM file
49
val allReads = sc.loadBam("sample.bam")
50
51
// Load with lenient validation for malformed files
52
val reads = sc.loadBam("sample.bam", ValidationStringency.LENIENT)
53
54
// Load specific region from indexed BAM
55
val region = ReferenceRegion("chr1", 1000000, 2000000)
56
val regionReads = sc.loadIndexedBam("sample.bam", region)
57
```
58
59
### Variant Data Loading
60
61
Load genetic variants and genotype information from VCF files.
62
63
```scala { .api }
64
/**
65
* Load variant contexts from VCF files with full metadata
66
* @param pathName - Path to VCF file or directory of files
67
* @param stringency - Validation stringency for VCF format compliance
68
* @return VariantContextRDD containing variants with genotype information
69
*/
70
def loadVcf(pathName: String,
71
stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
72
73
/**
74
* Load variants from indexed VCF file for specific genomic regions
75
* @param pathName - Path to indexed VCF file (with .tbi or .csi index)
76
* @param viewRegions - Genomic regions to query
77
* @return VariantContextRDD containing variants in the specified regions
78
*/
79
def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD
80
```
81
82
**Usage Examples:**
83
84
```scala
85
// Load complete VCF file
86
val variants = sc.loadVcf("variants.vcf")
87
88
// Load from compressed VCF with strict validation
89
val compressedVariants = sc.loadVcf("variants.vcf.gz", ValidationStringency.STRICT)
90
91
// Load specific chromosomal region
92
val chrRegion = ReferenceRegion("chr22", 0, 51304566)
93
val chr22Variants = sc.loadIndexedVcf("variants.vcf.gz", chrRegion)
94
```
95
96
### Sequence Data Loading
97
98
Load raw sequencing data and reference sequences.
99
100
```scala { .api }
101
/**
102
* Load FASTQ sequencing reads (single-end or paired-end)
103
* @param pathName1 - Path to first FASTQ file (or single-end file)
104
* @param optPathName2 - Optional path to second FASTQ file for paired-end reads
105
* @param optRecordGroup - Optional read group identifier for the reads
106
* @param stringency - Validation stringency for FASTQ format compliance
107
* @return AlignmentRecordRDD containing unaligned sequencing reads
108
*/
109
def loadFastq(pathName1: String,
110
optPathName2: Option[String],
111
optRecordGroup: Option[String] = None,
112
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
113
114
/**
115
* Load interleaved FASTQ file where paired reads are alternately arranged
116
* @param pathName - Path to interleaved FASTQ file
117
* @return AlignmentRecordRDD containing paired sequencing reads
118
*/
119
def loadInterleavedFastq(pathName: String): AlignmentRecordRDD
120
121
/**
122
* Load reference genome sequences from FASTA files
123
* @param pathName - Path to FASTA file
124
* @param maximumLength - Maximum length of individual sequences to load
125
* @return NucleotideContigFragmentRDD containing reference sequences
126
*/
127
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD
128
```
129
130
**Usage Examples:**
131
132
```scala
133
// Load single-end FASTQ
134
val singleEndReads = sc.loadFastq("reads.fastq")
135
136
// Load paired-end FASTQ files
137
val pairedReads = sc.loadFastq("reads_R1.fastq", Some("reads_R2.fastq"))
138
139
// Load with custom read group
140
val readsWithRG = sc.loadFastq("reads.fastq", None, Some("sample1.rg1"))
141
142
// Load interleaved paired FASTQ
143
val interleavedReads = sc.loadInterleavedFastq("paired.fastq")
144
145
// Load reference genome
146
val reference = sc.loadFasta("hg38.fasta", maximumLength = 50000L)
147
```
148
149
### Feature Data Loading
150
151
Load genomic annotations and features from various formats.
152
153
```scala { .api }
154
/**
155
* Load genomic features from BED, GFF3, GTF, or other supported formats
156
* @param pathName - Path to feature file
157
* @return FeatureRDD containing genomic annotations
158
*/
159
def loadFeatures(pathName: String): FeatureRDD
160
```
161
162
**Usage Examples:**
163
164
```scala
165
// Load BED file annotations
166
val bedFeatures = sc.loadFeatures("annotations.bed")
167
168
// Load GTF gene annotations
169
val geneFeatures = sc.loadFeatures("genes.gtf")
170
171
// Load GFF3 annotations
172
val gff3Features = sc.loadFeatures("features.gff3")
173
```
174
175
### Format-Agnostic Loading
176
177
Load genomic data without specifying the exact format, with automatic format detection.
178
179
```scala { .api }
180
/**
181
* Load alignment records with automatic format detection
182
* @param pathName - Path to alignment file (SAM/BAM/CRAM/ADAM)
183
* @return AlignmentRecordRDD containing sequencing reads
184
*/
185
def loadAlignments(pathName: String): AlignmentRecordRDD
186
187
/**
188
* Load variants with automatic format detection
189
* @param pathName - Path to variant file (VCF/ADAM)
190
* @return VariantRDD containing genetic variants
191
*/
192
def loadVariants(pathName: String): VariantRDD
193
194
/**
195
* Load genotypes with automatic format detection
196
* @param pathName - Path to genotype file (VCF/ADAM)
197
* @return GenotypeRDD containing genotype calls
198
*/
199
def loadGenotypes(pathName: String): GenotypeRDD
200
```
201
202
### Parquet Data Loading
203
204
Load data from ADAM's native Parquet+Avro format with optional filtering and projection.
205
206
```scala { .api }
207
/**
208
* Load data from Parquet files with optional predicate pushdown and column projection
209
* @param pathName - Path to Parquet file or directory
210
* @param optPredicate - Optional filter predicate for server-side filtering
211
* @param optProjection - Optional schema projection to load only specific fields
212
* @return RDD of the specified type T
213
*/
214
def loadParquet[T](pathName: String,
215
optPredicate: Option[FilterPredicate] = None,
216
optProjection: Option[Schema] = None): RDD[T]
217
```
218
219
**Usage Examples:**
220
221
```scala
222
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
223
import org.apache.parquet.filter2.predicate.FilterApi._
224
225
// Load with column projection for efficiency
226
val projection = Projection(AlignmentRecordField.readName, AlignmentRecordField.readMapped)
227
val projectedReads = sc.loadParquet[AlignmentRecord]("reads.adam",
228
optProjection = Some(projection))
229
230
// Load with predicate filtering
231
val mappedFilter = equal(binaryColumn("readMapped"), true)
232
val mappedReads = sc.loadParquet[AlignmentRecord]("reads.adam",
233
optPredicate = Some(mappedFilter))
234
```
235
236
### Advanced Loading Options
237
238
Additional configuration options for specialized loading scenarios.
239
240
```scala { .api }
241
// Validation stringency options
242
object ValidationStringency extends Enumeration {
243
val STRICT = Value // Fail on format violations
244
val LENIENT = Value // Log warnings for violations
245
val SILENT = Value // Ignore format violations
246
}
247
248
// SAM format types for saving
249
object SAMFormat extends Enumeration {
250
val SAM = Value // Plain text SAM
251
val BAM = Value // Binary BAM
252
val CRAM = Value // Compressed CRAM
253
}
254
```
255
256
## Loading Performance Tips
257
258
1. **Use indexed files** for region-based queries to avoid scanning entire files
259
2. **Apply projection** when loading Parquet data to read only necessary columns
260
3. **Use predicate pushdown** to filter data at the storage level
261
4. **Consider validation stringency** - use LENIENT or SILENT for performance-critical applications with trusted data
262
5. **Partition large datasets** across multiple files for better parallelization
263
6. **Cache frequently accessed RDDs** in memory using `.cache()` method
264
265
## Supported File Formats
266
267
- **Alignment Data**: SAM, BAM, CRAM, ADAM Parquet
268
- **Variant Data**: VCF (plain text and compressed), ADAM Parquet
269
- **Sequence Data**: FASTQ (single-end, paired-end, interleaved), FASTA
270
- **Feature Data**: BED, GFF3, GTF, IntervalList, NarrowPeak, ADAM Parquet
271
- **Reference Data**: FASTA, 2bit format