0
# Genomic Data Loading
1
2
The JavaADAMContext provides comprehensive genomic data loading capabilities, automatically detecting file formats and providing validation controls for various genomic file types.
3
4
## Required Imports
5
6
```java
7
import org.bdgenomics.adam.api.java.JavaADAMContext;
8
import org.bdgenomics.adam.rdd.ADAMContext;
9
import htsjdk.samtools.ValidationStringency;
10
```
11
12
## Capabilities
13
14
### JavaADAMContext Class
15
16
Main context class providing Java-friendly methods for genomic data operations.
17
18
```java { .api }
19
/**
20
* Java-friendly wrapper for ADAM Context providing genomic data loading capabilities
21
*/
22
class JavaADAMContext {
23
/**
24
* Creates a JavaADAMContext wrapping the provided ADAMContext
25
* @param ac The ADAMContext to wrap
26
*/
27
JavaADAMContext(ADAMContext ac);
28
29
/**
30
* Returns the Java Spark Context associated with this context
31
* @return JavaSparkContext for Spark operations
32
*/
33
JavaSparkContext getSparkContext();
34
}
35
```
36
37
### Alignment Data Loading
38
39
Load sequencing alignment data from various formats with automatic format detection.
40
41
```java { .api }
42
/**
43
* Load alignment records with automatic format detection
44
* Supports: .bam/.cram/.sam (BAM/CRAM/SAM), .fa/.fasta (FASTA),
45
* .fq/.fastq (FASTQ), .ifq (interleaved FASTQ)
46
* Falls back to Parquet + Avro for unrecognized extensions
47
* @param pathName Path to alignment file(s). Supports globs and directories
48
* @return AlignmentRecordRDD containing reads, sequence dictionary, and record groups
49
*/
50
AlignmentRecordRDD loadAlignments(String pathName);
51
52
/**
53
* Load alignment records with validation stringency control
54
* @param pathName Path to alignment file(s)
55
* @param stringency Validation strictness (LENIENT, SILENT, STRICT)
56
* @return AlignmentRecordRDD containing reads, sequence dictionary, and record groups
57
*/
58
AlignmentRecordRDD loadAlignments(String pathName, ValidationStringency stringency);
59
```
60
61
### Reference Sequence Loading
62
63
Load reference genome sequences and create broadcastable reference files.
64
65
```java { .api }
66
/**
67
* Load nucleotide contig fragments from reference sequences
68
* Supports: .fa/.fasta (FASTA format)
69
* Falls back to Parquet + Avro for other extensions
70
* @param pathName Path to reference file(s). Supports globs and directories for FASTA
71
* @return NucleotideContigFragmentRDD containing reference sequences
72
*/
73
NucleotideContigFragmentRDD loadContigFragments(String pathName);
74
75
/**
76
* Load reference sequences into broadcastable format
77
* Supports: .2bit files directly, other formats via loadContigFragments
78
* @param pathName Path to reference file (no globs/directories for 2bit)
79
* @return ReferenceFile for broadcast operations
80
*/
81
ReferenceFile loadReferenceFile(String pathName);
82
83
/**
84
* Load reference sequences with custom maximum fragment length
85
* @param pathName Path to reference file
86
* @param maximumLength Maximum fragment length (default 10000L, avoid >1e9)
87
* @return ReferenceFile for broadcast operations
88
*/
89
ReferenceFile loadReferenceFile(String pathName, Long maximumLength);
90
```
91
92
### Fragment Loading
93
94
Load paired-end sequencing fragments from alignment files.
95
96
```java { .api }
97
/**
98
* Load fragments from alignment data
99
* Supports: .bam/.cram/.sam (BAM/CRAM/SAM), .ifq (interleaved FASTQ)
100
* Falls back to Parquet + Avro for other extensions
101
* @param pathName Path to fragment file(s). Supports globs and directories
102
* @return FragmentRDD containing paired-end fragments
103
*/
104
FragmentRDD loadFragments(String pathName);
105
106
/**
107
* Load fragments with validation stringency control
108
* @param pathName Path to fragment file(s)
109
* @param stringency Validation strictness for BAM/CRAM/SAM and FASTQ formats
110
* @return FragmentRDD containing paired-end fragments
111
*/
112
FragmentRDD loadFragments(String pathName, ValidationStringency stringency);
113
```
114
115
### Feature Data Loading
116
117
Load genomic annotations and feature data from various annotation formats.
118
119
```java { .api }
120
/**
121
* Load genomic features from annotation files
122
* Supports: .bed (BED6/12), .gff3 (GFF3), .gtf/.gff (GTF/GFF2),
123
* .narrow[pP]eak (NarrowPeak), .interval_list (IntervalList)
124
* Falls back to Parquet + Avro for other extensions
125
* @param pathName Path to feature file(s). Supports globs and directories
126
* @return FeatureRDD containing genomic annotations
127
*/
128
FeatureRDD loadFeatures(String pathName);
129
130
/**
131
* Load features with validation stringency control
132
* @param pathName Path to feature file(s)
133
* @param stringency Validation strictness for supported text formats
134
* @return FeatureRDD containing genomic annotations
135
*/
136
FeatureRDD loadFeatures(String pathName, ValidationStringency stringency);
137
```
138
139
### Coverage Data Loading
140
141
Load genomic coverage data, converting features to coverage information.
142
143
```java { .api }
144
/**
145
* Load features and convert to coverage data
146
* Coverage values are stored in the score field of Feature records
147
* Supports same formats as loadFeatures
148
* @param pathName Path to coverage file(s). Supports globs and directories
149
* @return CoverageRDD containing coverage depth information
150
*/
151
CoverageRDD loadCoverage(String pathName);
152
153
/**
154
* Load coverage data with validation stringency control
155
* @param pathName Path to coverage file(s)
156
* @param stringency Validation strictness for supported text formats
157
* @return CoverageRDD containing coverage depth information
158
*/
159
CoverageRDD loadCoverage(String pathName, ValidationStringency stringency);
160
```
161
162
### Variant Data Loading
163
164
Load genetic variation data from VCF files or Parquet format.
165
166
```java { .api }
167
/**
168
* Load genotype calls from variant files
169
* Supports: .vcf/.vcf.gz/.vcf.bgzf/.vcf.bgz (VCF format)
170
* Falls back to Parquet + Avro for other extensions
171
* @param pathName Path to genotype file(s). Supports globs and directories for VCF
172
* @return GenotypeRDD containing sample genotype calls
173
*/
174
GenotypeRDD loadGenotypes(String pathName);
175
176
/**
177
* Load genotypes with validation stringency control
178
* @param pathName Path to genotype file(s)
179
* @param stringency Validation strictness for VCF format
180
* @return GenotypeRDD containing sample genotype calls
181
*/
182
GenotypeRDD loadGenotypes(String pathName, ValidationStringency stringency);
183
184
/**
185
* Load variant records from variant files
186
* Supports: .vcf/.vcf.gz/.vcf.bgzf/.vcf.bgz (VCF format)
187
* Falls back to Parquet + Avro for other extensions
188
* @param pathName Path to variant file(s). Supports globs and directories for VCF
189
* @return VariantRDD containing genetic variations
190
*/
191
VariantRDD loadVariants(String pathName);
192
193
/**
194
* Load variants with validation stringency control
195
* @param pathName Path to variant file(s)
196
* @param stringency Validation strictness for VCF format
197
* @return VariantRDD containing genetic variations
198
*/
199
VariantRDD loadVariants(String pathName, ValidationStringency stringency);
200
```
201
202
## Usage Examples
203
204
**Basic alignment loading:**
205
206
```java
207
JavaADAMContext jac = new JavaADAMContext(adamContext);
208
209
// Load BAM file with default settings
210
AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");
211
212
// Load with strict validation
213
AlignmentRecordRDD strictAlignments = jac.loadAlignments("sample.bam",
214
ValidationStringency.STRICT);
215
216
// Load compressed FASTQ
217
AlignmentRecordRDD fastqReads = jac.loadAlignments("reads.fastq.gz");
218
```
219
220
**Reference and feature loading:**
221
222
```java
223
// Load reference genome
224
NucleotideContigFragmentRDD reference = jac.loadContigFragments("hg38.fa");
225
226
// Create broadcastable reference
227
ReferenceFile refFile = jac.loadReferenceFile("hg38.2bit");
228
229
// Load genomic annotations
230
FeatureRDD genes = jac.loadFeatures("gencode.gtf");
231
CoverageRDD coverage = jac.loadCoverage("sample.bed");
232
```
233
234
**Variant data loading:**
235
236
```java
237
// Load VCF file
238
VariantRDD variants = jac.loadVariants("variants.vcf.gz");
239
GenotypeRDD genotypes = jac.loadGenotypes("variants.vcf.gz");
240
241
// With lenient validation for problematic files
242
VariantRDD lenientVariants = jac.loadVariants("problematic.vcf",
243
ValidationStringency.LENIENT);
244
```
245
246
## File Format Support Details
247
248
### Compression Support
249
All text-based formats support Hadoop compression codecs:
250
- `.gz` (gzip)
251
- `.bz2` (bzip2)
252
- Additional codecs as configured in Hadoop environment
253
254
### Path Patterns
255
- **Globs supported**: `*.bam`, `sample*.vcf.gz`
256
- **Directories supported**: Load all compatible files in directory
257
- **Single files**: Direct file paths
258
259
### Validation Stringency Options
260
- **STRICT**: Fail on any format violations
261
- **LENIENT**: Warn on format issues but continue processing
262
- **SILENT**: Ignore format violations silently
263
264
## Types
265
266
```java { .api }
267
/**
268
* Validation stringency enumeration for controlling file format validation
269
*/
270
enum ValidationStringency {
271
STRICT, // Fail on any format violations
272
LENIENT, // Warn on format issues but continue processing
273
SILENT // Ignore format violations silently
274
}
275
```
276
277
## Error Handling
278
279
Loading methods may throw exceptions for:
280
- File not found or inaccessible
281
- Unsupported file format
282
- Validation failures (when using STRICT stringency)
283
- Insufficient memory for large reference files
284
- Hadoop configuration issues for compressed files