Tessl Tile for maven/org.bdgenomics.adam/adam-core_2.10@0.23.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

algorithms.md data-loading.md data-types.md file-formats.md index.md transformations.md

data-loading.mddocs/

0
# Data Loading and I/O
1

2
Core functionality for loading genomic data from various file formats and saving transformed results. ADAM Core provides a unified interface for accessing genomic data regardless of the underlying storage format, with support for both local files and distributed storage systems.
3

4
## Capabilities
5

6
### ADAMContext Entry Point
7

8
The main entry point for all data loading operations, automatically added to SparkContext via implicit conversion.
9

10
```scala { .api }
11
/**
12
 * Implicit conversion that adds ADAM data loading methods to SparkContext
13
 * @param sc - Spark context to extend
14
 * @return ADAMContext with genomic data loading capabilities
15
 */
16
implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext
17
```
18

19
### Alignment Data Loading
20

21
Load aligned and unaligned sequencing reads from SAM, BAM, and CRAM formats.
22

23
```scala { .api }
24
/**
25
 * Load alignment records from SAM/BAM/CRAM files
26
 * @param pathName - Path to alignment file or directory of files
27
 * @param stringency - Validation stringency for format compliance
28
 * @return AlignmentRecordRDD containing sequencing reads
29
 */
30
def loadBam(pathName: String, 
31
           stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
32

33
/**
34
 * Load alignment records from indexed BAM/CRAM file for specific genomic regions
35
 * @param pathName - Path to indexed alignment file
36
 * @param viewRegions - Genomic regions to query
37
 * @return AlignmentRecordRDD containing reads overlapping the regions
38
 */
39
def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD
40
```
41

42
**Usage Examples:**
43

44
```scala
45
import org.bdgenomics.adam.rdd.ADAMContext._
46
import org.bdgenomics.adam.models.ReferenceRegion
47

48
// Load entire BAM file
49
val allReads = sc.loadBam("sample.bam")
50

51
// Load with lenient validation for malformed files
52
val reads = sc.loadBam("sample.bam", ValidationStringency.LENIENT)
53

54
// Load specific region from indexed BAM
55
val region = ReferenceRegion("chr1", 1000000, 2000000)
56
val regionReads = sc.loadIndexedBam("sample.bam", region)
57
```
58

59
### Variant Data Loading
60

61
Load genetic variants and genotype information from VCF files.
62

63
```scala { .api }
64
/**
65
 * Load variant contexts from VCF files with full metadata
66
 * @param pathName - Path to VCF file or directory of files
67
 * @param stringency - Validation stringency for VCF format compliance
68
 * @return VariantContextRDD containing variants with genotype information
69
 */
70
def loadVcf(pathName: String,
71
           stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
72

73
/**
74
 * Load variants from indexed VCF file for specific genomic regions
75
 * @param pathName - Path to indexed VCF file (with .tbi or .csi index)
76
 * @param viewRegions - Genomic regions to query
77
 * @return VariantContextRDD containing variants in the specified regions
78
 */
79
def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD
80
```
81

82
**Usage Examples:**
83

84
```scala
85
// Load complete VCF file
86
val variants = sc.loadVcf("variants.vcf")
87

88
// Load from compressed VCF with strict validation
89
val compressedVariants = sc.loadVcf("variants.vcf.gz", ValidationStringency.STRICT)
90

91
// Load specific chromosomal region
92
val chrRegion = ReferenceRegion("chr22", 0, 51304566)
93
val chr22Variants = sc.loadIndexedVcf("variants.vcf.gz", chrRegion)
94
```
95

96
### Sequence Data Loading
97

98
Load raw sequencing data and reference sequences.
99

100
```scala { .api }
101
/**
102
 * Load FASTQ sequencing reads (single-end or paired-end)
103
 * @param pathName1 - Path to first FASTQ file (or single-end file)
104
 * @param optPathName2 - Optional path to second FASTQ file for paired-end reads
105
 * @param optRecordGroup - Optional read group identifier for the reads
106
 * @param stringency - Validation stringency for FASTQ format compliance
107
 * @return AlignmentRecordRDD containing unaligned sequencing reads
108
 */
109
def loadFastq(pathName1: String, 
110
             optPathName2: Option[String], 
111
             optRecordGroup: Option[String] = None,
112
             stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
113

114
/**
115
 * Load interleaved FASTQ file where paired reads are alternately arranged
116
 * @param pathName - Path to interleaved FASTQ file
117
 * @return AlignmentRecordRDD containing paired sequencing reads
118
 */
119
def loadInterleavedFastq(pathName: String): AlignmentRecordRDD
120

121
/**
122
 * Load reference genome sequences from FASTA files
123
 * @param pathName - Path to FASTA file
124
 * @param maximumLength - Maximum length of individual sequences to load
125
 * @return NucleotideContigFragmentRDD containing reference sequences
126
 */
127
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD
128
```
129

130
**Usage Examples:**
131

132
```scala
133
// Load single-end FASTQ
134
val singleEndReads = sc.loadFastq("reads.fastq")
135

136
// Load paired-end FASTQ files
137
val pairedReads = sc.loadFastq("reads_R1.fastq", Some("reads_R2.fastq"))
138

139
// Load with custom read group
140
val readsWithRG = sc.loadFastq("reads.fastq", None, Some("sample1.rg1"))
141

142
// Load interleaved paired FASTQ
143
val interleavedReads = sc.loadInterleavedFastq("paired.fastq")
144

145
// Load reference genome
146
val reference = sc.loadFasta("hg38.fasta", maximumLength = 50000L)
147
```
148

149
### Feature Data Loading
150

151
Load genomic annotations and features from various formats.
152

153
```scala { .api }
154
/**
155
 * Load genomic features from BED, GFF3, GTF, or other supported formats
156
 * @param pathName - Path to feature file
157
 * @return FeatureRDD containing genomic annotations
158
 */
159
def loadFeatures(pathName: String): FeatureRDD
160
```
161

162
**Usage Examples:**
163

164
```scala
165
// Load BED file annotations
166
val bedFeatures = sc.loadFeatures("annotations.bed")
167

168
// Load GTF gene annotations
169
val geneFeatures = sc.loadFeatures("genes.gtf")
170

171
// Load GFF3 annotations
172
val gff3Features = sc.loadFeatures("features.gff3")
173
```
174

175
### Format-Agnostic Loading
176

177
Load genomic data without specifying the exact format, with automatic format detection.
178

179
```scala { .api }
180
/**
181
 * Load alignment records with automatic format detection
182
 * @param pathName - Path to alignment file (SAM/BAM/CRAM/ADAM)
183
 * @return AlignmentRecordRDD containing sequencing reads
184
 */
185
def loadAlignments(pathName: String): AlignmentRecordRDD
186

187
/**
188
 * Load variants with automatic format detection  
189
 * @param pathName - Path to variant file (VCF/ADAM)
190
 * @return VariantRDD containing genetic variants
191
 */
192
def loadVariants(pathName: String): VariantRDD
193

194
/**
195
 * Load genotypes with automatic format detection
196
 * @param pathName - Path to genotype file (VCF/ADAM)
197
 * @return GenotypeRDD containing genotype calls
198
 */
199
def loadGenotypes(pathName: String): GenotypeRDD
200
```
201

202
### Parquet Data Loading
203

204
Load data from ADAM's native Parquet+Avro format with optional filtering and projection.
205

206
```scala { .api }
207
/**
208
 * Load data from Parquet files with optional predicate pushdown and column projection
209
 * @param pathName - Path to Parquet file or directory
210
 * @param optPredicate - Optional filter predicate for server-side filtering
211
 * @param optProjection - Optional schema projection to load only specific fields
212
 * @return RDD of the specified type T
213
 */
214
def loadParquet[T](pathName: String, 
215
                  optPredicate: Option[FilterPredicate] = None, 
216
                  optProjection: Option[Schema] = None): RDD[T]
217
```
218

219
**Usage Examples:**
220

221
```scala
222
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
223
import org.apache.parquet.filter2.predicate.FilterApi._
224

225
// Load with column projection for efficiency
226
val projection = Projection(AlignmentRecordField.readName, AlignmentRecordField.readMapped)
227
val projectedReads = sc.loadParquet[AlignmentRecord]("reads.adam", 
228
                                                    optProjection = Some(projection))
229

230
// Load with predicate filtering
231
val mappedFilter = equal(binaryColumn("readMapped"), true)
232
val mappedReads = sc.loadParquet[AlignmentRecord]("reads.adam", 
233
                                                 optPredicate = Some(mappedFilter))
234
```
235

236
### Advanced Loading Options
237

238
Additional configuration options for specialized loading scenarios.
239

240
```scala { .api }
241
// Validation stringency options
242
object ValidationStringency extends Enumeration {
243
  val STRICT  = Value  // Fail on format violations
244
  val LENIENT = Value  // Log warnings for violations  
245
  val SILENT  = Value  // Ignore format violations
246
}
247

248
// SAM format types for saving
249
object SAMFormat extends Enumeration {
250
  val SAM  = Value  // Plain text SAM
251
  val BAM  = Value  // Binary BAM
252
  val CRAM = Value  // Compressed CRAM
253
}
254
```
255

256
## Loading Performance Tips
257

258
1. **Use indexed files** for region-based queries to avoid scanning entire files
259
2. **Apply projection** when loading Parquet data to read only necessary columns
260
3. **Use predicate pushdown** to filter data at the storage level
261
4. **Consider validation stringency** - use LENIENT or SILENT for performance-critical applications with trusted data
262
5. **Partition large datasets** across multiple files for better parallelization
263
6. **Cache frequently accessed RDDs** in memory using `.cache()` method
264

265
## Supported File Formats
266

267
- **Alignment Data**: SAM, BAM, CRAM, ADAM Parquet
268
- **Variant Data**: VCF (plain text and compressed), ADAM Parquet  
269
- **Sequence Data**: FASTQ (single-end, paired-end, interleaved), FASTA
270
- **Feature Data**: BED, GFF3, GTF, IntervalList, NarrowPeak, ADAM Parquet
271
- **Reference Data**: FASTA, 2bit format

Version

Tile

Files

data-loading.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-loading.mddocs/