0
# ADAM Core
1
2
ADAM Core is a foundational library for distributed genomics data processing built on Apache Spark. It provides high-performance, fault-tolerant data structures and algorithms for genomic sequences, alignments, variants, and features, with support for legacy formats (SAM/BAM/CRAM, VCF, BED/GFF3/GTF, FASTQ, FASTA) and modern columnar storage (Apache Parquet). The library enables scalable genomic data processing across cluster computing environments while maintaining competitive single-node performance.
3
4
## Package Information
5
6
- **Package Name**: adam-core_2.10
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Framework**: Apache Spark
10
- **Installation**: Add to Maven pom.xml:
11
```xml
12
<dependency>
13
<groupId>org.bdgenomics.adam</groupId>
14
<artifactId>adam-core_2.10</artifactId>
15
<version>0.23.0</version>
16
</dependency>
17
```
18
19
## Core Imports
20
21
```scala
22
import org.bdgenomics.adam.rdd.ADAMContext._
23
import org.bdgenomics.adam.rdd.{AlignmentRecordRDD, VariantRDD, FeatureRDD}
24
import org.apache.spark.SparkContext
25
```
26
27
## Basic Usage
28
29
```scala
30
import org.bdgenomics.adam.rdd.ADAMContext._
31
import org.apache.spark.{SparkContext, SparkConf}
32
33
// Initialize Spark context
34
val conf = new SparkConf().setAppName("GenomicsAnalysis")
35
val sc = new SparkContext(conf)
36
37
// Load genomic data (implicit conversion sc -> ADAMContext)
38
val alignments = sc.loadBam("input.bam")
39
val variants = sc.loadVcf("variants.vcf")
40
val features = sc.loadBed("annotations.bed")
41
42
// Transform and analyze
43
val mappedReads = alignments.transform(_.filter(_.getReadMapped))
44
val coverage = alignments.toCoverage()
45
46
// Save results
47
mappedReads.saveAsParquet("mapped_reads.adam")
48
coverage.saveAsWig("coverage.wig")
49
```
50
51
## Architecture
52
53
ADAM Core is built around several key architectural components:
54
55
- **ADAMContext**: Entry point providing loading methods for all genomic file formats, extending SparkContext functionality
56
- **GenomicRDD Framework**: Base distributed data structures with genomic-aware partitioning, transformations, and I/O operations
57
- **Data Type System**: Strongly-typed genomic records based on Avro schemas for serialization efficiency
58
- **Format Converters**: Bidirectional conversion between legacy formats and ADAM's internal representations
59
- **Schema Projections**: Column-store optimizations for accessing only required fields from Parquet data
60
- **Algorithms Package**: Genomic algorithms including consensus generation and sequence alignment
61
62
## Capabilities
63
64
### Data Loading and I/O
65
66
Core functionality for loading genomic data from various file formats and saving transformed results. Supports indexed access for efficient region-based queries.
67
68
```scala { .api }
69
// Main entry point - implicit conversion from SparkContext
70
implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext
71
72
// Load alignment data
73
def loadBam(pathName: String,
74
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
75
def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD
76
77
// Load variant data
78
def loadVcf(pathName: String,
79
stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD
80
def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD
81
82
// Load sequence data
83
def loadFastq(pathName1: String, optPathName2: Option[String] = None,
84
optRecordGroup: Option[String] = None,
85
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD
86
def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD
87
```
88
89
[Data Loading and I/O](./data-loading.md)
90
91
### Genomic Data Types
92
93
Distributed RDD implementations for major genomic data types, providing transformations, joins, and analysis operations optimized for genomic workflows.
94
95
```scala { .api }
96
// Base trait for all genomic RDDs
97
trait GenomicRDD[T, U <: GenomicRDD[T, U]] {
98
def transform(tFn: RDD[T] => RDD[T]): U
99
def union(rdds: U*): U
100
def saveAsParquet(pathName: String): Unit
101
def cache(): U
102
def persist(storageLevel: StorageLevel): U
103
def unpersist(): U
104
}
105
106
// Main genomic data types (abstract sealed classes from actual implementation)
107
sealed abstract class AlignmentRecordRDD extends AvroRecordGroupGenomicRDD[AlignmentRecord, AlignmentRecordProduct, AlignmentRecordRDD]
108
sealed abstract class VariantRDD extends AvroGenomicRDD[Variant, VariantProduct, VariantRDD]
109
sealed abstract class GenotypeRDD extends MultisampleAvroGenomicRDD[Genotype, GenotypeProduct, GenotypeRDD]
110
sealed abstract class FeatureRDD extends AvroGenomicRDD[Feature, FeatureProduct, FeatureRDD]
111
abstract class CoverageRDD extends GenomicDataset[Coverage, Coverage, CoverageRDD]
112
```
113
114
[Genomic Data Types](./data-types.md)
115
116
### Data Transformations
117
118
Genomic-specific transformations including format conversions, quality score recalibration, duplicate marking, and coverage analysis.
119
120
```scala { .api }
121
// AlignmentRecordRDD transformations
122
def toCoverage(collapse: Boolean = true): CoverageRDD
123
def toFragments(): FragmentRDD
124
def markDuplicates(): AlignmentRecordRDD
125
def recalibrateBaseQualities(knownSnps: VariantRDD): AlignmentRecordRDD
126
127
// VariantRDD transformations
128
def toGenotypes(): GenotypeRDD
129
def toVariantContexts(): VariantContextRDD
130
```
131
132
[Data Transformations](./transformations.md)
133
134
### File Format Support
135
136
Comprehensive support for reading and writing genomic file formats, with automatic format detection and validation.
137
138
```scala { .api }
139
// Format-agnostic loading
140
def loadAlignments(pathName: String): AlignmentRecordRDD
141
def loadVariants(pathName: String): VariantRDD
142
def loadGenotypes(pathName: String): GenotypeRDD
143
def loadFeatures(pathName: String): FeatureRDD
144
145
// Format-specific saving
146
def saveAsSam(pathName: String, asType: SAMFormat = SAMFormat.SAM): Unit
147
def saveAsVcf(pathName: String): Unit
148
def saveAsBed(pathName: String): Unit
149
```
150
151
[File Format Support](./file-formats.md)
152
153
### Genomic Algorithms
154
155
Bioinformatics algorithms including consensus calling, sequence alignment, and variant normalization optimized for distributed processing.
156
157
```scala { .api }
158
// Consensus generation
159
trait ConsensusGenerator {
160
def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
161
}
162
163
// Sequence alignment
164
object SmithWaterman {
165
def align(reference: String, read: String, scoring: SmithWatermanScoring): Alignment
166
}
167
```
168
169
[Genomic Algorithms](./algorithms.md)
170
171
## Key Data Types
172
173
```scala { .api }
174
// Genomic coordinates and regions
175
case class ReferenceRegion(referenceName: String, start: Long, end: Long) {
176
def contains(pos: ReferencePosition): Boolean
177
def overlaps(other: ReferenceRegion): Boolean
178
def width: Long
179
}
180
181
case class ReferencePosition(referenceName: String, pos: Long) extends Ordered[ReferencePosition]
182
183
// Reference genome metadata
184
class SequenceDictionary {
185
def records: Seq[SequenceRecord]
186
def apply(contigName: String): SequenceRecord
187
}
188
189
// Validation stringency levels
190
object ValidationStringency extends Enumeration {
191
val STRICT, LENIENT, SILENT = Value
192
}
193
194
// Base traits for genomic RDD hierarchy (from actual implementation)
195
trait AvroGenomicRDD[T, U, V <: AvroGenomicRDD[T, U, V]] extends GenomicRDD[T, V]
196
trait AvroRecordGroupGenomicRDD[T, U, V <: AvroRecordGroupGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]
197
trait MultisampleAvroGenomicRDD[T, U, V <: MultisampleAvroGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]
198
trait GenomicDataset[T, U, V <: GenomicDataset[T, U, V]] extends GenomicRDD[T, V]
199
200
// Avro record product types
201
trait AlignmentRecordProduct
202
trait VariantProduct
203
trait GenotypeProduct
204
trait FeatureProduct
205
206
// Core Avro data types (from ADAM schemas)
207
case class AlignmentRecord(/* fields from Avro schema */)
208
case class Variant(/* fields from Avro schema */)
209
case class Genotype(/* fields from Avro schema */)
210
case class Feature(/* fields from Avro schema */)
211
case class Coverage(/* fields from Avro schema */)
212
213
// Storage level from Spark
214
import org.apache.spark.storage.StorageLevel
215
```
216
217
## Error Handling
218
219
ADAM Core uses validation stringency levels to control error handling:
220
221
- **STRICT**: Fails immediately on any format violations or data inconsistencies
222
- **LENIENT**: Logs warnings for format violations but continues processing
223
- **SILENT**: Ignores format violations and processes data without warnings
224
225
Most loading methods accept an optional `ValidationStringency` parameter to customize error handling behavior.