or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

algorithms.mddata-loading.mddata-types.mdfile-formats.mdindex.mdtransformations.md

index.mddocs/

0

# ADAM Core

1

2

ADAM Core is a foundational library for distributed genomics data processing built on Apache Spark. It provides high-performance, fault-tolerant data structures and algorithms for genomic sequences, alignments, variants, and features, with support for legacy formats (SAM/BAM/CRAM, VCF, BED/GFF3/GTF, FASTQ, FASTA) and modern columnar storage (Apache Parquet). The library enables scalable genomic data processing across cluster computing environments while maintaining competitive single-node performance.

3

4

## Package Information

5

6

- **Package Name**: adam-core_2.10

7

- **Package Type**: maven

8

- **Language**: Scala

9

- **Framework**: Apache Spark

10

- **Installation**: Add to Maven pom.xml:

11

```xml

12

<dependency>

13

<groupId>org.bdgenomics.adam</groupId>

14

<artifactId>adam-core_2.10</artifactId>

15

<version>0.23.0</version>

16

</dependency>

17

```

18

19

## Core Imports

20

21

```scala

22

import org.bdgenomics.adam.rdd.ADAMContext._

23

import org.bdgenomics.adam.rdd.{AlignmentRecordRDD, VariantRDD, FeatureRDD}

24

import org.apache.spark.SparkContext

25

```

26

27

## Basic Usage

28

29

```scala

30

import org.bdgenomics.adam.rdd.ADAMContext._

31

import org.apache.spark.{SparkContext, SparkConf}

32

33

// Initialize Spark context

34

val conf = new SparkConf().setAppName("GenomicsAnalysis")

35

val sc = new SparkContext(conf)

36

37

// Load genomic data (implicit conversion sc -> ADAMContext)

38

val alignments = sc.loadBam("input.bam")

39

val variants = sc.loadVcf("variants.vcf")

40

val features = sc.loadBed("annotations.bed")

41

42

// Transform and analyze

43

val mappedReads = alignments.transform(_.filter(_.getReadMapped))

44

val coverage = alignments.toCoverage()

45

46

// Save results

47

mappedReads.saveAsParquet("mapped_reads.adam")

48

coverage.saveAsWig("coverage.wig")

49

```

50

51

## Architecture

52

53

ADAM Core is built around several key architectural components:

54

55

- **ADAMContext**: Entry point providing loading methods for all genomic file formats, extending SparkContext functionality

56

- **GenomicRDD Framework**: Base distributed data structures with genomic-aware partitioning, transformations, and I/O operations

57

- **Data Type System**: Strongly-typed genomic records based on Avro schemas for serialization efficiency

58

- **Format Converters**: Bidirectional conversion between legacy formats and ADAM's internal representations

59

- **Schema Projections**: Column-store optimizations for accessing only required fields from Parquet data

60

- **Algorithms Package**: Genomic algorithms including consensus generation and sequence alignment

61

62

## Capabilities

63

64

### Data Loading and I/O

65

66

Core functionality for loading genomic data from various file formats and saving transformed results. Supports indexed access for efficient region-based queries.

67

68

```scala { .api }

69

// Main entry point - implicit conversion from SparkContext

70

implicit def sparkContextToADAMContext(sc: SparkContext): ADAMContext

71

72

// Load alignment data

73

def loadBam(pathName: String,

74

stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD

75

def loadIndexedBam(pathName: String, viewRegions: Iterable[ReferenceRegion]): AlignmentRecordRDD

76

77

// Load variant data

78

def loadVcf(pathName: String,

79

stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD

80

def loadIndexedVcf(pathName: String, viewRegions: Iterable[ReferenceRegion]): VariantContextRDD

81

82

// Load sequence data

83

def loadFastq(pathName1: String, optPathName2: Option[String] = None,

84

optRecordGroup: Option[String] = None,

85

stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecordRDD

86

def loadFasta(pathName: String, maximumLength: Long = 10000L): NucleotideContigFragmentRDD

87

```

88

89

[Data Loading and I/O](./data-loading.md)

90

91

### Genomic Data Types

92

93

Distributed RDD implementations for major genomic data types, providing transformations, joins, and analysis operations optimized for genomic workflows.

94

95

```scala { .api }

96

// Base trait for all genomic RDDs

97

trait GenomicRDD[T, U <: GenomicRDD[T, U]] {

98

def transform(tFn: RDD[T] => RDD[T]): U

99

def union(rdds: U*): U

100

def saveAsParquet(pathName: String): Unit

101

def cache(): U

102

def persist(storageLevel: StorageLevel): U

103

def unpersist(): U

104

}

105

106

// Main genomic data types (abstract sealed classes from actual implementation)

107

sealed abstract class AlignmentRecordRDD extends AvroRecordGroupGenomicRDD[AlignmentRecord, AlignmentRecordProduct, AlignmentRecordRDD]

108

sealed abstract class VariantRDD extends AvroGenomicRDD[Variant, VariantProduct, VariantRDD]

109

sealed abstract class GenotypeRDD extends MultisampleAvroGenomicRDD[Genotype, GenotypeProduct, GenotypeRDD]

110

sealed abstract class FeatureRDD extends AvroGenomicRDD[Feature, FeatureProduct, FeatureRDD]

111

abstract class CoverageRDD extends GenomicDataset[Coverage, Coverage, CoverageRDD]

112

```

113

114

[Genomic Data Types](./data-types.md)

115

116

### Data Transformations

117

118

Genomic-specific transformations including format conversions, quality score recalibration, duplicate marking, and coverage analysis.

119

120

```scala { .api }

121

// AlignmentRecordRDD transformations

122

def toCoverage(collapse: Boolean = true): CoverageRDD

123

def toFragments(): FragmentRDD

124

def markDuplicates(): AlignmentRecordRDD

125

def recalibrateBaseQualities(knownSnps: VariantRDD): AlignmentRecordRDD

126

127

// VariantRDD transformations

128

def toGenotypes(): GenotypeRDD

129

def toVariantContexts(): VariantContextRDD

130

```

131

132

[Data Transformations](./transformations.md)

133

134

### File Format Support

135

136

Comprehensive support for reading and writing genomic file formats, with automatic format detection and validation.

137

138

```scala { .api }

139

// Format-agnostic loading

140

def loadAlignments(pathName: String): AlignmentRecordRDD

141

def loadVariants(pathName: String): VariantRDD

142

def loadGenotypes(pathName: String): GenotypeRDD

143

def loadFeatures(pathName: String): FeatureRDD

144

145

// Format-specific saving

146

def saveAsSam(pathName: String, asType: SAMFormat = SAMFormat.SAM): Unit

147

def saveAsVcf(pathName: String): Unit

148

def saveAsBed(pathName: String): Unit

149

```

150

151

[File Format Support](./file-formats.md)

152

153

### Genomic Algorithms

154

155

Bioinformatics algorithms including consensus calling, sequence alignment, and variant normalization optimized for distributed processing.

156

157

```scala { .api }

158

// Consensus generation

159

trait ConsensusGenerator {

160

def findConsensus(reads: Iterable[AlignmentRecord]): Consensus

161

}

162

163

// Sequence alignment

164

object SmithWaterman {

165

def align(reference: String, read: String, scoring: SmithWatermanScoring): Alignment

166

}

167

```

168

169

[Genomic Algorithms](./algorithms.md)

170

171

## Key Data Types

172

173

```scala { .api }

174

// Genomic coordinates and regions

175

case class ReferenceRegion(referenceName: String, start: Long, end: Long) {

176

def contains(pos: ReferencePosition): Boolean

177

def overlaps(other: ReferenceRegion): Boolean

178

def width: Long

179

}

180

181

case class ReferencePosition(referenceName: String, pos: Long) extends Ordered[ReferencePosition]

182

183

// Reference genome metadata

184

class SequenceDictionary {

185

def records: Seq[SequenceRecord]

186

def apply(contigName: String): SequenceRecord

187

}

188

189

// Validation stringency levels

190

object ValidationStringency extends Enumeration {

191

val STRICT, LENIENT, SILENT = Value

192

}

193

194

// Base traits for genomic RDD hierarchy (from actual implementation)

195

trait AvroGenomicRDD[T, U, V <: AvroGenomicRDD[T, U, V]] extends GenomicRDD[T, V]

196

trait AvroRecordGroupGenomicRDD[T, U, V <: AvroRecordGroupGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]

197

trait MultisampleAvroGenomicRDD[T, U, V <: MultisampleAvroGenomicRDD[T, U, V]] extends AvroGenomicRDD[T, U, V]

198

trait GenomicDataset[T, U, V <: GenomicDataset[T, U, V]] extends GenomicRDD[T, V]

199

200

// Avro record product types

201

trait AlignmentRecordProduct

202

trait VariantProduct

203

trait GenotypeProduct

204

trait FeatureProduct

205

206

// Core Avro data types (from ADAM schemas)

207

case class AlignmentRecord(/* fields from Avro schema */)

208

case class Variant(/* fields from Avro schema */)

209

case class Genotype(/* fields from Avro schema */)

210

case class Feature(/* fields from Avro schema */)

211

case class Coverage(/* fields from Avro schema */)

212

213

// Storage level from Spark

214

import org.apache.spark.storage.StorageLevel

215

```

216

217

## Error Handling

218

219

ADAM Core uses validation stringency levels to control error handling:

220

221

- **STRICT**: Fails immediately on any format violations or data inconsistencies

222

- **LENIENT**: Logs warnings for format violations but continues processing

223

- **SILENT**: Ignores format violations and processes data without warnings

224

225

Most loading methods accept an optional `ValidationStringency` parameter to customize error handling behavior.