or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-conversions.mdgenomic-data-loading.mdindex.mdpython-integration.mdrdd-conversions.md

index.mddocs/

0

# ADAM APIs

1

2

ADAM APIs provides Java and Python-friendly API wrappers for the ADAM (A Distributed Alignment Mapper) genomics analysis library. This module enables scalable genomic data processing using Apache Spark's distributed computing capabilities, offering convenient wrapper classes and converters that make ADAM's core functionality accessible to Java and Python developers.

3

4

## Package Information

5

6

- **Package Name**: adam-apis_2.10

7

- **Package Type**: maven

8

- **Language**: Scala (with Java API wrappers)

9

- **Installation**: Add to Maven dependencies:

10

```xml

11

<dependency>

12

<groupId>org.bdgenomics.adam</groupId>

13

<artifactId>adam-apis_2.10</artifactId>

14

<version>0.23.0</version>

15

</dependency>

16

```

17

18

## Core Imports

19

20

```java

21

import org.bdgenomics.adam.api.java.JavaADAMContext;

22

import org.bdgenomics.adam.rdd.ADAMContext;

23

import org.apache.spark.api.java.JavaSparkContext;

24

import htsjdk.samtools.ValidationStringency;

25

```

26

27

For genomic RDD types:

28

```java

29

import org.bdgenomics.adam.rdd.read.AlignmentRecordRDD;

30

import org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD;

31

import org.bdgenomics.adam.rdd.fragment.FragmentRDD;

32

import org.bdgenomics.adam.rdd.feature.FeatureRDD;

33

import org.bdgenomics.adam.rdd.feature.CoverageRDD;

34

import org.bdgenomics.adam.rdd.variant.GenotypeRDD;

35

import org.bdgenomics.adam.rdd.variant.VariantRDD;

36

import org.bdgenomics.adam.rdd.variant.VariantContextRDD;

37

import org.bdgenomics.adam.util.ReferenceFile;

38

```

39

40

For RDD/Dataset conversions:

41

```java

42

import org.bdgenomics.adam.api.java.*;

43

```

44

45

For Python API support:

46

```java

47

import org.bdgenomics.adam.api.python.DataFrameConversionWrapper;

48

```

49

50

## Basic Usage

51

52

```java

53

import org.apache.spark.SparkConf;

54

import org.apache.spark.api.java.JavaSparkContext;

55

import org.bdgenomics.adam.api.java.JavaADAMContext;

56

import org.bdgenomics.adam.rdd.ADAMContext;

57

import org.bdgenomics.adam.rdd.read.AlignmentRecordRDD;

58

59

// Create Spark context

60

SparkConf conf = new SparkConf().setAppName("ADAM API Example");

61

JavaSparkContext jsc = new JavaSparkContext(conf);

62

63

// Create ADAM context

64

ADAMContext ac = new ADAMContext(jsc.sc());

65

JavaADAMContext jac = new JavaADAMContext(ac);

66

67

// Load genomic data

68

AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");

69

System.out.println("Loaded " + alignments.jrdd().count() + " alignment records");

70

71

// Load other genomic data types

72

jac.loadVariants("variants.vcf");

73

jac.loadFeatures("annotations.bed");

74

jac.loadContigFragments("reference.fa");

75

```

76

77

## Architecture

78

79

ADAM APIs is built around several key components:

80

81

- **JavaADAMContext**: Main entry point providing Java-friendly methods for loading various genomic file formats

82

- **RDD Converter Classes**: Function2-based converters for transforming between different genomic RDD types

83

- **Dataset Converter Classes**: Similar converters but for Spark SQL Dataset operations

84

- **Python API Support**: DataFrame conversion wrappers for Python interoperability

85

- **Type Safety**: Full preservation of genomic data types and metadata through conversions

86

87

## Capabilities

88

89

### Genomic Data Loading

90

91

Core functionality for loading genomic data from various file formats into ADAM's specialized RDD types. Supports automatic format detection and validation.

92

93

```java { .api }

94

// Main context class

95

class JavaADAMContext {

96

JavaADAMContext(ADAMContext ac);

97

JavaSparkContext getSparkContext();

98

99

// Load alignment data (BAM/CRAM/SAM/FASTA/FASTQ)

100

AlignmentRecordRDD loadAlignments(String pathName);

101

AlignmentRecordRDD loadAlignments(String pathName, ValidationStringency stringency);

102

103

// Load reference sequences

104

NucleotideContigFragmentRDD loadContigFragments(String pathName);

105

ReferenceFile loadReferenceFile(String pathName);

106

ReferenceFile loadReferenceFile(String pathName, Long maximumLength);

107

108

// Load fragments (paired-end sequencing data)

109

FragmentRDD loadFragments(String pathName);

110

FragmentRDD loadFragments(String pathName, ValidationStringency stringency);

111

112

// Load genomic features (annotations)

113

FeatureRDD loadFeatures(String pathName);

114

FeatureRDD loadFeatures(String pathName, ValidationStringency stringency);

115

116

// Load coverage data

117

CoverageRDD loadCoverage(String pathName);

118

CoverageRDD loadCoverage(String pathName, ValidationStringency stringency);

119

120

// Load variant data

121

GenotypeRDD loadGenotypes(String pathName);

122

GenotypeRDD loadGenotypes(String pathName, ValidationStringency stringency);

123

VariantRDD loadVariants(String pathName);

124

VariantRDD loadVariants(String pathName, ValidationStringency stringency);

125

}

126

```

127

128

[Genomic Data Loading](./genomic-data-loading.md)

129

130

### RDD Type Conversions

131

132

Comprehensive set of converter classes for transforming between different genomic RDD types. Each converter implements Function2 interface for use in Spark transformations.

133

134

```java { .api }

135

// Base conversion interface

136

interface SameTypeConversion<T, U extends GenomicRDD<T, U>> extends Function2<U, RDD<T>, U> {

137

U call(U v1, RDD<T> v2);

138

}

139

140

// Example converter classes

141

class ContigsToAlignmentRecordsConverter extends Function2<NucleotideContigFragmentRDD, RDD<AlignmentRecord>, AlignmentRecordRDD>;

142

class AlignmentRecordsToVariantsConverter extends Function2<AlignmentRecordRDD, RDD<Variant>, VariantRDD>;

143

class VariantsToGenotypesConverter extends Function2<VariantRDD, RDD<Genotype>, GenotypeRDD>;

144

```

145

146

[RDD Conversions](./rdd-conversions.md)

147

148

### Dataset Type Conversions

149

150

Spark SQL Dataset-based converters providing similar functionality to RDD converters but with Dataset operations for better performance and SQL integration.

151

152

```java { .api }

153

// Base dataset conversion traits

154

interface ToAlignmentRecordDatasetConversion<T extends Product, U extends GenomicDataset<?, T, U>>

155

extends GenomicDatasetConversion<T, U, AlignmentRecord, AlignmentRecordRDD>;

156

157

// Example dataset converter classes

158

class ContigsToAlignmentRecordsDatasetConverter extends ToAlignmentRecordDatasetConversion<NucleotideContigFragment, NucleotideContigFragmentRDD>;

159

class VariantsToGenotypesDatasetConverter extends ToGenotypeDatasetConversion<Variant, VariantRDD>;

160

```

161

162

[Dataset Conversions](./dataset-conversions.md)

163

164

### Python API Support

165

166

Wrapper functionality enabling Python integration through DataFrame conversion utilities.

167

168

```java { .api }

169

class DataFrameConversionWrapper implements JFunction<DataFrame, DataFrame> {

170

DataFrameConversionWrapper(DataFrame newDf);

171

DataFrame call(DataFrame v1);

172

}

173

```

174

175

[Python Integration](./python-integration.md)

176

177

## Supported Genomic Data Types

178

179

- **AlignmentRecord**: Read alignments from sequencing data

180

- **NucleotideContigFragment**: Reference genome sequences

181

- **Fragment**: Paired-end sequencing fragments

182

- **Feature**: Genomic annotations and intervals

183

- **Coverage**: Coverage depth information

184

- **Genotype**: Sample genotype calls

185

- **Variant**: Genetic variations

186

- **VariantContext**: Rich variant information with samples and additional metadata

187

188

## Supported File Formats

189

190

- **Alignment formats**: BAM, CRAM, SAM, FASTA, FASTQ, interleaved FASTQ (.ifq)

191

- **Feature formats**: BED6/12, GFF3, GTF/GFF2, NarrowPeak, IntervalList

192

- **Variant formats**: VCF (including .vcf.gz, .vcf.bgzf, .vcf.bgz)

193

- **Reference formats**: FASTA, 2bit

194

- **Universal fallback**: Parquet + Avro for all data types

195

196

All formats support standard Hadoop compression codecs (.gz, .bz2) where applicable.

197

198

## Types

199

200

### Core RDD Types

201

202

```java { .api }

203

// Genomic RDD wrapper types with metadata preservation

204

interface GenomicRDD<T, U extends GenomicRDD<T, U>> {

205

RDD<T> jrdd();

206

// Additional metadata methods...

207

}

208

209

class AlignmentRecordRDD extends GenomicRDD<AlignmentRecord, AlignmentRecordRDD> {}

210

class NucleotideContigFragmentRDD extends GenomicRDD<NucleotideContigFragment, NucleotideContigFragmentRDD> {}

211

class FragmentRDD extends GenomicRDD<Fragment, FragmentRDD> {}

212

class FeatureRDD extends GenomicRDD<Feature, FeatureRDD> {}

213

class CoverageRDD extends GenomicRDD<Coverage, CoverageRDD> {}

214

class GenotypeRDD extends GenomicRDD<Genotype, GenotypeRDD> {}

215

class VariantRDD extends GenomicRDD<Variant, VariantRDD> {}

216

class VariantContextRDD extends GenomicRDD<VariantContext, VariantContextRDD> {}

217

```

218

219

### Validation Stringency

220

221

```java { .api }

222

// HTSJDK validation strictness control

223

enum ValidationStringency {

224

STRICT, // Fail on any format violations

225

LENIENT, // Warn on format issues but continue processing

226

SILENT // Ignore format violations silently

227

}

228

```

229

230

### Utility Types

231

232

```java { .api }

233

// Broadcastable reference sequences

234

class ReferenceFile {

235

// Methods for efficient reference lookups across cluster

236

}

237

238

// Spark integration types

239

class JavaSparkContext {

240

// Standard Spark Java API context

241

}

242

243

class DataFrame {

244

// Spark SQL DataFrame for Python integration

245

}

246

```