or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-conversions.mdgenomic-data-loading.mdindex.mdpython-integration.mdrdd-conversions.md

genomic-data-loading.mddocs/

0

# Genomic Data Loading

1

2

The JavaADAMContext provides comprehensive genomic data loading capabilities, automatically detecting file formats and providing validation controls for various genomic file types.

3

4

## Required Imports

5

6

```java

7

import org.bdgenomics.adam.api.java.JavaADAMContext;

8

import org.bdgenomics.adam.rdd.ADAMContext;

9

import htsjdk.samtools.ValidationStringency;

10

```

11

12

## Capabilities

13

14

### JavaADAMContext Class

15

16

Main context class providing Java-friendly methods for genomic data operations.

17

18

```java { .api }

19

/**

20

* Java-friendly wrapper for ADAM Context providing genomic data loading capabilities

21

*/

22

class JavaADAMContext {

23

/**

24

* Creates a JavaADAMContext wrapping the provided ADAMContext

25

* @param ac The ADAMContext to wrap

26

*/

27

JavaADAMContext(ADAMContext ac);

28

29

/**

30

* Returns the Java Spark Context associated with this context

31

* @return JavaSparkContext for Spark operations

32

*/

33

JavaSparkContext getSparkContext();

34

}

35

```

36

37

### Alignment Data Loading

38

39

Load sequencing alignment data from various formats with automatic format detection.

40

41

```java { .api }

42

/**

43

* Load alignment records with automatic format detection

44

* Supports: .bam/.cram/.sam (BAM/CRAM/SAM), .fa/.fasta (FASTA),

45

* .fq/.fastq (FASTQ), .ifq (interleaved FASTQ)

46

* Falls back to Parquet + Avro for unrecognized extensions

47

* @param pathName Path to alignment file(s). Supports globs and directories

48

* @return AlignmentRecordRDD containing reads, sequence dictionary, and record groups

49

*/

50

AlignmentRecordRDD loadAlignments(String pathName);

51

52

/**

53

* Load alignment records with validation stringency control

54

* @param pathName Path to alignment file(s)

55

* @param stringency Validation strictness (LENIENT, SILENT, STRICT)

56

* @return AlignmentRecordRDD containing reads, sequence dictionary, and record groups

57

*/

58

AlignmentRecordRDD loadAlignments(String pathName, ValidationStringency stringency);

59

```

60

61

### Reference Sequence Loading

62

63

Load reference genome sequences and create broadcastable reference files.

64

65

```java { .api }

66

/**

67

* Load nucleotide contig fragments from reference sequences

68

* Supports: .fa/.fasta (FASTA format)

69

* Falls back to Parquet + Avro for other extensions

70

* @param pathName Path to reference file(s). Supports globs and directories for FASTA

71

* @return NucleotideContigFragmentRDD containing reference sequences

72

*/

73

NucleotideContigFragmentRDD loadContigFragments(String pathName);

74

75

/**

76

* Load reference sequences into broadcastable format

77

* Supports: .2bit files directly, other formats via loadContigFragments

78

* @param pathName Path to reference file (no globs/directories for 2bit)

79

* @return ReferenceFile for broadcast operations

80

*/

81

ReferenceFile loadReferenceFile(String pathName);

82

83

/**

84

* Load reference sequences with custom maximum fragment length

85

* @param pathName Path to reference file

86

* @param maximumLength Maximum fragment length (default 10000L, avoid >1e9)

87

* @return ReferenceFile for broadcast operations

88

*/

89

ReferenceFile loadReferenceFile(String pathName, Long maximumLength);

90

```

91

92

### Fragment Loading

93

94

Load paired-end sequencing fragments from alignment files.

95

96

```java { .api }

97

/**

98

* Load fragments from alignment data

99

* Supports: .bam/.cram/.sam (BAM/CRAM/SAM), .ifq (interleaved FASTQ)

100

* Falls back to Parquet + Avro for other extensions

101

* @param pathName Path to fragment file(s). Supports globs and directories

102

* @return FragmentRDD containing paired-end fragments

103

*/

104

FragmentRDD loadFragments(String pathName);

105

106

/**

107

* Load fragments with validation stringency control

108

* @param pathName Path to fragment file(s)

109

* @param stringency Validation strictness for BAM/CRAM/SAM and FASTQ formats

110

* @return FragmentRDD containing paired-end fragments

111

*/

112

FragmentRDD loadFragments(String pathName, ValidationStringency stringency);

113

```

114

115

### Feature Data Loading

116

117

Load genomic annotations and feature data from various annotation formats.

118

119

```java { .api }

120

/**

121

* Load genomic features from annotation files

122

* Supports: .bed (BED6/12), .gff3 (GFF3), .gtf/.gff (GTF/GFF2),

123

* .narrow[pP]eak (NarrowPeak), .interval_list (IntervalList)

124

* Falls back to Parquet + Avro for other extensions

125

* @param pathName Path to feature file(s). Supports globs and directories

126

* @return FeatureRDD containing genomic annotations

127

*/

128

FeatureRDD loadFeatures(String pathName);

129

130

/**

131

* Load features with validation stringency control

132

* @param pathName Path to feature file(s)

133

* @param stringency Validation strictness for supported text formats

134

* @return FeatureRDD containing genomic annotations

135

*/

136

FeatureRDD loadFeatures(String pathName, ValidationStringency stringency);

137

```

138

139

### Coverage Data Loading

140

141

Load genomic coverage data, converting features to coverage information.

142

143

```java { .api }

144

/**

145

* Load features and convert to coverage data

146

* Coverage values are stored in the score field of Feature records

147

* Supports same formats as loadFeatures

148

* @param pathName Path to coverage file(s). Supports globs and directories

149

* @return CoverageRDD containing coverage depth information

150

*/

151

CoverageRDD loadCoverage(String pathName);

152

153

/**

154

* Load coverage data with validation stringency control

155

* @param pathName Path to coverage file(s)

156

* @param stringency Validation strictness for supported text formats

157

* @return CoverageRDD containing coverage depth information

158

*/

159

CoverageRDD loadCoverage(String pathName, ValidationStringency stringency);

160

```

161

162

### Variant Data Loading

163

164

Load genetic variation data from VCF files or Parquet format.

165

166

```java { .api }

167

/**

168

* Load genotype calls from variant files

169

* Supports: .vcf/.vcf.gz/.vcf.bgzf/.vcf.bgz (VCF format)

170

* Falls back to Parquet + Avro for other extensions

171

* @param pathName Path to genotype file(s). Supports globs and directories for VCF

172

* @return GenotypeRDD containing sample genotype calls

173

*/

174

GenotypeRDD loadGenotypes(String pathName);

175

176

/**

177

* Load genotypes with validation stringency control

178

* @param pathName Path to genotype file(s)

179

* @param stringency Validation strictness for VCF format

180

* @return GenotypeRDD containing sample genotype calls

181

*/

182

GenotypeRDD loadGenotypes(String pathName, ValidationStringency stringency);

183

184

/**

185

* Load variant records from variant files

186

* Supports: .vcf/.vcf.gz/.vcf.bgzf/.vcf.bgz (VCF format)

187

* Falls back to Parquet + Avro for other extensions

188

* @param pathName Path to variant file(s). Supports globs and directories for VCF

189

* @return VariantRDD containing genetic variations

190

*/

191

VariantRDD loadVariants(String pathName);

192

193

/**

194

* Load variants with validation stringency control

195

* @param pathName Path to variant file(s)

196

* @param stringency Validation strictness for VCF format

197

* @return VariantRDD containing genetic variations

198

*/

199

VariantRDD loadVariants(String pathName, ValidationStringency stringency);

200

```

201

202

## Usage Examples

203

204

**Basic alignment loading:**

205

206

```java

207

JavaADAMContext jac = new JavaADAMContext(adamContext);

208

209

// Load BAM file with default settings

210

AlignmentRecordRDD alignments = jac.loadAlignments("sample.bam");

211

212

// Load with strict validation

213

AlignmentRecordRDD strictAlignments = jac.loadAlignments("sample.bam",

214

ValidationStringency.STRICT);

215

216

// Load compressed FASTQ

217

AlignmentRecordRDD fastqReads = jac.loadAlignments("reads.fastq.gz");

218

```

219

220

**Reference and feature loading:**

221

222

```java

223

// Load reference genome

224

NucleotideContigFragmentRDD reference = jac.loadContigFragments("hg38.fa");

225

226

// Create broadcastable reference

227

ReferenceFile refFile = jac.loadReferenceFile("hg38.2bit");

228

229

// Load genomic annotations

230

FeatureRDD genes = jac.loadFeatures("gencode.gtf");

231

CoverageRDD coverage = jac.loadCoverage("sample.bed");

232

```

233

234

**Variant data loading:**

235

236

```java

237

// Load VCF file

238

VariantRDD variants = jac.loadVariants("variants.vcf.gz");

239

GenotypeRDD genotypes = jac.loadGenotypes("variants.vcf.gz");

240

241

// With lenient validation for problematic files

242

VariantRDD lenientVariants = jac.loadVariants("problematic.vcf",

243

ValidationStringency.LENIENT);

244

```

245

246

## File Format Support Details

247

248

### Compression Support

249

All text-based formats support Hadoop compression codecs:

250

- `.gz` (gzip)

251

- `.bz2` (bzip2)

252

- Additional codecs as configured in Hadoop environment

253

254

### Path Patterns

255

- **Globs supported**: `*.bam`, `sample*.vcf.gz`

256

- **Directories supported**: Load all compatible files in directory

257

- **Single files**: Direct file paths

258

259

### Validation Stringency Options

260

- **STRICT**: Fail on any format violations

261

- **LENIENT**: Warn on format issues but continue processing

262

- **SILENT**: Ignore format violations silently

263

264

## Types

265

266

```java { .api }

267

/**

268

* Validation stringency enumeration for controlling file format validation

269

*/

270

enum ValidationStringency {

271

STRICT, // Fail on any format violations

272

LENIENT, // Warn on format issues but continue processing

273

SILENT // Ignore format violations silently

274

}

275

```

276

277

## Error Handling

278

279

Loading methods may throw exceptions for:

280

- File not found or inaccessible

281

- Unsupported file format

282

- Validation failures (when using STRICT stringency)

283

- Insufficient memory for large reference files

284

- Hadoop configuration issues for compressed files