or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-inspection.mdformat-conversion.mdgenomic-processing.mdindex.md

genomic-processing.mddocs/

0

# Genomic Data Processing

1

2

This document covers ADAM CLI's core genomic data processing capabilities, including alignment transformations, feature processing, variant analysis, k-mer counting, and coverage analysis.

3

4

## K-mer Analysis

5

6

### Count Read K-mers

7

8

Analyzes k-mer frequencies in read sequences for quality control and genomic analysis.

9

10

```scala { .api }

11

object CountReadKmers extends BDGCommandCompanion {

12

val commandName = "countKmers"

13

val commandDescription = "Counts the k-mers/q-mers from a read dataset."

14

def apply(cmdLine: Array[String]): CountReadKmers

15

}

16

17

class CountReadKmersArgs extends Args4jBase with ParquetArgs {

18

var inputPath: String

19

var outputPath: String

20

var kmerLength: Int

21

var printHistogram: Boolean

22

var repartition: Int

23

}

24

```

25

26

**Usage Example:**

27

```bash

28

adam-submit countKmers \

29

input.adam output_kmers.adam 21 \

30

--print_histogram

31

```

32

33

### Count Contig K-mers

34

35

Analyzes k-mer frequencies in assembled contig sequences.

36

37

```scala { .api }

38

object CountContigKmers extends BDGCommandCompanion {

39

val commandName = "countContigKmers"

40

val commandDescription = "Counts the k-mers/q-mers from a read dataset."

41

def apply(cmdLine: Array[String]): CountContigKmers

42

}

43

44

class CountContigKmersArgs extends Args4jBase with ParquetArgs {

45

var inputPath: String // ADAM or FASTA file

46

var outputPath: String // Output location for k-mer counts

47

var kmerLength: Int // Length of k-mers

48

var printHistogram: Boolean // Print histogram of counts

49

}

50

```

51

52

## Alignment Processing

53

54

### Transform Alignments

55

56

Comprehensive alignment processing with format conversion, quality score recalibration, duplicate marking, and local realignment.

57

58

```scala { .api }

59

object TransformAlignments extends BDGCommandCompanion {

60

val commandName = "transformAlignments"

61

val commandDescription = "Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations"

62

def apply(cmdLine: Array[String]): TransformAlignments

63

}

64

65

class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {

66

// Input/Output

67

var inputPath: String

68

var outputPath: String

69

70

// Filtering and projection

71

var limitProjection: Boolean

72

var useAlignedReadPredicate: Boolean

73

var regionPredicate: String

74

75

// Sorting options

76

var sortReads: Boolean

77

var sortLexicographically: Boolean

78

79

// Quality processing

80

var markDuplicates: Boolean

81

var recalibrateBaseQualities: Boolean

82

var locallyRealign: Boolean

83

var realignAroundIndels: Boolean

84

85

// Trimming and binning

86

var trim: Boolean

87

var qualityScoreBin: Int

88

89

// Performance tuning

90

var coalesce: Int

91

var forceShuffle: Boolean

92

var storageLevel: String

93

}

94

```

95

96

**Key Processing Options:**

97

98

- **Mark Duplicates**: Identify and flag PCR/optical duplicates

99

- **Base Quality Recalibration**: Adjust base quality scores using known variants

100

- **Local Realignment**: Realign reads around indels for improved accuracy

101

- **Quality Score Binning**: Reduce quality score precision to save storage space

102

- **Read Trimming**: Remove low-quality bases from read ends

103

104

**Usage Examples:**

105

```bash

106

# Basic format conversion

107

adam-submit transformAlignments input.bam output.adam

108

109

# Full preprocessing pipeline

110

adam-submit transformAlignments \

111

--markDuplicates \

112

--recalibrateBaseQualities \

113

--locallyRealign \

114

--sortReads \

115

input.bam output.adam

116

117

# With region filtering

118

adam-submit transformAlignments \

119

--regionPredicate "referenceName=chr1 AND start>=1000000 AND end<=2000000" \

120

input.bam output.adam

121

```

122

123

## Feature Processing

124

125

### Transform Features

126

127

Process genomic features from BED, GFF3, GTF, and other annotation formats.

128

129

```scala { .api }

130

object TransformFeatures extends BDGCommandCompanion {

131

val commandName = "transformFeatures"

132

val commandDescription = "Convert a file with sequence features into corresponding ADAM format"

133

def apply(cmdLine: Array[String]): TransformFeatures

134

}

135

136

class TransformFeaturesArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {

137

var inputPath: String

138

var outputPath: String

139

var sortFeatures: Boolean

140

var sortLexicographically: Boolean

141

var coalesce: Int

142

var forceShuffle: Boolean

143

}

144

```

145

146

**Usage Example:**

147

```bash

148

adam-submit transformFeatures \

149

--sortFeatures \

150

annotations.gtf features.adam

151

```

152

153

## Variant Processing

154

155

### Transform Variants

156

157

Process variant data from VCF files with sorting and validation options.

158

159

```scala { .api }

160

object TransformVariants extends BDGCommandCompanion {

161

val commandName = "transformVariants"

162

val commandDescription = "Convert a VCF file into corresponding ADAM format"

163

def apply(cmdLine: Array[String]): TransformVariants

164

}

165

166

class TransformVariantsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {

167

var inputPath: String

168

var outputPath: String

169

var coalesce: Int

170

var forceShuffle: Boolean

171

var sort: Boolean

172

var sortLexicographically: Boolean

173

var stringency: String

174

}

175

```

176

177

**Usage Example:**

178

```bash

179

adam-submit transformVariants \

180

--sort \

181

--stringency LENIENT \

182

variants.vcf variants.adam

183

```

184

185

### Transform Genotypes

186

187

Process genotype data with filtering and quality control options.

188

189

```scala { .api }

190

object TransformGenotypes extends BDGCommandCompanion {

191

val commandName = "transformGenotypes"

192

val commandDescription = "Convert a VCF file into corresponding ADAM format"

193

def apply(cmdLine: Array[String]): TransformGenotypes

194

}

195

196

class TransformGenotypesArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {

197

var inputPath: String

198

var outputPath: String

199

var coalesce: Int

200

var forceShuffle: Boolean

201

var sort: Boolean

202

var sortLexicographically: Boolean

203

}

204

```

205

206

## Fragment Processing

207

208

### Transform Fragments

209

210

Process paired-end read fragments with insert size analysis and quality filtering.

211

212

```scala { .api }

213

object TransformFragments extends BDGCommandCompanion {

214

val commandName = "transformFragments"

215

val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"

216

def apply(cmdLine: Array[String]): TransformFragments

217

}

218

219

class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {

220

var inputPath: String

221

var outputPath: String

222

var coalesce: Int

223

var forceShuffle: Boolean

224

var storageLevel: String

225

}

226

```

227

228

## Coverage Analysis

229

230

### Reads to Coverage

231

232

Generate coverage depth information from aligned reads.

233

234

```scala { .api }

235

object Reads2Coverage extends BDGCommandCompanion {

236

val commandName = "reads2coverage"

237

val commandDescription = "Calculate the coverage from a given ADAM file"

238

def apply(cmdLine: Array[String]): Reads2Coverage

239

}

240

241

class Reads2CoverageArgs extends Args4jBase with ParquetArgs {

242

var inputPath: String

243

var outputPath: String

244

var collapse: Boolean

245

var onlyCountUniqueReads: Boolean

246

var coalesce: Int

247

var forceShuffle: Boolean

248

}

249

```

250

251

**Usage Example:**

252

```bash

253

adam-submit reads2coverage \

254

--onlyCountUniqueReads \

255

--collapse \

256

alignments.adam coverage.adam

257

```

258

259

## Data Management

260

261

### Merge Shards

262

263

Combine multiple data shards into consolidated files for improved query performance.

264

265

```scala { .api }

266

object MergeShards extends BDGCommandCompanion {

267

val commandName = "mergeShards"

268

val commandDescription = "Merge multiple shards of genomic data"

269

def apply(cmdLine: Array[String]): MergeShards

270

}

271

272

class MergeShardsArgs extends Args4jBase with ParquetArgs {

273

var inputPath: String

274

var outputPath: String

275

var coalesce: Int

276

var sortOrder: String

277

}

278

```

279

280

**Usage Example:**

281

```bash

282

adam-submit mergeShards \

283

--sortOrder coordinate \

284

sharded_data/ merged_output.adam

285

```

286

287

## Performance Considerations

288

289

### Memory Management

290

- Use `--storageLevel` to control Spark caching strategy

291

- Configure `--coalesce` to optimize output file count

292

- Set appropriate driver and executor memory via Spark arguments

293

294

### Cluster Scaling

295

- Partition data appropriately for cluster size

296

- Use `--forceShuffle` when data skew is detected

297

- Monitor Spark UI for bottlenecks and resource utilization

298

299

### Data Locality

300

- Co-locate input data with compute resources when possible

301

- Use HDFS or object storage for distributed deployments

302

- Consider data compression vs. processing speed trade-offs