Tessl Tile for maven/org.bdgenomics.adam/adam-cli-spark2_2.10@0.23.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

data-inspection.md format-conversion.md genomic-processing.md index.md

genomic-processing.mddocs/

0
# Genomic Data Processing
1

2
This document covers ADAM CLI's core genomic data processing capabilities, including alignment transformations, feature processing, variant analysis, k-mer counting, and coverage analysis.
3

4
## K-mer Analysis
5

6
### Count Read K-mers
7

8
Analyzes k-mer frequencies in read sequences for quality control and genomic analysis.
9

10
```scala { .api }
11
object CountReadKmers extends BDGCommandCompanion {
12
  val commandName = "countKmers"
13
  val commandDescription = "Counts the k-mers/q-mers from a read dataset."
14
  def apply(cmdLine: Array[String]): CountReadKmers
15
}
16

17
class CountReadKmersArgs extends Args4jBase with ParquetArgs {
18
  var inputPath: String
19
  var outputPath: String  
20
  var kmerLength: Int
21
  var printHistogram: Boolean
22
  var repartition: Int
23
}
24
```
25

26
**Usage Example:**
27
```bash
28
adam-submit countKmers \
29
  input.adam output_kmers.adam 21 \
30
  --print_histogram
31
```
32

33
### Count Contig K-mers
34

35
Analyzes k-mer frequencies in assembled contig sequences.
36

37
```scala { .api }
38
object CountContigKmers extends BDGCommandCompanion {
39
  val commandName = "countContigKmers"  
40
  val commandDescription = "Counts the k-mers/q-mers from a read dataset."
41
  def apply(cmdLine: Array[String]): CountContigKmers
42
}
43

44
class CountContigKmersArgs extends Args4jBase with ParquetArgs {
45
  var inputPath: String           // ADAM or FASTA file
46
  var outputPath: String          // Output location for k-mer counts  
47
  var kmerLength: Int             // Length of k-mers
48
  var printHistogram: Boolean     // Print histogram of counts
49
}
50
```
51

52
## Alignment Processing
53

54
### Transform Alignments
55

56
Comprehensive alignment processing with format conversion, quality score recalibration, duplicate marking, and local realignment.
57

58
```scala { .api }
59
object TransformAlignments extends BDGCommandCompanion {
60
  val commandName = "transformAlignments"
61
  val commandDescription = "Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations"
62
  def apply(cmdLine: Array[String]): TransformAlignments
63
}
64

65
class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
66
  // Input/Output
67
  var inputPath: String
68
  var outputPath: String
69
  
70
  // Filtering and projection
71
  var limitProjection: Boolean
72
  var useAlignedReadPredicate: Boolean  
73
  var regionPredicate: String
74
  
75
  // Sorting options
76
  var sortReads: Boolean
77
  var sortLexicographically: Boolean
78
  
79
  // Quality processing
80
  var markDuplicates: Boolean
81
  var recalibrateBaseQualities: Boolean
82
  var locallyRealign: Boolean
83
  var realignAroundIndels: Boolean
84
  
85
  // Trimming and binning
86
  var trim: Boolean
87
  var qualityScoreBin: Int
88
  
89
  // Performance tuning
90
  var coalesce: Int
91
  var forceShuffle: Boolean
92
  var storageLevel: String
93
}
94
```
95

96
**Key Processing Options:**
97

98
- **Mark Duplicates**: Identify and flag PCR/optical duplicates
99
- **Base Quality Recalibration**: Adjust base quality scores using known variants
100
- **Local Realignment**: Realign reads around indels for improved accuracy
101
- **Quality Score Binning**: Reduce quality score precision to save storage space
102
- **Read Trimming**: Remove low-quality bases from read ends
103

104
**Usage Examples:**
105
```bash
106
# Basic format conversion
107
adam-submit transformAlignments input.bam output.adam
108

109
# Full preprocessing pipeline
110
adam-submit transformAlignments \
111
  --markDuplicates \
112
  --recalibrateBaseQualities \
113
  --locallyRealign \
114
  --sortReads \
115
  input.bam output.adam
116

117
# With region filtering
118
adam-submit transformAlignments \
119
  --regionPredicate "referenceName=chr1 AND start>=1000000 AND end<=2000000" \
120
  input.bam output.adam
121
```
122

123
## Feature Processing
124

125
### Transform Features
126

127
Process genomic features from BED, GFF3, GTF, and other annotation formats.
128

129
```scala { .api }
130
object TransformFeatures extends BDGCommandCompanion {
131
  val commandName = "transformFeatures"
132
  val commandDescription = "Convert a file with sequence features into corresponding ADAM format"
133
  def apply(cmdLine: Array[String]): TransformFeatures
134
}
135

136
class TransformFeaturesArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
137
  var inputPath: String
138
  var outputPath: String
139
  var sortFeatures: Boolean  
140
  var sortLexicographically: Boolean
141
  var coalesce: Int
142
  var forceShuffle: Boolean
143
}
144
```
145

146
**Usage Example:**
147
```bash
148
adam-submit transformFeatures \
149
  --sortFeatures \
150
  annotations.gtf features.adam
151
```
152

153
## Variant Processing
154

155
### Transform Variants
156

157
Process variant data from VCF files with sorting and validation options.
158

159
```scala { .api }
160
object TransformVariants extends BDGCommandCompanion {
161
  val commandName = "transformVariants"
162
  val commandDescription = "Convert a VCF file into corresponding ADAM format"
163
  def apply(cmdLine: Array[String]): TransformVariants
164
}
165

166
class TransformVariantsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
167
  var inputPath: String
168
  var outputPath: String
169
  var coalesce: Int
170
  var forceShuffle: Boolean
171
  var sort: Boolean
172
  var sortLexicographically: Boolean
173
  var stringency: String
174
}
175
```
176

177
**Usage Example:**
178
```bash
179
adam-submit transformVariants \
180
  --sort \
181
  --stringency LENIENT \
182
  variants.vcf variants.adam
183
```
184

185
### Transform Genotypes
186

187
Process genotype data with filtering and quality control options.
188

189
```scala { .api }
190
object TransformGenotypes extends BDGCommandCompanion {
191
  val commandName = "transformGenotypes"
192
  val commandDescription = "Convert a VCF file into corresponding ADAM format"
193
  def apply(cmdLine: Array[String]): TransformGenotypes
194
}
195

196
class TransformGenotypesArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
197
  var inputPath: String
198
  var outputPath: String
199
  var coalesce: Int
200
  var forceShuffle: Boolean
201
  var sort: Boolean
202
  var sortLexicographically: Boolean
203
}
204
```
205

206
## Fragment Processing
207

208
### Transform Fragments
209

210
Process paired-end read fragments with insert size analysis and quality filtering.
211

212
```scala { .api }
213
object TransformFragments extends BDGCommandCompanion {
214
  val commandName = "transformFragments"
215
  val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"
216
  def apply(cmdLine: Array[String]): TransformFragments
217
}
218

219
class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
220
  var inputPath: String
221
  var outputPath: String
222
  var coalesce: Int
223
  var forceShuffle: Boolean
224
  var storageLevel: String
225
}
226
```
227

228
## Coverage Analysis
229

230
### Reads to Coverage
231

232
Generate coverage depth information from aligned reads.
233

234
```scala { .api }
235
object Reads2Coverage extends BDGCommandCompanion {
236
  val commandName = "reads2coverage"
237
  val commandDescription = "Calculate the coverage from a given ADAM file"
238
  def apply(cmdLine: Array[String]): Reads2Coverage
239
}
240

241
class Reads2CoverageArgs extends Args4jBase with ParquetArgs {
242
  var inputPath: String
243
  var outputPath: String
244
  var collapse: Boolean
245
  var onlyCountUniqueReads: Boolean
246
  var coalesce: Int
247
  var forceShuffle: Boolean
248
}
249
```
250

251
**Usage Example:**
252
```bash
253
adam-submit reads2coverage \
254
  --onlyCountUniqueReads \
255
  --collapse \
256
  alignments.adam coverage.adam
257
```
258

259
## Data Management
260

261
### Merge Shards
262

263
Combine multiple data shards into consolidated files for improved query performance.
264

265
```scala { .api }
266
object MergeShards extends BDGCommandCompanion {
267
  val commandName = "mergeShards"
268
  val commandDescription = "Merge multiple shards of genomic data"
269
  def apply(cmdLine: Array[String]): MergeShards
270
}
271

272
class MergeShardsArgs extends Args4jBase with ParquetArgs {
273
  var inputPath: String
274
  var outputPath: String
275
  var coalesce: Int
276
  var sortOrder: String
277
}
278
```
279

280
**Usage Example:**
281
```bash
282
adam-submit mergeShards \
283
  --sortOrder coordinate \
284
  sharded_data/ merged_output.adam
285
```
286

287
## Performance Considerations
288

289
### Memory Management
290
- Use `--storageLevel` to control Spark caching strategy
291
- Configure `--coalesce` to optimize output file count
292
- Set appropriate driver and executor memory via Spark arguments
293

294
### Cluster Scaling
295
- Partition data appropriately for cluster size
296
- Use `--forceShuffle` when data skew is detected  
297
- Monitor Spark UI for bottlenecks and resource utilization
298

299
### Data Locality
300
- Co-locate input data with compute resources when possible
301
- Use HDFS or object storage for distributed deployments
302
- Consider data compression vs. processing speed trade-offs

Version

Tile

Files

genomic-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

genomic-processing.mddocs/