0
# Genomic Data Processing
1
2
This document covers ADAM CLI's core genomic data processing capabilities, including alignment transformations, feature processing, variant analysis, k-mer counting, and coverage analysis.
3
4
## K-mer Analysis
5
6
### Count Read K-mers
7
8
Analyzes k-mer frequencies in read sequences for quality control and genomic analysis.
9
10
```scala { .api }
11
object CountReadKmers extends BDGCommandCompanion {
12
val commandName = "countKmers"
13
val commandDescription = "Counts the k-mers/q-mers from a read dataset."
14
def apply(cmdLine: Array[String]): CountReadKmers
15
}
16
17
class CountReadKmersArgs extends Args4jBase with ParquetArgs {
18
var inputPath: String
19
var outputPath: String
20
var kmerLength: Int
21
var printHistogram: Boolean
22
var repartition: Int
23
}
24
```
25
26
**Usage Example:**
27
```bash
28
adam-submit countKmers \
29
input.adam output_kmers.adam 21 \
30
--print_histogram
31
```
32
33
### Count Contig K-mers
34
35
Analyzes k-mer frequencies in assembled contig sequences.
36
37
```scala { .api }
38
object CountContigKmers extends BDGCommandCompanion {
39
val commandName = "countContigKmers"
40
val commandDescription = "Counts the k-mers/q-mers from a read dataset."
41
def apply(cmdLine: Array[String]): CountContigKmers
42
}
43
44
class CountContigKmersArgs extends Args4jBase with ParquetArgs {
45
var inputPath: String // ADAM or FASTA file
46
var outputPath: String // Output location for k-mer counts
47
var kmerLength: Int // Length of k-mers
48
var printHistogram: Boolean // Print histogram of counts
49
}
50
```
51
52
## Alignment Processing
53
54
### Transform Alignments
55
56
Comprehensive alignment processing with format conversion, quality score recalibration, duplicate marking, and local realignment.
57
58
```scala { .api }
59
object TransformAlignments extends BDGCommandCompanion {
60
val commandName = "transformAlignments"
61
val commandDescription = "Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations"
62
def apply(cmdLine: Array[String]): TransformAlignments
63
}
64
65
class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
66
// Input/Output
67
var inputPath: String
68
var outputPath: String
69
70
// Filtering and projection
71
var limitProjection: Boolean
72
var useAlignedReadPredicate: Boolean
73
var regionPredicate: String
74
75
// Sorting options
76
var sortReads: Boolean
77
var sortLexicographically: Boolean
78
79
// Quality processing
80
var markDuplicates: Boolean
81
var recalibrateBaseQualities: Boolean
82
var locallyRealign: Boolean
83
var realignAroundIndels: Boolean
84
85
// Trimming and binning
86
var trim: Boolean
87
var qualityScoreBin: Int
88
89
// Performance tuning
90
var coalesce: Int
91
var forceShuffle: Boolean
92
var storageLevel: String
93
}
94
```
95
96
**Key Processing Options:**
97
98
- **Mark Duplicates**: Identify and flag PCR/optical duplicates
99
- **Base Quality Recalibration**: Adjust base quality scores using known variants
100
- **Local Realignment**: Realign reads around indels for improved accuracy
101
- **Quality Score Binning**: Reduce quality score precision to save storage space
102
- **Read Trimming**: Remove low-quality bases from read ends
103
104
**Usage Examples:**
105
```bash
106
# Basic format conversion
107
adam-submit transformAlignments input.bam output.adam
108
109
# Full preprocessing pipeline
110
adam-submit transformAlignments \
111
--markDuplicates \
112
--recalibrateBaseQualities \
113
--locallyRealign \
114
--sortReads \
115
input.bam output.adam
116
117
# With region filtering
118
adam-submit transformAlignments \
119
--regionPredicate "referenceName=chr1 AND start>=1000000 AND end<=2000000" \
120
input.bam output.adam
121
```
122
123
## Feature Processing
124
125
### Transform Features
126
127
Process genomic features from BED, GFF3, GTF, and other annotation formats.
128
129
```scala { .api }
130
object TransformFeatures extends BDGCommandCompanion {
131
val commandName = "transformFeatures"
132
val commandDescription = "Convert a file with sequence features into corresponding ADAM format"
133
def apply(cmdLine: Array[String]): TransformFeatures
134
}
135
136
class TransformFeaturesArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
137
var inputPath: String
138
var outputPath: String
139
var sortFeatures: Boolean
140
var sortLexicographically: Boolean
141
var coalesce: Int
142
var forceShuffle: Boolean
143
}
144
```
145
146
**Usage Example:**
147
```bash
148
adam-submit transformFeatures \
149
--sortFeatures \
150
annotations.gtf features.adam
151
```
152
153
## Variant Processing
154
155
### Transform Variants
156
157
Process variant data from VCF files with sorting and validation options.
158
159
```scala { .api }
160
object TransformVariants extends BDGCommandCompanion {
161
val commandName = "transformVariants"
162
val commandDescription = "Convert a VCF file into corresponding ADAM format"
163
def apply(cmdLine: Array[String]): TransformVariants
164
}
165
166
class TransformVariantsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
167
var inputPath: String
168
var outputPath: String
169
var coalesce: Int
170
var forceShuffle: Boolean
171
var sort: Boolean
172
var sortLexicographically: Boolean
173
var stringency: String
174
}
175
```
176
177
**Usage Example:**
178
```bash
179
adam-submit transformVariants \
180
--sort \
181
--stringency LENIENT \
182
variants.vcf variants.adam
183
```
184
185
### Transform Genotypes
186
187
Process genotype data with filtering and quality control options.
188
189
```scala { .api }
190
object TransformGenotypes extends BDGCommandCompanion {
191
val commandName = "transformGenotypes"
192
val commandDescription = "Convert a VCF file into corresponding ADAM format"
193
def apply(cmdLine: Array[String]): TransformGenotypes
194
}
195
196
class TransformGenotypesArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
197
var inputPath: String
198
var outputPath: String
199
var coalesce: Int
200
var forceShuffle: Boolean
201
var sort: Boolean
202
var sortLexicographically: Boolean
203
}
204
```
205
206
## Fragment Processing
207
208
### Transform Fragments
209
210
Process paired-end read fragments with insert size analysis and quality filtering.
211
212
```scala { .api }
213
object TransformFragments extends BDGCommandCompanion {
214
val commandName = "transformFragments"
215
val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"
216
def apply(cmdLine: Array[String]): TransformFragments
217
}
218
219
class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
220
var inputPath: String
221
var outputPath: String
222
var coalesce: Int
223
var forceShuffle: Boolean
224
var storageLevel: String
225
}
226
```
227
228
## Coverage Analysis
229
230
### Reads to Coverage
231
232
Generate coverage depth information from aligned reads.
233
234
```scala { .api }
235
object Reads2Coverage extends BDGCommandCompanion {
236
val commandName = "reads2coverage"
237
val commandDescription = "Calculate the coverage from a given ADAM file"
238
def apply(cmdLine: Array[String]): Reads2Coverage
239
}
240
241
class Reads2CoverageArgs extends Args4jBase with ParquetArgs {
242
var inputPath: String
243
var outputPath: String
244
var collapse: Boolean
245
var onlyCountUniqueReads: Boolean
246
var coalesce: Int
247
var forceShuffle: Boolean
248
}
249
```
250
251
**Usage Example:**
252
```bash
253
adam-submit reads2coverage \
254
--onlyCountUniqueReads \
255
--collapse \
256
alignments.adam coverage.adam
257
```
258
259
## Data Management
260
261
### Merge Shards
262
263
Combine multiple data shards into consolidated files for improved query performance.
264
265
```scala { .api }
266
object MergeShards extends BDGCommandCompanion {
267
val commandName = "mergeShards"
268
val commandDescription = "Merge multiple shards of genomic data"
269
def apply(cmdLine: Array[String]): MergeShards
270
}
271
272
class MergeShardsArgs extends Args4jBase with ParquetArgs {
273
var inputPath: String
274
var outputPath: String
275
var coalesce: Int
276
var sortOrder: String
277
}
278
```
279
280
**Usage Example:**
281
```bash
282
adam-submit mergeShards \
283
--sortOrder coordinate \
284
sharded_data/ merged_output.adam
285
```
286
287
## Performance Considerations
288
289
### Memory Management
290
- Use `--storageLevel` to control Spark caching strategy
291
- Configure `--coalesce` to optimize output file count
292
- Set appropriate driver and executor memory via Spark arguments
293
294
### Cluster Scaling
295
- Partition data appropriately for cluster size
296
- Use `--forceShuffle` when data skew is detected
297
- Monitor Spark UI for bottlenecks and resource utilization
298
299
### Data Locality
300
- Co-locate input data with compute resources when possible
301
- Use HDFS or object storage for distributed deployments
302
- Consider data compression vs. processing speed trade-offs