0
# Data Inspection and Analysis
1
2
This document covers ADAM CLI's data inspection capabilities for viewing, filtering, and analyzing genomic datasets. These tools provide samtools-like functionality with distributed processing capabilities.
3
4
## Data Viewing and Filtering
5
6
### View Command
7
8
The View command provides samtools view-like functionality for filtering and examining genomic alignment data with support for flag-based filtering and format conversion.
9
10
```scala { .api }
11
object View extends BDGCommandCompanion {
12
val commandName = "view"
13
val commandDescription = "View certain reads from an alignment-record file."
14
def apply(cmdLine: Array[String]): View
15
}
16
17
class ViewArgs extends Args4jBase with ParquetArgs with ADAMSaveAnyArgs {
18
var inputPath: String // Input alignment file
19
var outputPath: String // Output file (optional)
20
var outputPathArg: String // Alternative output specification
21
22
// Flag-based filtering (samtools-compatible)
23
var matchAllBits: Int // Include reads matching all bits (-f)
24
var mismatchAllBits: Int // Exclude reads matching all bits (-F)
25
var matchSomeBits: Int // Include reads matching some bits (-g)
26
var mismatchSomeBits: Int // Exclude reads matching some bits (-G)
27
28
// Output options
29
var printCount: Boolean // Print count only (-c)
30
}
31
```
32
33
**Flag Filtering Examples:**
34
35
```bash
36
# View only mapped reads (exclude unmapped, flag 4)
37
adam-submit view -F 4 alignments.adam
38
39
# View only proper pairs (flag 2) that are mapped (exclude flag 4)
40
adam-submit view -f 2 -F 4 alignments.adam mapped_pairs.adam
41
42
# Count unmapped reads
43
adam-submit view -f 4 -c alignments.adam
44
45
# View first read in pair (flag 64), exclude secondary alignments (flag 256)
46
adam-submit view -f 64 -F 256 alignments.adam first_reads.adam
47
```
48
49
**Common SAM Flags:**
50
- `1`: Read is paired
51
- `2`: Read is in proper pair
52
- `4`: Read is unmapped
53
- `8`: Mate is unmapped
54
- `16`: Read is on reverse strand
55
- `64`: First read in pair
56
- `128`: Second read in pair
57
- `256`: Secondary alignment
58
- `512`: Read fails quality checks
59
- `1024`: PCR/optical duplicate
60
61
### Print ADAM Data
62
63
Display the contents of ADAM files in human-readable format for data inspection and debugging.
64
65
```scala { .api }
66
object PrintADAM extends BDGCommandCompanion {
67
val commandName = "printAdam"
68
val commandDescription = "Print the contents of an ADAM file"
69
def apply(cmdLine: Array[String]): PrintADAM
70
}
71
72
class PrintADAMArgs extends Args4jBase with ParquetArgs {
73
var inputPath: String // Input ADAM file to print
74
var outputPath: String // Optional output file
75
var pretty: Boolean // Pretty-print JSON output
76
var records: Int // Number of records to print
77
}
78
```
79
80
**Usage Examples:**
81
```bash
82
# Print first 10 records to console
83
adam-submit printAdam --records 10 data.adam
84
85
# Pretty-print all records to file
86
adam-submit printAdam --pretty data.adam output.txt
87
88
# Inspect data structure
89
adam-submit printAdam --records 1 --pretty alignments.adam
90
```
91
92
## Statistical Analysis
93
94
### FlagStat
95
96
Generate comprehensive alignment statistics similar to samtools flagstat, providing essential quality control metrics for sequencing data.
97
98
```scala { .api }
99
object FlagStat extends BDGCommandCompanion {
100
val commandName = "flagstat"
101
val commandDescription = "Print statistics about reads in an alignment file"
102
def apply(cmdLine: Array[String]): FlagStat
103
}
104
105
class FlagStatArgs extends Args4jBase {
106
var inputPath: String // Input alignment file
107
var outputPath: String // Optional output file for statistics
108
var stringency: String // Validation stringency
109
}
110
```
111
112
**Statistics Generated:**
113
- Total reads processed
114
- Mapped reads and mapping percentage
115
- Properly paired reads for paired-end data
116
- Singleton reads (mate unmapped)
117
- Read duplicates (PCR/optical)
118
- Secondary and supplementary alignments
119
- Quality control failures
120
121
**Usage Examples:**
122
```bash
123
# Basic flagstat to console
124
adam-submit flagstat alignments.adam
125
126
# Save statistics to file
127
adam-submit flagstat alignments.adam stats.txt
128
129
# Use lenient validation for problematic files
130
adam-submit flagstat --stringency LENIENT alignments.adam
131
```
132
133
**Sample Output:**
134
```
135
71723 + 0 in total (QC-passed reads + QC-failed reads)
136
0 + 0 secondary
137
0 + 0 supplementary
138
0 + 0 duplicates
139
69543 + 0 mapped (97.0% : N/A)
140
71723 + 0 paired in sequencing
141
35861 + 0 read1
142
35862 + 0 read2
143
67432 + 0 properly paired (94.0% : N/A)
144
69543 + 0 with itself and mate mapped
145
0 + 0 singletons (0.0% : N/A)
146
```
147
148
## Quality Control and Validation
149
150
### Validation Stringency Control
151
152
All inspection tools support configurable validation stringency for handling problematic data:
153
154
```scala { .api }
155
// Validation levels
156
ValidationStringency.STRICT // Fail on any validation errors
157
ValidationStringency.LENIENT // Issue warnings for validation errors
158
ValidationStringency.SILENT // Ignore validation errors
159
```
160
161
**Usage in Commands:**
162
```bash
163
# Strict validation (default)
164
adam-submit view --stringency STRICT alignments.adam
165
166
# Lenient validation for legacy data
167
adam-submit flagstat --stringency LENIENT old_alignments.adam
168
169
# Silent validation for known problematic files
170
adam-submit printAdam --stringency SILENT problematic.adam
171
```
172
173
## Performance Considerations
174
175
### Large Dataset Handling
176
177
For very large datasets, consider these optimization strategies:
178
179
```bash
180
# Use sampling for quick inspection
181
adam-submit view -c alignments.adam # Count only, no data transfer
182
183
# Limit record processing for quick stats
184
adam-submit printAdam --records 1000 large_file.adam
185
186
# Use appropriate Spark resources
187
adam-submit --driver-memory 8g --executor-memory 4g -- \
188
flagstat huge_alignment.adam
189
```
190
191
### Memory Management
192
193
```bash
194
# For memory-intensive operations
195
adam-submit --conf spark.sql.adaptive.enabled=true \
196
--conf spark.sql.adaptive.coalescePartitions.enabled=true \
197
view -f 2 large_alignments.adam filtered.adam
198
```
199
200
## Integration with Analysis Pipelines
201
202
### Filtering for Downstream Analysis
203
204
The View command is commonly used to prepare data subsets:
205
206
```bash
207
# Extract high-quality mapped pairs for variant calling
208
adam-submit view \
209
-f 3 \ # Paired and both mapped
210
-F 1028 \ # Exclude duplicates and secondary
211
-q 20 \ # Minimum mapping quality
212
input.adam high_quality.adam
213
214
# Extract unmapped reads for assembly
215
adam-submit view -f 4 input.adam unmapped.adam
216
217
# Extract reads from specific chromosome
218
adam-submit view \
219
--regionPredicate "referenceName=chr22" \
220
input.adam chr22.adam
221
```
222
223
### Quality Control Workflows
224
225
Combine tools for comprehensive QC:
226
227
```bash
228
# 1. Get overall statistics
229
adam-submit flagstat input.adam > qc_stats.txt
230
231
# 2. Inspect problematic reads
232
adam-submit view -f 512 input.adam failed_qc.adam
233
234
# 3. Check duplicate rates
235
adam-submit view -f 1024 -c input.adam
236
```
237
238
### Data Validation Pipelines
239
240
```bash
241
# Validate file integrity
242
adam-submit printAdam --records 1 --stringency STRICT data.adam
243
244
# Generate detailed statistics
245
adam-submit flagstat --stringency STRICT data.adam stats.txt
246
247
# Filter and validate simultaneously
248
adam-submit view -F 4 --stringency LENIENT input.adam validated.adam
249
```
250
251
## Output Format Options
252
253
### Supported Output Formats
254
255
The View command supports multiple output formats through the ADAMSaveAnyArgs mixin:
256
257
- **ADAM Parquet**: Native format for continued ADAM processing
258
- **SAM/BAM**: For external tool compatibility
259
- **JSON**: For programmatic access and debugging
260
- **Text**: Human-readable format for inspection
261
262
### Format Specification
263
264
```bash
265
# Save as BAM for external tools
266
adam-submit view -f 2 input.adam -o output.bam
267
268
# Save as JSON for analysis scripts
269
adam-submit view --records 100 input.adam -o sample.json
270
271
# Save as text for manual inspection
272
adam-submit view --records 10 input.adam -o sample.txt
273
```