or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-inspection.mdformat-conversion.mdgenomic-processing.mdindex.md

data-inspection.mddocs/

0

# Data Inspection and Analysis

1

2

This document covers ADAM CLI's data inspection capabilities for viewing, filtering, and analyzing genomic datasets. These tools provide samtools-like functionality with distributed processing capabilities.

3

4

## Data Viewing and Filtering

5

6

### View Command

7

8

The View command provides samtools view-like functionality for filtering and examining genomic alignment data with support for flag-based filtering and format conversion.

9

10

```scala { .api }

11

object View extends BDGCommandCompanion {

12

val commandName = "view"

13

val commandDescription = "View certain reads from an alignment-record file."

14

def apply(cmdLine: Array[String]): View

15

}

16

17

class ViewArgs extends Args4jBase with ParquetArgs with ADAMSaveAnyArgs {

18

var inputPath: String // Input alignment file

19

var outputPath: String // Output file (optional)

20

var outputPathArg: String // Alternative output specification

21

22

// Flag-based filtering (samtools-compatible)

23

var matchAllBits: Int // Include reads matching all bits (-f)

24

var mismatchAllBits: Int // Exclude reads matching all bits (-F)

25

var matchSomeBits: Int // Include reads matching some bits (-g)

26

var mismatchSomeBits: Int // Exclude reads matching some bits (-G)

27

28

// Output options

29

var printCount: Boolean // Print count only (-c)

30

}

31

```

32

33

**Flag Filtering Examples:**

34

35

```bash

36

# View only mapped reads (exclude unmapped, flag 4)

37

adam-submit view -F 4 alignments.adam

38

39

# View only proper pairs (flag 2) that are mapped (exclude flag 4)

40

adam-submit view -f 2 -F 4 alignments.adam mapped_pairs.adam

41

42

# Count unmapped reads

43

adam-submit view -f 4 -c alignments.adam

44

45

# View first read in pair (flag 64), exclude secondary alignments (flag 256)

46

adam-submit view -f 64 -F 256 alignments.adam first_reads.adam

47

```

48

49

**Common SAM Flags:**

50

- `1`: Read is paired

51

- `2`: Read is in proper pair

52

- `4`: Read is unmapped

53

- `8`: Mate is unmapped

54

- `16`: Read is on reverse strand

55

- `64`: First read in pair

56

- `128`: Second read in pair

57

- `256`: Secondary alignment

58

- `512`: Read fails quality checks

59

- `1024`: PCR/optical duplicate

60

61

### Print ADAM Data

62

63

Display the contents of ADAM files in human-readable format for data inspection and debugging.

64

65

```scala { .api }

66

object PrintADAM extends BDGCommandCompanion {

67

val commandName = "printAdam"

68

val commandDescription = "Print the contents of an ADAM file"

69

def apply(cmdLine: Array[String]): PrintADAM

70

}

71

72

class PrintADAMArgs extends Args4jBase with ParquetArgs {

73

var inputPath: String // Input ADAM file to print

74

var outputPath: String // Optional output file

75

var pretty: Boolean // Pretty-print JSON output

76

var records: Int // Number of records to print

77

}

78

```

79

80

**Usage Examples:**

81

```bash

82

# Print first 10 records to console

83

adam-submit printAdam --records 10 data.adam

84

85

# Pretty-print all records to file

86

adam-submit printAdam --pretty data.adam output.txt

87

88

# Inspect data structure

89

adam-submit printAdam --records 1 --pretty alignments.adam

90

```

91

92

## Statistical Analysis

93

94

### FlagStat

95

96

Generate comprehensive alignment statistics similar to samtools flagstat, providing essential quality control metrics for sequencing data.

97

98

```scala { .api }

99

object FlagStat extends BDGCommandCompanion {

100

val commandName = "flagstat"

101

val commandDescription = "Print statistics about reads in an alignment file"

102

def apply(cmdLine: Array[String]): FlagStat

103

}

104

105

class FlagStatArgs extends Args4jBase {

106

var inputPath: String // Input alignment file

107

var outputPath: String // Optional output file for statistics

108

var stringency: String // Validation stringency

109

}

110

```

111

112

**Statistics Generated:**

113

- Total reads processed

114

- Mapped reads and mapping percentage

115

- Properly paired reads for paired-end data

116

- Singleton reads (mate unmapped)

117

- Read duplicates (PCR/optical)

118

- Secondary and supplementary alignments

119

- Quality control failures

120

121

**Usage Examples:**

122

```bash

123

# Basic flagstat to console

124

adam-submit flagstat alignments.adam

125

126

# Save statistics to file

127

adam-submit flagstat alignments.adam stats.txt

128

129

# Use lenient validation for problematic files

130

adam-submit flagstat --stringency LENIENT alignments.adam

131

```

132

133

**Sample Output:**

134

```

135

71723 + 0 in total (QC-passed reads + QC-failed reads)

136

0 + 0 secondary

137

0 + 0 supplementary

138

0 + 0 duplicates

139

69543 + 0 mapped (97.0% : N/A)

140

71723 + 0 paired in sequencing

141

35861 + 0 read1

142

35862 + 0 read2

143

67432 + 0 properly paired (94.0% : N/A)

144

69543 + 0 with itself and mate mapped

145

0 + 0 singletons (0.0% : N/A)

146

```

147

148

## Quality Control and Validation

149

150

### Validation Stringency Control

151

152

All inspection tools support configurable validation stringency for handling problematic data:

153

154

```scala { .api }

155

// Validation levels

156

ValidationStringency.STRICT // Fail on any validation errors

157

ValidationStringency.LENIENT // Issue warnings for validation errors

158

ValidationStringency.SILENT // Ignore validation errors

159

```

160

161

**Usage in Commands:**

162

```bash

163

# Strict validation (default)

164

adam-submit view --stringency STRICT alignments.adam

165

166

# Lenient validation for legacy data

167

adam-submit flagstat --stringency LENIENT old_alignments.adam

168

169

# Silent validation for known problematic files

170

adam-submit printAdam --stringency SILENT problematic.adam

171

```

172

173

## Performance Considerations

174

175

### Large Dataset Handling

176

177

For very large datasets, consider these optimization strategies:

178

179

```bash

180

# Use sampling for quick inspection

181

adam-submit view -c alignments.adam # Count only, no data transfer

182

183

# Limit record processing for quick stats

184

adam-submit printAdam --records 1000 large_file.adam

185

186

# Use appropriate Spark resources

187

adam-submit --driver-memory 8g --executor-memory 4g -- \

188

flagstat huge_alignment.adam

189

```

190

191

### Memory Management

192

193

```bash

194

# For memory-intensive operations

195

adam-submit --conf spark.sql.adaptive.enabled=true \

196

--conf spark.sql.adaptive.coalescePartitions.enabled=true \

197

view -f 2 large_alignments.adam filtered.adam

198

```

199

200

## Integration with Analysis Pipelines

201

202

### Filtering for Downstream Analysis

203

204

The View command is commonly used to prepare data subsets:

205

206

```bash

207

# Extract high-quality mapped pairs for variant calling

208

adam-submit view \

209

-f 3 \ # Paired and both mapped

210

-F 1028 \ # Exclude duplicates and secondary

211

-q 20 \ # Minimum mapping quality

212

input.adam high_quality.adam

213

214

# Extract unmapped reads for assembly

215

adam-submit view -f 4 input.adam unmapped.adam

216

217

# Extract reads from specific chromosome

218

adam-submit view \

219

--regionPredicate "referenceName=chr22" \

220

input.adam chr22.adam

221

```

222

223

### Quality Control Workflows

224

225

Combine tools for comprehensive QC:

226

227

```bash

228

# 1. Get overall statistics

229

adam-submit flagstat input.adam > qc_stats.txt

230

231

# 2. Inspect problematic reads

232

adam-submit view -f 512 input.adam failed_qc.adam

233

234

# 3. Check duplicate rates

235

adam-submit view -f 1024 -c input.adam

236

```

237

238

### Data Validation Pipelines

239

240

```bash

241

# Validate file integrity

242

adam-submit printAdam --records 1 --stringency STRICT data.adam

243

244

# Generate detailed statistics

245

adam-submit flagstat --stringency STRICT data.adam stats.txt

246

247

# Filter and validate simultaneously

248

adam-submit view -F 4 --stringency LENIENT input.adam validated.adam

249

```

250

251

## Output Format Options

252

253

### Supported Output Formats

254

255

The View command supports multiple output formats through the ADAMSaveAnyArgs mixin:

256

257

- **ADAM Parquet**: Native format for continued ADAM processing

258

- **SAM/BAM**: For external tool compatibility

259

- **JSON**: For programmatic access and debugging

260

- **Text**: Human-readable format for inspection

261

262

### Format Specification

263

264

```bash

265

# Save as BAM for external tools

266

adam-submit view -f 2 input.adam -o output.bam

267

268

# Save as JSON for analysis scripts

269

adam-submit view --records 100 input.adam -o sample.json

270

271

# Save as text for manual inspection

272

adam-submit view --records 10 input.adam -o sample.txt

273

```