or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-inspection.mdformat-conversion.mdgenomic-processing.mdindex.md

format-conversion.mddocs/

0

# Format Conversion

1

2

This document covers ADAM CLI's format conversion capabilities for transforming between various genomic file formats and ADAM's optimized Parquet storage format.

3

4

## FASTA Conversions

5

6

### FASTA to ADAM

7

8

Convert FASTA sequence files to ADAM's Parquet-based nucleotide contig format for improved performance and integration with Spark-based analysis pipelines.

9

10

```scala { .api }

11

object Fasta2ADAM extends BDGCommandCompanion {

12

val commandName = "fasta2adam"

13

val commandDescription = "Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences."

14

def apply(cmdLine: Array[String]): Fasta2ADAM

15

}

16

17

class Fasta2ADAMArgs extends Args4jBase with ParquetSaveArgs {

18

var fastaFile: String // Input FASTA file path

19

var outputPath: String // Output ADAM file path

20

var verbose: Boolean // Enhanced debugging information

21

var reads: String // Contig ID mapping for read compatibility

22

var maximumLength: Long // Maximum fragment length (default: 10,000)

23

var partitions: Int // Number of output partitions

24

}

25

```

26

27

**Key Features:**

28

- **Sequence Indexing**: Automatically creates sequence dictionaries for downstream tools

29

- **Fragment Control**: Splits large sequences into manageable fragments

30

- **ID Mapping**: Maps contig IDs to match existing read datasets

31

- **Partitioning**: Controls output parallelization for optimal performance

32

33

**Usage Examples:**

34

```bash

35

# Basic conversion

36

adam-submit fasta2adam reference.fasta reference.adam

37

38

# With verbose output and custom fragment length

39

adam-submit fasta2adam \

40

--verbose \

41

--fragment_length 50000 \

42

--repartition 100 \

43

reference.fasta reference.adam

44

45

# Map contig IDs to match read dataset

46

adam-submit fasta2adam \

47

--reads alignments.adam \

48

--verbose \

49

reference.fasta reference.adam

50

```

51

52

### ADAM to FASTA

53

54

Convert ADAM nucleotide contig data back to standard FASTA format for compatibility with external tools.

55

56

```scala { .api }

57

object ADAM2Fasta extends BDGCommandCompanion {

58

val commandName = "adam2fasta"

59

val commandDescription = "Convert ADAM nucleotide contig fragments to FASTA files"

60

def apply(cmdLine: Array[String]): ADAM2Fasta

61

}

62

63

class ADAM2FastaArgs extends Args4jBase {

64

var inputPath: String // Input ADAM contig file

65

var outputPath: String // Output FASTA file path

66

var lineWidth: Int // FASTA line width (default: 70)

67

var coalesce: Int // Number of output partitions

68

var disableDictionary: Boolean // Skip sequence dictionary output

69

}

70

```

71

72

**Usage Examples:**

73

```bash

74

# Basic conversion

75

adam-submit adam2fasta contigs.adam output.fasta

76

77

# Custom line width and single output file

78

adam-submit adam2fasta \

79

--lineWidth 80 \

80

--coalesce 1 \

81

contigs.adam reference.fasta

82

```

83

84

## FASTQ Conversions

85

86

### ADAM to FASTQ

87

88

Convert ADAM alignment or fragment data to FASTQ format for compatibility with external alignment tools and quality control applications.

89

90

```scala { .api }

91

object ADAM2Fastq extends BDGCommandCompanion {

92

val commandName = "adam2fastq"

93

val commandDescription = "Convert ADAM read data to FASTQ files"

94

def apply(cmdLine: Array[String]): ADAM2Fastq

95

}

96

97

class ADAM2FastqArgs extends Args4jBase {

98

var inputPath: String // Input ADAM file

99

var outputPath: String // Primary FASTQ output

100

var outputPath2: String // Secondary FASTQ for paired reads

101

var validationStringency: ValidationStringency // Input validation level

102

var repartition: Int // Output partitioning

103

var persistLevel: String // Spark persistence level

104

var disableProjection: Boolean // Disable column projection

105

var outputOriginalBaseQualities: Boolean // Use original quality scores

106

}

107

```

108

109

**Key Features:**

110

- **Paired-End Support**: Automatic separation of read pairs into separate files

111

- **Quality Score Options**: Choose between recalibrated and original quality scores

112

- **Validation Control**: Configurable stringency for malformed read handling

113

- **Memory Management**: Configurable persistence levels for large datasets

114

115

**Usage Examples:**

116

```bash

117

# Single-end reads

118

adam-submit adam2fastq reads.adam output.fastq

119

120

# Paired-end reads with separate output files

121

adam-submit adam2fastq \

122

reads.adam \

123

output_R1.fastq \

124

output_R2.fastq

125

126

# Use original base qualities with lenient validation

127

adam-submit adam2fastq \

128

--outputOriginalBaseQualities \

129

--validationStringency LENIENT \

130

reads.adam output.fastq

131

132

# High-memory processing with custom persistence

133

adam-submit adam2fastq \

134

--persistLevel MEMORY_AND_DISK_SER \

135

--repartition 200 \

136

large_dataset.adam output.fastq

137

```

138

139

## Multi-Format Fragment Processing

140

141

### Transform Fragments

142

143

Convert various genomic formats (SAM/BAM/CRAM) to ADAM fragment format, which maintains paired-end relationships and insert size information.

144

145

```scala { .api }

146

object TransformFragments extends BDGCommandCompanion {

147

val commandName = "transformFragments"

148

val commandDescription = "Convert SAM/BAM/CRAM to ADAM fragments"

149

def apply(cmdLine: Array[String]): TransformFragments

150

}

151

152

class TransformFragmentsArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {

153

var inputPath: String // Input alignment file

154

var outputPath: String // Output fragment file

155

var coalesce: Int // Output partition count

156

var forceShuffle: Boolean // Force data shuffling

157

var storageLevel: String // Spark storage level

158

}

159

```

160

161

**Fragment Benefits:**

162

- **Insert Size Analysis**: Maintains paired-end insert size distributions

163

- **Quality Metrics**: Preserves alignment quality and mapping information

164

- **Memory Efficiency**: Optimized storage for paired-end data analysis

165

- **Downstream Compatibility**: Works with ADAM's fragment-based analysis tools

166

167

**Usage Example:**

168

```bash

169

# Convert BAM to fragments with performance optimization

170

adam-submit transformFragments \

171

--coalesce 50 \

172

--storageLevel MEMORY_AND_DISK \

173

paired_reads.bam fragments.adam

174

```

175

176

## Format Support Matrix

177

178

| Input Format | Output Format | Command | Key Features |

179

|--------------|---------------|---------|--------------|

180

| FASTA | ADAM Contigs | `fasta2adam` | Sequence indexing, fragmentation |

181

| ADAM Contigs | FASTA | `adam2fasta` | Dictionary generation, line formatting |

182

| ADAM Reads/Alignments | FASTQ | `adam2fastq` | Paired-end separation, quality options |

183

| SAM/BAM/CRAM | ADAM Fragments | `transformFragments` | Insert size preservation, pairing |

184

185

## Performance Optimization

186

187

### Memory Management

188

```bash

189

# For large datasets, use disk-based persistence

190

--persistLevel MEMORY_AND_DISK_SER

191

192

# Control memory usage with partitioning

193

--repartition 100 # Increase for large files

194

--coalesce 10 # Decrease for small files

195

```

196

197

### I/O Optimization

198

```bash

199

# Force data shuffling for balanced partitions

200

--forceShuffle

201

202

# Disable column projection for full schema access

203

--disableProjection

204

```

205

206

### Validation Control

207

```scala { .api }

208

// Validation stringency levels

209

ValidationStringency.STRICT // Fail on any malformed data

210

ValidationStringency.LENIENT // Warn on malformed data

211

ValidationStringency.SILENT // Ignore malformed data

212

```

213

214

## Integration with External Tools

215

216

### Sequence Dictionaries

217

FASTA conversions automatically generate sequence dictionaries compatible with:

218

- **SAMtools**: For reference-based operations

219

- **GATK**: For variant calling pipelines

220

- **Picard**: For data validation and metrics

221

222

### Quality Score Handling

223

FASTQ conversions support both:

224

- **Original Quality Scores**: As recorded in source files

225

- **Recalibrated Scores**: From ADAM quality score recalibration

226

227

### File Format Compatibility

228

All conversions maintain compatibility with standard genomics file format specifications:

229

- **FASTA**: NCBI/EMBL standard format

230

- **FASTQ**: Illumina 1.8+ Phred+33 encoding

231

- **SAM/BAM**: HTSlib specification compliance