Tessl Tile for maven/org.bdgenomics.adam/adam-core_2.10@0.23.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

algorithms.md data-loading.md data-types.md file-formats.md index.md transformations.md

algorithms.mddocs/

0
# Genomic Algorithms
1

2
Bioinformatics algorithms including consensus calling, sequence alignment, and variant normalization optimized for distributed processing. ADAM Core provides scalable implementations of standard genomic algorithms while maintaining compatibility with established tools like GATK and Picard.
3

4
## Capabilities
5

6
### Consensus Generation
7

8
Algorithms for generating consensus sequences from aligned reads or incorporating known variants.
9

10
```scala { .api }
11
/**
12
 * Base trait for consensus sequence generation algorithms
13
 * Provides interface for generating consensus from genomic data
14
 */
15
trait ConsensusGenerator {
16
  /**
17
   * Generate consensus sequence from aligned reads
18
   * @param reads - Iterable collection of alignment records
19
   * @return Consensus object containing consensus sequence and quality information
20
   */
21
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
22
}
23

24
/**
25
 * Generate consensus sequences from aligned reads using coverage-based calling
26
 * Implements majority-rule consensus with configurable thresholds
27
 */
28
class ConsensusGeneratorFromReads(
29
  consensusFrequencyThreshold: Double = 0.4,
30
  minReadsAtPosition: Int = 5
31
) extends ConsensusGenerator {
32
  
33
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
34
}
35

36
/**
37
 * Generate consensus incorporating known variant sites
38
 * Uses known variants to guide consensus calling in ambiguous regions
39
 */
40
class ConsensusGeneratorFromKnowns(
41
  variants: RDD[Variant],
42
  minAlleleFraction: Double = 0.25
43
) extends ConsensusGenerator {
44
  
45
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
46
}
47

48
/**
49
 * Union consensus generator combining multiple consensus strategies
50
 * Merges results from multiple consensus generators with conflict resolution
51
 */
52
class UnionConsensusGenerator(
53
  generators: List[ConsensusGenerator]
54
) extends ConsensusGenerator {
55
  
56
  def findConsensus(reads: Iterable[AlignmentRecord]): Consensus
57
}
58
```
59

60
**Usage Examples:**
61

62
```scala
63
import org.bdgenomics.adam.algorithms.consensus._
64
import org.bdgenomics.adam.rdd.ADAMContext._
65

66
// Basic consensus from reads
67
val generator = new ConsensusGeneratorFromReads(
68
  consensusFrequencyThreshold = 0.6,
69
  minReadsAtPosition = 10
70
)
71

72
val alignments = sc.loadBam("sample.bam")
73
val readsAtPosition = alignments.rdd.filter(_.getContigName == "chr1")
74
val consensus = generator.findConsensus(readsAtPosition.collect())
75

76
// Consensus incorporating known variants
77
val knownVariants = sc.loadVcf("dbsnp.vcf").toVariants().rdd
78
val variantGenerator = new ConsensusGeneratorFromKnowns(
79
  knownVariants,
80
  minAlleleFraction = 0.3
81
)
82

83
// Combined consensus strategy
84
val unionGenerator = new UnionConsensusGenerator(
85
  List(generator, variantGenerator)
86
)
87
```
88

89
### Sequence Alignment
90

91
Local sequence alignment algorithms for comparing genomic sequences.
92

93
```scala { .api }
94
/**
95
 * Smith-Waterman local sequence alignment algorithm
96
 * Provides optimal local alignment between two sequences
97
 */
98
object SmithWaterman {
99
  /**
100
   * Perform local sequence alignment between reference and query sequences
101
   * @param reference - Reference sequence string
102
   * @param read - Query sequence string  
103
   * @param scoring - Scoring scheme for alignment
104
   * @return Alignment object containing optimal local alignment
105
   */
106
  def align(reference: String, 
107
           read: String, 
108
           scoring: SmithWatermanScoring): Alignment
109
}
110

111
/**
112
 * Scoring scheme interface for sequence alignment
113
 */
114
trait SmithWatermanScoring {
115
  /** Score for matching bases */
116
  def matchScore: Int
117
  
118
  /** Score for mismatching bases */
119
  def mismatchScore: Int
120
  
121
  /** Score for opening a gap */
122
  def gapOpenScore: Int
123
  
124
  /** Score for extending a gap */
125
  def gapExtendScore: Int
126
}
127

128
/**
129
 * Constant gap scoring scheme with fixed penalties
130
 * @param matchScore - Score for matches
131
 * @param mismatchScore - Penalty for mismatches  
132
 * @param gapScore - Penalty for gaps
133
 */
134
class SmithWatermanConstantGapScoring(
135
  matchScore: Int = 10,
136
  mismatchScore: Int = -2,
137
  gapScore: Int = -8
138
) extends SmithWatermanScoring {
139
  
140
  def gapOpenScore: Int = gapScore
141
  def gapExtendScore: Int = gapScore
142
}
143

144
/**
145
 * Affine gap scoring with different open and extend penalties
146
 * @param matchScore - Score for matches
147
 * @param mismatchScore - Penalty for mismatches
148
 * @param gapOpenScore - Penalty for opening gaps
149
 * @param gapExtendScore - Penalty for extending gaps
150
 */
151
class SmithWatermanGapScoringFromFn(
152
  matchScore: Int,
153
  mismatchScore: Int, 
154
  gapOpenScore: Int,
155
  gapExtendScore: Int
156
) extends SmithWatermanScoring
157
```
158

159
**Usage Examples:**
160

161
```scala
162
import org.bdgenomics.adam.algorithms.smithwaterman._
163

164
// Set up scoring scheme
165
val scoring = new SmithWatermanConstantGapScoring(
166
  matchScore = 10,
167
  mismatchScore = -2,
168
  gapScore = -5
169
)
170

171
// Perform alignment
172
val reference = "ACGTACGTACGT"
173
val query = "ACGTACGT"
174
val alignment = SmithWaterman.align(reference, query, scoring)
175

176
// Affine gap scoring for more realistic penalties
177
val affineScoring = new SmithWatermanGapScoringFromFn(
178
  matchScore = 10,
179
  mismatchScore = -4,
180
  gapOpenScore = -10,
181
  gapExtendScore = -1
182
)
183

184
val affineAlignment = SmithWaterman.align(reference, query, affineScoring)
185
```
186

187
### Variant Normalization
188

189
Algorithms for standardizing variant representations and resolving complex variants.
190

191
```scala { .api }
192
/**
193
 * Utilities for normalizing variant representations
194
 * Implements variant normalization standards for consistent representation
195
 */
196
object NormalizationUtils {
197
  /**
198
   * Left-align and trim variant representation
199
   * @param variant - Variant to normalize
200
   * @param reference - Reference sequence context
201
   * @return Normalized variant representation
202
   */
203
  def leftAlignAndTrim(variant: Variant, reference: String): Variant
204
  
205
  /**
206
   * Decompose multi-nucleotide variants into primitive representations
207
   * @param variant - Complex variant to decompose
208
   * @return Sequence of primitive variants
209
   */
210
  def decomposeMNV(variant: Variant): Seq[Variant]
211
  
212
  /**
213
   * Normalize indel representation to canonical form
214
   * @param variant - Indel variant to normalize
215
   * @param reference - Reference sequence context
216
   * @return Canonically represented indel
217
   */
218
  def normalizeIndel(variant: Variant, reference: String): Variant
219
}
220
```
221

222
**Usage Examples:**
223

224
```scala
225
import org.bdgenomics.adam.algorithms.consensus.NormalizationUtils
226
import org.bdgenomics.adam.rdd.ADAMContext._
227

228
// Load variants and reference
229
val variants = sc.loadVcf("variants.vcf").toVariants()
230
val reference = sc.loadFasta("reference.fasta")
231

232
// Normalize variants
233
val normalized = variants.rdd.map { variant =>
234
  val refContext = reference.extract(
235
    ReferenceRegion(variant.getContigName, variant.getStart - 10, variant.getEnd + 10)
236
  )
237
  
238
  NormalizationUtils.leftAlignAndTrim(variant, refContext)
239
}
240

241
// Decompose complex variants
242
val decomposed = variants.rdd.flatMap { variant =>
243
  if (variant.getAlternateAllele.length > 1) {
244
    NormalizationUtils.decomposeMNV(variant)
245
  } else {
246
    Seq(variant)
247
  }
248
}
249
```
250

251
### Quality Score Recalibration
252

253
Base quality score recalibration algorithms for improving sequencing accuracy.
254

255
```scala { .api }
256
/**
257
 * Base quality score recalibration using known variant sites
258
 * Implements GATK-style base quality score recalibration
259
 * @param alignments - AlignmentRecordRDD to recalibrate
260
 * @param knownSnps - Known variant sites for recalibration modeling
261
 * @return AlignmentRecordRDD with recalibrated quality scores
262
 */
263
def recalibrateBaseQualities(alignments: AlignmentRecordRDD, 
264
                           knownSnps: VariantRDD): AlignmentRecordRDD
265

266
/**
267
 * Build quality score recalibration model
268
 * @param alignments - Training alignments
269
 * @param knownSites - Known variant sites
270
 * @return Recalibration model for applying to new data
271
 */
272
def buildRecalibrationModel(alignments: AlignmentRecordRDD,
273
                          knownSites: VariantRDD): RecalibrationModel
274
```
275

276
### Duplicate Marking
277

278
Algorithms for identifying and marking duplicate reads in sequencing data.
279

280
```scala { .api }
281
/**
282
 * Mark duplicate reads based on alignment coordinates
283
 * Implements Picard-compatible duplicate marking algorithm
284
 * @param alignments - AlignmentRecordRDD to process
285
 * @return AlignmentRecordRDD with duplicate reads flagged
286
 */
287
def markDuplicates(alignments: AlignmentRecordRDD): AlignmentRecordRDD
288

289
/**
290
 * Mark duplicate fragments for paired-end data
291
 * @param fragments - FragmentRDD to process  
292
 * @return FragmentRDD with duplicate fragments flagged
293
 */
294
def markDuplicates(fragments: FragmentRDD): FragmentRDD
295
```
296

297
**Usage Examples:**
298

299
```scala
300
import org.bdgenomics.adam.rdd.ADAMContext._
301

302
val alignments = sc.loadBam("sample.bam")
303

304
// Mark duplicates
305
val markedDuplicates = alignments.markDuplicates()
306

307
// Remove duplicates for analysis
308
val uniqueReads = markedDuplicates.transform(_.filter(!_.getDuplicateRead))
309

310
// Quality recalibration
311
val knownSites = sc.loadVcf("dbsnp.vcf").toVariants()
312
val recalibrated = alignments.recalibrateBaseQualities(knownSites)
313
```
314

315
### Indel Realignment
316

317
Local realignment algorithms for improving alignment accuracy around indels.
318

319
```scala { .api }
320
/**
321
 * Local realignment around indels to improve alignment accuracy
322
 * @param alignments - AlignmentRecordRDD to realign
323
 * @param knownIndels - Optional known indel sites for guidance
324
 * @return AlignmentRecordRDD with improved alignments around indels
325
 */
326
def realignIndels(alignments: AlignmentRecordRDD, 
327
                 knownIndels: Option[VariantRDD] = None): AlignmentRecordRDD
328

329
/**
330
 * Identify candidate regions for realignment
331
 * @param alignments - Input alignments
332
 * @return RDD of genomic regions requiring realignment
333
 */
334
def findRealignmentCandidates(alignments: AlignmentRecordRDD): RDD[ReferenceRegion]
335
```
336

337
### Coverage Analysis Algorithms
338

339
Algorithms for analyzing sequencing depth and coverage patterns.
340

341
```scala { .api }
342
/**
343
 * Calculate coverage depth across genomic positions
344
 * @param alignments - AlignmentRecordRDD to analyze
345
 * @param collapse - Whether to merge adjacent positions with same coverage
346
 * @return CoverageRDD with depth information
347
 */
348
def calculateCoverage(alignments: AlignmentRecordRDD, 
349
                     collapse: Boolean = true): CoverageRDD
350

351
/**
352
 * Identify regions with coverage above specified threshold
353
 * @param coverage - CoverageRDD to analyze
354
 * @param threshold - Minimum coverage threshold
355
 * @return FeatureRDD of high-coverage regions
356
 */
357
def findHighCoverageRegions(coverage: CoverageRDD, 
358
                          threshold: Double): FeatureRDD
359

360
/**
361
 * Calculate coverage statistics across genomic regions
362
 * @param coverage - Coverage data
363
 * @param regions - Regions of interest
364
 * @return RDD of coverage statistics per region
365
 */
366
def coverageStatistics(coverage: CoverageRDD, 
367
                      regions: FeatureRDD): RDD[(Feature, CoverageStatistics)]
368
```
369

370
### K-mer Analysis
371

372
Algorithms for analyzing k-mer composition in genomic sequences.
373

374
```scala { .api }
375
/**
376
 * Count k-mers in sequencing reads
377
 * @param alignments - Reads to analyze
378
 * @param kmerLength - Length of k-mers to count
379
 * @return RDD of k-mer sequences with occurrence counts
380
 */
381
def countKmers(alignments: AlignmentRecordRDD, 
382
              kmerLength: Int): RDD[(String, Long)]
383

384
/**
385
 * Find overrepresented k-mers in reads
386
 * @param kmers - K-mer counts from countKmers
387
 * @param threshold - Minimum count threshold for overrepresentation
388
 * @return RDD of overrepresented k-mers
389
 */
390
def findOverrepresentedKmers(kmers: RDD[(String, Long)], 
391
                           threshold: Long): RDD[(String, Long)]
392
```
393

394
**Usage Examples:**
395

396
```scala
397
// K-mer analysis workflow
398
val alignments = sc.loadBam("sample.bam")
399

400
// Count 21-mers
401
val kmers = alignments.countKmers(21)
402

403
// Find most frequent k-mers
404
val topKmers = kmers.top(100)(Ordering.by(_._2))
405

406
// Identify overrepresented sequences
407
val overrepresented = kmers.filter(_._2 > 1000)
408
```
409

410
## Algorithm Performance Considerations
411

412
1. **Use appropriate partitioning** - Algorithms work best with data partitioned by genomic region
413
2. **Consider memory requirements** - Some algorithms require substantial memory for intermediate data structures
414
3. **Leverage broadcast variables** - Use broadcast for small reference datasets accessed by all partitions
415
4. **Pipeline operations** - Chain multiple algorithms to minimize data shuffling
416
5. **Persist intermediate results** - Cache RDDs that will be reused in iterative algorithms
417
6. **Tune parallelism** - Adjust partition count based on cluster size and data volume
418

419
## Algorithm Compatibility
420

421
ADAM's algorithms maintain compatibility with standard bioinformatics tools:
422

423
- **Duplicate marking**: Compatible with Picard MarkDuplicates
424
- **Quality recalibration**: Compatible with GATK BaseRecalibrator  
425
- **Variant normalization**: Follows VCF specification standards
426
- **Consensus calling**: Compatible with samtools consensus methods
427

428
This ensures ADAM-processed data integrates seamlessly with existing genomics pipelines.

Version

Tile

Files

algorithms.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

algorithms.mddocs/