0
# Consensus Gene Prediction
1
2
EvidenceModeler-like consensus gene prediction system that combines multiple annotation sources using protein and transcript evidence alignment, configurable source weights, and structural validation to generate high-quality consensus gene models.
3
4
## Capabilities
5
6
### Main Consensus Generation
7
8
Generate consensus gene predictions from multiple annotation sources with evidence-based scoring.
9
10
```python { .api }
11
def generate_consensus(fasta, genes, proteins, transcripts, weights, out, debug=False, minscore=False, repeats=False, repeat_overlap=90, tiebreakers="calculated", min_exon=3, min_intron=11, max_intron=-1, max_exon=-1, evidence_derived_models=[], num_processes=None, utrs=True, min_utr_length=10, max_utr_length=2000, log=sys.stderr.write):
12
"""
13
Generate consensus predictions from multiple annotation sources.
14
15
Parameters:
16
- fasta (str): Path to genome FASTA file
17
- genes (list): List of paths to gene model files in GFF3 format
18
- proteins (list): List of paths to protein alignment files in GFF3 format
19
- transcripts (list): List of paths to transcript alignment files in GFF3 format
20
- weights (dict): User-supplied source weights {source: weight}
21
- out (str): Path to output consensus GFF3 file
22
- debug (bool): Enable debug output
23
- minscore (int|bool): Minimum score to retain gene model, or False for auto
24
- repeats (str|bool): Path to repeat annotations, or False
25
- repeat_overlap (int): Percent overlap with repeats to filter (default: 90)
26
- tiebreakers (str): Tiebreaker method ("calculated" or other)
27
- min_exon (int): Minimum exon length (default: 3)
28
- min_intron (int): Minimum intron length (default: 11)
29
- max_intron (int): Maximum intron length (-1 for no limit)
30
- max_exon (int): Maximum exon length (-1 for no limit)
31
- evidence_derived_models (list): List of evidence-derived model sources
32
- num_processes (int|None): Number of parallel processes
33
- utrs (bool): Enable UTR extension (default: True)
34
- min_utr_length (int): Minimum UTR length (default: 10)
35
- max_utr_length (int): Maximum UTR length (default: 2000)
36
- log (function): Logging function
37
38
Returns:
39
None
40
"""
41
```
42
43
### Annotation Edit Distance
44
45
Calculate Annotation Edit Distance (AED) between gene models for similarity assessment.
46
47
```python { .api }
48
def getAED(query, reference):
49
"""
50
Calculate Annotation Edit Distance between two gene models.
51
52
Parameters:
53
- query (dict): Query gene model data structure
54
- reference (dict): Reference gene model data structure
55
56
Returns:
57
float: AED score (0.0 = perfect match, 1.0 = no overlap)
58
"""
59
```
60
61
### Evidence-Based Scoring
62
63
Score gene models based on overlap with protein and transcript evidence.
64
65
```python { .api }
66
def score_by_evidence(locus, weights={}, derived=[]):
67
"""
68
Score gene models based on evidence overlap.
69
70
Parameters:
71
- locus (list): List of gene models at a genomic locus
72
- weights (dict): User-supplied source weights
73
- derived (list): Derived weights from evidence analysis
74
75
Returns:
76
dict: Scored gene models with evidence metrics
77
"""
78
```
79
80
### Model Selection
81
82
Select the best gene model from a locus based on scoring criteria.
83
84
```python { .api }
85
def best_model_default(locus_name, contig, strand, locus, debug=False, weights={}, order={}, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1, evidence_derived_models=[]):
86
"""
87
Select best gene model from locus using default scoring algorithm.
88
89
Parameters:
90
- locus_name (str): Name identifier for the locus
91
- contig (str): Contig/chromosome name
92
- strand (str): Strand orientation ("+" or "-")
93
- locus (list): List of gene models at genomic locus
94
- debug (bool): Enable debug output
95
- weights (dict): User-supplied source weights
96
- order (dict): Derived weights from evidence analysis
97
- min_exon (int): Minimum exon length requirement
98
- min_intron (int): Minimum intron length requirement
99
- max_intron (int): Maximum intron length (-1 for no limit)
100
- max_exon (int): Maximum exon length requirement (-1 for no limit)
101
- evidence_derived_models (list): List of evidence-derived model sources
102
103
Returns:
104
dict: Best scoring gene model or None if no suitable model found
105
"""
106
```
107
108
### Data Structure Conversion
109
110
Convert between different data structures used in consensus prediction.
111
112
```python { .api }
113
def gff2interlap(infile, fasta, inter=False):
114
"""
115
Convert GFF3 file to InterLap data structure for overlap analysis.
116
117
Parameters:
118
- infile (str): Path to GFF3 file
119
- fasta (str): Path to genome FASTA file
120
- inter (InterLap|bool): Existing InterLap object to extend, or False
121
122
Returns:
123
InterLap: InterLap object with genomic intervals
124
"""
125
126
def bed2interlap(bedfile, inter=False):
127
"""
128
Convert BED file to InterLap data structure.
129
130
Parameters:
131
- bedfile (str): Path to BED format file
132
- inter (InterLap|bool): Existing InterLap object to extend, or False
133
134
Returns:
135
InterLap: InterLap object with genomic intervals
136
"""
137
138
def safe_extract_coordinates(coords):
139
"""
140
Safely extract coordinates from gene model data structure.
141
142
Parameters:
143
- coords (list): Coordinate data structure
144
145
Returns:
146
list: Validated coordinate tuples
147
"""
148
```
149
150
### Gene Clustering and Locus Identification
151
152
Identify and cluster overlapping gene models into loci for consensus prediction.
153
154
```python { .api }
155
def cluster_interlap(obj):
156
"""
157
Cluster overlapping gene models using InterLap data structure.
158
159
Parameters:
160
- obj (InterLap): InterLap object containing gene intervals
161
162
Returns:
163
list: List of clustered loci with overlapping gene models
164
"""
165
166
def get_loci(annot_dict):
167
"""
168
Group gene models into genomic loci based on overlap.
169
170
Parameters:
171
- annot_dict (dict): Annotation dictionary
172
173
Returns:
174
dict: Dictionary of loci with overlapping gene models
175
"""
176
177
def sub_cluster(obj):
178
"""
179
Perform sub-clustering within loci for complex overlaps.
180
181
Parameters:
182
- obj (InterLap): InterLap object with clustered intervals
183
184
Returns:
185
list: List of refined sub-clusters
186
"""
187
188
def contained(a, b):
189
"""
190
Check if interval a is contained within interval b.
191
192
Parameters:
193
- a (tuple): Interval coordinates (start, end)
194
- b (tuple): Interval coordinates (start, end)
195
196
Returns:
197
bool: True if a is contained in b
198
"""
199
200
def get_overlap(a, b):
201
"""
202
Calculate overlap between two genomic intervals.
203
204
Parameters:
205
- a (tuple): First interval (start, end)
206
- b (tuple): Second interval (start, end)
207
208
Returns:
209
int: Number of overlapping base pairs
210
"""
211
```
212
213
### UTR Processing
214
215
Advanced UTR identification and extension algorithms.
216
217
```python { .api }
218
def check_intron_compatibility(model_coords, transcript_coords, strand):
219
"""
220
Check if model introns are compatible with transcript evidence.
221
222
Parameters:
223
- model_coords (list): Model exon coordinates
224
- transcript_coords (list): Transcript exon coordinates
225
- strand (str): Strand orientation ("+" or "-")
226
227
Returns:
228
bool: True if introns are compatible
229
"""
230
231
def select_best_utrs(utr_exons_list, strand, min_length=10, max_length=2000):
232
"""
233
Select optimal UTR exons from multiple evidence sources.
234
235
Parameters:
236
- utr_exons_list (list): List of UTR exon coordinate sets
237
- strand (str): Strand orientation ("+" or "-")
238
- min_length (int): Minimum UTR length
239
- max_length (int): Maximum UTR length
240
241
Returns:
242
list: Selected UTR exon coordinates
243
"""
244
245
def extend_utrs(gene_model, transcript_evidence, strand, min_length=10, max_length=2000):
246
"""
247
Extend gene model UTRs using transcript evidence.
248
249
Parameters:
250
- gene_model (dict): Gene model data structure
251
- transcript_evidence (list): List of transcript alignments
252
- strand (str): Strand orientation ("+" or "-")
253
- min_length (int): Minimum UTR extension length
254
- max_length (int): Maximum UTR extension length
255
256
Returns:
257
dict: Gene model with extended UTRs
258
"""
259
```
260
261
### Repeat Filtering and Quality Control
262
263
Filter gene models based on repeat content and structural quality.
264
265
```python { .api }
266
def filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False):
267
"""
268
Filter gene models overlapping repetitive regions.
269
270
Parameters:
271
- fasta (str): Path to genome FASTA file
272
- repeats (str): Path to repeat annotations (BED format)
273
- gene_models (dict): Gene model dictionary
274
- filter_threshold (int): Percent overlap threshold for filtering
275
- log (function|bool): Logging function or False
276
277
Returns:
278
dict: Filtered gene models
279
"""
280
281
def reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1):
282
"""
283
Validate gene model structure against biological constraints.
284
285
Parameters:
286
- coords (dict): Gene coordinates {'CDS': [...], 'protein': [...]}
287
- min_protein (int): Minimum protein length (amino acids)
288
- min_exon (int): Minimum exon length (nucleotides)
289
- min_intron (int): Minimum intron length (nucleotides)
290
- max_intron (int): Maximum intron length (-1 for no limit)
291
- max_exon (int): Maximum exon length (-1 for no limit)
292
293
Returns:
294
bool: True if model passes structural validation
295
"""
296
```
297
298
### Evidence Integration
299
300
Load and integrate protein and transcript evidence into consensus prediction.
301
302
```python { .api }
303
def gffevidence2dict(file, Evi):
304
"""
305
Parse evidence alignments from GFF3 into evidence dictionary.
306
307
Parameters:
308
- file (str): Path to evidence GFF3 file
309
- Evi (dict): Existing evidence dictionary to update
310
311
Returns:
312
dict: Updated evidence dictionary
313
"""
314
315
def add_evidence(loci, evidence, source="proteins"):
316
"""
317
Add evidence data to gene model loci.
318
319
Parameters:
320
- loci (dict): Dictionary of gene model loci
321
- evidence (dict): Evidence alignment dictionary
322
- source (str): Evidence type ("proteins" or "transcripts")
323
324
Returns:
325
dict: Loci with integrated evidence information
326
"""
327
328
def parse_data(genome, gene, protein, transcript, log=sys.stderr.write):
329
"""
330
Parse all input data files for consensus prediction.
331
332
Parameters:
333
- genome (str): Path to genome FASTA file
334
- gene (list): List of gene model file paths
335
- protein (list): List of protein alignment file paths
336
- transcript (list): List of transcript alignment file paths
337
- log (function): Logging function
338
339
Returns:
340
tuple: (genome_dict, gene_dict, evidence_dict)
341
"""
342
343
def filter4evidence(data, n_genes=3, n_evidence=2):
344
"""
345
Filter loci requiring minimum gene and evidence counts.
346
347
Parameters:
348
- data (dict): Loci data dictionary
349
- n_genes (int): Minimum number of gene models required
350
- n_evidence (int): Minimum number of evidence alignments required
351
352
Returns:
353
dict: Filtered loci meeting evidence requirements
354
"""
355
```
356
357
### Scoring and Ranking Algorithms
358
359
Advanced scoring functions for consensus model selection.
360
361
```python { .api }
362
def score_aggregator(locus, weights, evidence_weights, min_protein=30):
363
"""
364
Aggregate multiple scoring metrics for final model ranking.
365
366
Parameters:
367
- locus (list): List of gene models at locus
368
- weights (dict): User-supplied source weights
369
- evidence_weights (dict): Evidence-derived weights
370
- min_protein (int): Minimum protein length requirement
371
372
Returns:
373
list: Ranked gene models with aggregated scores
374
"""
375
376
def cluster_by_aed(locus, score=0.005):
377
"""
378
Cluster highly similar models using AED scores.
379
380
Parameters:
381
- locus (list): List of gene models
382
- score (float): AED threshold for clustering
383
384
Returns:
385
list: Clustered gene models
386
"""
387
388
def map_coords(g_coords, e_coords):
389
"""
390
Map gene coordinates to evidence coordinates.
391
392
Parameters:
393
- g_coords (list): Gene model coordinates
394
- e_coords (list): Evidence alignment coordinates
395
396
Returns:
397
dict: Coordinate mapping information
398
"""
399
400
def score_evidence(g_coords, e_coords, weight=2):
401
"""
402
Score gene model based on evidence overlap.
403
404
Parameters:
405
- g_coords (list): Gene model coordinates
406
- e_coords (list): Evidence coordinates
407
- weight (float): Evidence weight factor
408
409
Returns:
410
float: Evidence-based score
411
"""
412
```
413
414
### Weight Calculation and Source Ranking
415
416
Calculate source weights and rankings based on evidence support.
417
418
```python { .api }
419
def calculate_source_order(data):
420
"""
421
Calculate source ranking based on evidence support.
422
423
Parameters:
424
- data (dict): Gene model and evidence data
425
426
Returns:
427
dict: Source rankings and weights
428
"""
429
430
def order_sources(locus):
431
"""
432
Order prediction sources by quality metrics.
433
434
Parameters:
435
- locus (list): List of gene models at locus
436
437
Returns:
438
dict: Ordered source rankings
439
"""
440
441
def src_scaling_factor(obj):
442
"""
443
Calculate scaling factors for different prediction sources.
444
445
Parameters:
446
- obj (dict): Source performance data
447
448
Returns:
449
dict: Source-specific scaling factors
450
"""
451
452
def de_novo_distance(locus):
453
"""
454
Calculate distance metrics for de novo source weights.
455
456
Parameters:
457
- locus (list): List of gene models
458
459
Returns:
460
dict: Distance-based weight metrics
461
"""
462
463
def calculate_gene_distance(locus):
464
"""
465
Calculate structural distance between gene models.
466
467
Parameters:
468
- locus (list): List of gene models at locus
469
470
Returns:
471
dict: Inter-model distance metrics
472
"""
473
```
474
475
### Algorithm Configuration
476
477
Configure consensus prediction algorithm parameters.
478
479
```python { .api }
480
def auto_score_threshold(weights, order, user_weight=6):
481
"""
482
Automatically determine optimal score threshold.
483
484
Parameters:
485
- weights (dict): User-supplied weights
486
- order (dict): Source order rankings
487
- user_weight (int): Weight for user preferences
488
489
Returns:
490
float: Calculated score threshold
491
"""
492
493
def ensure_unique_names(genes):
494
"""
495
Ensure all gene models have unique identifiers.
496
497
Parameters:
498
- genes (dict): Gene model dictionary
499
500
Returns:
501
dict: Gene models with unique names
502
"""
503
504
def fasta_length(fasta):
505
"""
506
Get sequence lengths from FASTA file.
507
508
Parameters:
509
- fasta (str): Path to FASTA file
510
511
Returns:
512
dict: Dictionary mapping sequence names to lengths
513
"""
514
```
515
516
### Output and Refinement
517
518
Refine consensus predictions and generate output files.
519
520
```python { .api }
521
def refine_cluster(locus, derived=[]):
522
"""
523
Refine gene model cluster using derived weights.
524
525
Parameters:
526
- locus (list): List of gene models
527
- derived (list): Derived weight parameters
528
529
Returns:
530
list: Refined gene model cluster
531
"""
532
533
def solve_sub_loci(result):
534
"""
535
Resolve complex multi-gene loci into individual predictions.
536
537
Parameters:
538
- result (dict): Clustered locus data
539
540
Returns:
541
dict: Resolved individual gene predictions
542
"""
543
544
def gff_writer(input, output):
545
"""
546
Write consensus predictions to GFF3 format.
547
548
Parameters:
549
- input (dict): Consensus gene predictions
550
- output (str): Output GFF3 file path
551
552
Returns:
553
None
554
"""
555
```
556
557
### CLI Interface
558
559
Command-line interface for consensus prediction.
560
561
```python { .api }
562
def consensus(args):
563
"""
564
Main consensus prediction CLI entry point.
565
566
Parameters:
567
- args (argparse.Namespace): Command line arguments
568
569
Returns:
570
None
571
"""
572
- max_exon (int): Maximum exon length (-1 for no limit)
573
- evidence_derived_models (list): Evidence-derived model sources
574
575
Returns:
576
dict: Best gene model with score information
577
"""
578
```
579
580
### Repeat Filtering
581
582
Filter gene models that significantly overlap with repetitive elements.
583
584
```python { .api }
585
def filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False):
586
"""
587
Filter gene models overlapping with repetitive elements.
588
589
Parameters:
590
- fasta (str): Path to genome FASTA file
591
- repeats (str): Path to repeat annotations (BED or GFF3)
592
- gene_models (dict): Dictionary of gene models to filter
593
- filter_threshold (int): Percent overlap threshold for filtering
594
- log (function|bool): Logging function or boolean
595
596
Returns:
597
dict: Filtered gene models with repeats removed
598
"""
599
```
600
601
### Structural Validation
602
603
Validate gene model structure against biological constraints.
604
605
```python { .api }
606
def reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1):
607
"""
608
Validate gene model structure against biological constraints.
609
610
Parameters:
611
- coords (dict): Gene model coordinate information
612
- min_protein (int): Minimum protein length in amino acids
613
- min_exon (int): Minimum exon length in nucleotides
614
- min_intron (int): Minimum intron length in nucleotides
615
- max_intron (int): Maximum intron length (-1 for no limit)
616
- max_exon (int): Maximum exon length (-1 for no limit)
617
618
Returns:
619
bool: True if model passes validation, False otherwise
620
"""
621
```
622
623
## Usage Examples
624
625
### Basic Consensus Prediction
626
627
```python
628
from gfftk.consensus import generate_consensus
629
630
# Define input files
631
gene_predictions = ["augustus.gff3", "genemark.gff3", "snap.gff3"]
632
protein_alignments = ["uniprot.gff3", "pfam.gff3"]
633
transcript_alignments = ["rnaseq.gff3"]
634
635
# User-defined source weights
636
weights = {
637
"augustus": 3,
638
"genemark": 2,
639
"snap": 1
640
}
641
642
# Generate consensus
643
generate_consensus(
644
fasta="genome.fasta",
645
genes=gene_predictions,
646
proteins=protein_alignments,
647
transcripts=transcript_alignments,
648
weights=weights,
649
output="consensus.gff3",
650
num_processes=4,
651
min_protein=50,
652
repeat_overlap=80
653
)
654
```
655
656
### AED Calculation
657
658
```python
659
from gfftk.consensus import getAED
660
from gfftk.gff import gff2dict
661
662
# Load two annotations for comparison
663
reference = gff2dict("reference.gff3", "genome.fasta")
664
query = gff2dict("prediction.gff3", "genome.fasta")
665
666
# Calculate AED for overlapping genes
667
for gene_id in reference:
668
if gene_id in query:
669
aed = getAED(query[gene_id], reference[gene_id])
670
print(f"Gene {gene_id}: AED = {aed:.3f}")
671
```
672
673
### Custom Scoring Pipeline
674
675
```python
676
from gfftk.consensus import score_by_evidence, best_model_default
677
from gfftk.gff import gff2dict
678
679
# Load gene models and evidence
680
genes = gff2dict("predictions.gff3", "genome.fasta")
681
proteins = gff2dict("protein_alignments.gff3", "genome.fasta")
682
683
# Group genes by locus (simplified example)
684
loci = {}
685
for gene_id, gene_data in genes.items():
686
locus_key = f"{gene_data['contig']}:{gene_data['location'][0]}-{gene_data['location'][1]}"
687
if locus_key not in loci:
688
loci[locus_key] = []
689
loci[locus_key].append({gene_id: gene_data})
690
691
# Score each locus
692
weights = {"augustus": 2, "genemark": 1}
693
consensus_models = {}
694
695
for locus_key, locus_models in loci.items():
696
# Score models by evidence
697
scored = score_by_evidence(locus_models, weights)
698
699
# Select best model
700
best = best_model_default(scored, weights, [], min_protein=30)
701
702
if best:
703
consensus_models.update(best)
704
705
print(f"Selected {len(consensus_models)} consensus gene models")
706
```
707
708
### Repeat Filtering
709
710
```python
711
from gfftk.consensus import filter_models_repeats
712
from gfftk.gff import gff2dict
713
714
# Load gene models
715
models = gff2dict("gene_models.gff3", "genome.fasta")
716
717
# Filter models overlapping repeats
718
filtered_models = filter_models_repeats(
719
fasta="genome.fasta",
720
repeats="repeats.bed",
721
gene_models=models,
722
filter_threshold=75, # Remove models with >75% repeat overlap
723
log=print
724
)
725
726
print(f"Retained {len(filtered_models)}/{len(models)} models after repeat filtering")
727
```
728
729
### Structural Validation
730
731
```python
732
from gfftk.consensus import reasonable_model
733
from gfftk.gff import gff2dict
734
735
# Load gene models
736
models = gff2dict("gene_models.gff3", "genome.fasta")
737
738
# Validate each model
739
valid_models = {}
740
for gene_id, gene_data in models.items():
741
coords = {
742
'CDS': gene_data['CDS'],
743
'protein': gene_data['protein'],
744
'mRNA': gene_data['mRNA']
745
}
746
747
# Apply structural constraints
748
if reasonable_model(
749
coords,
750
min_protein=50, # At least 50 amino acids
751
min_exon=15, # At least 15 bp exons
752
min_intron=20, # At least 20 bp introns
753
max_intron=50000 # Max 50kb introns
754
):
755
valid_models[gene_id] = gene_data
756
757
print(f"Validated {len(valid_models)}/{len(models)} gene models")
758
```
759
760
## Types
761
762
```python { .api }
763
# Gene model locus structure
764
Locus = list[dict] # List of gene models at same genomic location
765
766
# Source weights dictionary
767
SourceWeights = dict[str, float] # {source_name: weight_value}
768
769
# Evidence overlap metrics
770
EvidenceMetrics = {
771
"protein_overlap": float, # Percent overlap with protein evidence
772
"transcript_overlap": float, # Percent overlap with transcript evidence
773
"aed_score": float, # Annotation Edit Distance
774
"combined_score": float # Final combined score
775
}
776
777
# Scored gene model
778
ScoredModel = {
779
"gene_data": dict, # Original gene annotation data
780
"score": float, # Final score
781
"evidence": EvidenceMetrics # Evidence support metrics
782
}
783
784
# Consensus parameters
785
ConsensusParams = {
786
"min_protein": int, # Minimum protein length
787
"min_exon": int, # Minimum exon length
788
"max_exon": int, # Maximum exon length
789
"min_intron": int, # Minimum intron length
790
"max_intron": int, # Maximum intron length
791
"repeat_overlap": int, # Repeat overlap threshold
792
"minscore": int, # Minimum model score
793
"num_processes": int # Number of parallel processes
794
}
795
796
# AED score
797
AEDScore = float # 0.0 (perfect) to 1.0 (no overlap)
798
799
# Genomic coordinates
800
GenomicCoordinates = list[tuple[int, int]]
801
802
# Repeat annotation format
803
RepeatAnnotation = {
804
"contig": str,
805
"start": int,
806
"end": int,
807
"type": str
808
}
809
```