or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli-commands.mdcomparison.mdconsensus.mdformat-conversion.mdgenbank-tbl.mdgff-processing.mdindex.mdsequence-operations.mdutilities.md

consensus.mddocs/

0

# Consensus Gene Prediction

1

2

EvidenceModeler-like consensus gene prediction system that combines multiple annotation sources using protein and transcript evidence alignment, configurable source weights, and structural validation to generate high-quality consensus gene models.

3

4

## Capabilities

5

6

### Main Consensus Generation

7

8

Generate consensus gene predictions from multiple annotation sources with evidence-based scoring.

9

10

```python { .api }

11

def generate_consensus(fasta, genes, proteins, transcripts, weights, out, debug=False, minscore=False, repeats=False, repeat_overlap=90, tiebreakers="calculated", min_exon=3, min_intron=11, max_intron=-1, max_exon=-1, evidence_derived_models=[], num_processes=None, utrs=True, min_utr_length=10, max_utr_length=2000, log=sys.stderr.write):

12

"""

13

Generate consensus predictions from multiple annotation sources.

14

15

Parameters:

16

- fasta (str): Path to genome FASTA file

17

- genes (list): List of paths to gene model files in GFF3 format

18

- proteins (list): List of paths to protein alignment files in GFF3 format

19

- transcripts (list): List of paths to transcript alignment files in GFF3 format

20

- weights (dict): User-supplied source weights {source: weight}

21

- out (str): Path to output consensus GFF3 file

22

- debug (bool): Enable debug output

23

- minscore (int|bool): Minimum score to retain gene model, or False for auto

24

- repeats (str|bool): Path to repeat annotations, or False

25

- repeat_overlap (int): Percent overlap with repeats to filter (default: 90)

26

- tiebreakers (str): Tiebreaker method ("calculated" or other)

27

- min_exon (int): Minimum exon length (default: 3)

28

- min_intron (int): Minimum intron length (default: 11)

29

- max_intron (int): Maximum intron length (-1 for no limit)

30

- max_exon (int): Maximum exon length (-1 for no limit)

31

- evidence_derived_models (list): List of evidence-derived model sources

32

- num_processes (int|None): Number of parallel processes

33

- utrs (bool): Enable UTR extension (default: True)

34

- min_utr_length (int): Minimum UTR length (default: 10)

35

- max_utr_length (int): Maximum UTR length (default: 2000)

36

- log (function): Logging function

37

38

Returns:

39

None

40

"""

41

```

42

43

### Annotation Edit Distance

44

45

Calculate Annotation Edit Distance (AED) between gene models for similarity assessment.

46

47

```python { .api }

48

def getAED(query, reference):

49

"""

50

Calculate Annotation Edit Distance between two gene models.

51

52

Parameters:

53

- query (dict): Query gene model data structure

54

- reference (dict): Reference gene model data structure

55

56

Returns:

57

float: AED score (0.0 = perfect match, 1.0 = no overlap)

58

"""

59

```

60

61

### Evidence-Based Scoring

62

63

Score gene models based on overlap with protein and transcript evidence.

64

65

```python { .api }

66

def score_by_evidence(locus, weights={}, derived=[]):

67

"""

68

Score gene models based on evidence overlap.

69

70

Parameters:

71

- locus (list): List of gene models at a genomic locus

72

- weights (dict): User-supplied source weights

73

- derived (list): Derived weights from evidence analysis

74

75

Returns:

76

dict: Scored gene models with evidence metrics

77

"""

78

```

79

80

### Model Selection

81

82

Select the best gene model from a locus based on scoring criteria.

83

84

```python { .api }

85

def best_model_default(locus_name, contig, strand, locus, debug=False, weights={}, order={}, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1, evidence_derived_models=[]):

86

"""

87

Select best gene model from locus using default scoring algorithm.

88

89

Parameters:

90

- locus_name (str): Name identifier for the locus

91

- contig (str): Contig/chromosome name

92

- strand (str): Strand orientation ("+" or "-")

93

- locus (list): List of gene models at genomic locus

94

- debug (bool): Enable debug output

95

- weights (dict): User-supplied source weights

96

- order (dict): Derived weights from evidence analysis

97

- min_exon (int): Minimum exon length requirement

98

- min_intron (int): Minimum intron length requirement

99

- max_intron (int): Maximum intron length (-1 for no limit)

100

- max_exon (int): Maximum exon length requirement (-1 for no limit)

101

- evidence_derived_models (list): List of evidence-derived model sources

102

103

Returns:

104

dict: Best scoring gene model or None if no suitable model found

105

"""

106

```

107

108

### Data Structure Conversion

109

110

Convert between different data structures used in consensus prediction.

111

112

```python { .api }

113

def gff2interlap(infile, fasta, inter=False):

114

"""

115

Convert GFF3 file to InterLap data structure for overlap analysis.

116

117

Parameters:

118

- infile (str): Path to GFF3 file

119

- fasta (str): Path to genome FASTA file

120

- inter (InterLap|bool): Existing InterLap object to extend, or False

121

122

Returns:

123

InterLap: InterLap object with genomic intervals

124

"""

125

126

def bed2interlap(bedfile, inter=False):

127

"""

128

Convert BED file to InterLap data structure.

129

130

Parameters:

131

- bedfile (str): Path to BED format file

132

- inter (InterLap|bool): Existing InterLap object to extend, or False

133

134

Returns:

135

InterLap: InterLap object with genomic intervals

136

"""

137

138

def safe_extract_coordinates(coords):

139

"""

140

Safely extract coordinates from gene model data structure.

141

142

Parameters:

143

- coords (list): Coordinate data structure

144

145

Returns:

146

list: Validated coordinate tuples

147

"""

148

```

149

150

### Gene Clustering and Locus Identification

151

152

Identify and cluster overlapping gene models into loci for consensus prediction.

153

154

```python { .api }

155

def cluster_interlap(obj):

156

"""

157

Cluster overlapping gene models using InterLap data structure.

158

159

Parameters:

160

- obj (InterLap): InterLap object containing gene intervals

161

162

Returns:

163

list: List of clustered loci with overlapping gene models

164

"""

165

166

def get_loci(annot_dict):

167

"""

168

Group gene models into genomic loci based on overlap.

169

170

Parameters:

171

- annot_dict (dict): Annotation dictionary

172

173

Returns:

174

dict: Dictionary of loci with overlapping gene models

175

"""

176

177

def sub_cluster(obj):

178

"""

179

Perform sub-clustering within loci for complex overlaps.

180

181

Parameters:

182

- obj (InterLap): InterLap object with clustered intervals

183

184

Returns:

185

list: List of refined sub-clusters

186

"""

187

188

def contained(a, b):

189

"""

190

Check if interval a is contained within interval b.

191

192

Parameters:

193

- a (tuple): Interval coordinates (start, end)

194

- b (tuple): Interval coordinates (start, end)

195

196

Returns:

197

bool: True if a is contained in b

198

"""

199

200

def get_overlap(a, b):

201

"""

202

Calculate overlap between two genomic intervals.

203

204

Parameters:

205

- a (tuple): First interval (start, end)

206

- b (tuple): Second interval (start, end)

207

208

Returns:

209

int: Number of overlapping base pairs

210

"""

211

```

212

213

### UTR Processing

214

215

Advanced UTR identification and extension algorithms.

216

217

```python { .api }

218

def check_intron_compatibility(model_coords, transcript_coords, strand):

219

"""

220

Check if model introns are compatible with transcript evidence.

221

222

Parameters:

223

- model_coords (list): Model exon coordinates

224

- transcript_coords (list): Transcript exon coordinates

225

- strand (str): Strand orientation ("+" or "-")

226

227

Returns:

228

bool: True if introns are compatible

229

"""

230

231

def select_best_utrs(utr_exons_list, strand, min_length=10, max_length=2000):

232

"""

233

Select optimal UTR exons from multiple evidence sources.

234

235

Parameters:

236

- utr_exons_list (list): List of UTR exon coordinate sets

237

- strand (str): Strand orientation ("+" or "-")

238

- min_length (int): Minimum UTR length

239

- max_length (int): Maximum UTR length

240

241

Returns:

242

list: Selected UTR exon coordinates

243

"""

244

245

def extend_utrs(gene_model, transcript_evidence, strand, min_length=10, max_length=2000):

246

"""

247

Extend gene model UTRs using transcript evidence.

248

249

Parameters:

250

- gene_model (dict): Gene model data structure

251

- transcript_evidence (list): List of transcript alignments

252

- strand (str): Strand orientation ("+" or "-")

253

- min_length (int): Minimum UTR extension length

254

- max_length (int): Maximum UTR extension length

255

256

Returns:

257

dict: Gene model with extended UTRs

258

"""

259

```

260

261

### Repeat Filtering and Quality Control

262

263

Filter gene models based on repeat content and structural quality.

264

265

```python { .api }

266

def filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False):

267

"""

268

Filter gene models overlapping repetitive regions.

269

270

Parameters:

271

- fasta (str): Path to genome FASTA file

272

- repeats (str): Path to repeat annotations (BED format)

273

- gene_models (dict): Gene model dictionary

274

- filter_threshold (int): Percent overlap threshold for filtering

275

- log (function|bool): Logging function or False

276

277

Returns:

278

dict: Filtered gene models

279

"""

280

281

def reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1):

282

"""

283

Validate gene model structure against biological constraints.

284

285

Parameters:

286

- coords (dict): Gene coordinates {'CDS': [...], 'protein': [...]}

287

- min_protein (int): Minimum protein length (amino acids)

288

- min_exon (int): Minimum exon length (nucleotides)

289

- min_intron (int): Minimum intron length (nucleotides)

290

- max_intron (int): Maximum intron length (-1 for no limit)

291

- max_exon (int): Maximum exon length (-1 for no limit)

292

293

Returns:

294

bool: True if model passes structural validation

295

"""

296

```

297

298

### Evidence Integration

299

300

Load and integrate protein and transcript evidence into consensus prediction.

301

302

```python { .api }

303

def gffevidence2dict(file, Evi):

304

"""

305

Parse evidence alignments from GFF3 into evidence dictionary.

306

307

Parameters:

308

- file (str): Path to evidence GFF3 file

309

- Evi (dict): Existing evidence dictionary to update

310

311

Returns:

312

dict: Updated evidence dictionary

313

"""

314

315

def add_evidence(loci, evidence, source="proteins"):

316

"""

317

Add evidence data to gene model loci.

318

319

Parameters:

320

- loci (dict): Dictionary of gene model loci

321

- evidence (dict): Evidence alignment dictionary

322

- source (str): Evidence type ("proteins" or "transcripts")

323

324

Returns:

325

dict: Loci with integrated evidence information

326

"""

327

328

def parse_data(genome, gene, protein, transcript, log=sys.stderr.write):

329

"""

330

Parse all input data files for consensus prediction.

331

332

Parameters:

333

- genome (str): Path to genome FASTA file

334

- gene (list): List of gene model file paths

335

- protein (list): List of protein alignment file paths

336

- transcript (list): List of transcript alignment file paths

337

- log (function): Logging function

338

339

Returns:

340

tuple: (genome_dict, gene_dict, evidence_dict)

341

"""

342

343

def filter4evidence(data, n_genes=3, n_evidence=2):

344

"""

345

Filter loci requiring minimum gene and evidence counts.

346

347

Parameters:

348

- data (dict): Loci data dictionary

349

- n_genes (int): Minimum number of gene models required

350

- n_evidence (int): Minimum number of evidence alignments required

351

352

Returns:

353

dict: Filtered loci meeting evidence requirements

354

"""

355

```

356

357

### Scoring and Ranking Algorithms

358

359

Advanced scoring functions for consensus model selection.

360

361

```python { .api }

362

def score_aggregator(locus, weights, evidence_weights, min_protein=30):

363

"""

364

Aggregate multiple scoring metrics for final model ranking.

365

366

Parameters:

367

- locus (list): List of gene models at locus

368

- weights (dict): User-supplied source weights

369

- evidence_weights (dict): Evidence-derived weights

370

- min_protein (int): Minimum protein length requirement

371

372

Returns:

373

list: Ranked gene models with aggregated scores

374

"""

375

376

def cluster_by_aed(locus, score=0.005):

377

"""

378

Cluster highly similar models using AED scores.

379

380

Parameters:

381

- locus (list): List of gene models

382

- score (float): AED threshold for clustering

383

384

Returns:

385

list: Clustered gene models

386

"""

387

388

def map_coords(g_coords, e_coords):

389

"""

390

Map gene coordinates to evidence coordinates.

391

392

Parameters:

393

- g_coords (list): Gene model coordinates

394

- e_coords (list): Evidence alignment coordinates

395

396

Returns:

397

dict: Coordinate mapping information

398

"""

399

400

def score_evidence(g_coords, e_coords, weight=2):

401

"""

402

Score gene model based on evidence overlap.

403

404

Parameters:

405

- g_coords (list): Gene model coordinates

406

- e_coords (list): Evidence coordinates

407

- weight (float): Evidence weight factor

408

409

Returns:

410

float: Evidence-based score

411

"""

412

```

413

414

### Weight Calculation and Source Ranking

415

416

Calculate source weights and rankings based on evidence support.

417

418

```python { .api }

419

def calculate_source_order(data):

420

"""

421

Calculate source ranking based on evidence support.

422

423

Parameters:

424

- data (dict): Gene model and evidence data

425

426

Returns:

427

dict: Source rankings and weights

428

"""

429

430

def order_sources(locus):

431

"""

432

Order prediction sources by quality metrics.

433

434

Parameters:

435

- locus (list): List of gene models at locus

436

437

Returns:

438

dict: Ordered source rankings

439

"""

440

441

def src_scaling_factor(obj):

442

"""

443

Calculate scaling factors for different prediction sources.

444

445

Parameters:

446

- obj (dict): Source performance data

447

448

Returns:

449

dict: Source-specific scaling factors

450

"""

451

452

def de_novo_distance(locus):

453

"""

454

Calculate distance metrics for de novo source weights.

455

456

Parameters:

457

- locus (list): List of gene models

458

459

Returns:

460

dict: Distance-based weight metrics

461

"""

462

463

def calculate_gene_distance(locus):

464

"""

465

Calculate structural distance between gene models.

466

467

Parameters:

468

- locus (list): List of gene models at locus

469

470

Returns:

471

dict: Inter-model distance metrics

472

"""

473

```

474

475

### Algorithm Configuration

476

477

Configure consensus prediction algorithm parameters.

478

479

```python { .api }

480

def auto_score_threshold(weights, order, user_weight=6):

481

"""

482

Automatically determine optimal score threshold.

483

484

Parameters:

485

- weights (dict): User-supplied weights

486

- order (dict): Source order rankings

487

- user_weight (int): Weight for user preferences

488

489

Returns:

490

float: Calculated score threshold

491

"""

492

493

def ensure_unique_names(genes):

494

"""

495

Ensure all gene models have unique identifiers.

496

497

Parameters:

498

- genes (dict): Gene model dictionary

499

500

Returns:

501

dict: Gene models with unique names

502

"""

503

504

def fasta_length(fasta):

505

"""

506

Get sequence lengths from FASTA file.

507

508

Parameters:

509

- fasta (str): Path to FASTA file

510

511

Returns:

512

dict: Dictionary mapping sequence names to lengths

513

"""

514

```

515

516

### Output and Refinement

517

518

Refine consensus predictions and generate output files.

519

520

```python { .api }

521

def refine_cluster(locus, derived=[]):

522

"""

523

Refine gene model cluster using derived weights.

524

525

Parameters:

526

- locus (list): List of gene models

527

- derived (list): Derived weight parameters

528

529

Returns:

530

list: Refined gene model cluster

531

"""

532

533

def solve_sub_loci(result):

534

"""

535

Resolve complex multi-gene loci into individual predictions.

536

537

Parameters:

538

- result (dict): Clustered locus data

539

540

Returns:

541

dict: Resolved individual gene predictions

542

"""

543

544

def gff_writer(input, output):

545

"""

546

Write consensus predictions to GFF3 format.

547

548

Parameters:

549

- input (dict): Consensus gene predictions

550

- output (str): Output GFF3 file path

551

552

Returns:

553

None

554

"""

555

```

556

557

### CLI Interface

558

559

Command-line interface for consensus prediction.

560

561

```python { .api }

562

def consensus(args):

563

"""

564

Main consensus prediction CLI entry point.

565

566

Parameters:

567

- args (argparse.Namespace): Command line arguments

568

569

Returns:

570

None

571

"""

572

- max_exon (int): Maximum exon length (-1 for no limit)

573

- evidence_derived_models (list): Evidence-derived model sources

574

575

Returns:

576

dict: Best gene model with score information

577

"""

578

```

579

580

### Repeat Filtering

581

582

Filter gene models that significantly overlap with repetitive elements.

583

584

```python { .api }

585

def filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False):

586

"""

587

Filter gene models overlapping with repetitive elements.

588

589

Parameters:

590

- fasta (str): Path to genome FASTA file

591

- repeats (str): Path to repeat annotations (BED or GFF3)

592

- gene_models (dict): Dictionary of gene models to filter

593

- filter_threshold (int): Percent overlap threshold for filtering

594

- log (function|bool): Logging function or boolean

595

596

Returns:

597

dict: Filtered gene models with repeats removed

598

"""

599

```

600

601

### Structural Validation

602

603

Validate gene model structure against biological constraints.

604

605

```python { .api }

606

def reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1):

607

"""

608

Validate gene model structure against biological constraints.

609

610

Parameters:

611

- coords (dict): Gene model coordinate information

612

- min_protein (int): Minimum protein length in amino acids

613

- min_exon (int): Minimum exon length in nucleotides

614

- min_intron (int): Minimum intron length in nucleotides

615

- max_intron (int): Maximum intron length (-1 for no limit)

616

- max_exon (int): Maximum exon length (-1 for no limit)

617

618

Returns:

619

bool: True if model passes validation, False otherwise

620

"""

621

```

622

623

## Usage Examples

624

625

### Basic Consensus Prediction

626

627

```python

628

from gfftk.consensus import generate_consensus

629

630

# Define input files

631

gene_predictions = ["augustus.gff3", "genemark.gff3", "snap.gff3"]

632

protein_alignments = ["uniprot.gff3", "pfam.gff3"]

633

transcript_alignments = ["rnaseq.gff3"]

634

635

# User-defined source weights

636

weights = {

637

"augustus": 3,

638

"genemark": 2,

639

"snap": 1

640

}

641

642

# Generate consensus

643

generate_consensus(

644

fasta="genome.fasta",

645

genes=gene_predictions,

646

proteins=protein_alignments,

647

transcripts=transcript_alignments,

648

weights=weights,

649

output="consensus.gff3",

650

num_processes=4,

651

min_protein=50,

652

repeat_overlap=80

653

)

654

```

655

656

### AED Calculation

657

658

```python

659

from gfftk.consensus import getAED

660

from gfftk.gff import gff2dict

661

662

# Load two annotations for comparison

663

reference = gff2dict("reference.gff3", "genome.fasta")

664

query = gff2dict("prediction.gff3", "genome.fasta")

665

666

# Calculate AED for overlapping genes

667

for gene_id in reference:

668

if gene_id in query:

669

aed = getAED(query[gene_id], reference[gene_id])

670

print(f"Gene {gene_id}: AED = {aed:.3f}")

671

```

672

673

### Custom Scoring Pipeline

674

675

```python

676

from gfftk.consensus import score_by_evidence, best_model_default

677

from gfftk.gff import gff2dict

678

679

# Load gene models and evidence

680

genes = gff2dict("predictions.gff3", "genome.fasta")

681

proteins = gff2dict("protein_alignments.gff3", "genome.fasta")

682

683

# Group genes by locus (simplified example)

684

loci = {}

685

for gene_id, gene_data in genes.items():

686

locus_key = f"{gene_data['contig']}:{gene_data['location'][0]}-{gene_data['location'][1]}"

687

if locus_key not in loci:

688

loci[locus_key] = []

689

loci[locus_key].append({gene_id: gene_data})

690

691

# Score each locus

692

weights = {"augustus": 2, "genemark": 1}

693

consensus_models = {}

694

695

for locus_key, locus_models in loci.items():

696

# Score models by evidence

697

scored = score_by_evidence(locus_models, weights)

698

699

# Select best model

700

best = best_model_default(scored, weights, [], min_protein=30)

701

702

if best:

703

consensus_models.update(best)

704

705

print(f"Selected {len(consensus_models)} consensus gene models")

706

```

707

708

### Repeat Filtering

709

710

```python

711

from gfftk.consensus import filter_models_repeats

712

from gfftk.gff import gff2dict

713

714

# Load gene models

715

models = gff2dict("gene_models.gff3", "genome.fasta")

716

717

# Filter models overlapping repeats

718

filtered_models = filter_models_repeats(

719

fasta="genome.fasta",

720

repeats="repeats.bed",

721

gene_models=models,

722

filter_threshold=75, # Remove models with >75% repeat overlap

723

log=print

724

)

725

726

print(f"Retained {len(filtered_models)}/{len(models)} models after repeat filtering")

727

```

728

729

### Structural Validation

730

731

```python

732

from gfftk.consensus import reasonable_model

733

from gfftk.gff import gff2dict

734

735

# Load gene models

736

models = gff2dict("gene_models.gff3", "genome.fasta")

737

738

# Validate each model

739

valid_models = {}

740

for gene_id, gene_data in models.items():

741

coords = {

742

'CDS': gene_data['CDS'],

743

'protein': gene_data['protein'],

744

'mRNA': gene_data['mRNA']

745

}

746

747

# Apply structural constraints

748

if reasonable_model(

749

coords,

750

min_protein=50, # At least 50 amino acids

751

min_exon=15, # At least 15 bp exons

752

min_intron=20, # At least 20 bp introns

753

max_intron=50000 # Max 50kb introns

754

):

755

valid_models[gene_id] = gene_data

756

757

print(f"Validated {len(valid_models)}/{len(models)} gene models")

758

```

759

760

## Types

761

762

```python { .api }

763

# Gene model locus structure

764

Locus = list[dict] # List of gene models at same genomic location

765

766

# Source weights dictionary

767

SourceWeights = dict[str, float] # {source_name: weight_value}

768

769

# Evidence overlap metrics

770

EvidenceMetrics = {

771

"protein_overlap": float, # Percent overlap with protein evidence

772

"transcript_overlap": float, # Percent overlap with transcript evidence

773

"aed_score": float, # Annotation Edit Distance

774

"combined_score": float # Final combined score

775

}

776

777

# Scored gene model

778

ScoredModel = {

779

"gene_data": dict, # Original gene annotation data

780

"score": float, # Final score

781

"evidence": EvidenceMetrics # Evidence support metrics

782

}

783

784

# Consensus parameters

785

ConsensusParams = {

786

"min_protein": int, # Minimum protein length

787

"min_exon": int, # Minimum exon length

788

"max_exon": int, # Maximum exon length

789

"min_intron": int, # Minimum intron length

790

"max_intron": int, # Maximum intron length

791

"repeat_overlap": int, # Repeat overlap threshold

792

"minscore": int, # Minimum model score

793

"num_processes": int # Number of parallel processes

794

}

795

796

# AED score

797

AEDScore = float # 0.0 (perfect) to 1.0 (no overlap)

798

799

# Genomic coordinates

800

GenomicCoordinates = list[tuple[int, int]]

801

802

# Repeat annotation format

803

RepeatAnnotation = {

804

"contig": str,

805

"start": int,

806

"end": int,

807

"type": str

808

}

809

```