Comprehensive Python toolkit for working with genome annotation files in GFF3, GTF, and TBL formats with format conversion and analysis capabilities
npx @tessl/cli install tessl/pypi-gfftk@25.6.00
# GFFtk
1
2
A comprehensive Python toolkit for working with genome annotation files in GFF3, GTF, and TBL formats. GFFtk provides powerful format conversion capabilities, allowing users to convert between different genomic file formats including GenBank, extract protein and transcript sequences from annotations, and perform advanced filtering operations using flexible regex patterns on genomic features.
3
4
## Package Information
5
6
- **Package Name**: gfftk
7
- **Language**: Python
8
- **Installation**: `pip install gfftk`
9
- **Command Line Tool**: `gfftk` (after installation)
10
11
## Core Imports
12
13
```python
14
import gfftk
15
```
16
17
Common imports for format conversion and parsing:
18
19
```python
20
from gfftk.gff import gff2dict, dict2gff3, gtf2dict, dict2gtf
21
from gfftk.convert import gff2proteins, gff2tbl, tbl2gff3
22
from gfftk.genbank import tbl2dict, dict2tbl
23
from gfftk.fasta import fasta2dict, translate, FASTA
24
```
25
26
## Basic Usage
27
28
```python
29
from gfftk.gff import gff2dict
30
from gfftk.convert import gff2proteins
31
from gfftk.fasta import fasta2dict
32
33
# Load genome sequence and annotation
34
genome = fasta2dict("genome.fasta")
35
annotation = gff2dict("annotation.gff3", "genome.fasta")
36
37
# Convert GFF3 to protein FASTA
38
gff2proteins("annotation.gff3", "genome.fasta", output="proteins.faa")
39
40
# Access annotation data programmatically
41
for gene_id, gene_data in annotation.items():
42
print(f"Gene: {gene_id}")
43
print(f"Location: {gene_data['contig']}:{gene_data['location'][0]}-{gene_data['location'][1]}")
44
print(f"Products: {gene_data['product']}")
45
```
46
47
## Architecture
48
49
GFFtk is built around a central annotation dictionary format that enables seamless conversion between different genomic annotation formats:
50
51
- **Central Data Structure**: All formats (GFF3, GTF, TBL, GenBank) are converted to a unified annotation dictionary
52
- **Format Parsers**: Specialized parsers for each input format handle format-specific quirks
53
- **Format Writers**: Dedicated writers output the annotation dictionary to different formats
54
- **Validation Layer**: Built-in validation ensures data integrity during conversions
55
- **CLI Interface**: Command-line tools provide batch processing capabilities
56
- **Filtering System**: Flexible regex-based filtering for feature selection
57
58
This design allows for reliable conversion between formats while preserving all annotation information and relationships.
59
60
## Capabilities
61
62
### Format Conversion
63
64
Convert between GFF3, GTF, TBL, GenBank, and FASTA formats with full feature preservation and validation. Supports protein and transcript sequence extraction with customizable genetic code tables.
65
66
```python { .api }
67
def gff2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[]):
68
"""Convert GFF3 to TBL format"""
69
70
def tbl2gff3(tbl, fasta, output=False, table=1, grep=[], grepv=[]):
71
"""Convert TBL to GFF3 format"""
72
73
def gff2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[]):
74
"""Convert GFF3 to protein FASTA sequences"""
75
```
76
77
[Format Conversion](./format-conversion.md)
78
79
### GFF3 and GTF Parsing
80
81
Parse and validate GFF3 and GTF files with support for multiple annotation sources and formats. Handles complex gene models with alternative splicing and provides robust error checking.
82
83
```python { .api }
84
def gff2dict(gff, fasta, table=1, debug=False):
85
"""Parse GFF3 to annotation dictionary"""
86
87
def gtf2dict(gtf, fasta, table=1, debug=False):
88
"""Parse GTF to annotation dictionary"""
89
90
def dict2gff3(infile, output=False, debug=False, source=False, newline=False):
91
"""Write annotation dictionary to GFF3 format"""
92
```
93
94
[GFF3 and GTF Processing](./gff-processing.md)
95
96
### Sequence Operations
97
98
Extract and manipulate genomic sequences with support for coordinate-based extraction, translation using multiple genetic codes, and reverse complement operations.
99
100
```python { .api }
101
class FASTA:
102
def __init__(self, fasta_file): ...
103
def get_seq(self, contig): ...
104
105
def fasta2dict(fasta, full_header=False):
106
"""Convert FASTA file to dictionary"""
107
108
def translate(dna, strand, phase, table=1):
109
"""Translate DNA sequence to protein"""
110
```
111
112
[Sequence Operations](./sequence-operations.md)
113
114
### Consensus Gene Prediction
115
116
EvidenceModeler-like consensus gene prediction that combines multiple annotation sources using protein and transcript evidence, with configurable scoring weights and structural validation.
117
118
```python { .api }
119
def generate_consensus(fasta, genes, proteins, transcripts, weights, output, ...):
120
"""Generate consensus predictions from multiple sources"""
121
122
def getAED(query, reference):
123
"""Calculate Annotation Edit Distance"""
124
125
def score_by_evidence(locus, weights={}, derived=[]):
126
"""Score models by evidence overlap"""
127
```
128
129
[Consensus Prediction](./consensus.md)
130
131
### Annotation Comparison
132
133
Compare two genome annotations to identify differences, calculate similarity metrics, and generate detailed comparison reports with feature-level analysis.
134
135
```python { .api }
136
def compareAnnotations(old, new, fasta, output=False):
137
"""Compare two GFF3 annotations"""
138
139
def pairwiseAED(query, reference):
140
"""Calculate pairwise AED scores"""
141
142
def gff2interlap(input, fasta):
143
"""Convert GFF3 to InterLap structure for overlap analysis"""
144
```
145
146
[Annotation Comparison](./comparison.md)
147
148
### GenBank and TBL Format Handling
149
150
Complete support for NCBI GenBank and TBL annotation formats with bidirectional conversion, validation, and NCBI submission integration.
151
152
```python { .api }
153
def tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False):
154
"""Convert NCBI TBL format to annotation dictionary"""
155
156
def dict2tbl(annots, seqs, outfile, table=1, debug=False):
157
"""Convert annotation dictionary to NCBI TBL format"""
158
159
def dict2gbff(annots, seqs, outfile, organism=None, circular=False):
160
"""Convert annotation dictionary to GenBank format"""
161
162
def table2asn(sbt, tbl, fasta, out, organism, strain, table=1):
163
"""Run NCBI table2asn for GenBank submission"""
164
```
165
166
[GenBank and TBL Formats](./genbank-tbl.md)
167
168
### Command Line Interface
169
170
Direct command-line access to all GFFtk functions with comprehensive parameter support and batch processing capabilities.
171
172
```python { .api }
173
def convert(args):
174
"""CLI interface for format conversion operations"""
175
176
def consensus(args):
177
"""CLI interface for consensus gene prediction"""
178
179
def compare(args):
180
"""CLI interface for annotation comparison"""
181
182
def stats(args):
183
"""CLI interface for statistics calculation"""
184
```
185
186
[Command Line Tools](./cli-commands.md)
187
188
### File Utilities and Validation
189
190
Comprehensive file handling utilities with support for compressed formats, data validation, and annotation statistics calculation.
191
192
```python { .api }
193
def zopen(filename, mode="r", buff=1024*1024, external=True):
194
"""Open files with automatic compression support"""
195
196
def annotation_stats(Genes):
197
"""Calculate comprehensive annotation statistics"""
198
199
def filter_annotations(annotations, grep=None, grepv=None):
200
"""Filter annotations using regex patterns"""
201
```
202
203
[Utilities and Validation](./utilities.md)
204
205
## Types
206
207
```python { .api }
208
# Central annotation dictionary format
209
AnnotationDict = dict[str, dict]
210
211
# Gene annotation structure
212
GeneAnnotation = {
213
"name": str, # Gene name/identifier
214
"type": list[str], # Feature types per transcript
215
"transcript": list[str], # Full transcript sequences
216
"cds_transcript": list[str], # CDS-only sequences
217
"protein": list[str], # Protein translations
218
"5UTR": list[list[tuple[int, int]]], # 5' UTR coordinates
219
"3UTR": list[list[tuple[int, int]]], # 3' UTR coordinates
220
"codon_start": list[int], # Translation start phase
221
"ids": list[str], # Transcript IDs
222
"CDS": list[list[tuple[int, int]]], # CDS coordinates
223
"mRNA": list[list[tuple[int, int]]], # mRNA coordinates
224
"strand": str, # Strand ("+"/"-")
225
"gene_synonym": list[str], # Gene synonyms
226
"location": tuple[int, int], # Gene coordinates
227
"contig": str, # Contig/chromosome
228
"product": list[str], # Product descriptions
229
"source": str, # Annotation source
230
"phase": list[str], # CDS phase info
231
"db_xref": list[list[str]], # Database cross-refs
232
"go_terms": list[list[str]], # GO terms
233
"EC_number": list[list[str]], # EC numbers
234
"note": list[list[str]], # Notes
235
"partialStart": list[bool], # Partial start flags
236
"partialStop": list[bool], # Partial stop flags
237
"pseudo": bool, # Pseudogene flag
238
}
239
```