Tessl Tile for pypi/gfftk@25.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli-commands.md comparison.md consensus.md format-conversion.md genbank-tbl.md gff-processing.md index.md sequence-operations.md utilities.md

index.mddocs/

0
# GFFtk
1

2
A comprehensive Python toolkit for working with genome annotation files in GFF3, GTF, and TBL formats. GFFtk provides powerful format conversion capabilities, allowing users to convert between different genomic file formats including GenBank, extract protein and transcript sequences from annotations, and perform advanced filtering operations using flexible regex patterns on genomic features.
3

4
## Package Information
5

6
- **Package Name**: gfftk
7
- **Language**: Python
8
- **Installation**: `pip install gfftk`
9
- **Command Line Tool**: `gfftk` (after installation)
10

11
## Core Imports
12

13
```python
14
import gfftk
15
```
16

17
Common imports for format conversion and parsing:
18

19
```python
20
from gfftk.gff import gff2dict, dict2gff3, gtf2dict, dict2gtf
21
from gfftk.convert import gff2proteins, gff2tbl, tbl2gff3
22
from gfftk.genbank import tbl2dict, dict2tbl
23
from gfftk.fasta import fasta2dict, translate, FASTA
24
```
25

26
## Basic Usage
27

28
```python
29
from gfftk.gff import gff2dict
30
from gfftk.convert import gff2proteins
31
from gfftk.fasta import fasta2dict
32

33
# Load genome sequence and annotation
34
genome = fasta2dict("genome.fasta")
35
annotation = gff2dict("annotation.gff3", "genome.fasta")
36

37
# Convert GFF3 to protein FASTA
38
gff2proteins("annotation.gff3", "genome.fasta", output="proteins.faa")
39

40
# Access annotation data programmatically
41
for gene_id, gene_data in annotation.items():
42
    print(f"Gene: {gene_id}")
43
    print(f"Location: {gene_data['contig']}:{gene_data['location'][0]}-{gene_data['location'][1]}")
44
    print(f"Products: {gene_data['product']}")
45
```
46

47
## Architecture
48

49
GFFtk is built around a central annotation dictionary format that enables seamless conversion between different genomic annotation formats:
50

51
- **Central Data Structure**: All formats (GFF3, GTF, TBL, GenBank) are converted to a unified annotation dictionary
52
- **Format Parsers**: Specialized parsers for each input format handle format-specific quirks
53
- **Format Writers**: Dedicated writers output the annotation dictionary to different formats
54
- **Validation Layer**: Built-in validation ensures data integrity during conversions
55
- **CLI Interface**: Command-line tools provide batch processing capabilities
56
- **Filtering System**: Flexible regex-based filtering for feature selection
57

58
This design allows for reliable conversion between formats while preserving all annotation information and relationships.
59

60
## Capabilities
61

62
### Format Conversion
63

64
Convert between GFF3, GTF, TBL, GenBank, and FASTA formats with full feature preservation and validation. Supports protein and transcript sequence extraction with customizable genetic code tables.
65

66
```python { .api }
67
def gff2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[]):
68
    """Convert GFF3 to TBL format"""
69

70
def tbl2gff3(tbl, fasta, output=False, table=1, grep=[], grepv=[]):
71
    """Convert TBL to GFF3 format"""
72

73
def gff2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[]):
74
    """Convert GFF3 to protein FASTA sequences"""
75
```
76

77
[Format Conversion](./format-conversion.md)
78

79
### GFF3 and GTF Parsing
80

81
Parse and validate GFF3 and GTF files with support for multiple annotation sources and formats. Handles complex gene models with alternative splicing and provides robust error checking.
82

83
```python { .api }
84
def gff2dict(gff, fasta, table=1, debug=False):
85
    """Parse GFF3 to annotation dictionary"""
86

87
def gtf2dict(gtf, fasta, table=1, debug=False):
88
    """Parse GTF to annotation dictionary"""
89

90
def dict2gff3(infile, output=False, debug=False, source=False, newline=False):
91
    """Write annotation dictionary to GFF3 format"""
92
```
93

94
[GFF3 and GTF Processing](./gff-processing.md)
95

96
### Sequence Operations
97

98
Extract and manipulate genomic sequences with support for coordinate-based extraction, translation using multiple genetic codes, and reverse complement operations.
99

100
```python { .api }
101
class FASTA:
102
    def __init__(self, fasta_file): ...
103
    def get_seq(self, contig): ...
104

105
def fasta2dict(fasta, full_header=False):
106
    """Convert FASTA file to dictionary"""
107

108
def translate(dna, strand, phase, table=1):
109
    """Translate DNA sequence to protein"""
110
```
111

112
[Sequence Operations](./sequence-operations.md)
113

114
### Consensus Gene Prediction
115

116
EvidenceModeler-like consensus gene prediction that combines multiple annotation sources using protein and transcript evidence, with configurable scoring weights and structural validation.
117

118
```python { .api }
119
def generate_consensus(fasta, genes, proteins, transcripts, weights, output, ...):
120
    """Generate consensus predictions from multiple sources"""
121

122
def getAED(query, reference):
123
    """Calculate Annotation Edit Distance"""
124

125
def score_by_evidence(locus, weights={}, derived=[]):
126
    """Score models by evidence overlap"""
127
```
128

129
[Consensus Prediction](./consensus.md)
130

131
### Annotation Comparison
132

133
Compare two genome annotations to identify differences, calculate similarity metrics, and generate detailed comparison reports with feature-level analysis.
134

135
```python { .api }
136
def compareAnnotations(old, new, fasta, output=False):
137
    """Compare two GFF3 annotations"""
138

139
def pairwiseAED(query, reference):
140
    """Calculate pairwise AED scores"""
141

142
def gff2interlap(input, fasta):
143
    """Convert GFF3 to InterLap structure for overlap analysis"""
144
```
145

146
[Annotation Comparison](./comparison.md)
147

148
### GenBank and TBL Format Handling
149

150
Complete support for NCBI GenBank and TBL annotation formats with bidirectional conversion, validation, and NCBI submission integration.
151

152
```python { .api }
153
def tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False):
154
    """Convert NCBI TBL format to annotation dictionary"""
155

156
def dict2tbl(annots, seqs, outfile, table=1, debug=False):
157
    """Convert annotation dictionary to NCBI TBL format"""
158

159
def dict2gbff(annots, seqs, outfile, organism=None, circular=False):
160
    """Convert annotation dictionary to GenBank format"""
161

162
def table2asn(sbt, tbl, fasta, out, organism, strain, table=1):
163
    """Run NCBI table2asn for GenBank submission"""
164
```
165

166
[GenBank and TBL Formats](./genbank-tbl.md)
167

168
### Command Line Interface
169

170
Direct command-line access to all GFFtk functions with comprehensive parameter support and batch processing capabilities.
171

172
```python { .api }
173
def convert(args):
174
    """CLI interface for format conversion operations"""
175

176
def consensus(args):
177
    """CLI interface for consensus gene prediction"""
178

179
def compare(args):
180
    """CLI interface for annotation comparison"""
181

182
def stats(args):
183
    """CLI interface for statistics calculation"""
184
```
185

186
[Command Line Tools](./cli-commands.md)
187

188
### File Utilities and Validation
189

190
Comprehensive file handling utilities with support for compressed formats, data validation, and annotation statistics calculation.
191

192
```python { .api }
193
def zopen(filename, mode="r", buff=1024*1024, external=True):
194
    """Open files with automatic compression support"""
195

196
def annotation_stats(Genes):
197
    """Calculate comprehensive annotation statistics"""
198

199
def filter_annotations(annotations, grep=None, grepv=None):
200
    """Filter annotations using regex patterns"""
201
```
202

203
[Utilities and Validation](./utilities.md)
204

205
## Types
206

207
```python { .api }
208
# Central annotation dictionary format
209
AnnotationDict = dict[str, dict]
210

211
# Gene annotation structure
212
GeneAnnotation = {
213
    "name": str,                    # Gene name/identifier
214
    "type": list[str],              # Feature types per transcript
215
    "transcript": list[str],        # Full transcript sequences
216
    "cds_transcript": list[str],    # CDS-only sequences
217
    "protein": list[str],           # Protein translations
218
    "5UTR": list[list[tuple[int, int]]],  # 5' UTR coordinates
219
    "3UTR": list[list[tuple[int, int]]],  # 3' UTR coordinates
220
    "codon_start": list[int],       # Translation start phase
221
    "ids": list[str],               # Transcript IDs
222
    "CDS": list[list[tuple[int, int]]],   # CDS coordinates
223
    "mRNA": list[list[tuple[int, int]]],  # mRNA coordinates
224
    "strand": str,                  # Strand ("+"/"-")
225
    "gene_synonym": list[str],      # Gene synonyms
226
    "location": tuple[int, int],    # Gene coordinates
227
    "contig": str,                  # Contig/chromosome
228
    "product": list[str],           # Product descriptions
229
    "source": str,                  # Annotation source
230
    "phase": list[str],             # CDS phase info
231
    "db_xref": list[list[str]],     # Database cross-refs
232
    "go_terms": list[list[str]],    # GO terms
233
    "EC_number": list[list[str]],   # EC numbers
234
    "note": list[list[str]],        # Notes
235
    "partialStart": list[bool],     # Partial start flags
236
    "partialStop": list[bool],      # Partial stop flags
237
    "pseudo": bool,                 # Pseudogene flag
238
}
239
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/