or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli-commands.mdcomparison.mdconsensus.mdformat-conversion.mdgenbank-tbl.mdgff-processing.mdindex.mdsequence-operations.mdutilities.md

index.mddocs/

0

# GFFtk

1

2

A comprehensive Python toolkit for working with genome annotation files in GFF3, GTF, and TBL formats. GFFtk provides powerful format conversion capabilities, allowing users to convert between different genomic file formats including GenBank, extract protein and transcript sequences from annotations, and perform advanced filtering operations using flexible regex patterns on genomic features.

3

4

## Package Information

5

6

- **Package Name**: gfftk

7

- **Language**: Python

8

- **Installation**: `pip install gfftk`

9

- **Command Line Tool**: `gfftk` (after installation)

10

11

## Core Imports

12

13

```python

14

import gfftk

15

```

16

17

Common imports for format conversion and parsing:

18

19

```python

20

from gfftk.gff import gff2dict, dict2gff3, gtf2dict, dict2gtf

21

from gfftk.convert import gff2proteins, gff2tbl, tbl2gff3

22

from gfftk.genbank import tbl2dict, dict2tbl

23

from gfftk.fasta import fasta2dict, translate, FASTA

24

```

25

26

## Basic Usage

27

28

```python

29

from gfftk.gff import gff2dict

30

from gfftk.convert import gff2proteins

31

from gfftk.fasta import fasta2dict

32

33

# Load genome sequence and annotation

34

genome = fasta2dict("genome.fasta")

35

annotation = gff2dict("annotation.gff3", "genome.fasta")

36

37

# Convert GFF3 to protein FASTA

38

gff2proteins("annotation.gff3", "genome.fasta", output="proteins.faa")

39

40

# Access annotation data programmatically

41

for gene_id, gene_data in annotation.items():

42

print(f"Gene: {gene_id}")

43

print(f"Location: {gene_data['contig']}:{gene_data['location'][0]}-{gene_data['location'][1]}")

44

print(f"Products: {gene_data['product']}")

45

```

46

47

## Architecture

48

49

GFFtk is built around a central annotation dictionary format that enables seamless conversion between different genomic annotation formats:

50

51

- **Central Data Structure**: All formats (GFF3, GTF, TBL, GenBank) are converted to a unified annotation dictionary

52

- **Format Parsers**: Specialized parsers for each input format handle format-specific quirks

53

- **Format Writers**: Dedicated writers output the annotation dictionary to different formats

54

- **Validation Layer**: Built-in validation ensures data integrity during conversions

55

- **CLI Interface**: Command-line tools provide batch processing capabilities

56

- **Filtering System**: Flexible regex-based filtering for feature selection

57

58

This design allows for reliable conversion between formats while preserving all annotation information and relationships.

59

60

## Capabilities

61

62

### Format Conversion

63

64

Convert between GFF3, GTF, TBL, GenBank, and FASTA formats with full feature preservation and validation. Supports protein and transcript sequence extraction with customizable genetic code tables.

65

66

```python { .api }

67

def gff2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[]):

68

"""Convert GFF3 to TBL format"""

69

70

def tbl2gff3(tbl, fasta, output=False, table=1, grep=[], grepv=[]):

71

"""Convert TBL to GFF3 format"""

72

73

def gff2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[]):

74

"""Convert GFF3 to protein FASTA sequences"""

75

```

76

77

[Format Conversion](./format-conversion.md)

78

79

### GFF3 and GTF Parsing

80

81

Parse and validate GFF3 and GTF files with support for multiple annotation sources and formats. Handles complex gene models with alternative splicing and provides robust error checking.

82

83

```python { .api }

84

def gff2dict(gff, fasta, table=1, debug=False):

85

"""Parse GFF3 to annotation dictionary"""

86

87

def gtf2dict(gtf, fasta, table=1, debug=False):

88

"""Parse GTF to annotation dictionary"""

89

90

def dict2gff3(infile, output=False, debug=False, source=False, newline=False):

91

"""Write annotation dictionary to GFF3 format"""

92

```

93

94

[GFF3 and GTF Processing](./gff-processing.md)

95

96

### Sequence Operations

97

98

Extract and manipulate genomic sequences with support for coordinate-based extraction, translation using multiple genetic codes, and reverse complement operations.

99

100

```python { .api }

101

class FASTA:

102

def __init__(self, fasta_file): ...

103

def get_seq(self, contig): ...

104

105

def fasta2dict(fasta, full_header=False):

106

"""Convert FASTA file to dictionary"""

107

108

def translate(dna, strand, phase, table=1):

109

"""Translate DNA sequence to protein"""

110

```

111

112

[Sequence Operations](./sequence-operations.md)

113

114

### Consensus Gene Prediction

115

116

EvidenceModeler-like consensus gene prediction that combines multiple annotation sources using protein and transcript evidence, with configurable scoring weights and structural validation.

117

118

```python { .api }

119

def generate_consensus(fasta, genes, proteins, transcripts, weights, output, ...):

120

"""Generate consensus predictions from multiple sources"""

121

122

def getAED(query, reference):

123

"""Calculate Annotation Edit Distance"""

124

125

def score_by_evidence(locus, weights={}, derived=[]):

126

"""Score models by evidence overlap"""

127

```

128

129

[Consensus Prediction](./consensus.md)

130

131

### Annotation Comparison

132

133

Compare two genome annotations to identify differences, calculate similarity metrics, and generate detailed comparison reports with feature-level analysis.

134

135

```python { .api }

136

def compareAnnotations(old, new, fasta, output=False):

137

"""Compare two GFF3 annotations"""

138

139

def pairwiseAED(query, reference):

140

"""Calculate pairwise AED scores"""

141

142

def gff2interlap(input, fasta):

143

"""Convert GFF3 to InterLap structure for overlap analysis"""

144

```

145

146

[Annotation Comparison](./comparison.md)

147

148

### GenBank and TBL Format Handling

149

150

Complete support for NCBI GenBank and TBL annotation formats with bidirectional conversion, validation, and NCBI submission integration.

151

152

```python { .api }

153

def tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False):

154

"""Convert NCBI TBL format to annotation dictionary"""

155

156

def dict2tbl(annots, seqs, outfile, table=1, debug=False):

157

"""Convert annotation dictionary to NCBI TBL format"""

158

159

def dict2gbff(annots, seqs, outfile, organism=None, circular=False):

160

"""Convert annotation dictionary to GenBank format"""

161

162

def table2asn(sbt, tbl, fasta, out, organism, strain, table=1):

163

"""Run NCBI table2asn for GenBank submission"""

164

```

165

166

[GenBank and TBL Formats](./genbank-tbl.md)

167

168

### Command Line Interface

169

170

Direct command-line access to all GFFtk functions with comprehensive parameter support and batch processing capabilities.

171

172

```python { .api }

173

def convert(args):

174

"""CLI interface for format conversion operations"""

175

176

def consensus(args):

177

"""CLI interface for consensus gene prediction"""

178

179

def compare(args):

180

"""CLI interface for annotation comparison"""

181

182

def stats(args):

183

"""CLI interface for statistics calculation"""

184

```

185

186

[Command Line Tools](./cli-commands.md)

187

188

### File Utilities and Validation

189

190

Comprehensive file handling utilities with support for compressed formats, data validation, and annotation statistics calculation.

191

192

```python { .api }

193

def zopen(filename, mode="r", buff=1024*1024, external=True):

194

"""Open files with automatic compression support"""

195

196

def annotation_stats(Genes):

197

"""Calculate comprehensive annotation statistics"""

198

199

def filter_annotations(annotations, grep=None, grepv=None):

200

"""Filter annotations using regex patterns"""

201

```

202

203

[Utilities and Validation](./utilities.md)

204

205

## Types

206

207

```python { .api }

208

# Central annotation dictionary format

209

AnnotationDict = dict[str, dict]

210

211

# Gene annotation structure

212

GeneAnnotation = {

213

"name": str, # Gene name/identifier

214

"type": list[str], # Feature types per transcript

215

"transcript": list[str], # Full transcript sequences

216

"cds_transcript": list[str], # CDS-only sequences

217

"protein": list[str], # Protein translations

218

"5UTR": list[list[tuple[int, int]]], # 5' UTR coordinates

219

"3UTR": list[list[tuple[int, int]]], # 3' UTR coordinates

220

"codon_start": list[int], # Translation start phase

221

"ids": list[str], # Transcript IDs

222

"CDS": list[list[tuple[int, int]]], # CDS coordinates

223

"mRNA": list[list[tuple[int, int]]], # mRNA coordinates

224

"strand": str, # Strand ("+"/"-")

225

"gene_synonym": list[str], # Gene synonyms

226

"location": tuple[int, int], # Gene coordinates

227

"contig": str, # Contig/chromosome

228

"product": list[str], # Product descriptions

229

"source": str, # Annotation source

230

"phase": list[str], # CDS phase info

231

"db_xref": list[list[str]], # Database cross-refs

232

"go_terms": list[list[str]], # GO terms

233

"EC_number": list[list[str]], # EC numbers

234

"note": list[list[str]], # Notes

235

"partialStart": list[bool], # Partial start flags

236

"partialStop": list[bool], # Partial stop flags

237

"pseudo": bool, # Pseudogene flag

238

}

239

```