0
# GenBank and TBL Format Handling
1
2
Comprehensive support for NCBI GenBank and TBL annotation formats including bidirectional conversion, validation, and integration with NCBI table2asn for GenBank record generation. These functions provide the core functionality for working with NCBI-compliant annotation files.
3
4
## Capabilities
5
6
### TBL Format Parsing
7
8
Parse NCBI TBL annotation files into the gfftk annotation dictionary format with support for multiple transcript isoforms and complex gene structures.
9
10
```python { .api }
11
def tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False):
12
"""
13
Convert NCBI TBL format to annotation dictionary.
14
15
Parses NCBI TBL files which contain gene models in tab-delimited format
16
used by GenBank submission. Handles multiple transcript isoforms per gene,
17
partial features, and all annotation qualifiers.
18
19
Parameters:
20
- inputfile (str|io.BytesIO): Path to TBL file or file-like object
21
- fasta (str): Path to corresponding genome FASTA file
22
- annotation (dict|bool): Existing annotation dictionary to update, or False
23
- table (int): Genetic code table (1=standard, 11=bacterial)
24
- debug (bool): Enable debug output
25
26
Returns:
27
dict: Annotation dictionary with gene models
28
"""
29
```
30
31
### TBL Format Writing
32
33
Convert annotation dictionary to NCBI TBL format with proper formatting and validation for GenBank submission compatibility.
34
35
```python { .api }
36
def dict2tbl(annots, seqs, outfile, table=1, debug=False):
37
"""
38
Convert annotation dictionary to NCBI TBL format.
39
40
Writes annotations in NCBI TBL format suitable for GenBank submission
41
via table2asn. Handles complex gene structures, multiple isoforms,
42
and all annotation qualifiers with proper formatting.
43
44
Parameters:
45
- annots (dict): Annotation dictionary
46
- seqs (dict): Sequence dictionary from FASTA
47
- outfile (str): Output TBL file path
48
- table (int): Genetic code table (1=standard, 11=bacterial)
49
- debug (bool): Enable debug output
50
51
Returns:
52
None
53
"""
54
```
55
56
### GenBank Format Generation
57
58
Generate GenBank format files directly from annotation dictionary with organism metadata and formatting options.
59
60
```python { .api }
61
def dict2gbff(annots, seqs, outfile, organism=None, circular=False, lowercase=False):
62
"""
63
Convert annotation dictionary to GenBank format.
64
65
Generates GenBank flat file format (.gbff) with complete annotation
66
information, sequence data, and proper GenBank formatting. Includes
67
organism metadata and circular DNA support.
68
69
Parameters:
70
- annots (dict): Annotation dictionary
71
- seqs (dict): Sequence dictionary from FASTA
72
- outfile (str): Output GenBank file path
73
- organism (str|None): Organism name for ORGANISM field
74
- circular (bool): Mark sequences as circular DNA
75
- lowercase (bool): Output sequence in lowercase
76
77
Returns:
78
None
79
"""
80
```
81
82
### NCBI table2asn Integration
83
84
Interface with NCBI's table2asn tool for generating GenBank submission files from TBL and FASTA inputs.
85
86
```python { .api }
87
def table2asn(sbt, tbl, fasta, out, organism, strain, table=1):
88
"""
89
Run NCBI table2asn to generate GenBank files.
90
91
Executes NCBI table2asn tool to convert TBL annotation files and
92
FASTA sequences into GenBank submission format. Requires table2asn
93
to be installed and available in PATH.
94
95
Parameters:
96
- sbt (str): Path to submission template (.sbt) file
97
- tbl (str): Path to TBL annotation file
98
- fasta (str): Path to genome FASTA file
99
- out (str): Output directory path
100
- organism (str): Organism name
101
- strain (str): Strain identifier
102
- table (int): Genetic code table (1=standard, 11=bacterial)
103
104
Returns:
105
None
106
"""
107
```
108
109
### Submission Template Generation
110
111
Generate NCBI submission template files required for table2asn processing.
112
113
```python { .api }
114
def sbt_writer(out):
115
"""
116
Generate NCBI submission template (.sbt) file.
117
118
Creates a basic submission template file required by table2asn
119
for GenBank submission processing. Template contains minimal
120
required metadata fields.
121
122
Parameters:
123
- out (str): Output path for .sbt file
124
125
Returns:
126
None
127
"""
128
```
129
130
### Coordinate Manipulation
131
132
Utilities for working with genomic coordinates in TBL format annotations.
133
134
```python { .api }
135
def fetch_coords(v, i=0, feature="gene"):
136
"""
137
Extract genomic coordinates from annotation data.
138
139
Parses coordinate information from various annotation formats
140
and returns standardized coordinate tuples. Handles partial
141
features and strand information.
142
143
Parameters:
144
- v (list): Coordinate data structure
145
- i (int): Index for transcript/feature selection
146
- feature (str): Feature type ("gene", "mRNA", "CDS")
147
148
Returns:
149
tuple: (start, end) coordinates
150
"""
151
152
def duplicate_coords(cds):
153
"""
154
Identify duplicate CDS coordinates.
155
156
Scans CDS coordinate lists to identify duplicate exons
157
or coordinate ranges that may indicate annotation errors
158
or alternative splicing variants.
159
160
Parameters:
161
- cds (list): List of CDS coordinate tuples
162
163
Returns:
164
list: Indices of duplicate coordinate sets
165
"""
166
167
def drop_alt_coords(info, idxs):
168
"""
169
Remove alternative coordinate sets from annotation.
170
171
Removes specified coordinate sets from annotation data
172
structure, typically used to clean up alternative
173
splicing variants or duplicate annotations.
174
175
Parameters:
176
- info (dict): Annotation information dictionary
177
- idxs (list): Indices of coordinate sets to remove
178
179
Returns:
180
dict: Updated annotation dictionary
181
"""
182
```
183
184
### UTR Processing
185
186
Specialized functions for UTR (Untranslated Region) identification and processing.
187
188
```python { .api }
189
def findUTRs(cds, mrna, strand):
190
"""
191
Identify UTR regions from CDS and mRNA coordinates.
192
193
Calculates 5' and 3' UTR regions by comparing CDS coordinates
194
with mRNA boundaries. Handles strand orientation and returns
195
coordinate tuples for UTR regions.
196
197
Parameters:
198
- cds (list): List of CDS coordinate tuples
199
- mrna (list): List of mRNA coordinate tuples
200
- strand (str): Strand orientation ("+"/"-")
201
202
Returns:
203
tuple: (five_utr_coords, three_utr_coords) as coordinate lists
204
"""
205
```
206
207
### GO Term Processing
208
209
Handle Gene Ontology term formatting for GenBank submissions.
210
211
```python { .api }
212
def reformatGO(term, goDict={}):
213
"""
214
Reformat GO terms for GenBank submission.
215
216
Converts GO terms to proper format for GenBank annotation
217
files, handling term descriptions and maintaining consistency
218
with NCBI requirements.
219
220
Parameters:
221
- term (str): GO term identifier (e.g., "GO:0008150")
222
- goDict (dict): GO term dictionary for lookups
223
224
Returns:
225
str: Reformatted GO term description
226
"""
227
```
228
229
## Usage Examples
230
231
### Converting TBL to Annotation Dictionary
232
233
```python
234
from gfftk.genbank import tbl2dict
235
from gfftk.fasta import fasta2dict
236
237
# Load sequences and parse TBL file
238
sequences = fasta2dict("genome.fasta")
239
annotations = tbl2dict("annotation.tbl", "genome.fasta")
240
241
# Access parsed data
242
for gene_id, gene_data in annotations.items():
243
print(f"Gene: {gene_id}")
244
print(f"Products: {gene_data['product']}")
245
print(f"Location: {gene_data['location']}")
246
```
247
248
### Generating GenBank Files
249
250
```python
251
from gfftk.genbank import dict2gbff, dict2tbl
252
from gfftk.fasta import fasta2dict
253
from gfftk.gff import gff2dict
254
255
# Parse GFF3 annotation
256
sequences = fasta2dict("genome.fasta")
257
annotations = gff2dict("annotation.gff3", "genome.fasta")
258
259
# Generate GenBank format
260
dict2gbff(
261
annotations,
262
sequences,
263
"output.gbff",
264
organism="Escherichia coli",
265
circular=True
266
)
267
268
# Generate TBL format for NCBI submission
269
dict2tbl(annotations, sequences, "annotation.tbl")
270
```
271
272
### NCBI Submission Workflow
273
274
```python
275
from gfftk.genbank import sbt_writer, table2asn, dict2tbl
276
from gfftk.fasta import fasta2dict
277
from gfftk.gff import gff2dict
278
279
# Prepare annotation data
280
sequences = fasta2dict("genome.fasta")
281
annotations = gff2dict("annotation.gff3", "genome.fasta")
282
283
# Generate TBL file
284
dict2tbl(annotations, sequences, "submission.tbl")
285
286
# Create submission template
287
sbt_writer("template.sbt")
288
289
# Run table2asn (requires table2asn installation)
290
table2asn(
291
"template.sbt",
292
"submission.tbl",
293
"genome.fasta",
294
"output_dir",
295
"Escherichia coli",
296
"K-12",
297
table=11 # Bacterial genetic code
298
)
299
```