Comprehensive Python toolkit for working with genome annotation files in GFF3, GTF, and TBL formats with format conversion and analysis capabilities
Comprehensive support for NCBI GenBank and TBL annotation formats including bidirectional conversion, validation, and integration with NCBI table2asn for GenBank record generation. These functions provide the core functionality for working with NCBI-compliant annotation files.
Parse NCBI TBL annotation files into the gfftk annotation dictionary format with support for multiple transcript isoforms and complex gene structures.
def tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False):
"""
Convert NCBI TBL format to annotation dictionary.
Parses NCBI TBL files which contain gene models in tab-delimited format
used by GenBank submission. Handles multiple transcript isoforms per gene,
partial features, and all annotation qualifiers.
Parameters:
- inputfile (str|io.BytesIO): Path to TBL file or file-like object
- fasta (str): Path to corresponding genome FASTA file
- annotation (dict|bool): Existing annotation dictionary to update, or False
- table (int): Genetic code table (1=standard, 11=bacterial)
- debug (bool): Enable debug output
Returns:
dict: Annotation dictionary with gene models
"""Convert annotation dictionary to NCBI TBL format with proper formatting and validation for GenBank submission compatibility.
def dict2tbl(annots, seqs, outfile, table=1, debug=False):
"""
Convert annotation dictionary to NCBI TBL format.
Writes annotations in NCBI TBL format suitable for GenBank submission
via table2asn. Handles complex gene structures, multiple isoforms,
and all annotation qualifiers with proper formatting.
Parameters:
- annots (dict): Annotation dictionary
- seqs (dict): Sequence dictionary from FASTA
- outfile (str): Output TBL file path
- table (int): Genetic code table (1=standard, 11=bacterial)
- debug (bool): Enable debug output
Returns:
None
"""Generate GenBank format files directly from annotation dictionary with organism metadata and formatting options.
def dict2gbff(annots, seqs, outfile, organism=None, circular=False, lowercase=False):
"""
Convert annotation dictionary to GenBank format.
Generates GenBank flat file format (.gbff) with complete annotation
information, sequence data, and proper GenBank formatting. Includes
organism metadata and circular DNA support.
Parameters:
- annots (dict): Annotation dictionary
- seqs (dict): Sequence dictionary from FASTA
- outfile (str): Output GenBank file path
- organism (str|None): Organism name for ORGANISM field
- circular (bool): Mark sequences as circular DNA
- lowercase (bool): Output sequence in lowercase
Returns:
None
"""Interface with NCBI's table2asn tool for generating GenBank submission files from TBL and FASTA inputs.
def table2asn(sbt, tbl, fasta, out, organism, strain, table=1):
"""
Run NCBI table2asn to generate GenBank files.
Executes NCBI table2asn tool to convert TBL annotation files and
FASTA sequences into GenBank submission format. Requires table2asn
to be installed and available in PATH.
Parameters:
- sbt (str): Path to submission template (.sbt) file
- tbl (str): Path to TBL annotation file
- fasta (str): Path to genome FASTA file
- out (str): Output directory path
- organism (str): Organism name
- strain (str): Strain identifier
- table (int): Genetic code table (1=standard, 11=bacterial)
Returns:
None
"""Generate NCBI submission template files required for table2asn processing.
def sbt_writer(out):
"""
Generate NCBI submission template (.sbt) file.
Creates a basic submission template file required by table2asn
for GenBank submission processing. Template contains minimal
required metadata fields.
Parameters:
- out (str): Output path for .sbt file
Returns:
None
"""Utilities for working with genomic coordinates in TBL format annotations.
def fetch_coords(v, i=0, feature="gene"):
"""
Extract genomic coordinates from annotation data.
Parses coordinate information from various annotation formats
and returns standardized coordinate tuples. Handles partial
features and strand information.
Parameters:
- v (list): Coordinate data structure
- i (int): Index for transcript/feature selection
- feature (str): Feature type ("gene", "mRNA", "CDS")
Returns:
tuple: (start, end) coordinates
"""
def duplicate_coords(cds):
"""
Identify duplicate CDS coordinates.
Scans CDS coordinate lists to identify duplicate exons
or coordinate ranges that may indicate annotation errors
or alternative splicing variants.
Parameters:
- cds (list): List of CDS coordinate tuples
Returns:
list: Indices of duplicate coordinate sets
"""
def drop_alt_coords(info, idxs):
"""
Remove alternative coordinate sets from annotation.
Removes specified coordinate sets from annotation data
structure, typically used to clean up alternative
splicing variants or duplicate annotations.
Parameters:
- info (dict): Annotation information dictionary
- idxs (list): Indices of coordinate sets to remove
Returns:
dict: Updated annotation dictionary
"""Specialized functions for UTR (Untranslated Region) identification and processing.
def findUTRs(cds, mrna, strand):
"""
Identify UTR regions from CDS and mRNA coordinates.
Calculates 5' and 3' UTR regions by comparing CDS coordinates
with mRNA boundaries. Handles strand orientation and returns
coordinate tuples for UTR regions.
Parameters:
- cds (list): List of CDS coordinate tuples
- mrna (list): List of mRNA coordinate tuples
- strand (str): Strand orientation ("+"/"-")
Returns:
tuple: (five_utr_coords, three_utr_coords) as coordinate lists
"""Handle Gene Ontology term formatting for GenBank submissions.
def reformatGO(term, goDict={}):
"""
Reformat GO terms for GenBank submission.
Converts GO terms to proper format for GenBank annotation
files, handling term descriptions and maintaining consistency
with NCBI requirements.
Parameters:
- term (str): GO term identifier (e.g., "GO:0008150")
- goDict (dict): GO term dictionary for lookups
Returns:
str: Reformatted GO term description
"""from gfftk.genbank import tbl2dict
from gfftk.fasta import fasta2dict
# Load sequences and parse TBL file
sequences = fasta2dict("genome.fasta")
annotations = tbl2dict("annotation.tbl", "genome.fasta")
# Access parsed data
for gene_id, gene_data in annotations.items():
print(f"Gene: {gene_id}")
print(f"Products: {gene_data['product']}")
print(f"Location: {gene_data['location']}")from gfftk.genbank import dict2gbff, dict2tbl
from gfftk.fasta import fasta2dict
from gfftk.gff import gff2dict
# Parse GFF3 annotation
sequences = fasta2dict("genome.fasta")
annotations = gff2dict("annotation.gff3", "genome.fasta")
# Generate GenBank format
dict2gbff(
annotations,
sequences,
"output.gbff",
organism="Escherichia coli",
circular=True
)
# Generate TBL format for NCBI submission
dict2tbl(annotations, sequences, "annotation.tbl")from gfftk.genbank import sbt_writer, table2asn, dict2tbl
from gfftk.fasta import fasta2dict
from gfftk.gff import gff2dict
# Prepare annotation data
sequences = fasta2dict("genome.fasta")
annotations = gff2dict("annotation.gff3", "genome.fasta")
# Generate TBL file
dict2tbl(annotations, sequences, "submission.tbl")
# Create submission template
sbt_writer("template.sbt")
# Run table2asn (requires table2asn installation)
table2asn(
"template.sbt",
"submission.tbl",
"genome.fasta",
"output_dir",
"Escherichia coli",
"K-12",
table=11 # Bacterial genetic code
)Install with Tessl CLI
npx tessl i tessl/pypi-gfftk