tessl/pypi-gfftk

Comprehensive Python toolkit for working with genome annotation files in GFF3, GTF, and TBL formats with format conversion and analysis capabilities

Overview

Eval results

Files

GenBank and TBL Format Handling

Name: tessl/pypi-gfftk
Author: tessl

Comprehensive support for NCBI GenBank and TBL annotation formats including bidirectional conversion, validation, and integration with NCBI table2asn for GenBank record generation. These functions provide the core functionality for working with NCBI-compliant annotation files.

Capabilities

TBL Format Parsing

Parse NCBI TBL annotation files into the gfftk annotation dictionary format with support for multiple transcript isoforms and complex gene structures.

def tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False):
    """
    Convert NCBI TBL format to annotation dictionary.

    Parses NCBI TBL files which contain gene models in tab-delimited format
    used by GenBank submission. Handles multiple transcript isoforms per gene,
    partial features, and all annotation qualifiers.

    Parameters:
    - inputfile (str|io.BytesIO): Path to TBL file or file-like object
    - fasta (str): Path to corresponding genome FASTA file
    - annotation (dict|bool): Existing annotation dictionary to update, or False
    - table (int): Genetic code table (1=standard, 11=bacterial)
    - debug (bool): Enable debug output

    Returns:
    dict: Annotation dictionary with gene models
    """

TBL Format Writing

Convert annotation dictionary to NCBI TBL format with proper formatting and validation for GenBank submission compatibility.

def dict2tbl(annots, seqs, outfile, table=1, debug=False):
    """
    Convert annotation dictionary to NCBI TBL format.

    Writes annotations in NCBI TBL format suitable for GenBank submission
    via table2asn. Handles complex gene structures, multiple isoforms,
    and all annotation qualifiers with proper formatting.

    Parameters:
    - annots (dict): Annotation dictionary
    - seqs (dict): Sequence dictionary from FASTA
    - outfile (str): Output TBL file path
    - table (int): Genetic code table (1=standard, 11=bacterial)
    - debug (bool): Enable debug output

    Returns:
    None
    """

GenBank Format Generation

Generate GenBank format files directly from annotation dictionary with organism metadata and formatting options.

def dict2gbff(annots, seqs, outfile, organism=None, circular=False, lowercase=False):
    """
    Convert annotation dictionary to GenBank format.

    Generates GenBank flat file format (.gbff) with complete annotation
    information, sequence data, and proper GenBank formatting. Includes
    organism metadata and circular DNA support.

    Parameters:
    - annots (dict): Annotation dictionary
    - seqs (dict): Sequence dictionary from FASTA
    - outfile (str): Output GenBank file path
    - organism (str|None): Organism name for ORGANISM field
    - circular (bool): Mark sequences as circular DNA
    - lowercase (bool): Output sequence in lowercase

    Returns:
    None
    """

NCBI table2asn Integration

Interface with NCBI's table2asn tool for generating GenBank submission files from TBL and FASTA inputs.

def table2asn(sbt, tbl, fasta, out, organism, strain, table=1):
    """
    Run NCBI table2asn to generate GenBank files.

    Executes NCBI table2asn tool to convert TBL annotation files and
    FASTA sequences into GenBank submission format. Requires table2asn
    to be installed and available in PATH.

    Parameters:
    - sbt (str): Path to submission template (.sbt) file
    - tbl (str): Path to TBL annotation file
    - fasta (str): Path to genome FASTA file
    - out (str): Output directory path
    - organism (str): Organism name
    - strain (str): Strain identifier
    - table (int): Genetic code table (1=standard, 11=bacterial)

    Returns:
    None
    """

Submission Template Generation

Generate NCBI submission template files required for table2asn processing.

def sbt_writer(out):
    """
    Generate NCBI submission template (.sbt) file.

    Creates a basic submission template file required by table2asn
    for GenBank submission processing. Template contains minimal
    required metadata fields.

    Parameters:
    - out (str): Output path for .sbt file

    Returns:
    None
    """

Coordinate Manipulation

Utilities for working with genomic coordinates in TBL format annotations.

def fetch_coords(v, i=0, feature="gene"):
    """
    Extract genomic coordinates from annotation data.

    Parses coordinate information from various annotation formats
    and returns standardized coordinate tuples. Handles partial
    features and strand information.

    Parameters:
    - v (list): Coordinate data structure
    - i (int): Index for transcript/feature selection
    - feature (str): Feature type ("gene", "mRNA", "CDS")

    Returns:
    tuple: (start, end) coordinates
    """

def duplicate_coords(cds):
    """
    Identify duplicate CDS coordinates.

    Scans CDS coordinate lists to identify duplicate exons
    or coordinate ranges that may indicate annotation errors
    or alternative splicing variants.

    Parameters:
    - cds (list): List of CDS coordinate tuples

    Returns:
    list: Indices of duplicate coordinate sets
    """

def drop_alt_coords(info, idxs):
    """
    Remove alternative coordinate sets from annotation.

    Removes specified coordinate sets from annotation data
    structure, typically used to clean up alternative
    splicing variants or duplicate annotations.

    Parameters:
    - info (dict): Annotation information dictionary
    - idxs (list): Indices of coordinate sets to remove

    Returns:
    dict: Updated annotation dictionary
    """

UTR Processing

Specialized functions for UTR (Untranslated Region) identification and processing.

def findUTRs(cds, mrna, strand):
    """
    Identify UTR regions from CDS and mRNA coordinates.

    Calculates 5' and 3' UTR regions by comparing CDS coordinates
    with mRNA boundaries. Handles strand orientation and returns
    coordinate tuples for UTR regions.

    Parameters:
    - cds (list): List of CDS coordinate tuples
    - mrna (list): List of mRNA coordinate tuples
    - strand (str): Strand orientation ("+"/"-")

    Returns:
    tuple: (five_utr_coords, three_utr_coords) as coordinate lists
    """

GO Term Processing

Handle Gene Ontology term formatting for GenBank submissions.

def reformatGO(term, goDict={}):
    """
    Reformat GO terms for GenBank submission.

    Converts GO terms to proper format for GenBank annotation
    files, handling term descriptions and maintaining consistency
    with NCBI requirements.

    Parameters:
    - term (str): GO term identifier (e.g., "GO:0008150")
    - goDict (dict): GO term dictionary for lookups

    Returns:
    str: Reformatted GO term description
    """

Usage Examples

Converting TBL to Annotation Dictionary

from gfftk.genbank import tbl2dict
from gfftk.fasta import fasta2dict

# Load sequences and parse TBL file
sequences = fasta2dict("genome.fasta")
annotations = tbl2dict("annotation.tbl", "genome.fasta")

# Access parsed data
for gene_id, gene_data in annotations.items():
    print(f"Gene: {gene_id}")
    print(f"Products: {gene_data['product']}")
    print(f"Location: {gene_data['location']}")

Generating GenBank Files

from gfftk.genbank import dict2gbff, dict2tbl
from gfftk.fasta import fasta2dict
from gfftk.gff import gff2dict

# Parse GFF3 annotation
sequences = fasta2dict("genome.fasta")
annotations = gff2dict("annotation.gff3", "genome.fasta")

# Generate GenBank format
dict2gbff(
    annotations,
    sequences,
    "output.gbff",
    organism="Escherichia coli",
    circular=True
)

# Generate TBL format for NCBI submission
dict2tbl(annotations, sequences, "annotation.tbl")

NCBI Submission Workflow

from gfftk.genbank import sbt_writer, table2asn, dict2tbl
from gfftk.fasta import fasta2dict
from gfftk.gff import gff2dict

# Prepare annotation data
sequences = fasta2dict("genome.fasta")
annotations = gff2dict("annotation.gff3", "genome.fasta")

# Generate TBL file
dict2tbl(annotations, sequences, "submission.tbl")

# Create submission template
sbt_writer("template.sbt")

# Run table2asn (requires table2asn installation)
table2asn(
    "template.sbt",
    "submission.tbl",
    "genome.fasta",
    "output_dir",
    "Escherichia coli",
    "K-12",
    table=11  # Bacterial genetic code
)

Install with Tessl CLI