or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

index.md
tile.json

tessl/pypi-ncbi-genome-download

Download genome files from the NCBI FTP server.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/ncbi-genome-download@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-ncbi-genome-download@0.3.0

index.mddocs/

NCBI Genome Download

A Python command-line tool and library for downloading bacterial, fungal, and viral genome files from the NCBI FTP servers. Provides flexible filtering options including taxonomic groups, assembly levels, RefSeq categories, genera, species, and taxonomy IDs, with support for parallel downloads and multiple output formats.

Package Information

  • Package Name: ncbi-genome-download
  • Language: Python
  • Installation: pip install ncbi-genome-download

Core Imports

import ncbi_genome_download

For programmatic access:

from ncbi_genome_download import download, NgdConfig, SUPPORTED_TAXONOMIC_GROUPS

For advanced usage:

from ncbi_genome_download import args_download, argument_parser, config_download

Basic Usage

Command Line

# Download all bacterial genomes in GenBank format
ncbi-genome-download bacteria

# Download complete genomes in FASTA format for specific genera
ncbi-genome-download bacteria --assembly-levels complete --formats fasta --genera "Escherichia,Salmonella"

# Short alias is also available
ngd archaea --formats fasta

Programmatic Interface

from ncbi_genome_download import download, NgdConfig

# Basic download using default parameters
retcode = download(
    groups=['bacteria'],
    file_formats=['genbank'],
    assembly_levels=['complete']
)

# Advanced configuration using NgdConfig
config = NgdConfig()
config.groups = ['bacteria', 'archaea']  
config.file_formats = ['fasta', 'genbank']
config.assembly_levels = ['complete', 'chromosome']
config.genera = ['Escherichia', 'Bacillus']
config.output = '/path/to/output'
config.parallel = 4

from ncbi_genome_download import config_download
retcode = config_download(config)

Architecture

The ncbi-genome-download package is designed with a modular architecture that separates concerns across several key components:

  • Core Module (core.py): Contains the main download logic, file processing, and worker functions
  • Configuration Module (config.py): Manages all configuration options, validation, and default values
  • Summary Module (summary.py): Handles parsing of NCBI assembly summary files
  • Jobs Module (jobs.py): Defines download job data structures for parallel processing
  • Metadata Module (metadata.py): Tracks and exports metadata about downloaded files
  • Command Line Interface (__main__.py): Provides the CLI entry point with argument parsing

This design enables flexible usage patterns from simple command-line operations to complex programmatic workflows, with robust parallel downloading capabilities and comprehensive filtering options.

Capabilities

Main Download Function

Downloads genome files from NCBI FTP servers with flexible filtering and configuration options.

def download(**kwargs):
    """
    Download data from NCBI using parameters passed as kwargs.
    
    Parameters:
    - groups: list or str, taxonomic groups to download (default: 'all')
    - section: str, NCBI section ('refseq' or 'genbank', default: 'refseq')
    - file_formats: list or str, formats to download (default: 'genbank')
    - assembly_levels: list or str, assembly levels (default: 'all')
    - genera: list or str, genera filter (default: [])
    - strains: list or str, strains filter (default: [])
    - species_taxids: list or str, species taxonomy IDs (default: [])
    - taxids: list or str, taxonomy IDs (default: [])
    - assembly_accessions: list or str, assembly accessions (default: [])
    - refseq_categories: list or str, RefSeq categories (default: 'all')
    - output: str, output directory (default: current directory)
    - parallel: int, number of parallel downloads (default: 1)
    - dry_run: bool, only show what would be downloaded (default: False)
    - progress_bar: bool, show progress bar (default: False)
    - metadata_table: str, path to save metadata table (default: None)
    - human_readable: bool, create human-readable directory structure (default: False)
    - flat_output: bool, dump files without subdirectories (default: False)
    - uri: str, NCBI base URI (default: 'https://ftp.ncbi.nih.gov/genomes')
    - use_cache: bool, use cached summary files (default: False)
    - fuzzy_genus: bool, use fuzzy search for genus names (default: False)
    - fuzzy_accessions: bool, use fuzzy search for accessions (default: False)
    - type_materials: list or str, relation to type material (default: 'any')
    
    Returns:
    int: Success code (0 for success, non-zero for error)
    """

Arguments-based Download Function

Downloads using parsed command-line arguments or similar namespace objects.

def args_download(args):
    """
    Download data from NCBI using parameters from argparse Namespace.
    
    Parameters:
    - args: argparse.Namespace, parsed command-line arguments
    
    Returns:
    int: Success code (0 for success, non-zero for error)
    """

Argument Parser Creation

Creates the command-line argument parser for the tool.

def argument_parser(version=None):
    """
    Create the argument parser for ncbi-genome-download.
    
    Parameters:
    - version: str, optional version string for --version flag
    
    Returns:
    argparse.ArgumentParser: Configured argument parser
    """

Configuration-based Download Function

Lower-level download function that takes a configuration object directly.

def config_download(config):
    """
    Run the actual download from NCBI with parameters in a config object.
    
    Parameters:
    - config: NgdConfig, configuration object with download settings
    
    Returns:
    int: Success code (0 for success, non-zero for error)
    """

Configuration Management

Complete configuration object for fine-grained control over download parameters.

class NgdConfig:
    """Configuration object for ncbi-genome-download."""
    
    def __init__(self):
        """Set up a config object with all default values."""
    
    @property
    def available_groups(self):
        """
        Get available taxonomic groups for current section.
        
        Returns:
        list: Available taxonomic groups based on current section
        """
    
    @classmethod
    def from_kwargs(cls, **kwargs):
        """
        Initialise configuration from kwargs.
        
        Parameters:
        - **kwargs: Configuration parameters as keyword arguments
        
        Returns:
        NgdConfig: Configured instance
        """
    
    @classmethod
    def from_namespace(cls, namespace):
        """
        Initialise from argparser Namespace object.
        
        Parameters:
        - namespace: argparse.Namespace, parsed arguments
        
        Returns:
        NgdConfig: Configured instance
        """
    
    @classmethod
    def get_default(cls, category):
        """
        Get the default value of a given category.
        
        Parameters:
        - category: str, configuration category name
        
        Returns:
        Default value for the category
        """
    
    @classmethod
    def get_choices(cls, category):
        """
        Get all available options for a category.
        
        Parameters:  
        - category: str, configuration category name
        
        Returns:
        list: Available choices for the category
        """
    
    @classmethod
    def get_fileending(cls, file_format):
        """
        Get the file extension for a given file format.
        
        Parameters:
        - file_format: str, file format name
        
        Returns:
        str: File extension pattern for the format
        """
    
    @classmethod  
    def get_refseq_category_string(cls, category):
        """
        Get the NCBI string representation for a RefSeq category.
        
        Parameters:
        - category: str, refseq category name
        
        Returns:
        str: NCBI string for the category
        """
    
    def is_compatible_assembly_accession(self, acc):
        """
        Check if assembly accession matches configured filters.
        
        Parameters:
        - acc: str, NCBI assembly accession
        
        Returns:
        bool: True if accession matches filter
        """
    
    def is_compatible_assembly_level(self, ncbi_assembly_level):
        """
        Check if assembly level matches configured filters.
        
        Parameters:
        - ncbi_assembly_level: str, NCBI assembly level string
        
        Returns:
        bool: True if assembly level matches filter
        """
    
    def is_compatible_refseq_category(self, category):
        """
        Check if RefSeq category matches configured filters.
        
        Parameters:
        - category: str, RefSeq category
        
        Returns:
        bool: True if category matches filter
        """

Supported Options

Taxonomic Groups

SUPPORTED_TAXONOMIC_GROUPS = [
    'archaea',
    'bacteria', 
    'fungi',
    'invertebrate',
    'metagenomes',
    'plant',
    'protozoa',
    'vertebrate_mammalian',
    'vertebrate_other',
    'viral'
]

GENBANK_EXCLUSIVE = [
    'metagenomes'
]

File Formats

Available file formats for download:

  • genbank - GenBank flat file format (.gbff.gz)
  • fasta - FASTA nucleotide sequences (.fna.gz)
  • rm - RepeatMasker output (.rm.out.gz)
  • features - Feature table (.txt.gz)
  • gff - Generic Feature Format (.gff.gz)
  • protein-fasta - Protein FASTA sequences (.faa.gz)
  • genpept - GenPept protein sequences (.gpff.gz)
  • wgs - WGS master GenBank record (.gbff.gz)
  • cds-fasta - CDS FASTA from genomic (.fna.gz)
  • rna-fna - RNA FASTA sequences (.fna.gz)
  • rna-fasta - RNA FASTA from genomic (.fna.gz)
  • assembly-report - Assembly report (.txt)
  • assembly-stats - Assembly statistics (.txt)
  • translated-cds - Translated CDS sequences (.faa.gz)

Assembly Levels

Available assembly levels:

  • complete - Complete Genome
  • chromosome - Chromosome
  • scaffold - Scaffold
  • contig - Contig

RefSeq Categories

Available RefSeq categories:

  • reference - Reference genome
  • representative - Representative genome
  • na - Not applicable/available

Error Handling

The functions return integer exit codes:

  • 0 - Success
  • 1 - General error (no matching downloads, invalid parameters)
  • 75 - Temporary failure (network/connection issues)
  • -2 - Validation error (invalid arguments)

Common exceptions:

  • ValueError - Raised for invalid configuration options or unsupported values
  • requests.exceptions.ConnectionError - Network connectivity issues
  • OSError - File system errors (permissions, disk space)

Usage Examples

Download Specific Organisms

from ncbi_genome_download import download

# Download all E. coli complete genomes
download(
    groups=['bacteria'],
    genera=['Escherichia coli'],
    assembly_levels=['complete'],
    file_formats=['fasta', 'genbank']
)

Parallel Downloads with Progress

from ncbi_genome_download import download

# Download with 4 parallel processes and progress bar
download(
    groups=['archaea'],
    assembly_levels=['complete'],
    parallel=4,
    progress_bar=True,
    output='/data/genomes'
)

Dry Run to Preview Downloads

from ncbi_genome_download import download

# See what would be downloaded without actually downloading
download(
    groups=['viral'],
    assembly_levels=['complete'],
    dry_run=True
)

Save Metadata

from ncbi_genome_download import download

# Download and save metadata table
download(
    groups=['bacteria'],
    genera=['Bacillus'],
    metadata_table='bacillus_metadata.tsv'
)

Contributed Scripts

gimme_taxa.py

A utility script for querying the NCBI taxonomy database to find taxonomy IDs for use with ncbi-genome-download. Requires the ete3 toolkit.

Installation:

pip install ete3

Basic Usage:

# Find all descendant taxa for Escherichia (taxid 561)
python gimme_taxa.py -o ~/mytaxafile.txt 561

# Use taxon name instead of ID
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia

# Multiple taxids and/or names
python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter

Key Features:

  • Query by taxonomy ID or scientific name
  • Returns all child taxa of specified parent taxa
  • Writes output in format suitable for ncbi-genome-download
  • Creates local SQLite database for fast queries
  • Supports database updates with --update flag