Download genome files from the NCBI FTP server.
npx @tessl/cli install tessl/pypi-ncbi-genome-download@0.3.0A Python command-line tool and library for downloading bacterial, fungal, and viral genome files from the NCBI FTP servers. Provides flexible filtering options including taxonomic groups, assembly levels, RefSeq categories, genera, species, and taxonomy IDs, with support for parallel downloads and multiple output formats.
pip install ncbi-genome-downloadimport ncbi_genome_downloadFor programmatic access:
from ncbi_genome_download import download, NgdConfig, SUPPORTED_TAXONOMIC_GROUPSFor advanced usage:
from ncbi_genome_download import args_download, argument_parser, config_download# Download all bacterial genomes in GenBank format
ncbi-genome-download bacteria
# Download complete genomes in FASTA format for specific genera
ncbi-genome-download bacteria --assembly-levels complete --formats fasta --genera "Escherichia,Salmonella"
# Short alias is also available
ngd archaea --formats fastafrom ncbi_genome_download import download, NgdConfig
# Basic download using default parameters
retcode = download(
groups=['bacteria'],
file_formats=['genbank'],
assembly_levels=['complete']
)
# Advanced configuration using NgdConfig
config = NgdConfig()
config.groups = ['bacteria', 'archaea']
config.file_formats = ['fasta', 'genbank']
config.assembly_levels = ['complete', 'chromosome']
config.genera = ['Escherichia', 'Bacillus']
config.output = '/path/to/output'
config.parallel = 4
from ncbi_genome_download import config_download
retcode = config_download(config)The ncbi-genome-download package is designed with a modular architecture that separates concerns across several key components:
core.py): Contains the main download logic, file processing, and worker functionsconfig.py): Manages all configuration options, validation, and default valuessummary.py): Handles parsing of NCBI assembly summary filesjobs.py): Defines download job data structures for parallel processingmetadata.py): Tracks and exports metadata about downloaded files__main__.py): Provides the CLI entry point with argument parsingThis design enables flexible usage patterns from simple command-line operations to complex programmatic workflows, with robust parallel downloading capabilities and comprehensive filtering options.
Downloads genome files from NCBI FTP servers with flexible filtering and configuration options.
def download(**kwargs):
"""
Download data from NCBI using parameters passed as kwargs.
Parameters:
- groups: list or str, taxonomic groups to download (default: 'all')
- section: str, NCBI section ('refseq' or 'genbank', default: 'refseq')
- file_formats: list or str, formats to download (default: 'genbank')
- assembly_levels: list or str, assembly levels (default: 'all')
- genera: list or str, genera filter (default: [])
- strains: list or str, strains filter (default: [])
- species_taxids: list or str, species taxonomy IDs (default: [])
- taxids: list or str, taxonomy IDs (default: [])
- assembly_accessions: list or str, assembly accessions (default: [])
- refseq_categories: list or str, RefSeq categories (default: 'all')
- output: str, output directory (default: current directory)
- parallel: int, number of parallel downloads (default: 1)
- dry_run: bool, only show what would be downloaded (default: False)
- progress_bar: bool, show progress bar (default: False)
- metadata_table: str, path to save metadata table (default: None)
- human_readable: bool, create human-readable directory structure (default: False)
- flat_output: bool, dump files without subdirectories (default: False)
- uri: str, NCBI base URI (default: 'https://ftp.ncbi.nih.gov/genomes')
- use_cache: bool, use cached summary files (default: False)
- fuzzy_genus: bool, use fuzzy search for genus names (default: False)
- fuzzy_accessions: bool, use fuzzy search for accessions (default: False)
- type_materials: list or str, relation to type material (default: 'any')
Returns:
int: Success code (0 for success, non-zero for error)
"""Downloads using parsed command-line arguments or similar namespace objects.
def args_download(args):
"""
Download data from NCBI using parameters from argparse Namespace.
Parameters:
- args: argparse.Namespace, parsed command-line arguments
Returns:
int: Success code (0 for success, non-zero for error)
"""Creates the command-line argument parser for the tool.
def argument_parser(version=None):
"""
Create the argument parser for ncbi-genome-download.
Parameters:
- version: str, optional version string for --version flag
Returns:
argparse.ArgumentParser: Configured argument parser
"""Lower-level download function that takes a configuration object directly.
def config_download(config):
"""
Run the actual download from NCBI with parameters in a config object.
Parameters:
- config: NgdConfig, configuration object with download settings
Returns:
int: Success code (0 for success, non-zero for error)
"""Complete configuration object for fine-grained control over download parameters.
class NgdConfig:
"""Configuration object for ncbi-genome-download."""
def __init__(self):
"""Set up a config object with all default values."""
@property
def available_groups(self):
"""
Get available taxonomic groups for current section.
Returns:
list: Available taxonomic groups based on current section
"""
@classmethod
def from_kwargs(cls, **kwargs):
"""
Initialise configuration from kwargs.
Parameters:
- **kwargs: Configuration parameters as keyword arguments
Returns:
NgdConfig: Configured instance
"""
@classmethod
def from_namespace(cls, namespace):
"""
Initialise from argparser Namespace object.
Parameters:
- namespace: argparse.Namespace, parsed arguments
Returns:
NgdConfig: Configured instance
"""
@classmethod
def get_default(cls, category):
"""
Get the default value of a given category.
Parameters:
- category: str, configuration category name
Returns:
Default value for the category
"""
@classmethod
def get_choices(cls, category):
"""
Get all available options for a category.
Parameters:
- category: str, configuration category name
Returns:
list: Available choices for the category
"""
@classmethod
def get_fileending(cls, file_format):
"""
Get the file extension for a given file format.
Parameters:
- file_format: str, file format name
Returns:
str: File extension pattern for the format
"""
@classmethod
def get_refseq_category_string(cls, category):
"""
Get the NCBI string representation for a RefSeq category.
Parameters:
- category: str, refseq category name
Returns:
str: NCBI string for the category
"""
def is_compatible_assembly_accession(self, acc):
"""
Check if assembly accession matches configured filters.
Parameters:
- acc: str, NCBI assembly accession
Returns:
bool: True if accession matches filter
"""
def is_compatible_assembly_level(self, ncbi_assembly_level):
"""
Check if assembly level matches configured filters.
Parameters:
- ncbi_assembly_level: str, NCBI assembly level string
Returns:
bool: True if assembly level matches filter
"""
def is_compatible_refseq_category(self, category):
"""
Check if RefSeq category matches configured filters.
Parameters:
- category: str, RefSeq category
Returns:
bool: True if category matches filter
"""SUPPORTED_TAXONOMIC_GROUPS = [
'archaea',
'bacteria',
'fungi',
'invertebrate',
'metagenomes',
'plant',
'protozoa',
'vertebrate_mammalian',
'vertebrate_other',
'viral'
]
GENBANK_EXCLUSIVE = [
'metagenomes'
]Available file formats for download:
genbank - GenBank flat file format (.gbff.gz)fasta - FASTA nucleotide sequences (.fna.gz)rm - RepeatMasker output (.rm.out.gz)features - Feature table (.txt.gz)gff - Generic Feature Format (.gff.gz)protein-fasta - Protein FASTA sequences (.faa.gz)genpept - GenPept protein sequences (.gpff.gz)wgs - WGS master GenBank record (.gbff.gz)cds-fasta - CDS FASTA from genomic (.fna.gz)rna-fna - RNA FASTA sequences (.fna.gz)rna-fasta - RNA FASTA from genomic (.fna.gz)assembly-report - Assembly report (.txt)assembly-stats - Assembly statistics (.txt)translated-cds - Translated CDS sequences (.faa.gz)Available assembly levels:
complete - Complete Genomechromosome - Chromosomescaffold - Scaffoldcontig - ContigAvailable RefSeq categories:
reference - Reference genomerepresentative - Representative genomena - Not applicable/availableThe functions return integer exit codes:
0 - Success1 - General error (no matching downloads, invalid parameters)75 - Temporary failure (network/connection issues)-2 - Validation error (invalid arguments)Common exceptions:
ValueError - Raised for invalid configuration options or unsupported valuesrequests.exceptions.ConnectionError - Network connectivity issuesOSError - File system errors (permissions, disk space)from ncbi_genome_download import download
# Download all E. coli complete genomes
download(
groups=['bacteria'],
genera=['Escherichia coli'],
assembly_levels=['complete'],
file_formats=['fasta', 'genbank']
)from ncbi_genome_download import download
# Download with 4 parallel processes and progress bar
download(
groups=['archaea'],
assembly_levels=['complete'],
parallel=4,
progress_bar=True,
output='/data/genomes'
)from ncbi_genome_download import download
# See what would be downloaded without actually downloading
download(
groups=['viral'],
assembly_levels=['complete'],
dry_run=True
)from ncbi_genome_download import download
# Download and save metadata table
download(
groups=['bacteria'],
genera=['Bacillus'],
metadata_table='bacillus_metadata.tsv'
)A utility script for querying the NCBI taxonomy database to find taxonomy IDs for use with ncbi-genome-download. Requires the ete3 toolkit.
Installation:
pip install ete3Basic Usage:
# Find all descendant taxa for Escherichia (taxid 561)
python gimme_taxa.py -o ~/mytaxafile.txt 561
# Use taxon name instead of ID
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia
# Multiple taxids and/or names
python gimme_taxa.py -o all_descendent_taxids.txt 561,MethanobrevibacterKey Features:
--update flag