0
# NCBI Genome Download
1
2
A Python command-line tool and library for downloading bacterial, fungal, and viral genome files from the NCBI FTP servers. Provides flexible filtering options including taxonomic groups, assembly levels, RefSeq categories, genera, species, and taxonomy IDs, with support for parallel downloads and multiple output formats.
3
4
## Package Information
5
6
- **Package Name**: ncbi-genome-download
7
- **Language**: Python
8
- **Installation**: `pip install ncbi-genome-download`
9
10
## Core Imports
11
12
```python
13
import ncbi_genome_download
14
```
15
16
For programmatic access:
17
18
```python
19
from ncbi_genome_download import download, NgdConfig, SUPPORTED_TAXONOMIC_GROUPS
20
```
21
22
For advanced usage:
23
24
```python
25
from ncbi_genome_download import args_download, argument_parser, config_download
26
```
27
28
## Basic Usage
29
30
### Command Line
31
32
```bash
33
# Download all bacterial genomes in GenBank format
34
ncbi-genome-download bacteria
35
36
# Download complete genomes in FASTA format for specific genera
37
ncbi-genome-download bacteria --assembly-levels complete --formats fasta --genera "Escherichia,Salmonella"
38
39
# Short alias is also available
40
ngd archaea --formats fasta
41
```
42
43
### Programmatic Interface
44
45
```python
46
from ncbi_genome_download import download, NgdConfig
47
48
# Basic download using default parameters
49
retcode = download(
50
groups=['bacteria'],
51
file_formats=['genbank'],
52
assembly_levels=['complete']
53
)
54
55
# Advanced configuration using NgdConfig
56
config = NgdConfig()
57
config.groups = ['bacteria', 'archaea']
58
config.file_formats = ['fasta', 'genbank']
59
config.assembly_levels = ['complete', 'chromosome']
60
config.genera = ['Escherichia', 'Bacillus']
61
config.output = '/path/to/output'
62
config.parallel = 4
63
64
from ncbi_genome_download import config_download
65
retcode = config_download(config)
66
```
67
68
## Architecture
69
70
The ncbi-genome-download package is designed with a modular architecture that separates concerns across several key components:
71
72
- **Core Module** (`core.py`): Contains the main download logic, file processing, and worker functions
73
- **Configuration Module** (`config.py`): Manages all configuration options, validation, and default values
74
- **Summary Module** (`summary.py`): Handles parsing of NCBI assembly summary files
75
- **Jobs Module** (`jobs.py`): Defines download job data structures for parallel processing
76
- **Metadata Module** (`metadata.py`): Tracks and exports metadata about downloaded files
77
- **Command Line Interface** (`__main__.py`): Provides the CLI entry point with argument parsing
78
79
This design enables flexible usage patterns from simple command-line operations to complex programmatic workflows, with robust parallel downloading capabilities and comprehensive filtering options.
80
81
## Capabilities
82
83
### Main Download Function
84
85
Downloads genome files from NCBI FTP servers with flexible filtering and configuration options.
86
87
```python { .api }
88
def download(**kwargs):
89
"""
90
Download data from NCBI using parameters passed as kwargs.
91
92
Parameters:
93
- groups: list or str, taxonomic groups to download (default: 'all')
94
- section: str, NCBI section ('refseq' or 'genbank', default: 'refseq')
95
- file_formats: list or str, formats to download (default: 'genbank')
96
- assembly_levels: list or str, assembly levels (default: 'all')
97
- genera: list or str, genera filter (default: [])
98
- strains: list or str, strains filter (default: [])
99
- species_taxids: list or str, species taxonomy IDs (default: [])
100
- taxids: list or str, taxonomy IDs (default: [])
101
- assembly_accessions: list or str, assembly accessions (default: [])
102
- refseq_categories: list or str, RefSeq categories (default: 'all')
103
- output: str, output directory (default: current directory)
104
- parallel: int, number of parallel downloads (default: 1)
105
- dry_run: bool, only show what would be downloaded (default: False)
106
- progress_bar: bool, show progress bar (default: False)
107
- metadata_table: str, path to save metadata table (default: None)
108
- human_readable: bool, create human-readable directory structure (default: False)
109
- flat_output: bool, dump files without subdirectories (default: False)
110
- uri: str, NCBI base URI (default: 'https://ftp.ncbi.nih.gov/genomes')
111
- use_cache: bool, use cached summary files (default: False)
112
- fuzzy_genus: bool, use fuzzy search for genus names (default: False)
113
- fuzzy_accessions: bool, use fuzzy search for accessions (default: False)
114
- type_materials: list or str, relation to type material (default: 'any')
115
116
Returns:
117
int: Success code (0 for success, non-zero for error)
118
"""
119
```
120
121
### Arguments-based Download Function
122
123
Downloads using parsed command-line arguments or similar namespace objects.
124
125
```python { .api }
126
def args_download(args):
127
"""
128
Download data from NCBI using parameters from argparse Namespace.
129
130
Parameters:
131
- args: argparse.Namespace, parsed command-line arguments
132
133
Returns:
134
int: Success code (0 for success, non-zero for error)
135
"""
136
```
137
138
### Argument Parser Creation
139
140
Creates the command-line argument parser for the tool.
141
142
```python { .api }
143
def argument_parser(version=None):
144
"""
145
Create the argument parser for ncbi-genome-download.
146
147
Parameters:
148
- version: str, optional version string for --version flag
149
150
Returns:
151
argparse.ArgumentParser: Configured argument parser
152
"""
153
```
154
155
### Configuration-based Download Function
156
157
Lower-level download function that takes a configuration object directly.
158
159
```python { .api }
160
def config_download(config):
161
"""
162
Run the actual download from NCBI with parameters in a config object.
163
164
Parameters:
165
- config: NgdConfig, configuration object with download settings
166
167
Returns:
168
int: Success code (0 for success, non-zero for error)
169
"""
170
```
171
172
### Configuration Management
173
174
Complete configuration object for fine-grained control over download parameters.
175
176
```python { .api }
177
class NgdConfig:
178
"""Configuration object for ncbi-genome-download."""
179
180
def __init__(self):
181
"""Set up a config object with all default values."""
182
183
@property
184
def available_groups(self):
185
"""
186
Get available taxonomic groups for current section.
187
188
Returns:
189
list: Available taxonomic groups based on current section
190
"""
191
192
@classmethod
193
def from_kwargs(cls, **kwargs):
194
"""
195
Initialise configuration from kwargs.
196
197
Parameters:
198
- **kwargs: Configuration parameters as keyword arguments
199
200
Returns:
201
NgdConfig: Configured instance
202
"""
203
204
@classmethod
205
def from_namespace(cls, namespace):
206
"""
207
Initialise from argparser Namespace object.
208
209
Parameters:
210
- namespace: argparse.Namespace, parsed arguments
211
212
Returns:
213
NgdConfig: Configured instance
214
"""
215
216
@classmethod
217
def get_default(cls, category):
218
"""
219
Get the default value of a given category.
220
221
Parameters:
222
- category: str, configuration category name
223
224
Returns:
225
Default value for the category
226
"""
227
228
@classmethod
229
def get_choices(cls, category):
230
"""
231
Get all available options for a category.
232
233
Parameters:
234
- category: str, configuration category name
235
236
Returns:
237
list: Available choices for the category
238
"""
239
240
@classmethod
241
def get_fileending(cls, file_format):
242
"""
243
Get the file extension for a given file format.
244
245
Parameters:
246
- file_format: str, file format name
247
248
Returns:
249
str: File extension pattern for the format
250
"""
251
252
@classmethod
253
def get_refseq_category_string(cls, category):
254
"""
255
Get the NCBI string representation for a RefSeq category.
256
257
Parameters:
258
- category: str, refseq category name
259
260
Returns:
261
str: NCBI string for the category
262
"""
263
264
def is_compatible_assembly_accession(self, acc):
265
"""
266
Check if assembly accession matches configured filters.
267
268
Parameters:
269
- acc: str, NCBI assembly accession
270
271
Returns:
272
bool: True if accession matches filter
273
"""
274
275
def is_compatible_assembly_level(self, ncbi_assembly_level):
276
"""
277
Check if assembly level matches configured filters.
278
279
Parameters:
280
- ncbi_assembly_level: str, NCBI assembly level string
281
282
Returns:
283
bool: True if assembly level matches filter
284
"""
285
286
def is_compatible_refseq_category(self, category):
287
"""
288
Check if RefSeq category matches configured filters.
289
290
Parameters:
291
- category: str, RefSeq category
292
293
Returns:
294
bool: True if category matches filter
295
"""
296
```
297
298
## Supported Options
299
300
### Taxonomic Groups
301
302
```python { .api }
303
SUPPORTED_TAXONOMIC_GROUPS = [
304
'archaea',
305
'bacteria',
306
'fungi',
307
'invertebrate',
308
'metagenomes',
309
'plant',
310
'protozoa',
311
'vertebrate_mammalian',
312
'vertebrate_other',
313
'viral'
314
]
315
316
GENBANK_EXCLUSIVE = [
317
'metagenomes'
318
]
319
```
320
321
### File Formats
322
323
Available file formats for download:
324
325
- `genbank` - GenBank flat file format (.gbff.gz)
326
- `fasta` - FASTA nucleotide sequences (.fna.gz)
327
- `rm` - RepeatMasker output (.rm.out.gz)
328
- `features` - Feature table (.txt.gz)
329
- `gff` - Generic Feature Format (.gff.gz)
330
- `protein-fasta` - Protein FASTA sequences (.faa.gz)
331
- `genpept` - GenPept protein sequences (.gpff.gz)
332
- `wgs` - WGS master GenBank record (.gbff.gz)
333
- `cds-fasta` - CDS FASTA from genomic (.fna.gz)
334
- `rna-fna` - RNA FASTA sequences (.fna.gz)
335
- `rna-fasta` - RNA FASTA from genomic (.fna.gz)
336
- `assembly-report` - Assembly report (.txt)
337
- `assembly-stats` - Assembly statistics (.txt)
338
- `translated-cds` - Translated CDS sequences (.faa.gz)
339
340
### Assembly Levels
341
342
Available assembly levels:
343
344
- `complete` - Complete Genome
345
- `chromosome` - Chromosome
346
- `scaffold` - Scaffold
347
- `contig` - Contig
348
349
### RefSeq Categories
350
351
Available RefSeq categories:
352
353
- `reference` - Reference genome
354
- `representative` - Representative genome
355
- `na` - Not applicable/available
356
357
## Error Handling
358
359
The functions return integer exit codes:
360
361
- `0` - Success
362
- `1` - General error (no matching downloads, invalid parameters)
363
- `75` - Temporary failure (network/connection issues)
364
- `-2` - Validation error (invalid arguments)
365
366
Common exceptions:
367
368
- `ValueError` - Raised for invalid configuration options or unsupported values
369
- `requests.exceptions.ConnectionError` - Network connectivity issues
370
- `OSError` - File system errors (permissions, disk space)
371
372
## Usage Examples
373
374
### Download Specific Organisms
375
376
```python
377
from ncbi_genome_download import download
378
379
# Download all E. coli complete genomes
380
download(
381
groups=['bacteria'],
382
genera=['Escherichia coli'],
383
assembly_levels=['complete'],
384
file_formats=['fasta', 'genbank']
385
)
386
```
387
388
### Parallel Downloads with Progress
389
390
```python
391
from ncbi_genome_download import download
392
393
# Download with 4 parallel processes and progress bar
394
download(
395
groups=['archaea'],
396
assembly_levels=['complete'],
397
parallel=4,
398
progress_bar=True,
399
output='/data/genomes'
400
)
401
```
402
403
### Dry Run to Preview Downloads
404
405
```python
406
from ncbi_genome_download import download
407
408
# See what would be downloaded without actually downloading
409
download(
410
groups=['viral'],
411
assembly_levels=['complete'],
412
dry_run=True
413
)
414
```
415
416
### Save Metadata
417
418
```python
419
from ncbi_genome_download import download
420
421
# Download and save metadata table
422
download(
423
groups=['bacteria'],
424
genera=['Bacillus'],
425
metadata_table='bacillus_metadata.tsv'
426
)
427
```
428
429
## Contributed Scripts
430
431
### gimme_taxa.py
432
433
A utility script for querying the NCBI taxonomy database to find taxonomy IDs for use with ncbi-genome-download. Requires the `ete3` toolkit.
434
435
**Installation:**
436
```bash
437
pip install ete3
438
```
439
440
**Basic Usage:**
441
```bash
442
# Find all descendant taxa for Escherichia (taxid 561)
443
python gimme_taxa.py -o ~/mytaxafile.txt 561
444
445
# Use taxon name instead of ID
446
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia
447
448
# Multiple taxids and/or names
449
python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter
450
```
451
452
**Key Features:**
453
- Query by taxonomy ID or scientific name
454
- Returns all child taxa of specified parent taxa
455
- Writes output in format suitable for ncbi-genome-download
456
- Creates local SQLite database for fast queries
457
- Supports database updates with `--update` flag