Tessl Tile for pypi/ncbi-genome-download@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# NCBI Genome Download
1

2
A Python command-line tool and library for downloading bacterial, fungal, and viral genome files from the NCBI FTP servers. Provides flexible filtering options including taxonomic groups, assembly levels, RefSeq categories, genera, species, and taxonomy IDs, with support for parallel downloads and multiple output formats.
3

4
## Package Information
5

6
- **Package Name**: ncbi-genome-download
7
- **Language**: Python
8
- **Installation**: `pip install ncbi-genome-download`
9

10
## Core Imports
11

12
```python
13
import ncbi_genome_download
14
```
15

16
For programmatic access:
17

18
```python
19
from ncbi_genome_download import download, NgdConfig, SUPPORTED_TAXONOMIC_GROUPS
20
```
21

22
For advanced usage:
23

24
```python
25
from ncbi_genome_download import args_download, argument_parser, config_download
26
```
27

28
## Basic Usage
29

30
### Command Line
31

32
```bash
33
# Download all bacterial genomes in GenBank format
34
ncbi-genome-download bacteria
35

36
# Download complete genomes in FASTA format for specific genera
37
ncbi-genome-download bacteria --assembly-levels complete --formats fasta --genera "Escherichia,Salmonella"
38

39
# Short alias is also available
40
ngd archaea --formats fasta
41
```
42

43
### Programmatic Interface
44

45
```python
46
from ncbi_genome_download import download, NgdConfig
47

48
# Basic download using default parameters
49
retcode = download(
50
    groups=['bacteria'],
51
    file_formats=['genbank'],
52
    assembly_levels=['complete']
53
)
54

55
# Advanced configuration using NgdConfig
56
config = NgdConfig()
57
config.groups = ['bacteria', 'archaea']  
58
config.file_formats = ['fasta', 'genbank']
59
config.assembly_levels = ['complete', 'chromosome']
60
config.genera = ['Escherichia', 'Bacillus']
61
config.output = '/path/to/output'
62
config.parallel = 4
63

64
from ncbi_genome_download import config_download
65
retcode = config_download(config)
66
```
67

68
## Architecture
69

70
The ncbi-genome-download package is designed with a modular architecture that separates concerns across several key components:
71

72
- **Core Module** (`core.py`): Contains the main download logic, file processing, and worker functions
73
- **Configuration Module** (`config.py`): Manages all configuration options, validation, and default values
74
- **Summary Module** (`summary.py`): Handles parsing of NCBI assembly summary files
75
- **Jobs Module** (`jobs.py`): Defines download job data structures for parallel processing
76
- **Metadata Module** (`metadata.py`): Tracks and exports metadata about downloaded files
77
- **Command Line Interface** (`__main__.py`): Provides the CLI entry point with argument parsing
78

79
This design enables flexible usage patterns from simple command-line operations to complex programmatic workflows, with robust parallel downloading capabilities and comprehensive filtering options.
80

81
## Capabilities
82

83
### Main Download Function
84

85
Downloads genome files from NCBI FTP servers with flexible filtering and configuration options.
86

87
```python { .api }
88
def download(**kwargs):
89
    """
90
    Download data from NCBI using parameters passed as kwargs.
91
    
92
    Parameters:
93
    - groups: list or str, taxonomic groups to download (default: 'all')
94
    - section: str, NCBI section ('refseq' or 'genbank', default: 'refseq')
95
    - file_formats: list or str, formats to download (default: 'genbank')
96
    - assembly_levels: list or str, assembly levels (default: 'all')
97
    - genera: list or str, genera filter (default: [])
98
    - strains: list or str, strains filter (default: [])
99
    - species_taxids: list or str, species taxonomy IDs (default: [])
100
    - taxids: list or str, taxonomy IDs (default: [])
101
    - assembly_accessions: list or str, assembly accessions (default: [])
102
    - refseq_categories: list or str, RefSeq categories (default: 'all')
103
    - output: str, output directory (default: current directory)
104
    - parallel: int, number of parallel downloads (default: 1)
105
    - dry_run: bool, only show what would be downloaded (default: False)
106
    - progress_bar: bool, show progress bar (default: False)
107
    - metadata_table: str, path to save metadata table (default: None)
108
    - human_readable: bool, create human-readable directory structure (default: False)
109
    - flat_output: bool, dump files without subdirectories (default: False)
110
    - uri: str, NCBI base URI (default: 'https://ftp.ncbi.nih.gov/genomes')
111
    - use_cache: bool, use cached summary files (default: False)
112
    - fuzzy_genus: bool, use fuzzy search for genus names (default: False)
113
    - fuzzy_accessions: bool, use fuzzy search for accessions (default: False)
114
    - type_materials: list or str, relation to type material (default: 'any')
115
    
116
    Returns:
117
    int: Success code (0 for success, non-zero for error)
118
    """
119
```
120

121
### Arguments-based Download Function
122

123
Downloads using parsed command-line arguments or similar namespace objects.
124

125
```python { .api }
126
def args_download(args):
127
    """
128
    Download data from NCBI using parameters from argparse Namespace.
129
    
130
    Parameters:
131
    - args: argparse.Namespace, parsed command-line arguments
132
    
133
    Returns:
134
    int: Success code (0 for success, non-zero for error)
135
    """
136
```
137

138
### Argument Parser Creation
139

140
Creates the command-line argument parser for the tool.
141

142
```python { .api }
143
def argument_parser(version=None):
144
    """
145
    Create the argument parser for ncbi-genome-download.
146
    
147
    Parameters:
148
    - version: str, optional version string for --version flag
149
    
150
    Returns:
151
    argparse.ArgumentParser: Configured argument parser
152
    """
153
```
154

155
### Configuration-based Download Function
156

157
Lower-level download function that takes a configuration object directly.
158

159
```python { .api }
160
def config_download(config):
161
    """
162
    Run the actual download from NCBI with parameters in a config object.
163
    
164
    Parameters:
165
    - config: NgdConfig, configuration object with download settings
166
    
167
    Returns:
168
    int: Success code (0 for success, non-zero for error)
169
    """
170
```
171

172
### Configuration Management
173

174
Complete configuration object for fine-grained control over download parameters.
175

176
```python { .api }
177
class NgdConfig:
178
    """Configuration object for ncbi-genome-download."""
179
    
180
    def __init__(self):
181
        """Set up a config object with all default values."""
182
    
183
    @property
184
    def available_groups(self):
185
        """
186
        Get available taxonomic groups for current section.
187
        
188
        Returns:
189
        list: Available taxonomic groups based on current section
190
        """
191
    
192
    @classmethod
193
    def from_kwargs(cls, **kwargs):
194
        """
195
        Initialise configuration from kwargs.
196
        
197
        Parameters:
198
        - **kwargs: Configuration parameters as keyword arguments
199
        
200
        Returns:
201
        NgdConfig: Configured instance
202
        """
203
    
204
    @classmethod
205
    def from_namespace(cls, namespace):
206
        """
207
        Initialise from argparser Namespace object.
208
        
209
        Parameters:
210
        - namespace: argparse.Namespace, parsed arguments
211
        
212
        Returns:
213
        NgdConfig: Configured instance
214
        """
215
    
216
    @classmethod
217
    def get_default(cls, category):
218
        """
219
        Get the default value of a given category.
220
        
221
        Parameters:
222
        - category: str, configuration category name
223
        
224
        Returns:
225
        Default value for the category
226
        """
227
    
228
    @classmethod
229
    def get_choices(cls, category):
230
        """
231
        Get all available options for a category.
232
        
233
        Parameters:  
234
        - category: str, configuration category name
235
        
236
        Returns:
237
        list: Available choices for the category
238
        """
239
    
240
    @classmethod
241
    def get_fileending(cls, file_format):
242
        """
243
        Get the file extension for a given file format.
244
        
245
        Parameters:
246
        - file_format: str, file format name
247
        
248
        Returns:
249
        str: File extension pattern for the format
250
        """
251
    
252
    @classmethod  
253
    def get_refseq_category_string(cls, category):
254
        """
255
        Get the NCBI string representation for a RefSeq category.
256
        
257
        Parameters:
258
        - category: str, refseq category name
259
        
260
        Returns:
261
        str: NCBI string for the category
262
        """
263
    
264
    def is_compatible_assembly_accession(self, acc):
265
        """
266
        Check if assembly accession matches configured filters.
267
        
268
        Parameters:
269
        - acc: str, NCBI assembly accession
270
        
271
        Returns:
272
        bool: True if accession matches filter
273
        """
274
    
275
    def is_compatible_assembly_level(self, ncbi_assembly_level):
276
        """
277
        Check if assembly level matches configured filters.
278
        
279
        Parameters:
280
        - ncbi_assembly_level: str, NCBI assembly level string
281
        
282
        Returns:
283
        bool: True if assembly level matches filter
284
        """
285
    
286
    def is_compatible_refseq_category(self, category):
287
        """
288
        Check if RefSeq category matches configured filters.
289
        
290
        Parameters:
291
        - category: str, RefSeq category
292
        
293
        Returns:
294
        bool: True if category matches filter
295
        """
296
```
297

298
## Supported Options
299

300
### Taxonomic Groups
301

302
```python { .api }
303
SUPPORTED_TAXONOMIC_GROUPS = [
304
    'archaea',
305
    'bacteria', 
306
    'fungi',
307
    'invertebrate',
308
    'metagenomes',
309
    'plant',
310
    'protozoa',
311
    'vertebrate_mammalian',
312
    'vertebrate_other',
313
    'viral'
314
]
315

316
GENBANK_EXCLUSIVE = [
317
    'metagenomes'
318
]
319
```
320

321
### File Formats
322

323
Available file formats for download:
324

325
- `genbank` - GenBank flat file format (.gbff.gz)
326
- `fasta` - FASTA nucleotide sequences (.fna.gz)  
327
- `rm` - RepeatMasker output (.rm.out.gz)
328
- `features` - Feature table (.txt.gz)
329
- `gff` - Generic Feature Format (.gff.gz)
330
- `protein-fasta` - Protein FASTA sequences (.faa.gz)
331
- `genpept` - GenPept protein sequences (.gpff.gz)
332
- `wgs` - WGS master GenBank record (.gbff.gz)
333
- `cds-fasta` - CDS FASTA from genomic (.fna.gz)
334
- `rna-fna` - RNA FASTA sequences (.fna.gz)
335
- `rna-fasta` - RNA FASTA from genomic (.fna.gz)
336
- `assembly-report` - Assembly report (.txt)
337
- `assembly-stats` - Assembly statistics (.txt)
338
- `translated-cds` - Translated CDS sequences (.faa.gz)
339

340
### Assembly Levels
341

342
Available assembly levels:
343

344
- `complete` - Complete Genome
345
- `chromosome` - Chromosome  
346
- `scaffold` - Scaffold
347
- `contig` - Contig
348

349
### RefSeq Categories
350

351
Available RefSeq categories:
352

353
- `reference` - Reference genome
354
- `representative` - Representative genome  
355
- `na` - Not applicable/available
356

357
## Error Handling
358

359
The functions return integer exit codes:
360

361
- `0` - Success
362
- `1` - General error (no matching downloads, invalid parameters)
363
- `75` - Temporary failure (network/connection issues)
364
- `-2` - Validation error (invalid arguments)
365

366
Common exceptions:
367

368
- `ValueError` - Raised for invalid configuration options or unsupported values
369
- `requests.exceptions.ConnectionError` - Network connectivity issues
370
- `OSError` - File system errors (permissions, disk space)
371

372
## Usage Examples
373

374
### Download Specific Organisms
375

376
```python
377
from ncbi_genome_download import download
378

379
# Download all E. coli complete genomes
380
download(
381
    groups=['bacteria'],
382
    genera=['Escherichia coli'],
383
    assembly_levels=['complete'],
384
    file_formats=['fasta', 'genbank']
385
)
386
```
387

388
### Parallel Downloads with Progress
389

390
```python
391
from ncbi_genome_download import download
392

393
# Download with 4 parallel processes and progress bar
394
download(
395
    groups=['archaea'],
396
    assembly_levels=['complete'],
397
    parallel=4,
398
    progress_bar=True,
399
    output='/data/genomes'
400
)
401
```
402

403
### Dry Run to Preview Downloads
404

405
```python
406
from ncbi_genome_download import download
407

408
# See what would be downloaded without actually downloading
409
download(
410
    groups=['viral'],
411
    assembly_levels=['complete'],
412
    dry_run=True
413
)
414
```
415

416
### Save Metadata
417

418
```python
419
from ncbi_genome_download import download
420

421
# Download and save metadata table
422
download(
423
    groups=['bacteria'],
424
    genera=['Bacillus'],
425
    metadata_table='bacillus_metadata.tsv'
426
)
427
```
428

429
## Contributed Scripts
430

431
### gimme_taxa.py
432

433
A utility script for querying the NCBI taxonomy database to find taxonomy IDs for use with ncbi-genome-download. Requires the `ete3` toolkit.
434

435
**Installation:**
436
```bash
437
pip install ete3
438
```
439

440
**Basic Usage:**
441
```bash
442
# Find all descendant taxa for Escherichia (taxid 561)
443
python gimme_taxa.py -o ~/mytaxafile.txt 561
444

445
# Use taxon name instead of ID
446
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia
447

448
# Multiple taxids and/or names
449
python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter
450
```
451

452
**Key Features:**
453
- Query by taxonomy ID or scientific name
454
- Returns all child taxa of specified parent taxa
455
- Writes output in format suitable for ncbi-genome-download
456
- Creates local SQLite database for fast queries
457
- Supports database updates with `--update` flag

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/