or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-ncbi-genome-download

Download genome files from the NCBI FTP server.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/ncbi-genome-download@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-ncbi-genome-download@0.3.0

0

# NCBI Genome Download

1

2

A Python command-line tool and library for downloading bacterial, fungal, and viral genome files from the NCBI FTP servers. Provides flexible filtering options including taxonomic groups, assembly levels, RefSeq categories, genera, species, and taxonomy IDs, with support for parallel downloads and multiple output formats.

3

4

## Package Information

5

6

- **Package Name**: ncbi-genome-download

7

- **Language**: Python

8

- **Installation**: `pip install ncbi-genome-download`

9

10

## Core Imports

11

12

```python

13

import ncbi_genome_download

14

```

15

16

For programmatic access:

17

18

```python

19

from ncbi_genome_download import download, NgdConfig, SUPPORTED_TAXONOMIC_GROUPS

20

```

21

22

For advanced usage:

23

24

```python

25

from ncbi_genome_download import args_download, argument_parser, config_download

26

```

27

28

## Basic Usage

29

30

### Command Line

31

32

```bash

33

# Download all bacterial genomes in GenBank format

34

ncbi-genome-download bacteria

35

36

# Download complete genomes in FASTA format for specific genera

37

ncbi-genome-download bacteria --assembly-levels complete --formats fasta --genera "Escherichia,Salmonella"

38

39

# Short alias is also available

40

ngd archaea --formats fasta

41

```

42

43

### Programmatic Interface

44

45

```python

46

from ncbi_genome_download import download, NgdConfig

47

48

# Basic download using default parameters

49

retcode = download(

50

groups=['bacteria'],

51

file_formats=['genbank'],

52

assembly_levels=['complete']

53

)

54

55

# Advanced configuration using NgdConfig

56

config = NgdConfig()

57

config.groups = ['bacteria', 'archaea']

58

config.file_formats = ['fasta', 'genbank']

59

config.assembly_levels = ['complete', 'chromosome']

60

config.genera = ['Escherichia', 'Bacillus']

61

config.output = '/path/to/output'

62

config.parallel = 4

63

64

from ncbi_genome_download import config_download

65

retcode = config_download(config)

66

```

67

68

## Architecture

69

70

The ncbi-genome-download package is designed with a modular architecture that separates concerns across several key components:

71

72

- **Core Module** (`core.py`): Contains the main download logic, file processing, and worker functions

73

- **Configuration Module** (`config.py`): Manages all configuration options, validation, and default values

74

- **Summary Module** (`summary.py`): Handles parsing of NCBI assembly summary files

75

- **Jobs Module** (`jobs.py`): Defines download job data structures for parallel processing

76

- **Metadata Module** (`metadata.py`): Tracks and exports metadata about downloaded files

77

- **Command Line Interface** (`__main__.py`): Provides the CLI entry point with argument parsing

78

79

This design enables flexible usage patterns from simple command-line operations to complex programmatic workflows, with robust parallel downloading capabilities and comprehensive filtering options.

80

81

## Capabilities

82

83

### Main Download Function

84

85

Downloads genome files from NCBI FTP servers with flexible filtering and configuration options.

86

87

```python { .api }

88

def download(**kwargs):

89

"""

90

Download data from NCBI using parameters passed as kwargs.

91

92

Parameters:

93

- groups: list or str, taxonomic groups to download (default: 'all')

94

- section: str, NCBI section ('refseq' or 'genbank', default: 'refseq')

95

- file_formats: list or str, formats to download (default: 'genbank')

96

- assembly_levels: list or str, assembly levels (default: 'all')

97

- genera: list or str, genera filter (default: [])

98

- strains: list or str, strains filter (default: [])

99

- species_taxids: list or str, species taxonomy IDs (default: [])

100

- taxids: list or str, taxonomy IDs (default: [])

101

- assembly_accessions: list or str, assembly accessions (default: [])

102

- refseq_categories: list or str, RefSeq categories (default: 'all')

103

- output: str, output directory (default: current directory)

104

- parallel: int, number of parallel downloads (default: 1)

105

- dry_run: bool, only show what would be downloaded (default: False)

106

- progress_bar: bool, show progress bar (default: False)

107

- metadata_table: str, path to save metadata table (default: None)

108

- human_readable: bool, create human-readable directory structure (default: False)

109

- flat_output: bool, dump files without subdirectories (default: False)

110

- uri: str, NCBI base URI (default: 'https://ftp.ncbi.nih.gov/genomes')

111

- use_cache: bool, use cached summary files (default: False)

112

- fuzzy_genus: bool, use fuzzy search for genus names (default: False)

113

- fuzzy_accessions: bool, use fuzzy search for accessions (default: False)

114

- type_materials: list or str, relation to type material (default: 'any')

115

116

Returns:

117

int: Success code (0 for success, non-zero for error)

118

"""

119

```

120

121

### Arguments-based Download Function

122

123

Downloads using parsed command-line arguments or similar namespace objects.

124

125

```python { .api }

126

def args_download(args):

127

"""

128

Download data from NCBI using parameters from argparse Namespace.

129

130

Parameters:

131

- args: argparse.Namespace, parsed command-line arguments

132

133

Returns:

134

int: Success code (0 for success, non-zero for error)

135

"""

136

```

137

138

### Argument Parser Creation

139

140

Creates the command-line argument parser for the tool.

141

142

```python { .api }

143

def argument_parser(version=None):

144

"""

145

Create the argument parser for ncbi-genome-download.

146

147

Parameters:

148

- version: str, optional version string for --version flag

149

150

Returns:

151

argparse.ArgumentParser: Configured argument parser

152

"""

153

```

154

155

### Configuration-based Download Function

156

157

Lower-level download function that takes a configuration object directly.

158

159

```python { .api }

160

def config_download(config):

161

"""

162

Run the actual download from NCBI with parameters in a config object.

163

164

Parameters:

165

- config: NgdConfig, configuration object with download settings

166

167

Returns:

168

int: Success code (0 for success, non-zero for error)

169

"""

170

```

171

172

### Configuration Management

173

174

Complete configuration object for fine-grained control over download parameters.

175

176

```python { .api }

177

class NgdConfig:

178

"""Configuration object for ncbi-genome-download."""

179

180

def __init__(self):

181

"""Set up a config object with all default values."""

182

183

@property

184

def available_groups(self):

185

"""

186

Get available taxonomic groups for current section.

187

188

Returns:

189

list: Available taxonomic groups based on current section

190

"""

191

192

@classmethod

193

def from_kwargs(cls, **kwargs):

194

"""

195

Initialise configuration from kwargs.

196

197

Parameters:

198

- **kwargs: Configuration parameters as keyword arguments

199

200

Returns:

201

NgdConfig: Configured instance

202

"""

203

204

@classmethod

205

def from_namespace(cls, namespace):

206

"""

207

Initialise from argparser Namespace object.

208

209

Parameters:

210

- namespace: argparse.Namespace, parsed arguments

211

212

Returns:

213

NgdConfig: Configured instance

214

"""

215

216

@classmethod

217

def get_default(cls, category):

218

"""

219

Get the default value of a given category.

220

221

Parameters:

222

- category: str, configuration category name

223

224

Returns:

225

Default value for the category

226

"""

227

228

@classmethod

229

def get_choices(cls, category):

230

"""

231

Get all available options for a category.

232

233

Parameters:

234

- category: str, configuration category name

235

236

Returns:

237

list: Available choices for the category

238

"""

239

240

@classmethod

241

def get_fileending(cls, file_format):

242

"""

243

Get the file extension for a given file format.

244

245

Parameters:

246

- file_format: str, file format name

247

248

Returns:

249

str: File extension pattern for the format

250

"""

251

252

@classmethod

253

def get_refseq_category_string(cls, category):

254

"""

255

Get the NCBI string representation for a RefSeq category.

256

257

Parameters:

258

- category: str, refseq category name

259

260

Returns:

261

str: NCBI string for the category

262

"""

263

264

def is_compatible_assembly_accession(self, acc):

265

"""

266

Check if assembly accession matches configured filters.

267

268

Parameters:

269

- acc: str, NCBI assembly accession

270

271

Returns:

272

bool: True if accession matches filter

273

"""

274

275

def is_compatible_assembly_level(self, ncbi_assembly_level):

276

"""

277

Check if assembly level matches configured filters.

278

279

Parameters:

280

- ncbi_assembly_level: str, NCBI assembly level string

281

282

Returns:

283

bool: True if assembly level matches filter

284

"""

285

286

def is_compatible_refseq_category(self, category):

287

"""

288

Check if RefSeq category matches configured filters.

289

290

Parameters:

291

- category: str, RefSeq category

292

293

Returns:

294

bool: True if category matches filter

295

"""

296

```

297

298

## Supported Options

299

300

### Taxonomic Groups

301

302

```python { .api }

303

SUPPORTED_TAXONOMIC_GROUPS = [

304

'archaea',

305

'bacteria',

306

'fungi',

307

'invertebrate',

308

'metagenomes',

309

'plant',

310

'protozoa',

311

'vertebrate_mammalian',

312

'vertebrate_other',

313

'viral'

314

]

315

316

GENBANK_EXCLUSIVE = [

317

'metagenomes'

318

]

319

```

320

321

### File Formats

322

323

Available file formats for download:

324

325

- `genbank` - GenBank flat file format (.gbff.gz)

326

- `fasta` - FASTA nucleotide sequences (.fna.gz)

327

- `rm` - RepeatMasker output (.rm.out.gz)

328

- `features` - Feature table (.txt.gz)

329

- `gff` - Generic Feature Format (.gff.gz)

330

- `protein-fasta` - Protein FASTA sequences (.faa.gz)

331

- `genpept` - GenPept protein sequences (.gpff.gz)

332

- `wgs` - WGS master GenBank record (.gbff.gz)

333

- `cds-fasta` - CDS FASTA from genomic (.fna.gz)

334

- `rna-fna` - RNA FASTA sequences (.fna.gz)

335

- `rna-fasta` - RNA FASTA from genomic (.fna.gz)

336

- `assembly-report` - Assembly report (.txt)

337

- `assembly-stats` - Assembly statistics (.txt)

338

- `translated-cds` - Translated CDS sequences (.faa.gz)

339

340

### Assembly Levels

341

342

Available assembly levels:

343

344

- `complete` - Complete Genome

345

- `chromosome` - Chromosome

346

- `scaffold` - Scaffold

347

- `contig` - Contig

348

349

### RefSeq Categories

350

351

Available RefSeq categories:

352

353

- `reference` - Reference genome

354

- `representative` - Representative genome

355

- `na` - Not applicable/available

356

357

## Error Handling

358

359

The functions return integer exit codes:

360

361

- `0` - Success

362

- `1` - General error (no matching downloads, invalid parameters)

363

- `75` - Temporary failure (network/connection issues)

364

- `-2` - Validation error (invalid arguments)

365

366

Common exceptions:

367

368

- `ValueError` - Raised for invalid configuration options or unsupported values

369

- `requests.exceptions.ConnectionError` - Network connectivity issues

370

- `OSError` - File system errors (permissions, disk space)

371

372

## Usage Examples

373

374

### Download Specific Organisms

375

376

```python

377

from ncbi_genome_download import download

378

379

# Download all E. coli complete genomes

380

download(

381

groups=['bacteria'],

382

genera=['Escherichia coli'],

383

assembly_levels=['complete'],

384

file_formats=['fasta', 'genbank']

385

)

386

```

387

388

### Parallel Downloads with Progress

389

390

```python

391

from ncbi_genome_download import download

392

393

# Download with 4 parallel processes and progress bar

394

download(

395

groups=['archaea'],

396

assembly_levels=['complete'],

397

parallel=4,

398

progress_bar=True,

399

output='/data/genomes'

400

)

401

```

402

403

### Dry Run to Preview Downloads

404

405

```python

406

from ncbi_genome_download import download

407

408

# See what would be downloaded without actually downloading

409

download(

410

groups=['viral'],

411

assembly_levels=['complete'],

412

dry_run=True

413

)

414

```

415

416

### Save Metadata

417

418

```python

419

from ncbi_genome_download import download

420

421

# Download and save metadata table

422

download(

423

groups=['bacteria'],

424

genera=['Bacillus'],

425

metadata_table='bacillus_metadata.tsv'

426

)

427

```

428

429

## Contributed Scripts

430

431

### gimme_taxa.py

432

433

A utility script for querying the NCBI taxonomy database to find taxonomy IDs for use with ncbi-genome-download. Requires the `ete3` toolkit.

434

435

**Installation:**

436

```bash

437

pip install ete3

438

```

439

440

**Basic Usage:**

441

```bash

442

# Find all descendant taxa for Escherichia (taxid 561)

443

python gimme_taxa.py -o ~/mytaxafile.txt 561

444

445

# Use taxon name instead of ID

446

python gimme_taxa.py -o all_descendent_taxids.txt Escherichia

447

448

# Multiple taxids and/or names

449

python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter

450

```

451

452

**Key Features:**

453

- Query by taxonomy ID or scientific name

454

- Returns all child taxa of specified parent taxa

455

- Writes output in format suitable for ncbi-genome-download

456

- Creates local SQLite database for fast queries

457

- Supports database updates with `--update` flag