0
# NCBI Taxonomy Integration
1
2
Comprehensive integration with the NCBI Taxonomy database for taxonomic annotation, lineage retrieval, species tree construction, and taxonomic analysis. ETE3 provides seamless access to taxonomic information and tree-based taxonomic operations.
3
4
## Capabilities
5
6
### NCBITaxa Class
7
8
Main interface for accessing and working with NCBI Taxonomy data.
9
10
```python { .api }
11
class NCBITaxa:
12
"""
13
Interface to NCBI Taxonomy database with local caching and tree integration.
14
"""
15
16
def __init__(self, dbfile=None, taxdump_file=None, update=True):
17
"""
18
Initialize NCBI Taxonomy database interface.
19
20
Parameters:
21
- dbfile (str): Path to local taxonomy database file
22
If None, uses default location (~/.etetoolkit/taxa.sqlite)
23
- taxdump_file (str): Path to custom taxdump file for database initialization
24
- update (bool): Whether to automatically update database if outdated
25
"""
26
```
27
28
### Database Management
29
30
Manage local taxonomy database and updates.
31
32
```python { .api }
33
def update_taxonomy_database(self):
34
"""
35
Update local NCBI taxonomy database with latest data.
36
Downloads and processes current NCBI taxonomy dump files.
37
"""
38
39
def get_topology(self, taxids, intermediate_nodes=False, rank_limit=None, annotate=True):
40
"""
41
Build taxonomic tree from list of taxonomic IDs.
42
43
Parameters:
44
- taxids (list): List of NCBI taxonomic IDs
45
- intermediate_nodes (bool): Include intermediate taxonomic nodes
46
- rank_limit (str): Limit tree to specific taxonomic rank
47
- annotate (bool): Annotate nodes with taxonomic information
48
49
Returns:
50
Tree: Taxonomic tree with specified taxa
51
"""
52
```
53
54
### Taxonomic ID Translation
55
56
Convert between taxonomic names and NCBI taxonomic IDs.
57
58
```python { .api }
59
def get_name_translator(self, names):
60
"""
61
Translate organism names to NCBI taxonomic IDs.
62
63
Parameters:
64
- names (list): List of organism names to translate
65
66
Returns:
67
dict: Mapping from names to taxonomic IDs
68
"""
69
70
def get_taxid_translator(self, taxids):
71
"""
72
Translate NCBI taxonomic IDs to organism names.
73
74
Parameters:
75
- taxids (list): List of taxonomic IDs to translate
76
77
Returns:
78
dict: Mapping from taxonomic IDs to names
79
"""
80
81
def translate_to_names(self, taxids):
82
"""
83
Convert taxonomic IDs to scientific names.
84
85
Parameters:
86
- taxids (list): List of taxonomic IDs
87
88
Returns:
89
list: List of corresponding scientific names
90
"""
91
92
def get_fuzzy_name_translation(self, names, sim=0.9):
93
"""
94
Fuzzy matching for organism names to taxonomic IDs.
95
96
Parameters:
97
- names (list): List of organism names (may contain typos/variations)
98
- sim (float): Similarity threshold (0.0-1.0)
99
100
Returns:
101
dict: Best matches mapping names to taxonomic IDs
102
"""
103
```
104
105
### Taxonomic Hierarchy and Lineages
106
107
Retrieve taxonomic classifications and hierarchical relationships.
108
109
```python { .api }
110
def get_lineage(self, taxid):
111
"""
112
Get complete taxonomic lineage for a taxonomic ID.
113
114
Parameters:
115
- taxid (int): NCBI taxonomic ID
116
117
Returns:
118
list: List of taxonomic IDs from root to target taxon
119
"""
120
121
def get_rank(self, taxids):
122
"""
123
Get taxonomic ranks for taxonomic IDs.
124
125
Parameters:
126
- taxids (list): List of taxonomic IDs
127
128
Returns:
129
dict: Mapping from taxonomic IDs to their ranks
130
"""
131
132
def get_common_names(self, taxids):
133
"""
134
Get common names for taxonomic IDs.
135
136
Parameters:
137
- taxids (list): List of taxonomic IDs
138
139
Returns:
140
dict: Mapping from taxonomic IDs to common names
141
"""
142
143
def get_descendant_taxa(self, parent, collapse_subspecies=False, rank_limit=None):
144
"""
145
Get all descendant taxa for a parent taxonomic ID.
146
147
Parameters:
148
- parent (int): Parent taxonomic ID
149
- collapse_subspecies (bool): Exclude subspecies level taxa
150
- rank_limit (str): Only include taxa at or above specified rank
151
152
Returns:
153
list: List of descendant taxonomic IDs
154
"""
155
```
156
157
### Tree Annotation
158
159
Annotate phylogenetic trees with taxonomic information.
160
161
```python { .api }
162
def annotate_tree(self, tree, taxid_attr="species", tax2name=None, tax2track=None):
163
"""
164
Annotate tree nodes with taxonomic information.
165
166
Parameters:
167
- tree (Tree): Tree to annotate
168
- taxid_attr (str): Node attribute containing taxonomic information
169
- tax2name (dict): Custom mapping from taxids to names
170
- tax2track (dict): Additional attributes to track
171
172
Returns:
173
Tree: Annotated tree with taxonomic data
174
"""
175
```
176
177
## Taxonomic Analysis Functions
178
179
### Species Tree Construction
180
181
```python { .api }
182
def get_broken_branches(self, tree, species_attr="species"):
183
"""
184
Identify branches that break species monophyly.
185
186
Parameters:
187
- tree (Tree): Input phylogenetic tree
188
- species_attr (str): Node attribute containing species information
189
190
Returns:
191
list: List of branches breaking monophyly
192
"""
193
194
def annotate_tree_with_taxa(self, tree, taxid_attr="name", tax2name=None):
195
"""
196
Add taxonomic annotations to all tree nodes.
197
198
Parameters:
199
- tree (Tree): Tree to annotate
200
- taxid_attr (str): Attribute containing taxonomic identifiers
201
- tax2name (dict): Custom taxonomic ID to name mapping
202
203
Returns:
204
Tree: Tree with taxonomic annotations added
205
"""
206
```
207
208
## Usage Examples
209
210
### Basic Taxonomy Operations
211
212
```python
213
from ete3 import NCBITaxa
214
215
# Initialize NCBI taxonomy
216
ncbi = NCBITaxa()
217
218
# Translate names to taxonomic IDs
219
name2taxid = ncbi.get_name_translator(['Homo sapiens', 'Pan troglodytes', 'Gorilla gorilla'])
220
print(f"Human taxid: {name2taxid['Homo sapiens']}")
221
222
# Translate taxonomic IDs to names
223
taxid2name = ncbi.get_taxid_translator([9606, 9598, 9593])
224
print(f"Taxid 9606: {taxid2name[9606]}")
225
226
# Get taxonomic lineage
227
lineage = ncbi.get_lineage(9606) # Human
228
print(f"Human lineage: {lineage}")
229
230
# Get ranks for lineage
231
ranks = ncbi.get_rank(lineage)
232
for taxid in lineage:
233
print(f"{taxid}: {ranks[taxid]}")
234
```
235
236
### Building Taxonomic Trees
237
238
```python
239
from ete3 import NCBITaxa
240
241
ncbi = NCBITaxa()
242
243
# Create taxonomic tree from species list
244
species_names = ['Homo sapiens', 'Pan troglodytes', 'Gorilla gorilla', 'Macaca mulatta']
245
name2taxid = ncbi.get_name_translator(species_names)
246
taxids = [name2taxid[name] for name in species_names]
247
248
# Build taxonomic tree
249
tree = ncbi.get_topology(taxids)
250
print(tree.get_ascii())
251
252
# Include intermediate nodes for complete taxonomy
253
full_tree = ncbi.get_topology(taxids, intermediate_nodes=True)
254
print(full_tree.get_ascii())
255
```
256
257
### Tree Annotation
258
259
```python
260
from ete3 import PhyloTree, NCBITaxa
261
262
# Create phylogenetic tree
263
tree = PhyloTree("(9606:1,(9598:0.5,9593:0.5):0.5);") # Using taxids as names
264
265
# Initialize NCBI taxonomy
266
ncbi = NCBITaxa()
267
268
# Annotate tree with taxonomic information
269
annotated_tree = ncbi.annotate_tree(tree, taxid_attr="name")
270
271
# Access taxonomic information
272
for node in annotated_tree.traverse():
273
if hasattr(node, 'sci_name'):
274
print(f"Node {node.name}: {node.sci_name} ({node.rank})")
275
```
276
277
### Fuzzy Name Matching
278
279
```python
280
from ete3 import NCBITaxa
281
282
ncbi = NCBITaxa()
283
284
# Handle names with potential typos or variations
285
fuzzy_names = ['Homo sapian', 'chimpanzee', 'gorill']
286
matches = ncbi.get_fuzzy_name_translation(fuzzy_names, sim=0.8)
287
288
for name, taxid in matches.items():
289
correct_name = ncbi.translate_to_names([taxid])[0]
290
print(f"'{name}' -> {taxid} ({correct_name})")
291
```
292
293
### Advanced Taxonomic Analysis
294
295
```python
296
from ete3 import NCBITaxa, PhyloTree
297
298
ncbi = NCBITaxa()
299
300
# Get all primates
301
primate_taxid = ncbi.get_name_translator(['Primates'])['Primates']
302
primate_descendants = ncbi.get_descendant_taxa(primate_taxid, rank_limit='species')
303
304
# Create comprehensive primate tree
305
primate_tree = ncbi.get_topology(primate_descendants[:50]) # Limit for example
306
307
# Analyze taxonomic ranks
308
ranks = ncbi.get_rank(primate_descendants[:20])
309
rank_counts = {}
310
for taxid, rank in ranks.items():
311
rank_counts[rank] = rank_counts.get(rank, 0) + 1
312
313
print(f"Taxonomic rank distribution: {rank_counts}")
314
```
315
316
### Database Updates and Management
317
318
```python
319
from ete3 import NCBITaxa
320
321
# Update local taxonomy database (run periodically)
322
ncbi = NCBITaxa()
323
# ncbi.update_taxonomy_database() # Uncomment to actually update
324
325
# Use custom database file
326
ncbi_custom = NCBITaxa(dbfile="/path/to/custom/taxa.sqlite")
327
328
# Check database version/status
329
# Access internal database methods if needed for maintenance
330
```
331
332
### Integration with Phylogenetic Analysis
333
334
```python
335
from ete3 import PhyloTree, NCBITaxa
336
337
# Gene tree with species information
338
gene_tree = PhyloTree("(human_gene1:0.1,(chimp_gene1:0.05,gorilla_gene1:0.05):0.02);")
339
340
# Set up species naming
341
gene_tree.set_species_naming_function(lambda x: x.split('_')[0])
342
343
# Get NCBI taxonomy for comparison
344
ncbi = NCBITaxa()
345
species_names = ['human', 'chimp', 'gorilla']
346
name_mapping = {'human': 'Homo sapiens', 'chimp': 'Pan troglodytes', 'gorilla': 'Gorilla gorilla'}
347
full_names = [name_mapping[sp] for sp in species_names]
348
taxids = [ncbi.get_name_translator([name])[name] for name in full_names]
349
350
# Create species tree from NCBI
351
species_tree = ncbi.get_topology(taxids)
352
353
# Compare gene tree topology with species tree
354
# (This would involve reconciliation analysis)
355
print("Gene tree topology:")
356
print(gene_tree.get_ascii())
357
print("Species tree topology:")
358
print(species_tree.get_ascii())
359
```