0
# Database Queries and Annotations
1
2
Scanpy's queries module provides tools for enriching single-cell analysis with external database information. This includes querying Biomart for gene annotations, performing gene enrichment analysis, and retrieving genomic coordinates and mitochondrial gene lists.
3
4
## Capabilities
5
6
### Biomart Gene Annotations
7
8
Query Ensembl Biomart database for comprehensive gene information.
9
10
```python { .api }
11
def biomart_annotations(org, attrs, values=None, use_cache=True, **kwargs):
12
"""
13
Retrieve gene annotations from Ensembl Biomart.
14
15
Parameters:
16
- org (str): Organism name (e.g., 'hsapiens', 'mmusculus')
17
- attrs (list): List of attributes to retrieve from Biomart
18
- values (list, optional): List of gene identifiers to query
19
- use_cache (bool): Use cached results if available
20
- **kwargs: Additional Biomart query parameters
21
22
Returns:
23
DataFrame: Gene annotations with requested attributes
24
"""
25
```
26
27
### Gene Enrichment Analysis
28
29
Perform functional enrichment analysis using g:Profiler.
30
31
```python { .api }
32
def enrich(gene_list, organism='hsapiens', sources=None, background=None, domain_scope='annotated', significance_threshold_method='g_SCS', user_threshold=0.05, ordered_query=False, measure_underrepresentation=False, evcodes=False, combined=True, **kwargs):
33
"""
34
Gene enrichment analysis using g:Profiler API.
35
36
Parameters:
37
- gene_list (list): List of gene identifiers for enrichment
38
- organism (str): Organism code ('hsapiens', 'mmusculus', etc.)
39
- sources (list, optional): Databases to query (GO:BP, GO:MF, GO:CC, KEGG, etc.)
40
- background (list, optional): Background gene set
41
- domain_scope (str): Statistical domain scope
42
- significance_threshold_method (str): Multiple testing correction method
43
- user_threshold (float): Significance threshold
44
- ordered_query (bool): Whether gene list is ordered by importance
45
- measure_underrepresentation (bool): Test for underrepresentation
46
- evcodes (bool): Include GO evidence codes
47
- combined (bool): Use combined g:SCS threshold
48
- **kwargs: Additional g:Profiler parameters
49
50
Returns:
51
DataFrame: Enrichment results with p-values, terms, and descriptions
52
"""
53
```
54
55
### Genomic Coordinates
56
57
Retrieve genomic coordinates for genes from Ensembl.
58
59
```python { .api }
60
def gene_coordinates(gene_list, org='hsapiens', gene_symbols=True, **kwargs):
61
"""
62
Get genomic coordinates for a list of genes.
63
64
Parameters:
65
- gene_list (list): List of gene identifiers
66
- org (str): Organism ('hsapiens', 'mmusculus', etc.)
67
- gene_symbols (bool): Whether input are gene symbols (vs Ensembl IDs)
68
- **kwargs: Additional query parameters
69
70
Returns:
71
DataFrame: Genomic coordinates with chromosome, start, end positions
72
"""
73
```
74
75
### Mitochondrial Gene Lists
76
77
Retrieve lists of mitochondrial genes for QC filtering.
78
79
```python { .api }
80
def mitochondrial_genes(org='hsapiens', gene_symbols=True, **kwargs):
81
"""
82
Get list of mitochondrial genes for given organism.
83
84
Parameters:
85
- org (str): Organism ('hsapiens', 'mmusculus', etc.)
86
- gene_symbols (bool): Return gene symbols (vs Ensembl IDs)
87
- **kwargs: Additional query parameters
88
89
Returns:
90
list: List of mitochondrial gene identifiers
91
"""
92
```
93
94
## Usage Examples
95
96
### Basic Gene Annotation
97
98
```python
99
import scanpy as sc
100
import pandas as pd
101
102
# Get basic gene information
103
gene_list = ['CD3D', 'CD4', 'CD8A', 'IL2', 'IFNG']
104
annotations = sc.queries.biomart_annotations(
105
org='hsapiens',
106
attrs=['ensembl_gene_id', 'external_gene_name', 'description', 'gene_biotype'],
107
values=gene_list
108
)
109
110
print(annotations.head())
111
```
112
113
### Gene Enrichment Analysis
114
115
```python
116
# Get marker genes from clustering
117
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
118
119
# Extract top marker genes for cluster 0
120
marker_genes = sc.get.rank_genes_groups_df(adata, group='0').head(100)['names'].tolist()
121
122
# Perform enrichment analysis
123
enrichment_results = sc.queries.enrich(
124
gene_list=marker_genes,
125
organism='hsapiens',
126
sources=['GO:BP', 'GO:MF', 'KEGG', 'REACTOME']
127
)
128
129
# Show top enriched terms
130
top_terms = enrichment_results.head(10)
131
print(top_terms[['native', 'name', 'p_value', 'term_size']])
132
```
133
134
### Working with Genomic Coordinates
135
136
```python
137
# Get coordinates for genes of interest
138
genes_of_interest = ['TP53', 'MYC', 'EGFR', 'VEGFA']
139
coordinates = sc.queries.gene_coordinates(
140
gene_list=genes_of_interest,
141
org='hsapiens'
142
)
143
144
print(coordinates)
145
146
# Use coordinates for downstream analysis
147
for _, gene_info in coordinates.iterrows():
148
print(f"{gene_info['external_gene_name']}: {gene_info['chromosome_name']}:{gene_info['start_position']}-{gene_info['end_position']}")
149
```
150
151
### Quality Control with Mitochondrial Genes
152
153
```python
154
# Get mitochondrial genes for human
155
mt_genes = sc.queries.mitochondrial_genes(org='hsapiens')
156
print(f"Found {len(mt_genes)} mitochondrial genes")
157
158
# Mark mitochondrial genes in AnnData object
159
adata.var['mt'] = adata.var_names.isin(mt_genes)
160
161
# Calculate mitochondrial gene percentage
162
sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)
163
164
# Filter cells with high mitochondrial content
165
adata = adata[adata.obs.pct_counts_mt < 20, :]
166
```
167
168
### Cross-Species Analysis
169
170
```python
171
# Human-mouse gene mapping
172
human_genes = ['CD3D', 'CD4', 'CD8A']
173
mouse_genes = ['Cd3d', 'Cd4', 'Cd8a']
174
175
# Get human annotations
176
human_annotations = sc.queries.biomart_annotations(
177
org='hsapiens',
178
attrs=['ensembl_gene_id', 'external_gene_name', 'mmusculus_homolog_ensembl_gene'],
179
values=human_genes
180
)
181
182
# Get mouse annotations
183
mouse_annotations = sc.queries.biomart_annotations(
184
org='mmusculus',
185
attrs=['ensembl_gene_id', 'external_gene_name', 'hsapiens_homolog_ensembl_gene'],
186
values=mouse_genes
187
)
188
189
print("Human-mouse homolog mapping:")
190
print(human_annotations[['external_gene_name', 'mmusculus_homolog_ensembl_gene']])
191
```
192
193
### Comprehensive Gene Annotation Pipeline
194
195
```python
196
def annotate_gene_list(gene_list, organism='hsapiens'):
197
"""Comprehensive gene annotation pipeline."""
198
199
# Get basic annotations
200
basic_info = sc.queries.biomart_annotations(
201
org=organism,
202
attrs=[
203
'ensembl_gene_id',
204
'external_gene_name',
205
'description',
206
'gene_biotype',
207
'chromosome_name',
208
'start_position',
209
'end_position'
210
],
211
values=gene_list
212
)
213
214
# Perform enrichment analysis
215
enrichment = sc.queries.enrich(
216
gene_list=gene_list,
217
organism=organism,
218
sources=['GO:BP', 'GO:MF', 'GO:CC', 'KEGG']
219
)
220
221
# Get mitochondrial gene status
222
mt_genes = sc.queries.mitochondrial_genes(org=organism)
223
basic_info['is_mitochondrial'] = basic_info['external_gene_name'].isin(mt_genes)
224
225
return {
226
'annotations': basic_info,
227
'enrichment': enrichment,
228
'mt_genes': mt_genes
229
}
230
231
# Use the pipeline
232
results = annotate_gene_list(['CD3D', 'CD4', 'CD8A', 'IL2', 'IFNG'])
233
```
234
235
### Integration with Marker Gene Analysis
236
237
```python
238
# Complete workflow: clustering -> marker genes -> annotation -> enrichment
239
def analyze_cluster_markers(adata, cluster_key='leiden', n_genes=50):
240
"""Analyze cluster markers with comprehensive annotation."""
241
242
# Find marker genes
243
sc.tl.rank_genes_groups(adata, cluster_key, method='wilcoxon')
244
245
results = {}
246
247
for cluster in adata.obs[cluster_key].unique():
248
# Get top marker genes
249
marker_df = sc.get.rank_genes_groups_df(adata, group=cluster)
250
top_markers = marker_df.head(n_genes)['names'].tolist()
251
252
# Annotate markers
253
annotations = sc.queries.biomart_annotations(
254
org='hsapiens',
255
attrs=['external_gene_name', 'description', 'gene_biotype'],
256
values=top_markers
257
)
258
259
# Enrichment analysis
260
enrichment = sc.queries.enrich(
261
gene_list=top_markers,
262
organism='hsapiens',
263
sources=['GO:BP', 'KEGG'],
264
user_threshold=0.01
265
)
266
267
results[cluster] = {
268
'markers': top_markers,
269
'annotations': annotations,
270
'enrichment': enrichment.head(10) if not enrichment.empty else None
271
}
272
273
return results
274
275
# Run comprehensive analysis
276
cluster_analysis = analyze_cluster_markers(adata, n_genes=30)
277
278
# Print results for each cluster
279
for cluster, data in cluster_analysis.items():
280
print(f"\nCluster {cluster}:")
281
print(f"Top markers: {', '.join(data['markers'][:10])}")
282
283
if data['enrichment'] is not None:
284
print("Top enriched pathways:")
285
for _, pathway in data['enrichment'].head(3).iterrows():
286
print(f" - {pathway['name']} (p={pathway['p_value']:.2e})")
287
```
288
289
## Best Practices
290
291
### Query Optimization
292
293
1. **Caching**: Use cached results when possible to avoid repeated API calls
294
2. **Batch Queries**: Query multiple genes at once rather than individually
295
3. **Attribute Selection**: Only request necessary attributes to reduce response size
296
4. **Rate Limiting**: Be mindful of API rate limits for large queries
297
298
### Error Handling
299
300
```python
301
def robust_query(gene_list, max_retries=3, delay=1):
302
"""Query with retry logic and error handling."""
303
import time
304
305
for attempt in range(max_retries):
306
try:
307
return sc.queries.biomart_annotations(
308
org='hsapiens',
309
attrs=['external_gene_name', 'description'],
310
values=gene_list
311
)
312
except Exception as e:
313
if attempt < max_retries - 1:
314
print(f"Query failed (attempt {attempt + 1}): {e}")
315
time.sleep(delay * (2 ** attempt)) # Exponential backoff
316
else:
317
raise
318
319
# Use robust querying
320
annotations = robust_query(['CD3D', 'CD4', 'CD8A'])
321
```
322
323
### Data Integration
324
325
```python
326
# Integrate query results with AnnData object
327
def add_gene_annotations(adata, organism='hsapiens'):
328
"""Add gene annotations to AnnData var."""
329
330
gene_list = adata.var_names.tolist()
331
332
# Get annotations
333
annotations = sc.queries.biomart_annotations(
334
org=organism,
335
attrs=['external_gene_name', 'description', 'gene_biotype', 'chromosome_name'],
336
values=gene_list
337
)
338
339
# Merge with existing var data
340
annotations.index = annotations['external_gene_name']
341
342
# Add to adata.var
343
for col in ['description', 'gene_biotype', 'chromosome_name']:
344
if col in annotations.columns:
345
adata.var[col] = annotations.loc[adata.var_names, col].fillna('Unknown')
346
347
# Mark mitochondrial genes
348
mt_genes = sc.queries.mitochondrial_genes(org=organism)
349
adata.var['mitochondrial'] = adata.var_names.isin(mt_genes)
350
351
return adata
352
353
# Add annotations to your data
354
adata = add_gene_annotations(adata)
355
```
356
357
## Supported Organisms
358
359
Common organism codes for queries:
360
- `'hsapiens'` - Human
361
- `'mmusculus'` - Mouse
362
- `'drerio'` - Zebrafish
363
- `'dmelanogaster'` - Fruit fly
364
- `'celegans'` - C. elegans
365
- `'rnorvegicus'` - Rat
366
- `'scerevisiae'` - Yeast
367
368
## External Dependencies
369
370
The queries module requires internet connectivity and may depend on:
371
- `biomart` - For Ensembl queries
372
- `gprofiler-official` - For enrichment analysis
373
- `requests` - For API calls
374
- `pandas` - For data handling
375
376
Install additional dependencies if needed:
377
```bash
378
pip install biomart gprofiler-official requests
379
```