Tessl Tile for pypi/scanpy@1.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

analysis-tools.md data-io.md datasets.md external-tools.md index.md preprocessing.md queries.md spatial-analysis.md utilities.md visualization.md

queries.mddocs/

0
# Database Queries and Annotations
1

2
Scanpy's queries module provides tools for enriching single-cell analysis with external database information. This includes querying Biomart for gene annotations, performing gene enrichment analysis, and retrieving genomic coordinates and mitochondrial gene lists.
3

4
## Capabilities
5

6
### Biomart Gene Annotations
7

8
Query Ensembl Biomart database for comprehensive gene information.
9

10
```python { .api }
11
def biomart_annotations(org, attrs, values=None, use_cache=True, **kwargs):
12
    """
13
    Retrieve gene annotations from Ensembl Biomart.
14
    
15
    Parameters:
16
    - org (str): Organism name (e.g., 'hsapiens', 'mmusculus')
17
    - attrs (list): List of attributes to retrieve from Biomart
18
    - values (list, optional): List of gene identifiers to query
19
    - use_cache (bool): Use cached results if available
20
    - **kwargs: Additional Biomart query parameters
21
    
22
    Returns:
23
    DataFrame: Gene annotations with requested attributes
24
    """
25
```
26

27
### Gene Enrichment Analysis
28

29
Perform functional enrichment analysis using g:Profiler.
30

31
```python { .api }
32
def enrich(gene_list, organism='hsapiens', sources=None, background=None, domain_scope='annotated', significance_threshold_method='g_SCS', user_threshold=0.05, ordered_query=False, measure_underrepresentation=False, evcodes=False, combined=True, **kwargs):
33
    """
34
    Gene enrichment analysis using g:Profiler API.
35
    
36
    Parameters:
37
    - gene_list (list): List of gene identifiers for enrichment
38
    - organism (str): Organism code ('hsapiens', 'mmusculus', etc.)
39
    - sources (list, optional): Databases to query (GO:BP, GO:MF, GO:CC, KEGG, etc.)
40
    - background (list, optional): Background gene set
41
    - domain_scope (str): Statistical domain scope
42
    - significance_threshold_method (str): Multiple testing correction method
43
    - user_threshold (float): Significance threshold
44
    - ordered_query (bool): Whether gene list is ordered by importance
45
    - measure_underrepresentation (bool): Test for underrepresentation
46
    - evcodes (bool): Include GO evidence codes
47
    - combined (bool): Use combined g:SCS threshold
48
    - **kwargs: Additional g:Profiler parameters
49
    
50
    Returns:
51
    DataFrame: Enrichment results with p-values, terms, and descriptions
52
    """
53
```
54

55
### Genomic Coordinates
56

57
Retrieve genomic coordinates for genes from Ensembl.
58

59
```python { .api }
60
def gene_coordinates(gene_list, org='hsapiens', gene_symbols=True, **kwargs):
61
    """
62
    Get genomic coordinates for a list of genes.
63
    
64
    Parameters:
65
    - gene_list (list): List of gene identifiers
66
    - org (str): Organism ('hsapiens', 'mmusculus', etc.)
67
    - gene_symbols (bool): Whether input are gene symbols (vs Ensembl IDs)
68
    - **kwargs: Additional query parameters
69
    
70
    Returns:
71
    DataFrame: Genomic coordinates with chromosome, start, end positions
72
    """
73
```
74

75
### Mitochondrial Gene Lists
76

77
Retrieve lists of mitochondrial genes for QC filtering.
78

79
```python { .api }
80
def mitochondrial_genes(org='hsapiens', gene_symbols=True, **kwargs):
81
    """
82
    Get list of mitochondrial genes for given organism.
83
    
84
    Parameters:
85
    - org (str): Organism ('hsapiens', 'mmusculus', etc.)
86
    - gene_symbols (bool): Return gene symbols (vs Ensembl IDs)
87
    - **kwargs: Additional query parameters
88
    
89
    Returns:
90
    list: List of mitochondrial gene identifiers
91
    """
92
```
93

94
## Usage Examples
95

96
### Basic Gene Annotation
97

98
```python
99
import scanpy as sc
100
import pandas as pd
101

102
# Get basic gene information
103
gene_list = ['CD3D', 'CD4', 'CD8A', 'IL2', 'IFNG']
104
annotations = sc.queries.biomart_annotations(
105
    org='hsapiens',
106
    attrs=['ensembl_gene_id', 'external_gene_name', 'description', 'gene_biotype'],
107
    values=gene_list
108
)
109

110
print(annotations.head())
111
```
112

113
### Gene Enrichment Analysis
114

115
```python
116
# Get marker genes from clustering
117
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
118

119
# Extract top marker genes for cluster 0
120
marker_genes = sc.get.rank_genes_groups_df(adata, group='0').head(100)['names'].tolist()
121

122
# Perform enrichment analysis
123
enrichment_results = sc.queries.enrich(
124
    gene_list=marker_genes,
125
    organism='hsapiens',
126
    sources=['GO:BP', 'GO:MF', 'KEGG', 'REACTOME']
127
)
128

129
# Show top enriched terms
130
top_terms = enrichment_results.head(10)
131
print(top_terms[['native', 'name', 'p_value', 'term_size']])
132
```
133

134
### Working with Genomic Coordinates
135

136
```python
137
# Get coordinates for genes of interest
138
genes_of_interest = ['TP53', 'MYC', 'EGFR', 'VEGFA']
139
coordinates = sc.queries.gene_coordinates(
140
    gene_list=genes_of_interest,
141
    org='hsapiens'
142
)
143

144
print(coordinates)
145

146
# Use coordinates for downstream analysis
147
for _, gene_info in coordinates.iterrows():
148
    print(f"{gene_info['external_gene_name']}: {gene_info['chromosome_name']}:{gene_info['start_position']}-{gene_info['end_position']}")
149
```
150

151
### Quality Control with Mitochondrial Genes
152

153
```python
154
# Get mitochondrial genes for human
155
mt_genes = sc.queries.mitochondrial_genes(org='hsapiens')
156
print(f"Found {len(mt_genes)} mitochondrial genes")
157

158
# Mark mitochondrial genes in AnnData object
159
adata.var['mt'] = adata.var_names.isin(mt_genes)
160

161
# Calculate mitochondrial gene percentage
162
sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)
163

164
# Filter cells with high mitochondrial content
165
adata = adata[adata.obs.pct_counts_mt < 20, :]
166
```
167

168
### Cross-Species Analysis
169

170
```python
171
# Human-mouse gene mapping
172
human_genes = ['CD3D', 'CD4', 'CD8A']
173
mouse_genes = ['Cd3d', 'Cd4', 'Cd8a']
174

175
# Get human annotations
176
human_annotations = sc.queries.biomart_annotations(
177
    org='hsapiens',
178
    attrs=['ensembl_gene_id', 'external_gene_name', 'mmusculus_homolog_ensembl_gene'],
179
    values=human_genes
180
)
181

182
# Get mouse annotations
183
mouse_annotations = sc.queries.biomart_annotations(
184
    org='mmusculus', 
185
    attrs=['ensembl_gene_id', 'external_gene_name', 'hsapiens_homolog_ensembl_gene'],
186
    values=mouse_genes
187
)
188

189
print("Human-mouse homolog mapping:")
190
print(human_annotations[['external_gene_name', 'mmusculus_homolog_ensembl_gene']])
191
```
192

193
### Comprehensive Gene Annotation Pipeline
194

195
```python
196
def annotate_gene_list(gene_list, organism='hsapiens'):
197
    """Comprehensive gene annotation pipeline."""
198
    
199
    # Get basic annotations
200
    basic_info = sc.queries.biomart_annotations(
201
        org=organism,
202
        attrs=[
203
            'ensembl_gene_id',
204
            'external_gene_name', 
205
            'description',
206
            'gene_biotype',
207
            'chromosome_name',
208
            'start_position',
209
            'end_position'
210
        ],
211
        values=gene_list
212
    )
213
    
214
    # Perform enrichment analysis
215
    enrichment = sc.queries.enrich(
216
        gene_list=gene_list,
217
        organism=organism,
218
        sources=['GO:BP', 'GO:MF', 'GO:CC', 'KEGG']
219
    )
220
    
221
    # Get mitochondrial gene status
222
    mt_genes = sc.queries.mitochondrial_genes(org=organism)
223
    basic_info['is_mitochondrial'] = basic_info['external_gene_name'].isin(mt_genes)
224
    
225
    return {
226
        'annotations': basic_info,
227
        'enrichment': enrichment,
228
        'mt_genes': mt_genes
229
    }
230

231
# Use the pipeline
232
results = annotate_gene_list(['CD3D', 'CD4', 'CD8A', 'IL2', 'IFNG'])
233
```
234

235
### Integration with Marker Gene Analysis
236

237
```python
238
# Complete workflow: clustering -> marker genes -> annotation -> enrichment
239
def analyze_cluster_markers(adata, cluster_key='leiden', n_genes=50):
240
    """Analyze cluster markers with comprehensive annotation."""
241
    
242
    # Find marker genes
243
    sc.tl.rank_genes_groups(adata, cluster_key, method='wilcoxon')
244
    
245
    results = {}
246
    
247
    for cluster in adata.obs[cluster_key].unique():
248
        # Get top marker genes
249
        marker_df = sc.get.rank_genes_groups_df(adata, group=cluster)
250
        top_markers = marker_df.head(n_genes)['names'].tolist()
251
        
252
        # Annotate markers
253
        annotations = sc.queries.biomart_annotations(
254
            org='hsapiens',
255
            attrs=['external_gene_name', 'description', 'gene_biotype'],
256
            values=top_markers
257
        )
258
        
259
        # Enrichment analysis
260
        enrichment = sc.queries.enrich(
261
            gene_list=top_markers,
262
            organism='hsapiens',
263
            sources=['GO:BP', 'KEGG'],
264
            user_threshold=0.01
265
        )
266
        
267
        results[cluster] = {
268
            'markers': top_markers,
269
            'annotations': annotations,
270
            'enrichment': enrichment.head(10) if not enrichment.empty else None
271
        }
272
    
273
    return results
274

275
# Run comprehensive analysis
276
cluster_analysis = analyze_cluster_markers(adata, n_genes=30)
277

278
# Print results for each cluster
279
for cluster, data in cluster_analysis.items():
280
    print(f"\nCluster {cluster}:")
281
    print(f"Top markers: {', '.join(data['markers'][:10])}")
282
    
283
    if data['enrichment'] is not None:
284
        print("Top enriched pathways:")
285
        for _, pathway in data['enrichment'].head(3).iterrows():
286
            print(f"  - {pathway['name']} (p={pathway['p_value']:.2e})")
287
```
288

289
## Best Practices
290

291
### Query Optimization
292

293
1. **Caching**: Use cached results when possible to avoid repeated API calls
294
2. **Batch Queries**: Query multiple genes at once rather than individually  
295
3. **Attribute Selection**: Only request necessary attributes to reduce response size
296
4. **Rate Limiting**: Be mindful of API rate limits for large queries
297

298
### Error Handling
299

300
```python
301
def robust_query(gene_list, max_retries=3, delay=1):
302
    """Query with retry logic and error handling."""
303
    import time
304
    
305
    for attempt in range(max_retries):
306
        try:
307
            return sc.queries.biomart_annotations(
308
                org='hsapiens',
309
                attrs=['external_gene_name', 'description'],
310
                values=gene_list
311
            )
312
        except Exception as e:
313
            if attempt < max_retries - 1:
314
                print(f"Query failed (attempt {attempt + 1}): {e}")
315
                time.sleep(delay * (2 ** attempt))  # Exponential backoff
316
            else:
317
                raise
318

319
# Use robust querying
320
annotations = robust_query(['CD3D', 'CD4', 'CD8A'])
321
```
322

323
### Data Integration
324

325
```python
326
# Integrate query results with AnnData object
327
def add_gene_annotations(adata, organism='hsapiens'):
328
    """Add gene annotations to AnnData var."""
329
    
330
    gene_list = adata.var_names.tolist()
331
    
332
    # Get annotations
333
    annotations = sc.queries.biomart_annotations(
334
        org=organism,
335
        attrs=['external_gene_name', 'description', 'gene_biotype', 'chromosome_name'],
336
        values=gene_list
337
    )
338
    
339
    # Merge with existing var data
340
    annotations.index = annotations['external_gene_name']
341
    
342
    # Add to adata.var
343
    for col in ['description', 'gene_biotype', 'chromosome_name']:
344
        if col in annotations.columns:
345
            adata.var[col] = annotations.loc[adata.var_names, col].fillna('Unknown')
346
    
347
    # Mark mitochondrial genes
348
    mt_genes = sc.queries.mitochondrial_genes(org=organism)
349
    adata.var['mitochondrial'] = adata.var_names.isin(mt_genes)
350
    
351
    return adata
352

353
# Add annotations to your data
354
adata = add_gene_annotations(adata)
355
```
356

357
## Supported Organisms
358

359
Common organism codes for queries:
360
- `'hsapiens'` - Human
361
- `'mmusculus'` - Mouse  
362
- `'drerio'` - Zebrafish
363
- `'dmelanogaster'` - Fruit fly
364
- `'celegans'` - C. elegans
365
- `'rnorvegicus'` - Rat
366
- `'scerevisiae'` - Yeast
367

368
## External Dependencies
369

370
The queries module requires internet connectivity and may depend on:
371
- `biomart` - For Ensembl queries
372
- `gprofiler-official` - For enrichment analysis
373
- `requests` - For API calls
374
- `pandas` - For data handling
375

376
Install additional dependencies if needed:
377
```bash
378
pip install biomart gprofiler-official requests
379
```

Version

Tile

Files

queries.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

queries.mddocs/