or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-tools.mddata-io.mddatasets.mdexternal-tools.mdindex.mdpreprocessing.mdqueries.mdspatial-analysis.mdutilities.mdvisualization.md

queries.mddocs/

0

# Database Queries and Annotations

1

2

Scanpy's queries module provides tools for enriching single-cell analysis with external database information. This includes querying Biomart for gene annotations, performing gene enrichment analysis, and retrieving genomic coordinates and mitochondrial gene lists.

3

4

## Capabilities

5

6

### Biomart Gene Annotations

7

8

Query Ensembl Biomart database for comprehensive gene information.

9

10

```python { .api }

11

def biomart_annotations(org, attrs, values=None, use_cache=True, **kwargs):

12

"""

13

Retrieve gene annotations from Ensembl Biomart.

14

15

Parameters:

16

- org (str): Organism name (e.g., 'hsapiens', 'mmusculus')

17

- attrs (list): List of attributes to retrieve from Biomart

18

- values (list, optional): List of gene identifiers to query

19

- use_cache (bool): Use cached results if available

20

- **kwargs: Additional Biomart query parameters

21

22

Returns:

23

DataFrame: Gene annotations with requested attributes

24

"""

25

```

26

27

### Gene Enrichment Analysis

28

29

Perform functional enrichment analysis using g:Profiler.

30

31

```python { .api }

32

def enrich(gene_list, organism='hsapiens', sources=None, background=None, domain_scope='annotated', significance_threshold_method='g_SCS', user_threshold=0.05, ordered_query=False, measure_underrepresentation=False, evcodes=False, combined=True, **kwargs):

33

"""

34

Gene enrichment analysis using g:Profiler API.

35

36

Parameters:

37

- gene_list (list): List of gene identifiers for enrichment

38

- organism (str): Organism code ('hsapiens', 'mmusculus', etc.)

39

- sources (list, optional): Databases to query (GO:BP, GO:MF, GO:CC, KEGG, etc.)

40

- background (list, optional): Background gene set

41

- domain_scope (str): Statistical domain scope

42

- significance_threshold_method (str): Multiple testing correction method

43

- user_threshold (float): Significance threshold

44

- ordered_query (bool): Whether gene list is ordered by importance

45

- measure_underrepresentation (bool): Test for underrepresentation

46

- evcodes (bool): Include GO evidence codes

47

- combined (bool): Use combined g:SCS threshold

48

- **kwargs: Additional g:Profiler parameters

49

50

Returns:

51

DataFrame: Enrichment results with p-values, terms, and descriptions

52

"""

53

```

54

55

### Genomic Coordinates

56

57

Retrieve genomic coordinates for genes from Ensembl.

58

59

```python { .api }

60

def gene_coordinates(gene_list, org='hsapiens', gene_symbols=True, **kwargs):

61

"""

62

Get genomic coordinates for a list of genes.

63

64

Parameters:

65

- gene_list (list): List of gene identifiers

66

- org (str): Organism ('hsapiens', 'mmusculus', etc.)

67

- gene_symbols (bool): Whether input are gene symbols (vs Ensembl IDs)

68

- **kwargs: Additional query parameters

69

70

Returns:

71

DataFrame: Genomic coordinates with chromosome, start, end positions

72

"""

73

```

74

75

### Mitochondrial Gene Lists

76

77

Retrieve lists of mitochondrial genes for QC filtering.

78

79

```python { .api }

80

def mitochondrial_genes(org='hsapiens', gene_symbols=True, **kwargs):

81

"""

82

Get list of mitochondrial genes for given organism.

83

84

Parameters:

85

- org (str): Organism ('hsapiens', 'mmusculus', etc.)

86

- gene_symbols (bool): Return gene symbols (vs Ensembl IDs)

87

- **kwargs: Additional query parameters

88

89

Returns:

90

list: List of mitochondrial gene identifiers

91

"""

92

```

93

94

## Usage Examples

95

96

### Basic Gene Annotation

97

98

```python

99

import scanpy as sc

100

import pandas as pd

101

102

# Get basic gene information

103

gene_list = ['CD3D', 'CD4', 'CD8A', 'IL2', 'IFNG']

104

annotations = sc.queries.biomart_annotations(

105

org='hsapiens',

106

attrs=['ensembl_gene_id', 'external_gene_name', 'description', 'gene_biotype'],

107

values=gene_list

108

)

109

110

print(annotations.head())

111

```

112

113

### Gene Enrichment Analysis

114

115

```python

116

# Get marker genes from clustering

117

sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')

118

119

# Extract top marker genes for cluster 0

120

marker_genes = sc.get.rank_genes_groups_df(adata, group='0').head(100)['names'].tolist()

121

122

# Perform enrichment analysis

123

enrichment_results = sc.queries.enrich(

124

gene_list=marker_genes,

125

organism='hsapiens',

126

sources=['GO:BP', 'GO:MF', 'KEGG', 'REACTOME']

127

)

128

129

# Show top enriched terms

130

top_terms = enrichment_results.head(10)

131

print(top_terms[['native', 'name', 'p_value', 'term_size']])

132

```

133

134

### Working with Genomic Coordinates

135

136

```python

137

# Get coordinates for genes of interest

138

genes_of_interest = ['TP53', 'MYC', 'EGFR', 'VEGFA']

139

coordinates = sc.queries.gene_coordinates(

140

gene_list=genes_of_interest,

141

org='hsapiens'

142

)

143

144

print(coordinates)

145

146

# Use coordinates for downstream analysis

147

for _, gene_info in coordinates.iterrows():

148

print(f"{gene_info['external_gene_name']}: {gene_info['chromosome_name']}:{gene_info['start_position']}-{gene_info['end_position']}")

149

```

150

151

### Quality Control with Mitochondrial Genes

152

153

```python

154

# Get mitochondrial genes for human

155

mt_genes = sc.queries.mitochondrial_genes(org='hsapiens')

156

print(f"Found {len(mt_genes)} mitochondrial genes")

157

158

# Mark mitochondrial genes in AnnData object

159

adata.var['mt'] = adata.var_names.isin(mt_genes)

160

161

# Calculate mitochondrial gene percentage

162

sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)

163

164

# Filter cells with high mitochondrial content

165

adata = adata[adata.obs.pct_counts_mt < 20, :]

166

```

167

168

### Cross-Species Analysis

169

170

```python

171

# Human-mouse gene mapping

172

human_genes = ['CD3D', 'CD4', 'CD8A']

173

mouse_genes = ['Cd3d', 'Cd4', 'Cd8a']

174

175

# Get human annotations

176

human_annotations = sc.queries.biomart_annotations(

177

org='hsapiens',

178

attrs=['ensembl_gene_id', 'external_gene_name', 'mmusculus_homolog_ensembl_gene'],

179

values=human_genes

180

)

181

182

# Get mouse annotations

183

mouse_annotations = sc.queries.biomart_annotations(

184

org='mmusculus',

185

attrs=['ensembl_gene_id', 'external_gene_name', 'hsapiens_homolog_ensembl_gene'],

186

values=mouse_genes

187

)

188

189

print("Human-mouse homolog mapping:")

190

print(human_annotations[['external_gene_name', 'mmusculus_homolog_ensembl_gene']])

191

```

192

193

### Comprehensive Gene Annotation Pipeline

194

195

```python

196

def annotate_gene_list(gene_list, organism='hsapiens'):

197

"""Comprehensive gene annotation pipeline."""

198

199

# Get basic annotations

200

basic_info = sc.queries.biomart_annotations(

201

org=organism,

202

attrs=[

203

'ensembl_gene_id',

204

'external_gene_name',

205

'description',

206

'gene_biotype',

207

'chromosome_name',

208

'start_position',

209

'end_position'

210

],

211

values=gene_list

212

)

213

214

# Perform enrichment analysis

215

enrichment = sc.queries.enrich(

216

gene_list=gene_list,

217

organism=organism,

218

sources=['GO:BP', 'GO:MF', 'GO:CC', 'KEGG']

219

)

220

221

# Get mitochondrial gene status

222

mt_genes = sc.queries.mitochondrial_genes(org=organism)

223

basic_info['is_mitochondrial'] = basic_info['external_gene_name'].isin(mt_genes)

224

225

return {

226

'annotations': basic_info,

227

'enrichment': enrichment,

228

'mt_genes': mt_genes

229

}

230

231

# Use the pipeline

232

results = annotate_gene_list(['CD3D', 'CD4', 'CD8A', 'IL2', 'IFNG'])

233

```

234

235

### Integration with Marker Gene Analysis

236

237

```python

238

# Complete workflow: clustering -> marker genes -> annotation -> enrichment

239

def analyze_cluster_markers(adata, cluster_key='leiden', n_genes=50):

240

"""Analyze cluster markers with comprehensive annotation."""

241

242

# Find marker genes

243

sc.tl.rank_genes_groups(adata, cluster_key, method='wilcoxon')

244

245

results = {}

246

247

for cluster in adata.obs[cluster_key].unique():

248

# Get top marker genes

249

marker_df = sc.get.rank_genes_groups_df(adata, group=cluster)

250

top_markers = marker_df.head(n_genes)['names'].tolist()

251

252

# Annotate markers

253

annotations = sc.queries.biomart_annotations(

254

org='hsapiens',

255

attrs=['external_gene_name', 'description', 'gene_biotype'],

256

values=top_markers

257

)

258

259

# Enrichment analysis

260

enrichment = sc.queries.enrich(

261

gene_list=top_markers,

262

organism='hsapiens',

263

sources=['GO:BP', 'KEGG'],

264

user_threshold=0.01

265

)

266

267

results[cluster] = {

268

'markers': top_markers,

269

'annotations': annotations,

270

'enrichment': enrichment.head(10) if not enrichment.empty else None

271

}

272

273

return results

274

275

# Run comprehensive analysis

276

cluster_analysis = analyze_cluster_markers(adata, n_genes=30)

277

278

# Print results for each cluster

279

for cluster, data in cluster_analysis.items():

280

print(f"\nCluster {cluster}:")

281

print(f"Top markers: {', '.join(data['markers'][:10])}")

282

283

if data['enrichment'] is not None:

284

print("Top enriched pathways:")

285

for _, pathway in data['enrichment'].head(3).iterrows():

286

print(f" - {pathway['name']} (p={pathway['p_value']:.2e})")

287

```

288

289

## Best Practices

290

291

### Query Optimization

292

293

1. **Caching**: Use cached results when possible to avoid repeated API calls

294

2. **Batch Queries**: Query multiple genes at once rather than individually

295

3. **Attribute Selection**: Only request necessary attributes to reduce response size

296

4. **Rate Limiting**: Be mindful of API rate limits for large queries

297

298

### Error Handling

299

300

```python

301

def robust_query(gene_list, max_retries=3, delay=1):

302

"""Query with retry logic and error handling."""

303

import time

304

305

for attempt in range(max_retries):

306

try:

307

return sc.queries.biomart_annotations(

308

org='hsapiens',

309

attrs=['external_gene_name', 'description'],

310

values=gene_list

311

)

312

except Exception as e:

313

if attempt < max_retries - 1:

314

print(f"Query failed (attempt {attempt + 1}): {e}")

315

time.sleep(delay * (2 ** attempt)) # Exponential backoff

316

else:

317

raise

318

319

# Use robust querying

320

annotations = robust_query(['CD3D', 'CD4', 'CD8A'])

321

```

322

323

### Data Integration

324

325

```python

326

# Integrate query results with AnnData object

327

def add_gene_annotations(adata, organism='hsapiens'):

328

"""Add gene annotations to AnnData var."""

329

330

gene_list = adata.var_names.tolist()

331

332

# Get annotations

333

annotations = sc.queries.biomart_annotations(

334

org=organism,

335

attrs=['external_gene_name', 'description', 'gene_biotype', 'chromosome_name'],

336

values=gene_list

337

)

338

339

# Merge with existing var data

340

annotations.index = annotations['external_gene_name']

341

342

# Add to adata.var

343

for col in ['description', 'gene_biotype', 'chromosome_name']:

344

if col in annotations.columns:

345

adata.var[col] = annotations.loc[adata.var_names, col].fillna('Unknown')

346

347

# Mark mitochondrial genes

348

mt_genes = sc.queries.mitochondrial_genes(org=organism)

349

adata.var['mitochondrial'] = adata.var_names.isin(mt_genes)

350

351

return adata

352

353

# Add annotations to your data

354

adata = add_gene_annotations(adata)

355

```

356

357

## Supported Organisms

358

359

Common organism codes for queries:

360

- `'hsapiens'` - Human

361

- `'mmusculus'` - Mouse

362

- `'drerio'` - Zebrafish

363

- `'dmelanogaster'` - Fruit fly

364

- `'celegans'` - C. elegans

365

- `'rnorvegicus'` - Rat

366

- `'scerevisiae'` - Yeast

367

368

## External Dependencies

369

370

The queries module requires internet connectivity and may depend on:

371

- `biomart` - For Ensembl queries

372

- `gprofiler-official` - For enrichment analysis

373

- `requests` - For API calls

374

- `pandas` - For data handling

375

376

Install additional dependencies if needed:

377

```bash

378

pip install biomart gprofiler-official requests

379

```