0
# Scanpy
1
2
Scanpy is a comprehensive toolkit for analyzing single-cell gene expression data that provides a scalable Python-based implementation for datasets exceeding one million cells. Built jointly with anndata, it offers a complete workflow including preprocessing, visualization, clustering, trajectory inference, and differential expression testing specifically designed for single-cell genomics research. The library integrates seamlessly with the scientific Python ecosystem and includes advanced algorithms for dimensionality reduction, neighborhood graphs, clustering methods, and pseudotime analysis, making it an essential tool for computational biology researchers working with single-cell RNA sequencing data and other single-cell omics technologies.
3
4
## Package Information
5
6
- **Package Name**: scanpy
7
- **Language**: Python
8
- **Installation**: `pip install scanpy`
9
10
## Core Imports
11
12
```python
13
import scanpy as sc
14
```
15
16
Common additional imports for working with scanpy:
17
18
```python
19
import scanpy as sc
20
import anndata as ad
21
import pandas as pd
22
import numpy as np
23
```
24
25
## Basic Usage
26
27
```python
28
import scanpy as sc
29
import pandas as pd
30
31
# Settings
32
sc.settings.verbosity = 3 # verbosity level
33
sc.settings.set_figure_params(dpi=80, facecolor='white')
34
35
# Load data (10x Genomics format)
36
adata = sc.read_10x_mtx(
37
'data/filtered_gene_bc_matrices/hg19/', # the directory with the .mtx file
38
var_names='gene_symbols', # use gene symbols for gene names (variables names)
39
cache=True # write a cache file for faster subsequent reading
40
)
41
42
# Basic preprocessing
43
sc.pp.filter_cells(adata, min_genes=200) # filter out cells expressing < 200 genes
44
sc.pp.filter_genes(adata, min_cells=3) # filter out genes expressed in < 3 cells
45
46
# Calculate QC metrics
47
adata.var['mt'] = adata.var_names.str.startswith('MT-') # mitochondrial genes
48
sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)
49
50
# Normalization and scaling
51
sc.pp.normalize_total(adata, target_sum=1e4) # normalize every cell to 10,000 UMI
52
sc.pp.log1p(adata) # logarithmize the data
53
54
# Find highly variable genes
55
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
56
sc.pl.highly_variable_genes(adata)
57
58
# Principal component analysis
59
sc.pp.pca(adata, svd_solver='arpack')
60
sc.pl.pca_variance_ratio(adata, log=True, n_top_genes=50)
61
62
# Compute neighborhood graph
63
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
64
65
# UMAP embedding
66
sc.tl.umap(adata)
67
sc.pl.umap(adata)
68
69
# Leiden clustering
70
sc.tl.leiden(adata, resolution=0.5)
71
sc.pl.umap(adata, color=['leiden'])
72
```
73
74
## Architecture
75
76
Scanpy is built around the AnnData (Annotated Data) format, which efficiently stores large-scale single-cell data:
77
78
- **AnnData Object**: Central data structure containing expression matrix, cell/gene metadata, and analysis results
79
- **Modular Design**: Separate modules for preprocessing (`pp`), analysis tools (`tl`), and plotting (`pl`)
80
- **Integration**: Seamless integration with the scientific Python ecosystem (NumPy, pandas, matplotlib, seaborn)
81
- **Scalability**: Memory-efficient algorithms designed for datasets with millions of cells
82
- **Extensibility**: Plugin architecture supporting external tools and methods
83
84
## Capabilities
85
86
### Data Input/Output
87
88
Read and write various single-cell data formats including 10x Genomics, H5AD, Loom, CSV, and more. Support for both local files and remote data access.
89
90
```python { .api }
91
def read(filename, **kwargs):
92
"""Read file and return AnnData object."""
93
94
def read_10x_h5(filename, **kwargs):
95
"""Read 10x Genomics HDF5 file."""
96
97
def read_10x_mtx(path, **kwargs):
98
"""Read 10x Genomics MTX format."""
99
100
def read_visium(path, **kwargs):
101
"""Read 10x Visium spatial transcriptomics data."""
102
103
def write(filename, adata, **kwargs):
104
"""Write AnnData object to file."""
105
```
106
107
[Data I/O](./data-io.md)
108
109
### Preprocessing
110
111
Comprehensive preprocessing pipeline including quality control, filtering, normalization, scaling, feature selection, and dimensionality reduction. Essential steps for preparing raw single-cell data for downstream analysis.
112
113
```python { .api }
114
def filter_cells(adata, **kwargs):
115
"""Filter cells based on quality metrics."""
116
117
def filter_genes(adata, **kwargs):
118
"""Filter genes based on expression criteria."""
119
120
def normalize_total(adata, **kwargs):
121
"""Normalize counts per cell."""
122
123
def log1p(adata, **kwargs):
124
"""Logarithmize the data matrix."""
125
126
def highly_variable_genes(adata, **kwargs):
127
"""Identify highly variable genes."""
128
129
def pca(adata, **kwargs):
130
"""Principal component analysis."""
131
132
def neighbors(adata, **kwargs):
133
"""Compute neighborhood graph."""
134
```
135
136
[Preprocessing](./preprocessing.md)
137
138
### Analysis Tools
139
140
Advanced analysis methods including dimensionality reduction, clustering, trajectory inference, differential expression testing, and specialized single-cell analysis algorithms.
141
142
```python { .api }
143
def umap(adata, **kwargs):
144
"""UMAP embedding."""
145
146
def tsne(adata, **kwargs):
147
"""t-SNE embedding."""
148
149
def leiden(adata, **kwargs):
150
"""Leiden clustering."""
151
152
def louvain(adata, **kwargs):
153
"""Louvain clustering."""
154
155
def rank_genes_groups(adata, **kwargs):
156
"""Rank genes for characterizing groups."""
157
158
def dpt(adata, **kwargs):
159
"""Diffusion pseudotime analysis."""
160
161
def paga(adata, **kwargs):
162
"""Partition-based graph abstraction."""
163
```
164
165
[Analysis Tools](./analysis-tools.md)
166
167
### Visualization
168
169
Extensive plotting capabilities for single-cell data visualization including scatter plots, heatmaps, violin plots, trajectory plots, and specialized single-cell visualizations.
170
171
```python { .api }
172
def umap(adata, **kwargs):
173
"""Plot UMAP embedding."""
174
175
def scatter(adata, **kwargs):
176
"""Scatter plot of observations."""
177
178
def violin(adata, **kwargs):
179
"""Violin plot of gene expression."""
180
181
def heatmap(adata, **kwargs):
182
"""Heatmap of gene expression."""
183
184
def rank_genes_groups(adata, **kwargs):
185
"""Plot ranking of genes."""
186
187
def paga(adata, **kwargs):
188
"""Plot PAGA graph."""
189
```
190
191
[Visualization](./visualization.md)
192
193
### Built-in Datasets
194
195
Collection of standard single-cell datasets for testing, benchmarking, and educational purposes, including processed and raw versions of popular datasets.
196
197
```python { .api }
198
def pbmc3k():
199
"""3k PBMCs from 10x Genomics."""
200
201
def pbmc68k_reduced():
202
"""68k PBMCs, reduced for computational efficiency."""
203
204
def paul15():
205
"""Hematopoietic stem and progenitor cell dataset."""
206
207
def moignard15():
208
"""Blood development dataset."""
209
```
210
211
[Datasets](./datasets.md)
212
213
### External Tool Integration
214
215
Integration with popular external single-cell analysis tools and methods through a unified interface, extending scanpy's capabilities with specialized algorithms.
216
217
```python { .api }
218
def phate(adata, **kwargs):
219
"""PHATE dimensionality reduction."""
220
221
def palantir(adata, **kwargs):
222
"""Palantir trajectory inference."""
223
224
def harmony_integrate(adata, **kwargs):
225
"""Harmony batch correction."""
226
227
def magic(adata, **kwargs):
228
"""MAGIC imputation."""
229
```
230
231
[External Tools](./external-tools.md)
232
233
### Spatial Transcriptomics
234
235
Specialized functions for analyzing spatial transcriptomics data, including spatial statistics, visualization, and neighborhood analysis for spatially resolved single-cell data.
236
237
```python { .api }
238
def read_visium(path, **kwargs):
239
"""Read 10x Visium data."""
240
241
def spatial(adata, **kwargs):
242
"""Plot spatial transcriptomics data."""
243
244
def morans_i(adata, **kwargs):
245
"""Moran's I spatial autocorrelation."""
246
247
def gearys_c(adata, **kwargs):
248
"""Geary's C spatial autocorrelation."""
249
```
250
251
[Spatial Analysis](./spatial-analysis.md)
252
253
### Utilities and Settings
254
255
Configuration, logging, data extraction utilities, and helper functions for working with AnnData objects and managing analysis workflows.
256
257
```python { .api }
258
# Settings and configuration
259
settings: ScanpyConfig
260
261
# Data extraction utilities
262
def obs_df(adata, **kwargs):
263
"""Extract observation dataframe."""
264
265
def var_df(adata, **kwargs):
266
"""Extract variable dataframe."""
267
268
# Logging functions
269
def print_versions():
270
"""Print version information."""
271
```
272
273
[Utilities](./utilities.md)
274
275
### Database Queries and Annotations
276
277
Biomart queries and gene annotation tools for enriching single-cell analysis with external database information.
278
279
```python { .api }
280
def biomart_annotations(org, attrs):
281
"""Query biomart for gene annotations."""
282
283
def enrich(gene_list, organism='hsapiens'):
284
"""Gene enrichment analysis using g:Profiler."""
285
286
def gene_coordinates(gene_list, org='hsapiens'):
287
"""Get genomic coordinates for genes."""
288
289
def mitochondrial_genes(org='hsapiens'):
290
"""Get mitochondrial gene list."""
291
```
292
293
[Database Queries](./queries.md)
294
295
## Core Types
296
297
```python { .api }
298
# Core data types (from anndata)
299
class AnnData:
300
"""Annotated data matrix."""
301
def __init__(self, X, obs=None, var=None, **kwargs): ...
302
303
# Scanpy-specific types
304
class Neighbors:
305
"""Neighbors computation and storage."""
306
def __init__(self, adata, **kwargs): ...
307
308
class Verbosity:
309
"""Logging verbosity levels."""
310
311
# Settings configuration
312
class ScanpyConfig:
313
"""Global scanpy settings."""
314
verbosity: int
315
n_jobs: int
316
317
def set_figure_params(self, **kwargs): ...
318
```