Comprehensive toolkit for analyzing single-cell gene expression data with scalable Python implementation supporting preprocessing, visualization, clustering, trajectory inference, and differential expression testing.
—
Scanpy provides a collection of standard single-cell datasets for testing, benchmarking, and educational purposes. These datasets include both processed and raw versions of popular single-cell studies, making it easy to get started with analysis or reproduce published results.
Peripheral blood mononuclear cell (PBMC) datasets from 10x Genomics, widely used for benchmarking and tutorials.
def pbmc3k():
"""
3k PBMCs from 10x Genomics.
Returns:
AnnData: 3000 PBMCs with 32738 genes
"""
def pbmc3k_processed():
"""
Processed 3k PBMCs with cluster annotations.
Returns:
AnnData: Preprocessed 3k PBMCs with clustering results
"""
def pbmc68k_reduced():
"""
68k PBMCs from 10x Genomics, reduced for computational efficiency.
Returns:
AnnData: Subsampled and preprocessed 68k PBMCs
"""Datasets from studies of cell development and differentiation.
def paul15():
"""
Hematopoietic stem and progenitor cell dataset from Paul et al. 2015.
Single-cell RNA-seq of 2730 cells from mouse bone marrow capturing
hematopoietic differentiation. Contains pre-computed diffusion map
and branch assignments.
Returns:
AnnData: 2730 cells with 3451 genes and trajectory annotations
"""
def moignard15():
"""
Blood development dataset from Moignard et al. 2015.
Single-cell qPCR data of 3934 cells with 42 genes during early
blood development in mouse embryos.
Returns:
AnnData: 3934 cells with 42 genes and developmental stage annotations
"""
def krumsiek11():
"""
Krumsiek et al. 2011 hematopoiesis model dataset.
Simulated gene expression data based on Boolean network model
of hematopoietic differentiation.
Returns:
AnnData: Simulated cells with continuous gene expression values
"""Datasets from studies of human disease.
def burczynski06():
"""
Burczynski et al. 2006 Crohn's disease dataset.
Microarray data from colonic mucosal biopsies of Crohn's disease
patients and healthy controls.
Returns:
AnnData: 127 samples with 22283 genes and disease annotations
"""Artificial datasets for testing and method development.
def blobs(n_observations=640, n_variables=11, n_centers=5, cluster_std=1.0, center_box=(-10.0, 10.0), random_state=0):
"""
Generate synthetic single-cell-like data with Gaussian blobs.
Parameters:
- n_observations (int): Number of cells to generate
- n_variables (int): Number of genes
- n_centers (int): Number of cluster centers
- cluster_std (float): Standard deviation of clusters
- center_box (tuple): Bounding box for cluster centers
- random_state (int): Random seed
Returns:
AnnData: Synthetic dataset with cluster labels
"""
def toggleswitch(n_observations=640, n_variables=11, random_state=0):
"""
Toggle switch synthetic dataset.
Simulated data representing a bistable gene regulatory network
(toggle switch) with two stable states.
Parameters:
- n_observations (int): Number of cells
- n_variables (int): Number of genes
- random_state (int): Random seed
Returns:
AnnData: Synthetic toggle switch data
"""Spatial transcriptomics datasets for testing spatial analysis methods.
def visium_sge(sample='V1_Breast_Cancer_Block_A_Section_1'):
"""
10x Visium spatial gene expression dataset.
Visium spatial transcriptomics data from various tissue samples
including breast cancer, mouse brain, and other tissues.
Parameters:
- sample (str): Sample identifier to load
Returns:
AnnData: Spatial transcriptomics data with coordinates
"""Access datasets from external repositories and databases.
def ebi_expression_atlas(accession, filter_cells=True, filter_genes=True):
"""
Load dataset from EBI Single Cell Expression Atlas.
Parameters:
- accession (str): EBI Expression Atlas accession number
- filter_cells (bool): Apply cell filtering
- filter_genes (bool): Apply gene filtering
Returns:
AnnData: Dataset from EBI Expression Atlas
"""import scanpy as sc
# Load 3k PBMCs for tutorial
adata = sc.datasets.pbmc3k()
adata.var_names_unique()
# Load processed PBMCs with annotations
adata = sc.datasets.pbmc3k_processed()
print(adata.obs.columns) # see available annotations
# Load developmental dataset
adata = sc.datasets.paul15()
print(adata.obs['paul15_clusters'].value_counts())# Load Visium spatial data
adata = sc.datasets.visium_sge()
# Spatial coordinates are in adata.obsm['spatial']
print(adata.obsm['spatial'].shape)
# Basic spatial plot
sc.pl.spatial(adata, color='total_counts')# Generate synthetic clustered data
adata = sc.datasets.blobs(n_observations=1000, n_variables=50, n_centers=8)
# Basic analysis
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color='blobs')
# Toggle switch data for trajectory analysis
adata = sc.datasets.toggleswitch()
sc.pp.neighbors(adata)
sc.tl.diffmap(adata)
sc.pl.diffmap(adata, color='distance')# Paul et al. hematopoiesis data
adata = sc.datasets.paul15()
# The dataset comes with pre-computed results
sc.pl.diffmap(adata, color='paul15_clusters', components=['1,2', '1,3'])
# Moignard et al. early development
adata = sc.datasets.moignard15()
sc.pl.scatter(adata, x='Gata1', y='Gata2', color='groups')# Crohn's disease dataset
adata = sc.datasets.burczynski06()
# Basic differential expression
sc.tl.rank_genes_groups(adata, 'DISEASE', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=10)# Load data from EBI Expression Atlas
# (requires internet connection)
adata = sc.datasets.ebi_expression_atlas('E-MTAB-5061')
print(f"Loaded {adata.n_obs} cells and {adata.n_vars} genes")These datasets provide excellent starting points for learning scanpy, testing new methods, and reproducing published analyses. Each dataset comes with appropriate metadata and annotations for its specific use case.
Install with Tessl CLI
npx tessl i tessl/pypi-scanpy