tessl/pypi-scanpy

Comprehensive toolkit for analyzing single-cell gene expression data with scalable Python implementation supporting preprocessing, visualization, clustering, trajectory inference, and differential expression testing.

—

Pending

Overview

Eval results

Files

Built-in Datasets

Name: tessl/pypi-scanpy
Author: tessl

Scanpy provides a collection of standard single-cell datasets for testing, benchmarking, and educational purposes. These datasets include both processed and raw versions of popular single-cell studies, making it easy to get started with analysis or reproduce published results.

Capabilities

PBMC Datasets

Peripheral blood mononuclear cell (PBMC) datasets from 10x Genomics, widely used for benchmarking and tutorials.

def pbmc3k():
    """
    3k PBMCs from 10x Genomics.
    
    Returns:
    AnnData: 3000 PBMCs with 32738 genes
    """

def pbmc3k_processed():
    """
    Processed 3k PBMCs with cluster annotations.
    
    Returns:
    AnnData: Preprocessed 3k PBMCs with clustering results
    """

def pbmc68k_reduced():
    """
    68k PBMCs from 10x Genomics, reduced for computational efficiency.
    
    Returns:
    AnnData: Subsampled and preprocessed 68k PBMCs
    """

Developmental Biology

Datasets from studies of cell development and differentiation.

def paul15():
    """
    Hematopoietic stem and progenitor cell dataset from Paul et al. 2015.
    
    Single-cell RNA-seq of 2730 cells from mouse bone marrow capturing
    hematopoietic differentiation. Contains pre-computed diffusion map
    and branch assignments.
    
    Returns:
    AnnData: 2730 cells with 3451 genes and trajectory annotations
    """

def moignard15():
    """
    Blood development dataset from Moignard et al. 2015.
    
    Single-cell qPCR data of 3934 cells with 42 genes during early
    blood development in mouse embryos.
    
    Returns:
    AnnData: 3934 cells with 42 genes and developmental stage annotations
    """

def krumsiek11():
    """
    Krumsiek et al. 2011 hematopoiesis model dataset.
    
    Simulated gene expression data based on Boolean network model
    of hematopoietic differentiation.
    
    Returns:
    AnnData: Simulated cells with continuous gene expression values
    """

Disease Studies

Datasets from studies of human disease.

def burczynski06():
    """
    Burczynski et al. 2006 Crohn's disease dataset.
    
    Microarray data from colonic mucosal biopsies of Crohn's disease
    patients and healthy controls.
    
    Returns:
    AnnData: 127 samples with 22283 genes and disease annotations
    """

Synthetic and Test Data

Artificial datasets for testing and method development.

def blobs(n_observations=640, n_variables=11, n_centers=5, cluster_std=1.0, center_box=(-10.0, 10.0), random_state=0):
    """
    Generate synthetic single-cell-like data with Gaussian blobs.
    
    Parameters:
    - n_observations (int): Number of cells to generate
    - n_variables (int): Number of genes
    - n_centers (int): Number of cluster centers
    - cluster_std (float): Standard deviation of clusters
    - center_box (tuple): Bounding box for cluster centers
    - random_state (int): Random seed
    
    Returns:
    AnnData: Synthetic dataset with cluster labels
    """

def toggleswitch(n_observations=640, n_variables=11, random_state=0):
    """
    Toggle switch synthetic dataset.
    
    Simulated data representing a bistable gene regulatory network
    (toggle switch) with two stable states.
    
    Parameters:
    - n_observations (int): Number of cells
    - n_variables (int): Number of genes
    - random_state (int): Random seed
    
    Returns:
    AnnData: Synthetic toggle switch data
    """

Spatial Transcriptomics

Spatial transcriptomics datasets for testing spatial analysis methods.

def visium_sge(sample='V1_Breast_Cancer_Block_A_Section_1'):
    """
    10x Visium spatial gene expression dataset.
    
    Visium spatial transcriptomics data from various tissue samples
    including breast cancer, mouse brain, and other tissues.
    
    Parameters:
    - sample (str): Sample identifier to load
    
    Returns:
    AnnData: Spatial transcriptomics data with coordinates
    """

External Data Access

Access datasets from external repositories and databases.

def ebi_expression_atlas(accession, filter_cells=True, filter_genes=True):
    """
    Load dataset from EBI Single Cell Expression Atlas.
    
    Parameters:
    - accession (str): EBI Expression Atlas accession number
    - filter_cells (bool): Apply cell filtering
    - filter_genes (bool): Apply gene filtering
    
    Returns:
    AnnData: Dataset from EBI Expression Atlas
    """

Usage Examples

Loading Standard Datasets

import scanpy as sc

# Load 3k PBMCs for tutorial
adata = sc.datasets.pbmc3k()
adata.var_names_unique()

# Load processed PBMCs with annotations
adata = sc.datasets.pbmc3k_processed()
print(adata.obs.columns)  # see available annotations

# Load developmental dataset
adata = sc.datasets.paul15()
print(adata.obs['paul15_clusters'].value_counts())

Working with Spatial Data

# Load Visium spatial data
adata = sc.datasets.visium_sge()

# Spatial coordinates are in adata.obsm['spatial']
print(adata.obsm['spatial'].shape)

# Basic spatial plot
sc.pl.spatial(adata, color='total_counts')

Synthetic Data for Testing

# Generate synthetic clustered data
adata = sc.datasets.blobs(n_observations=1000, n_variables=50, n_centers=8)

# Basic analysis
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color='blobs')

# Toggle switch data for trajectory analysis
adata = sc.datasets.toggleswitch()
sc.pp.neighbors(adata)
sc.tl.diffmap(adata)
sc.pl.diffmap(adata, color='distance')

Developmental Biology Analysis

# Paul et al. hematopoiesis data
adata = sc.datasets.paul15()

# The dataset comes with pre-computed results
sc.pl.diffmap(adata, color='paul15_clusters', components=['1,2', '1,3'])

# Moignard et al. early development
adata = sc.datasets.moignard15()
sc.pl.scatter(adata, x='Gata1', y='Gata2', color='groups')

Disease Study Data

# Crohn's disease dataset
adata = sc.datasets.burczynski06()

# Basic differential expression
sc.tl.rank_genes_groups(adata, 'DISEASE', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=10)

External Data Access

# Load data from EBI Expression Atlas
# (requires internet connection)
adata = sc.datasets.ebi_expression_atlas('E-MTAB-5061')
print(f"Loaded {adata.n_obs} cells and {adata.n_vars} genes")

Dataset Information

PBMC3k Details

Source: 10x Genomics
Cells: 2,700 (after filtering)
Genes: 32,738 (before filtering)
Technology: 10x Chromium
Species: Human
Tissue: Peripheral blood

Paul15 Details

Source: Paul et al., Cell 2015
Cells: 2,730
Genes: 3,451
Technology: MARS-seq
Species: Mouse
Tissue: Bone marrow
Features: Pre-computed diffusion map, branch assignments

Moignard15 Details

Source: Moignard et al., Nature 2015
Cells: 3,934
Genes: 42 (targeted panel)
Technology: Single-cell qPCR
Species: Mouse
Tissue: Embryonic blood
Features: Developmental stage annotations

Visium SGE Details

Source: 10x Genomics
Technology: Visium spatial gene expression
Species: Human/Mouse (sample dependent)
Features: Spatial coordinates, histological images
Samples: Multiple tissue types available

These datasets provide excellent starting points for learning scanpy, testing new methods, and reproducing published analyses. Each dataset comes with appropriate metadata and annotations for its specific use case.

Install with Tessl CLI