0
# Built-in Datasets
1
2
Scanpy provides a collection of standard single-cell datasets for testing, benchmarking, and educational purposes. These datasets include both processed and raw versions of popular single-cell studies, making it easy to get started with analysis or reproduce published results.
3
4
## Capabilities
5
6
### PBMC Datasets
7
8
Peripheral blood mononuclear cell (PBMC) datasets from 10x Genomics, widely used for benchmarking and tutorials.
9
10
```python { .api }
11
def pbmc3k():
12
"""
13
3k PBMCs from 10x Genomics.
14
15
Returns:
16
AnnData: 3000 PBMCs with 32738 genes
17
"""
18
19
def pbmc3k_processed():
20
"""
21
Processed 3k PBMCs with cluster annotations.
22
23
Returns:
24
AnnData: Preprocessed 3k PBMCs with clustering results
25
"""
26
27
def pbmc68k_reduced():
28
"""
29
68k PBMCs from 10x Genomics, reduced for computational efficiency.
30
31
Returns:
32
AnnData: Subsampled and preprocessed 68k PBMCs
33
"""
34
```
35
36
### Developmental Biology
37
38
Datasets from studies of cell development and differentiation.
39
40
```python { .api }
41
def paul15():
42
"""
43
Hematopoietic stem and progenitor cell dataset from Paul et al. 2015.
44
45
Single-cell RNA-seq of 2730 cells from mouse bone marrow capturing
46
hematopoietic differentiation. Contains pre-computed diffusion map
47
and branch assignments.
48
49
Returns:
50
AnnData: 2730 cells with 3451 genes and trajectory annotations
51
"""
52
53
def moignard15():
54
"""
55
Blood development dataset from Moignard et al. 2015.
56
57
Single-cell qPCR data of 3934 cells with 42 genes during early
58
blood development in mouse embryos.
59
60
Returns:
61
AnnData: 3934 cells with 42 genes and developmental stage annotations
62
"""
63
64
def krumsiek11():
65
"""
66
Krumsiek et al. 2011 hematopoiesis model dataset.
67
68
Simulated gene expression data based on Boolean network model
69
of hematopoietic differentiation.
70
71
Returns:
72
AnnData: Simulated cells with continuous gene expression values
73
"""
74
```
75
76
### Disease Studies
77
78
Datasets from studies of human disease.
79
80
```python { .api }
81
def burczynski06():
82
"""
83
Burczynski et al. 2006 Crohn's disease dataset.
84
85
Microarray data from colonic mucosal biopsies of Crohn's disease
86
patients and healthy controls.
87
88
Returns:
89
AnnData: 127 samples with 22283 genes and disease annotations
90
"""
91
```
92
93
### Synthetic and Test Data
94
95
Artificial datasets for testing and method development.
96
97
```python { .api }
98
def blobs(n_observations=640, n_variables=11, n_centers=5, cluster_std=1.0, center_box=(-10.0, 10.0), random_state=0):
99
"""
100
Generate synthetic single-cell-like data with Gaussian blobs.
101
102
Parameters:
103
- n_observations (int): Number of cells to generate
104
- n_variables (int): Number of genes
105
- n_centers (int): Number of cluster centers
106
- cluster_std (float): Standard deviation of clusters
107
- center_box (tuple): Bounding box for cluster centers
108
- random_state (int): Random seed
109
110
Returns:
111
AnnData: Synthetic dataset with cluster labels
112
"""
113
114
def toggleswitch(n_observations=640, n_variables=11, random_state=0):
115
"""
116
Toggle switch synthetic dataset.
117
118
Simulated data representing a bistable gene regulatory network
119
(toggle switch) with two stable states.
120
121
Parameters:
122
- n_observations (int): Number of cells
123
- n_variables (int): Number of genes
124
- random_state (int): Random seed
125
126
Returns:
127
AnnData: Synthetic toggle switch data
128
"""
129
```
130
131
### Spatial Transcriptomics
132
133
Spatial transcriptomics datasets for testing spatial analysis methods.
134
135
```python { .api }
136
def visium_sge(sample='V1_Breast_Cancer_Block_A_Section_1'):
137
"""
138
10x Visium spatial gene expression dataset.
139
140
Visium spatial transcriptomics data from various tissue samples
141
including breast cancer, mouse brain, and other tissues.
142
143
Parameters:
144
- sample (str): Sample identifier to load
145
146
Returns:
147
AnnData: Spatial transcriptomics data with coordinates
148
"""
149
```
150
151
### External Data Access
152
153
Access datasets from external repositories and databases.
154
155
```python { .api }
156
def ebi_expression_atlas(accession, filter_cells=True, filter_genes=True):
157
"""
158
Load dataset from EBI Single Cell Expression Atlas.
159
160
Parameters:
161
- accession (str): EBI Expression Atlas accession number
162
- filter_cells (bool): Apply cell filtering
163
- filter_genes (bool): Apply gene filtering
164
165
Returns:
166
AnnData: Dataset from EBI Expression Atlas
167
"""
168
```
169
170
## Usage Examples
171
172
### Loading Standard Datasets
173
174
```python
175
import scanpy as sc
176
177
# Load 3k PBMCs for tutorial
178
adata = sc.datasets.pbmc3k()
179
adata.var_names_unique()
180
181
# Load processed PBMCs with annotations
182
adata = sc.datasets.pbmc3k_processed()
183
print(adata.obs.columns) # see available annotations
184
185
# Load developmental dataset
186
adata = sc.datasets.paul15()
187
print(adata.obs['paul15_clusters'].value_counts())
188
```
189
190
### Working with Spatial Data
191
192
```python
193
# Load Visium spatial data
194
adata = sc.datasets.visium_sge()
195
196
# Spatial coordinates are in adata.obsm['spatial']
197
print(adata.obsm['spatial'].shape)
198
199
# Basic spatial plot
200
sc.pl.spatial(adata, color='total_counts')
201
```
202
203
### Synthetic Data for Testing
204
205
```python
206
# Generate synthetic clustered data
207
adata = sc.datasets.blobs(n_observations=1000, n_variables=50, n_centers=8)
208
209
# Basic analysis
210
sc.pp.neighbors(adata)
211
sc.tl.umap(adata)
212
sc.pl.umap(adata, color='blobs')
213
214
# Toggle switch data for trajectory analysis
215
adata = sc.datasets.toggleswitch()
216
sc.pp.neighbors(adata)
217
sc.tl.diffmap(adata)
218
sc.pl.diffmap(adata, color='distance')
219
```
220
221
### Developmental Biology Analysis
222
223
```python
224
# Paul et al. hematopoiesis data
225
adata = sc.datasets.paul15()
226
227
# The dataset comes with pre-computed results
228
sc.pl.diffmap(adata, color='paul15_clusters', components=['1,2', '1,3'])
229
230
# Moignard et al. early development
231
adata = sc.datasets.moignard15()
232
sc.pl.scatter(adata, x='Gata1', y='Gata2', color='groups')
233
```
234
235
### Disease Study Data
236
237
```python
238
# Crohn's disease dataset
239
adata = sc.datasets.burczynski06()
240
241
# Basic differential expression
242
sc.tl.rank_genes_groups(adata, 'DISEASE', method='wilcoxon')
243
sc.pl.rank_genes_groups(adata, n_genes=10)
244
```
245
246
### External Data Access
247
248
```python
249
# Load data from EBI Expression Atlas
250
# (requires internet connection)
251
adata = sc.datasets.ebi_expression_atlas('E-MTAB-5061')
252
print(f"Loaded {adata.n_obs} cells and {adata.n_vars} genes")
253
```
254
255
## Dataset Information
256
257
### PBMC3k Details
258
- **Source**: 10x Genomics
259
- **Cells**: 2,700 (after filtering)
260
- **Genes**: 32,738 (before filtering)
261
- **Technology**: 10x Chromium
262
- **Species**: Human
263
- **Tissue**: Peripheral blood
264
265
### Paul15 Details
266
- **Source**: Paul et al., Cell 2015
267
- **Cells**: 2,730
268
- **Genes**: 3,451
269
- **Technology**: MARS-seq
270
- **Species**: Mouse
271
- **Tissue**: Bone marrow
272
- **Features**: Pre-computed diffusion map, branch assignments
273
274
### Moignard15 Details
275
- **Source**: Moignard et al., Nature 2015
276
- **Cells**: 3,934
277
- **Genes**: 42 (targeted panel)
278
- **Technology**: Single-cell qPCR
279
- **Species**: Mouse
280
- **Tissue**: Embryonic blood
281
- **Features**: Developmental stage annotations
282
283
### Visium SGE Details
284
- **Source**: 10x Genomics
285
- **Technology**: Visium spatial gene expression
286
- **Species**: Human/Mouse (sample dependent)
287
- **Features**: Spatial coordinates, histological images
288
- **Samples**: Multiple tissue types available
289
290
These datasets provide excellent starting points for learning scanpy, testing new methods, and reproducing published analyses. Each dataset comes with appropriate metadata and annotations for its specific use case.