or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-tools.mddata-io.mddatasets.mdexternal-tools.mdindex.mdpreprocessing.mdqueries.mdspatial-analysis.mdutilities.mdvisualization.md

datasets.mddocs/

0

# Built-in Datasets

1

2

Scanpy provides a collection of standard single-cell datasets for testing, benchmarking, and educational purposes. These datasets include both processed and raw versions of popular single-cell studies, making it easy to get started with analysis or reproduce published results.

3

4

## Capabilities

5

6

### PBMC Datasets

7

8

Peripheral blood mononuclear cell (PBMC) datasets from 10x Genomics, widely used for benchmarking and tutorials.

9

10

```python { .api }

11

def pbmc3k():

12

"""

13

3k PBMCs from 10x Genomics.

14

15

Returns:

16

AnnData: 3000 PBMCs with 32738 genes

17

"""

18

19

def pbmc3k_processed():

20

"""

21

Processed 3k PBMCs with cluster annotations.

22

23

Returns:

24

AnnData: Preprocessed 3k PBMCs with clustering results

25

"""

26

27

def pbmc68k_reduced():

28

"""

29

68k PBMCs from 10x Genomics, reduced for computational efficiency.

30

31

Returns:

32

AnnData: Subsampled and preprocessed 68k PBMCs

33

"""

34

```

35

36

### Developmental Biology

37

38

Datasets from studies of cell development and differentiation.

39

40

```python { .api }

41

def paul15():

42

"""

43

Hematopoietic stem and progenitor cell dataset from Paul et al. 2015.

44

45

Single-cell RNA-seq of 2730 cells from mouse bone marrow capturing

46

hematopoietic differentiation. Contains pre-computed diffusion map

47

and branch assignments.

48

49

Returns:

50

AnnData: 2730 cells with 3451 genes and trajectory annotations

51

"""

52

53

def moignard15():

54

"""

55

Blood development dataset from Moignard et al. 2015.

56

57

Single-cell qPCR data of 3934 cells with 42 genes during early

58

blood development in mouse embryos.

59

60

Returns:

61

AnnData: 3934 cells with 42 genes and developmental stage annotations

62

"""

63

64

def krumsiek11():

65

"""

66

Krumsiek et al. 2011 hematopoiesis model dataset.

67

68

Simulated gene expression data based on Boolean network model

69

of hematopoietic differentiation.

70

71

Returns:

72

AnnData: Simulated cells with continuous gene expression values

73

"""

74

```

75

76

### Disease Studies

77

78

Datasets from studies of human disease.

79

80

```python { .api }

81

def burczynski06():

82

"""

83

Burczynski et al. 2006 Crohn's disease dataset.

84

85

Microarray data from colonic mucosal biopsies of Crohn's disease

86

patients and healthy controls.

87

88

Returns:

89

AnnData: 127 samples with 22283 genes and disease annotations

90

"""

91

```

92

93

### Synthetic and Test Data

94

95

Artificial datasets for testing and method development.

96

97

```python { .api }

98

def blobs(n_observations=640, n_variables=11, n_centers=5, cluster_std=1.0, center_box=(-10.0, 10.0), random_state=0):

99

"""

100

Generate synthetic single-cell-like data with Gaussian blobs.

101

102

Parameters:

103

- n_observations (int): Number of cells to generate

104

- n_variables (int): Number of genes

105

- n_centers (int): Number of cluster centers

106

- cluster_std (float): Standard deviation of clusters

107

- center_box (tuple): Bounding box for cluster centers

108

- random_state (int): Random seed

109

110

Returns:

111

AnnData: Synthetic dataset with cluster labels

112

"""

113

114

def toggleswitch(n_observations=640, n_variables=11, random_state=0):

115

"""

116

Toggle switch synthetic dataset.

117

118

Simulated data representing a bistable gene regulatory network

119

(toggle switch) with two stable states.

120

121

Parameters:

122

- n_observations (int): Number of cells

123

- n_variables (int): Number of genes

124

- random_state (int): Random seed

125

126

Returns:

127

AnnData: Synthetic toggle switch data

128

"""

129

```

130

131

### Spatial Transcriptomics

132

133

Spatial transcriptomics datasets for testing spatial analysis methods.

134

135

```python { .api }

136

def visium_sge(sample='V1_Breast_Cancer_Block_A_Section_1'):

137

"""

138

10x Visium spatial gene expression dataset.

139

140

Visium spatial transcriptomics data from various tissue samples

141

including breast cancer, mouse brain, and other tissues.

142

143

Parameters:

144

- sample (str): Sample identifier to load

145

146

Returns:

147

AnnData: Spatial transcriptomics data with coordinates

148

"""

149

```

150

151

### External Data Access

152

153

Access datasets from external repositories and databases.

154

155

```python { .api }

156

def ebi_expression_atlas(accession, filter_cells=True, filter_genes=True):

157

"""

158

Load dataset from EBI Single Cell Expression Atlas.

159

160

Parameters:

161

- accession (str): EBI Expression Atlas accession number

162

- filter_cells (bool): Apply cell filtering

163

- filter_genes (bool): Apply gene filtering

164

165

Returns:

166

AnnData: Dataset from EBI Expression Atlas

167

"""

168

```

169

170

## Usage Examples

171

172

### Loading Standard Datasets

173

174

```python

175

import scanpy as sc

176

177

# Load 3k PBMCs for tutorial

178

adata = sc.datasets.pbmc3k()

179

adata.var_names_unique()

180

181

# Load processed PBMCs with annotations

182

adata = sc.datasets.pbmc3k_processed()

183

print(adata.obs.columns) # see available annotations

184

185

# Load developmental dataset

186

adata = sc.datasets.paul15()

187

print(adata.obs['paul15_clusters'].value_counts())

188

```

189

190

### Working with Spatial Data

191

192

```python

193

# Load Visium spatial data

194

adata = sc.datasets.visium_sge()

195

196

# Spatial coordinates are in adata.obsm['spatial']

197

print(adata.obsm['spatial'].shape)

198

199

# Basic spatial plot

200

sc.pl.spatial(adata, color='total_counts')

201

```

202

203

### Synthetic Data for Testing

204

205

```python

206

# Generate synthetic clustered data

207

adata = sc.datasets.blobs(n_observations=1000, n_variables=50, n_centers=8)

208

209

# Basic analysis

210

sc.pp.neighbors(adata)

211

sc.tl.umap(adata)

212

sc.pl.umap(adata, color='blobs')

213

214

# Toggle switch data for trajectory analysis

215

adata = sc.datasets.toggleswitch()

216

sc.pp.neighbors(adata)

217

sc.tl.diffmap(adata)

218

sc.pl.diffmap(adata, color='distance')

219

```

220

221

### Developmental Biology Analysis

222

223

```python

224

# Paul et al. hematopoiesis data

225

adata = sc.datasets.paul15()

226

227

# The dataset comes with pre-computed results

228

sc.pl.diffmap(adata, color='paul15_clusters', components=['1,2', '1,3'])

229

230

# Moignard et al. early development

231

adata = sc.datasets.moignard15()

232

sc.pl.scatter(adata, x='Gata1', y='Gata2', color='groups')

233

```

234

235

### Disease Study Data

236

237

```python

238

# Crohn's disease dataset

239

adata = sc.datasets.burczynski06()

240

241

# Basic differential expression

242

sc.tl.rank_genes_groups(adata, 'DISEASE', method='wilcoxon')

243

sc.pl.rank_genes_groups(adata, n_genes=10)

244

```

245

246

### External Data Access

247

248

```python

249

# Load data from EBI Expression Atlas

250

# (requires internet connection)

251

adata = sc.datasets.ebi_expression_atlas('E-MTAB-5061')

252

print(f"Loaded {adata.n_obs} cells and {adata.n_vars} genes")

253

```

254

255

## Dataset Information

256

257

### PBMC3k Details

258

- **Source**: 10x Genomics

259

- **Cells**: 2,700 (after filtering)

260

- **Genes**: 32,738 (before filtering)

261

- **Technology**: 10x Chromium

262

- **Species**: Human

263

- **Tissue**: Peripheral blood

264

265

### Paul15 Details

266

- **Source**: Paul et al., Cell 2015

267

- **Cells**: 2,730

268

- **Genes**: 3,451

269

- **Technology**: MARS-seq

270

- **Species**: Mouse

271

- **Tissue**: Bone marrow

272

- **Features**: Pre-computed diffusion map, branch assignments

273

274

### Moignard15 Details

275

- **Source**: Moignard et al., Nature 2015

276

- **Cells**: 3,934

277

- **Genes**: 42 (targeted panel)

278

- **Technology**: Single-cell qPCR

279

- **Species**: Mouse

280

- **Tissue**: Embryonic blood

281

- **Features**: Developmental stage annotations

282

283

### Visium SGE Details

284

- **Source**: 10x Genomics

285

- **Technology**: Visium spatial gene expression

286

- **Species**: Human/Mouse (sample dependent)

287

- **Features**: Spatial coordinates, histological images

288

- **Samples**: Multiple tissue types available

289

290

These datasets provide excellent starting points for learning scanpy, testing new methods, and reproducing published analyses. Each dataset comes with appropriate metadata and annotations for its specific use case.