or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-tools.mddata-io.mddatasets.mdexternal-tools.mdindex.mdpreprocessing.mdqueries.mdspatial-analysis.mdutilities.mdvisualization.md

utilities.mddocs/

0

# Utilities and Settings

1

2

Scanpy provides various utility functions, configuration options, and helper tools for managing analysis workflows, extracting data, and configuring the analysis environment.

3

4

## Capabilities

5

6

### Global Settings and Configuration

7

8

Configure scanpy's behavior and matplotlib plotting parameters.

9

10

```python { .api }

11

# Global settings object

12

settings: ScanpyConfig

13

14

class ScanpyConfig:

15

"""Global scanpy configuration object."""

16

17

# Core settings

18

verbosity: int = 1 # Logging verbosity level (0-5)

19

n_jobs: int = 1 # Number of parallel jobs (-1 for all cores)

20

21

# Data settings

22

max_memory: str = '2G' # Maximum memory for operations

23

n_pcs: int = 50 # Default number of PCs

24

25

# Figure settings

26

figdir: str = './figures/' # Default figure output directory

27

file_format_figs: str = 'pdf' # Default figure format

28

dpi: int = 80 # Default DPI for figures

29

dpi_save: int = 150 # DPI for saved figures

30

transparent: bool = False # Transparent backgrounds

31

32

# Cache settings

33

cache_compression: str = 'lzf' # Compression for cached files

34

35

def set_figure_params(self, dpi=80, dpi_save=150, transparent=False, fontsize=14, color_map='viridis', format='pdf', facecolor='white', **kwargs):

36

"""

37

Set matplotlib figure parameters.

38

39

Parameters:

40

- dpi (int): Resolution for display

41

- dpi_save (int): Resolution for saved figures

42

- transparent (bool): Transparent background

43

- fontsize (int): Base font size

44

- color_map (str): Default colormap

45

- format (str): Default save format

46

- facecolor (str): Figure background color

47

- **kwargs: Additional matplotlib rcParams

48

"""

49

```

50

51

### Data Extraction Utilities

52

53

Extract and manipulate data from AnnData objects.

54

55

```python { .api }

56

def obs_df(adata, keys=None, obsm_keys=None, layer=None, gene_symbols=None, use_raw=False):

57

"""

58

Extract observation metadata as pandas DataFrame.

59

60

Parameters:

61

- adata (AnnData): Annotated data object

62

- keys (list, optional): Keys from obs to include

63

- obsm_keys (list, optional): Keys from obsm to include

64

- layer (str, optional): Layer to extract data from

65

- gene_symbols (str, optional): Gene symbols key

66

- use_raw (bool): Use raw data

67

68

Returns:

69

DataFrame: Observation data with requested keys

70

"""

71

72

def var_df(adata, keys=None, varm_keys=None, layer=None):

73

"""

74

Extract variable metadata as pandas DataFrame.

75

76

Parameters:

77

- adata (AnnData): Annotated data object

78

- keys (list, optional): Keys from var to include

79

- varm_keys (list, optional): Keys from varm to include

80

- layer (str, optional): Layer to extract data from

81

82

Returns:

83

DataFrame: Variable data with requested keys

84

"""

85

86

def rank_genes_groups_df(adata, group=None, key='rank_genes_groups', pval_cutoff=None, log2fc_min=None, log2fc_max=None, gene_symbols=None):

87

"""

88

Extract ranked genes results as pandas DataFrame.

89

90

Parameters:

91

- adata (AnnData): Annotated data object

92

- group (str, optional): Specific group to extract

93

- key (str): Key for ranked genes results

94

- pval_cutoff (float, optional): P-value cutoff

95

- log2fc_min (float, optional): Minimum log2 fold change

96

- log2fc_max (float, optional): Maximum log2 fold change

97

- gene_symbols (str, optional): Gene symbols key

98

99

Returns:

100

DataFrame: Ranked genes with statistics

101

"""

102

103

def aggregate(adata, by, func='mean', layer=None, obsm=None, varm=None):

104

"""

105

Aggregate observations by grouping variable.

106

107

Parameters:

108

- adata (AnnData): Annotated data object

109

- by (str): Key in obs for grouping

110

- func (str or callable): Aggregation function

111

- layer (str, optional): Layer to aggregate

112

- obsm (str, optional): Obsm key to aggregate

113

- varm (str, optional): Varm key to aggregate

114

115

Returns:

116

AnnData: Aggregated data object

117

"""

118

```

119

120

### Internal Data Access Utilities

121

122

Low-level utilities for accessing AnnData representations.

123

124

```python { .api }

125

def _get_obs_rep(adata, use_rep=None, n_pcs=None, use_raw=False, layer=None, obsm=None, obsp=None):

126

"""

127

Get observation representation for analysis.

128

129

Parameters:

130

- adata (AnnData): Annotated data object

131

- use_rep (str, optional): Representation key in obsm

132

- n_pcs (int, optional): Number of PCs if using PCA

133

- use_raw (bool): Use raw data

134

- layer (str, optional): Layer to use

135

- obsm (str, optional): Obsm key

136

- obsp (str, optional): Obsp key

137

138

Returns:

139

array: Data representation

140

"""

141

142

def _set_obs_rep(adata, X_new, use_rep=None, n_pcs=None, layer=None, obsm=None):

143

"""

144

Set observation representation in AnnData.

145

146

Parameters:

147

- adata (AnnData): Annotated data object

148

- X_new (array): New data representation

149

- use_rep (str, optional): Representation key

150

- n_pcs (int, optional): Number of PCs

151

- layer (str, optional): Layer key

152

- obsm (str, optional): Obsm key

153

"""

154

155

def _check_mask(adata, mask_var, mask_obs=None):

156

"""

157

Validate and process mask for subsetting.

158

159

Parameters:

160

- adata (AnnData): Annotated data object

161

- mask_var (array or str): Variable mask

162

- mask_obs (array or str, optional): Observation mask

163

164

Returns:

165

tuple: Processed masks

166

"""

167

```

168

169

### Logging and Verbosity

170

171

Control logging output and verbosity levels.

172

173

```python { .api }

174

def print_versions():

175

"""

176

Print version information for scanpy and dependencies.

177

178

Returns:

179

None: Prints version information to stdout

180

"""

181

182

# Logging levels

183

CRITICAL: int = 50

184

ERROR: int = 40

185

WARNING: int = 30

186

INFO: int = 20

187

DEBUG: int = 10

188

HINT: int = 15 # Custom level between INFO and DEBUG

189

190

# Verbosity levels

191

class Verbosity:

192

"""Verbosity level enumeration."""

193

error: int = 0

194

warn: int = 1

195

info: int = 2

196

hint: int = 3

197

debug: int = 4

198

trace: int = 5

199

```

200

201

### Memory and Performance Utilities

202

203

Tools for managing memory usage and performance.

204

205

```python { .api }

206

def memory_usage():

207

"""

208

Get current memory usage.

209

210

Returns:

211

str: Memory usage information

212

"""

213

214

def check_versions():

215

"""

216

Check versions of key dependencies.

217

218

Returns:

219

None: Prints warnings for version issues

220

"""

221

```

222

223

### File and Path Utilities

224

225

Utilities for working with files and paths.

226

227

```python { .api }

228

def _check_datasetdir_exists():

229

"""Check if dataset directory exists."""

230

231

def _get_filename_from_key(key):

232

"""Generate filename from key."""

233

234

def _doc_params(**kwds):

235

"""Decorator for parameter documentation."""

236

```

237

238

### Plotting Configuration

239

240

Configure matplotlib and plotting behavior.

241

242

```python { .api }

243

def set_figure_params(scanpy=True, dpi=80, dpi_save=150, transparent=False, fontsize=14, color_map='viridis', format='pdf', facecolor='white', **kwargs):

244

"""

245

Set global figure parameters for matplotlib.

246

247

Parameters:

248

- scanpy (bool): Use scanpy-specific settings

249

- dpi (int): Display resolution

250

- dpi_save (int): Save resolution

251

- transparent (bool): Transparent background

252

- fontsize (int): Base font size

253

- color_map (str): Default colormap

254

- format (str): Default save format

255

- facecolor (str): Figure background color

256

- **kwargs: Additional rcParams

257

"""

258

259

def reset_rcParams():

260

"""Reset matplotlib rcParams to defaults."""

261

```

262

263

### Constants and Enumerations

264

265

Important constants used throughout scanpy.

266

267

```python { .api }

268

# Default number of PCs

269

N_PCS: int = 50

270

271

# Default number of diffusion components

272

N_DCS: int = 15

273

274

# File format constants

275

FIGDIR_DEFAULT: str = './figures/'

276

FORMAT_DEFAULT: str = 'pdf'

277

278

# Cache settings

279

CACHE_DEFAULT: str = './cache/'

280

```

281

282

## Usage Examples

283

284

### Configuring Scanpy Settings

285

286

```python

287

import scanpy as sc

288

289

# Set verbosity level

290

sc.settings.verbosity = 3 # hint level

291

292

# Configure parallel processing

293

sc.settings.n_jobs = -1 # use all available cores

294

295

# Set figure parameters

296

sc.settings.set_figure_params(

297

dpi=100,

298

dpi_save=300,

299

fontsize=12,

300

color_map='plasma',

301

format='png',

302

transparent=True

303

)

304

305

# Set output directory

306

sc.settings.figdir = './my_figures/'

307

308

# Check current settings

309

print(f"Verbosity: {sc.settings.verbosity}")

310

print(f"N jobs: {sc.settings.n_jobs}")

311

print(f"Figure dir: {sc.settings.figdir}")

312

```

313

314

### Data Extraction and Analysis

315

316

```python

317

# Extract observation data with specific columns

318

obs_data = sc.get.obs_df(adata, keys=['total_counts', 'n_genes', 'leiden'])

319

print(obs_data.head())

320

321

# Get ranked genes as DataFrame

322

marker_genes = sc.get.rank_genes_groups_df(adata, group='0')

323

top_genes = marker_genes.head(20)

324

325

# Extract variable information

326

var_data = sc.get.var_df(adata, keys=['highly_variable', 'dispersions'])

327

328

# Aggregate data by clusters

329

adata_agg = sc.get.aggregate(adata, by='leiden', func='mean')

330

print(f"Aggregated to {adata_agg.n_obs} pseudo-bulk samples")

331

```

332

333

### Working with Different Data Representations

334

335

```python

336

# Get PCA representation

337

X_pca = sc.get._get_obs_rep(adata, use_rep='X_pca', n_pcs=30)

338

print(f"PCA shape: {X_pca.shape}")

339

340

# Get UMAP representation

341

X_umap = sc.get._get_obs_rep(adata, use_rep='X_umap')

342

print(f"UMAP shape: {X_umap.shape}")

343

344

# Get raw data representation

345

X_raw = sc.get._get_obs_rep(adata, use_raw=True)

346

print(f"Raw data shape: {X_raw.shape}")

347

```

348

349

### Environment and Version Information

350

351

```python

352

# Print comprehensive version information

353

sc.logging.print_versions()

354

355

# Check for version compatibility issues

356

sc._utils.check_versions()

357

358

# Print memory usage

359

print(f"Current memory usage: {sc._utils.memory_usage()}")

360

```

361

362

### Advanced Configuration

363

364

```python

365

# Custom matplotlib configuration

366

sc.pl.set_rcParams_scanpy(fontsize=10, color_map='viridis')

367

368

# Reset to defaults

369

sc.pl.set_rcParams_defaults()

370

371

# Fine-grained matplotlib control

372

import matplotlib.pyplot as plt

373

plt.rcParams['figure.figsize'] = (8, 6)

374

plt.rcParams['axes.grid'] = True

375

plt.rcParams['grid.alpha'] = 0.3

376

377

# Apply custom color palette

378

import seaborn as sns

379

custom_palette = sns.color_palette("husl", 8)

380

sc.pl.palettes.default_20 = custom_palette

381

```

382

383

### Performance Optimization

384

385

```python

386

# Configure for large datasets

387

sc.settings.max_memory = '16G' # Set memory limit

388

sc.settings.n_jobs = 8 # Limit parallel jobs

389

sc.settings.verbosity = 1 # Reduce logging overhead

390

391

# Enable caching for repeated operations

392

sc.settings.cachedir = '/tmp/scanpy_cache/'

393

394

# Use chunked operations for large matrices

395

sc.pp.scale(adata, chunked=True, chunk_size=1000)

396

```

397

398

### Custom Analysis Workflows

399

400

```python

401

def run_standard_analysis(adata, resolution=0.5, n_pcs=50):

402

"""Custom analysis function using scanpy utilities."""

403

404

# Configure for this analysis

405

original_verbosity = sc.settings.verbosity

406

sc.settings.verbosity = 2

407

408

try:

409

# Preprocessing

410

sc.pp.filter_cells(adata, min_genes=200)

411

sc.pp.filter_genes(adata, min_cells=3)

412

sc.pp.normalize_total(adata, target_sum=1e4)

413

sc.pp.log1p(adata)

414

415

# Analysis

416

sc.pp.highly_variable_genes(adata)

417

adata.raw = adata

418

adata = adata[:, adata.var.highly_variable]

419

sc.pp.scale(adata)

420

sc.pp.pca(adata, n_comps=n_pcs)

421

sc.pp.neighbors(adata)

422

sc.tl.umap(adata)

423

sc.tl.leiden(adata, resolution=resolution)

424

425

# Extract results

426

results = {

427

'clusters': sc.get.obs_df(adata, keys=['leiden']),

428

'embedding': sc.get._get_obs_rep(adata, use_rep='X_umap'),

429

'n_clusters': len(adata.obs['leiden'].unique())

430

}

431

432

return adata, results

433

434

finally:

435

# Restore original settings

436

sc.settings.verbosity = original_verbosity

437

438

# Run analysis

439

adata_processed, analysis_results = run_standard_analysis(adata)

440

print(f"Found {analysis_results['n_clusters']} clusters")

441

```

442

443

### Debugging and Troubleshooting

444

445

```python

446

# Enable debug logging

447

sc.settings.verbosity = 4 # debug level

448

449

# Check data integrity

450

def check_adata_integrity(adata):

451

"""Check AnnData object for common issues."""

452

print(f"Shape: {adata.shape}")

453

print(f"Data type: {adata.X.dtype}")

454

print(f"Sparse: {scipy.sparse.issparse(adata.X)}")

455

print(f"NaN values: {np.isnan(adata.X.data).sum() if scipy.sparse.issparse(adata.X) else np.isnan(adata.X).sum()}")

456

print(f"Negative values: {(adata.X.data < 0).sum() if scipy.sparse.issparse(adata.X) else (adata.X < 0).sum()}")

457

458

# Check for common issues

459

if adata.obs.index.duplicated().any():

460

print("WARNING: Duplicate observation names found")

461

if adata.var.index.duplicated().any():

462

print("WARNING: Duplicate variable names found")

463

464

check_adata_integrity(adata)

465

466

# Memory profiling for large operations

467

import time

468

start_time = time.time()

469

start_memory = sc._utils.memory_usage()

470

471

# Your analysis here

472

sc.pp.neighbors(adata, n_neighbors=15)

473

474

end_time = time.time()

475

end_memory = sc._utils.memory_usage()

476

477

print(f"Operation took {end_time - start_time:.2f} seconds")

478

print(f"Memory before: {start_memory}")

479

print(f"Memory after: {end_memory}")

480

```

481

482

## Configuration Files

483

484

### Setting up scanpy configuration

485

486

```python

487

# Create configuration file (~/.scanpy/config.yaml)

488

import os

489

import yaml

490

491

config_dir = os.path.expanduser('~/.scanpy')

492

os.makedirs(config_dir, exist_ok=True)

493

494

config = {

495

'verbosity': 2,

496

'n_jobs': -1,

497

'figdir': './figures/',

498

'file_format_figs': 'pdf',

499

'dpi_save': 300,

500

'transparent': True

501

}

502

503

with open(os.path.join(config_dir, 'config.yaml'), 'w') as f:

504

yaml.dump(config, f)

505

```

506

507

## Best Practices

508

509

### Settings Management

510

511

1. **Consistent Configuration**: Set global parameters at the start of analysis

512

2. **Resource Management**: Configure `n_jobs` and `max_memory` based on system

513

3. **Reproducibility**: Set random seeds and document settings used

514

4. **Output Management**: Organize figure output with descriptive directories

515

516

### Performance Tips

517

518

1. **Memory Efficiency**: Use appropriate data types and sparse matrices

519

2. **Parallel Processing**: Enable multiprocessing for CPU-intensive operations

520

3. **Chunked Operations**: Use chunked processing for very large datasets

521

4. **Caching**: Enable caching for repeated computations

522

523

### Debugging

524

525

1. **Logging Levels**: Use appropriate verbosity for development vs production

526

2. **Data Validation**: Check data integrity before analysis

527

3. **Version Tracking**: Document software versions for reproducibility

528

4. **Error Handling**: Implement proper error handling in custom workflows