0
# Utilities and Settings
1
2
Scanpy provides various utility functions, configuration options, and helper tools for managing analysis workflows, extracting data, and configuring the analysis environment.
3
4
## Capabilities
5
6
### Global Settings and Configuration
7
8
Configure scanpy's behavior and matplotlib plotting parameters.
9
10
```python { .api }
11
# Global settings object
12
settings: ScanpyConfig
13
14
class ScanpyConfig:
15
"""Global scanpy configuration object."""
16
17
# Core settings
18
verbosity: int = 1 # Logging verbosity level (0-5)
19
n_jobs: int = 1 # Number of parallel jobs (-1 for all cores)
20
21
# Data settings
22
max_memory: str = '2G' # Maximum memory for operations
23
n_pcs: int = 50 # Default number of PCs
24
25
# Figure settings
26
figdir: str = './figures/' # Default figure output directory
27
file_format_figs: str = 'pdf' # Default figure format
28
dpi: int = 80 # Default DPI for figures
29
dpi_save: int = 150 # DPI for saved figures
30
transparent: bool = False # Transparent backgrounds
31
32
# Cache settings
33
cache_compression: str = 'lzf' # Compression for cached files
34
35
def set_figure_params(self, dpi=80, dpi_save=150, transparent=False, fontsize=14, color_map='viridis', format='pdf', facecolor='white', **kwargs):
36
"""
37
Set matplotlib figure parameters.
38
39
Parameters:
40
- dpi (int): Resolution for display
41
- dpi_save (int): Resolution for saved figures
42
- transparent (bool): Transparent background
43
- fontsize (int): Base font size
44
- color_map (str): Default colormap
45
- format (str): Default save format
46
- facecolor (str): Figure background color
47
- **kwargs: Additional matplotlib rcParams
48
"""
49
```
50
51
### Data Extraction Utilities
52
53
Extract and manipulate data from AnnData objects.
54
55
```python { .api }
56
def obs_df(adata, keys=None, obsm_keys=None, layer=None, gene_symbols=None, use_raw=False):
57
"""
58
Extract observation metadata as pandas DataFrame.
59
60
Parameters:
61
- adata (AnnData): Annotated data object
62
- keys (list, optional): Keys from obs to include
63
- obsm_keys (list, optional): Keys from obsm to include
64
- layer (str, optional): Layer to extract data from
65
- gene_symbols (str, optional): Gene symbols key
66
- use_raw (bool): Use raw data
67
68
Returns:
69
DataFrame: Observation data with requested keys
70
"""
71
72
def var_df(adata, keys=None, varm_keys=None, layer=None):
73
"""
74
Extract variable metadata as pandas DataFrame.
75
76
Parameters:
77
- adata (AnnData): Annotated data object
78
- keys (list, optional): Keys from var to include
79
- varm_keys (list, optional): Keys from varm to include
80
- layer (str, optional): Layer to extract data from
81
82
Returns:
83
DataFrame: Variable data with requested keys
84
"""
85
86
def rank_genes_groups_df(adata, group=None, key='rank_genes_groups', pval_cutoff=None, log2fc_min=None, log2fc_max=None, gene_symbols=None):
87
"""
88
Extract ranked genes results as pandas DataFrame.
89
90
Parameters:
91
- adata (AnnData): Annotated data object
92
- group (str, optional): Specific group to extract
93
- key (str): Key for ranked genes results
94
- pval_cutoff (float, optional): P-value cutoff
95
- log2fc_min (float, optional): Minimum log2 fold change
96
- log2fc_max (float, optional): Maximum log2 fold change
97
- gene_symbols (str, optional): Gene symbols key
98
99
Returns:
100
DataFrame: Ranked genes with statistics
101
"""
102
103
def aggregate(adata, by, func='mean', layer=None, obsm=None, varm=None):
104
"""
105
Aggregate observations by grouping variable.
106
107
Parameters:
108
- adata (AnnData): Annotated data object
109
- by (str): Key in obs for grouping
110
- func (str or callable): Aggregation function
111
- layer (str, optional): Layer to aggregate
112
- obsm (str, optional): Obsm key to aggregate
113
- varm (str, optional): Varm key to aggregate
114
115
Returns:
116
AnnData: Aggregated data object
117
"""
118
```
119
120
### Internal Data Access Utilities
121
122
Low-level utilities for accessing AnnData representations.
123
124
```python { .api }
125
def _get_obs_rep(adata, use_rep=None, n_pcs=None, use_raw=False, layer=None, obsm=None, obsp=None):
126
"""
127
Get observation representation for analysis.
128
129
Parameters:
130
- adata (AnnData): Annotated data object
131
- use_rep (str, optional): Representation key in obsm
132
- n_pcs (int, optional): Number of PCs if using PCA
133
- use_raw (bool): Use raw data
134
- layer (str, optional): Layer to use
135
- obsm (str, optional): Obsm key
136
- obsp (str, optional): Obsp key
137
138
Returns:
139
array: Data representation
140
"""
141
142
def _set_obs_rep(adata, X_new, use_rep=None, n_pcs=None, layer=None, obsm=None):
143
"""
144
Set observation representation in AnnData.
145
146
Parameters:
147
- adata (AnnData): Annotated data object
148
- X_new (array): New data representation
149
- use_rep (str, optional): Representation key
150
- n_pcs (int, optional): Number of PCs
151
- layer (str, optional): Layer key
152
- obsm (str, optional): Obsm key
153
"""
154
155
def _check_mask(adata, mask_var, mask_obs=None):
156
"""
157
Validate and process mask for subsetting.
158
159
Parameters:
160
- adata (AnnData): Annotated data object
161
- mask_var (array or str): Variable mask
162
- mask_obs (array or str, optional): Observation mask
163
164
Returns:
165
tuple: Processed masks
166
"""
167
```
168
169
### Logging and Verbosity
170
171
Control logging output and verbosity levels.
172
173
```python { .api }
174
def print_versions():
175
"""
176
Print version information for scanpy and dependencies.
177
178
Returns:
179
None: Prints version information to stdout
180
"""
181
182
# Logging levels
183
CRITICAL: int = 50
184
ERROR: int = 40
185
WARNING: int = 30
186
INFO: int = 20
187
DEBUG: int = 10
188
HINT: int = 15 # Custom level between INFO and DEBUG
189
190
# Verbosity levels
191
class Verbosity:
192
"""Verbosity level enumeration."""
193
error: int = 0
194
warn: int = 1
195
info: int = 2
196
hint: int = 3
197
debug: int = 4
198
trace: int = 5
199
```
200
201
### Memory and Performance Utilities
202
203
Tools for managing memory usage and performance.
204
205
```python { .api }
206
def memory_usage():
207
"""
208
Get current memory usage.
209
210
Returns:
211
str: Memory usage information
212
"""
213
214
def check_versions():
215
"""
216
Check versions of key dependencies.
217
218
Returns:
219
None: Prints warnings for version issues
220
"""
221
```
222
223
### File and Path Utilities
224
225
Utilities for working with files and paths.
226
227
```python { .api }
228
def _check_datasetdir_exists():
229
"""Check if dataset directory exists."""
230
231
def _get_filename_from_key(key):
232
"""Generate filename from key."""
233
234
def _doc_params(**kwds):
235
"""Decorator for parameter documentation."""
236
```
237
238
### Plotting Configuration
239
240
Configure matplotlib and plotting behavior.
241
242
```python { .api }
243
def set_figure_params(scanpy=True, dpi=80, dpi_save=150, transparent=False, fontsize=14, color_map='viridis', format='pdf', facecolor='white', **kwargs):
244
"""
245
Set global figure parameters for matplotlib.
246
247
Parameters:
248
- scanpy (bool): Use scanpy-specific settings
249
- dpi (int): Display resolution
250
- dpi_save (int): Save resolution
251
- transparent (bool): Transparent background
252
- fontsize (int): Base font size
253
- color_map (str): Default colormap
254
- format (str): Default save format
255
- facecolor (str): Figure background color
256
- **kwargs: Additional rcParams
257
"""
258
259
def reset_rcParams():
260
"""Reset matplotlib rcParams to defaults."""
261
```
262
263
### Constants and Enumerations
264
265
Important constants used throughout scanpy.
266
267
```python { .api }
268
# Default number of PCs
269
N_PCS: int = 50
270
271
# Default number of diffusion components
272
N_DCS: int = 15
273
274
# File format constants
275
FIGDIR_DEFAULT: str = './figures/'
276
FORMAT_DEFAULT: str = 'pdf'
277
278
# Cache settings
279
CACHE_DEFAULT: str = './cache/'
280
```
281
282
## Usage Examples
283
284
### Configuring Scanpy Settings
285
286
```python
287
import scanpy as sc
288
289
# Set verbosity level
290
sc.settings.verbosity = 3 # hint level
291
292
# Configure parallel processing
293
sc.settings.n_jobs = -1 # use all available cores
294
295
# Set figure parameters
296
sc.settings.set_figure_params(
297
dpi=100,
298
dpi_save=300,
299
fontsize=12,
300
color_map='plasma',
301
format='png',
302
transparent=True
303
)
304
305
# Set output directory
306
sc.settings.figdir = './my_figures/'
307
308
# Check current settings
309
print(f"Verbosity: {sc.settings.verbosity}")
310
print(f"N jobs: {sc.settings.n_jobs}")
311
print(f"Figure dir: {sc.settings.figdir}")
312
```
313
314
### Data Extraction and Analysis
315
316
```python
317
# Extract observation data with specific columns
318
obs_data = sc.get.obs_df(adata, keys=['total_counts', 'n_genes', 'leiden'])
319
print(obs_data.head())
320
321
# Get ranked genes as DataFrame
322
marker_genes = sc.get.rank_genes_groups_df(adata, group='0')
323
top_genes = marker_genes.head(20)
324
325
# Extract variable information
326
var_data = sc.get.var_df(adata, keys=['highly_variable', 'dispersions'])
327
328
# Aggregate data by clusters
329
adata_agg = sc.get.aggregate(adata, by='leiden', func='mean')
330
print(f"Aggregated to {adata_agg.n_obs} pseudo-bulk samples")
331
```
332
333
### Working with Different Data Representations
334
335
```python
336
# Get PCA representation
337
X_pca = sc.get._get_obs_rep(adata, use_rep='X_pca', n_pcs=30)
338
print(f"PCA shape: {X_pca.shape}")
339
340
# Get UMAP representation
341
X_umap = sc.get._get_obs_rep(adata, use_rep='X_umap')
342
print(f"UMAP shape: {X_umap.shape}")
343
344
# Get raw data representation
345
X_raw = sc.get._get_obs_rep(adata, use_raw=True)
346
print(f"Raw data shape: {X_raw.shape}")
347
```
348
349
### Environment and Version Information
350
351
```python
352
# Print comprehensive version information
353
sc.logging.print_versions()
354
355
# Check for version compatibility issues
356
sc._utils.check_versions()
357
358
# Print memory usage
359
print(f"Current memory usage: {sc._utils.memory_usage()}")
360
```
361
362
### Advanced Configuration
363
364
```python
365
# Custom matplotlib configuration
366
sc.pl.set_rcParams_scanpy(fontsize=10, color_map='viridis')
367
368
# Reset to defaults
369
sc.pl.set_rcParams_defaults()
370
371
# Fine-grained matplotlib control
372
import matplotlib.pyplot as plt
373
plt.rcParams['figure.figsize'] = (8, 6)
374
plt.rcParams['axes.grid'] = True
375
plt.rcParams['grid.alpha'] = 0.3
376
377
# Apply custom color palette
378
import seaborn as sns
379
custom_palette = sns.color_palette("husl", 8)
380
sc.pl.palettes.default_20 = custom_palette
381
```
382
383
### Performance Optimization
384
385
```python
386
# Configure for large datasets
387
sc.settings.max_memory = '16G' # Set memory limit
388
sc.settings.n_jobs = 8 # Limit parallel jobs
389
sc.settings.verbosity = 1 # Reduce logging overhead
390
391
# Enable caching for repeated operations
392
sc.settings.cachedir = '/tmp/scanpy_cache/'
393
394
# Use chunked operations for large matrices
395
sc.pp.scale(adata, chunked=True, chunk_size=1000)
396
```
397
398
### Custom Analysis Workflows
399
400
```python
401
def run_standard_analysis(adata, resolution=0.5, n_pcs=50):
402
"""Custom analysis function using scanpy utilities."""
403
404
# Configure for this analysis
405
original_verbosity = sc.settings.verbosity
406
sc.settings.verbosity = 2
407
408
try:
409
# Preprocessing
410
sc.pp.filter_cells(adata, min_genes=200)
411
sc.pp.filter_genes(adata, min_cells=3)
412
sc.pp.normalize_total(adata, target_sum=1e4)
413
sc.pp.log1p(adata)
414
415
# Analysis
416
sc.pp.highly_variable_genes(adata)
417
adata.raw = adata
418
adata = adata[:, adata.var.highly_variable]
419
sc.pp.scale(adata)
420
sc.pp.pca(adata, n_comps=n_pcs)
421
sc.pp.neighbors(adata)
422
sc.tl.umap(adata)
423
sc.tl.leiden(adata, resolution=resolution)
424
425
# Extract results
426
results = {
427
'clusters': sc.get.obs_df(adata, keys=['leiden']),
428
'embedding': sc.get._get_obs_rep(adata, use_rep='X_umap'),
429
'n_clusters': len(adata.obs['leiden'].unique())
430
}
431
432
return adata, results
433
434
finally:
435
# Restore original settings
436
sc.settings.verbosity = original_verbosity
437
438
# Run analysis
439
adata_processed, analysis_results = run_standard_analysis(adata)
440
print(f"Found {analysis_results['n_clusters']} clusters")
441
```
442
443
### Debugging and Troubleshooting
444
445
```python
446
# Enable debug logging
447
sc.settings.verbosity = 4 # debug level
448
449
# Check data integrity
450
def check_adata_integrity(adata):
451
"""Check AnnData object for common issues."""
452
print(f"Shape: {adata.shape}")
453
print(f"Data type: {adata.X.dtype}")
454
print(f"Sparse: {scipy.sparse.issparse(adata.X)}")
455
print(f"NaN values: {np.isnan(adata.X.data).sum() if scipy.sparse.issparse(adata.X) else np.isnan(adata.X).sum()}")
456
print(f"Negative values: {(adata.X.data < 0).sum() if scipy.sparse.issparse(adata.X) else (adata.X < 0).sum()}")
457
458
# Check for common issues
459
if adata.obs.index.duplicated().any():
460
print("WARNING: Duplicate observation names found")
461
if adata.var.index.duplicated().any():
462
print("WARNING: Duplicate variable names found")
463
464
check_adata_integrity(adata)
465
466
# Memory profiling for large operations
467
import time
468
start_time = time.time()
469
start_memory = sc._utils.memory_usage()
470
471
# Your analysis here
472
sc.pp.neighbors(adata, n_neighbors=15)
473
474
end_time = time.time()
475
end_memory = sc._utils.memory_usage()
476
477
print(f"Operation took {end_time - start_time:.2f} seconds")
478
print(f"Memory before: {start_memory}")
479
print(f"Memory after: {end_memory}")
480
```
481
482
## Configuration Files
483
484
### Setting up scanpy configuration
485
486
```python
487
# Create configuration file (~/.scanpy/config.yaml)
488
import os
489
import yaml
490
491
config_dir = os.path.expanduser('~/.scanpy')
492
os.makedirs(config_dir, exist_ok=True)
493
494
config = {
495
'verbosity': 2,
496
'n_jobs': -1,
497
'figdir': './figures/',
498
'file_format_figs': 'pdf',
499
'dpi_save': 300,
500
'transparent': True
501
}
502
503
with open(os.path.join(config_dir, 'config.yaml'), 'w') as f:
504
yaml.dump(config, f)
505
```
506
507
## Best Practices
508
509
### Settings Management
510
511
1. **Consistent Configuration**: Set global parameters at the start of analysis
512
2. **Resource Management**: Configure `n_jobs` and `max_memory` based on system
513
3. **Reproducibility**: Set random seeds and document settings used
514
4. **Output Management**: Organize figure output with descriptive directories
515
516
### Performance Tips
517
518
1. **Memory Efficiency**: Use appropriate data types and sparse matrices
519
2. **Parallel Processing**: Enable multiprocessing for CPU-intensive operations
520
3. **Chunked Operations**: Use chunked processing for very large datasets
521
4. **Caching**: Enable caching for repeated computations
522
523
### Debugging
524
525
1. **Logging Levels**: Use appropriate verbosity for development vs production
526
2. **Data Validation**: Check data integrity before analysis
527
3. **Version Tracking**: Document software versions for reproducibility
528
4. **Error Handling**: Implement proper error handling in custom workflows