0
# TileDB-SOMA
1
2
A Python implementation of the SOMA (Stack of Matrices, Annotated) API using TileDB Embedded for efficient storage and retrieval of single-cell data. TileDB-SOMA provides scalable data structures for storing and querying larger-than-memory datasets in both cloud and local systems, with specialized support for single-cell biology workflows.
3
4
## Package Information
5
6
- **Package Name**: tiledbsoma
7
- **Language**: Python
8
- **Installation**: `pip install tiledbsoma`
9
- **Version**: 1.17.1
10
11
## Core Imports
12
13
```python
14
import tiledbsoma
15
```
16
17
Common patterns for data structures:
18
19
```python
20
from tiledbsoma import (
21
Collection, DataFrame, SparseNDArray, DenseNDArray,
22
Experiment, Measurement, open
23
)
24
```
25
26
For I/O operations:
27
28
```python
29
import tiledbsoma.io as soma_io
30
```
31
32
## Basic Usage
33
34
```python
35
import tiledbsoma
36
import numpy as np
37
import pyarrow as pa
38
39
# Create a DataFrame with single-cell observations
40
schema = pa.schema([
41
("soma_joinid", pa.int64()),
42
("cell_type", pa.string()),
43
("tissue", pa.string()),
44
("donor_id", pa.string())
45
])
46
47
# Create and write data
48
with tiledbsoma.DataFrame.create("obs.soma", schema=schema) as obs_df:
49
data = pa.table({
50
"soma_joinid": [0, 1, 2, 3],
51
"cell_type": ["T-cell", "B-cell", "Neuron", "Astrocyte"],
52
"tissue": ["blood", "blood", "brain", "brain"],
53
"donor_id": ["D1", "D1", "D2", "D2"]
54
})
55
obs_df.write(data)
56
57
# Read data back
58
with tiledbsoma.open("obs.soma") as obs_df:
59
data = obs_df.read().concat()
60
print(data.to_pandas())
61
62
# Create a sparse matrix for gene expression data
63
with tiledbsoma.SparseNDArray.create(
64
"X.soma",
65
type=pa.float32(),
66
shape=(1000, 2000) # 1000 cells, 2000 genes
67
) as X_array:
68
# Write sparse data (cell_id, gene_id, expression_value)
69
coordinates = pa.table({
70
"soma_dim_0": [0, 0, 1, 1, 2], # cell indices
71
"soma_dim_1": [5, 100, 5, 200, 300], # gene indices
72
})
73
values = pa.table({
74
"soma_data": [1.5, 2.3, 0.8, 3.1, 1.2] # expression values
75
})
76
X_array.write((coordinates, values))
77
```
78
79
## Architecture
80
81
TileDB-SOMA follows a hierarchical object model designed for single-cell data analysis:
82
83
- **Collections**: String-keyed containers that can hold any SOMA object type
84
- **Arrays**: Multi-dimensional arrays (sparse/dense) for numerical data with TileDB storage
85
- **DataFrames**: Tabular data with Arrow schemas, requiring `soma_joinid` column
86
- **Experiments**: Specialized collections representing annotated measurement matrices
87
- **Measurements**: Collections grouping observations with measurements on annotated variables
88
89
The library uses Apache Arrow for in-memory data representation and TileDB for persistent storage, enabling efficient operations on larger-than-memory datasets with support for cloud storage backends.
90
91
## Capabilities
92
93
### Core Data Structures
94
95
Fundamental SOMA data types including Collections for hierarchical organization, DataFrames for tabular data, and sparse/dense N-dimensional arrays for numerical data storage.
96
97
```python { .api }
98
class Collection:
99
@classmethod
100
def create(cls, uri, *, platform_config=None, context=None, tiledb_timestamp=None): ...
101
def add_new_collection(self, key, **kwargs): ...
102
def add_new_dataframe(self, key, **kwargs): ...
103
104
class DataFrame:
105
@classmethod
106
def create(cls, uri, *, schema, domain=None, platform_config=None, context=None, tiledb_timestamp=None): ...
107
def read(self, coords=(), value_filter=None, column_names=None, result_order=None, batch_size=None, partitions=None, platform_config=None): ...
108
def write(self, values, platform_config=None): ...
109
110
class SparseNDArray:
111
@classmethod
112
def create(cls, uri, *, type, shape, platform_config=None, context=None, tiledb_timestamp=None): ...
113
def read(self, coords=(), result_order=None, batch_size=None, partitions=None, platform_config=None): ...
114
def write(self, values, platform_config=None): ...
115
116
class DenseNDArray:
117
@classmethod
118
def create(cls, uri, *, type, shape, platform_config=None, context=None, tiledb_timestamp=None): ...
119
def read(self, coords=(), result_order=None, batch_size=None, partitions=None, platform_config=None): ...
120
def write(self, coords, values, platform_config=None): ...
121
```
122
123
[Core Data Structures](./core-data-structures.md)
124
125
### Single-Cell Biology Support
126
127
Specialized data structures for single-cell analysis including Experiments for annotated measurement matrices and Measurements for grouping observations with variables.
128
129
```python { .api }
130
class Experiment(Collection):
131
obs: DataFrame # Primary observations annotations
132
ms: Collection # Named measurements collection
133
spatial: Collection # Spatial scenes collection
134
def axis_query(self, measurement_name, *, obs_query=None, var_query=None): ...
135
136
class Measurement(Collection):
137
var: DataFrame # Variable annotations
138
X: Collection[SparseNDArray] # Feature values matrices
139
obsm: Collection[DenseNDArray] # Dense observation annotations
140
obsp: Collection[SparseNDArray] # Sparse pairwise observation annotations
141
```
142
143
[Single-Cell Biology](./single-cell-biology.md)
144
145
### Spatial Data Support
146
147
Experimental spatial data structures for storing and analyzing spatial single-cell data, including geometry dataframes, point clouds, multiscale images, and spatial scenes.
148
149
```python { .api }
150
class GeometryDataFrame(DataFrame):
151
@classmethod
152
def create(cls, uri, *, schema, coordinate_space=("x", "y"), domain=None, platform_config=None, context=None, tiledb_timestamp=None): ...
153
154
class PointCloudDataFrame(DataFrame):
155
@classmethod
156
def create(cls, uri, *, schema, coordinate_space=("x", "y"), domain=None, platform_config=None, context=None, tiledb_timestamp=None): ...
157
158
class Scene(Collection):
159
img: Collection # Image collection
160
obsl: Collection # Observation location collection
161
varl: Collection # Variable location collection
162
```
163
164
[Spatial Data](./spatial-data.md)
165
166
### Data I/O Operations
167
168
Comprehensive ingestion and outgestion functions for converting between SOMA format and popular single-cell data formats like AnnData and H5AD files.
169
170
```python { .api }
171
def from_anndata(anndata, uri, *, measurement_name="RNA", obs_id_name="obs_id", var_id_name="var_id", X_layer_name=None, obsm_layers=None, varm_layers=None, obsp_layers=None, varp_layers=None, uns_keys=None, ingest_mode="write", registration_mapping=None, context=None, platform_config=None, additional_metadata=None): ...
172
173
def to_anndata(experiment, *, measurement_name="RNA", X_layer_name=None, obsm_layers=None, varm_layers=None, obsp_layers=None, varp_layers=None, obs_coords=None, var_coords=None, obs_value_filter=None, var_value_filter=None, obs_column_names=None, var_column_names=None, batch_size=None, context=None): ...
174
175
def from_h5ad(h5ad_file_path, output_path, *, measurement_name="RNA", ...): ...
176
```
177
178
[Data I/O](./data-io.md)
179
180
### Registration System
181
182
ID mapping utilities for multi-file append-mode ingestion, supporting soma_joinid remapping and string-to-integer label mapping across multiple input files.
183
184
```python { .api }
185
class AxisAmbientLabelMapping:
186
def __init__(self, *, field_name: str, joinid_map: pd.DataFrame, enum_values: dict):
187
"""
188
Tracks mapping of input data ID-column names to SOMA join IDs.
189
190
Parameters:
191
- field_name: str, name of the ID column
192
- joinid_map: pd.DataFrame, mapping from ID to soma_joinid
193
- enum_values: dict, categorical type mappings
194
"""
195
196
class ExperimentAmbientLabelMapping:
197
obs: AxisAmbientLabelMapping # Observation ID mappings
198
var: dict[str, AxisAmbientLabelMapping] # Variable ID mappings per measurement
199
200
class AxisIDMapping:
201
def __init__(self, id_map: dict[int, int]):
202
"""
203
Offset-to-joinid mappings for individual input files.
204
205
Parameters:
206
- id_map: dict, mapping from input offsets to SOMA join IDs
207
"""
208
209
class ExperimentIDMapping:
210
obs: AxisIDMapping # Observation ID mapping
211
var: dict[str, AxisIDMapping] # Variable ID mappings per measurement
212
213
def get_dataframe_values(df: DataFrame, *, ids: npt.NDArray[np.int64], col_name: str):
214
"""Get values from DataFrame for specified IDs and column"""
215
```
216
217
### Query and Indexing
218
219
Query builders and indexing utilities for efficient data retrieval from SOMA objects, including experiment axis queries and integer indexing.
220
221
```python { .api }
222
class ExperimentAxisQuery:
223
def obs(self, *, column_names=None, batch_size=None, partitions=None, platform_config=None): ...
224
def var(self, *, column_names=None, batch_size=None, partitions=None, platform_config=None): ...
225
def X(self, layer_name, *, batch_size=None, partitions=None, platform_config=None): ...
226
def to_anndata(self, *, X_layer_name=None, column_names=None, obsm_layers=None, varm_layers=None, obsp_layers=None, varp_layers=None): ...
227
228
class IntIndexer:
229
def __init__(self, data, *, context=None): ...
230
def get_indexer(self, target): ...
231
```
232
233
[Query and Indexing](./query-indexing.md)
234
235
### Query Filtering
236
237
Advanced query condition system for attribute filtering with support for complex Boolean expressions and membership operations.
238
239
```python { .api }
240
class QueryCondition:
241
def __init__(self, expression: str):
242
"""
243
Create a query condition for filtering SOMA objects.
244
245
Parameters:
246
- expression: str, Boolean expression using TileDB query syntax
247
248
Supports:
249
- Comparison operators: <, >, <=, >=, ==, !=
250
- Boolean operators: and, or, &, |
251
- Membership operator: in
252
- Attribute casting: attr("column_name")
253
- Value casting: val(value)
254
"""
255
256
def init_query_condition(self, schema, query_attrs):
257
"""Initialize the query condition with schema and attributes"""
258
```
259
260
### Configuration and Options
261
262
Configuration classes for TileDB context management and platform-specific options for creating and writing SOMA objects.
263
264
```python { .api }
265
class SOMATileDBContext:
266
def __init__(self, config=None): ...
267
268
class TileDBCreateOptions:
269
def __init__(self, **kwargs): ...
270
271
class TileDBWriteOptions:
272
def __init__(self, **kwargs): ...
273
```
274
275
[Configuration](./configuration.md)
276
277
## Coordinate System Types
278
279
```python { .api }
280
class CoordinateSpace:
281
"""Defines coordinate space for spatial data"""
282
283
class AffineTransform:
284
"""Affine coordinate transformation"""
285
286
class IdentityTransform:
287
"""Identity coordinate transformation"""
288
289
class ScaleTransform:
290
"""Scale coordinate transformation"""
291
292
class UniformScaleTransform:
293
"""Uniform scale coordinate transformation"""
294
```
295
296
## Core Constants
297
298
```python { .api }
299
SOMA_JOINID: str = "soma_joinid" # Required DataFrame column name
300
```
301
302
## Exception Types
303
304
```python { .api }
305
class SOMAError(Exception):
306
"""Base exception class for all SOMA-specific errors"""
307
308
class DoesNotExistError(SOMAError):
309
"""Raised when requested SOMA object does not exist"""
310
311
class AlreadyExistsError(SOMAError):
312
"""Raised when attempting to create object that already exists"""
313
314
class NotCreateableError(SOMAError):
315
"""Raised when object cannot be created"""
316
```
317
318
## Utility Functions
319
320
```python { .api }
321
def open(uri, mode="r", *, soma_type=None, context=None, tiledb_timestamp=None):
322
"""Opens any SOMA object at URI"""
323
324
def get_implementation() -> str:
325
"""Returns implementation name ('python-tiledb')"""
326
327
def get_implementation_version() -> str:
328
"""Returns package version"""
329
330
def show_package_versions() -> None:
331
"""Prints version information for all dependencies"""
332
```
333
334
## Statistics and Logging
335
336
```python { .api }
337
def tiledbsoma_stats_json() -> str:
338
"""Return TileDB-SOMA statistics as JSON string"""
339
340
def tiledbsoma_stats_as_py() -> list:
341
"""Return TileDB-SOMA statistics as Python objects"""
342
343
def tiledbsoma_stats_enable() -> None:
344
"""Enable TileDB statistics collection"""
345
346
def tiledbsoma_stats_disable() -> None:
347
"""Disable TileDB statistics collection"""
348
349
def tiledbsoma_stats_reset() -> None:
350
"""Reset TileDB statistics"""
351
352
def tiledbsoma_stats_dump() -> None:
353
"""Dump TileDB statistics to stdout"""
354
```
355
356
## Logging Configuration
357
358
```python { .api }
359
import tiledbsoma.logging
360
361
def warning() -> None:
362
"""Set logging level to WARNING"""
363
364
def info() -> None:
365
"""Set logging level to INFO with progress indicators"""
366
367
def debug() -> None:
368
"""Set logging level to DEBUG with detailed progress"""
369
370
def log_io_same(message: str) -> None:
371
"""Log message to both INFO and DEBUG levels"""
372
373
def log_io(info_message: str | None, debug_message: str) -> None:
374
"""Log different messages at INFO and DEBUG levels"""
375
```