Tessl Tile for pypi/tiledbsoma@1.17.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md core-data-structures.md data-io.md index.md query-indexing.md single-cell-biology.md spatial-data.md

core-data-structures.mddocs/

0
# Core Data Structures
1

2
The fundamental SOMA data types that provide the building blocks for storing and organizing scientific data. These include Collections for hierarchical organization, DataFrames for tabular data, and sparse/dense N-dimensional arrays for numerical data storage.
3

4
## Capabilities
5

6
### Collection
7

8
A string-keyed container that can hold any SOMA object type, enabling hierarchical organization of data. Collections provide the foundation for complex data structures and can contain other collections, dataframes, or arrays.
9

10
```python { .api }
11
class Collection:
12
    @classmethod
13
    def create(cls, uri, *, platform_config=None, context=None, tiledb_timestamp=None):
14
        """
15
        Create a new Collection.
16
        
17
        Parameters:
18
        - uri: str, URI for the collection
19
        - platform_config: TileDB-specific configuration options
20
        - context: TileDB context for the operation
21
        - tiledb_timestamp: Timestamp for temporal queries
22
        
23
        Returns:
24
        Collection instance
25
        """
26
    
27
    def add_new_collection(self, key, **kwargs):
28
        """
29
        Add a new sub-collection.
30
        
31
        Parameters:
32
        - key: str, name for the new collection
33
        - **kwargs: Additional arguments passed to Collection.create()
34
        
35
        Returns:
36
        Collection instance
37
        """
38
    
39
    def add_new_dataframe(self, key, **kwargs):
40
        """
41
        Add a new DataFrame to the collection.
42
        
43
        Parameters:
44
        - key: str, name for the new dataframe
45
        - **kwargs: Additional arguments passed to DataFrame.create()
46
        
47
        Returns:
48
        DataFrame instance
49
        """
50
    
51
    def add_new_dense_ndarray(self, key, **kwargs):
52
        """
53
        Add a new DenseNDArray to the collection.
54
        
55
        Parameters:
56
        - key: str, name for the new array
57
        - **kwargs: Additional arguments passed to DenseNDArray.create()
58
        
59
        Returns:
60
        DenseNDArray instance
61
        """
62
    
63
    def add_new_sparse_ndarray(self, key, **kwargs):
64
        """
65
        Add a new SparseNDArray to the collection.
66
        
67
        Parameters:
68
        - key: str, name for the new array
69
        - **kwargs: Additional arguments passed to SparseNDArray.create()
70
        
71
        Returns:
72
        SparseNDArray instance
73
        """
74
    
75
    def members(self):
76
        """
77
        Get collection member names and types.
78
        
79
        Returns:
80
        dict: Mapping of member names to their SOMA types
81
        """
82
    
83
    def __getitem__(self, key):
84
        """
85
        Access collection members by key.
86
        
87
        Parameters:
88
        - key: str, member name
89
        
90
        Returns:
91
        SOMA object at the specified key
92
        """
93
    
94
    def keys(self):
95
        """
96
        Get collection member names.
97
        
98
        Returns:
99
        Iterator of member names
100
        """
101
```
102

103
#### Usage Example
104

105
```python
106
import tiledbsoma
107

108
# Create a root collection
109
with tiledbsoma.Collection.create("my_experiment.soma") as collection:
110
    # Add sub-collections for organization
111
    collection.add_new_collection("raw_data")
112
    collection.add_new_collection("processed_data")
113
    
114
    # Add data structures
115
    collection.add_new_dataframe("observations", schema=obs_schema)
116
    collection.add_new_sparse_ndarray("expression_matrix", type=pa.float32(), shape=(1000, 2000))
117

118
# Access collection members
119
with tiledbsoma.open("my_experiment.soma") as collection:
120
    obs_df = collection["observations"]
121
    expr_matrix = collection["expression_matrix"]
122
```
123

124
### DataFrame
125

126
A multi-column table with a user-defined Arrow schema. All DataFrames must contain a `soma_joinid` column of type int64, which serves as the primary index for joining with other data structures.
127

128
```python { .api }
129
class DataFrame:
130
    @classmethod
131
    def create(cls, uri, *, schema, domain=None, platform_config=None, context=None, tiledb_timestamp=None):
132
        """
133
        Create a new DataFrame.
134
        
135
        Parameters:
136
        - uri: str, URI for the dataframe
137
        - schema: pyarrow.Schema, column schema including soma_joinid
138
        - domain: list of tuples, domain bounds for each dimension (optional)
139
        - platform_config: TileDB-specific configuration options
140
        - context: TileDB context for the operation
141
        - tiledb_timestamp: Timestamp for temporal queries
142
        
143
        Returns:
144
        DataFrame instance
145
        """
146
    
147
    def read(self, coords=(), value_filter=None, column_names=None, result_order=None, batch_size=None, partitions=None, platform_config=None):
148
        """
149
        Read data from the DataFrame.
150
        
151
        Parameters:
152
        - coords: tuple, coordinate selection for soma_joinid
153
        - value_filter: str, filter expression for attribute values
154
        - column_names: list of str, specific columns to read
155
        - result_order: ResultOrder, result ordering preference
156
        - batch_size: int, number of rows per batch
157
        - partitions: Partitions object for parallel reading
158
        - platform_config: TileDB-specific configuration options
159
        
160
        Returns:
161
        Iterator of Arrow tables
162
        """
163
    
164
    def write(self, values, platform_config=None):
165
        """
166
        Write data to the DataFrame.
167
        
168
        Parameters:
169
        - values: pyarrow.Table, data to write
170
        - platform_config: TileDB-specific configuration options
171
        """
172
    
173
    def keys(self):
174
        """
175
        Get column names.
176
        
177
        Returns:
178
        list of str: Column names
179
        """
180
    
181
    def count(self):
182
        """
183
        Get the number of rows in the DataFrame.
184
        
185
        Returns:
186
        int: Number of rows
187
        """
188
    
189
    def domain(self):
190
        """
191
        Get the domain bounds for each dimension.
192
        
193
        Returns:
194
        tuple: Domain bounds (min, max) for soma_joinid
195
        """
196
    
197
    def tiledbsoma_upgrade_domain(self, newdomain, check_only=False):
198
        """
199
        Upgrade the domain bounds.
200
        
201
        Parameters:
202
        - newdomain: tuple, new domain bounds
203
        - check_only: bool, if True, only check if upgrade is possible
204
        
205
        Returns:
206
        bool: True if upgrade was successful or is possible
207
        """
208
    
209
    def tiledbsoma_resize_soma_joinid_shape(self, newshape, check_only=False):
210
        """
211
        Resize the soma_joinid dimension shape.
212
        
213
        Parameters:
214
        - newshape: int, new maximum soma_joinid value
215
        - check_only: bool, if True, only check if resize is possible
216
        
217
        Returns:
218
        bool: True if resize was successful or is possible
219
        """
220
    
221
    @property
222
    def schema(self):
223
        """
224
        Get the Arrow schema.
225
        
226
        Returns:
227
        pyarrow.Schema: The dataframe schema
228
        """
229
    
230
    def maxdomain(self):
231
        """
232
        Get the maximum domain bounds.
233
        
234
        Returns:
235
        tuple: Maximum domain bounds for each dimension
236
        """
237
    
238
    def index_column_names(self):
239
        """
240
        Get the names of index columns.
241
        
242
        Returns:
243
        tuple of str: Index column names
244
        """
245
    
246
    def get_enumeration_values(self, enum_name):
247
        """
248
        Get enumeration values for a categorical column.
249
        
250
        Parameters:
251
        - enum_name: str, name of the enumeration
252
        
253
        Returns:
254
        list: Enumeration values
255
        """
256
    
257
    def extend_enumeration_values(self, enum_name, new_values):
258
        """
259
        Extend enumeration with new values.
260
        
261
        Parameters:
262
        - enum_name: str, name of the enumeration
263
        - new_values: list, new values to add
264
        """
265
    
266
    def tiledbsoma_has_upgraded_domain(self):
267
        """
268
        Check if domain has been upgraded.
269
        
270
        Returns:
271
        bool: True if domain has been upgraded
272
        """
273
    
274
    def tiledbsoma_upgrade_soma_joinid_shape(self, newshape, check_only=False):
275
        """
276
        Upgrade soma_joinid dimension shape.
277
        
278
        Parameters:
279
        - newshape: int, new shape for soma_joinid dimension
280
        - check_only: bool, if True, only check if upgrade is possible
281
        
282
        Returns:
283
        bool or None: Result of upgrade operation
284
        """
285
    
286
    def change_domain(self, newdomain, check_only=False):
287
        """
288
        Change the domain configuration.
289
        
290
        Parameters:
291
        - newdomain: tuple, new domain bounds
292
        - check_only: bool, if True, only check if change is possible
293
        """
294
```
295

296
#### Usage Example
297

298
```python
299
import tiledbsoma
300
import pyarrow as pa
301

302
# Define schema with required soma_joinid column
303
schema = pa.schema([
304
    ("soma_joinid", pa.int64()),
305
    ("cell_type", pa.string()),
306
    ("tissue", pa.string()),
307
    ("donor_id", pa.string()),
308
    ("total_counts", pa.int32())
309
])
310

311
# Create and write data
312
with tiledbsoma.DataFrame.create("cell_metadata.soma", schema=schema) as df:
313
    data = pa.table({
314
        "soma_joinid": [0, 1, 2, 3, 4],
315
        "cell_type": ["T-cell", "B-cell", "Neuron", "Astrocyte", "Hepatocyte"],
316
        "tissue": ["blood", "blood", "brain", "brain", "liver"],
317
        "donor_id": ["D1", "D1", "D2", "D2", "D3"],
318
        "total_counts": [1500, 2000, 800, 1200, 1800]
319
    })
320
    df.write(data)
321

322
# Read with filtering
323
with tiledbsoma.open("cell_metadata.soma") as df:
324
    # Filter for brain tissue cells
325
    brain_cells = df.read(
326
        value_filter="tissue == 'brain'",
327
        column_names=["soma_joinid", "cell_type", "total_counts"]
328
    ).concat()
329
    print(brain_cells.to_pandas())
330
```
331

332
### SparseNDArray
333

334
A sparse N-dimensional array with offset (0-based) integer indexing. Dimensions are named `soma_dim_0`, `soma_dim_1`, etc., and stored values are named `soma_data`. Sparse arrays only store non-zero values, making them memory-efficient for data with many zeros.
335

336
```python { .api }
337
class SparseNDArray:
338
    @classmethod
339
    def create(cls, uri, *, type, shape, platform_config=None, context=None, tiledb_timestamp=None):
340
        """
341
        Create a new SparseNDArray.
342
        
343
        Parameters:
344
        - uri: str, URI for the array
345
        - type: pyarrow data type for stored values
346
        - shape: tuple of int, array dimensions
347
        - platform_config: TileDB-specific configuration options
348
        - context: TileDB context for the operation
349
        - tiledb_timestamp: Timestamp for temporal queries
350
        
351
        Returns:
352
        SparseNDArray instance
353
        """
354
    
355
    def read(self, coords=(), result_order=None, batch_size=None, partitions=None, platform_config=None):
356
        """
357
        Read data from the sparse array.
358
        
359
        Parameters:
360
        - coords: tuple of slices/arrays, coordinate selection for each dimension
361
        - result_order: ResultOrder, result ordering preference
362
        - batch_size: int, number of elements per batch
363
        - partitions: Partitions object for parallel reading
364
        - platform_config: TileDB-specific configuration options
365
        
366
        Returns:
367
        SparseNDArrayRead iterator
368
        """
369
    
370
    def write(self, values, platform_config=None):
371
        """
372
        Write sparse data to the array.
373
        
374
        Parameters:
375
        - values: tuple of (coordinates_table, values_table)
376
          - coordinates_table: pyarrow.Table with soma_dim_* columns
377
          - values_table: pyarrow.Table with soma_data column
378
        - platform_config: TileDB-specific configuration options
379
        """
380
    
381
    @property
382
    def shape(self):
383
        """
384
        Get array dimensions.
385
        
386
        Returns:
387
        tuple of int: Array shape
388
        """
389
    
390
    @property
391
    def nnz(self):
392
        """
393
        Get number of non-zero elements.
394
        
395
        Returns:
396
        int: Number of stored (non-zero) elements
397
        """
398
    
399
    @property
400
    def schema(self):
401
        """
402
        Get the Arrow schema for coordinates and values.
403
        
404
        Returns:
405
        pyarrow.Schema: Schema for the array data
406
        """
407
```
408

409
#### Usage Example
410

411
```python
412
import tiledbsoma
413
import pyarrow as pa
414
import numpy as np
415

416
# Create a sparse 2D array for gene expression (cells x genes)
417
with tiledbsoma.SparseNDArray.create(
418
    "expression_matrix.soma",
419
    type=pa.float32(), 
420
    shape=(1000, 2000)  # 1000 cells, 2000 genes
421
) as sparse_array:
422
    
423
    # Generate sparse data (only non-zero expression values)
424
    np.random.seed(42)
425
    n_nonzero = 5000
426
    cell_ids = np.random.randint(0, 1000, n_nonzero)
427
    gene_ids = np.random.randint(0, 2000, n_nonzero)
428
    expression_values = np.random.exponential(2.0, n_nonzero)
429
    
430
    # Prepare coordinate and value tables
431
    coordinates = pa.table({
432
        "soma_dim_0": cell_ids,  # cell dimension
433
        "soma_dim_1": gene_ids   # gene dimension
434
    })
435
    values = pa.table({
436
        "soma_data": expression_values
437
    })
438
    
439
    # Write sparse data
440
    sparse_array.write((coordinates, values))
441

442
# Read sparse data back
443
with tiledbsoma.open("expression_matrix.soma") as sparse_array:
444
    print(f"Array shape: {sparse_array.shape}")
445
    print(f"Non-zero elements: {sparse_array.nnz}")
446
    
447
    # Read subset of data (first 100 cells, all genes)
448
    reader = sparse_array.read(coords=(slice(0, 100), slice(None)))
449
    for batch in reader:
450
        coords_df = batch.coords().to_pandas()
451
        values_df = batch.values().to_pandas()
452
        print(f"Batch: {len(coords_df)} non-zero values")
453
```
454

455
### DenseNDArray
456

457
A dense N-dimensional array with offset (0-based) integer indexing. Like sparse arrays, dimensions are named `soma_dim_0`, `soma_dim_1`, etc., and values are named `soma_data`. Dense arrays store values for all coordinate positions, making them suitable for data without sparsity.
458

459
```python { .api }
460
class DenseNDArray:
461
    @classmethod
462
    def create(cls, uri, *, type, shape, platform_config=None, context=None, tiledb_timestamp=None):
463
        """
464
        Create a new DenseNDArray.
465
        
466
        Parameters:
467
        - uri: str, URI for the array
468
        - type: pyarrow data type for stored values
469
        - shape: tuple of int, array dimensions
470
        - platform_config: TileDB-specific configuration options
471
        - context: TileDB context for the operation
472
        - tiledb_timestamp: Timestamp for temporal queries
473
        
474
        Returns:
475
        DenseNDArray instance
476
        """
477
    
478
    def read(self, coords=(), result_order=None, batch_size=None, partitions=None, platform_config=None):
479
        """
480
        Read data from the dense array.
481
        
482
        Parameters:
483
        - coords: tuple of slices/arrays, coordinate selection for each dimension
484
        - result_order: ResultOrder, result ordering preference
485
        - batch_size: int, number of elements per batch
486
        - partitions: Partitions object for parallel reading
487
        - platform_config: TileDB-specific configuration options
488
        
489
        Returns:
490
        pyarrow.Tensor with requested data
491
        """
492
    
493
    def write(self, coords, values, platform_config=None):
494
        """
495
        Write dense data to the array.
496
        
497
        Parameters:
498
        - coords: tuple of slices, coordinate region to write
499
        - values: numpy array or Arrow tensor with data to write
500
        - platform_config: TileDB-specific configuration options
501
        """
502
    
503
    @property
504
    def shape(self):
505
        """
506
        Get array dimensions.
507
        
508
        Returns:
509
        tuple of int: Array shape
510
        """
511
    
512
    @property
513
    def schema(self):
514
        """
515
        Get the Arrow schema for the array.
516
        
517
        Returns:
518
        pyarrow.Schema: Schema for the array data
519
        """
520
```
521

522
#### Usage Example
523

524
```python
525
import tiledbsoma
526
import pyarrow as pa
527
import numpy as np
528

529
# Create a dense 2D array for embedding coordinates
530
with tiledbsoma.DenseNDArray.create(
531
    "cell_embeddings.soma",
532
    type=pa.float64(),
533
    shape=(1000, 50)  # 1000 cells, 50 embedding dimensions
534
) as dense_array:
535
    
536
    # Generate embedding data (PCA coordinates)
537
    np.random.seed(42)
538
    embeddings = np.random.normal(0, 1, (1000, 50))
539
    
540
    # Write all data at once
541
    dense_array.write(
542
        coords=(slice(None), slice(None)),  # Write entire array
543
        values=embeddings
544
    )
545

546
# Read dense data back
547
with tiledbsoma.open("cell_embeddings.soma") as dense_array:
548
    print(f"Array shape: {dense_array.shape}")
549
    
550
    # Read subset (first 10 cells, first 5 dimensions)
551
    subset = dense_array.read(coords=(slice(0, 10), slice(0, 5)))
552
    print("First 10 cells, first 5 PCA dimensions:")
553
    print(subset.to_numpy())
554
    
555
    # Read specific cells by index
556
    cell_indices = [0, 50, 100, 200, 500]
557
    selected_cells = dense_array.read(coords=(cell_indices, slice(None)))
558
    print(f"Selected cells embedding shape: {selected_cells.to_numpy().shape}")
559
```
560

561
## Factory Function
562

563
```python { .api }
564
def open(uri, mode="r", *, soma_type=None, context=None, tiledb_timestamp=None):
565
    """
566
    Open any SOMA object at the specified URI.
567
    
568
    Parameters:
569
    - uri: str, URI of the SOMA object to open
570
    - mode: str, access mode ("r" for read, "w" for write)
571
    - soma_type: str, expected SOMA type (optional, auto-detected if not provided)
572
    - context: TileDB context for the operation
573
    - tiledb_timestamp: Timestamp for temporal queries
574
    
575
    Returns:
576
    SOMA object of the appropriate type (Collection, DataFrame, etc.)
577
    """
578
```

Version

Tile

Files

core-data-structures.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-data-structures.mddocs/