Tessl Tile for pypi/hdmf@4.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

build-system.md common-data.md containers.md data-utils.md index.md io-backends.md query.md specification.md term-sets.md utils.md validation.md

query.mddocs/

0
# Query System
1

2
HDMF's query system provides powerful querying and filtering capabilities for datasets and containers with reference resolution and advanced data access patterns. It enables efficient data exploration and analysis without loading entire datasets into memory.
3

4
## Capabilities
5

6
### Dataset Query Interface
7

8
Interface for querying HDF5-like datasets with lazy loading and efficient data access.
9

10
```python { .api }
11
class HDMFDataset:
12
    """
13
    Dataset query interface providing querying capabilities for HDF5-like datasets.
14
    
15
    Enables efficient data access with lazy loading, slicing, and filtering
16
    without requiring full dataset loading into memory.
17
    """
18
    
19
    def __init__(self, dataset, io, **kwargs):
20
        """
21
        Initialize HDMF dataset wrapper.
22
        
23
        Args:
24
            dataset: Underlying dataset object (e.g., h5py.Dataset)
25
            io: I/O backend for data access
26
            **kwargs: Additional dataset properties
27
        """
28
    
29
    def __getitem__(self, key):
30
        """
31
        Get data slice from dataset with advanced indexing support.
32
        
33
        Args:
34
            key: Index, slice, or advanced indexing specification
35
            
36
        Returns:
37
            Data slice from the dataset
38
            
39
        Examples:
40
            dataset[0:100]           # Simple slice
41
            dataset[:, [0, 5, 10]]   # Column selection
42
            dataset[mask]            # Boolean indexing
43
        """
44
    
45
    def __setitem__(self, key, value):
46
        """
47
        Set data slice in dataset.
48
        
49
        Args:
50
            key: Index or slice specification
51
            value: Data to set
52
        """
53
    
54
    def append(self, data):
55
        """
56
        Append data to dataset (if resizable).
57
        
58
        Args:
59
            data: Data to append
60
        """
61
    
62
    def query(self, condition: str, **kwargs):
63
        """
64
        Query dataset with condition string.
65
        
66
        Args:
67
            condition: Query condition string
68
            **kwargs: Additional query parameters
69
            
70
        Returns:
71
            Filtered data matching the condition
72
        """
73
    
74
    def where(self, condition):
75
        """
76
        Find indices where condition is True.
77
        
78
        Args:
79
            condition: Boolean condition or callable
80
            
81
        Returns:
82
            Array of indices where condition is satisfied
83
        """
84
    
85
    @property
86
    def shape(self) -> tuple:
87
        """Shape of the dataset."""
88
    
89
    @property
90
    def dtype(self):
91
        """Data type of the dataset."""
92
    
93
    @property
94
    def size(self) -> int:
95
        """Total number of elements."""
96
    
97
    @property
98
    def ndim(self) -> int:
99
        """Number of dimensions."""
100
```
101

102
### Reference Resolution
103

104
System for resolving references between containers and builders in the data hierarchy.
105

106
```python { .api }
107
class ReferenceResolver:
108
    """
109
    Abstract base class for resolving references between containers/builders.
110
    
111
    Provides the interface for resolving object references, region references,
112
    and other cross-references within HDMF data structures.
113
    """
114
    
115
    def __init__(self, **kwargs):
116
        """Initialize reference resolver."""
117
    
118
    def get_object(self, ref) -> object:
119
        """
120
        Get object from reference.
121
        
122
        Args:
123
            ref: Reference to resolve
124
            
125
        Returns:
126
            Referenced object
127
        """
128
    
129
    def get_region(self, ref) -> tuple:
130
        """
131
        Get region from region reference.
132
        
133
        Args:
134
            ref: Region reference to resolve
135
            
136
        Returns:
137
            Tuple of (object, selection)
138
        """
139

140
class BuilderResolver(ReferenceResolver):
141
    """
142
    Reference resolver for Builder objects.
143
    
144
    Resolves references between builders during the build process,
145
    enabling cross-references in storage representations.
146
    """
147
    
148
    def __init__(self, builder_map: dict, **kwargs):
149
        """
150
        Initialize builder resolver.
151
        
152
        Args:
153
            builder_map: Dictionary mapping objects to builders
154
        """
155
    
156
    def get_object(self, ref):
157
        """
158
        Get builder from reference.
159
        
160
        Args:
161
            ref: Reference to builder
162
            
163
        Returns:
164
            Builder object
165
        """
166

167
class ContainerResolver(ReferenceResolver):
168
    """
169
    Reference resolver for Container objects.
170
    
171
    Resolves references between containers in the constructed object hierarchy,
172
    enabling navigation and cross-references in the in-memory representation.
173
    """
174
    
175
    def __init__(self, type_map: 'TypeMap', container: 'Container', **kwargs):
176
        """
177
        Initialize container resolver.
178
        
179
        Args:
180
            type_map: Type mapping for container resolution
181
            container: Root container for resolution context
182
        """
183
    
184
    def get_object(self, ref):
185
        """
186
        Get container from reference.
187
        
188
        Args:
189
            ref: Reference to container
190
            
191
        Returns:
192
            Container object
193
        """
194
    
195
    def get_region(self, ref):
196
        """
197
        Get region from container reference.
198
        
199
        Args:
200
            ref: Region reference
201
            
202
        Returns:
203
            Tuple of (container, selection)
204
        """
205
```
206

207
### Query Utilities
208

209
Utility functions and classes for advanced querying and data filtering.
210

211
```python { .api }
212
def query_dataset(dataset: HDMFDataset, query_str: str, **kwargs):
213
    """
214
    Query dataset using query string syntax.
215
    
216
    Args:
217
        dataset: Dataset to query
218
        query_str: Query string with conditions
219
        **kwargs: Additional query parameters
220
        
221
    Returns:
222
        Query results
223
        
224
    Examples:
225
        query_dataset(data, "column > 5 AND column < 10")
226
        query_dataset(data, "name LIKE 'neuron_*'")
227
    """
228

229
def filter_data(data, condition_func, **kwargs):
230
    """
231
    Filter data using condition function.
232
    
233
    Args:
234
        data: Data to filter
235
        condition_func: Function returning boolean mask
236
        **kwargs: Additional filtering options
237
        
238
    Returns:
239
        Filtered data
240
    """
241

242
class QueryResult:
243
    """
244
    Result object for query operations with lazy evaluation.
245
    
246
    Provides access to query results with efficient memory usage
247
    and support for chaining additional operations.
248
    """
249
    
250
    def __init__(self, source_dataset, indices, **kwargs):
251
        """
252
        Initialize query result.
253
        
254
        Args:
255
            source_dataset: Source dataset
256
            indices: Selected indices
257
        """
258
    
259
    def to_array(self):
260
        """
261
        Convert query result to numpy array.
262
        
263
        Returns:
264
            NumPy array with query results
265
        """
266
    
267
    def __getitem__(self, key):
268
        """Access subset of query results."""
269
    
270
    def __len__(self) -> int:
271
        """Number of results."""
272
    
273
    def __iter__(self):
274
        """Iterate over results."""
275
```
276

277
## Usage Examples
278

279
### Basic Dataset Querying
280

281
```python
282
from hdmf.backends.hdf5 import HDF5IO
283
from hdmf.query import HDMFDataset
284
import numpy as np
285

286
# Open HDF5 file with data
287
with HDF5IO('experiment.h5', mode='r') as io:
288
    container = io.read()
289
    
290
    # Get dataset as HDMFDataset for querying
291
    neural_data = container.neural_data.data  # This is an HDMFDataset
292
    
293
    # Basic slicing operations
294
    first_1000_samples = neural_data[0:1000, :]
295
    specific_channels = neural_data[:, [0, 5, 10, 15]]
296
    time_window = neural_data[5000:10000, :]
297
    
298
    print(f"Dataset shape: {neural_data.shape}")
299
    print(f"First 1000 samples shape: {first_1000_samples.shape}")
300
    print(f"Selected channels shape: {specific_channels.shape}")
301

302
# Advanced indexing with boolean masks
303
with HDF5IO('experiment.h5', mode='r') as io:
304
    container = io.read()
305
    voltage_data = container.voltage_traces.data
306
    
307
    # Create boolean mask for high-activity periods
308
    mean_activity = np.mean(voltage_data[:], axis=1)
309
    high_activity_mask = mean_activity > np.percentile(mean_activity, 95)
310
    
311
    # Extract high activity periods
312
    high_activity_data = voltage_data[high_activity_mask, :]
313
    print(f"High activity periods: {high_activity_data.shape}")
314
```
315

316
### Querying Dynamic Tables
317

318
```python
319
from hdmf.common import DynamicTable
320
from hdmf.query import query_dataset
321

322
# Create sample table
323
subjects_table = DynamicTable(
324
    name='subjects',
325
    description='Subject information'
326
)
327

328
subjects_table.add_column('subject_id', 'Subject ID')
329
subjects_table.add_column('age', 'Age in months', dtype='int')
330
subjects_table.add_column('weight', 'Weight in grams', dtype='float')
331
subjects_table.add_column('genotype', 'Genotype')
332

333
# Add sample data
334
for i in range(50):
335
    subjects_table.add_row(
336
        subject_id=f'subject_{i:03d}',
337
        age=np.random.randint(3, 24),
338
        weight=np.random.normal(25.0, 3.0),
339
        genotype=np.random.choice(['WT', 'KO'])
340
    )
341

342
# Query using table methods
343
adult_subjects = subjects_table.which(age__gt=12)
344
print(f"Adult subjects: {len(adult_subjects)}")
345

346
heavy_subjects = subjects_table.which(weight__gt=27.0)
347
print(f"Heavy subjects: {len(heavy_subjects)}")
348

349
ko_subjects = subjects_table.which(genotype='KO')
350
print(f"KO subjects: {len(ko_subjects)}")
351

352
# Complex queries combining conditions
353
adult_ko = []
354
for idx in range(len(subjects_table)):
355
    row = subjects_table[idx]
356
    if row['age'] > 12 and row['genotype'] == 'KO':
357
        adult_ko.append(idx)
358

359
print(f"Adult KO subjects: {len(adult_ko)}")
360
```
361

362
### Reference Resolution
363

364
```python
365
from hdmf.query import ContainerResolver
366
from hdmf.common import DynamicTable, DynamicTableRegion, get_type_map
367

368
# Create referenced data structure
369
neurons_table = DynamicTable(name='neurons', description='Neuron data')
370
neurons_table.add_column('neuron_id', 'Neuron ID')
371
neurons_table.add_column('cell_type', 'Cell type')
372

373
# Add neurons
374
for i in range(20):
375
    neurons_table.add_row(
376
        neuron_id=f'neuron_{i:03d}',
377
        cell_type='pyramidal' if i % 2 == 0 else 'interneuron'
378
    )
379

380
# Create table region referring to subset
381
pyramidal_region = DynamicTableRegion(
382
    name='pyramidal_neurons',
383
    data=[i for i in range(0, 20, 2)],  # Even indices (pyramidal cells)
384
    description='Pyramidal neurons only',
385
    table=neurons_table
386
)
387

388
# Create analysis table using references
389
analysis_table = DynamicTable(name='analysis', description='Analysis results')
390
analysis_table.add_column('neuron_group', 'Group of neurons')
391
analysis_table.add_column('avg_firing_rate', 'Average firing rate', dtype='float')
392

393
analysis_table.add_row(
394
    neuron_group=pyramidal_region,
395
    avg_firing_rate=15.3
396
)
397

398
# Resolve references using ContainerResolver
399
type_map = get_type_map()
400
resolver = ContainerResolver(type_map, neurons_table)
401

402
# Access referenced data through resolver
403
referenced_neurons = analysis_table.get_column('neuron_group').data[0]
404
resolved_neurons = resolver.get_object(referenced_neurons)
405

406
print(f"Referenced neurons: {len(referenced_neurons)} neurons")
407
print(f"First referenced neuron: {resolved_neurons[0]}")
408
```
409

410
### Advanced Data Filtering
411

412
```python
413
from hdmf.backends.hdf5 import HDF5IO
414
from hdmf.query import filter_data, QueryResult
415
import numpy as np
416

417
# Load time series data
418
with HDF5IO('timeseries.h5', mode='r') as io:
419
    container = io.read()
420
    timestamps = container.timestamps.data
421
    neural_data = container.neural_data.data
422
    
423
    # Define filtering conditions
424
    def high_variance_condition(data_slice):
425
        """Find time periods with high variance across channels."""
426
        return np.var(data_slice, axis=1) > np.percentile(np.var(data_slice, axis=1), 90)
427
    
428
    def specific_frequency_condition(data_slice, target_freq=40.0, sampling_rate=1000.0):
429
        """Find periods with specific frequency content."""
430
        # Simple frequency detection using FFT
431
        fft_result = np.fft.fft(data_slice, axis=0)
432
        freqs = np.fft.fftfreq(data_slice.shape[0], 1/sampling_rate)
433
        
434
        # Check for peak near target frequency
435
        target_idx = np.argmin(np.abs(freqs - target_freq))
436
        power_at_target = np.abs(fft_result[target_idx, :])
437
        
438
        return np.mean(power_at_target) > np.percentile(power_at_target, 95)
439
    
440
    # Apply filters with sliding window
441
    window_size = 1000  # 1 second windows at 1kHz
442
    high_var_periods = []
443
    freq_periods = []
444
    
445
    for start_idx in range(0, len(neural_data) - window_size, window_size//2):
446
        window_data = neural_data[start_idx:start_idx + window_size, :]
447
        
448
        if high_variance_condition(window_data):
449
            high_var_periods.append((start_idx, start_idx + window_size))
450
        
451
        if specific_frequency_condition(window_data):
452
            freq_periods.append((start_idx, start_idx + window_size))
453
    
454
    print(f"High variance periods: {len(high_var_periods)}")
455
    print(f"Target frequency periods: {len(freq_periods)}")
456
    
457
    # Extract filtered data
458
    if high_var_periods:
459
        first_high_var = neural_data[high_var_periods[0][0]:high_var_periods[0][1], :]
460
        print(f"First high variance period shape: {first_high_var.shape}")
461
```
462

463
### Efficient Large Dataset Queries
464

465
```python
466
from hdmf.backends.hdf5 import HDF5IO
467
import numpy as np
468

469
def query_large_dataset_efficiently(file_path: str, query_condition, chunk_size: int = 10000):
470
    """
471
    Efficiently query large datasets using chunked processing.
472
    
473
    Args:
474
        file_path: Path to HDF5 file
475
        query_condition: Function that returns boolean mask
476
        chunk_size: Size of data chunks to process
477
        
478
    Returns:
479
        List of matching data indices
480
    """
481
    
482
    matching_indices = []
483
    
484
    with HDF5IO(file_path, mode='r') as io:
485
        container = io.read()
486
        dataset = container.large_dataset.data
487
        
488
        total_samples = dataset.shape[0]
489
        
490
        # Process dataset in chunks
491
        for start_idx in range(0, total_samples, chunk_size):
492
            end_idx = min(start_idx + chunk_size, total_samples)
493
            
494
            # Load chunk
495
            chunk_data = dataset[start_idx:end_idx, :]
496
            
497
            # Apply condition to chunk
498
            chunk_mask = query_condition(chunk_data)
499
            
500
            # Convert local indices to global indices
501
            local_matches = np.where(chunk_mask)[0]
502
            global_matches = local_matches + start_idx
503
            
504
            matching_indices.extend(global_matches)
505
            
506
            print(f"Processed {end_idx}/{total_samples} samples, "
507
                  f"found {len(local_matches)} matches in chunk")
508
    
509
    return matching_indices
510

511
# Example usage
512
def find_outliers(data_chunk, threshold=3.0):
513
    """Find data points that are outliers (>3 standard deviations)."""
514
    z_scores = np.abs((data_chunk - np.mean(data_chunk, axis=0)) / np.std(data_chunk, axis=0))
515
    return np.any(z_scores > threshold, axis=1)
516

517
outlier_indices = query_large_dataset_efficiently(
518
    'large_experiment.h5',
519
    find_outliers,
520
    chunk_size=5000
521
)
522

523
print(f"Found {len(outlier_indices)} outlier samples")
524
```
525

526
### Cross-Container Queries
527

528
```python
529
from hdmf.common import DynamicTable, DynamicTableRegion
530
from hdmf.query import ContainerResolver
531

532
def cross_table_analysis(subjects_table, sessions_table, results_table):
533
    """
534
    Perform analysis across multiple related tables.
535
    
536
    Args:
537
        subjects_table: Table with subject information
538
        sessions_table: Table with session information  
539
        results_table: Table with analysis results
540
    """
541
    
542
    # Find high-performing subjects
543
    high_performance_threshold = 0.85
544
    high_performers = []
545
    
546
    for i in range(len(results_table)):
547
        if results_table[i]['performance_score'] > high_performance_threshold:
548
            high_performers.append(i)
549
    
550
    # Get subject IDs for high performers
551
    high_performer_subjects = []
552
    for result_idx in high_performers:
553
        session_ref = results_table[result_idx]['session']
554
        # Resolve session reference
555
        session_info = session_ref.table[session_ref.data[0]]
556
        subject_id = session_info['subject_id']
557
        high_performer_subjects.append(subject_id)
558
    
559
    # Analyze subject characteristics
560
    subject_ages = []
561
    subject_genotypes = []
562
    
563
    for subject_id in high_performer_subjects:
564
        # Find subject in subjects table
565
        subject_indices = subjects_table.which(subject_id=subject_id)
566
        if subject_indices:
567
            subject_info = subjects_table[subject_indices[0]]
568
            subject_ages.append(subject_info['age'])
569
            subject_genotypes.append(subject_info['genotype'])
570
    
571
    # Summary statistics
572
    avg_age = np.mean(subject_ages)
573
    genotype_counts = {}
574
    for genotype in subject_genotypes:
575
        genotype_counts[genotype] = genotype_counts.get(genotype, 0) + 1
576
    
577
    print(f"High performers: {len(high_performers)} sessions")
578
    print(f"Average age: {avg_age:.1f} months")
579
    print(f"Genotype distribution: {genotype_counts}")
580
    
581
    return {
582
        'high_performer_indices': high_performers,
583
        'subject_ages': subject_ages,
584
        'genotype_distribution': genotype_counts
585
    }
586

587
# Example usage would require setting up the related tables
588
# with proper cross-references between subjects, sessions, and results
589
```
590

591
### Query Result Caching and Optimization
592

593
```python
594
from hdmf.query import HDMFDataset
595
import numpy as np
596
from functools import lru_cache
597

598
class CachedQueryDataset:
599
    """Dataset wrapper with query result caching for better performance."""
600
    
601
    def __init__(self, dataset: HDMFDataset, cache_size: int = 128):
602
        self.dataset = dataset
603
        self.cache_size = cache_size
604
        
605
        # Create cached query method
606
        self._cached_query = lru_cache(maxsize=cache_size)(self._query_impl)
607
    
608
    def _query_impl(self, query_hash: str, *args, **kwargs):
609
        """Internal query implementation for caching."""
610
        # This would contain the actual query logic
611
        # Hash is used as cache key
612
        return self.dataset.query(*args, **kwargs)
613
    
614
    def query_with_cache(self, condition: str, **kwargs):
615
        """Query with result caching based on condition string."""
616
        # Create hash of query parameters for caching
617
        query_params = f"{condition}_{str(sorted(kwargs.items()))}"
618
        query_hash = str(hash(query_params))
619
        
620
        return self._cached_query(query_hash, condition, **kwargs)
621
    
622
    def clear_cache(self):
623
        """Clear query result cache."""
624
        self._cached_query.cache_clear()
625
    
626
    def cache_info(self):
627
        """Get cache statistics."""
628
        return self._cached_query.cache_info()
629

630
# Usage example
631
with HDF5IO('experiment.h5', mode='r') as io:
632
    container = io.read()
633
    
634
    # Wrap dataset with caching
635
    cached_dataset = CachedQueryDataset(container.neural_data.data)
636
    
637
    # Repeated queries will be cached
638
    result1 = cached_dataset.query_with_cache("value > 0.5")
639
    result2 = cached_dataset.query_with_cache("value > 0.5")  # From cache
640
    
641
    print(f"Cache info: {cached_dataset.cache_info()}")
642
```

Version

Tile

Files

query.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

query.mddocs/