0
# Query System
1
2
HDMF's query system provides powerful querying and filtering capabilities for datasets and containers with reference resolution and advanced data access patterns. It enables efficient data exploration and analysis without loading entire datasets into memory.
3
4
## Capabilities
5
6
### Dataset Query Interface
7
8
Interface for querying HDF5-like datasets with lazy loading and efficient data access.
9
10
```python { .api }
11
class HDMFDataset:
12
"""
13
Dataset query interface providing querying capabilities for HDF5-like datasets.
14
15
Enables efficient data access with lazy loading, slicing, and filtering
16
without requiring full dataset loading into memory.
17
"""
18
19
def __init__(self, dataset, io, **kwargs):
20
"""
21
Initialize HDMF dataset wrapper.
22
23
Args:
24
dataset: Underlying dataset object (e.g., h5py.Dataset)
25
io: I/O backend for data access
26
**kwargs: Additional dataset properties
27
"""
28
29
def __getitem__(self, key):
30
"""
31
Get data slice from dataset with advanced indexing support.
32
33
Args:
34
key: Index, slice, or advanced indexing specification
35
36
Returns:
37
Data slice from the dataset
38
39
Examples:
40
dataset[0:100] # Simple slice
41
dataset[:, [0, 5, 10]] # Column selection
42
dataset[mask] # Boolean indexing
43
"""
44
45
def __setitem__(self, key, value):
46
"""
47
Set data slice in dataset.
48
49
Args:
50
key: Index or slice specification
51
value: Data to set
52
"""
53
54
def append(self, data):
55
"""
56
Append data to dataset (if resizable).
57
58
Args:
59
data: Data to append
60
"""
61
62
def query(self, condition: str, **kwargs):
63
"""
64
Query dataset with condition string.
65
66
Args:
67
condition: Query condition string
68
**kwargs: Additional query parameters
69
70
Returns:
71
Filtered data matching the condition
72
"""
73
74
def where(self, condition):
75
"""
76
Find indices where condition is True.
77
78
Args:
79
condition: Boolean condition or callable
80
81
Returns:
82
Array of indices where condition is satisfied
83
"""
84
85
@property
86
def shape(self) -> tuple:
87
"""Shape of the dataset."""
88
89
@property
90
def dtype(self):
91
"""Data type of the dataset."""
92
93
@property
94
def size(self) -> int:
95
"""Total number of elements."""
96
97
@property
98
def ndim(self) -> int:
99
"""Number of dimensions."""
100
```
101
102
### Reference Resolution
103
104
System for resolving references between containers and builders in the data hierarchy.
105
106
```python { .api }
107
class ReferenceResolver:
108
"""
109
Abstract base class for resolving references between containers/builders.
110
111
Provides the interface for resolving object references, region references,
112
and other cross-references within HDMF data structures.
113
"""
114
115
def __init__(self, **kwargs):
116
"""Initialize reference resolver."""
117
118
def get_object(self, ref) -> object:
119
"""
120
Get object from reference.
121
122
Args:
123
ref: Reference to resolve
124
125
Returns:
126
Referenced object
127
"""
128
129
def get_region(self, ref) -> tuple:
130
"""
131
Get region from region reference.
132
133
Args:
134
ref: Region reference to resolve
135
136
Returns:
137
Tuple of (object, selection)
138
"""
139
140
class BuilderResolver(ReferenceResolver):
141
"""
142
Reference resolver for Builder objects.
143
144
Resolves references between builders during the build process,
145
enabling cross-references in storage representations.
146
"""
147
148
def __init__(self, builder_map: dict, **kwargs):
149
"""
150
Initialize builder resolver.
151
152
Args:
153
builder_map: Dictionary mapping objects to builders
154
"""
155
156
def get_object(self, ref):
157
"""
158
Get builder from reference.
159
160
Args:
161
ref: Reference to builder
162
163
Returns:
164
Builder object
165
"""
166
167
class ContainerResolver(ReferenceResolver):
168
"""
169
Reference resolver for Container objects.
170
171
Resolves references between containers in the constructed object hierarchy,
172
enabling navigation and cross-references in the in-memory representation.
173
"""
174
175
def __init__(self, type_map: 'TypeMap', container: 'Container', **kwargs):
176
"""
177
Initialize container resolver.
178
179
Args:
180
type_map: Type mapping for container resolution
181
container: Root container for resolution context
182
"""
183
184
def get_object(self, ref):
185
"""
186
Get container from reference.
187
188
Args:
189
ref: Reference to container
190
191
Returns:
192
Container object
193
"""
194
195
def get_region(self, ref):
196
"""
197
Get region from container reference.
198
199
Args:
200
ref: Region reference
201
202
Returns:
203
Tuple of (container, selection)
204
"""
205
```
206
207
### Query Utilities
208
209
Utility functions and classes for advanced querying and data filtering.
210
211
```python { .api }
212
def query_dataset(dataset: HDMFDataset, query_str: str, **kwargs):
213
"""
214
Query dataset using query string syntax.
215
216
Args:
217
dataset: Dataset to query
218
query_str: Query string with conditions
219
**kwargs: Additional query parameters
220
221
Returns:
222
Query results
223
224
Examples:
225
query_dataset(data, "column > 5 AND column < 10")
226
query_dataset(data, "name LIKE 'neuron_*'")
227
"""
228
229
def filter_data(data, condition_func, **kwargs):
230
"""
231
Filter data using condition function.
232
233
Args:
234
data: Data to filter
235
condition_func: Function returning boolean mask
236
**kwargs: Additional filtering options
237
238
Returns:
239
Filtered data
240
"""
241
242
class QueryResult:
243
"""
244
Result object for query operations with lazy evaluation.
245
246
Provides access to query results with efficient memory usage
247
and support for chaining additional operations.
248
"""
249
250
def __init__(self, source_dataset, indices, **kwargs):
251
"""
252
Initialize query result.
253
254
Args:
255
source_dataset: Source dataset
256
indices: Selected indices
257
"""
258
259
def to_array(self):
260
"""
261
Convert query result to numpy array.
262
263
Returns:
264
NumPy array with query results
265
"""
266
267
def __getitem__(self, key):
268
"""Access subset of query results."""
269
270
def __len__(self) -> int:
271
"""Number of results."""
272
273
def __iter__(self):
274
"""Iterate over results."""
275
```
276
277
## Usage Examples
278
279
### Basic Dataset Querying
280
281
```python
282
from hdmf.backends.hdf5 import HDF5IO
283
from hdmf.query import HDMFDataset
284
import numpy as np
285
286
# Open HDF5 file with data
287
with HDF5IO('experiment.h5', mode='r') as io:
288
container = io.read()
289
290
# Get dataset as HDMFDataset for querying
291
neural_data = container.neural_data.data # This is an HDMFDataset
292
293
# Basic slicing operations
294
first_1000_samples = neural_data[0:1000, :]
295
specific_channels = neural_data[:, [0, 5, 10, 15]]
296
time_window = neural_data[5000:10000, :]
297
298
print(f"Dataset shape: {neural_data.shape}")
299
print(f"First 1000 samples shape: {first_1000_samples.shape}")
300
print(f"Selected channels shape: {specific_channels.shape}")
301
302
# Advanced indexing with boolean masks
303
with HDF5IO('experiment.h5', mode='r') as io:
304
container = io.read()
305
voltage_data = container.voltage_traces.data
306
307
# Create boolean mask for high-activity periods
308
mean_activity = np.mean(voltage_data[:], axis=1)
309
high_activity_mask = mean_activity > np.percentile(mean_activity, 95)
310
311
# Extract high activity periods
312
high_activity_data = voltage_data[high_activity_mask, :]
313
print(f"High activity periods: {high_activity_data.shape}")
314
```
315
316
### Querying Dynamic Tables
317
318
```python
319
from hdmf.common import DynamicTable
320
from hdmf.query import query_dataset
321
322
# Create sample table
323
subjects_table = DynamicTable(
324
name='subjects',
325
description='Subject information'
326
)
327
328
subjects_table.add_column('subject_id', 'Subject ID')
329
subjects_table.add_column('age', 'Age in months', dtype='int')
330
subjects_table.add_column('weight', 'Weight in grams', dtype='float')
331
subjects_table.add_column('genotype', 'Genotype')
332
333
# Add sample data
334
for i in range(50):
335
subjects_table.add_row(
336
subject_id=f'subject_{i:03d}',
337
age=np.random.randint(3, 24),
338
weight=np.random.normal(25.0, 3.0),
339
genotype=np.random.choice(['WT', 'KO'])
340
)
341
342
# Query using table methods
343
adult_subjects = subjects_table.which(age__gt=12)
344
print(f"Adult subjects: {len(adult_subjects)}")
345
346
heavy_subjects = subjects_table.which(weight__gt=27.0)
347
print(f"Heavy subjects: {len(heavy_subjects)}")
348
349
ko_subjects = subjects_table.which(genotype='KO')
350
print(f"KO subjects: {len(ko_subjects)}")
351
352
# Complex queries combining conditions
353
adult_ko = []
354
for idx in range(len(subjects_table)):
355
row = subjects_table[idx]
356
if row['age'] > 12 and row['genotype'] == 'KO':
357
adult_ko.append(idx)
358
359
print(f"Adult KO subjects: {len(adult_ko)}")
360
```
361
362
### Reference Resolution
363
364
```python
365
from hdmf.query import ContainerResolver
366
from hdmf.common import DynamicTable, DynamicTableRegion, get_type_map
367
368
# Create referenced data structure
369
neurons_table = DynamicTable(name='neurons', description='Neuron data')
370
neurons_table.add_column('neuron_id', 'Neuron ID')
371
neurons_table.add_column('cell_type', 'Cell type')
372
373
# Add neurons
374
for i in range(20):
375
neurons_table.add_row(
376
neuron_id=f'neuron_{i:03d}',
377
cell_type='pyramidal' if i % 2 == 0 else 'interneuron'
378
)
379
380
# Create table region referring to subset
381
pyramidal_region = DynamicTableRegion(
382
name='pyramidal_neurons',
383
data=[i for i in range(0, 20, 2)], # Even indices (pyramidal cells)
384
description='Pyramidal neurons only',
385
table=neurons_table
386
)
387
388
# Create analysis table using references
389
analysis_table = DynamicTable(name='analysis', description='Analysis results')
390
analysis_table.add_column('neuron_group', 'Group of neurons')
391
analysis_table.add_column('avg_firing_rate', 'Average firing rate', dtype='float')
392
393
analysis_table.add_row(
394
neuron_group=pyramidal_region,
395
avg_firing_rate=15.3
396
)
397
398
# Resolve references using ContainerResolver
399
type_map = get_type_map()
400
resolver = ContainerResolver(type_map, neurons_table)
401
402
# Access referenced data through resolver
403
referenced_neurons = analysis_table.get_column('neuron_group').data[0]
404
resolved_neurons = resolver.get_object(referenced_neurons)
405
406
print(f"Referenced neurons: {len(referenced_neurons)} neurons")
407
print(f"First referenced neuron: {resolved_neurons[0]}")
408
```
409
410
### Advanced Data Filtering
411
412
```python
413
from hdmf.backends.hdf5 import HDF5IO
414
from hdmf.query import filter_data, QueryResult
415
import numpy as np
416
417
# Load time series data
418
with HDF5IO('timeseries.h5', mode='r') as io:
419
container = io.read()
420
timestamps = container.timestamps.data
421
neural_data = container.neural_data.data
422
423
# Define filtering conditions
424
def high_variance_condition(data_slice):
425
"""Find time periods with high variance across channels."""
426
return np.var(data_slice, axis=1) > np.percentile(np.var(data_slice, axis=1), 90)
427
428
def specific_frequency_condition(data_slice, target_freq=40.0, sampling_rate=1000.0):
429
"""Find periods with specific frequency content."""
430
# Simple frequency detection using FFT
431
fft_result = np.fft.fft(data_slice, axis=0)
432
freqs = np.fft.fftfreq(data_slice.shape[0], 1/sampling_rate)
433
434
# Check for peak near target frequency
435
target_idx = np.argmin(np.abs(freqs - target_freq))
436
power_at_target = np.abs(fft_result[target_idx, :])
437
438
return np.mean(power_at_target) > np.percentile(power_at_target, 95)
439
440
# Apply filters with sliding window
441
window_size = 1000 # 1 second windows at 1kHz
442
high_var_periods = []
443
freq_periods = []
444
445
for start_idx in range(0, len(neural_data) - window_size, window_size//2):
446
window_data = neural_data[start_idx:start_idx + window_size, :]
447
448
if high_variance_condition(window_data):
449
high_var_periods.append((start_idx, start_idx + window_size))
450
451
if specific_frequency_condition(window_data):
452
freq_periods.append((start_idx, start_idx + window_size))
453
454
print(f"High variance periods: {len(high_var_periods)}")
455
print(f"Target frequency periods: {len(freq_periods)}")
456
457
# Extract filtered data
458
if high_var_periods:
459
first_high_var = neural_data[high_var_periods[0][0]:high_var_periods[0][1], :]
460
print(f"First high variance period shape: {first_high_var.shape}")
461
```
462
463
### Efficient Large Dataset Queries
464
465
```python
466
from hdmf.backends.hdf5 import HDF5IO
467
import numpy as np
468
469
def query_large_dataset_efficiently(file_path: str, query_condition, chunk_size: int = 10000):
470
"""
471
Efficiently query large datasets using chunked processing.
472
473
Args:
474
file_path: Path to HDF5 file
475
query_condition: Function that returns boolean mask
476
chunk_size: Size of data chunks to process
477
478
Returns:
479
List of matching data indices
480
"""
481
482
matching_indices = []
483
484
with HDF5IO(file_path, mode='r') as io:
485
container = io.read()
486
dataset = container.large_dataset.data
487
488
total_samples = dataset.shape[0]
489
490
# Process dataset in chunks
491
for start_idx in range(0, total_samples, chunk_size):
492
end_idx = min(start_idx + chunk_size, total_samples)
493
494
# Load chunk
495
chunk_data = dataset[start_idx:end_idx, :]
496
497
# Apply condition to chunk
498
chunk_mask = query_condition(chunk_data)
499
500
# Convert local indices to global indices
501
local_matches = np.where(chunk_mask)[0]
502
global_matches = local_matches + start_idx
503
504
matching_indices.extend(global_matches)
505
506
print(f"Processed {end_idx}/{total_samples} samples, "
507
f"found {len(local_matches)} matches in chunk")
508
509
return matching_indices
510
511
# Example usage
512
def find_outliers(data_chunk, threshold=3.0):
513
"""Find data points that are outliers (>3 standard deviations)."""
514
z_scores = np.abs((data_chunk - np.mean(data_chunk, axis=0)) / np.std(data_chunk, axis=0))
515
return np.any(z_scores > threshold, axis=1)
516
517
outlier_indices = query_large_dataset_efficiently(
518
'large_experiment.h5',
519
find_outliers,
520
chunk_size=5000
521
)
522
523
print(f"Found {len(outlier_indices)} outlier samples")
524
```
525
526
### Cross-Container Queries
527
528
```python
529
from hdmf.common import DynamicTable, DynamicTableRegion
530
from hdmf.query import ContainerResolver
531
532
def cross_table_analysis(subjects_table, sessions_table, results_table):
533
"""
534
Perform analysis across multiple related tables.
535
536
Args:
537
subjects_table: Table with subject information
538
sessions_table: Table with session information
539
results_table: Table with analysis results
540
"""
541
542
# Find high-performing subjects
543
high_performance_threshold = 0.85
544
high_performers = []
545
546
for i in range(len(results_table)):
547
if results_table[i]['performance_score'] > high_performance_threshold:
548
high_performers.append(i)
549
550
# Get subject IDs for high performers
551
high_performer_subjects = []
552
for result_idx in high_performers:
553
session_ref = results_table[result_idx]['session']
554
# Resolve session reference
555
session_info = session_ref.table[session_ref.data[0]]
556
subject_id = session_info['subject_id']
557
high_performer_subjects.append(subject_id)
558
559
# Analyze subject characteristics
560
subject_ages = []
561
subject_genotypes = []
562
563
for subject_id in high_performer_subjects:
564
# Find subject in subjects table
565
subject_indices = subjects_table.which(subject_id=subject_id)
566
if subject_indices:
567
subject_info = subjects_table[subject_indices[0]]
568
subject_ages.append(subject_info['age'])
569
subject_genotypes.append(subject_info['genotype'])
570
571
# Summary statistics
572
avg_age = np.mean(subject_ages)
573
genotype_counts = {}
574
for genotype in subject_genotypes:
575
genotype_counts[genotype] = genotype_counts.get(genotype, 0) + 1
576
577
print(f"High performers: {len(high_performers)} sessions")
578
print(f"Average age: {avg_age:.1f} months")
579
print(f"Genotype distribution: {genotype_counts}")
580
581
return {
582
'high_performer_indices': high_performers,
583
'subject_ages': subject_ages,
584
'genotype_distribution': genotype_counts
585
}
586
587
# Example usage would require setting up the related tables
588
# with proper cross-references between subjects, sessions, and results
589
```
590
591
### Query Result Caching and Optimization
592
593
```python
594
from hdmf.query import HDMFDataset
595
import numpy as np
596
from functools import lru_cache
597
598
class CachedQueryDataset:
599
"""Dataset wrapper with query result caching for better performance."""
600
601
def __init__(self, dataset: HDMFDataset, cache_size: int = 128):
602
self.dataset = dataset
603
self.cache_size = cache_size
604
605
# Create cached query method
606
self._cached_query = lru_cache(maxsize=cache_size)(self._query_impl)
607
608
def _query_impl(self, query_hash: str, *args, **kwargs):
609
"""Internal query implementation for caching."""
610
# This would contain the actual query logic
611
# Hash is used as cache key
612
return self.dataset.query(*args, **kwargs)
613
614
def query_with_cache(self, condition: str, **kwargs):
615
"""Query with result caching based on condition string."""
616
# Create hash of query parameters for caching
617
query_params = f"{condition}_{str(sorted(kwargs.items()))}"
618
query_hash = str(hash(query_params))
619
620
return self._cached_query(query_hash, condition, **kwargs)
621
622
def clear_cache(self):
623
"""Clear query result cache."""
624
self._cached_query.cache_clear()
625
626
def cache_info(self):
627
"""Get cache statistics."""
628
return self._cached_query.cache_info()
629
630
# Usage example
631
with HDF5IO('experiment.h5', mode='r') as io:
632
container = io.read()
633
634
# Wrap dataset with caching
635
cached_dataset = CachedQueryDataset(container.neural_data.data)
636
637
# Repeated queries will be cached
638
result1 = cached_dataset.query_with_cache("value > 0.5")
639
result2 = cached_dataset.query_with_cache("value > 0.5") # From cache
640
641
print(f"Cache info: {cached_dataset.cache_info()}")
642
```