or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

build-system.mdcommon-data.mdcontainers.mddata-utils.mdindex.mdio-backends.mdquery.mdspecification.mdterm-sets.mdutils.mdvalidation.md

query.mddocs/

0

# Query System

1

2

HDMF's query system provides powerful querying and filtering capabilities for datasets and containers with reference resolution and advanced data access patterns. It enables efficient data exploration and analysis without loading entire datasets into memory.

3

4

## Capabilities

5

6

### Dataset Query Interface

7

8

Interface for querying HDF5-like datasets with lazy loading and efficient data access.

9

10

```python { .api }

11

class HDMFDataset:

12

"""

13

Dataset query interface providing querying capabilities for HDF5-like datasets.

14

15

Enables efficient data access with lazy loading, slicing, and filtering

16

without requiring full dataset loading into memory.

17

"""

18

19

def __init__(self, dataset, io, **kwargs):

20

"""

21

Initialize HDMF dataset wrapper.

22

23

Args:

24

dataset: Underlying dataset object (e.g., h5py.Dataset)

25

io: I/O backend for data access

26

**kwargs: Additional dataset properties

27

"""

28

29

def __getitem__(self, key):

30

"""

31

Get data slice from dataset with advanced indexing support.

32

33

Args:

34

key: Index, slice, or advanced indexing specification

35

36

Returns:

37

Data slice from the dataset

38

39

Examples:

40

dataset[0:100] # Simple slice

41

dataset[:, [0, 5, 10]] # Column selection

42

dataset[mask] # Boolean indexing

43

"""

44

45

def __setitem__(self, key, value):

46

"""

47

Set data slice in dataset.

48

49

Args:

50

key: Index or slice specification

51

value: Data to set

52

"""

53

54

def append(self, data):

55

"""

56

Append data to dataset (if resizable).

57

58

Args:

59

data: Data to append

60

"""

61

62

def query(self, condition: str, **kwargs):

63

"""

64

Query dataset with condition string.

65

66

Args:

67

condition: Query condition string

68

**kwargs: Additional query parameters

69

70

Returns:

71

Filtered data matching the condition

72

"""

73

74

def where(self, condition):

75

"""

76

Find indices where condition is True.

77

78

Args:

79

condition: Boolean condition or callable

80

81

Returns:

82

Array of indices where condition is satisfied

83

"""

84

85

@property

86

def shape(self) -> tuple:

87

"""Shape of the dataset."""

88

89

@property

90

def dtype(self):

91

"""Data type of the dataset."""

92

93

@property

94

def size(self) -> int:

95

"""Total number of elements."""

96

97

@property

98

def ndim(self) -> int:

99

"""Number of dimensions."""

100

```

101

102

### Reference Resolution

103

104

System for resolving references between containers and builders in the data hierarchy.

105

106

```python { .api }

107

class ReferenceResolver:

108

"""

109

Abstract base class for resolving references between containers/builders.

110

111

Provides the interface for resolving object references, region references,

112

and other cross-references within HDMF data structures.

113

"""

114

115

def __init__(self, **kwargs):

116

"""Initialize reference resolver."""

117

118

def get_object(self, ref) -> object:

119

"""

120

Get object from reference.

121

122

Args:

123

ref: Reference to resolve

124

125

Returns:

126

Referenced object

127

"""

128

129

def get_region(self, ref) -> tuple:

130

"""

131

Get region from region reference.

132

133

Args:

134

ref: Region reference to resolve

135

136

Returns:

137

Tuple of (object, selection)

138

"""

139

140

class BuilderResolver(ReferenceResolver):

141

"""

142

Reference resolver for Builder objects.

143

144

Resolves references between builders during the build process,

145

enabling cross-references in storage representations.

146

"""

147

148

def __init__(self, builder_map: dict, **kwargs):

149

"""

150

Initialize builder resolver.

151

152

Args:

153

builder_map: Dictionary mapping objects to builders

154

"""

155

156

def get_object(self, ref):

157

"""

158

Get builder from reference.

159

160

Args:

161

ref: Reference to builder

162

163

Returns:

164

Builder object

165

"""

166

167

class ContainerResolver(ReferenceResolver):

168

"""

169

Reference resolver for Container objects.

170

171

Resolves references between containers in the constructed object hierarchy,

172

enabling navigation and cross-references in the in-memory representation.

173

"""

174

175

def __init__(self, type_map: 'TypeMap', container: 'Container', **kwargs):

176

"""

177

Initialize container resolver.

178

179

Args:

180

type_map: Type mapping for container resolution

181

container: Root container for resolution context

182

"""

183

184

def get_object(self, ref):

185

"""

186

Get container from reference.

187

188

Args:

189

ref: Reference to container

190

191

Returns:

192

Container object

193

"""

194

195

def get_region(self, ref):

196

"""

197

Get region from container reference.

198

199

Args:

200

ref: Region reference

201

202

Returns:

203

Tuple of (container, selection)

204

"""

205

```

206

207

### Query Utilities

208

209

Utility functions and classes for advanced querying and data filtering.

210

211

```python { .api }

212

def query_dataset(dataset: HDMFDataset, query_str: str, **kwargs):

213

"""

214

Query dataset using query string syntax.

215

216

Args:

217

dataset: Dataset to query

218

query_str: Query string with conditions

219

**kwargs: Additional query parameters

220

221

Returns:

222

Query results

223

224

Examples:

225

query_dataset(data, "column > 5 AND column < 10")

226

query_dataset(data, "name LIKE 'neuron_*'")

227

"""

228

229

def filter_data(data, condition_func, **kwargs):

230

"""

231

Filter data using condition function.

232

233

Args:

234

data: Data to filter

235

condition_func: Function returning boolean mask

236

**kwargs: Additional filtering options

237

238

Returns:

239

Filtered data

240

"""

241

242

class QueryResult:

243

"""

244

Result object for query operations with lazy evaluation.

245

246

Provides access to query results with efficient memory usage

247

and support for chaining additional operations.

248

"""

249

250

def __init__(self, source_dataset, indices, **kwargs):

251

"""

252

Initialize query result.

253

254

Args:

255

source_dataset: Source dataset

256

indices: Selected indices

257

"""

258

259

def to_array(self):

260

"""

261

Convert query result to numpy array.

262

263

Returns:

264

NumPy array with query results

265

"""

266

267

def __getitem__(self, key):

268

"""Access subset of query results."""

269

270

def __len__(self) -> int:

271

"""Number of results."""

272

273

def __iter__(self):

274

"""Iterate over results."""

275

```

276

277

## Usage Examples

278

279

### Basic Dataset Querying

280

281

```python

282

from hdmf.backends.hdf5 import HDF5IO

283

from hdmf.query import HDMFDataset

284

import numpy as np

285

286

# Open HDF5 file with data

287

with HDF5IO('experiment.h5', mode='r') as io:

288

container = io.read()

289

290

# Get dataset as HDMFDataset for querying

291

neural_data = container.neural_data.data # This is an HDMFDataset

292

293

# Basic slicing operations

294

first_1000_samples = neural_data[0:1000, :]

295

specific_channels = neural_data[:, [0, 5, 10, 15]]

296

time_window = neural_data[5000:10000, :]

297

298

print(f"Dataset shape: {neural_data.shape}")

299

print(f"First 1000 samples shape: {first_1000_samples.shape}")

300

print(f"Selected channels shape: {specific_channels.shape}")

301

302

# Advanced indexing with boolean masks

303

with HDF5IO('experiment.h5', mode='r') as io:

304

container = io.read()

305

voltage_data = container.voltage_traces.data

306

307

# Create boolean mask for high-activity periods

308

mean_activity = np.mean(voltage_data[:], axis=1)

309

high_activity_mask = mean_activity > np.percentile(mean_activity, 95)

310

311

# Extract high activity periods

312

high_activity_data = voltage_data[high_activity_mask, :]

313

print(f"High activity periods: {high_activity_data.shape}")

314

```

315

316

### Querying Dynamic Tables

317

318

```python

319

from hdmf.common import DynamicTable

320

from hdmf.query import query_dataset

321

322

# Create sample table

323

subjects_table = DynamicTable(

324

name='subjects',

325

description='Subject information'

326

)

327

328

subjects_table.add_column('subject_id', 'Subject ID')

329

subjects_table.add_column('age', 'Age in months', dtype='int')

330

subjects_table.add_column('weight', 'Weight in grams', dtype='float')

331

subjects_table.add_column('genotype', 'Genotype')

332

333

# Add sample data

334

for i in range(50):

335

subjects_table.add_row(

336

subject_id=f'subject_{i:03d}',

337

age=np.random.randint(3, 24),

338

weight=np.random.normal(25.0, 3.0),

339

genotype=np.random.choice(['WT', 'KO'])

340

)

341

342

# Query using table methods

343

adult_subjects = subjects_table.which(age__gt=12)

344

print(f"Adult subjects: {len(adult_subjects)}")

345

346

heavy_subjects = subjects_table.which(weight__gt=27.0)

347

print(f"Heavy subjects: {len(heavy_subjects)}")

348

349

ko_subjects = subjects_table.which(genotype='KO')

350

print(f"KO subjects: {len(ko_subjects)}")

351

352

# Complex queries combining conditions

353

adult_ko = []

354

for idx in range(len(subjects_table)):

355

row = subjects_table[idx]

356

if row['age'] > 12 and row['genotype'] == 'KO':

357

adult_ko.append(idx)

358

359

print(f"Adult KO subjects: {len(adult_ko)}")

360

```

361

362

### Reference Resolution

363

364

```python

365

from hdmf.query import ContainerResolver

366

from hdmf.common import DynamicTable, DynamicTableRegion, get_type_map

367

368

# Create referenced data structure

369

neurons_table = DynamicTable(name='neurons', description='Neuron data')

370

neurons_table.add_column('neuron_id', 'Neuron ID')

371

neurons_table.add_column('cell_type', 'Cell type')

372

373

# Add neurons

374

for i in range(20):

375

neurons_table.add_row(

376

neuron_id=f'neuron_{i:03d}',

377

cell_type='pyramidal' if i % 2 == 0 else 'interneuron'

378

)

379

380

# Create table region referring to subset

381

pyramidal_region = DynamicTableRegion(

382

name='pyramidal_neurons',

383

data=[i for i in range(0, 20, 2)], # Even indices (pyramidal cells)

384

description='Pyramidal neurons only',

385

table=neurons_table

386

)

387

388

# Create analysis table using references

389

analysis_table = DynamicTable(name='analysis', description='Analysis results')

390

analysis_table.add_column('neuron_group', 'Group of neurons')

391

analysis_table.add_column('avg_firing_rate', 'Average firing rate', dtype='float')

392

393

analysis_table.add_row(

394

neuron_group=pyramidal_region,

395

avg_firing_rate=15.3

396

)

397

398

# Resolve references using ContainerResolver

399

type_map = get_type_map()

400

resolver = ContainerResolver(type_map, neurons_table)

401

402

# Access referenced data through resolver

403

referenced_neurons = analysis_table.get_column('neuron_group').data[0]

404

resolved_neurons = resolver.get_object(referenced_neurons)

405

406

print(f"Referenced neurons: {len(referenced_neurons)} neurons")

407

print(f"First referenced neuron: {resolved_neurons[0]}")

408

```

409

410

### Advanced Data Filtering

411

412

```python

413

from hdmf.backends.hdf5 import HDF5IO

414

from hdmf.query import filter_data, QueryResult

415

import numpy as np

416

417

# Load time series data

418

with HDF5IO('timeseries.h5', mode='r') as io:

419

container = io.read()

420

timestamps = container.timestamps.data

421

neural_data = container.neural_data.data

422

423

# Define filtering conditions

424

def high_variance_condition(data_slice):

425

"""Find time periods with high variance across channels."""

426

return np.var(data_slice, axis=1) > np.percentile(np.var(data_slice, axis=1), 90)

427

428

def specific_frequency_condition(data_slice, target_freq=40.0, sampling_rate=1000.0):

429

"""Find periods with specific frequency content."""

430

# Simple frequency detection using FFT

431

fft_result = np.fft.fft(data_slice, axis=0)

432

freqs = np.fft.fftfreq(data_slice.shape[0], 1/sampling_rate)

433

434

# Check for peak near target frequency

435

target_idx = np.argmin(np.abs(freqs - target_freq))

436

power_at_target = np.abs(fft_result[target_idx, :])

437

438

return np.mean(power_at_target) > np.percentile(power_at_target, 95)

439

440

# Apply filters with sliding window

441

window_size = 1000 # 1 second windows at 1kHz

442

high_var_periods = []

443

freq_periods = []

444

445

for start_idx in range(0, len(neural_data) - window_size, window_size//2):

446

window_data = neural_data[start_idx:start_idx + window_size, :]

447

448

if high_variance_condition(window_data):

449

high_var_periods.append((start_idx, start_idx + window_size))

450

451

if specific_frequency_condition(window_data):

452

freq_periods.append((start_idx, start_idx + window_size))

453

454

print(f"High variance periods: {len(high_var_periods)}")

455

print(f"Target frequency periods: {len(freq_periods)}")

456

457

# Extract filtered data

458

if high_var_periods:

459

first_high_var = neural_data[high_var_periods[0][0]:high_var_periods[0][1], :]

460

print(f"First high variance period shape: {first_high_var.shape}")

461

```

462

463

### Efficient Large Dataset Queries

464

465

```python

466

from hdmf.backends.hdf5 import HDF5IO

467

import numpy as np

468

469

def query_large_dataset_efficiently(file_path: str, query_condition, chunk_size: int = 10000):

470

"""

471

Efficiently query large datasets using chunked processing.

472

473

Args:

474

file_path: Path to HDF5 file

475

query_condition: Function that returns boolean mask

476

chunk_size: Size of data chunks to process

477

478

Returns:

479

List of matching data indices

480

"""

481

482

matching_indices = []

483

484

with HDF5IO(file_path, mode='r') as io:

485

container = io.read()

486

dataset = container.large_dataset.data

487

488

total_samples = dataset.shape[0]

489

490

# Process dataset in chunks

491

for start_idx in range(0, total_samples, chunk_size):

492

end_idx = min(start_idx + chunk_size, total_samples)

493

494

# Load chunk

495

chunk_data = dataset[start_idx:end_idx, :]

496

497

# Apply condition to chunk

498

chunk_mask = query_condition(chunk_data)

499

500

# Convert local indices to global indices

501

local_matches = np.where(chunk_mask)[0]

502

global_matches = local_matches + start_idx

503

504

matching_indices.extend(global_matches)

505

506

print(f"Processed {end_idx}/{total_samples} samples, "

507

f"found {len(local_matches)} matches in chunk")

508

509

return matching_indices

510

511

# Example usage

512

def find_outliers(data_chunk, threshold=3.0):

513

"""Find data points that are outliers (>3 standard deviations)."""

514

z_scores = np.abs((data_chunk - np.mean(data_chunk, axis=0)) / np.std(data_chunk, axis=0))

515

return np.any(z_scores > threshold, axis=1)

516

517

outlier_indices = query_large_dataset_efficiently(

518

'large_experiment.h5',

519

find_outliers,

520

chunk_size=5000

521

)

522

523

print(f"Found {len(outlier_indices)} outlier samples")

524

```

525

526

### Cross-Container Queries

527

528

```python

529

from hdmf.common import DynamicTable, DynamicTableRegion

530

from hdmf.query import ContainerResolver

531

532

def cross_table_analysis(subjects_table, sessions_table, results_table):

533

"""

534

Perform analysis across multiple related tables.

535

536

Args:

537

subjects_table: Table with subject information

538

sessions_table: Table with session information

539

results_table: Table with analysis results

540

"""

541

542

# Find high-performing subjects

543

high_performance_threshold = 0.85

544

high_performers = []

545

546

for i in range(len(results_table)):

547

if results_table[i]['performance_score'] > high_performance_threshold:

548

high_performers.append(i)

549

550

# Get subject IDs for high performers

551

high_performer_subjects = []

552

for result_idx in high_performers:

553

session_ref = results_table[result_idx]['session']

554

# Resolve session reference

555

session_info = session_ref.table[session_ref.data[0]]

556

subject_id = session_info['subject_id']

557

high_performer_subjects.append(subject_id)

558

559

# Analyze subject characteristics

560

subject_ages = []

561

subject_genotypes = []

562

563

for subject_id in high_performer_subjects:

564

# Find subject in subjects table

565

subject_indices = subjects_table.which(subject_id=subject_id)

566

if subject_indices:

567

subject_info = subjects_table[subject_indices[0]]

568

subject_ages.append(subject_info['age'])

569

subject_genotypes.append(subject_info['genotype'])

570

571

# Summary statistics

572

avg_age = np.mean(subject_ages)

573

genotype_counts = {}

574

for genotype in subject_genotypes:

575

genotype_counts[genotype] = genotype_counts.get(genotype, 0) + 1

576

577

print(f"High performers: {len(high_performers)} sessions")

578

print(f"Average age: {avg_age:.1f} months")

579

print(f"Genotype distribution: {genotype_counts}")

580

581

return {

582

'high_performer_indices': high_performers,

583

'subject_ages': subject_ages,

584

'genotype_distribution': genotype_counts

585

}

586

587

# Example usage would require setting up the related tables

588

# with proper cross-references between subjects, sessions, and results

589

```

590

591

### Query Result Caching and Optimization

592

593

```python

594

from hdmf.query import HDMFDataset

595

import numpy as np

596

from functools import lru_cache

597

598

class CachedQueryDataset:

599

"""Dataset wrapper with query result caching for better performance."""

600

601

def __init__(self, dataset: HDMFDataset, cache_size: int = 128):

602

self.dataset = dataset

603

self.cache_size = cache_size

604

605

# Create cached query method

606

self._cached_query = lru_cache(maxsize=cache_size)(self._query_impl)

607

608

def _query_impl(self, query_hash: str, *args, **kwargs):

609

"""Internal query implementation for caching."""

610

# This would contain the actual query logic

611

# Hash is used as cache key

612

return self.dataset.query(*args, **kwargs)

613

614

def query_with_cache(self, condition: str, **kwargs):

615

"""Query with result caching based on condition string."""

616

# Create hash of query parameters for caching

617

query_params = f"{condition}_{str(sorted(kwargs.items()))}"

618

query_hash = str(hash(query_params))

619

620

return self._cached_query(query_hash, condition, **kwargs)

621

622

def clear_cache(self):

623

"""Clear query result cache."""

624

self._cached_query.cache_clear()

625

626

def cache_info(self):

627

"""Get cache statistics."""

628

return self._cached_query.cache_info()

629

630

# Usage example

631

with HDF5IO('experiment.h5', mode='r') as io:

632

container = io.read()

633

634

# Wrap dataset with caching

635

cached_dataset = CachedQueryDataset(container.neural_data.data)

636

637

# Repeated queries will be cached

638

result1 = cached_dataset.query_with_cache("value > 0.5")

639

result2 = cached_dataset.query_with_cache("value > 0.5") # From cache

640

641

print(f"Cache info: {cached_dataset.cache_info()}")

642

```