CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-h5netcdf

tessl install tessl/pypi-h5netcdf@1.6.0

netCDF4 file access via h5py with hierarchical and legacy APIs for scientific computing

Agent Success

Agent success rate when using this tile

69%

Improvement

Agent success rate improvement when using this tile compared to baseline

0.83x

Baseline

Agent success rate without this tile

83%

task.mdevals/scenario-4/

Climate Data Storage Optimizer

Overview

You need to create a tool that processes and stores large climate datasets efficiently. The tool should create an optimized netCDF4 file with proper compression and chunking strategies to balance storage efficiency with data access performance.

Requirements

Implement a Python program climate_storage.py that creates a netCDF4 file containing multidimensional climate data with optimized storage settings.

File Structure

Create a file climate_data.nc with the following structure:

  1. Dimensions:

    • time: unlimited dimension for temporal data
    • latitude: 180 grid points
    • longitude: 360 grid points
    • level: 10 atmospheric levels
  2. Variables:

    • temperature (time, level, latitude, longitude): 4D temperature data in Kelvin

      • Should use compression to reduce file size
      • Should use chunking optimized for time-series access (reading all spatial data for specific time steps)
      • Should use a shuffle filter to improve compression
    • pressure (time, latitude, longitude): 3D pressure data in Pascals

      • Should use moderate compression
      • Should use chunking optimized for spatial analysis (reading data across time for specific locations)
    • station_id (latitude, longitude): 2D station identifier data

      • Should use chunking but no compression (for fast random access)
  3. Attributes:

    • Set global attribute title to "Optimized Climate Dataset"
    • Set global attribute compression_info to "Using gzip with shuffle filter"
    • Set variable attribute units for temperature to "K"
    • Set variable attribute units for pressure to "Pa"

Implementation Details

  • Use gzip compression with level 4 for the temperature variable
  • Use gzip compression with level 2 for the pressure variable
  • Configure chunks for temperature as: (1, 2, 45, 90) - optimized for reading single time steps
  • Configure chunks for pressure as: (10, 30, 60) - optimized for spatial analysis across multiple time steps
  • Configure chunks for station_id as: (45, 90) - balanced access pattern
  • Enable the shuffle filter for temperature and pressure variables
  • Initialize temperature with sample data: use values of 273.15 (Kelvin) for all points
  • Initialize pressure with sample data: use values of 101325.0 (Pascals) for all points
  • Initialize station_id with sample data: use sequential integers starting from 1

Dependencies { .dependencies }

h5netcdf { .dependency }

Provides netCDF4 file access via h5py with support for compression and chunking.

Test Cases

Test 1: Verify File Creation and Structure @test

Test file: test_climate_storage.py

import h5netcdf.legacyapi as netCDF4

def test_file_structure():
    """Verify the file is created with correct structure."""
    with netCDF4.Dataset('climate_data.nc', 'r') as f:
        # Check dimensions
        assert 'time' in f.dimensions
        assert 'latitude' in f.dimensions
        assert 'longitude' in f.dimensions
        assert 'level' in f.dimensions
        assert f.dimensions['latitude'].size == 180
        assert f.dimensions['longitude'].size == 360
        assert f.dimensions['level'].size == 10
        assert f.dimensions['time'].isunlimited()

        # Check variables exist
        assert 'temperature' in f.variables
        assert 'pressure' in f.variables
        assert 'station_id' in f.variables

Test 2: Verify Compression Settings @test

Test file: test_climate_storage.py

import h5netcdf.legacyapi as netCDF4

def test_compression():
    """Verify compression settings are applied correctly."""
    with netCDF4.Dataset('climate_data.nc', 'r') as f:
        temp_var = f.variables['temperature']
        press_var = f.variables['pressure']
        station_var = f.variables['station_id']

        # Check temperature compression
        assert temp_var.filters()['complevel'] == 4
        assert temp_var.filters()['shuffle'] == True

        # Check pressure compression
        assert press_var.filters()['complevel'] == 2
        assert press_var.filters()['shuffle'] == True

        # Check station_id has no compression
        assert station_var.filters()['complevel'] == 0

Test 3: Verify Chunking Configuration @test

Test file: test_climate_storage.py

import h5netcdf.legacyapi as netCDF4

def test_chunking():
    """Verify chunk sizes are configured correctly."""
    with netCDF4.Dataset('climate_data.nc', 'r') as f:
        temp_var = f.variables['temperature']
        press_var = f.variables['pressure']
        station_var = f.variables['station_id']

        # Check chunk sizes
        assert temp_var.chunking() == [1, 2, 45, 90]
        assert press_var.chunking() == [10, 30, 60]
        assert station_var.chunking() == [45, 90]

Test 4: Verify Attributes and Initial Data @test

Test file: test_climate_storage.py

import h5netcdf.legacyapi as netCDF4
import numpy as np

def test_attributes_and_data():
    """Verify attributes are set and initial data is written."""
    with netCDF4.Dataset('climate_data.nc', 'r') as f:
        # Check global attributes
        assert f.getncattr('title') == "Optimized Climate Dataset"
        assert f.getncattr('compression_info') == "Using gzip with shuffle filter"

        # Check variable attributes
        assert f.variables['temperature'].getncattr('units') == "K"
        assert f.variables['pressure'].getncattr('units') == "Pa"

        # Check initial data (at least one value to confirm data was written)
        temp_data = f.variables['temperature'][0, 0, 0, 0]
        assert np.isclose(temp_data, 273.15)

        press_data = f.variables['pressure'][0, 0, 0]
        assert np.isclose(press_data, 101325.0)

        station_data = f.variables['station_id'][0, 0]
        assert station_data == 1

Deliverables

  1. climate_storage.py - Main implementation file that creates the optimized netCDF4 file
  2. test_climate_storage.py - Test file with all test cases
  3. climate_data.nc - The generated netCDF4 file (created when running the program)

Notes

  • The program should handle the unlimited time dimension by initializing it with at least 1 time step
  • Focus on correctly configuring compression and chunking parameters as specified
  • All test cases must pass when executed

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/h5netcdf@1.6.x
tile.json