or run

tessl search
Log in

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/deeplake@4.3.x
tile.json

tessl/pypi-deeplake

tessl install tessl/pypi-deeplake@4.3.0

Database for AI powered by a storage format optimized for deep-learning applications.

Agent Success

Agent success rate when using this tile

75%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.6x

Baseline

Agent success rate without this tile

47%

task.mdevals/scenario-4/

Image Dataset Query Interface

Build a query interface for an image dataset that supports filtering, aggregation, and similarity search operations.

Background

You are building a system to query a dataset containing images with metadata (labels, descriptions, embeddings). The dataset supports advanced query capabilities that you need to expose through a simple Python interface.

Requirements

Create a Python module query_interface.py that implements the following functionality:

Filter by Label

Implement a function filter_by_label(dataset_path: str, label: str) -> dict that:

  • Queries the dataset to find all images with a specific label
  • Returns a dictionary with keys: count (number of matching images) and sample_ids (list of IDs)

Aggregate Label Statistics

Implement a function get_label_statistics(dataset_path: str) -> list[dict] that:

  • Computes statistics grouped by label
  • Returns a list of dictionaries, each containing label and count keys
  • Results should be sorted by count in descending order

Find Similar Images

Implement a function find_similar_images(dataset_path: str, query_embedding: list[float], top_k: int = 5) -> list[int] that:

  • Finds the top K most similar images based on embedding similarity
  • Uses cosine similarity as the distance metric
  • Returns a list of image IDs ordered by similarity (most similar first)

Combined Filter Query

Implement a function filter_by_multiple_conditions(dataset_path: str, min_id: int, label: str) -> int that:

  • Counts images where ID is greater than min_id AND label equals the specified label
  • Returns the count as an integer

Implementation Notes

  • The dataset at dataset_path contains columns: id, label, description, embedding
  • The embedding column contains vector embeddings for similarity search
  • All functions should handle the dataset operations efficiently
  • Error handling for invalid paths or missing data is not required for this exercise

@generates

API

def filter_by_label(dataset_path: str, label: str) -> dict:
    """
    Filter images by a specific label.

    Args:
        dataset_path: Path to the dataset
        label: Label to filter by

    Returns:
        Dictionary with 'count' and 'sample_ids' keys
    """
    pass

def get_label_statistics(dataset_path: str) -> list[dict]:
    """
    Get aggregated statistics grouped by label.

    Args:
        dataset_path: Path to the dataset

    Returns:
        List of dicts with 'label' and 'count', sorted by count descending
    """
    pass

def find_similar_images(dataset_path: str, query_embedding: list[float], top_k: int = 5) -> list[int]:
    """
    Find top K most similar images using cosine similarity.

    Args:
        dataset_path: Path to the dataset
        query_embedding: Query vector for similarity search
        top_k: Number of similar images to return

    Returns:
        List of image IDs ordered by similarity
    """
    pass

def filter_by_multiple_conditions(dataset_path: str, min_id: int, label: str) -> int:
    """
    Count images matching multiple conditions.

    Args:
        dataset_path: Path to the dataset
        min_id: Minimum ID threshold
        label: Label to filter by

    Returns:
        Count of matching images
    """
    pass

Tests

  • filter_by_label returns correct count and IDs for images with label "cat" @test
  • get_label_statistics returns aggregated counts grouped by label in descending order @test
  • find_similar_images returns the 3 most similar image IDs using cosine similarity @test
  • filter_by_multiple_conditions correctly counts images where id > 100 AND label = "dog" @test

Dependencies { .dependencies }

deeplake { .dependency }

Provides dataset query capabilities.