or run

tessl search
Log in

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/kedro@1.1.x
tile.json

tessl/pypi-kedro

tessl install tessl/pypi-kedro@1.1.0

Kedro helps you build production-ready data and analytics pipelines

Agent Success

Agent success rate when using this tile

98%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.32x

Baseline

Agent success rate without this tile

74%

task.mdevals/scenario-9/

Pipeline Resume Analyzer

Build a tool that analyzes failed Kedro pipeline runs and provides recommendations for resuming execution from the optimal points.

Problem Description

When data pipelines fail mid-execution, it's wasteful to re-run the entire pipeline from the beginning. Your task is to create a Python module that analyzes a failed pipeline execution and determines the optimal nodes from which to resume execution, minimizing redundant computation while ensuring data consistency.

The tool should:

  1. Accept a Kedro pipeline and catalog as inputs
  2. Simulate a pipeline execution failure at a specific node
  3. Determine which nodes need to be re-run based on:
    • Which datasets persist to disk (and thus survive the failure)
    • The dependency structure of the pipeline
  4. Output a list of node names that should be used as starting points for resumption

Requirements

Input Parameters

Your module should accept:

  • A Kedro Pipeline object containing multiple nodes with dependencies
  • A Kedro DataCatalog object defining datasets
  • The name of a node where the failure occurred

Output

Return a list of node names (strings) that represent the minimum set of nodes from which the pipeline should resume execution.

Test Cases

Test Case 1: Simple Linear Pipeline

  • Given a pipeline with nodes A → B → C → D where Node A produces dataset "data_a" (memory dataset), Node B consumes "data_a" and produces "data_b" (CSV dataset - persisted), Node C consumes "data_b" and produces "data_c" (memory dataset), and Node D consumes "data_c" and produces "data_d" (CSV dataset - persisted), when the pipeline fails at node D, the resume analyzer should recommend starting from node C because "data_b" is persisted but "data_c" is not. @test

Test Case 2: Pipeline with Branching

  • Given a pipeline where Node A produces "data_a" (Parquet dataset - persisted), Nodes B and C both consume "data_a", Node B produces "data_b" (memory dataset), Node C produces "data_c" (JSON dataset - persisted), and Node D consumes both "data_b" and "data_c" to produce "data_d", when the pipeline fails at node D, the resume analyzer should recommend starting from node B only because "data_a" and "data_c" are persisted but "data_b" is not. @test

Test Case 3: All Datasets Persisted

  • Given a pipeline with nodes A → B → C where Node A produces "data_a" (CSV dataset - persisted), Node B consumes "data_a" and produces "data_b" (Parquet dataset - persisted), and Node C consumes "data_b" and produces "data_c" (JSON dataset - persisted), when the pipeline fails at node C, the resume analyzer should recommend starting from node C itself since all required inputs are persisted. @test

Test Case 4: Multiple Memory Datasets

  • Given a pipeline with nodes A → B → C → D → E where all datasets are memory datasets (none persisted), when the pipeline fails at node E, the resume analyzer should recommend starting from node A (the root) since no intermediate data is available. @test

Implementation

@generates

API

from kedro.pipeline import Pipeline
from kedro.io import DataCatalog

def analyze_resume_points(
    pipeline: Pipeline,
    catalog: DataCatalog,
    failed_node_name: str
) -> list[str]:
    """
    Analyze a failed pipeline and determine optimal resume points.

    Args:
        pipeline: The Kedro pipeline that failed
        catalog: The data catalog containing dataset definitions
        failed_node_name: Name of the node where execution failed

    Returns:
        List of node names from which execution should resume

    Raises:
        ValueError: If failed_node_name is not in the pipeline
    """
    pass

Dependencies { .dependencies }

kedro { .dependency }

Provides the pipeline and data catalog framework for analyzing pipeline execution failures and determining optimal resume points.