or run

tessl search
Log in

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/kedro@1.1.x
tile.json

tessl/pypi-kedro

tessl install tessl/pypi-kedro@1.1.0

Kedro helps you build production-ready data and analytics pipelines

Agent Success

Agent success rate when using this tile

98%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.32x

Baseline

Agent success rate without this tile

74%

task.mdevals/scenario-16/

Pipeline Failure Recovery Tool

Build a tool that analyzes a failed data pipeline and suggests which pipeline stages should be re-run to resume execution efficiently.

Overview

When a data processing pipeline fails partway through execution, some outputs may have been successfully saved to disk while others were not. Rather than re-running the entire pipeline from the beginning, it's more efficient to identify which stages can be skipped (because their outputs already exist) and which stages need to be re-run.

Your tool should analyze the pipeline structure, identify which outputs exist on disk, determine resumption points, and suggest which stages to re-run.

Requirements

Pipeline Structure Analysis

The pipeline consists of processing stages (nodes) that consume inputs and produce outputs. The pipeline forms a directed acyclic graph (DAG) where dependencies are determined by which datasets are consumed and produced by each node.

Failure Detection and Recovery

When a pipeline fails:

  • Some nodes may have completed successfully and saved their outputs to persistent storage
  • Other nodes may not have run at all, or may have failed before saving outputs
  • Some datasets may be temporary (in-memory only) and don't persist across runs

Your tool should:

  1. Accept a pipeline definition with nodes, inputs, and outputs
  2. Accept information about which datasets currently exist in persistent storage
  3. Determine which nodes need to be re-run to complete the pipeline
  4. Suggest the minimal set of resumption points (starting nodes) needed

Resumption Point Selection

The resumption points should be selected such that:

  • All nodes that need to re-run are downstream from at least one resumption point
  • The number of resumption points is minimized
  • If a node's persistent outputs already exist, it can be skipped (along with all its upstream dependencies)

Test Cases

Test Case 1: Simple linear pipeline with mid-point failure

Given a pipeline with three sequential nodes:

  • Node A produces dataset "data_a"
  • Node B consumes "data_a" and produces "data_b"
  • Node C consumes "data_b" and produces "data_c"

If "data_a" exists but "data_b" and "data_c" do not exist, the tool should suggest resuming from Node B. @test

Test Case 2: Pipeline with parallel branches

Given a pipeline with parallel processing:

  • Node A produces "data_a"
  • Node B consumes "data_a" and produces "data_b"
  • Node C consumes "data_a" and produces "data_c"
  • Node D consumes both "data_b" and "data_c" and produces "data_d"

If "data_a" and "data_b" exist, but "data_c" and "data_d" do not exist, the tool should suggest resuming from Node C only (not Node B). @test

Test Case 3: Multiple independent resumption points

Given a pipeline:

  • Node A produces "data_a"
  • Node B consumes "data_a" and produces "data_b"
  • Node C produces "data_c" (independent, no inputs)
  • Node D consumes "data_c" and produces "data_d"
  • Node E consumes both "data_b" and "data_d" and produces "data_e"

If "data_a" exists but all other outputs are missing, the tool should suggest resuming from both Node B and Node C. @test

Implementation

@generates

API

from typing import List, Set

def suggest_resume_nodes(
    pipeline,
    existing_datasets: Set[str]
) -> List[str]:
    """
    Suggest which nodes to use as resumption points after a pipeline failure.

    Analyzes the pipeline structure and existing datasets to determine the minimal
    set of nodes that should be re-run to complete the pipeline. Uses breadth-first
    search to find nodes whose outputs don't exist, then identifies resumption points.

    Args:
        pipeline: A Kedro Pipeline object defining the DAG of processing stages
        existing_datasets: Set of dataset names that currently exist in persistent storage

    Returns:
        List of node names that should be used as resumption points, sorted alphabetically
    """
    pass

Dependencies { .dependencies }

kedro { .dependency }

Provides data pipeline construction and execution capabilities, including the ability to create nodes and pipelines with automatic dependency resolution.

@satisfied-by