or run

tessl search

Version

Workspace: tessl
Visibility: Public
Created: 18 days ago
Last updated: 2 days ago
Describes: pkg:pypi/kedro@1.1.x

tile.json

tessl/pypi-kedro

tessl install tessl/pypi-kedro@1.1.0

Kedro helps you build production-ready data and analytics pipelines

Agent Success

Agent success rate when using this tile

98%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.32x

Baseline

Agent success rate without this tile

74%

Pipeline Failure Recovery Tool

Build a tool that analyzes a failed data pipeline and suggests which pipeline stages should be re-run to resume execution efficiently.

Overview

When a data processing pipeline fails partway through execution, some outputs may have been successfully saved to disk while others were not. Rather than re-running the entire pipeline from the beginning, it's more efficient to identify which stages can be skipped (because their outputs already exist) and which stages need to be re-run.

Your tool should analyze the pipeline structure, identify which outputs exist on disk, determine resumption points, and suggest which stages to re-run.

Requirements

Pipeline Structure Analysis

The pipeline consists of processing stages (nodes) that consume inputs and produce outputs. The pipeline forms a directed acyclic graph (DAG) where dependencies are determined by which datasets are consumed and produced by each node.

Failure Detection and Recovery

When a pipeline fails:

Some nodes may have completed successfully and saved their outputs to persistent storage
Other nodes may not have run at all, or may have failed before saving outputs
Some datasets may be temporary (in-memory only) and don't persist across runs

Your tool should:

Accept a pipeline definition with nodes, inputs, and outputs
Accept information about which datasets currently exist in persistent storage
Determine which nodes need to be re-run to complete the pipeline
Suggest the minimal set of resumption points (starting nodes) needed

Resumption Point Selection

The resumption points should be selected such that:

All nodes that need to re-run are downstream from at least one resumption point
The number of resumption points is minimized
If a node's persistent outputs already exist, it can be skipped (along with all its upstream dependencies)

Test Cases

Test Case 1: Simple linear pipeline with mid-point failure

Given a pipeline with three sequential nodes:

Node A produces dataset "data_a"
Node B consumes "data_a" and produces "data_b"
Node C consumes "data_b" and produces "data_c"

If "data_a" exists but "data_b" and "data_c" do not exist, the tool should suggest resuming from Node B. @test

Test Case 2: Pipeline with parallel branches

Given a pipeline with parallel processing:

Node A produces "data_a"
Node B consumes "data_a" and produces "data_b"
Node C consumes "data_a" and produces "data_c"
Node D consumes both "data_b" and "data_c" and produces "data_d"

If "data_a" and "data_b" exist, but "data_c" and "data_d" do not exist, the tool should suggest resuming from Node C only (not Node B). @test

Test Case 3: Multiple independent resumption points

Given a pipeline:

Node A produces "data_a"
Node B consumes "data_a" and produces "data_b"
Node C produces "data_c" (independent, no inputs)
Node D consumes "data_c" and produces "data_d"
Node E consumes both "data_b" and "data_d" and produces "data_e"

If "data_a" exists but all other outputs are missing, the tool should suggest resuming from both Node B and Node C. @test

Implementation

@generates

API

from typing import List, Set

def suggest_resume_nodes(
    pipeline,
    existing_datasets: Set[str]
) -> List[str]:
    """
    Suggest which nodes to use as resumption points after a pipeline failure.

    Analyzes the pipeline structure and existing datasets to determine the minimal
    set of nodes that should be re-run to complete the pipeline. Uses breadth-first
    search to find nodes whose outputs don't exist, then identifies resumption points.

    Args:
        pipeline: A Kedro Pipeline object defining the DAG of processing stages
        existing_datasets: Set of dataset names that currently exist in persistent storage

    Returns:
        List of node names that should be used as resumption points, sorted alphabetically
    """
    pass

Dependencies { .dependencies }

kedro { .dependency }

Provides data pipeline construction and execution capabilities, including the ability to create nodes and pipelines with automatic dependency resolution.

@satisfied-by