tessl install tessl/pypi-kedro@1.1.0Kedro helps you build production-ready data and analytics pipelines
Agent Success
Agent success rate when using this tile
98%
Improvement
Agent success rate improvement when using this tile compared to baseline
1.32x
Baseline
Agent success rate without this tile
74%
Build a tool that analyzes a failed data pipeline and suggests which pipeline stages should be re-run to resume execution efficiently.
When a data processing pipeline fails partway through execution, some outputs may have been successfully saved to disk while others were not. Rather than re-running the entire pipeline from the beginning, it's more efficient to identify which stages can be skipped (because their outputs already exist) and which stages need to be re-run.
Your tool should analyze the pipeline structure, identify which outputs exist on disk, determine resumption points, and suggest which stages to re-run.
The pipeline consists of processing stages (nodes) that consume inputs and produce outputs. The pipeline forms a directed acyclic graph (DAG) where dependencies are determined by which datasets are consumed and produced by each node.
When a pipeline fails:
Your tool should:
The resumption points should be selected such that:
Given a pipeline with three sequential nodes:
If "data_a" exists but "data_b" and "data_c" do not exist, the tool should suggest resuming from Node B. @test
Given a pipeline with parallel processing:
If "data_a" and "data_b" exist, but "data_c" and "data_d" do not exist, the tool should suggest resuming from Node C only (not Node B). @test
Given a pipeline:
If "data_a" exists but all other outputs are missing, the tool should suggest resuming from both Node B and Node C. @test
@generates
from typing import List, Set
def suggest_resume_nodes(
pipeline,
existing_datasets: Set[str]
) -> List[str]:
"""
Suggest which nodes to use as resumption points after a pipeline failure.
Analyzes the pipeline structure and existing datasets to determine the minimal
set of nodes that should be re-run to complete the pipeline. Uses breadth-first
search to find nodes whose outputs don't exist, then identifies resumption points.
Args:
pipeline: A Kedro Pipeline object defining the DAG of processing stages
existing_datasets: Set of dataset names that currently exist in persistent storage
Returns:
List of node names that should be used as resumption points, sorted alphabetically
"""
passProvides data pipeline construction and execution capabilities, including the ability to create nodes and pipelines with automatic dependency resolution.
@satisfied-by