tessl/pypi-kedro

Kedro helps you build production-ready data and analytics pipelines

Overall
score

98%

Overview

Eval results

Files

Pipeline Resume Analyzer

Name: tessl/pypi-kedro
Author: tessl

Build a tool that analyzes failed Kedro pipeline runs and provides recommendations for resuming execution from the optimal points.

Problem Description

When data pipelines fail mid-execution, it's wasteful to re-run the entire pipeline from the beginning. Your task is to create a Python module that analyzes a failed pipeline execution and determines the optimal nodes from which to resume execution, minimizing redundant computation while ensuring data consistency.

The tool should:

Accept a Kedro pipeline and catalog as inputs
Simulate a pipeline execution failure at a specific node
Determine which nodes need to be re-run based on:
- Which datasets persist to disk (and thus survive the failure)
- The dependency structure of the pipeline
Output a list of node names that should be used as starting points for resumption

Requirements

Input Parameters

Your module should accept:

A Kedro Pipeline object containing multiple nodes with dependencies
A Kedro DataCatalog object defining datasets
The name of a node where the failure occurred

Output

Return a list of node names (strings) that represent the minimum set of nodes from which the pipeline should resume execution.

Test Cases

Test Case 1: Simple Linear Pipeline

Given a pipeline with nodes A → B → C → D where Node A produces dataset "data_a" (memory dataset), Node B consumes "data_a" and produces "data_b" (CSV dataset - persisted), Node C consumes "data_b" and produces "data_c" (memory dataset), and Node D consumes "data_c" and produces "data_d" (CSV dataset - persisted), when the pipeline fails at node D, the resume analyzer should recommend starting from node C because "data_b" is persisted but "data_c" is not. @test

Test Case 2: Pipeline with Branching

Given a pipeline where Node A produces "data_a" (Parquet dataset - persisted), Nodes B and C both consume "data_a", Node B produces "data_b" (memory dataset), Node C produces "data_c" (JSON dataset - persisted), and Node D consumes both "data_b" and "data_c" to produce "data_d", when the pipeline fails at node D, the resume analyzer should recommend starting from node B only because "data_a" and "data_c" are persisted but "data_b" is not. @test

Test Case 3: All Datasets Persisted

Given a pipeline with nodes A → B → C where Node A produces "data_a" (CSV dataset - persisted), Node B consumes "data_a" and produces "data_b" (Parquet dataset - persisted), and Node C consumes "data_b" and produces "data_c" (JSON dataset - persisted), when the pipeline fails at node C, the resume analyzer should recommend starting from node C itself since all required inputs are persisted. @test

Test Case 4: Multiple Memory Datasets

Given a pipeline with nodes A → B → C → D → E where all datasets are memory datasets (none persisted), when the pipeline fails at node E, the resume analyzer should recommend starting from node A (the root) since no intermediate data is available. @test

Implementation

@generates

API

from kedro.pipeline import Pipeline
from kedro.io import DataCatalog

def analyze_resume_points(
    pipeline: Pipeline,
    catalog: DataCatalog,
    failed_node_name: str
) -> list[str]:
    """
    Analyze a failed pipeline and determine optimal resume points.

    Args:
        pipeline: The Kedro pipeline that failed
        catalog: The data catalog containing dataset definitions
        failed_node_name: Name of the node where execution failed

    Returns:
        List of node names from which execution should resume

    Raises:
        ValueError: If failed_node_name is not in the pipeline
    """
    pass

Dependencies { .dependencies }

kedro { .dependency }

Provides the pipeline and data catalog framework for analyzing pipeline execution failures and determining optimal resume points.

Install with Tessl CLI

npx tessl i tessl/pypi-kedro

tile.json