or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

cli-project.mdconfiguration.mdcontext-session.mddata-catalog.mdhooks.mdindex.mdipython-integration.mdpipeline-construction.mdpipeline-execution.md
tile.json

tessl/pypi-kedro

A comprehensive Python framework for production-ready data science and analytics pipeline development

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/kedro@1.0.x

To install, run

npx @tessl/cli install tessl/pypi-kedro@1.0.0

index.mddocs/

Kedro

Kedro is a comprehensive Python framework designed for production-ready data science and analytics pipeline development. It provides software engineering best practices to help create reproducible, maintainable, and modular data engineering and data science pipelines through uniform project templates, data abstraction layers, configuration management, and pipeline assembly tools.

Package Information

  • Package Name: kedro
  • Language: Python
  • Installation: pip install kedro
  • Requires Python: >=3.9

Core Imports

import kedro

Common patterns for working with Kedro components:

# Configuration management
from kedro.config import AbstractConfigLoader, OmegaConfigLoader

# Data catalog and datasets
from kedro.io import DataCatalog, AbstractDataset, MemoryDataset

# Pipeline construction  
from kedro.pipeline import Pipeline, Node, pipeline, node

# Pipeline execution
from kedro.runner import SequentialRunner, ParallelRunner, ThreadRunner

# Framework components
from kedro.framework.context import KedroContext
from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project, pipelines, settings

Basic Usage

from kedro.pipeline import pipeline, node
from kedro.io import DataCatalog, MemoryDataset
from kedro.runner import SequentialRunner

# Define a simple processing function
def process_data(input_data):
    """Process input data and return results."""
    return [x * 2 for x in input_data]

# Create a pipeline node
processing_node = node(
    func=process_data,
    inputs="raw_data",
    outputs="processed_data",
    name="process_data_node"
)

# Create a pipeline from nodes
data_pipeline = pipeline([processing_node])

# Set up a data catalog
catalog = DataCatalog({
    "raw_data": MemoryDataset([1, 2, 3, 4, 5]),
    "processed_data": MemoryDataset()
})

# Run the pipeline
runner = SequentialRunner()
runner.run(data_pipeline, catalog)

# Access results
results = catalog.load("processed_data")
print(results)  # [2, 4, 6, 8, 10]

Architecture

Kedro follows a modular architecture built around key abstractions:

  • DataCatalog: Central registry managing all datasets with consistent load/save interfaces
  • Pipeline: Directed acyclic graph (DAG) of processing nodes with automatic dependency resolution
  • Node: Individual computation units that transform inputs to outputs via Python functions
  • Runner: Execution engines supporting sequential, parallel, and threaded processing strategies
  • KedroContext: Project context providing configuration, catalog access, and environment management
  • KedroSession: Session management for project lifecycle and execution environment

This design enables scalable data workflows that follow software engineering principles, supporting everything from local development to production deployment across different compute environments.

Capabilities

Configuration Management

Flexible configuration loading supporting multiple formats (YAML, JSON) with environment-specific overrides, parameter management, and extensible loader implementations.

class AbstractConfigLoader:
    def load_and_merge_dir_config(self, config_path, env=None, **kwargs): ...
    def get(self, *patterns, **kwargs): ...

class OmegaConfigLoader(AbstractConfigLoader):
    def __init__(self, conf_source, base_env="base", default_run_env="local", **kwargs): ...

Configuration Management

Data Catalog and Dataset Management

Comprehensive data abstraction layer providing consistent interfaces for various data sources, versioning support, lazy loading, and catalog-based dataset management.

class DataCatalog:
    def load(self, name): ...
    def save(self, name, data): ...
    def list(self): ...
    def exists(self, name): ...
    def add(self, data_set_name, data_set, replace=False): ...

class AbstractDataset:
    def load(self): ...
    def save(self, data): ...
    def exists(self): ...

Data Catalog and Datasets

Pipeline Construction

Pipeline definition capabilities including node creation, dependency management, pipeline composition, filtering, and transformation operations.

class Pipeline:
    def filter(self, tags=None, from_nodes=None, to_nodes=None, **kwargs): ...
    def tag(self, tags): ...
    def __add__(self, other): ...
    def __or__(self, other): ...

class Node:
    def __init__(self, func, inputs, outputs, name=None, tags=None): ...

def node(func, inputs, outputs, name=None, tags=None): ...
def pipeline(pipe, inputs=None, outputs=None, parameters=None, tags=None): ...

Pipeline Construction

Pipeline Execution

Multiple execution strategies for running pipelines including sequential, parallel (multiprocessing), and threaded execution with support for partial runs and custom data loading.

class AbstractRunner:
    def run(self, pipeline, catalog, hook_manager=None, session_id=None): ...
    def run_only_missing(self, pipeline, catalog, hook_manager=None, session_id=None): ...

class SequentialRunner(AbstractRunner): ...
class ParallelRunner(AbstractRunner): ...  
class ThreadRunner(AbstractRunner): ...

Pipeline Execution

Project Context and Session Management

Project lifecycle management including context creation, session handling, configuration access, and environment setup for Kedro applications.

class KedroContext:
    def run(self, pipeline_name=None, tags=None, runner=None, **kwargs): ...
    @property
    def catalog(self): ...
    @property
    def config_loader(self): ...

class KedroSession:
    @classmethod
    def create(cls, project_path=None, save_on_close=True, **kwargs): ...
    def load_context(self): ...
    def run(self, pipeline_name=None, tags=None, runner=None, **kwargs): ...

Context and Session Management

CLI and Project Management

Command-line interface for project creation, pipeline execution, and project management with extensible plugin system and project discovery utilities.

def main(): ...
def configure_project(package_name): ...
def find_pipelines(raise_errors=False): ...

CLI and Project Management

Hook System and Extensions

Plugin architecture enabling custom behavior injection at various lifecycle points including node execution, pipeline runs, and catalog operations.

def hook_impl(func): ...
def _create_hook_manager(): ...

Hook System

IPython and Jupyter Integration

Interactive development support with magic commands for reloading projects, debugging nodes, and seamless integration with Jupyter notebooks and IPython environments.

def load_ipython_extension(ipython): ...
def reload_kedro(path=None, env=None, runtime_params=None, **kwargs): ...

IPython Integration