0
# Kedro
1
2
Kedro is a comprehensive Python framework designed for production-ready data science and analytics pipeline development. It provides software engineering best practices to help create reproducible, maintainable, and modular data engineering and data science pipelines through uniform project templates, data abstraction layers, configuration management, and pipeline assembly tools.
3
4
## Package Information
5
6
- **Package Name**: kedro
7
- **Language**: Python
8
- **Installation**: `pip install kedro`
9
- **Requires Python**: >=3.9
10
11
## Core Imports
12
13
```python
14
import kedro
15
```
16
17
Common patterns for working with Kedro components:
18
19
```python
20
# Configuration management
21
from kedro.config import AbstractConfigLoader, OmegaConfigLoader
22
23
# Data catalog and datasets
24
from kedro.io import DataCatalog, AbstractDataset, MemoryDataset
25
26
# Pipeline construction
27
from kedro.pipeline import Pipeline, Node, pipeline, node
28
29
# Pipeline execution
30
from kedro.runner import SequentialRunner, ParallelRunner, ThreadRunner
31
32
# Framework components
33
from kedro.framework.context import KedroContext
34
from kedro.framework.session import KedroSession
35
from kedro.framework.project import configure_project, pipelines, settings
36
```
37
38
## Basic Usage
39
40
```python
41
from kedro.pipeline import pipeline, node
42
from kedro.io import DataCatalog, MemoryDataset
43
from kedro.runner import SequentialRunner
44
45
# Define a simple processing function
46
def process_data(input_data):
47
"""Process input data and return results."""
48
return [x * 2 for x in input_data]
49
50
# Create a pipeline node
51
processing_node = node(
52
func=process_data,
53
inputs="raw_data",
54
outputs="processed_data",
55
name="process_data_node"
56
)
57
58
# Create a pipeline from nodes
59
data_pipeline = pipeline([processing_node])
60
61
# Set up a data catalog
62
catalog = DataCatalog({
63
"raw_data": MemoryDataset([1, 2, 3, 4, 5]),
64
"processed_data": MemoryDataset()
65
})
66
67
# Run the pipeline
68
runner = SequentialRunner()
69
runner.run(data_pipeline, catalog)
70
71
# Access results
72
results = catalog.load("processed_data")
73
print(results) # [2, 4, 6, 8, 10]
74
```
75
76
## Architecture
77
78
Kedro follows a modular architecture built around key abstractions:
79
80
- **DataCatalog**: Central registry managing all datasets with consistent load/save interfaces
81
- **Pipeline**: Directed acyclic graph (DAG) of processing nodes with automatic dependency resolution
82
- **Node**: Individual computation units that transform inputs to outputs via Python functions
83
- **Runner**: Execution engines supporting sequential, parallel, and threaded processing strategies
84
- **KedroContext**: Project context providing configuration, catalog access, and environment management
85
- **KedroSession**: Session management for project lifecycle and execution environment
86
87
This design enables scalable data workflows that follow software engineering principles, supporting everything from local development to production deployment across different compute environments.
88
89
## Capabilities
90
91
### Configuration Management
92
93
Flexible configuration loading supporting multiple formats (YAML, JSON) with environment-specific overrides, parameter management, and extensible loader implementations.
94
95
```python { .api }
96
class AbstractConfigLoader:
97
def load_and_merge_dir_config(self, config_path, env=None, **kwargs): ...
98
def get(self, *patterns, **kwargs): ...
99
100
class OmegaConfigLoader(AbstractConfigLoader):
101
def __init__(self, conf_source, base_env="base", default_run_env="local", **kwargs): ...
102
```
103
104
[Configuration Management](./configuration.md)
105
106
### Data Catalog and Dataset Management
107
108
Comprehensive data abstraction layer providing consistent interfaces for various data sources, versioning support, lazy loading, and catalog-based dataset management.
109
110
```python { .api }
111
class DataCatalog:
112
def load(self, name): ...
113
def save(self, name, data): ...
114
def list(self): ...
115
def exists(self, name): ...
116
def add(self, data_set_name, data_set, replace=False): ...
117
118
class AbstractDataset:
119
def load(self): ...
120
def save(self, data): ...
121
def exists(self): ...
122
```
123
124
[Data Catalog and Datasets](./data-catalog.md)
125
126
### Pipeline Construction
127
128
Pipeline definition capabilities including node creation, dependency management, pipeline composition, filtering, and transformation operations.
129
130
```python { .api }
131
class Pipeline:
132
def filter(self, tags=None, from_nodes=None, to_nodes=None, **kwargs): ...
133
def tag(self, tags): ...
134
def __add__(self, other): ...
135
def __or__(self, other): ...
136
137
class Node:
138
def __init__(self, func, inputs, outputs, name=None, tags=None): ...
139
140
def node(func, inputs, outputs, name=None, tags=None): ...
141
def pipeline(pipe, inputs=None, outputs=None, parameters=None, tags=None): ...
142
```
143
144
[Pipeline Construction](./pipeline-construction.md)
145
146
### Pipeline Execution
147
148
Multiple execution strategies for running pipelines including sequential, parallel (multiprocessing), and threaded execution with support for partial runs and custom data loading.
149
150
```python { .api }
151
class AbstractRunner:
152
def run(self, pipeline, catalog, hook_manager=None, session_id=None): ...
153
def run_only_missing(self, pipeline, catalog, hook_manager=None, session_id=None): ...
154
155
class SequentialRunner(AbstractRunner): ...
156
class ParallelRunner(AbstractRunner): ...
157
class ThreadRunner(AbstractRunner): ...
158
```
159
160
[Pipeline Execution](./pipeline-execution.md)
161
162
### Project Context and Session Management
163
164
Project lifecycle management including context creation, session handling, configuration access, and environment setup for Kedro applications.
165
166
```python { .api }
167
class KedroContext:
168
def run(self, pipeline_name=None, tags=None, runner=None, **kwargs): ...
169
@property
170
def catalog(self): ...
171
@property
172
def config_loader(self): ...
173
174
class KedroSession:
175
@classmethod
176
def create(cls, project_path=None, save_on_close=True, **kwargs): ...
177
def load_context(self): ...
178
def run(self, pipeline_name=None, tags=None, runner=None, **kwargs): ...
179
```
180
181
[Context and Session Management](./context-session.md)
182
183
### CLI and Project Management
184
185
Command-line interface for project creation, pipeline execution, and project management with extensible plugin system and project discovery utilities.
186
187
```python { .api }
188
def main(): ...
189
def configure_project(package_name): ...
190
def find_pipelines(raise_errors=False): ...
191
```
192
193
[CLI and Project Management](./cli-project.md)
194
195
### Hook System and Extensions
196
197
Plugin architecture enabling custom behavior injection at various lifecycle points including node execution, pipeline runs, and catalog operations.
198
199
```python { .api }
200
def hook_impl(func): ...
201
def _create_hook_manager(): ...
202
```
203
204
[Hook System](./hooks.md)
205
206
### IPython and Jupyter Integration
207
208
Interactive development support with magic commands for reloading projects, debugging nodes, and seamless integration with Jupyter notebooks and IPython environments.
209
210
```python { .api }
211
def load_ipython_extension(ipython): ...
212
def reload_kedro(path=None, env=None, runtime_params=None, **kwargs): ...
213
```
214
215
[IPython Integration](./ipython-integration.md)