A cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
npx @tessl/cli install tessl/pypi-dagster@1.11.00
# Dagster Knowledge Tile
1
2
## Overview
3
4
Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, and a declarative programming model. It provides a comprehensive framework for building, testing, and deploying data pipelines with software engineering best practices. Dagster enables teams to build reliable, scalable data platforms with rich metadata, comprehensive observability, and powerful automation capabilities.
5
6
## Package Information
7
8
**Package:** `dagster`
9
**Version:** 1.11.8
10
**API Elements:** 700+ public API elements
11
**Functional Areas:** 26 major areas
12
13
## Core Imports
14
15
```python
16
import dagster
17
from dagster import (
18
# Core decorators
19
asset, multi_asset, op, job, graph, sensor, schedule, resource,
20
21
# Definition classes
22
Definitions, AssetsDefinition, OpDefinition, JobDefinition,
23
24
# Configuration
25
Config, ConfigurableResource, Field, Shape,
26
27
# Execution
28
materialize, execute_job, build_op_context, build_asset_context,
29
30
# Types
31
In, Out, AssetIn, AssetOut, DagsterType,
32
33
# Events and Results
34
AssetMaterialization, AssetObservation, MaterializeResult,
35
36
# Storage
37
IOManager, fs_io_manager, mem_io_manager,
38
39
# Partitions
40
DailyPartitionsDefinition, HourlyPartitionsDefinition, StaticPartitionsDefinition,
41
42
# Error handling
43
DagsterError, Failure, RetryRequested
44
)
45
```
46
47
## Architecture Overview
48
49
Dagster follows a modular architecture organized around these core concepts:
50
51
### 1. Software-Defined Assets (SDAs)
52
The primary abstraction for data artifacts. Assets represent data products that exist or should exist.
53
54
```python { .api }
55
@asset
56
def my_asset() -> MaterializeResult:
57
"""Define a software-defined asset."""
58
# Asset computation logic
59
return MaterializeResult()
60
```
61
62
### 2. Operations and Jobs
63
Fundamental computational units and their orchestration.
64
65
```python { .api }
66
@op
67
def my_op() -> str:
68
"""Define an operation."""
69
return "result"
70
71
@job
72
def my_job():
73
"""Define a job."""
74
my_op()
75
```
76
77
### 3. Resources
78
External dependencies and services injected into computations.
79
80
```python { .api }
81
@resource
82
def my_resource():
83
"""Define a resource."""
84
return "resource_value"
85
```
86
87
### 4. Schedules and Sensors
88
Automation triggers for pipeline execution.
89
90
```python { .api }
91
@schedule(cron_schedule="0 0 * * *", job=my_job)
92
def daily_schedule():
93
"""Define a daily schedule."""
94
return {}
95
```
96
97
## Key Capabilities
98
99
### Asset Management
100
- **Asset Definition**: `@asset` decorator and `AssetsDefinition` class
101
- **Multi-Asset Support**: `@multi_asset` for defining multiple related assets
102
- **Asset Dependencies**: Automatic dependency inference and explicit specification
103
- **Asset Checks**: Data quality validation with `@asset_check`
104
- **Asset Lineage**: Automatic lineage tracking and visualization
105
106
See: [Core Definitions](./core-definitions.md)
107
108
### Configuration System
109
- **Type-Safe Config**: Pydantic-based configuration with `Config` class
110
- **Configurable Resources**: `ConfigurableResource` for dependency injection
111
- **Environment Variables**: `EnvVar` for environment-based configuration
112
- **Schema Validation**: Comprehensive validation with helpful error messages
113
114
See: [Configuration System](./configuration.md)
115
116
### Execution Engine
117
- **Multiple Executors**: In-process, multiprocess, and custom execution
118
- **Rich Contexts**: Execution contexts with metadata, logging, and resources
119
- **Asset Materialization**: `materialize()` function for asset execution
120
- **Job Execution**: `execute_job()` for operation-based workflows
121
122
See: [Execution and Contexts](./execution-contexts.md)
123
124
### Storage and I/O
125
- **I/O Managers**: Pluggable storage backends with `IOManager` interface
126
- **Built-in I/O**: Filesystem, memory, and universal path I/O managers
127
- **Custom I/O**: Configurable I/O managers for any storage system
128
- **Asset Value Loading**: Efficient loading of materialized asset values
129
130
See: [Storage and I/O](./storage-io.md)
131
132
### Events and Metadata
133
- **Rich Metadata**: Comprehensive metadata system with typed values
134
- **Asset Events**: Materialization and observation events
135
- **Custom Events**: User-defined events for pipeline observability
136
- **Table Metadata**: Specialized metadata for tabular data
137
138
See: [Events and Metadata](./events-metadata.md)
139
140
### Automation
141
- **Asset Sensors**: Event-driven execution based on asset changes
142
- **Schedule System**: Time-based execution with cron expressions
143
- **Auto-Materialization**: Declarative policies for automatic execution
144
- **Run Status Sensors**: Sensors for pipeline failure handling
145
146
See: [Sensors and Schedules](./sensors-schedules.md)
147
148
### Partitioning
149
- **Time Partitions**: Daily, hourly, weekly, monthly partitions
150
- **Static Partitions**: Fixed sets of partitions
151
- **Dynamic Partitions**: Runtime-defined partitions
152
- **Multi-Dimensional**: Complex partitioning schemes
153
154
See: [Partitions System](./partitions.md)
155
156
### Error Handling
157
- **Structured Errors**: Comprehensive error hierarchy
158
- **Retry Policies**: Configurable retry strategies
159
- **Failure Events**: Rich failure information and debugging
160
- **Graceful Degradation**: Partial execution and recovery
161
162
See: [Error Handling](./error-handling.md)
163
164
## Basic Usage Example
165
166
```python { .api }
167
import pandas as pd
168
from dagster import asset, materialize, Definitions
169
170
@asset
171
def raw_data() -> pd.DataFrame:
172
"""Load raw data from source."""
173
return pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})
174
175
@asset
176
def processed_data(raw_data: pd.DataFrame) -> pd.DataFrame:
177
"""Process raw data."""
178
return raw_data.assign(processed_value=raw_data["value"] * 2)
179
180
@asset
181
def analysis_result(processed_data: pd.DataFrame) -> dict:
182
"""Generate analysis from processed data."""
183
return {
184
"total_records": len(processed_data),
185
"average_value": processed_data["processed_value"].mean()
186
}
187
188
# Define the complete set of definitions
189
defs = Definitions(
190
assets=[raw_data, processed_data, analysis_result]
191
)
192
193
# Materialize assets
194
if __name__ == "__main__":
195
result = materialize([raw_data, processed_data, analysis_result])
196
print(f"Materialized {len(result.asset_materializations)} assets")
197
```
198
199
## Advanced Features
200
201
### Components System (Beta)
202
Modular, reusable components for complex data platforms:
203
204
```python { .api }
205
from dagster import Component, Definitions
206
207
# Use components for modularity
208
defs = Definitions(
209
assets=load_assets_from_modules([my_assets_module]),
210
resources={"database": database_resource}
211
)
212
```
213
214
### Pipes Integration
215
Execute external processes with full Dagster integration:
216
217
```python { .api }
218
from dagster import PipesSubprocessClient, asset
219
220
@asset
221
def external_asset(pipes_subprocess_client: PipesSubprocessClient):
222
"""Run external Python script with Dagster context."""
223
return pipes_subprocess_client.run(
224
command=["python", "external_script.py"],
225
context={"asset_key": "external_asset"}
226
).get_results()
227
```
228
229
### Declarative Automation
230
Sophisticated automation policies:
231
232
```python { .api }
233
from dagster import AutoMaterializePolicy, AutoMaterializeRule
234
235
@asset(
236
auto_materialize_policy=AutoMaterializePolicy.eager()
237
.with_rules(AutoMaterializeRule.materialize_on_parent_updated())
238
)
239
def auto_asset(upstream_asset):
240
"""Automatically materialized asset."""
241
return process(upstream_asset)
242
```
243
244
## Integration Capabilities
245
246
Dagster integrates with the entire modern data stack:
247
248
- **Data Warehouses**: Snowflake, BigQuery, Redshift, PostgreSQL
249
- **Data Lakes**: S3, GCS, Azure, Delta Lake, Iceberg
250
- **ML Platforms**: MLflow, Weights & Biases, SageMaker
251
- **Orchestration**: Kubernetes, Docker, cloud platforms
252
- **Observability**: Slack, email, PagerDuty, custom webhooks
253
- **Version Control**: Git-based deployments and CI/CD integration
254
255
## Getting Started
256
257
1. **Install Dagster**: `pip install dagster dagster-webserver`
258
2. **Define Assets**: Create Python functions decorated with `@asset`
259
3. **Create Definitions**: Bundle assets, resources, and schedules in `Definitions`
260
4. **Run Dagster**: Use `dagster dev` for local development
261
5. **Deploy**: Use Dagster Cloud or self-hosted deployment options
262
263
## Documentation Structure
264
265
This Knowledge Tile provides comprehensive API documentation organized by functional area:
266
267
- **[Core Definitions](./core-definitions.md)** - Assets, operations, jobs, and repositories
268
- **[Configuration System](./configuration.md)** - Type-safe configuration and resources
269
- **[Execution and Contexts](./execution-contexts.md)** - Runtime system and execution contexts
270
- **[Storage and I/O](./storage-io.md)** - Data persistence and I/O management
271
- **[Events and Metadata](./events-metadata.md)** - Event system and metadata framework
272
- **[Sensors and Schedules](./sensors-schedules.md)** - Automation and triggering systems
273
- **[Partitions System](./partitions.md)** - Data partitioning and time-based execution
274
- **[Error Handling](./error-handling.md)** - Error hierarchy and failure management
275
276
Each section provides complete API documentation with function signatures, type definitions, usage examples, and cross-references to enable comprehensive understanding and usage of the Dagster framework.