or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-definitions.mderror-handling.mdevents-metadata.mdexecution-contexts.mdindex.mdpartitions.mdsensors-schedules.mdstorage-io.md

index.mddocs/

0

# Dagster Knowledge Tile

1

2

## Overview

3

4

Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, and a declarative programming model. It provides a comprehensive framework for building, testing, and deploying data pipelines with software engineering best practices. Dagster enables teams to build reliable, scalable data platforms with rich metadata, comprehensive observability, and powerful automation capabilities.

5

6

## Package Information

7

8

**Package:** `dagster`

9

**Version:** 1.11.8

10

**API Elements:** 700+ public API elements

11

**Functional Areas:** 26 major areas

12

13

## Core Imports

14

15

```python

16

import dagster

17

from dagster import (

18

# Core decorators

19

asset, multi_asset, op, job, graph, sensor, schedule, resource,

20

21

# Definition classes

22

Definitions, AssetsDefinition, OpDefinition, JobDefinition,

23

24

# Configuration

25

Config, ConfigurableResource, Field, Shape,

26

27

# Execution

28

materialize, execute_job, build_op_context, build_asset_context,

29

30

# Types

31

In, Out, AssetIn, AssetOut, DagsterType,

32

33

# Events and Results

34

AssetMaterialization, AssetObservation, MaterializeResult,

35

36

# Storage

37

IOManager, fs_io_manager, mem_io_manager,

38

39

# Partitions

40

DailyPartitionsDefinition, HourlyPartitionsDefinition, StaticPartitionsDefinition,

41

42

# Error handling

43

DagsterError, Failure, RetryRequested

44

)

45

```

46

47

## Architecture Overview

48

49

Dagster follows a modular architecture organized around these core concepts:

50

51

### 1. Software-Defined Assets (SDAs)

52

The primary abstraction for data artifacts. Assets represent data products that exist or should exist.

53

54

```python { .api }

55

@asset

56

def my_asset() -> MaterializeResult:

57

"""Define a software-defined asset."""

58

# Asset computation logic

59

return MaterializeResult()

60

```

61

62

### 2. Operations and Jobs

63

Fundamental computational units and their orchestration.

64

65

```python { .api }

66

@op

67

def my_op() -> str:

68

"""Define an operation."""

69

return "result"

70

71

@job

72

def my_job():

73

"""Define a job."""

74

my_op()

75

```

76

77

### 3. Resources

78

External dependencies and services injected into computations.

79

80

```python { .api }

81

@resource

82

def my_resource():

83

"""Define a resource."""

84

return "resource_value"

85

```

86

87

### 4. Schedules and Sensors

88

Automation triggers for pipeline execution.

89

90

```python { .api }

91

@schedule(cron_schedule="0 0 * * *", job=my_job)

92

def daily_schedule():

93

"""Define a daily schedule."""

94

return {}

95

```

96

97

## Key Capabilities

98

99

### Asset Management

100

- **Asset Definition**: `@asset` decorator and `AssetsDefinition` class

101

- **Multi-Asset Support**: `@multi_asset` for defining multiple related assets

102

- **Asset Dependencies**: Automatic dependency inference and explicit specification

103

- **Asset Checks**: Data quality validation with `@asset_check`

104

- **Asset Lineage**: Automatic lineage tracking and visualization

105

106

See: [Core Definitions](./core-definitions.md)

107

108

### Configuration System

109

- **Type-Safe Config**: Pydantic-based configuration with `Config` class

110

- **Configurable Resources**: `ConfigurableResource` for dependency injection

111

- **Environment Variables**: `EnvVar` for environment-based configuration

112

- **Schema Validation**: Comprehensive validation with helpful error messages

113

114

See: [Configuration System](./configuration.md)

115

116

### Execution Engine

117

- **Multiple Executors**: In-process, multiprocess, and custom execution

118

- **Rich Contexts**: Execution contexts with metadata, logging, and resources

119

- **Asset Materialization**: `materialize()` function for asset execution

120

- **Job Execution**: `execute_job()` for operation-based workflows

121

122

See: [Execution and Contexts](./execution-contexts.md)

123

124

### Storage and I/O

125

- **I/O Managers**: Pluggable storage backends with `IOManager` interface

126

- **Built-in I/O**: Filesystem, memory, and universal path I/O managers

127

- **Custom I/O**: Configurable I/O managers for any storage system

128

- **Asset Value Loading**: Efficient loading of materialized asset values

129

130

See: [Storage and I/O](./storage-io.md)

131

132

### Events and Metadata

133

- **Rich Metadata**: Comprehensive metadata system with typed values

134

- **Asset Events**: Materialization and observation events

135

- **Custom Events**: User-defined events for pipeline observability

136

- **Table Metadata**: Specialized metadata for tabular data

137

138

See: [Events and Metadata](./events-metadata.md)

139

140

### Automation

141

- **Asset Sensors**: Event-driven execution based on asset changes

142

- **Schedule System**: Time-based execution with cron expressions

143

- **Auto-Materialization**: Declarative policies for automatic execution

144

- **Run Status Sensors**: Sensors for pipeline failure handling

145

146

See: [Sensors and Schedules](./sensors-schedules.md)

147

148

### Partitioning

149

- **Time Partitions**: Daily, hourly, weekly, monthly partitions

150

- **Static Partitions**: Fixed sets of partitions

151

- **Dynamic Partitions**: Runtime-defined partitions

152

- **Multi-Dimensional**: Complex partitioning schemes

153

154

See: [Partitions System](./partitions.md)

155

156

### Error Handling

157

- **Structured Errors**: Comprehensive error hierarchy

158

- **Retry Policies**: Configurable retry strategies

159

- **Failure Events**: Rich failure information and debugging

160

- **Graceful Degradation**: Partial execution and recovery

161

162

See: [Error Handling](./error-handling.md)

163

164

## Basic Usage Example

165

166

```python { .api }

167

import pandas as pd

168

from dagster import asset, materialize, Definitions

169

170

@asset

171

def raw_data() -> pd.DataFrame:

172

"""Load raw data from source."""

173

return pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})

174

175

@asset

176

def processed_data(raw_data: pd.DataFrame) -> pd.DataFrame:

177

"""Process raw data."""

178

return raw_data.assign(processed_value=raw_data["value"] * 2)

179

180

@asset

181

def analysis_result(processed_data: pd.DataFrame) -> dict:

182

"""Generate analysis from processed data."""

183

return {

184

"total_records": len(processed_data),

185

"average_value": processed_data["processed_value"].mean()

186

}

187

188

# Define the complete set of definitions

189

defs = Definitions(

190

assets=[raw_data, processed_data, analysis_result]

191

)

192

193

# Materialize assets

194

if __name__ == "__main__":

195

result = materialize([raw_data, processed_data, analysis_result])

196

print(f"Materialized {len(result.asset_materializations)} assets")

197

```

198

199

## Advanced Features

200

201

### Components System (Beta)

202

Modular, reusable components for complex data platforms:

203

204

```python { .api }

205

from dagster import Component, Definitions

206

207

# Use components for modularity

208

defs = Definitions(

209

assets=load_assets_from_modules([my_assets_module]),

210

resources={"database": database_resource}

211

)

212

```

213

214

### Pipes Integration

215

Execute external processes with full Dagster integration:

216

217

```python { .api }

218

from dagster import PipesSubprocessClient, asset

219

220

@asset

221

def external_asset(pipes_subprocess_client: PipesSubprocessClient):

222

"""Run external Python script with Dagster context."""

223

return pipes_subprocess_client.run(

224

command=["python", "external_script.py"],

225

context={"asset_key": "external_asset"}

226

).get_results()

227

```

228

229

### Declarative Automation

230

Sophisticated automation policies:

231

232

```python { .api }

233

from dagster import AutoMaterializePolicy, AutoMaterializeRule

234

235

@asset(

236

auto_materialize_policy=AutoMaterializePolicy.eager()

237

.with_rules(AutoMaterializeRule.materialize_on_parent_updated())

238

)

239

def auto_asset(upstream_asset):

240

"""Automatically materialized asset."""

241

return process(upstream_asset)

242

```

243

244

## Integration Capabilities

245

246

Dagster integrates with the entire modern data stack:

247

248

- **Data Warehouses**: Snowflake, BigQuery, Redshift, PostgreSQL

249

- **Data Lakes**: S3, GCS, Azure, Delta Lake, Iceberg

250

- **ML Platforms**: MLflow, Weights & Biases, SageMaker

251

- **Orchestration**: Kubernetes, Docker, cloud platforms

252

- **Observability**: Slack, email, PagerDuty, custom webhooks

253

- **Version Control**: Git-based deployments and CI/CD integration

254

255

## Getting Started

256

257

1. **Install Dagster**: `pip install dagster dagster-webserver`

258

2. **Define Assets**: Create Python functions decorated with `@asset`

259

3. **Create Definitions**: Bundle assets, resources, and schedules in `Definitions`

260

4. **Run Dagster**: Use `dagster dev` for local development

261

5. **Deploy**: Use Dagster Cloud or self-hosted deployment options

262

263

## Documentation Structure

264

265

This Knowledge Tile provides comprehensive API documentation organized by functional area:

266

267

- **[Core Definitions](./core-definitions.md)** - Assets, operations, jobs, and repositories

268

- **[Configuration System](./configuration.md)** - Type-safe configuration and resources

269

- **[Execution and Contexts](./execution-contexts.md)** - Runtime system and execution contexts

270

- **[Storage and I/O](./storage-io.md)** - Data persistence and I/O management

271

- **[Events and Metadata](./events-metadata.md)** - Event system and metadata framework

272

- **[Sensors and Schedules](./sensors-schedules.md)** - Automation and triggering systems

273

- **[Partitions System](./partitions.md)** - Data partitioning and time-based execution

274

- **[Error Handling](./error-handling.md)** - Error hierarchy and failure management

275

276

Each section provides complete API documentation with function signatures, type definitions, usage examples, and cross-references to enable comprehensive understanding and usage of the Dagster framework.