or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

configuration.mdcore-definitions.mderror-handling.mdevents-metadata.mdexecution-contexts.mdindex.mdpartitions.mdsensors-schedules.mdstorage-io.md
tile.json

tessl/pypi-dagster

A cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/dagster@1.11.x

To install, run

npx @tessl/cli install tessl/pypi-dagster@1.11.0

index.mddocs/

Dagster Knowledge Tile

Overview

Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, and a declarative programming model. It provides a comprehensive framework for building, testing, and deploying data pipelines with software engineering best practices. Dagster enables teams to build reliable, scalable data platforms with rich metadata, comprehensive observability, and powerful automation capabilities.

Package Information

Package: dagster
Version: 1.11.8
API Elements: 700+ public API elements
Functional Areas: 26 major areas

Core Imports

import dagster
from dagster import (
    # Core decorators
    asset, multi_asset, op, job, graph, sensor, schedule, resource,
    
    # Definition classes
    Definitions, AssetsDefinition, OpDefinition, JobDefinition,
    
    # Configuration
    Config, ConfigurableResource, Field, Shape,
    
    # Execution
    materialize, execute_job, build_op_context, build_asset_context,
    
    # Types
    In, Out, AssetIn, AssetOut, DagsterType,
    
    # Events and Results
    AssetMaterialization, AssetObservation, MaterializeResult,
    
    # Storage
    IOManager, fs_io_manager, mem_io_manager,
    
    # Partitions
    DailyPartitionsDefinition, HourlyPartitionsDefinition, StaticPartitionsDefinition,
    
    # Error handling
    DagsterError, Failure, RetryRequested
)

Architecture Overview

Dagster follows a modular architecture organized around these core concepts:

1. Software-Defined Assets (SDAs)

The primary abstraction for data artifacts. Assets represent data products that exist or should exist.

@asset
def my_asset() -> MaterializeResult:
    """Define a software-defined asset."""
    # Asset computation logic
    return MaterializeResult()

2. Operations and Jobs

Fundamental computational units and their orchestration.

@op
def my_op() -> str:
    """Define an operation."""
    return "result"

@job
def my_job():
    """Define a job."""
    my_op()

3. Resources

External dependencies and services injected into computations.

@resource
def my_resource():
    """Define a resource."""
    return "resource_value"

4. Schedules and Sensors

Automation triggers for pipeline execution.

@schedule(cron_schedule="0 0 * * *", job=my_job)
def daily_schedule():
    """Define a daily schedule."""
    return {}

Key Capabilities

Asset Management

  • Asset Definition: @asset decorator and AssetsDefinition class
  • Multi-Asset Support: @multi_asset for defining multiple related assets
  • Asset Dependencies: Automatic dependency inference and explicit specification
  • Asset Checks: Data quality validation with @asset_check
  • Asset Lineage: Automatic lineage tracking and visualization

See: Core Definitions

Configuration System

  • Type-Safe Config: Pydantic-based configuration with Config class
  • Configurable Resources: ConfigurableResource for dependency injection
  • Environment Variables: EnvVar for environment-based configuration
  • Schema Validation: Comprehensive validation with helpful error messages

See: Configuration System

Execution Engine

  • Multiple Executors: In-process, multiprocess, and custom execution
  • Rich Contexts: Execution contexts with metadata, logging, and resources
  • Asset Materialization: materialize() function for asset execution
  • Job Execution: execute_job() for operation-based workflows

See: Execution and Contexts

Storage and I/O

  • I/O Managers: Pluggable storage backends with IOManager interface
  • Built-in I/O: Filesystem, memory, and universal path I/O managers
  • Custom I/O: Configurable I/O managers for any storage system
  • Asset Value Loading: Efficient loading of materialized asset values

See: Storage and I/O

Events and Metadata

  • Rich Metadata: Comprehensive metadata system with typed values
  • Asset Events: Materialization and observation events
  • Custom Events: User-defined events for pipeline observability
  • Table Metadata: Specialized metadata for tabular data

See: Events and Metadata

Automation

  • Asset Sensors: Event-driven execution based on asset changes
  • Schedule System: Time-based execution with cron expressions
  • Auto-Materialization: Declarative policies for automatic execution
  • Run Status Sensors: Sensors for pipeline failure handling

See: Sensors and Schedules

Partitioning

  • Time Partitions: Daily, hourly, weekly, monthly partitions
  • Static Partitions: Fixed sets of partitions
  • Dynamic Partitions: Runtime-defined partitions
  • Multi-Dimensional: Complex partitioning schemes

See: Partitions System

Error Handling

  • Structured Errors: Comprehensive error hierarchy
  • Retry Policies: Configurable retry strategies
  • Failure Events: Rich failure information and debugging
  • Graceful Degradation: Partial execution and recovery

See: Error Handling

Basic Usage Example

import pandas as pd
from dagster import asset, materialize, Definitions

@asset
def raw_data() -> pd.DataFrame:
    """Load raw data from source."""
    return pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})

@asset
def processed_data(raw_data: pd.DataFrame) -> pd.DataFrame:
    """Process raw data."""
    return raw_data.assign(processed_value=raw_data["value"] * 2)

@asset  
def analysis_result(processed_data: pd.DataFrame) -> dict:
    """Generate analysis from processed data."""
    return {
        "total_records": len(processed_data),
        "average_value": processed_data["processed_value"].mean()
    }

# Define the complete set of definitions
defs = Definitions(
    assets=[raw_data, processed_data, analysis_result]
)

# Materialize assets
if __name__ == "__main__":
    result = materialize([raw_data, processed_data, analysis_result])
    print(f"Materialized {len(result.asset_materializations)} assets")

Advanced Features

Components System (Beta)

Modular, reusable components for complex data platforms:

from dagster import Component, Definitions

# Use components for modularity
defs = Definitions(
    assets=load_assets_from_modules([my_assets_module]),
    resources={"database": database_resource}
)

Pipes Integration

Execute external processes with full Dagster integration:

from dagster import PipesSubprocessClient, asset

@asset
def external_asset(pipes_subprocess_client: PipesSubprocessClient):
    """Run external Python script with Dagster context."""
    return pipes_subprocess_client.run(
        command=["python", "external_script.py"],
        context={"asset_key": "external_asset"}
    ).get_results()

Declarative Automation

Sophisticated automation policies:

from dagster import AutoMaterializePolicy, AutoMaterializeRule

@asset(
    auto_materialize_policy=AutoMaterializePolicy.eager()
    .with_rules(AutoMaterializeRule.materialize_on_parent_updated())
)
def auto_asset(upstream_asset):
    """Automatically materialized asset."""
    return process(upstream_asset)

Integration Capabilities

Dagster integrates with the entire modern data stack:

  • Data Warehouses: Snowflake, BigQuery, Redshift, PostgreSQL
  • Data Lakes: S3, GCS, Azure, Delta Lake, Iceberg
  • ML Platforms: MLflow, Weights & Biases, SageMaker
  • Orchestration: Kubernetes, Docker, cloud platforms
  • Observability: Slack, email, PagerDuty, custom webhooks
  • Version Control: Git-based deployments and CI/CD integration

Getting Started

  1. Install Dagster: pip install dagster dagster-webserver
  2. Define Assets: Create Python functions decorated with @asset
  3. Create Definitions: Bundle assets, resources, and schedules in Definitions
  4. Run Dagster: Use dagster dev for local development
  5. Deploy: Use Dagster Cloud or self-hosted deployment options

Documentation Structure

This Knowledge Tile provides comprehensive API documentation organized by functional area:

Each section provides complete API documentation with function signatures, type definitions, usage examples, and cross-references to enable comprehensive understanding and usage of the Dagster framework.