CtrlK
BlogDocsLog inGet started
Tessl Logo

etl-pipeline-design

This skill should be used when the user asks to "design an ETL pipeline", "build data ingestion", "set up data orchestration", "troubleshoot pipeline issues", "optimize data workflows", or mentions ELT, medallion architecture, batch vs streaming, or data transformation patterns.

Install with Tessl CLI

npx tessl i github:back1ply/LLM-Skills --skill etl-pipeline-design
What are skills?

Overall
score

67%

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

ETL Pipeline Design Skill

Source: Distilled from Understanding ETL (Updated Edition) by Matt Palmer, O'Reilly Media, August 2025. ISBN: 979-8-341-66508-8

Applicable when designing data ingestion from APIs, databases, or streaming sources; building transformation pipelines (batch or streaming); selecting orchestration tools or patterns; troubleshooting pipeline failures or data quality issues; or optimizing pipelines for efficiency and scale.


Core Concepts

ETL vs ELT

  • ETL: Transform data before loading (legacy, when storage was expensive)
  • ELT: Load all data first, transform downstream (modern approach — "storage is cheap")
  • Modern pipelines are typically ELT, but the term "ETL" persists

Medallion Architecture

Stage data in three quality layers:

LayerPurposeExample
BronzeRaw, unfiltered data directly from sourcesAPI responses, raw logs
SilverCleaned, filtered, enrichedRemoved duplicates, renamed columns
GoldStakeholder-ready, often aggregatedReporting tables, ML features

1. Data Ingestion

Source Evaluation Checklist

For every data source, answer:

QuestionWhy It Matters
Who will use this data?Aligns incentives, prioritizes work
How will it be used?Guides downstream decisions
Is it bounded or unbounded?Determines batch vs. streaming
What's the minimum update frequency?Sets hard limits on freshness
What's the expected volume?Informs storage/compute choices
What's the format? (JSON, CSV, API, DB)Dictates processing requirements
What's the quality?Determines transformation needs

Destination Considerations

  • OLAP (BigQuery, Redshift, Snowflake, Databricks SQL): For analytics, column-oriented
  • OLTP (Postgres, MySQL): For transactional apps, row-oriented
  • Lakehouse (Delta Lake, Iceberg, Hudi): Combines lake + warehouse benefits

Batch vs. Streaming Decision

Is the data bounded (finite)?
├── YES → Batch processing
└── NO (continuous/unbounded) →
    ├── Latency requirement < 1 second? → True streaming (Flink, Kafka Streams)
    └── Latency 100ms-minutes acceptable? → Micro-batch (Spark Structured Streaming)

Streaming Methods:

  • Fixed windows: Data batched in fixed time intervals
  • Sliding windows: Overlapping time intervals
  • Sessions: Dynamic windows based on activity gaps

Choosing Ingestion Solutions

TypeExamplesProsCons
Legacy declarativeTalend, PentahoRobust connectorsOutdated, not MDS-aligned
Modern declarativeFivetran, Airbyte, StitchLow maintenance, connectorsVendor lock-in, cost
Native/PlatformLakeflow Connect, GlueIntegrated, managedPlatform-specific
ImperativeCustom scripts, Singer tapsFull controlHigh build/maintain cost
HybridMix of aboveFlexibilityComplexity

Recommendation: Use declarative for common sources, imperative for edge cases.


2. Data Transformation

Transformation Patterns

PatternDescriptionExample
EnrichmentAdd data from other sourcesJoin order codes → readable names
JoiningCombine datasets on common keysSales + Users → add country
FilteringSelect only needed recordsWHERE date >= '2025-01-01'
StructuringConvert formatsJSON → tabular Parquet
ConversionChange data typesString → datetime
AggregationSummarize dataDaily totals from hourly data
AnonymizationMask PIIHash emails
SplittingBreak columns apartemail → prefix + domain
DeduplicationRemove duplicatesKeep earliest by UUID

Update Patterns

PatternWhen to UseSQL Concept
OverwriteSmall datasets, simple refreshesTRUNCATE + INSERT
InsertAppend-only data (logs, transactions)INSERT INTO
UpsertCDC, deduplication, SCDMERGE
DeleteSoft (status='deleted') or hard (remove row)UPDATE or DELETE

UPSERT Example (Databricks)

MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN UPDATE SET
  firstName = people10mupdates.firstName,
  lastName = people10mupdates.lastName
WHEN NOT MATCHED THEN INSERT (id, firstName, lastName)
VALUES (people10mupdates.id, people10mupdates.firstName, people10mupdates.lastName)

Best Practices

  • Staging: Always stage intermediate data for recoverability
  • Idempotency: Running a pipeline twice should produce the same result
  • Incrementality: Process only new/changed data when possible

3. Data Orchestration

What Orchestrators Do

  • Manage dependencies between tasks (DAGs)
  • Schedule and trigger pipelines
  • Handle retries, errors, and alerts
  • Provide monitoring and lineage

Orchestrator Selection Criteria

CriterionQuestions to Ask
ScalabilityCan it handle 10x DAGs/tasks?
ReusabilityCan I create reusable components?
ConnectionsNative integrations with my stack?
SupportActive community or paid support?
ObservabilityCan I see failures, lineage, logs?

Orchestrator Options

ToolTypeBest For
AirflowOpen sourceMature, widely adopted, many connectors
DagsterOpen sourceModern, asset-based, good DX
PrefectOpen sourcePython-native, flexible
Lakeflow JobsPlatformDatabricks-native, integrated
dbtSQL orchestratorWarehouse transformations

Design Patterns

PatternDescription
BackfillsBuild pipelines that can recreate historical data
Event-drivenTrigger on data arrival, not just schedules
Conditional logicBranch based on conditions (if/else)
ConcurrencyFan out parallel tasks for performance
ParameterizedAccept variables for flexibility
DecompositionBreak into micro-DAGs for isolation

4. Troubleshooting & Observability

Key Metrics

MetricWhat It Measures
FreshnessTime since last update
VolumeRow counts, data sizes
QualityUniqueness, completeness, validity

Observability Methods

MethodPurpose
LoggingCapture execution details
LineageTrack data flow (column-level ideal)
Anomaly detectionCatch unexpected data patterns
Data diffsSee what code changes affect data
AssertionsValidate constraints (price > 0)

Error Handling

TechniqueDescription
Retry logicAutomatic retries with backoff
Conditional handlingDifferent paths for different errors
Pipeline decompositionIsolate failures
Graceful degradationPartial functionality on failure
AlertingNotify team (avoid alert fatigue)

Incident Metrics

  • N: Number of incidents
  • TTD: Time to detection
  • TTR: Time to resolution
  • Downtime: N × (TTD + TTR)

5. Efficiency & Scalability

Resource Allocation

ConceptDescription
Spot instancesCheaper but interruptible
On-demandReliable but costly
PoolingPre-warm clusters for faster starts
AutoscalingAdjust resources to workload
ServerlessPay per use (BigQuery, Databricks SQL)

Optimization Techniques

TechniqueBenefit
Incremental processingOnly process new/changed data
Columnar storageFaster analytics (Parquet, Delta)
PartitioningReduce data scanned
MaterializationTables vs. views trade-offs

Scaling Formula

Horizontal scaling = More machines/nodes
Vertical scaling = Bigger machines/nodes

Decision Framework

When designing a pipeline:

  1. Understand the source → Use source evaluation checklist
  2. Determine frequency → Batch vs. streaming based on bounds + requirements
  3. Choose ingestion → Declarative for common, imperative for custom
  4. Design transformations → Apply appropriate patterns
  5. Plan updates → Overwrite, insert, or upsert based on use case
  6. Orchestrate → Select tool, apply design patterns
  7. Observe → Implement logging, lineage, assertions
  8. Optimize → Incrementality, partitioning, right-sizing resources

Anti-Patterns to Avoid

Anti-PatternWhy It's Bad
Mixing orchestration with transformationOrchestrator should trigger, not execute
No staging layerHard to recover from failures
Non-idempotent pipelinesReruns cause duplicates or errors
Ignoring lineageImpossible to debug data issues
Alert fatigueToo many alerts = ignored alerts
GUI-only toolingNo version control, hard to collaborate

Reference Files

For detailed patterns and extended guidance, consult:

  • patterns/transformation-patterns.md — Detailed transformation pattern catalog with examples
  • patterns/orchestration-patterns.md — Advanced orchestration design patterns and DAG strategies
  • troubleshooting/observability-guide.md — Comprehensive observability setup, monitoring, and incident response
  • checklists/evaluation-checklists.md — Pipeline evaluation checklists for design reviews and production readiness
Repository
github.com/back1ply/LLM-Skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.