This skill should be used when the user asks to "design an ETL pipeline", "build data ingestion", "set up data orchestration", "troubleshoot pipeline issues", "optimize data workflows", or mentions ELT, medallion architecture, batch vs streaming, or data transformation patterns.
Install with Tessl CLI
npx tessl i github:back1ply/LLM-Skills --skill etl-pipeline-designOverall
score
67%
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
Source: Distilled from Understanding ETL (Updated Edition) by Matt Palmer, O'Reilly Media, August 2025. ISBN: 979-8-341-66508-8
Applicable when designing data ingestion from APIs, databases, or streaming sources; building transformation pipelines (batch or streaming); selecting orchestration tools or patterns; troubleshooting pipeline failures or data quality issues; or optimizing pipelines for efficiency and scale.
Stage data in three quality layers:
| Layer | Purpose | Example |
|---|---|---|
| Bronze | Raw, unfiltered data directly from sources | API responses, raw logs |
| Silver | Cleaned, filtered, enriched | Removed duplicates, renamed columns |
| Gold | Stakeholder-ready, often aggregated | Reporting tables, ML features |
For every data source, answer:
| Question | Why It Matters |
|---|---|
| Who will use this data? | Aligns incentives, prioritizes work |
| How will it be used? | Guides downstream decisions |
| Is it bounded or unbounded? | Determines batch vs. streaming |
| What's the minimum update frequency? | Sets hard limits on freshness |
| What's the expected volume? | Informs storage/compute choices |
| What's the format? (JSON, CSV, API, DB) | Dictates processing requirements |
| What's the quality? | Determines transformation needs |
Is the data bounded (finite)?
├── YES → Batch processing
└── NO (continuous/unbounded) →
├── Latency requirement < 1 second? → True streaming (Flink, Kafka Streams)
└── Latency 100ms-minutes acceptable? → Micro-batch (Spark Structured Streaming)Streaming Methods:
| Type | Examples | Pros | Cons |
|---|---|---|---|
| Legacy declarative | Talend, Pentaho | Robust connectors | Outdated, not MDS-aligned |
| Modern declarative | Fivetran, Airbyte, Stitch | Low maintenance, connectors | Vendor lock-in, cost |
| Native/Platform | Lakeflow Connect, Glue | Integrated, managed | Platform-specific |
| Imperative | Custom scripts, Singer taps | Full control | High build/maintain cost |
| Hybrid | Mix of above | Flexibility | Complexity |
Recommendation: Use declarative for common sources, imperative for edge cases.
| Pattern | Description | Example |
|---|---|---|
| Enrichment | Add data from other sources | Join order codes → readable names |
| Joining | Combine datasets on common keys | Sales + Users → add country |
| Filtering | Select only needed records | WHERE date >= '2025-01-01' |
| Structuring | Convert formats | JSON → tabular Parquet |
| Conversion | Change data types | String → datetime |
| Aggregation | Summarize data | Daily totals from hourly data |
| Anonymization | Mask PII | Hash emails |
| Splitting | Break columns apart | email → prefix + domain |
| Deduplication | Remove duplicates | Keep earliest by UUID |
| Pattern | When to Use | SQL Concept |
|---|---|---|
| Overwrite | Small datasets, simple refreshes | TRUNCATE + INSERT |
| Insert | Append-only data (logs, transactions) | INSERT INTO |
| Upsert | CDC, deduplication, SCD | MERGE |
| Delete | Soft (status='deleted') or hard (remove row) | UPDATE or DELETE |
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN UPDATE SET
firstName = people10mupdates.firstName,
lastName = people10mupdates.lastName
WHEN NOT MATCHED THEN INSERT (id, firstName, lastName)
VALUES (people10mupdates.id, people10mupdates.firstName, people10mupdates.lastName)| Criterion | Questions to Ask |
|---|---|
| Scalability | Can it handle 10x DAGs/tasks? |
| Reusability | Can I create reusable components? |
| Connections | Native integrations with my stack? |
| Support | Active community or paid support? |
| Observability | Can I see failures, lineage, logs? |
| Tool | Type | Best For |
|---|---|---|
| Airflow | Open source | Mature, widely adopted, many connectors |
| Dagster | Open source | Modern, asset-based, good DX |
| Prefect | Open source | Python-native, flexible |
| Lakeflow Jobs | Platform | Databricks-native, integrated |
| dbt | SQL orchestrator | Warehouse transformations |
| Pattern | Description |
|---|---|
| Backfills | Build pipelines that can recreate historical data |
| Event-driven | Trigger on data arrival, not just schedules |
| Conditional logic | Branch based on conditions (if/else) |
| Concurrency | Fan out parallel tasks for performance |
| Parameterized | Accept variables for flexibility |
| Decomposition | Break into micro-DAGs for isolation |
| Metric | What It Measures |
|---|---|
| Freshness | Time since last update |
| Volume | Row counts, data sizes |
| Quality | Uniqueness, completeness, validity |
| Method | Purpose |
|---|---|
| Logging | Capture execution details |
| Lineage | Track data flow (column-level ideal) |
| Anomaly detection | Catch unexpected data patterns |
| Data diffs | See what code changes affect data |
| Assertions | Validate constraints (price > 0) |
| Technique | Description |
|---|---|
| Retry logic | Automatic retries with backoff |
| Conditional handling | Different paths for different errors |
| Pipeline decomposition | Isolate failures |
| Graceful degradation | Partial functionality on failure |
| Alerting | Notify team (avoid alert fatigue) |
| Concept | Description |
|---|---|
| Spot instances | Cheaper but interruptible |
| On-demand | Reliable but costly |
| Pooling | Pre-warm clusters for faster starts |
| Autoscaling | Adjust resources to workload |
| Serverless | Pay per use (BigQuery, Databricks SQL) |
| Technique | Benefit |
|---|---|
| Incremental processing | Only process new/changed data |
| Columnar storage | Faster analytics (Parquet, Delta) |
| Partitioning | Reduce data scanned |
| Materialization | Tables vs. views trade-offs |
Horizontal scaling = More machines/nodes
Vertical scaling = Bigger machines/nodesWhen designing a pipeline:
| Anti-Pattern | Why It's Bad |
|---|---|
| Mixing orchestration with transformation | Orchestrator should trigger, not execute |
| No staging layer | Hard to recover from failures |
| Non-idempotent pipelines | Reruns cause duplicates or errors |
| Ignoring lineage | Impossible to debug data issues |
| Alert fatigue | Too many alerts = ignored alerts |
| GUI-only tooling | No version control, hard to collaborate |
For detailed patterns and extended guidance, consult:
patterns/transformation-patterns.md — Detailed transformation pattern catalog with examplespatterns/orchestration-patterns.md — Advanced orchestration design patterns and DAG strategiestroubleshooting/observability-guide.md — Comprehensive observability setup, monitoring, and incident responsechecklists/evaluation-checklists.md — Pipeline evaluation checklists for design reviews and production readinessIf you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.