CtrlK
BlogDocsLog inGet started
Tessl Logo

dbt-labs/dbt-agent-skills

A curated collection of Agent Skills for working with dbt, to help AI agents understand and execute dbt workflows more effectively.

91

Does it follow best practices?

Validation for skill structure

This version of the tile failed moderation
Moderation failed
Overview
Skills
Evals
Files

AGENTS.mdevals/

Skill Evaluation Tool - Developer Guide

This document covers conventions and patterns for working on the skill-eval CLI tool.

Architecture Overview

src/skill_eval/
├── cli.py       # Typer CLI commands (run, grade, report, review)
├── models.py    # Data models: Scenario, SkillSet, load_scenario()
├── runner.py    # Execution: Runner class, RunResult, RunTask
├── grader.py    # Auto-grading with Claude CLI
├── reporter.py  # Report generation from grades
├── selector.py  # Interactive TUI selectors for runs/scenarios
└── logging.py   # Loguru configuration with context support

Data flow: cli.py loads scenarios via models.py, executes via runner.py, grades via grader.py, and reports via reporter.py.

CLI Framework: Typer

We use Typer for the CLI.

Output

Use the appropriate output method based on context:

User-facing CLI output (command results, prompts): Use typer.echo()

typer.echo(f"Run directory: {run_dir}")
typer.echo("Error: file not found", err=True)

Progress logging (during execution): Use logger from logging.py

from skill_eval.logging import logger

logger.info("Starting scenario")
logger.debug("Tool called: Read")
logger.warning("Timeout reached")
logger.success("Completed")

# With context (for parallel runs)
ctx_logger = logger.bind(scenario="my-scenario", skill_set="with-skill")
ctx_logger.info("Starting")  # Shows: [T0/my-scenario/with-skill] Starting

Never use print() for output.

Adding Commands

New commands go in cli.py:

@app.command()
def mycommand(
    arg: str = typer.Argument(..., help="Required argument"),
    flag: bool = typer.Option(False, "--flag", "-f", help="Optional flag"),
) -> None:
    """Command description shown in --help."""
    typer.echo(f"Running with {arg}")

Data Models

Dataclasses (models.py, runner.py)

When modifying CLI commands that work with scenarios or skill sets, check if the underlying dataclasses need updates:

In models.py:

  • Grade - grading result (success, score, tool_usage, notes, etc.)
  • SkillSet - skills, mcp_servers, allowed_tools
  • Scenario - name, path, prompt, skill_sets, description

In runner.py:

  • RunResult - scenario results with output, success, tools_used, skills_invoked, etc.
  • RunTask - task definition for parallel execution (scenario, skill_set, run_dir)

Use dataclasses.asdict() to convert dataclasses to dicts for YAML serialization.

When modifying grading:

  1. Update Grade dataclass in models.py if adding new fields
  2. Update GRADING_PROMPT_TEMPLATE in grader.py if changing what Claude evaluates
  3. Update parse_grade_response() to extract new fields into Grade
  4. Update reporter.py to display new fields

Modifying Features - Checklist

Adding a new CLI subcommand

  1. Add command function in cli.py with @app.command()
  2. Use typer.echo() for all output
  3. Add tests in tests/
  4. Update README.md usage section
  5. Run uv run ty check src/

Adding a new field to skill-sets.yaml

  1. Update SkillSet dataclass in models.py
  2. Update load_scenario() to parse the new field
  3. Update Runner.run_scenario() if it affects execution
  4. Add tests for the new field

Adding new run output/metadata fields

  1. Update RunResult dataclass in runner.py
  2. Update _parse_json_output() to extract new data from Claude's output
  3. Update metadata.yaml writing in run_scenario()
  4. Update grader if it should evaluate the new field
  5. Update reporter if it should display the new field
  6. Add tests

Modifying grading criteria

  1. Update GRADING_PROMPT_TEMPLATE in grader.py
  2. Update parse_grade_response() to handle new fields
  3. Update init_grades_file() for manual grading template
  4. Update reporter.py to show new fields in reports
  5. Add tests

Parallel Execution

We use concurrent.futures.ThreadPoolExecutor for parallel runs:

from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    future_to_task = {executor.submit(fn, task): task for task in tasks}
    for future in as_completed(future_to_task):
        result = future.result()

See Runner.run_parallel() in runner.py for the full implementation.

Type Checking with ty

Run the ty type checker before committing:

uv run ty check src/

Fix any type errors. Common issues:

  • Mixed dict types need explicit annotations
  • Optional fields need | None types
  • Use list[str] not List[str] (Python 3.11+)

Testing

Tests live in tests/ and use pytest.

Running Tests

uv run pytest                    # all tests
uv run pytest tests/test_cli.py  # specific file
uv run pytest -k "test_grade"    # by name pattern

Test Requirements

Every new feature needs tests. This includes:

  • New CLI commands
  • New options/flags on existing commands
  • Changes to data models
  • New grading/reporting logic

Dependencies

Key dependencies in pyproject.toml:

  • typer - CLI framework
  • pyyaml - YAML parsing
  • claude-code-transcripts - HTML transcript generation
  • loguru - Logging with context support
  • textual - TUI for interactive selection

Dev dependencies:

  • pytest - testing
  • ty - type checking

Install with Tessl CLI

npx tessl i dbt-labs/dbt-agent-skills

evals

AGENTS.md

CLAUDE.md

README.md

CHANGELOG.md

CLAUDE.md

CONTRIBUTING.md

README.md

tile.json