CtrlK
BlogDocsLog inGet started
Tessl Logo

dbt-labs/dbt-agent-skills

A curated collection of Agent Skills for working with dbt, to help AI agents understand and execute dbt workflows more effectively.

91

Quality

91%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

AGENTS.mdevals/

Skill Evaluation Tool - Developer Guide

This document covers conventions and patterns for working on the skill-eval CLI tool.

Architecture Overview

src/skill_eval/
├── cli.py       # Typer CLI commands (run, grade, report, review)
├── models.py    # Data models: Scenario, SkillSet, load_scenario()
├── runner.py    # Execution: Runner class, RunResult, RunTask
├── grader.py    # Auto-grading with Claude CLI
├── reporter.py  # Report generation from grades
├── selector.py  # Interactive TUI selectors for runs/scenarios
└── logging.py   # Loguru configuration with context support

Data flow: cli.py loads scenarios via models.py, executes via runner.py, grades via grader.py, and reports via reporter.py.

CLI Framework: Typer

We use Typer for the CLI.

Output

Use the appropriate output method based on context:

User-facing CLI output (command results, prompts): Use typer.echo()

typer.echo(f"Run directory: {run_dir}")
typer.echo("Error: file not found", err=True)

Progress logging (during execution): Use logger from logging.py

from skill_eval.logging import logger

logger.info("Starting scenario")
logger.debug("Tool called: Read")
logger.warning("Timeout reached")
logger.success("Completed")

# With context (for parallel runs)
ctx_logger = logger.bind(scenario="my-scenario", skill_set="with-skill")
ctx_logger.info("Starting")  # Shows: [T0/my-scenario/with-skill] Starting

Never use print() for output.

Adding Commands

New commands go in cli.py:

@app.command()
def mycommand(
    arg: str = typer.Argument(..., help="Required argument"),
    flag: bool = typer.Option(False, "--flag", "-f", help="Optional flag"),
) -> None:
    """Command description shown in --help."""
    typer.echo(f"Running with {arg}")

Data Models

Dataclasses (models.py, runner.py)

When modifying CLI commands that work with scenarios or skill sets, check if the underlying dataclasses need updates:

In models.py:

  • Grade - grading result (success, score, tool_usage, notes, etc.)
  • SkillSet - skills, mcp_servers, allowed_tools
  • Scenario - name, path, prompt, skill_sets, description

In runner.py:

  • RunResult - scenario results with output, success, tools_used, skills_invoked, etc.
  • RunTask - task definition for parallel execution (scenario, skill_set, run_dir)

Use dataclasses.asdict() to convert dataclasses to dicts for YAML serialization.

When modifying grading:

  1. Update Grade dataclass in models.py if adding new fields
  2. Update GRADING_PROMPT_TEMPLATE in grader.py if changing what Claude evaluates
  3. Update parse_grade_response() to extract new fields into Grade
  4. Update reporter.py to display new fields

Modifying Features - Checklist

Adding a new CLI subcommand

  1. Add command function in cli.py with @app.command()
  2. Use typer.echo() for all output
  3. Add tests in tests/
  4. Update README.md usage section
  5. Run uv run ty check src/

Adding a new field to skill-sets.yaml

  1. Update SkillSet dataclass in models.py
  2. Update load_scenario() to parse the new field
  3. Update Runner.run_scenario() if it affects execution
  4. Add tests for the new field

Adding new run output/metadata fields

  1. Update RunResult dataclass in runner.py
  2. Update _parse_json_output() to extract new data from Claude's output
  3. Update metadata.yaml writing in run_scenario()
  4. Update grader if it should evaluate the new field
  5. Update reporter if it should display the new field
  6. Add tests

Modifying grading criteria

  1. Update GRADING_PROMPT_TEMPLATE in grader.py
  2. Update parse_grade_response() to handle new fields
  3. Update init_grades_file() for manual grading template
  4. Update reporter.py to show new fields in reports
  5. Add tests

Parallel Execution

We use concurrent.futures.ThreadPoolExecutor for parallel runs:

from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    future_to_task = {executor.submit(fn, task): task for task in tasks}
    for future in as_completed(future_to_task):
        result = future.result()

See Runner.run_parallel() in runner.py for the full implementation.

Type Checking with ty

Run the ty type checker before committing:

uv run ty check src/

Fix any type errors. Common issues:

  • Mixed dict types need explicit annotations
  • Optional fields need | None types
  • Use list[str] not List[str] (Python 3.11+)

Testing

Tests live in tests/ and use pytest.

Running Tests

uv run pytest                    # all tests
uv run pytest tests/test_cli.py  # specific file
uv run pytest -k "test_grade"    # by name pattern

Test Requirements

Every new feature needs tests. This includes:

  • New CLI commands
  • New options/flags on existing commands
  • Changes to data models
  • New grading/reporting logic

Dependencies

Key dependencies in pyproject.toml:

  • typer - CLI framework
  • pyyaml - YAML parsing
  • claude-code-transcripts - HTML transcript generation
  • loguru - Logging with context support
  • textual - TUI for interactive selection

Dev dependencies:

  • pytest - testing
  • ty - type checking

evals

AGENTS.md

CLAUDE.md

README.md

CHANGELOG.md

CLAUDE.md

CONTRIBUTING.md

README.md

tile.json