src/skill_eval/
├── cli.py       # Typer CLI commands (run, grade, report, review)
├── models.py    # Data models: Scenario, SkillSet, load_scenario()
├── runner.py    # Execution: Runner class, RunResult, RunTask
├── grader.py    # Auto-grading with Claude CLI
├── reporter.py  # Report generation from grades
├── selector.py  # Interactive TUI selectors for runs/scenarios
└── logging.py   # Loguru configuration with context support

Data flow: cli.py loads scenarios via models.py, executes via runner.py, grades via grader.py, and reports via reporter.py.

CLI Framework: Typer

We use Typer for the CLI.

Output

Use the appropriate output method based on context:

User-facing CLI output (command results, prompts): Use typer.echo()

typer.echo(f"Run directory: {run_dir}")
typer.echo("Error: file not found", err=True)

Progress logging (during execution): Use logger from logging.py

from skill_eval.logging import logger

logger.info("Starting scenario")
logger.debug("Tool called: Read")
logger.warning("Timeout reached")
logger.success("Completed")

# With context (for parallel runs)
ctx_logger = logger.bind(scenario="my-scenario", skill_set="with-skill")
ctx_logger.info("Starting")  # Shows: [T0/my-scenario/with-skill] Starting

Never use print() for output.

Adding Commands

New commands go in cli.py:

@app.command()
def mycommand(
    arg: str = typer.Argument(..., help="Required argument"),
    flag: bool = typer.Option(False, "--flag", "-f", help="Optional flag"),
) -> None:
    """Command description shown in --help."""
    typer.echo(f"Running with {arg}")

Data Models

Dataclasses (models.py, runner.py)

When modifying CLI commands that work with scenarios or skill sets, check if the underlying dataclasses need updates:

In models.py:

Grade - grading result (success, score, tool_usage, notes, etc.)
SkillSet - skills, mcp_servers, allowed_tools
Scenario - name, path, prompt, skill_sets, description

In runner.py:

RunResult - scenario results with output, success, tools_used, skills_invoked, etc.
RunTask - task definition for parallel execution (scenario, skill_set, run_dir)

Use dataclasses.asdict() to convert dataclasses to dicts for YAML serialization.

When modifying grading:

Update Grade dataclass in models.py if adding new fields
Update GRADING_PROMPT_TEMPLATE in grader.py if changing what Claude evaluates
Update parse_grade_response() to extract new fields into Grade
Update reporter.py to display new fields

Modifying Features - Checklist

Adding a new CLI subcommand

Add command function in cli.py with @app.command()
Use typer.echo() for all output
Add tests in tests/
Update README.md usage section
Run uv run ty check src/

Adding a new field to skill-sets.yaml

Update SkillSet dataclass in models.py
Update load_scenario() to parse the new field
Update Runner.run_scenario() if it affects execution
Add tests for the new field

Adding new run output/metadata fields

Update RunResult dataclass in runner.py
Update _parse_json_output() to extract new data from Claude's output
Update metadata.yaml writing in run_scenario()
Update grader if it should evaluate the new field
Update reporter if it should display the new field
Add tests

Modifying grading criteria

Update GRADING_PROMPT_TEMPLATE in grader.py
Update parse_grade_response() to handle new fields
Update init_grades_file() for manual grading template
Update reporter.py to show new fields in reports
Add tests

Parallel Execution

We use concurrent.futures.ThreadPoolExecutor for parallel runs:

from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    future_to_task = {executor.submit(fn, task): task for task in tasks}
    for future in as_completed(future_to_task):
        result = future.result()

See Runner.run_parallel() in runner.py for the full implementation.

Type Checking with ty

Run the ty type checker before committing:

uv run ty check src/

Fix any type errors. Common issues:

Mixed dict types need explicit annotations
Optional fields need | None types
Use list[str] not List[str] (Python 3.11+)

Testing

Tests live in tests/ and use pytest.

Running Tests

uv run pytest                    # all tests
uv run pytest tests/test_cli.py  # specific file
uv run pytest -k "test_grade"    # by name pattern

Test Requirements

Every new feature needs tests. This includes:

New CLI commands
New options/flags on existing commands
Changes to data models
New grading/reporting logic

Dependencies

Key dependencies in pyproject.toml:

typer - CLI framework
pyyaml - YAML parsing
claude-code-transcripts - HTML transcript generation
loguru - Logging with context support
textual - TUI for interactive selection

Dev dependencies:

pytest - testing
ty - type checking

evals

scenarios

src

tests

AGENTS.md

CLAUDE.md

README.md

scripts

skills

CHANGELOG.md