Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
Overall
score
90%
Does it follow best practices?
Validation for skill structure
The user says: "I want to understand if my CLAUDE.md is actually helping my AI agent. I have a monorepo at acme/backend on GitHub with TypeScript and Node.js code. I've never run any evals before — can you walk me through it?"
Walk the user through setting up and running a Tessl codebase eval on their repository from start to finish.