Collect and normalize agent logs, discover installed verifiers, and dispatch LLM judges to evaluate adherence. Produces per-session verdicts and aggregated reports.
91
90%
Does it follow best practices?
Impact
96%
3.09xAverage score across 3 eval scenarios
Passed
No known issues
How to extract instructions from source material and build verifier files with checklists.
Extract every instruction that directs an agent to do or not do something specific:
__tests__/"Coverage: extract everything. A typical skill yields 5-20+ instructions. A CLAUDE.md or rules file can yield more. If you have fewer than 5 from a non-trivial source, re-read — you're likely missing tool/method choices, process requirements, or prohibitions.
Do not invent rules not present in the source text.
Read the SKILL.md and all its linked reference files (references/, scripts/). Instructions are typically explicit workflow steps, tool preferences, and constraints.
These often contain project-wide conventions. Extract each rule as a separate instruction. Rules files tend to be denser — many short rules rather than long workflows.
The user tells you directly what they care about. Create verifiers from their description. Use "type": "user" in sources.
For each instruction found, create a verifier JSON file with the instruction, context, and sources filled in, but checklist: [] (empty array).
This gives the user a review point — they can:
File naming: use a short kebab-case slug from the instruction. Place in the target tile's verifiers/ directory.
{
"instruction": "Use pdfplumber for text and table extraction",
"relevant_when": "When extracting text or tables from PDFs",
"context": "The skill recommends pdfplumber over alternatives like PyPDF2 or tabula-py for its consistent API and better table extraction.",
"sources": [
{
"type": "file",
"filename": "skills/pdf/SKILL.md",
"tile": "anthropics/pdf@1.0.0",
"line_no": 42
}
],
"checklist": []
}Present the list to the user and get confirmation before Phase 2.
For each instruction file, decompose into checklist items. Each item is a binary pass/fail check.
Split when:
Keep as one item when:
Each checklist item's rule field must be binary — a judge needs to answer yes/no:
| Avoid (needs interpretation) | Use instead (binary) |
|---|---|
| "Properly handles errors" | "Uses try/catch around external API calls" |
| "Follows the import style" | "Local import paths use .ts extension, not .js" |
| "Good commit messages" | "Commit message contains more than 5 words" |
| "Creative layout" | "Uses at least ONE of: asymmetric grid, overlapping elements, rotated content" |
Each instruction should produce 1-5 checklist items (typically 1-3). If you have more than 5, consider whether some items test the same thing.
instruction — State rules positively and specifically. Instead of "Don't use unittest", write "Use pytest framework, not unittest". Include the concrete tool, library, command, or pattern.
relevant_when — Describe the decision point where the agent could either follow or violate this rule. Not the broad activity that might contain the decision, but the specific moment the choice arises.
Think: "In what specific situation could the agent either follow or violate this rule?" That situation is your relevant_when.
| Rule | Too broad (matches too many sessions) | Better (scoped to the decision point) |
|---|---|---|
| Use pytest not unittest | "When writing tests" | "When creating new test files or adding new test functions" |
Use pytest.raises for exceptions | "When writing Python tests" | "When writing test assertions that verify exceptions are raised" |
Put tests in __tests__/ | "When writing tests" | "When creating new test files (not editing existing ones)" |
Use @pytest.mark.parametrize | "When writing tests with similar assertions" | "When writing tests that repeat the same assertion logic across different input values" |
| Run lint after changes | "When modifying files" | "When agent has finished modifying .py files and is wrapping up" |
The pattern: "writing tests" is an activity. "Creating a new test file" is a decision point. The rule about test file location only matters at the moment a new file is created — not every time any test is touched.
A session might involve code that happens to have certain properties (async functions, database connections, HTTP handlers), but the session's work may be unrelated to the rule. Scope by the kind of work the rule governs, not by surface-level properties that happen to be present.
Bad: "relevant_when": "When writing code that uses async functions" — matches a session refactoring error messages in an async module, even though async patterns aren't the focus
Better: "relevant_when": "When writing new async functions or converting sync code to async" — scoped to when the agent is making concurrency decisions
Bad: "relevant_when": "When writing tests for code that has side effects" — matches almost any integration test
Better: "relevant_when": "When writing tests that need to isolate external service calls (HTTP, database, filesystem)" — scoped to when mocking decisions actually arise
When a rule can be satisfied by different-but-equivalent approaches, phrase the rule to accept them. Otherwise the judge will fail correct implementations that use a different formulation.
Bad: "rule": "Agent uses pytest.fixture for test setup" — rejects equivalent setup using @pytest.fixture as a decorator vs conftest.py fixtures
Better: "rule": "Agent uses pytest fixtures (via @pytest.fixture decorator or conftest.py) for shared test setup, rather than setUp/tearDown methods or inline setup"
Bad: "rule": "Agent uses context managers for file handling" — rejects equivalent try/finally patterns
Better: "rule": "Agent ensures files are properly closed after use, via context managers (with statement) or equivalent resource cleanup"
context — 2-3 sentences giving the judge enough background to understand the rule without reading the full source. Include:
checklist rule — The specific behavior to check. Should match the granularity of what a judge can observe in a session transcript. Don't check things that would be invisible (e.g., runtime behavior, performance characteristics).
checklist relevant_when — Should be equal or narrower than the instruction-level relevant_when — never broader. Think of it as a funnel:
relevant_when: "Does this session involve the general area?"relevant_when: "Did the agent face this specific decision?"Example for "Follow TDD workflow":
"relevant_when": "Agent is implementing a new feature or fixing a bug""relevant_when": "Agent creates or modifies implementation files in src/""relevant_when": "Agent creates new test files"For skill sources, consider adding a verifier specifically for activation — was the skill read/loaded by the agent? This only makes sense for skills (not docs or rules, which are always loaded via CLAUDE.md).
{
"instruction": "Activate the frontend-design skill when building UI",
"relevant_when": "Agent is building or modifying frontend UI components",
"context": "The frontend-design skill contains project-specific UI guidelines. It should be loaded via the skill tool when doing UI work, not just followed by coincidence.",
"sources": [
{
"type": "file",
"filename": "skills/frontend-design/SKILL.md",
"tile": "anthropics/frontend-design@1.2.0"
}
],
"checklist": [
{
"name": "skill-activated",
"rule": "Agent reads or activates the frontend-design skill (via skill tool or reading SKILL.md) before writing UI code",
"relevant_when": "Agent writes or modifies frontend UI components"
}
]
}For docs/rules sources, skip activation verifiers — these files are loaded automatically and activation isn't meaningful.
By default, write verifier files into the source skill's verifiers/ directory (e.g. skills/my-skill/verifiers/). If the user specifies a different tile with --tile <path>, write there instead.
If the target tile doesn't exist yet, create it with tessl tile new first.
When verifiers are embedded inside a skill directory (skills/my-skill/verifiers/), omit the sources field entirely. The source is implicitly the skill the verifier lives inside — including an explicit sources array leads to it drifting out of sync with the actual location.
Include sources only when:
verifiers/ not inside a skill)"type": "user")