Analyze agent sessions against verifier checklists, detect friction points, and create structured verifiers from skills and docs. Produces per-session verdicts and aggregated quality reports.
88
86%
Does it follow best practices?
Impact
97%
2.93xAverage score across 3 eval scenarios
Passed
No known issues
One JSON file per session, written to verdicts/<agent>/<session>.verdict.json:
{
"session_file": "normalized/claude-code/session-abc.jsonl",
"agent": "claude-code",
"instructions": [
{
"file": "use-tailwind-for-styling.json",
"instruction": "Use Tailwind CSS for all styling",
"tile": "anthropics/frontend-design",
"relevant": true,
"checks": [
{
"name": "tailwind-classes-used",
"applicable": true,
"passed": true,
"confidence": "high",
"evidence": "Turn 12: wrote className='flex items-center gap-4' in Card.tsx"
},
{
"name": "no-inline-styles",
"applicable": true,
"passed": false,
"confidence": "high",
"evidence": "Turn 18: used style={{ marginTop: 8 }} in Header.tsx"
}
]
},
{
"file": "run-tests-after-changes.json",
"instruction": "Run tests after making code changes",
"tile": "amyh/project-rules",
"relevant": true,
"checks": [
{
"name": "tests-run-after-edit",
"applicable": true,
"passed": true,
"confidence": "medium",
"evidence": "Turn 25: ran 'bun run test' after editing api.ts at turn 22"
}
]
}
],
"_meta": {
"model": "claude-haiku-4-5-20251001",
"started_at": "2026-03-11T14:30:00Z",
"completed_at": "2026-03-11T14:30:04Z",
"duration_ms": 4200,
"input_tokens": 12500,
"output_tokens": 1800,
"token_source": "api",
"transcript_chars": 44059,
"checks_count": 5
}
}| Field | Type | Description |
|---|---|---|
session_file | string | Path to normalized session JSONL (relative to analysis dir) |
agent | string | Agent name: claude-code, codex, gemini, cursor-ide, cursor-agent |
instructions | array | One entry per instruction evaluated |
_meta | object | Cost and timing metadata from the judge |
| Field | Type | Description |
|---|---|---|
file | string | Verifier filename (e.g. use-tailwind-for-styling.json) |
instruction | string | The instruction text |
tile | string | Source tile identifier |
relevant | bool | Whether this instruction is relevant to the session |
checks | array | One entry per checklist item (empty when relevant is false) |
| Field | Type | Description |
|---|---|---|
name | string | Checklist item name (from verifier JSON) |
applicable | bool | Whether this check's relevant_when applies to the session |
passed | bool | null | Whether the agent followed the rule. null when applicable is false |
confidence | "high" | "medium" | "low" | How clear the evidence is |
evidence | string | Short sentence citing specific turns |
_meta)| Field | Type | Description |
|---|---|---|
model | string | Model ID used for judging |
started_at | string | ISO timestamp when judge started |
completed_at | string | ISO timestamp when judge completed |
duration_ms | int | Wall-clock time in milliseconds |
input_tokens | int | null | Input tokens consumed |
output_tokens | int | null | Output tokens consumed |
token_source | string | "api", "estimated", or "unavailable" |
transcript_chars | int | Size of session transcript in characters |
checks_count | int | Total checklist items evaluated |
passed MUST be null when applicable is falsechecks SHOULD be empty [] when relevant is falseconfidence should be "low" when transcript was truncated or ambiguousverdicts-aggregate.json rolls up across sessions:
{
"timestamp": "2026-03-11T14:35:00Z",
"sessions_count": 15,
"tiles": {
"anthropics/frontend-design": {
"instructions": {
"use-tailwind-for-styling.json": {
"instruction": "Use Tailwind CSS for all styling",
"checks": {
"tailwind-classes-used": {
"applicable_count": 10,
"passed_count": 9,
"pass_rate": 0.90,
"confidence_breakdown": { "high": 8, "medium": 2, "low": 0 }
},
"no-inline-styles": {
"applicable_count": 10,
"passed_count": 8,
"pass_rate": 0.80,
"confidence_breakdown": { "high": 7, "medium": 2, "low": 1 }
}
}
}
},
"overall_pass_rate": 0.82
}
},
"cost": {
"total_input_tokens": 187500,
"total_output_tokens": 27000,
"estimated_cost_usd": 0.12
}
}