Collect and normalize agent logs, discover installed verifiers, and dispatch LLM judges to evaluate adherence. Produces per-session verdicts and aggregated reports.
91
90%
Does it follow best practices?
Impact
96%
3.09xAverage score across 3 eval scenarios
Passed
No known issues
You are reviewing an agent coding session to detect points of friction — moments where the user or agent struggled, wasted time, or encountered obstacles.
You will be given a condensed transcript of an agent coding session, with events labeled by turn number.
Review the transcript and extract:
Return a single JSON object:
{
"session_id": "<from SESSION header>",
"agent": "<from AGENT header>",
"outcome": "fully_achieved | mostly_achieved | partially_achieved | not_achieved",
"satisfaction": "happy | satisfied | likely_satisfied | dissatisfied | frustrated",
"summary": "1-2 sentence description of what happened",
"friction": [
{
"type": "<friction type>",
"description": "1 sentence describing what went wrong",
"turns": [5, 6, 7],
"impact": "minor | moderate | major"
}
]
}| Value | When to use |
|---|---|
fully_achieved | User's request was completed successfully. Evidence: user confirms, moves on to new work, session ends naturally. |
mostly_achieved | Main goal met but with minor gaps — a workaround was needed, an edge case was missed, or the user had to correct a small detail. |
partially_achieved | Some meaningful progress but significant parts remain unfinished. |
not_achieved | Session ended without meaningful progress. Evidence: user abandoned the approach, session ended abruptly after errors, or the agent went in circles. |
Infer from the user's language and behavior. Default to likely_satisfied when there's no clear signal.
| Value | When to use |
|---|---|
happy | User expresses explicit positive feedback: "great", "perfect", "nice work" |
satisfied | User accepts the work and moves on naturally, says "thanks", "looks good" |
likely_satisfied | No explicit feedback but session completed normally without friction |
dissatisfied | User has to repeat themselves, correct the agent, or express mild frustration |
frustrated | User expresses strong frustration, abandons the approach, or session ends abruptly after repeated problems |
Only include friction events that actually cost the user time or effort. Minor issues resolved in one turn don't count.
| Type | When to use |
|---|---|
wrong_approach | Agent chose the wrong strategy, tool, or method for the task. User had to redirect. |
buggy_code | Agent wrote code that didn't work — syntax errors, logic bugs, runtime crashes. |
over_investigation | Agent spent too many turns exploring, theorizing, or reading files when the answer was straightforward. |
misunderstood_request | Agent misinterpreted what the user asked for and worked on the wrong thing. |
premature_action | Agent started implementing before understanding requirements, or jumped ahead without user approval. |
tool_misuse | Agent used a tool incorrectly — wrong CLI flags, wrong command syntax, wrong file paths. |
repeated_failure | Agent failed at the same thing multiple times without changing approach. |
ignored_instruction | Agent didn't follow an explicit user instruction, project convention, or constraint. |
| Value | When to use |
|---|---|
minor | Resolved in 1-2 turns. Small hiccup, no significant delay. |
moderate | Took 3-5 turns to resolve, or required user intervention to redirect. |
major | More than 5 turns wasted, task was derailed, or user had to abandon the approach. |
"friction": [][ERROR] tags in the transcript) — these are strong friction signals, especially when followed by retries