try-tessl/agent-quality

Analyze agent sessions against verifier checklists, detect friction points, and create structured verifiers from skills and docs. Produces per-session verdicts and aggregated quality reports.

2.93x

Quality

86%

Does it follow best practices?

Impact

97%

2.93x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Pipeline Reference

Name: try-tessl/agent-quality
Rating: 88.64999999999999 (1 reviews)
Author: try-tessl

Read this when debugging or understanding how the analysis pipeline works internally.

Prerequisites

At least one tile with a verifiers/ directory installed in .tessl/tiles/ or ~/.tessl/tiles/
Agent logs to analyze (from Claude Code, Codex, Gemini, or Cursor)
claude CLI installed and authenticated (judges are dispatched via claude -p --model haiku)

If no verifiers are found, suggest the user create some:

No verifiers found in any installed tiles. Let's create some — which skills or rules would you like to check agent behavior against?

Options:

A specific skill — pick an installed tile and I'll extract verifiers from its SKILL.md

Project rules — extract from your CLAUDE.md, AGENTS.md, or .cursor/rules/

Something specific — tell me what you want to track and I'll create verifiers for it

To create verifiers, I'll use the create-verifiers skill included in this tile.

Then activate the create-verifiers skill to walk the user through verifier creation. New verifier tiles should be created outside .tessl/ (e.g. in tiles/) and installed as a local file source:

tessl tile new --name <workspace>/my-verifiers --path tiles/my-verifiers --workspace <workspace>
# ... create verifiers with the create-verifiers skill ...
tessl install file:tiles/my-verifiers --watch-local

How the Pipeline Works

Collects raw logs from coding agents on your machine
Normalizes them to a common format (with automatic secret redaction — see Security below)
Discovers verifiers in installed tiles (searches all verifiers/ directories — root, skill subdirectories, anywhere in the tile)
Prepares session transcripts for LLM review
Dispatches haiku judges via claude -p to evaluate each session against each verifier
Merges verdicts into an aggregate summary

With --friction, steps 5-6 also run friction reviewers in parallel:

Dispatches separate haiku calls to detect friction (errors, backtracking, user frustration)
Merges friction results and synthesizes them with verifier data
Classifies each friction event by its relationship to installed tiles:
- preventable — skill covers this, agent didn't follow
- introduced — skill instructions caused the friction
- adjacent — in the skill's domain, not covered by verifiers
- unrelated — general agent/environment issues

Security

Secret Redaction

All raw log content is passed through redact_secrets() in normalize_logs.py before any further processing. This strips API keys, bearer tokens, AWS credentials, Stripe keys, and other common secret patterns. Prepared transcripts and judge inputs only ever see redacted content.

Indirect Prompt Injection

Session transcripts contain untrusted content — tool outputs, web page text, user messages, and other data from prior agent sessions. This content is passed to LLM judges for evaluation, creating a theoretical indirect prompt injection surface. Mitigations:

Judge framing — review_session.py wraps transcripts in <transcript> tags and explicitly instructs the judge that the content is data to evaluate, not instructions to follow, and to ignore any embedded instructions or prompt overrides.
Structured output — judges must return a specific JSON verdict schema. There is no mechanism for a judge to take actions, write files, or execute code.
No tools — judges run via claude -p with no tool access. Even if an injection influenced the judge's reasoning, it cannot escalate beyond the verdict output.
Limited blast radius — the worst case is a biased verdict (a check incorrectly marked as passed/failed). Verdicts are reviewed by the orchestrating agent and presented to the user, so anomalies are visible.

Scripts

Pipeline Orchestrator

run_pipeline.py — run the full pipeline in a single command (calls all scripts below)

Collection & Normalization

collect_logs.py — copy raw logs from agent home directories
normalize_logs.py — convert raw logs to NormalizedEvent JSONL

Search & Discovery

search_sessions.py — grep prepared transcripts for keywords/patterns to find relevant sessions
discover_verifiers.py — scan installed tiles for verifiers (searches all subdirectories)

Evaluation

prepare_sessions.py — condense normalized logs into text transcripts (supports --sessions filter)
review_session.py — review a single session via claude -p --model haiku (called by dispatch_judges.py)
extract_checklist.py — extract checklist rules from verifier JSONs for judge prompts
dispatch_judges.py — send transcripts + checklists to haiku judges
merge_verdicts.py — aggregate verdict files into summary

Friction (parallel pipeline, enabled with --friction)

dispatch_friction.py — dispatch friction reviewers in parallel
merge_friction.py — aggregate friction reviews

Analysis (tile-level)

analyze_trends.py — trend analysis from verdict cache (recent sessions, timeseries, recent vs prior)
compare_runs.py — compare current run against previous run
synthesize_findings.py — correlate verifier + friction findings per tile

Data Storage

All analysis data is stored in ~/.tessl/session-analyses/<project-slug>/:

~/.tessl/session-analyses/-Users-amy-dev-myproject/
├── raw/                          # Collected raw logs
│   ├── claude-code/
│   ├── codex/
│   ├── gemini/
│   ├── cursor-ide/
│   └── cursor-agent/
├── normalized/                   # NormalizedEvent JSONL
│   └── <agent>/
├── runs/
│   └── <timestamp>/             # Each analysis run
│       ├── manifest.json         # What was analyzed (tiles, verifiers, agents, sessions)
│       ├── prepared/             # Condensed transcripts
│       │   └── <agent>/
│       ├── verdicts/             # Per-session judge verdicts
│       │   └── <tile>/           # Namespaced by tile
│       │       └── <agent>/
│       ├── verdicts-aggregate.json
│       ├── friction/             # Per-session friction reviews (with --friction)
│       │   └── <agent>/
│       ├── friction-summary.json  # Aggregated friction data
│       └── synthesis.json         # Combined verifier + friction findings
└── latest -> runs/<timestamp>/   # Symlink to most recent run

The project slug uses the same dash-encoding as tessl's agent-logs cache (e.g. /Users/amy/dev/myproject -> -Users-amy-dev-myproject).

Multi-Path Analyses

When --project-dir is passed multiple paths (e.g. worktrees or separate checkouts), each path gets its own analysis directory with independent raw/, normalized/, and runs/<timestamp>/ trees. The merge scripts (merge_verdicts.py, merge_friction.py) accept multiple --dir values and aggregate verdicts from all of them. The aggregated output (verdicts-aggregate.json, friction-summary.json, synthesis.json) is written to the primary (first) path's run directory.

try-tessl/agent-quality