Build a custom browser-based annotation interface tailored to your data for reviewing LLM traces and collecting structured feedback. Use when you need to build an annotation tool, review traces, or collect human labels.
85
81%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Build an HTML page that loads traces from a data source (JSON/CSV file), displays one trace at a time with Pass/Fail buttons, a free-text notes field, and Next/Previous navigation. Save labels to a local file (CSV/SQLite/JSON). Then customize to the domain using the guidelines below.
Format all data in the most human-readable representation for the domain. Emails should look like emails. Code should have syntax highlighting. Markdown should be rendered. Tables should be tables. JSON should be pretty-printed and collapsible.
<details> toggle.Annotate at the trace level. The reviewer judges the whole trace, not individual spans.
Once you have established failure categories from error analysis, you can later add predefined failure mode tags as clickable checkboxes, dropdowns or picklists so reviewers can select from known categories in addition to writing notes. But don't add these in the initial build.
Arrow keys = Navigate traces
1 = Pass 2 = Fail
D = Defer U = Undo last action
Cmd+S = Save Cmd+Enter = Save and nextBuild the app to accept traces from any source (JSON/CSV file). Keep sampling logic outside the app in a separate script. Start with random sampling.
Reference panel: Toggle-able panel showing ground truth, expected answers, or rubric definitions alongside the trace.
Filtering: Filter traces by metadata dimensions relevant to the product (channel, user type, pipeline version).
Clustering: Group traces by metadata or semantic similarity. Show representative traces per cluster with drill-down.
After building the interface, verify it with Playwright.
Visual review: Take screenshots of the interface with representative trace data loaded. Review each screenshot for:
Functional test: Write a Playwright script that performs a full annotation workflow:
febdb33
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.