tessl-labs/audit-logs

Collect and normalize agent logs, discover installed verifiers, and dispatch LLM judges to evaluate adherence. Produces per-session verdicts and aggregated reports.

3.09x

Quality

90%

Does it follow best practices?

Impact

96%

3.09x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Extraction Guide

Name: tessl-labs/audit-logs
Rating: 91.9 (1 reviews)
Author: tessl-labs

How to extract instructions from source material and build verifier files with checklists.

What to Extract

Extract every instruction that directs an agent to do or not do something specific:

Tool and library choices — "use pytest", "prefer bun over npm"
Process requirements — "read files before editing", "run lint after changes"
Specific commands or scripts — exact CLI invocations, flags to use, scripts to run
File and directory conventions — "use .ts extensions", "put tests in __tests__/"
Style and formatting rules — "use single quotes", "no default exports"
Prohibitions and warnings — "never force push to main", "don't use deprecated API"
Workflow patterns — "commit after each logical change", "run type checks before committing"
Magic numbers and specific syntax — exact parameter values, specific API patterns

Coverage: extract everything. A typical skill yields 5-20+ instructions. A CLAUDE.md or rules file can yield more. If you have fewer than 5 from a non-trivial source, re-read — you're likely missing tool/method choices, process requirements, or prohibitions.

Do not invent rules not present in the source text.

Source Types

Skills (SKILL.md files)

Read the SKILL.md and all its linked reference files (references/, scripts/). Instructions are typically explicit workflow steps, tool preferences, and constraints.

Documentation and rules (CLAUDE.md, AGENTS.md, .cursor/rules/, etc.)

These often contain project-wide conventions. Extract each rule as a separate instruction. Rules files tend to be denser — many short rules rather than long workflows.

User input

The user tells you directly what they care about. Create verifiers from their description. Use "type": "user" in sources.

Two-Phase Workflow

Phase 1: Create instruction files (empty checklists)

For each instruction found, create a verifier JSON file with the instruction, context, and sources filled in, but checklist: [] (empty array).

This gives the user a review point — they can:

Remove instructions they don't care about
Adjust wording
Add context
Confirm coverage before you spend time decomposing

File naming: use a short kebab-case slug from the instruction. Place in the target tile's verifiers/ directory.

{
  "instruction": "Use pdfplumber for text and table extraction",
  "relevant_when": "When extracting text or tables from PDFs",
  "context": "The skill recommends pdfplumber over alternatives like PyPDF2 or tabula-py for its consistent API and better table extraction.",
  "sources": [
    {
      "type": "file",
      "filename": "skills/pdf/SKILL.md",
      "tile": "anthropics/pdf@1.0.0",
      "line_no": 42
    }
  ],
  "checklist": []
}

Present the list to the user and get confirmation before Phase 2.

Phase 2: Fill out checklists

For each instruction file, decompose into checklist items. Each item is a binary pass/fail check.

Decomposition rules

Split when:

Multiple independent requirements: "use X library and configure Y setting" -> 2 items
Requirements at different scopes: "in tests do A; in production code do B" -> 2 items

Keep as one item when:

Requirements are inherently coupled: "use --frozen-lockfile flag with bun install"
"Use X, not Y" patterns: one item checking "uses X not Y"

Writing good rules

Each checklist item's rule field must be binary — a judge needs to answer yes/no:

Avoid (needs interpretation)	Use instead (binary)
"Properly handles errors"	"Uses try/catch around external API calls"
"Follows the import style"	"Local import paths use .ts extension, not .js"
"Good commit messages"	"Commit message contains more than 5 words"
"Creative layout"	"Uses at least ONE of: asymmetric grid, overlapping elements, rotated content"

Coverage

Each instruction should produce 1-5 checklist items (typically 1-3). If you have more than 5, consider whether some items test the same thing.

Field Guidance

instruction — State rules positively and specifically. Instead of "Don't use unittest", write "Use pytest framework, not unittest". Include the concrete tool, library, command, or pattern.

relevant_when — Describe the decision point where the agent could either follow or violate this rule. Not the broad activity that might contain the decision, but the specific moment the choice arises.

Think: "In what specific situation could the agent either follow or violate this rule?" That situation is your relevant_when.

Rule	Too broad (matches too many sessions)	Better (scoped to the decision point)
Use pytest not unittest	"When writing tests"	"When creating new test files or adding new test functions"
Use `pytest.raises` for exceptions	"When writing Python tests"	"When writing test assertions that verify exceptions are raised"
Put tests in `__tests__/`	"When writing tests"	"When creating new test files (not editing existing ones)"
Use `@pytest.mark.parametrize`	"When writing tests with similar assertions"	"When writing tests that repeat the same assertion logic across different input values"
Run lint after changes	"When modifying files"	"When agent has finished modifying .py files and is wrapping up"

The pattern: "writing tests" is an activity. "Creating a new test file" is a decision point. The rule about test file location only matters at the moment a new file is created — not every time any test is touched.

Don't match on incidental properties

A session might involve code that happens to have certain properties (async functions, database connections, HTTP handlers), but the session's work may be unrelated to the rule. Scope by the kind of work the rule governs, not by surface-level properties that happen to be present.

Bad: "relevant_when": "When writing code that uses async functions" — matches a session refactoring error messages in an async module, even though async patterns aren't the focus Better: "relevant_when": "When writing new async functions or converting sync code to async" — scoped to when the agent is making concurrency decisions

Bad: "relevant_when": "When writing tests for code that has side effects" — matches almost any integration test Better: "relevant_when": "When writing tests that need to isolate external service calls (HTTP, database, filesystem)" — scoped to when mocking decisions actually arise

Accept equivalent formulations in rules

When a rule can be satisfied by different-but-equivalent approaches, phrase the rule to accept them. Otherwise the judge will fail correct implementations that use a different formulation.

Bad: "rule": "Agent uses pytest.fixture for test setup" — rejects equivalent setup using @pytest.fixture as a decorator vs conftest.py fixtures Better: "rule": "Agent uses pytest fixtures (via @pytest.fixture decorator or conftest.py) for shared test setup, rather than setUp/tearDown methods or inline setup"

Bad: "rule": "Agent uses context managers for file handling" — rejects equivalent try/finally patterns Better: "rule": "Agent ensures files are properly closed after use, via context managers (with statement) or equivalent resource cleanup"

context — 2-3 sentences giving the judge enough background to understand the rule without reading the full source. Include:

Why this rule exists (the reasoning)
Key definitions or terminology
Edge cases or exceptions

checklist rule — The specific behavior to check. Should match the granularity of what a judge can observe in a session transcript. Don't check things that would be invisible (e.g., runtime behavior, performance characteristics).

checklist relevant_when — Should be equal or narrower than the instruction-level relevant_when — never broader. Think of it as a funnel:

Instruction relevant_when: "Does this session involve the general area?"
Checklist relevant_when: "Did the agent face this specific decision?"

Example for "Follow TDD workflow":

Instruction: "relevant_when": "Agent is implementing a new feature or fixing a bug"
Check "test-before-impl": "relevant_when": "Agent creates or modifies implementation files in src/"
Check "test-file-created": "relevant_when": "Agent creates new test files"

Handling Activation

For skill sources, consider adding a verifier specifically for activation — was the skill read/loaded by the agent? This only makes sense for skills (not docs or rules, which are always loaded via CLAUDE.md).

{
  "instruction": "Activate the frontend-design skill when building UI",
  "relevant_when": "Agent is building or modifying frontend UI components",
  "context": "The frontend-design skill contains project-specific UI guidelines. It should be loaded via the skill tool when doing UI work, not just followed by coincidence.",
  "sources": [
    {
      "type": "file",
      "filename": "skills/frontend-design/SKILL.md",
      "tile": "anthropics/frontend-design@1.2.0"
    }
  ],
  "checklist": [
    {
      "name": "skill-activated",
      "rule": "Agent reads or activates the frontend-design skill (via skill tool or reading SKILL.md) before writing UI code",
      "relevant_when": "Agent writes or modifies frontend UI components"
    }
  ]
}

For docs/rules sources, skip activation verifiers — these files are loaded automatically and activation isn't meaningful.

Output Target

By default, write verifier files into the source skill's verifiers/ directory (e.g. skills/my-skill/verifiers/). If the user specifies a different tile with --tile <path>, write there instead.

If the target tile doesn't exist yet, create it with tessl tile new first.

Sources Field

When verifiers are embedded inside a skill directory (skills/my-skill/verifiers/), omit the sources field entirely. The source is implicitly the skill the verifier lives inside — including an explicit sources array leads to it drifting out of sync with the actual location.

Include sources only when:

Verifiers are at the tile root (verifiers/ not inside a skill)
Verifiers are in a standalone verifier-only tile
The source is user input ("type": "user")

tessl-labs/audit-logs