CtrlK
BlogDocsLog inGet started
Tessl Logo

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

33

Quality

28%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Fix and improve this skill with Tessl

tessl review fix ./plugins/sadd/skills/judge/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

39%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill has a well-structured four-phase workflow with clear sequencing and validation checkpoints, which is its strongest aspect. However, it is excessively verbose—explaining concepts like context isolation, bias mitigation, and evidence-based evaluation that Claude already understands—consuming significant token budget. The monolithic structure and reliance on undefined external references (agent instructions, CLAUDE_PLUGIN_ROOT) weaken both progressive disclosure and actionability.

Suggestions

Cut the <context> block, scoring interpretation table, notes section, and most of the 'Important Guidelines' list—these explain concepts Claude already knows and consume ~40% of the token budget unnecessarily.

Extract the scoring interpretation table and guidelines into a separate REFERENCE.md file, keeping SKILL.md focused on the four-phase workflow.

Replace placeholder-heavy prompt templates with a concrete, minimal example showing actual values for one realistic evaluation scenario (e.g., evaluating a Python script).

Define or link to the referenced 'agent instructions' and explain what CLAUDE_PLUGIN_ROOT resolves to, since these are critical dependencies that are currently undefined.

DimensionReasoningScore

Conciseness

Extremely verbose at ~150+ lines. Extensively explains concepts Claude already understands (what context isolation is, what evidence-based means, bias types). The scoring interpretation table, extensive guidelines list, and notes section add significant token overhead. Much of the 'context' section and 'important guidelines' restate obvious evaluation principles.

1 / 3

Actionability

Provides structured prompt templates and dispatch instructions with placeholder variables, which is somewhat concrete. However, the prompts are templates with placeholders rather than fully executable examples, the meta-judge and judge agent instructions reference external 'agent instructions' that aren't provided, and the Task tool dispatch format is pseudocode-like rather than exact API calls.

2 / 3

Workflow Clarity

The four-phase workflow (Context Extraction → Meta-Judge → Judge → Process Results) is clearly sequenced with explicit dependencies ('Wait for the meta-judge to complete before proceeding'). Phase 4 includes validation checkpoints with specific checks (score range validation, contradiction detection) and a feedback loop for re-evaluation if validation fails.

3 / 3

Progressive Disclosure

All content is in a single monolithic file with no references to supporting files, despite the complexity warranting separation. The scoring interpretation table, guidelines, and notes could be in separate reference files. No bundle files are provided, and the skill references 'agent instructions' and CLAUDE_PLUGIN_ROOT without providing or linking to them.

1 / 3

Total

7

/

12

Passed

Description

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is too jargon-heavy and vague to serve as an effective skill selector. It lacks natural trigger terms users would use, provides no explicit 'when to use' guidance, and doesn't clearly describe what concrete actions are performed or what outputs are produced. The internal architecture terms ('meta-judge', 'sub-agent') are implementation details rather than user-facing capability descriptions.

Suggestions

Add a 'Use when...' clause with natural trigger terms like 'evaluate', 'review quality', 'score output', 'assess results', or 'grade response'.

Replace implementation jargon ('meta-judge', 'sub-agent') with user-facing descriptions of what the skill actually does, e.g., 'Evaluates the quality of outputs using structured rubrics and multi-step assessment'.

Specify what types of results are evaluated and what the evaluation output looks like (e.g., scores, feedback, pass/fail judgments).

DimensionReasoningScore

Specificity

Names a domain (evaluation/judging) and some actions ('launch a meta-judge', 'judge sub-agent', 'evaluate results'), but the actions are not concretely specified — what kind of results? What does evaluation entail? The terms 'meta-judge' and 'sub-agent' are somewhat specific but lack detail about concrete outputs or operations.

2 / 3

Completeness

Provides a weak 'what' (launch judges to evaluate results) but has no explicit 'when' clause or trigger guidance. The absence of a 'Use when...' clause caps this at 2 per the rubric, and the 'what' is also vague enough to warrant a 1.

1 / 3

Trigger Term Quality

Uses technical jargon like 'meta-judge', 'sub-agent', and 'evaluate results' which are not terms a user would naturally say. Missing natural trigger terms like 'review', 'score', 'assess quality', 'grade', or 'check output'.

1 / 3

Distinctiveness Conflict Risk

The mention of 'meta-judge' and 'judge sub-agent' provides some distinctiveness from generic evaluation skills, but 'evaluate results produced in the current conversation' is broad enough to potentially overlap with any quality-checking or review skill.

2 / 3

Total

6

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
NeoLabHQ/context-engineering-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.