CtrlK
BlogDocsLog inGet started
Tessl Logo

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

45

Quality

32%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/sadd/skills/judge/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description uses internal/technical jargon ('meta-judge', 'sub-agent') that would not match natural user queries, and lacks an explicit 'Use when...' clause to guide skill selection. While it hints at a specific evaluation workflow, the vague scope ('results produced in the current conversation') and absence of concrete trigger terms make it difficult for Claude to reliably select this skill at the right time.

Suggestions

Add an explicit 'Use when...' clause with natural trigger terms like 'evaluate output', 'score results', 'judge quality', 'assess response', or 'grade answers'.

Specify what types of results are evaluated and what the evaluation produces (e.g., 'Scores outputs against rubrics and produces structured feedback with ratings').

Replace or supplement jargon like 'meta-judge' and 'sub-agent' with plain-language descriptions of the workflow so users' natural language queries can match.

DimensionReasoningScore

Specificity

Names the domain (evaluation/judging) and some actions ('launch a meta-judge', 'judge sub-agent', 'evaluate results'), but the actions are not comprehensively described — what kind of results? What does evaluation entail? The terms 'meta-judge' and 'sub-agent' are somewhat specific but lack concrete detail about what they produce.

2 / 3

Completeness

Provides a partial 'what' (launch judges to evaluate results) but has no explicit 'when' clause or trigger guidance. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' itself is also weak, so this scores a 1.

1 / 3

Trigger Term Quality

Uses technical jargon like 'meta-judge', 'sub-agent', and 'current conversation' that users would rarely naturally say. Missing natural trigger terms a user might use such as 'evaluate', 'score', 'grade', 'assess quality', 'review output', or 'judge results'.

1 / 3

Distinctiveness Conflict Risk

The concept of a 'meta-judge' and 'judge sub-agent' is somewhat distinctive, but 'evaluate results produced in the current conversation' is broad enough to overlap with any evaluation, review, or quality-checking skill.

2 / 3

Total

6

/

12

Passed

Implementation

47%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill defines a clear four-phase evaluation pipeline with good workflow sequencing and validation checkpoints. However, it is significantly over-verbose, repeating concepts like context isolation and evidence-based scoring multiple times across sections. The prompt templates provide structure but rely on placeholders and external agent instructions that aren't included, reducing actionability.

Suggestions

Cut the <task> and <context> blocks entirely - they explain the pattern to Claude rather than instructing it. Move any unique information into the workflow steps.

Remove the 'Important Guidelines' section or reduce it to 3-4 items that aren't already stated in the workflow phases (e.g., items 4, 6, and 8 are already covered in Phase 1 and Phase 4).

Provide a concrete, filled-in example of at least one prompt template (e.g., a sample meta-judge prompt for evaluating a Python module) so the skill is more actionable.

Split the scoring interpretation table and guidelines into a separate reference file to reduce the main skill's token footprint.

DimensionReasoningScore

Conciseness

Extremely verbose for what it does. The skill explains concepts Claude already understands (what context isolation is, what evidence-based means, what bias mitigation is). The scoring interpretation table, extensive guidelines list, and notes section add significant token overhead. Much of the content restates the same principles multiple times (e.g., 'context isolation' is explained in the task, context, Phase 1, and guidelines sections).

1 / 3

Actionability

Provides structured prompt templates and dispatch instructions with Task tool usage, which is somewhat concrete. However, the prompts contain placeholder variables without clear examples of what filled-in values look like, and the meta-judge/judge agent instructions reference external 'agent instructions' that aren't provided. The workflow is more of a framework than copy-paste executable guidance.

2 / 3

Workflow Clarity

The four-phase workflow is clearly sequenced (Context Extraction → Meta-Judge → Judge Agent → Process Results) with explicit validation steps in Phase 4 (check score ranges, verify justifications, confirm calculations, check contradictions). Phase 2 explicitly states to wait before proceeding. The error recovery path ('If validation fails') is documented.

3 / 3

Progressive Disclosure

The content is a monolithic wall of text with no references to external files despite being complex enough to warrant splitting (e.g., prompt templates, scoring rubric, and guidelines could be separate files). No bundle files are provided, and the skill doesn't reference any supporting documents. The internal structure uses headers but everything is inline.

2 / 3

Total

8

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
NeoLabHQ/context-engineering-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.