CtrlK
BlogDocsLog inGet started
Tessl Logo

he-eval-report

Generate closure-grade HE eval and drift proof for one execution slice. Use when Linear, milestone, or source-prompt closure needs validation evidence.

50

Quality

55%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./Plugins/harness-engineering/skills/he-eval-report/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

35%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a comprehensive but overly verbose specification for generating eval reports. It has good structural elements (concrete validation commands, clear output paths, explicit failure handling) but suffers from excessive policy language, redundant constraints, and abstract procedural steps that obscure the core workflow. The content would benefit significantly from aggressive trimming and offloading detailed policy to referenced contracts rather than restating them inline.

Suggestions

Reduce the procedure to 5-6 concrete steps focused on what to do, moving policy checks (first-principles, XP, gate-selection, domain-model checks) to a referenced contract rather than listing them inline.

Add a concrete example of a minimal eval report output (even abbreviated) so Claude can see the expected format rather than relying entirely on an external template reference.

Consolidate Evidence Requirements, Safety Boundaries, and Gotchas into a single concise 'Constraints' section, removing redundancy with statements already in the Procedure.

Move the extensive References section into a separate index file and keep only the 3-4 most critical references inline, since the current 15+ references create cognitive overload.

DimensionReasoningScore

Conciseness

The skill is extremely verbose (~200+ lines) with extensive repetition and over-specification. It explains numerous concepts, contracts, and policies that Claude could infer or that could be offloaded to referenced files. Many sections (Evidence Requirements, Safety Boundaries, Gotchas) repeat constraints already stated elsewhere in the document. The sheer volume of cross-references and procedural detail far exceeds what's needed for a skill that essentially writes an eval report.

1 / 3

Actionability

The skill provides concrete validation commands (step-by-step python3 scripts with paths) and a specific output file naming convention, which is good. However, much of the procedure is abstract policy language ('Apply first-principles, XP, gate-selection, plugin-hook capability checks') rather than executable steps. The actual report-writing process relies heavily on external templates and contracts without showing concrete examples of what the output looks like.

2 / 3

Workflow Clarity

The 11-step procedure is sequenced and includes validation gates and a fail-fast policy, which is positive. However, the steps mix high-level policy checks with concrete actions, making the actual workflow hard to follow. Some steps are conditional and vaguely scoped ('only when they are relevant to closure'). The validation section has explicit commands and pass/fail recording, but the overall flow from start to finished report is obscured by the density of cross-cutting concerns.

2 / 3

Progressive Disclosure

The References section provides extensive one-level-deep links to contracts, templates, schemas, and taxonomies, which is good structure. However, without bundle files to verify these references exist, and given that the main body is itself a monolithic wall of dense text that could benefit from splitting (e.g., Evidence Requirements, Safety Boundaries, and Validation could be separate reference docs), the disclosure is only partially effective. The body tries to be both overview and comprehensive reference simultaneously.

2 / 3

Total

7

/

12

Passed

Description

75%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with explicit 'what' and 'when' clauses, and its niche focus makes it distinctive. However, it relies heavily on domain-specific jargon that may not match natural user language, and the specific actions being performed are not clearly articulated beyond the opaque terms 'HE eval' and 'drift proof'.

Suggestions

Expand jargon terms like 'HE eval' and 'drift proof' into plain language so Claude can better match user requests (e.g., 'Generate human evaluation assessments and consistency/drift analysis reports').

Add natural language trigger variations that users might actually say, such as 'validate milestone completion', 'check for quality drift', or 'generate evaluation evidence'.

DimensionReasoningScore

Specificity

It names a domain ('HE eval and drift proof') and an action ('Generate'), but the terms 'closure-grade', 'execution slice', and 'drift proof' are jargon-heavy and don't clearly describe concrete, understandable actions. It's not as vague as 'helps with documents' but falls short of listing multiple specific concrete actions.

2 / 3

Completeness

It explicitly answers both 'what' (generate closure-grade HE eval and drift proof for one execution slice) and 'when' (Use when Linear, milestone, or source-prompt closure needs validation evidence), with a clear 'Use when...' clause containing explicit triggers.

3 / 3

Trigger Term Quality

It includes some potentially useful trigger terms like 'Linear', 'milestone', 'closure', and 'validation evidence', but terms like 'HE eval', 'drift proof', and 'execution slice' are highly specialized jargon that users are unlikely to naturally say. Missing common variations or more accessible synonyms.

2 / 3

Distinctiveness Conflict Risk

The description is highly specialized with niche terminology ('closure-grade HE eval', 'drift proof', 'execution slice') that is unlikely to conflict with other skills. Its narrow focus makes it clearly distinguishable.

3 / 3

Total

10

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

metadata_version

'metadata.version' is missing

Warning

Total

10

/

11

Passed

Repository
jscraik/Agent-Skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.