CtrlK
BlogDocsLog inGet started
Tessl Logo

running-tests

running tests at various levels from smoke tests to full suite to randomized tests

71

1.75x
Quality

55%

Does it follow best practices?

Impact

100%

1.75x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/running-tests/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description provides a basic sense of the skill's domain (test execution at multiple levels) but lacks specificity about concrete actions, tools, or frameworks involved. It is missing an explicit 'Use when...' clause, which significantly hurts completeness and makes it harder for Claude to know when to select this skill. The trigger terms are reasonable but incomplete.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user asks to run tests, execute a test suite, or verify code changes with smoke, integration, or randomized testing.'

List more concrete actions and details, such as specific test frameworks supported, how test levels are selected, or what outputs are produced (e.g., 'Executes test suites using pytest/jest, supports smoke tests for quick validation, full regression suites, and randomized/fuzz testing').

Include additional natural trigger terms users might say, such as 'unit tests', 'regression tests', 'test runner', 'run my tests', or 'check if tests pass'.

DimensionReasoningScore

Specificity

Names the domain (testing) and lists some actions (smoke tests, full suite, randomized tests), but lacks concrete details about what 'running tests' entails—no mention of specific frameworks, commands, or outputs.

2 / 3

Completeness

Describes what it does (running tests at various levels) but has no explicit 'Use when...' clause or equivalent trigger guidance, which per the rubric should cap completeness at 2, and the 'what' itself is also fairly weak and incomplete, placing this at 1.

1 / 3

Trigger Term Quality

Includes some relevant keywords like 'smoke tests', 'full suite', and 'randomized tests' that users might say, but misses common variations like 'unit tests', 'integration tests', 'test runner', 'run tests', or specific framework names.

2 / 3

Distinctiveness Conflict Risk

The mention of specific test levels (smoke, full suite, randomized) provides some distinctiveness, but 'running tests' is broad enough to overlap with other testing-related skills like test writing, CI/CD, or debugging.

2 / 3

Total

7

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable skill with excellent workflow clarity and concrete, executable commands throughout. Its main weakness is length — at ~300 lines it could benefit from splitting reference material (test tag catalog, sanitizer configurations) into separate files. The content is well-organized but includes some sections that explain concepts Claude already knows (common failure patterns, what sanitizers catch).

Suggestions

Extract the 'Example Test Names by Area' catalog into a separate reference file (e.g., TEST_TAGS.md) and link to it, reducing the main skill's token footprint.

Trim the 'Interpreting Failures' and 'Common Failure Patterns' sections — Claude already knows what assertion failures, segfaults, and timeouts mean; focus only on project-specific diagnostic steps.

DimensionReasoningScore

Conciseness

The skill is mostly efficient with good use of concrete commands, but includes some unnecessary content like explaining common failure patterns Claude already knows (assertion failures, segfaults, timeouts), and the 'Interpreting Failures' section is somewhat generic. The example test names by area section is valuable domain knowledge but is quite lengthy.

2 / 3

Actionability

Excellent actionability throughout — every test level has fully executable, copy-paste-ready commands with specific flags. The tag patterns, configure commands, and environment variables are all concrete and specific to the stellar-core project.

3 / 3

Workflow Clarity

The multi-level test progression is clearly sequenced with explicit stop-on-failure semantics. Validation checkpoints are built into the workflow (--abort flag, baseline checks with fixed seeds), and there's a clear feedback loop for failures (identify → capture → analyze → locate). The 'Choosing the Right Test Level' section provides good decision guidance.

3 / 3

Progressive Disclosure

The content is well-structured with clear headers and logical sections, but it's a long monolithic document (~300 lines) with no references to external files. The example test names section and some of the sanitizer configurations could be split into separate reference files to keep the main skill leaner.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
stellar/stellar-core
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.