This skill should be used when the user asks to "write LLMObs tests", "add tests for LLM Observability", "test an LLMObs plugin", "llmobs test", "llmobs spec", "test llm observability", "assertLlmObsSpanEvent", "useLlmObs", "getEvents", "MOCK_STRING", "MOCK_NOT_NULLISH", "MOCK_NUMBER", "MOCK_OBJECT", "VCR cassette", "record cassette", "replay cassette", "vcr proxy", "llmobs cassette", "test chat completions", "test streaming", "test embeddings", "test agent runs", "test orchestration", "test workflow", "llmobs span event", "LLMObs test strategy", "LlmObsCategory test", "LLM_CLIENT test", "MULTI_PROVIDER test", "ORCHESTRATION test", "INFRASTRUCTURE test", "span kind llm test", "span kind workflow test", "inputMessages", "outputMessages", "token metrics", "llmobs span validation", "cassette not generated", "re-record cassette", "127.0.0.1:9126", or needs to write, modify, or debug tests for any LLMObs plugin in dd-trace-js.
56
62%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./.agents/skills/llmobs-testing/SKILL.mdQuality
Discovery
89%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description excels at trigger term coverage and distinctiveness, providing an exhaustive list of phrases that would help Claude correctly select this skill for LLMObs testing tasks in dd-trace-js. However, it reads more like a keyword index than a capability description—it's heavy on 'when to use' triggers but light on explaining what concrete actions the skill enables (e.g., generating test scaffolding, configuring VCR cassettes, validating span events). The description would benefit from a brief opening sentence summarizing specific capabilities before the trigger list.
Suggestions
Add a concise opening sentence listing specific concrete actions, e.g., 'Generates, modifies, and debugs LLMObs plugin tests in dd-trace-js, including setting up VCR cassettes, writing span event assertions, and configuring mock values for test validation.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions a domain (LLMObs tests in dd-trace-js) and some actions ('write, modify, or debug tests for any LLMObs plugin'), but the bulk of the description is a long list of trigger terms rather than concrete capability descriptions. It doesn't clearly list specific actions like 'generates test files', 'sets up VCR cassettes', or 'validates span events'. | 2 / 3 |
Completeness | The description explicitly answers 'when' with a detailed 'Use when' clause listing many trigger phrases, and answers 'what' with 'write, modify, or debug tests for any LLMObs plugin in dd-trace-js.' The opening 'This skill should be used when...' structure clearly addresses both dimensions. | 3 / 3 |
Trigger Term Quality | The description provides extensive coverage of natural trigger terms users would say, including exact phrases ('write LLMObs tests', 'add tests for LLM Observability'), specific API terms ('assertLlmObsSpanEvent', 'MOCK_STRING'), tool-specific terms ('VCR cassette', 'record cassette'), and test scenario terms ('test chat completions', 'test streaming'). This is thorough coverage of how users would naturally phrase requests. | 3 / 3 |
Distinctiveness Conflict Risk | The description is highly specific to LLMObs plugin testing within dd-trace-js, a very narrow niche. The trigger terms are domain-specific (LLMObs, VCR cassettes, span events, dd-trace-js) and unlikely to conflict with general testing or other observability skills. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
35%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill covers a complex testing domain comprehensively but suffers from significant redundancy — the category-determines-strategy message is hammered home at least 5 times in different sections. It lacks executable code examples despite being a testing skill where copy-paste-ready test templates would be highly valuable. The structure attempts progressive disclosure with reference files but undermines it by repeating reference content inline.
Suggestions
Eliminate redundant category strategy explanations — state the category→strategy mapping once in a concise table, then reference category-strategies.md for details
Add at least one complete, executable test example (e.g., a minimal LLM_CLIENT test with imports, setup, assertion) instead of describing the flow abstractly
Remove explanatory text Claude already knows (e.g., 'VCR records real API calls and replays them in tests for deterministic testing without external dependencies' — just say 'VCR replays recorded API responses')
Add a validation/troubleshooting checklist inline (e.g., 'If cassette not generated: check proxy is running at 127.0.0.1:9126, verify API key is set') rather than deferring all troubleshooting to references
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose and repetitive. The same information about categories (LLM_CLIENT uses VCR, ORCHESTRATION doesn't) is repeated at least 5 times across different sections. The 'Purpose', 'When to Use', and 'Key Principles' sections largely duplicate the critical warning at the top. Many sections explain concepts Claude would already understand (what VCR does, what a proxy is, what deterministic testing means). | 1 / 3 |
Actionability | The skill provides some concrete guidance (matcher names, field names, file locations, proxy URL) but lacks executable code examples. The 'Basic test flow' is described abstractly rather than shown as runnable code. The standard imports are listed as plain text rather than a proper import statement. Almost all concrete implementation is deferred to reference files that aren't provided. | 2 / 3 |
Workflow Clarity | The test flow steps are listed (initialize → call → get events → assert) but lack validation checkpoints. There's no guidance on what to do when tests fail, no feedback loops for cassette recording issues, and no explicit verification steps. The VCR recording process mentions steps but defers details to a reference file. For a workflow involving destructive operations like cassette recording, the missing validation caps this at 2. | 2 / 3 |
Progressive Disclosure | The skill references four well-organized reference files with clear descriptions, which is good structure. However, since no bundle files are provided, we can't verify these references exist. More importantly, the main SKILL.md contains too much inline content that repeats across sections rather than being concise overview content pointing to references. The category strategy information is repeated extensively inline when it should be summarized briefly and deferred to category-strategies.md. | 2 / 3 |
Total | 7 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
50aa025
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.