llmobs-testing

This skill should be used when the user asks to "write LLMObs tests", "add tests for LLM Observability", "test an LLMObs plugin", "llmobs test", "llmobs spec", "test llm observability", "assertLlmObsSpanEvent", "useLlmObs", "getEvents", "MOCK_STRING", "MOCK_NOT_NULLISH", "MOCK_NUMBER", "MOCK_OBJECT", "VCR cassette", "record cassette", "replay cassette", "vcr proxy", "llmobs cassette", "test chat completions", "test streaming", "test embeddings", "test agent runs", "test orchestration", "test workflow", "llmobs span event", "LLMObs test strategy", "LlmObsCategory test", "LLM_CLIENT test", "MULTI_PROVIDER test", "ORCHESTRATION test", "INFRASTRUCTURE test", "span kind llm test", "span kind workflow test", "inputMessages", "outputMessages", "token metrics", "llmobs span validation", "cassette not generated", "re-record cassette", "127.0.0.1:9126", or needs to write, modify, or debug tests for any LLMObs plugin in dd-trace-js.

Quality

62%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./.agents/skills/llmobs-testing/SKILL.md

Quality

Discovery

89%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description excels at trigger term coverage and distinctiveness, providing an exhaustive list of phrases that would help Claude correctly select this skill for LLMObs testing tasks in dd-trace-js. However, it reads more like a keyword index than a capability description—it's heavy on 'when to use' triggers but light on explaining what concrete actions the skill enables (e.g., generating test scaffolding, configuring VCR cassettes, validating span events). The description would benefit from a brief opening sentence summarizing specific capabilities before the trigger list.

Suggestions

Add a concise opening sentence listing specific concrete actions, e.g., 'Generates, modifies, and debugs LLMObs plugin tests in dd-trace-js, including setting up VCR cassettes, writing span event assertions, and configuring mock values for test validation.'

Dimension	Reasoning	Score
Specificity	The description mentions a domain (LLMObs tests in dd-trace-js) and some actions ('write, modify, or debug tests for any LLMObs plugin'), but the bulk of the description is a long list of trigger terms rather than concrete capability descriptions. It doesn't clearly list specific actions like 'generates test files', 'sets up VCR cassettes', or 'validates span events'.	2 / 3
Completeness	The description explicitly answers 'when' with a detailed 'Use when' clause listing many trigger phrases, and answers 'what' with 'write, modify, or debug tests for any LLMObs plugin in dd-trace-js.' The opening 'This skill should be used when...' structure clearly addresses both dimensions.	3 / 3
Trigger Term Quality	The description provides extensive coverage of natural trigger terms users would say, including exact phrases ('write LLMObs tests', 'add tests for LLM Observability'), specific API terms ('assertLlmObsSpanEvent', 'MOCK_STRING'), tool-specific terms ('VCR cassette', 'record cassette'), and test scenario terms ('test chat completions', 'test streaming'). This is thorough coverage of how users would naturally phrase requests.	3 / 3
Distinctiveness Conflict Risk	The description is highly specific to LLMObs plugin testing within dd-trace-js, a very narrow niche. The trigger terms are domain-specific (LLMObs, VCR cassettes, span events, dd-trace-js) and unlikely to conflict with general testing or other observability skills.	3 / 3
	Total	11 / 12 Passed

Implementation

35%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill covers a complex testing domain comprehensively but suffers from significant redundancy — the category-determines-strategy message is hammered home at least 5 times in different sections. It lacks executable code examples despite being a testing skill where copy-paste-ready test templates would be highly valuable. The structure attempts progressive disclosure with reference files but undermines it by repeating reference content inline.

Suggestions

Eliminate redundant category strategy explanations — state the category→strategy mapping once in a concise table, then reference category-strategies.md for details

Add at least one complete, executable test example (e.g., a minimal LLM_CLIENT test with imports, setup, assertion) instead of describing the flow abstractly

Remove explanatory text Claude already knows (e.g., 'VCR records real API calls and replays them in tests for deterministic testing without external dependencies' — just say 'VCR replays recorded API responses')

Add a validation/troubleshooting checklist inline (e.g., 'If cassette not generated: check proxy is running at 127.0.0.1:9126, verify API key is set') rather than deferring all troubleshooting to references

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose and repetitive. The same information about categories (LLM_CLIENT uses VCR, ORCHESTRATION doesn't) is repeated at least 5 times across different sections. The 'Purpose', 'When to Use', and 'Key Principles' sections largely duplicate the critical warning at the top. Many sections explain concepts Claude would already understand (what VCR does, what a proxy is, what deterministic testing means).	1 / 3
Actionability	The skill provides some concrete guidance (matcher names, field names, file locations, proxy URL) but lacks executable code examples. The 'Basic test flow' is described abstractly rather than shown as runnable code. The standard imports are listed as plain text rather than a proper import statement. Almost all concrete implementation is deferred to reference files that aren't provided.	2 / 3
Workflow Clarity	The test flow steps are listed (initialize → call → get events → assert) but lack validation checkpoints. There's no guidance on what to do when tests fail, no feedback loops for cassette recording issues, and no explicit verification steps. The VCR recording process mentions steps but defers details to a reference file. For a workflow involving destructive operations like cassette recording, the missing validation caps this at 2.	2 / 3
Progressive Disclosure	The skill references four well-organized reference files with clear descriptions, which is good structure. However, since no bundle files are provided, we can't verify these references exist. More importantly, the main SKILL.md contains too much inline content that repeats across sections rather than being concise overview content pointing to references. The category strategy information is repeated extensively inline when it should be summarized briefly and deferred to category-strategies.md.	2 / 3
	Total	7 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: DataDog/dd-trace-js
Commit: 50aa025

Reviewed: about 1 month ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.