Execute Langfuse secondary workflow: Evaluation, scoring, and datasets. Use when implementing LLM evaluation, adding user feedback, or setting up automated quality scoring and experiment datasets. Trigger with phrases like "langfuse evaluation", "langfuse scoring", "rate llm outputs", "langfuse feedback", "langfuse datasets", "langfuse experiments".
66
81%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
89%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-structured skill description that clearly communicates its scope within the Langfuse ecosystem and provides explicit trigger guidance. Its main weakness is that the capability descriptions could be more concrete — listing specific operations rather than broad categories. The explicit trigger phrases and 'Use when' clause are strong points that make skill selection reliable.
Suggestions
Add more concrete action verbs to the capability list, e.g., 'Create scoring configs, annotate traces with scores, upload dataset items, run evaluation experiments' instead of the more abstract 'evaluation, scoring, and datasets'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (Langfuse evaluation/scoring/datasets) and some actions (evaluation, scoring, user feedback, automated quality scoring, experiment datasets), but the actions are somewhat high-level rather than listing multiple concrete operations like 'create scoring configs, annotate traces, upload dataset items, run experiments'. | 2 / 3 |
Completeness | Clearly answers both 'what' (evaluation, scoring, datasets for Langfuse) and 'when' (explicit 'Use when' clause with scenarios plus a 'Trigger with phrases like' section listing specific trigger terms). | 3 / 3 |
Trigger Term Quality | Explicitly lists natural trigger phrases including 'langfuse evaluation', 'langfuse scoring', 'rate llm outputs', 'langfuse feedback', 'langfuse datasets', 'langfuse experiments' — these cover a good range of terms users would naturally say when needing this skill. | 3 / 3 |
Distinctiveness Conflict Risk | The description is clearly scoped to Langfuse's secondary workflow (evaluation/scoring/datasets), distinguishing it from general LLM tooling or even other Langfuse workflows (e.g., tracing). The 'langfuse' prefix on trigger terms and the specific mention of 'secondary workflow' make it unlikely to conflict with other skills. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
72%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a solid, actionable skill with excellent executable code examples covering the full Langfuse evaluation workflow. Its main weaknesses are the lack of inline validation checkpoints between steps (e.g., verifying dataset population before running experiments) and some scope creep with the prompt management section that isn't directly tied to the evaluation/scoring focus. The progressive disclosure and external references are well-handled.
Suggestions
Add explicit validation checkpoints between steps, e.g., verify dataset items exist before running experiments: `const items = await langfuse.api.datasetItems.list({ datasetName: '...' }); console.log(items.length + ' items ready');`
Consider moving the Prompt Management section (Step 3) to a separate skill or making it a brief reference, since it's tangential to the evaluation/scoring/datasets focus described in the skill's purpose
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient with executable code examples, but includes some unnecessary elements like the user feedback API endpoint pattern and prompt management section that expand scope beyond the core evaluation/scoring workflow. Some comments in code are helpful but others are redundant. | 2 / 3 |
Actionability | Every step provides fully executable TypeScript code with concrete examples covering all three score types, dataset creation, experiment running, and LLM-as-a-Judge patterns. The code is copy-paste ready with realistic variable names and complete function signatures. | 3 / 3 |
Workflow Clarity | Steps are clearly sequenced from scoring to datasets to experiments, but there are no explicit validation checkpoints between steps. For example, there's no verification that scores were successfully created before proceeding, no check that dataset items were populated before running experiments, and the error handling table is reactive rather than integrated into the workflow. | 2 / 3 |
Progressive Disclosure | The skill is well-structured with clear sections, a concise overview, a useful error handling table, and well-signaled references to external docs and related skills (langfuse-core-workflow-a, langfuse-common-errors, langfuse-ci-integration). Content is appropriately organized without being monolithic. | 3 / 3 |
Total | 10 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
a04d1a2
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.