langfuse-core-workflow-b

Execute Langfuse secondary workflow: Evaluation, scoring, and datasets. Use when implementing LLM evaluation, adding user feedback, or setting up automated quality scoring and experiment datasets. Trigger with phrases like "langfuse evaluation", "langfuse scoring", "rate llm outputs", "langfuse feedback", "langfuse datasets", "langfuse experiments".

Quality

81%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Discovery

89%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-structured skill description that clearly communicates its scope within the Langfuse ecosystem and provides explicit trigger guidance. Its main weakness is that the capability descriptions could be more concrete — listing specific operations rather than broad categories. The explicit trigger phrases and 'Use when' clause are strong points that make skill selection reliable.

Suggestions

Add more concrete action verbs to the capability list, e.g., 'Create scoring configs, annotate traces with scores, upload dataset items, run evaluation experiments' instead of the more abstract 'evaluation, scoring, and datasets'.

Dimension	Reasoning	Score
Specificity	Names the domain (Langfuse evaluation/scoring/datasets) and some actions (evaluation, scoring, user feedback, automated quality scoring, experiment datasets), but the actions are somewhat high-level rather than listing multiple concrete operations like 'create scoring configs, annotate traces, upload dataset items, run experiments'.	2 / 3
Completeness	Clearly answers both 'what' (evaluation, scoring, datasets for Langfuse) and 'when' (explicit 'Use when' clause with scenarios plus a 'Trigger with phrases like' section listing specific trigger terms).	3 / 3
Trigger Term Quality	Explicitly lists natural trigger phrases including 'langfuse evaluation', 'langfuse scoring', 'rate llm outputs', 'langfuse feedback', 'langfuse datasets', 'langfuse experiments' — these cover a good range of terms users would naturally say when needing this skill.	3 / 3
Distinctiveness Conflict Risk	The description is clearly scoped to Langfuse's secondary workflow (evaluation/scoring/datasets), distinguishing it from general LLM tooling or even other Langfuse workflows (e.g., tracing). The 'langfuse' prefix on trigger terms and the specific mention of 'secondary workflow' make it unlikely to conflict with other skills.	3 / 3
	Total	11 / 12 Passed

Implementation

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, actionable skill with excellent executable code examples covering the full Langfuse evaluation workflow. Its main weaknesses are the lack of inline validation checkpoints between steps (e.g., verifying dataset population before running experiments) and some scope creep with the prompt management section that isn't directly tied to the evaluation/scoring focus. The progressive disclosure and external references are well-handled.

Suggestions

Add explicit validation checkpoints between steps, e.g., verify dataset items exist before running experiments: `const items = await langfuse.api.datasetItems.list({ datasetName: '...' }); console.log(items.length + ' items ready');`

Consider moving the Prompt Management section (Step 3) to a separate skill or making it a brief reference, since it's tangential to the evaluation/scoring/datasets focus described in the skill's purpose

Dimension	Reasoning	Score
Conciseness	The skill is mostly efficient with executable code examples, but includes some unnecessary elements like the user feedback API endpoint pattern and prompt management section that expand scope beyond the core evaluation/scoring workflow. Some comments in code are helpful but others are redundant.	2 / 3
Actionability	Every step provides fully executable TypeScript code with concrete examples covering all three score types, dataset creation, experiment running, and LLM-as-a-Judge patterns. The code is copy-paste ready with realistic variable names and complete function signatures.	3 / 3
Workflow Clarity	Steps are clearly sequenced from scoring to datasets to experiments, but there are no explicit validation checkpoints between steps. For example, there's no verification that scores were successfully created before proceeding, no check that dataset items were populated before running experiments, and the error handling table is reactive rather than integrated into the workflow.	2 / 3
Progressive Disclosure	The skill is well-structured with clear sections, a concise overview, a useful error handling table, and well-signaled references to external docs and related skills (langfuse-core-workflow-a, langfuse-common-errors, langfuse-ci-integration). Content is appropriately organized without being monolithic.	3 / 3
	Total	10 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 9 / 11 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	9 / 11 Passed

Repository: jeremylongshore/claude-code-plugins-plus-skills
Commit: a04d1a2

Reviewed: 3 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.