CtrlK
BlogDocsLog inGet started
Tessl Logo

langfuse-core-workflow-b

Execute Langfuse secondary workflow: Evaluation, scoring, and datasets. Use when implementing LLM evaluation, adding user feedback, or setting up automated quality scoring and experiment datasets. Trigger with phrases like "langfuse evaluation", "langfuse scoring", "rate llm outputs", "langfuse feedback", "langfuse datasets", "langfuse experiments".

66

Quality

81%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Discovery

89%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-structured skill description that clearly communicates its scope within the Langfuse ecosystem and provides explicit trigger guidance. Its main weakness is that the capability descriptions could be more concrete — listing specific operations rather than broad categories. The explicit trigger phrases and 'Use when' clause are strong points that make skill selection reliable.

Suggestions

Add more concrete action verbs to the capability list, e.g., 'Create scoring configs, annotate traces with scores, upload dataset items, run evaluation experiments' instead of the more abstract 'evaluation, scoring, and datasets'.

DimensionReasoningScore

Specificity

Names the domain (Langfuse evaluation/scoring/datasets) and some actions (evaluation, scoring, user feedback, automated quality scoring, experiment datasets), but the actions are somewhat high-level rather than listing multiple concrete operations like 'create scoring configs, annotate traces, upload dataset items, run experiments'.

2 / 3

Completeness

Clearly answers both 'what' (evaluation, scoring, datasets for Langfuse) and 'when' (explicit 'Use when' clause with scenarios plus a 'Trigger with phrases like' section listing specific trigger terms).

3 / 3

Trigger Term Quality

Explicitly lists natural trigger phrases including 'langfuse evaluation', 'langfuse scoring', 'rate llm outputs', 'langfuse feedback', 'langfuse datasets', 'langfuse experiments' — these cover a good range of terms users would naturally say when needing this skill.

3 / 3

Distinctiveness Conflict Risk

The description is clearly scoped to Langfuse's secondary workflow (evaluation/scoring/datasets), distinguishing it from general LLM tooling or even other Langfuse workflows (e.g., tracing). The 'langfuse' prefix on trigger terms and the specific mention of 'secondary workflow' make it unlikely to conflict with other skills.

3 / 3

Total

11

/

12

Passed

Implementation

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, actionable skill with excellent executable code examples covering the full Langfuse evaluation workflow. Its main weaknesses are the lack of inline validation checkpoints between steps (e.g., verifying dataset population before running experiments) and some scope creep with the prompt management section that isn't directly tied to the evaluation/scoring focus. The progressive disclosure and external references are well-handled.

Suggestions

Add explicit validation checkpoints between steps, e.g., verify dataset items exist before running experiments: `const items = await langfuse.api.datasetItems.list({ datasetName: '...' }); console.log(items.length + ' items ready');`

Consider moving the Prompt Management section (Step 3) to a separate skill or making it a brief reference, since it's tangential to the evaluation/scoring/datasets focus described in the skill's purpose

DimensionReasoningScore

Conciseness

The skill is mostly efficient with executable code examples, but includes some unnecessary elements like the user feedback API endpoint pattern and prompt management section that expand scope beyond the core evaluation/scoring workflow. Some comments in code are helpful but others are redundant.

2 / 3

Actionability

Every step provides fully executable TypeScript code with concrete examples covering all three score types, dataset creation, experiment running, and LLM-as-a-Judge patterns. The code is copy-paste ready with realistic variable names and complete function signatures.

3 / 3

Workflow Clarity

Steps are clearly sequenced from scoring to datasets to experiments, but there are no explicit validation checkpoints between steps. For example, there's no verification that scores were successfully created before proceeding, no check that dataset items were populated before running experiments, and the error handling table is reactive rather than integrated into the workflow.

2 / 3

Progressive Disclosure

The skill is well-structured with clear sections, a concise overview, a useful error handling table, and well-signaled references to external docs and related skills (langfuse-core-workflow-a, langfuse-common-errors, langfuse-ci-integration). Content is appropriately organized without being monolithic.

3 / 3

Total

10

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
jeremylongshore/claude-code-plugins-plus-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.