CtrlK
BlogDocsLog inGet started
Tessl Logo

langfuse-core-workflow-b

Execute Langfuse secondary workflow: Evaluation, scoring, and datasets. Use when implementing LLM evaluation, adding user feedback, or setting up automated quality scoring and experiment datasets. Trigger with phrases like "langfuse evaluation", "langfuse scoring", "rate llm outputs", "langfuse feedback", "langfuse datasets", "langfuse experiments".

66

Quality

81%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill with complete executable TypeScript examples covering the full evaluation workflow. Its main weakness is the lack of validation checkpoints between steps (e.g., verifying dataset creation before running experiments) and some minor verbosity in comments. The progressive disclosure and cross-referencing to other skills is well done.

Suggestions

Add validation checkpoints between steps, e.g., after Step 4 verify the dataset exists before proceeding to Step 5's experiment run, and after Step 5 verify scores appear in the UI.

Trim obvious inline comments (e.g., '// Thumbs up/down', '// Optional: score a specific generation') to improve conciseness.

DimensionReasoningScore

Conciseness

The skill is mostly efficient with executable code examples, but includes some unnecessary verbosity like inline comments explaining obvious things ('Optional: score a specific generation', 'Thumbs up/down') and the user feedback endpoint example is somewhat tangential. The error handling table is useful but some entries are obvious.

2 / 3

Actionability

Every step provides fully executable TypeScript code with concrete examples — scoring traces, collecting feedback, fetching prompts, creating datasets, running experiments, and LLM-as-a-Judge evaluation. All code is copy-paste ready with realistic values and proper imports.

3 / 3

Workflow Clarity

Steps are clearly sequenced (1-6) and logically ordered, but there are no explicit validation checkpoints or feedback loops. For a workflow involving dataset creation and experiment execution, there should be verification steps (e.g., confirm dataset was created before running experiments, validate scores appeared in UI).

2 / 3

Progressive Disclosure

The skill is well-structured with a clear overview, sequential steps, an error handling table, and external resource links. It references related skills (langfuse-core-workflow-a, langfuse-common-errors, langfuse-ci-integration) for navigation. Content is appropriately scoped without being monolithic.

3 / 3

Total

10

/

12

Passed

Description

89%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-structured skill description that clearly identifies its niche (Langfuse evaluation and scoring workflows), provides explicit trigger guidance, and distinguishes itself from other skills. The main weakness is that the specific capabilities could be more concrete—listing actual operations rather than broad categories like 'evaluation' and 'scoring'.

Suggestions

Add more concrete action verbs describing specific operations, e.g., 'Create evaluation datasets, attach scores to traces, configure automated evaluators, collect user feedback on LLM outputs.'

DimensionReasoningScore

Specificity

Names the domain (Langfuse evaluation/scoring/datasets) and some actions (evaluation, scoring, user feedback, automated quality scoring, experiment datasets), but the actions are somewhat high-level and not fully concrete (e.g., doesn't specify specific operations like 'create dataset items', 'run evaluators', 'attach scores to traces').

2 / 3

Completeness

Clearly answers both 'what' (evaluation, scoring, datasets for Langfuse) and 'when' (explicit 'Use when' clause with scenarios plus a 'Trigger with phrases like' section listing specific trigger terms).

3 / 3

Trigger Term Quality

Includes a good set of natural trigger terms: 'langfuse evaluation', 'langfuse scoring', 'rate llm outputs', 'langfuse feedback', 'langfuse datasets', 'langfuse experiments'. These cover multiple natural phrasings a user might use.

3 / 3

Distinctiveness Conflict Risk

The description is clearly scoped to Langfuse's secondary workflow (evaluation/scoring/datasets), distinguishing it from general LLM tools or even other Langfuse workflows (e.g., tracing). The 'langfuse' prefix on most trigger terms reduces conflict risk significantly.

3 / 3

Total

11

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
jeremylongshore/claude-code-plugins-plus-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.