CtrlK
BlogDocsLog inGet started
Tessl Logo

langfuse-core-workflow-b

Execute Langfuse secondary workflow: Evaluation, scoring, and datasets. Use when implementing LLM evaluation, adding user feedback, or setting up automated quality scoring and experiment datasets. Trigger with phrases like "langfuse evaluation", "langfuse scoring", "rate llm outputs", "langfuse feedback", "langfuse datasets", "langfuse experiments".

83

Quality

81%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Discovery

89%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-structured skill description that excels in completeness and distinctiveness, with explicit 'Use when' and 'Trigger with' clauses and Langfuse-specific terminology that minimizes conflict risk. The main weakness is that the capability descriptions could be more concrete—listing specific actions like creating scores, annotating traces, or building dataset items rather than staying at the category level.

Suggestions

Increase specificity by listing concrete actions such as 'create scoring configs, annotate traces with scores, build evaluation datasets, run A/B experiments' instead of broad categories like 'evaluation, scoring, and datasets'.

DimensionReasoningScore

Specificity

Names the domain (Langfuse evaluation/scoring/datasets) and some actions (evaluation, scoring, user feedback, automated quality scoring, experiment datasets), but the actions are somewhat high-level rather than listing multiple concrete operations like 'create scoring rubrics, annotate traces, build dataset items, run experiments'.

2 / 3

Completeness

Clearly answers both 'what' (evaluation, scoring, datasets for Langfuse) and 'when' (explicit 'Use when' clause with scenarios plus a 'Trigger with phrases like' section providing concrete trigger terms).

3 / 3

Trigger Term Quality

Includes a well-curated set of natural trigger terms: 'langfuse evaluation', 'langfuse scoring', 'rate llm outputs', 'langfuse feedback', 'langfuse datasets', 'langfuse experiments'. These cover natural variations a user would say and are explicitly listed.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive due to the specific 'Langfuse secondary workflow' framing and Langfuse-specific trigger terms. The description clearly carves out a niche for Langfuse evaluation/scoring/datasets, distinguishing it from general LLM evaluation tools or other Langfuse workflows (e.g., tracing).

3 / 3

Total

11

/

12

Passed

Implementation

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, actionable skill with excellent executable code examples covering the full evaluation workflow. Its main weakness is the lack of explicit validation checkpoints between steps (e.g., verifying scores appear in the UI before proceeding to datasets, confirming dataset items exist before running experiments). The content could also be slightly more concise by trimming some of the less essential examples like the full feedback endpoint.

Suggestions

Add validation checkpoints between key steps, e.g., 'Verify scores appear in Langfuse UI before proceeding' after Step 1, and 'Confirm dataset items exist via langfuse.api.datasetItems.list()' after Step 4.

Consider trimming Step 2 (user feedback endpoint) to a shorter example since the full Express route handler adds bulk without teaching Langfuse-specific concepts beyond what Step 1 already covers.

DimensionReasoningScore

Conciseness

The skill is mostly efficient with executable code examples, but some sections like the user feedback endpoint and prompt management feel like they could be tightened. The error handling table is a nice compact format, but overall the skill is quite long (~180 lines of content) with some redundancy (e.g., re-importing LangfuseClient in Step 5).

2 / 3

Actionability

Every step includes fully executable TypeScript code with concrete examples — score creation with all three data types, dataset population with real test cases, experiment runner with evaluator functions, and LLM-as-a-Judge patterns. Code is copy-paste ready with realistic values.

3 / 3

Workflow Clarity

Steps are clearly sequenced (1-6) and logically ordered, but there are no explicit validation checkpoints or feedback loops. For operations like dataset creation and experiment running, there's no guidance on verifying scores appeared correctly, validating dataset items were created, or handling partial experiment failures before proceeding.

2 / 3

Progressive Disclosure

The skill has a clear overview, well-structured steps, a compact error handling table, and appropriately references external resources (docs links) and related skills (langfuse-core-workflow-a, langfuse-common-errors, langfuse-ci-integration) without nesting references more than one level deep.

3 / 3

Total

10

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
jeremylongshore/claude-code-plugins-plus-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.