CtrlK
BlogDocsLog inGet started
Tessl Logo

output-quality-rubrics

Defining what "good" looks like for AI outputs — accuracy, relevance, helpfulness.

28

Quality

18%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./claude-plugin/evaluation/skills/output-quality-rubrics/SKILL.md
SKILL.md
Quality
Evals
Security

Output Quality Rubrics

Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.

Core Quality Dimensions

  • Accuracy: Is the information correct? Are claims verifiable? Are there hallucinations?
  • Relevance: Does the output address what the user actually asked? Is everything included necessary?
  • Completeness: Does the output cover everything needed? Are there gaps?
  • Helpfulness: Can the user actually use this output to accomplish their goal?
  • Clarity: Is the output easy to understand? Is it well-structured?
  • Tone appropriateness: Does the output match the expected tone for the context?
  • Safety: Is the output free from harmful, biased, or inappropriate content?

Building a Rubric

For each dimension, define a scale: Example — Accuracy (1-5):

  • 5: All claims are verifiable and correct. No hallucinations.
  • 4: Minor inaccuracies that don't affect usefulness. No hallucinations.
  • 3: Some inaccuracies that could mislead if not caught. No dangerous hallucinations.
  • 2: Significant inaccuracies. User would need to verify most claims.
  • 1: Major hallucinations or factually wrong information presented confidently.

Weighting Dimensions

Not all dimensions matter equally for every use case:

  • A medical AI weights accuracy and safety highest
  • A creative writing AI weights helpfulness and tone highest
  • A coding AI weights accuracy and completeness highest
  • A customer service AI weights tone and helpfulness highest Define weights when creating the rubric. Make the priorities explicit.

Rubric Calibration

A rubric is only useful if evaluators use it consistently:

  • Anchor examples: Provide sample outputs at each score level
  • Calibration sessions: Have multiple evaluators score the same outputs and discuss disagreements
  • Inter-rater reliability: Measure agreement between evaluators and refine the rubric until agreement is high
  • Edge case guidance: Document how to score ambiguous cases

Design Artefacts

  • Scoring rubric with dimension definitions and scales
  • Anchor examples at each score level
  • Dimension weighting specifications per use case
  • Calibration session protocols
  • Scoring templates and checklists
Repository
Owl-Listener/ai-design-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.