failure-taxonomy

Classifying AI failures — hallucination, refusal, irrelevance, tone mismatch, latency.

Quality

20%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Passed

No findings from the security scan

Fix and improve this skill with Tessl

tessl review fix ./claude-plugin/evaluation/skills/failure-taxonomy/SKILL.md

Failure Taxonomy

Name: failure-taxonomy
Rating: 28.799999999999997 (1 reviews)
Author: Owl-Listener

Not all AI failures are the same. A hallucination is different from a refusal, which is different from a tone mismatch. A failure taxonomy classifies failure types so teams can track, prioritise, and address them systematically.

Failure Categories

Content Failures:

Hallucination: The AI presents false information as fact
Inaccuracy: The AI gets details wrong (dates, numbers, names)
Incompleteness: The AI misses important information
Irrelevance: The AI's response doesn't address the user's actual question
Contradiction: The AI contradicts itself within or across responses Behavioral Failures:
Inappropriate refusal: The AI refuses a reasonable request
Missing refusal: The AI fulfils a request it should have declined
Tone mismatch: The AI's tone is wrong for the context
Persona break: The AI drops out of its defined persona
Over-generation: The AI produces far more than needed Technical Failures:
Latency: Response takes too long
Truncation: Response is cut off
Format errors: Output is in the wrong format or structure
Tool failures: The AI attempts to use a tool and fails
Context loss: The AI loses track of conversation history Safety Failures:
Harmful content: The AI generates content that could cause harm
Privacy violation: The AI reveals sensitive information
Bias manifestation: The AI's output shows bias against a group
Manipulation: The AI's output could be used to deceive or manipulate

Severity Levels

Critical: Causes harm or creates serious trust violation. Requires immediate fix.
High: Significantly degrades user experience or task success. Fix within days.
Medium: Noticeable quality issue that users can work around. Fix within weeks.
Low: Minor quality issue. Track and batch with other fixes.

Using the Taxonomy

Logging: Classify every detected failure by type and severity
Trending: Track failure type frequency over time
Prioritisation: Address highest-severity, highest-frequency failures first
Root cause analysis: Group failures by type to identify systemic causes
Prevention: Use failure patterns to inform guardrail design and prompt improvements

Design Artefacts

Failure taxonomy reference document
Failure logging templates
Severity classification rubric
Failure trend dashboards
Root cause analysis protocols

Repository: Owl-Listener/ai-design-skills
Path: claude-plugin/evaluation/skills/failure-taxonomy/SKILL.md
Commit: f41b650

Last updated: about 10 hours ago
First committed: 3 months ago

Also appears in

Owl-Listener/ai-design-skills

In sync

since May 8, 2026

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.