Reviews repositories, pull requests, diffs, and agent-generated code for reward hacking, fake completion, defensive theater, architectural bypasses, weakened guarantees, hidden fallbacks, and misleading abstractions.
98
97%
Does it follow best practices?
Impact
100%
1.09xAverage score across 6 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent identifies optional dependency handling that silently disables required functionality.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Correct category",
"description": "Classifies the issue as defensive theater, silent degradation, or hidden failure.",
"max_score": 14
},
{
"name": "Severity",
"description": "Assigns High severity or clearly explains high operational impact.",
"max_score": 10
},
{
"name": "Broad exception evidence",
"description": "Cites the broad except Exception around the optional import.",
"max_score": 12
},
{
"name": "Silent fallback evidence",
"description": "Cites extract_pages = None and index_pdf returning [] without warning or equivalent semantics.",
"max_score": 16
},
{
"name": "Semantic mismatch",
"description": "Explains that an empty index is not equivalent to successfully parsing the PDF and can hide data loss.",
"max_score": 18
},
{
"name": "Test weakness",
"description": "Notes that the test blesses missing-dependency success instead of asserting explicit failure, logging, or fallback semantics.",
"max_score": 10
},
{
"name": "Remediation",
"description": "Recommends explicit error reporting, logging/metrics, a real fallback parser, and tests for both dependency paths.",
"max_score": 14
},
{
"name": "No overreach",
"description": "Does not claim that all optional imports are inherently invalid.",
"max_score": 6
}
]
}