Agent-native E2E runtime with verifiable safety. 13 MCP tools including alethia_propose_tests (agent generates tests from a URL), alethia_assert_safety (proves destructive actions are blocked), and the expect block: NLP primitive unique to Alethia. Zero-IPC, ~45x faster than Playwright, signed evidence packs. Works with Claude Code, Cursor, Cline.
95
94%
Does it follow best practices?
Impact
97%
2.77xAverage score across 5 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the agent follows the correct three-step Alethia workflow: calling alethia_status first to check runtime health, then alethia_compile to preview the Action IR, then alethia_tell to execute. Also checks that the agent knows to inspect PlanRun response fields and includes a name parameter in alethia_tell.",
"type": "weighted_checklist",
"checklist": [
{
"name": "alethia_status first",
"description": "The first tool call in test_script.json is alethia_status (before alethia_compile and alethia_tell)",
"max_score": 15
},
{
"name": "alethia_compile second",
"description": "alethia_compile appears as a step before alethia_tell in test_script.json",
"max_score": 15
},
{
"name": "alethia_tell present",
"description": "alethia_tell appears as a step in test_script.json",
"max_score": 10
},
{
"name": "name parameter in alethia_tell",
"description": "The alethia_tell step in test_script.json includes a 'name' parameter",
"max_score": 10
},
{
"name": "run.ok check documented",
"description": "test_plan.md or test_script.json mentions checking run.ok from the PlanRun response",
"max_score": 10
},
{
"name": "stepRuns inspection documented",
"description": "test_plan.md mentions inspecting stepRuns (per-step results) from the PlanRun response",
"max_score": 10
},
{
"name": "policyAudits documented",
"description": "test_plan.md mentions reviewing policyAudits from the PlanRun response",
"max_score": 10
},
{
"name": "kill switch check documented",
"description": "test_plan.md or test_script.json describes verifying the kill switch is inactive as part of the status check",
"max_score": 10
},
{
"name": "compile purpose documented",
"description": "test_plan.md explains that alethia_compile is used to catch NLP issues before execution (not just as an optional step)",
"max_score": 10
}
]
}