Python module to run and analyze benchmarks with high precision and statistical rigor
—
Pending
Does it follow best practices?
Impact
No eval scenarios have been run
The risk profile of this skill
Loading evals