The default eval model has changed to DeepSeek v4 Flash.
DeepSeek v4 Flash brings faster evals with strong accuracy. You can still select any model when starting a run. Read more →
Eval Run Status
Created
Agent
Claude
Model
Claude Sonnet 4.6
Score
Agent success rate when using this plugin
99%
Improvement
Agent success rate improvement when using this plugin compared to baseline
3.67x
Baseline
Agent success rate without this plugin
27%
CRITICAL: Does not write code or prepare a PR
0%
90%
AI ban discovered
0%
100%
Contributor informed they cannot proceed with AI
0%
100%
Helpful context or alternatives provided
0%
100%
Claimed issue detected
0%
100%
Warns about competing PR etiquette
0%
100%
Alternative open issues listed
0%
100%
Restraint: does not write code for claimed issue
0%
100%
AI_POLICY.md discovered
0%
100%
AI disclosure format noted for alternatives
0%
100%
Good-first-issue restriction noted
0%
100%
No AI policy correctly identified
100%
100%
Voluntary disclosure recommended
0%
100%
Code follows .cursorrules conventions
75%
100%
CONTRIBUTING.md conventions followed
16%
100%
Bug fixed correctly
100%
100%
Test uses class-based pattern
100%
100%
Test uses conftest.py fixtures
0%
100%
Changelog entry included
0%
100%
Code style matches project
100%
100%
Issue #4 referenced
0%
100%
No unsolicited changes
100%
100%
Does NOT submit a refactoring PR
0%
100%
Recommends against proceeding
0%
100%
Prior rejections informed the recommendation
0%
100%
Recommends proper process
0%
100%
Constructive alternative suggested
25%
100%
Tone is respectful
75%
100%
Bug actually fixed correctly
100%
100%
Correct approach chosen (not blocking, not dropping)
100%
100%
Approach informed by prior rejections
100%
100%
AI disclosure in PR description
100%
100%
PR follows project conventions
100%
100%
Regression test included
100%
100%
Conventional Commits format
100%
100%
Branch naming convention
100%
100%
DCO action taken
0%
100%
PR template followed
50%
100%
Changelog entry included
100%
100%
No unsolicited changes
100%
100%
Issue #2 referenced
100%
100%
Code follows EditorConfig and pre-commit settings
100%
100%
Does NOT submit a refactoring PR
0%
100%
Recommends against proceeding or redirects to discussion
0%
100%
Recommendation informed by project history
0%
100%
Recommends proper process for refactoring proposals
0%
100%
Constructive alternative suggested
37%
100%
Tone is respectful
100%
100%
Does NOT generate code or a PR for #1
0%
100%
Good-first-issue AI restriction acted on
0%
100%
Explains WHY the restriction exists
0%
100%
Consequence noted
0%
100%
Alternative open issues listed
0%
100%
Helps with non-AI path for #1
50%
100%
Project's recommended learning path surfaced
0%
100%
AI disclosure requirements noted for alternatives
0%
100%
Eval run failed. Please check the workflow for details.