Name: coding-agent-helpers/skeptic-verifier
Rating: 84.89999999999999 (1 reviews)
Author: coding-agent-helpers

Blog Docs Log in Get started

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

1.30x

Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

Securityby

Passed

No known issues

Evaluation results

88%

49%

Verify the Off-by-One Fix in the Pagination Utility

Output format compliance

Criteria

Without context

With context

Claim section present

100%

One-sentence claim

100%

Attempts section present

100%

Evidence section present

100%

Verdict section present

100%

Valid verdict value

100%

Remaining Risks conditional

100%

Functional proof attempted

50%

28%

Attempts are falsification-oriented

85%

Covers exact-multiple boundary

100%

63%

Double-Check the Input Sanitization Fix

Falsification bias in verification plan

Criteria

Without context

With context

Falsification-framed attempts

78%

85%

No confirmation bias

90%

70%

Addresses incomplete edge cases

25%

33%

Addresses stale test assumptions

90%

Shortest path to disproof

60%

80%

Functional proof attempted

66%

Evidence contains execution output

70%

40%

Valid verdict

37%

Correct output format

12%

25%

Claim graded not code quality

83%

100%

96%

18%

Stress Test the Email Deduplication Fix

Failure mode identification

Criteria

Without context

With context

Identifies race condition risk

100%

Addresses environment-specific breakage

100%

Addresses edge cases

100%

Addresses stale assumptions

80%

100%

Happy-path behavior examined

50%

Failure modes framed adversarially

80%

100%

Valid verdict assigned

100%

Verdict matches evidence

100%

Correct output sections

25%

100%

Claim graded not style

100%

41%

Verify the Browser Session Persistence Fix

Static review acknowledgment

Criteria

Without context

With context

Acknowledges static review

100%

Verdict is not PASS

100%

Environment limitation cited

100%

Remaining Risks section present

60%

100%

BroadcastChannel compatibility noted

100%

Race condition identified

100%

Correct output sections

37%

100%

Valid verdict value

100%

One-sentence claim

100%

Token leak risk considered

100%

85%

33%

Verify the Database Migration Script Safety Claim

PARTIAL verdict for environment-blocked verification

Criteria

Without context

With context

Verdict is PARTIAL not PASS

77%

100%

Environment block cited

35%

Static review acknowledged

70%

100%

Remaining Risks present

100%

Identifies batching gap

40%

Lock risk examined

100%

Idempotency claim checked

100%

Valid verdict value

100%

Correct output sections

16%

100%

One-sentence claim

100%

71%

Verify the Retry Logic Fix

Grade the claim not code quality

Criteria

Without context

With context

No code quality criticism in verdict

100%

Claim-based verdict

91%

100%

Functional proof executed

100%

Backoff calculation verified

100%

Verdict matches actual behavior

80%

100%

Attempts are falsification-oriented

40%

Valid verdict value

Correct output format

12%

One-sentence claim

Evidence contains execution output

83%

100%

69%

-9%

Try to Break the Phone Number Normalization Fix

FAIL verdict with counterexample

Criteria

Without context

With context

FAIL verdict issued

71%

Specific counterexample documented

100%

Counterexample found by running code

83%

41%

Shortest path used

80%

90%

International formats tested

100%

Correct output sections

25%

One-sentence claim

Verdict not based on code quality

100%

87%

Remaining Risks or gaps described

87%

Falsification-oriented attempts

90%

70%

83%

18%

Poke Holes in the Configuration Validator

Functional proof over code inspection

Criteria

Without context

With context

Code actually executed

62%

81%

Adversarial inputs tested

100%

Python bool edge case tested

100%

Boundary values tested

100%

Evidence is concrete

70%

80%

Correct output sections

25%

37%

Valid verdict value

100%

Verdict grounded in test results

87%

100%

One-sentence claim

25%

Functional proof preferred over inspection

62%

87%

Evaluated: 10 days ago
Agent: Claude Code
Model: Claude Sonnet 4.6

Table of Contents

Verify the Off-by-One Fix in the Pagination Utility Double-Check the Input Sanitization Fix Stress Test the Email Deduplication Fix Verify the Browser Session Persistence Fix Verify the Database Migration Script Safety Claim Verify the Retry Logic Fix Try to Break the Phone Number Normalization Fix Poke Holes in the Configuration Validator