CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Evaluation results

88%

49%

Verify the Off-by-One Fix in the Pagination Utility

Output format compliance

Criteria
Without context
With context

Claim section present

0%

100%

One-sentence claim

0%

100%

Attempts section present

0%

100%

Evidence section present

0%

100%

Verdict section present

0%

100%

Valid verdict value

0%

100%

Remaining Risks conditional

100%

100%

Functional proof attempted

50%

28%

Attempts are falsification-oriented

85%

85%

Covers exact-multiple boundary

100%

100%

63%

4%

Double-Check the Input Sanitization Fix

Falsification bias in verification plan

Criteria
Without context
With context

Falsification-framed attempts

78%

85%

No confirmation bias

90%

70%

Addresses incomplete edge cases

25%

33%

Addresses stale test assumptions

90%

90%

Shortest path to disproof

60%

80%

Functional proof attempted

66%

66%

Evidence contains execution output

70%

40%

Valid verdict

0%

37%

Correct output format

12%

25%

Claim graded not code quality

83%

100%

96%

18%

Stress Test the Email Deduplication Fix

Failure mode identification

Criteria
Without context
With context

Identifies race condition risk

100%

100%

Addresses environment-specific breakage

100%

100%

Addresses edge cases

100%

100%

Addresses stale assumptions

80%

100%

Happy-path behavior examined

50%

50%

Failure modes framed adversarially

80%

100%

Valid verdict assigned

0%

100%

Verdict matches evidence

100%

100%

Correct output sections

25%

100%

Claim graded not style

100%

100%

100%

41%

Verify the Browser Session Persistence Fix

Static review acknowledgment

Criteria
Without context
With context

Acknowledges static review

100%

100%

Verdict is not PASS

100%

100%

Environment limitation cited

0%

100%

Remaining Risks section present

60%

100%

BroadcastChannel compatibility noted

100%

100%

Race condition identified

100%

100%

Correct output sections

37%

100%

Valid verdict value

0%

100%

One-sentence claim

0%

100%

Token leak risk considered

0%

100%

85%

33%

Verify the Database Migration Script Safety Claim

PARTIAL verdict for environment-blocked verification

Criteria
Without context
With context

Verdict is PARTIAL not PASS

77%

100%

Environment block cited

0%

35%

Static review acknowledged

70%

100%

Remaining Risks present

100%

100%

Identifies batching gap

40%

40%

Lock risk examined

100%

100%

Idempotency claim checked

100%

100%

Valid verdict value

0%

100%

Correct output sections

16%

100%

One-sentence claim

0%

100%

71%

5%

Verify the Retry Logic Fix

Grade the claim not code quality

Criteria
Without context
With context

No code quality criticism in verdict

100%

100%

Claim-based verdict

91%

100%

Functional proof executed

100%

100%

Backoff calculation verified

100%

100%

Verdict matches actual behavior

80%

100%

Attempts are falsification-oriented

40%

40%

Valid verdict value

0%

0%

Correct output format

0%

12%

One-sentence claim

0%

0%

Evidence contains execution output

83%

100%

69%

-9%

Try to Break the Phone Number Normalization Fix

FAIL verdict with counterexample

Criteria
Without context
With context

FAIL verdict issued

71%

71%

Specific counterexample documented

100%

100%

Counterexample found by running code

83%

41%

Shortest path used

80%

90%

International formats tested

100%

100%

Correct output sections

25%

0%

One-sentence claim

0%

0%

Verdict not based on code quality

100%

87%

Remaining Risks or gaps described

87%

87%

Falsification-oriented attempts

90%

70%

83%

18%

Poke Holes in the Configuration Validator

Functional proof over code inspection

Criteria
Without context
With context

Code actually executed

62%

81%

Adversarial inputs tested

100%

100%

Python bool edge case tested

100%

100%

Boundary values tested

100%

100%

Evidence is concrete

70%

80%

Correct output sections

25%

37%

Valid verdict value

0%

100%

Verdict grounded in test results

87%

100%

One-sentence claim

0%

25%

Functional proof preferred over inspection

62%

87%

Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents