evilissimo/implementation-integrity-review

Reviews repositories, pull requests, diffs, and agent-generated code for reward hacking, fake completion, defensive theater, architectural bypasses, weakened guarantees, hidden fallbacks, and misleading abstractions.

1.09x

Quality

97%

Does it follow best practices?

Impact

100%

1.09x

Average score across 6 eval scenarios

Securityby

Passed

No known issues

Scoring Model

Name: evilissimo/implementation-integrity-review
Rating: 98.5 (1 reviews)
Author: evilissimo

Use this model to rank implementation-integrity findings. Severity measures the impact of the integrity failure. Confidence measures how strongly the evidence supports the claim.

Severity

Critical

Use Critical when the change creates false confidence around a core contract or safety property.

Examples:

Tests or CI are manipulated so broken production behavior appears verified.
Required security, authorization, tenancy, data-loss, payment, or compliance guarantees are bypassed.
A production path returns success while the requested work does not happen.
Operators lose the ability to detect a high-impact failure.

High

Use High when the implementation materially breaks promised behavior or hides important failure, but the blast radius is narrower than Critical.

Examples:

Validation, persistence, external calls, or side effects are weakened for a user-facing workflow.
Errors are swallowed and callers receive misleading success or empty data.
A fallback silently changes semantics in production.
Architecture bypass skips shared policy, transaction, or observability layers.

Medium

Use Medium when the issue creates meaningful maintenance or correctness risk but does not currently prove broad user-facing failure.

Examples:

A fake provider abstraction or compatibility layer invites incorrect future use.
Resiliency code advertises protection it does not provide for a limited path.
Tests are weaker than they should be, but other coverage still guards the main contract.
A boundary bypass duplicates domain logic without immediate high-impact harm.

Low

Use Low for localized integrity concerns with limited impact and clear containment.

Examples:

A misleading status, log, or health detail could confuse operators but does not hide a major failure.
A minor fallback path is under-specified but not on the primary workflow.
A narrow abstraction is unnecessary but not yet harmful.

Do not report purely stylistic, formatting, naming, or generic lint concerns as implementation-integrity findings.

Confidence

High

Use High confidence when the code, tests, and contract all line up:

the promised behavior is clear
the violating code path is reachable
evidence includes exact references
no legitimate pattern explains the code

Medium

Use Medium confidence when the evidence is strong but one part needs verification, such as runtime reachability, deployment configuration, or an external dependency behavior.

Low

Use Low confidence for plausible leads that need confirmation and still carry enough risk to mention. Avoid Low-confidence findings unless the potential impact is High or Critical.

Reporting Threshold

Report a finding when:

The contract or expectation is identifiable.
The implementation materially violates or misrepresents that contract.
The finding can cite exact code or test evidence.
The integrity-first rule would not classify it as mere style or cleanup.

If those conditions are not met, either omit the issue or list it as a review limit/open question rather than a finding.

Ordering

Order findings by:

higher severity
higher confidence
broader blast radius
stronger evidence of false confidence

Within the same severity, lead with issues that mislead users, operators, tests, or reviewers before issues that mainly create future maintenance risk.

evilissimo/implementation-integrity-review

scoring-model.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Scoring Model

Severity

Critical

High

Medium

Low

Confidence

High

Medium

Low

Reporting Threshold

Ordering

scoring-model.mddocs/