Verifies that a refactoring or transformation preserved observable behavior by comparing before and after execution, differential testing, or I/O capture. Use after a refactoring, after automated code transformation, before merging a structural PR, or whenever the claim is that two code versions do the same thing.
Install with Tessl CLI
npx tessl i github:santosomar/general-secure-coding-agent-skills --skill behavior-preservation-checker97
Quality
96%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
"It's just a refactor" is a claim. This skill checks the claim: does the new code produce the same observable behavior as the old code on the inputs that matter?
| Approach | Checks | Cost | Confidence |
|---|---|---|---|
| Run the existing tests | Whatever the tests assert | Free | As good as your test suite — often not very |
| Differential testing | Old and new produce same output on random/prod inputs | Low | High where you can enumerate inputs |
| Golden-master / snapshot | Output matches a recorded baseline byte-for-byte | Low | Very high for serialized output; brittle |
| Side-effect capture | Same DB writes, same HTTP calls, same log lines | Medium | Catches effects tests usually miss |
| Property-based equivalence | ∀x. old(x) == new(x) over generated inputs | Medium | High for pure functions |
| Formal equivalence proof | Proven equal by construction | High | Absolute — → semantic-equivalence-verifier |
Use the cheapest one that gives you the confidence you need. Differential testing covers 90% of cases.
for each input in <sample>:
old_out = old_version(input)
new_out = new_version(input)
if old_out != new_out:
REPORT divergenceWhere <sample> comes from:
Decide before comparing — not after you see a diff:
| Observable | Must match? |
|---|---|
| Return value | Yes — by definition |
| Exception type + message | Type yes; message… usually yes but debatable |
| Side effects (DB, files, network) | Yes — this is where refactors silently break |
| Side-effect order | Depends — was order specified, or incidental? |
| Log output | Usually no — logs are diagnostics, not contract |
| Timing / performance | Usually no — unless that's the contract |
| Iteration order | Depends — was it dict (unordered pre-3.7) or list (ordered)? |
| Float precision | Equal within ε, not bit-exact — define ε upfront |
Write down the equivalence relation. "Same return value, same DB writes (order-insensitive), ignore logs, floats within 1e-9."
Change: Refactored compute_tax(order) — was a 60-line method, now calls three helpers.
Setup: Both versions available — old commit checked out in a sibling worktree.
# differential_test.py
from old.tax import compute_tax as old_compute
from new.tax import compute_tax as new_compute
def test_equivalence(sample_orders): # 500 orders from prod snapshot
for order in sample_orders:
old = old_compute(order)
new = new_compute(order)
assert abs(old - new) < 0.001, f"diverged on {order.id}: {old} != {new}"Run: 498 match. 2 diverge:
12.50, new=12.49. Off by a cent.0.00, new=0.00. Wait — match? Rerun: old raised KeyError (test swallowed it). new returned 0.00.Findings:
For non-pure functions, return value isn't enough. Capture effects:
with capture_sql() as old_queries:
old_fn(x)
with capture_sql() as new_queries:
new_fn(x)
assert normalize(old_queries) == normalize(new_queries)normalize = sort if order doesn't matter, strip timestamps, etc. — per your equivalence relation.
## Equivalence relation
<return values | side effects | what's compared, what's ignored>
## Sample
<N> inputs from <source>
## Result
<N-k> equivalent
<k> divergent:
input=<summary> old=<val> new=<val>
<verdict: regression | latent-bug-fix | incidental | needs-review>
## Confidence
<high | medium | low — based on sample coverage>47d56bb
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.