Compares the runtime behavior of two or more versions of the same code by running them on identical inputs and diffing outputs, side effects, and errors. Use when validating a refactor, port, or optimization; when the user asks if two implementations behave the same; or when investigating a suspected regression across versions.
Install with Tessl CLI
npx tessl i github:santosomar/general-secure-coding-agent-skills --skill multi-version-behavior-comparator100
Quality
100%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Differential testing, generalized: run N versions on the same inputs, and any disagreement is a bug somewhere. You don't need an oracle — the versions are each other's oracle.
Compared to → behavior-preservation-checker (yes/no answer, two versions): this handles N versions, and focuses on characterizing divergence, not just detecting it.
| Observable | How to capture | When it matters |
|---|---|---|
| Return value | Direct comparison (after normalization) | Always |
| Exceptions / errors | Type + message (message is often unstable — compare type) | Always |
| Side effects (I/O, DB) | Mock + record, or capture at the boundary | If the code does I/O |
| stdout/stderr | Capture streams | If output is the interface |
| Mutation of inputs | Deep-copy inputs before call, compare after | If inputs are mutable |
| Timing | Measure, but wide tolerance — don't flag unless 10×+ different | Performance regressions only |
| Resource usage | Memory high-water, FD count | Leak hunting |
Pick which observables matter before running. Diffing everything produces noise.
Differential testing is only as good as the inputs. Sources, in order of effort:
Two outputs can be "the same" without being byte-identical:
| Spurious diff | Normalize by |
|---|---|
Dict key order ({a:1, b:2} vs {b:2, a:1}) | Compare as dicts, not as strings/JSON |
Float rounding (0.1 + 0.2) | abs(a - b) < ε, or round to N digits |
| Timestamps, UUIDs in output | Regex-replace with placeholder before diff |
| Whitespace / formatting | Parse then compare ASTs, or normalize whitespace |
| Order of unordered collections | Sort before comparing (or use set equality) |
| Error message wording | Compare exception type, not .args |
Setup: Three implementations of urlparse — stdlib, a hand-rolled one from legacy code, and a port to Go (via subprocess).
Harness:
import subprocess, json, urllib.parse
from hypothesis import given, strategies as st
from legacy import parse_url as legacy_parse
def normalize(result: dict) -> dict:
# Port returns "" for missing, stdlib returns None. Unify.
return {k: (v if v else None) for k, v in result.items()}
def run_go(url: str) -> dict:
out = subprocess.run(["./go_urlparse"], input=url, capture_output=True, text=True)
return json.loads(out.stdout)
@given(st.text(alphabet=st.characters(blacklist_categories=["Cs"]), max_size=200))
def test_three_way(url):
try:
std = urllib.parse.urlparse(url)._asdict()
except Exception as e:
std = {"__error__": type(e).__name__}
try:
leg = legacy_parse(url)
except Exception as e:
leg = {"__error__": type(e).__name__}
go = run_go(url)
std_n, leg_n, go_n = normalize(std), normalize(leg), normalize(go)
if not (std_n == leg_n == go_n):
# Majority vote: 2-of-3 agree → the odd one out is probably wrong
if std_n == leg_n: suspect = "go"
elif std_n == go_n: suspect = "legacy"
elif leg_n == go_n: suspect = "stdlib" # unlikely but possible
else: suspect = "all disagree"
print(f"DIVERGENCE on {url!r}: suspect={suspect}")
print(f" stdlib: {std_n}\n legacy: {leg_n}\n go: {go_n}")Findings after 10k inputs:
| Input | stdlib | legacy | go | Suspect |
|---|---|---|---|---|
"http://[::1]:80/x" | host=::1 | host=[::1] | host=::1 | legacy |
"a://b?c=%" | query=c=% | query=c=% | ValueError | go |
"//foo" | netloc=foo | path=//foo | netloc=foo | legacy |
Legacy has IPv6 bracket handling wrong. Go is too strict on percent-decoding. Stdlib agrees with 2-of-3 every time — it's the reference.
With 3+ versions, disagreement is localizable: if 2 agree and 1 differs, the 1 is probably wrong. Not always — maybe the 2 share a bug. But it's a strong prior. Rank bugs by (minority size, input simplicity).
{"a":1,"b":2} ≠ {"b":2,"a":1} as strings, = as dicts. Parse then compare.None, empty, or malformed input — that's where bugs live.## Versions compared
A: <description/path/commit>
B: ...
C: ...
## Observables
<what was diffed — return value, exceptions, side effects>
## Normalization
<what was canonicalized before comparison>
## Input space
<how inputs were generated — boundary, random, mined>
<N inputs tested>
## Divergences
| Input | A | B | C | Suspect | Cluster |
| ----- | - | - | - | ------- | ------- |
## Clustered root causes
1. <cause>: <N> divergences, suspect=<version>
Example input: <minimal reproducer>47d56bb
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.