CtrlK
BlogDocsLog inGet started
Tessl Logo

multi-version-behavior-comparator

Compares the runtime behavior of two or more versions of the same code by running them on identical inputs and diffing outputs, side effects, and errors. Use when validating a refactor, port, or optimization; when the user asks if two implementations behave the same; or when investigating a suspected regression across versions.

Install with Tessl CLI

npx tessl i github:santosomar/general-secure-coding-agent-skills --skill multi-version-behavior-comparator
What are skills?

100

Quality

100%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SKILL.md
Review
Evals

Multi-Version Behavior Comparator

Differential testing, generalized: run N versions on the same inputs, and any disagreement is a bug somewhere. You don't need an oracle — the versions are each other's oracle.

Compared to → behavior-preservation-checker (yes/no answer, two versions): this handles N versions, and focuses on characterizing divergence, not just detecting it.

What counts as "behavior"

ObservableHow to captureWhen it matters
Return valueDirect comparison (after normalization)Always
Exceptions / errorsType + message (message is often unstable — compare type)Always
Side effects (I/O, DB)Mock + record, or capture at the boundaryIf the code does I/O
stdout/stderrCapture streamsIf output is the interface
Mutation of inputsDeep-copy inputs before call, compare afterIf inputs are mutable
TimingMeasure, but wide tolerance — don't flag unless 10×+ differentPerformance regressions only
Resource usageMemory high-water, FD countLeak hunting

Pick which observables matter before running. Diffing everything produces noise.

Input generation — where the coverage comes from

Differential testing is only as good as the inputs. Sources, in order of effort:

  1. Existing test inputs. If there's a test suite, extract the inputs. They're already interesting.
  2. Boundary sweep. Empty, single element, max size, zero, negative, None/null, unicode, very long strings. The usual suspects.
  3. Random/property-based. Hypothesis, QuickCheck. Great at finding the 17-element list where version A and B disagree.
  4. Coverage-guided. Use coverage of version A (or B) to guide input generation toward unexercised branches. AFL-style but for diffing.
  5. Mined from production. Logs, traces, recorded requests. Real-world distribution.

Normalization — avoiding false positives

Two outputs can be "the same" without being byte-identical:

Spurious diffNormalize by
Dict key order ({a:1, b:2} vs {b:2, a:1})Compare as dicts, not as strings/JSON
Float rounding (0.1 + 0.2)abs(a - b) < ε, or round to N digits
Timestamps, UUIDs in outputRegex-replace with placeholder before diff
Whitespace / formattingParse then compare ASTs, or normalize whitespace
Order of unordered collectionsSort before comparing (or use set equality)
Error message wordingCompare exception type, not .args

Worked example — 3-way comparison

Setup: Three implementations of urlparse — stdlib, a hand-rolled one from legacy code, and a port to Go (via subprocess).

Harness:

import subprocess, json, urllib.parse
from hypothesis import given, strategies as st
from legacy import parse_url as legacy_parse

def normalize(result: dict) -> dict:
    # Port returns "" for missing, stdlib returns None. Unify.
    return {k: (v if v else None) for k, v in result.items()}

def run_go(url: str) -> dict:
    out = subprocess.run(["./go_urlparse"], input=url, capture_output=True, text=True)
    return json.loads(out.stdout)

@given(st.text(alphabet=st.characters(blacklist_categories=["Cs"]), max_size=200))
def test_three_way(url):
    try:
        std = urllib.parse.urlparse(url)._asdict()
    except Exception as e:
        std = {"__error__": type(e).__name__}
    try:
        leg = legacy_parse(url)
    except Exception as e:
        leg = {"__error__": type(e).__name__}
    go = run_go(url)

    std_n, leg_n, go_n = normalize(std), normalize(leg), normalize(go)

    if not (std_n == leg_n == go_n):
        # Majority vote: 2-of-3 agree → the odd one out is probably wrong
        if std_n == leg_n: suspect = "go"
        elif std_n == go_n: suspect = "legacy"
        elif leg_n == go_n: suspect = "stdlib"  # unlikely but possible
        else: suspect = "all disagree"
        print(f"DIVERGENCE on {url!r}: suspect={suspect}")
        print(f"  stdlib: {std_n}\n  legacy: {leg_n}\n  go:     {go_n}")

Findings after 10k inputs:

InputstdliblegacygoSuspect
"http://[::1]:80/x"host=::1host=[::1]host=::1legacy
"a://b?c=%"query=c=%query=c=%ValueErrorgo
"//foo"netloc=foopath=//foonetloc=foolegacy

Legacy has IPv6 bracket handling wrong. Go is too strict on percent-decoding. Stdlib agrees with 2-of-3 every time — it's the reference.

Majority voting

With 3+ versions, disagreement is localizable: if 2 agree and 1 differs, the 1 is probably wrong. Not always — maybe the 2 share a bug. But it's a strong prior. Rank bugs by (minority size, input simplicity).

Do not

  • Do not diff string representations of structured data. {"a":1,"b":2}{"b":2,"a":1} as strings, = as dicts. Parse then compare.
  • Do not treat every timing difference as a regression. 2× slower on one input is noise. 100× slower on all inputs is a regression.
  • Do not stop at the first divergence. Collect all divergences, then cluster by root cause. Ten failing inputs might be one bug.
  • Do not forget to test the error paths. Both versions handle valid inputs fine and diverge on None, empty, or malformed input — that's where bugs live.

Output format

## Versions compared
A: <description/path/commit>
B: ...
C: ...

## Observables
<what was diffed — return value, exceptions, side effects>

## Normalization
<what was canonicalized before comparison>

## Input space
<how inputs were generated — boundary, random, mined>
<N inputs tested>

## Divergences
| Input | A | B | C | Suspect | Cluster |
| ----- | - | - | - | ------- | ------- |

## Clustered root causes
1. <cause>: <N> divergences, suspect=<version>
   Example input: <minimal reproducer>
Repository
santosomar/general-secure-coding-agent-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.