evilissimo/software-design

Use before implementing or refactoring software. Contains two skills: (1) Modular Software Design — for designing module boundaries, APIs, layers, abstractions, services, repositories, adapters, or architecture, helping reduce total system complexity by creating deep modules, hiding implementation knowledge, avoiding leakage and pass-through APIs, comparing alternative designs, documenting interfaces before coding, and critiquing existing architecture; and (2) Software Testing — for writing unit tests, integration tests, or end-to-end tests, creating mocks/stubs/fakes, designing a testing strategy, doing TDD, reviewing test quality, fixing flaky tests, or refactoring test suites, generating risk-focused test plans, picking appropriate test levels, choosing between mocks/fakes/real dependencies, and applying Arrange-Act-Assert patterns with concrete examples.

1.12x

Quality

94%

Does it follow best practices?

Impact

92%

1.12x

Average score across 5 eval scenarios

Securityby

Passed

No known issues

name:: software-testing
description:: Use when writing unit tests, integration tests, or end-to-end tests, creating mocks/stubs/fakes, designing a testing strategy, doing TDD, reviewing test quality, fixing flaky tests, or refactoring test suites. Generates risk-focused test plans, picks appropriate test levels, chooses between mocks/fakes/real dependencies, and applies Arrange-Act-Assert patterns with concrete examples.

Software Testing Strategy

Name: evilissimo/software-design
Rating: 93.2 (1 reviews)
Author: evilissimo

Use this skill whenever you are designing, writing, extending, or reviewing tests. The goal is to produce the smallest suite that gives high confidence about the most likely and most costly failures, not to maximize coverage numbers.

Core Objective

Design tests that:

Reduce the most important risks
Are cheap enough to run
Are clear enough to maintain
Are stable enough to survive refactoring

Principles

Risk first. Test the behaviours and failure modes that matter most.
Observable behaviour. Tests should act like clients of public seams.
Cheap confidence. Pick the lowest level that can expose the risk.
Stable boundaries. Doubles support seams; they should not mirror internals.
Diagnostic coverage. Use coverage to find blind spots, not as a target.
Diminishing returns. Stop when another test adds little new information.

Concrete Examples

Example: Behaviour-Focused Unit Test (Python/pytest)

def test_discount_switches_at_loyalty_threshold():
    # Arrange
    customer_999 = Customer(points=999)
    customer_1000 = Customer(points=1000)
    basket = Basket(total=Decimal("100.00"))
    
    # Act & Assert
    assert price(customer_999, basket) == Decimal("100.00")
    assert price(customer_1000, basket) == Decimal("95.00")

Why this works: Tests the business rule (threshold boundary) through public outputs. No mocks of internal collaborators. Would fail if the discount logic regresses.

Example: Boundary Doubles

Use references/test-doubles.md for fake/mock examples and selection rules.

Example: Integration Test with Real Database

def test_repository_persists_order_with_line_items(db_session):
    # Arrange
    repo = OrderRepository(db_session)
    order = Order(customer_id="cust-1", items=[
        LineItem(sku="SKU-123", qty=2, price=Decimal("10.00"))
    ])
    
    # Act
    repo.save(order)
    
    # Assert
    loaded = repo.get_by_id(order.id)
    assert len(loaded.items) == 1
    assert loaded.items[0].sku == "SKU-123"

Why this works: Tests the real persistence boundary (ORM mapping, query, serialization) rather than mocking the database.

Testing Workflows

Before Writing New Code

Extract the behavioural brief. List public entry points, expected outputs, invalid inputs, and failure modes.
Partition the input space. Identify normal cases, boundaries, forbidden values, and semantically distinct categories. Note algebraic or stateful invariants as candidates for property-based tests.
Choose the lowest effective test level for each risk.
- Domain rules, calculations, parsing, state transitions → fast unit-style tests
- Persistence mapping, ORM queries, adapters, serialization → narrow integration tests
- Critical user journeys, cross-system wiring → few end-to-end tests
Design the oracle and observability plan. Decide what each test will observe. If outcomes are hard to inspect, improve observability first. Do not crack open private methods.
Write the initial suite skeleton. Main behaviour examples, edge cases, property-based tests where examples are too narrow, and integration/contract tests for real boundaries.
Use coverage diagnostically. After implementation, use coverage to find forgotten behaviour, not to chase percentages.

Adding Tests to Existing Code

Extend the suite through the same public seam production uses.
Prefer tests that check stable behaviour end-to-end within the module.
For legacy or tightly coupled code, start with coarser safety nets (integration/end-to-end), refactor to separate business logic from orchestration, then add unit tests where they add faster, cheaper confidence.

Refactoring with Tests

Refactor in small slices and keep tests aimed at observable behaviour.
If many tests fail after harmless refactoring, the suite is overspecified.
Move assertions outward, reduce mocked interaction checks, and regroup internal collaborators.
When a private method contains enough complexity to test directly, extract it as its own abstraction.

Debugging and Adding Regression Coverage

Reproduce the bug with the highest-level failing test that captures the real symptom.
Add lower-level focused tests near the actual fault if they help localise the issue or protect boundaries.
Pin not just the exact failing example but also neighbouring boundaries and invariants.

Decision Rules

Choosing Test Levels

Level	Use When
Unit-style	Domain rules, calculations, branching logic, state transitions, parsing, input combinations
Narrow integration	Translation across boundaries: repository queries, ORM mappings, serialization, framework wiring, adapters
Contract	Two deployable components evolve independently; risk is interface mismatch
End-to-end	Critical user journeys, cross-system smoke checks; keep sparse and fast

When to Mock

Use references/test-doubles.md for double selection rules. Short version: substitute true boundaries and owned adapters; avoid mocking internal choreography.

When to Use Property-Based Tests

Use references/property-based-testing.md when example lists are unwieldy and behaviour is better expressed as an invariant.

When to Test Through Public APIs

Default to public APIs or public seams. Tests are just another client.

Testing internal behaviour is justified only after extracting a meaningful abstraction that now has its own public contract. Do not make private members public solely for tests.

When to Stop Adding Tests

Stop when:

Remaining uncovered code is trivial or unreachable
Additional tests duplicate behaviour already asserted at the same level
Framework glue is already protected elsewhere
Risk is low

Continue when:

Coverage reveals unexercised meaningful logic (branches guarding business rules, error handling, null/empty handling, loop/threshold logic)
Mutation analysis reveals small plausible changes would survive

Maintainable Test Structure

Use Arrange-Act-Assert with short, well-named tests describing the scenario in domain terms.
Use fixtures sparingly, builders often, factories where they clarify domain intent.
Keep assertions precise. Prefer one coherent behaviour per test.
Design test data so values explain the case rather than merely satisfying types.

Review Checklist

When Designing Tests

What are the top failure modes: wrong result, boundary handling, persistence mismatch, broken protocol, concurrency, workflow breakage?
Which behaviours are most important to protect?
What is the cheapest test level that can expose each risk with a strong oracle?
What are the key boundaries, invalid inputs, and invariant properties?
Do I need a real dependency, fake, stub, spy, or mock, and why?
Is the behaviour observable through a public seam?
If internals change but behaviour stays the same, should the test still pass?
If this test fails, will it diagnose a real problem quickly?

When Reviewing an Existing Suite

Are business-critical paths and boundaries protected, or just the happy path?
Do coverage reports reveal meaningful blind spots around decisions and error handling?
Which tests fail after harmless refactoring? (brittleness candidates)
Which tests are flaky, slow, or nondeterministic?
Are there mystery guests, large fixtures, or hidden data files?
Are there many mocks of internal collaborators or third-party types?
Are there enough narrow integration/contract tests at boundaries?
Do tests reveal awkward design: implicit dependencies, oversized classes, hidden coupling?

Common Failure Modes

See references/test-smells.md for detailed symptoms and fixes for:

Overspecified tests and over-mocking
Weak assertions under high coverage
Flaky or slow tests
Mystery guests and obscure setup
Duplicated or over-abstracted test logic
Untestable design

Detailed Guidance

See references/ for deeper material:

references/coverage-criteria.md — statement, branch, condition, path, input-space, and mutation coverage
references/test-doubles.md — dummies, fakes, stubs, spies, mocks, and when to use each
references/property-based-testing.md — invariants, generators, shrinking, and common pitfalls
references/test-smells.md — brittleness, overspecification, weak oracles, mystery guests

See reusable formats: