evilissimo/software-design

Use before implementing or refactoring software. Contains two skills: (1) Modular Software Design — for designing module boundaries, APIs, layers, abstractions, services, repositories, adapters, or architecture, helping reduce total system complexity by creating deep modules, hiding implementation knowledge, avoiding leakage and pass-through APIs, comparing alternative designs, documenting interfaces before coding, and critiquing existing architecture; and (2) Software Testing — for writing unit tests, integration tests, or end-to-end tests, creating mocks/stubs/fakes, designing a testing strategy, doing TDD, reviewing test quality, fixing flaky tests, or refactoring test suites, generating risk-focused test plans, picking appropriate test levels, choosing between mocks/fakes/real dependencies, and applying Arrange-Act-Assert patterns with concrete examples.

Quality

88%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

name:: software-testing
description:: Use when writing unit tests, integration tests, or end-to-end tests, creating mocks/stubs/fakes, designing a testing strategy, doing TDD, reviewing test quality, fixing flaky tests, or refactoring test suites. Generates risk-focused test plans, picks appropriate test levels, chooses between mocks/fakes/real dependencies, and applies Arrange-Act-Assert patterns with concrete examples.

Software Testing Strategy

Name: evilissimo/software-design
Rating: 88.5 (1 reviews)
Author: evilissimo

Use this skill whenever you are designing, writing, extending, or reviewing tests. The goal is to produce the smallest suite that gives high confidence about the most likely and most costly failures, not to maximize coverage numbers.

Core Objective

Design tests that:

Reduce the most important risks
Are cheap enough to run
Are clear enough to maintain
Are stable enough to survive refactoring

Principles

Start from behaviour and risk, not from code shape. Derive tests from requirements, input space, expected outputs, and likely failure modes before looking at coverage reports.
Treat coverage as feedback, not as the goal. Coverage reveals blind spots but does not prove tests are meaningful. Use it to ask "what behaviour did we forget?" not "how do we force this number up?"
Prefer observable behaviour over implementation detail. Verify returned values, externally visible state, emitted events, persisted effects, or calls to out-of-process dependencies. Do not assert private methods or internal choreography.
Put most tests at the cheapest level that can expose the risk. Use low-level tests for combinatorial behaviour and boundaries. Use higher-level tests to validate wiring, contracts, and critical journeys.
Use doubles to control boundaries, not to duplicate object graphs. Prefer fakes over heavy mocks. Use mocks only when the outgoing interaction itself is the behaviour you care about.
Make behaviour easier to observe instead of giving tests special privileges. If tests are hard to write because outcomes are hidden, redesign the code to expose results through stable outputs, domain objects, events, or adapters.
Let tests support refactoring by testing seams and contracts. Small units, explicit dependencies, stable public interfaces, and clear boundaries make tests less fragile.
Stop adding tests when the next test adds little new information. Once major paths, boundaries, invariants, and interfaces are covered, marginal confidence gain drops sharply.

Concrete Examples

Example: Behaviour-Focused Unit Test (Python/pytest)

def test_discount_switches_at_loyalty_threshold():
    # Arrange
    customer_999 = Customer(points=999)
    customer_1000 = Customer(points=1000)
    basket = Basket(total=Decimal("100.00"))
    
    # Act & Assert
    assert price(customer_999, basket) == Decimal("100.00")
    assert price(customer_1000, basket) == Decimal("95.00")

Why this works: Tests the business rule (threshold boundary) through public outputs. No mocks of internal collaborators. Would fail if the discount logic regresses.

Example: Fake vs Mock at a Boundary

# Using a fake (preferred for behaviour tests)
class FakePaymentGateway:
    def __init__(self):
        self.captured = []
    
    def capture(self, amount, currency):
        self.captured.append((amount, currency))
        return PaymentResult(success=True, transaction_id="fake-123")

def test_order_completion_charges_correct_amount():
    gateway = FakePaymentGateway()
    service = OrderService(gateway)
    
    service.complete_order(order_id="ord-1")
    
    assert gateway.captured == [(Decimal("49.99"), "USD")]

# Using a mock (only when the interaction itself is the behaviour)
class MockEmailGateway:
    def __init__(self):
        self.sent = []
    
    def send_welcome(self, email):
        self.sent.append(email)

def test_registration_sends_welcome_email():
    gateway = MockEmailGateway()
    service = UserService(gateway)
    
    service.register("alice@example.com")
    
    assert "alice@example.com" in gateway.sent

Rule of thumb: Use fakes when you care about outcomes; use mocks only when you care that a specific side effect happened.

Example: Integration Test with Real Database

def test_repository_persists_order_with_line_items(db_session):
    # Arrange
    repo = OrderRepository(db_session)
    order = Order(customer_id="cust-1", items=[
        LineItem(sku="SKU-123", qty=2, price=Decimal("10.00"))
    ])
    
    # Act
    repo.save(order)
    
    # Assert
    loaded = repo.get_by_id(order.id)
    assert len(loaded.items) == 1
    assert loaded.items[0].sku == "SKU-123"

Why this works: Tests the real persistence boundary (ORM mapping, query, serialization) rather than mocking the database.

Testing Workflows

Before Writing New Code

Extract the behavioural brief. List public entry points, expected outputs, invalid inputs, and failure modes.
Partition the input space. Identify normal cases, boundaries, forbidden values, and semantically distinct categories. Note algebraic or stateful invariants as candidates for property-based tests.
Choose the lowest effective test level for each risk.
- Domain rules, calculations, parsing, state transitions → fast unit-style tests
- Persistence mapping, ORM queries, adapters, serialization → narrow integration tests
- Critical user journeys, cross-system wiring → few end-to-end tests
Design the oracle and observability plan. Decide what each test will observe. If outcomes are hard to inspect, improve observability first. Do not crack open private methods.
Write the initial suite skeleton. Main behaviour examples, edge cases, property-based tests where examples are too narrow, and integration/contract tests for real boundaries.
Use coverage diagnostically. After implementation, use coverage to find forgotten behaviour, not to chase percentages.

Adding Tests to Existing Code

Extend the suite through the same public seam production uses.
Prefer tests that check stable behaviour end-to-end within the module.
For legacy or tightly coupled code, start with coarser safety nets (integration/end-to-end), refactor to separate business logic from orchestration, then add unit tests where they add faster, cheaper confidence.

Refactoring with Tests

Refactor in small slices and keep tests aimed at observable behaviour.
If many tests fail after harmless refactoring, the suite is overspecified.
Move assertions outward, reduce mocked interaction checks, and regroup internal collaborators.
When a private method contains enough complexity to test directly, extract it as its own abstraction.

Debugging and Adding Regression Coverage

Reproduce the bug with the highest-level failing test that captures the real symptom.
Add lower-level focused tests near the actual fault if they help localise the issue or protect boundaries.
Pin not just the exact failing example but also neighbouring boundaries and invariants.

Decision Rules

Choosing Test Levels

Level	Use When
Unit-style	Domain rules, calculations, branching logic, state transitions, parsing, input combinations
Narrow integration	Translation across boundaries: repository queries, ORM mappings, serialization, framework wiring, adapters
Contract	Two deployable components evolve independently; risk is interface mismatch
End-to-end	Critical user journeys, cross-system smoke checks; keep sparse and fast

When to Mock

Do mock at true boundaries: out-of-process, slow, nondeterministic, shared, costly, or unsafe dependencies.

Do not mock internal collaborators that are merely part of the unit's implementation. If several small classes together realise one cohesive behaviour, test them together.

Preferences:

Fakes > heavy mocks (realistic behaviour without brittleness)
Spies/state inspection > strict interaction verification (when behaviour can be observed after the fact)
Mocks only when the outgoing interaction itself is the behaviour (e.g., sending email, publishing message)

Architecture rule: Mock or fake types you own. Wrap third-party libraries behind your own adapter before substituting.

When to Use Property-Based Tests

Use when example lists become unwieldy and behaviour is better expressed as an invariant: ordering preservation, idempotence, reversibility, monotonicity, conservation, parser round-trips, commutativity, state machine properties.

Avoid when the property is too weak, too broad, or secretly re-implements the algorithm under test.

When to Test Through Public APIs

Default to public APIs or public seams. Tests are just another client.

Testing internal behaviour is justified only after extracting a meaningful abstraction that now has its own public contract. Do not make private members public solely for tests.

When to Stop Adding Tests

Stop when:

Remaining uncovered code is trivial or unreachable
Additional tests duplicate behaviour already asserted at the same level
Framework glue is already protected elsewhere
Risk is low

Continue when:

Coverage reveals unexercised meaningful logic (branches guarding business rules, error handling, null/empty handling, loop/threshold logic)
Mutation analysis reveals small plausible changes would survive

Maintainable Test Structure

Use Arrange-Act-Assert with short, well-named tests describing the scenario in domain terms.
Use fixtures sparingly, builders often, factories where they clarify domain intent.
Keep assertions precise. Prefer one coherent behaviour per test.
Design test data so values explain the case rather than merely satisfying types.

Review Checklist

When Designing Tests

What are the top failure modes: wrong result, boundary handling, persistence mismatch, broken protocol, concurrency, workflow breakage?
Which behaviours are most important to protect?
What is the cheapest test level that can expose each risk with a strong oracle?
What are the key boundaries, invalid inputs, and invariant properties?
Do I need a real dependency, fake, stub, spy, or mock, and why?
Is the behaviour observable through a public seam?
If internals change but behaviour stays the same, should the test still pass?
If this test fails, will it diagnose a real problem quickly?

When Reviewing an Existing Suite

Are business-critical paths and boundaries protected, or just the happy path?
Do coverage reports reveal meaningful blind spots around decisions and error handling?
Which tests fail after harmless refactoring? (brittleness candidates)
Which tests are flaky, slow, or nondeterministic?
Are there mystery guests, large fixtures, or hidden data files?
Are there many mocks of internal collaborators or third-party types?
Are there enough narrow integration/contract tests at boundaries?
Do tests reveal awkward design: implicit dependencies, oversized classes, hidden coupling?

Common Failure Modes

See references/test-smells.md for detailed symptoms and fixes for:

Overspecified tests and over-mocking
Weak assertions under high coverage
Flaky or slow tests
Mystery guests and obscure setup
Duplicated or over-abstracted test logic
Untestable design

Detailed Guidance

See references/ for deeper material:

references/coverage-criteria.md — statement, branch, condition, path, input-space, and mutation coverage
references/test-doubles.md — dummies, fakes, stubs, spies, mocks, and when to use each
references/property-based-testing.md — invariants, generators, shrinking, and common pitfalls
references/test-smells.md — brittleness, overspecification, weak oracles, mystery guests

See templates/ for reusable formats:

skills

modular-software-design-skill

software-testing

references

templates

SKILL.md

tile.json