Implements, evaluates, and deploys multi-armed bandit algorithms — including Thompson Sampling, UCB, epsilon-greedy, LinUCB, EXP3, and contextual bandits. Covers algorithm selection, experiment harnesses, offline evaluation (IPS, Doubly Robust), infrastructure patterns, and correctness verification. Use when the user asks about multi-armed bandits, exploration-exploitation tradeoffs, adaptive experiments, A/B testing alternatives, online optimization, bandit-based recommendation or personalization systems, or contextual bandits.
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Framework-agnostic patterns for project structure, testing, reward pipelines, model serving, configuration, monitoring, and safety. Language-specific notes are collected at the end.
A convergent pattern emerges across Python, TypeScript, and Go bandit libraries:
algorithms/ directory.MAB class, Bandit struct, or SimpleBandit constructor that dispatches to strategy implementations.*_test.go beside source; Python uses a tests/ directory; TypeScript uses __tests__/.Notable implementations:
| Library | Language | Organization |
|---|---|---|
| MABWiser | Python | Flat module, private _ucb1.py modules |
| contextualbandits | Python | By lifecycle: online/, offpolicy/, evaluation/ |
| SMPyBandits | Python | By abstraction: Policies/, Arms/, Environment/ |
| SimpleBandit | TypeScript | Zero deps, <700 lines, JSON serialization-first |
| Stitch Fix mab | Go | One file per strategy, interface-based composition |
Five patterns are needed. None is sufficient alone.
Fixed seed produces deterministic arm selections. Assert exact sequences. This is the dominant pattern and every serious framework supports seeds.
Run N trials, assert best arm is selected more than X% of the time after convergence. UCB1 regret should grow O(log n). These tests are inherently slower and need tolerance.
These are deterministic properties that must always hold:
Reward sources should be interfaces so they can be stubbed. Examples: Stitch Fix uses ContextualRewardStub and HTTPRewardStub; Vizier mocks suggestion services.
| Layer | Scope | Budget | What It Covers |
|---|---|---|---|
| L1 Deterministic | <10s | Every commit | Seeded RNG + golden values |
| L2 Property-based | <60s | Every commit | Hypothesis/fast-check invariants |
| L3 Statistical | <5min | Pre-merge | Exploration rate verification with tolerance |
| L4 Regret regression | <30min | Nightly/release | Bounds + baseline comparison |
From "To Seed or Not to Seed" (ICST 2022): Seed the environment for reproducibility. Do not seed the algorithm -- verify it works across random initializations. Use pytest.approx or equivalent with statistical tolerance.
RuleBasedStateMachine with @invariant. PyBandits is the only bandit library using this pattern.fc.commands() and fc.modelRun().Epsilon-Greedy:
UCB1:
ln(t) not log10(t)Thompson Sampling:
Beta(1, 1) not Beta(0, 0)EXP3:
LinUCB:
I * lambda)solve() not inv() for numerical stabilityThe gap between research (pull arm, get reward) and production (serve, track, join, delay, update) is the hardest infrastructure problem.
1. Streaming Join (Udemy pattern)
Events flow through Kafka into Spark Structured Streaming, which performs funnel joins and sessionization, then updates the model and pushes weights to a Redis cache. Tracking IDs stitch events across time.
2. Reward Microservice (Stitch Fix pattern)
The allocation engine queries a RewardSource microservice that returns distribution parameters. The strategy computes selection probabilities from those parameters. Reward computation is fully decoupled from arm selection.
3. Batch Update (Optimizely, Kameleoon pattern)
Hourly aggregation updates posteriors and pushes new weights. Simpler infrastructure, higher latency.
| Pattern | Latency | When to Use | Examples |
|---|---|---|---|
| Embedded | <1ms | Most bandit use cases | MABWiser in Python, SimpleBandit in browser, Stitch Fix in Go |
| Cache-backed | 1-5ms | High-traffic, pre-computable | Udemy (Redis) |
| Service-based | 10-100ms (REST), 2-25ms (gRPC) | Complex optimization, shared service | Vizier, Ax |
| Kubernetes-native | Varies | Cloud-native with traffic management | KServe InferenceService with canary rollouts |
Bandits are simple enough that embedded serving is usually the right default. Move to cache-backed or service-based only when you need shared state or centralized management.
When users must see consistent results (no flickering on refresh, same user always gets the same arm), use deterministic hash-based assignment — even with stochastic algorithms like Thompson Sampling, Softmax, or EXP3. The algorithm computes probabilities, but the final assignment is deterministic:
RewardSourceStrategy (e.g., Thompson Sampling posteriors → probability vector)SHA1(experiment_id + unit_string) mod 1000def deterministic_assign(user_id, experiment_id, arm_probabilities)
# Hash the user+experiment to a number in [0, 1)
hash_input = "#{experiment_id}:#{user_id}"
hash_value = Digest::SHA1.hexdigest(hash_input).to_i(16) % 10000 / 10000.0
# Walk the probability vector to assign an arm
cumulative = 0.0
arm_probabilities.each_with_index do |prob, arm|
cumulative += prob
return arm if hash_value < cumulative
end
arm_probabilities.length - 1 # fallback to last arm
endSame input always produces the same output. This enables debugging, consistency across requests, and horizontal scaling without coordination. The key insight: stochastic algorithms and deterministic serving are not in conflict — the algorithm explores across the population (different users hash to different arms), while each individual user gets a stable experience.
Constructor parameters (most common): pass hyperparameters at construction time. Simple, explicit, testable. Preferred for libraries.
Declarative config (Vizier, SMPyBandits): YAML or dict with search space definitions. Useful when non-engineers configure experiments.
Feature flags (Optimizely, Kameleoon): platform-level configuration, dynamic without code deployment. Appropriate for SaaS experimentation platforms.
Reward models should update independently of the allocation engine (as in Stitch Fix). No restart needed when reward signals change. This requires the decoupled architecture described in the reward pipeline section.
Automated supervisor checks are critical. Udemy checks every 5 minutes and recovers autonomously 99 out of 100 times. Combine with SLO monitoring and anomaly detection on reward distributions.
These are non-negotiable in production.
Default arm fallback: when the reward service is unavailable, serve a safe default. Never fail open with random exploration.
Exploration caps: limit exploration to X% of traffic. Yahoo uses bucketing; Twitter caps at 1%; Spotify pre-filters to top-100 candidates before exploring.
SLO checks: monitor arm performance against minimum thresholds. Pull underperforming arms automatically. Implement a minimum reward threshold per arm — if an arm's observed reward rate drops below the threshold for a sustained window, remove it from the candidate set or force-assign it to the default arm:
def check_arm_slo(arm, min_reward_rate: 0.01, window: 1000)
return true if @counts[arm] < window # not enough data yet
if @values[arm] < min_reward_rate
disable_arm(arm)
log_warning("Arm #{arm} below SLO: #{@values[arm]} < #{min_reward_rate}")
return false
end
true
endSLO checks should run on every update() call, not just periodically — catching a degraded arm early prevents wasting traffic.
Abort mechanisms: budget-doubling schemes interrupt if cumulative reward falls below a "ruin" threshold.
Null arm handling: redistribute probability mass away from disabled or removed arms. Probabilities must still sum to 1.0.
select_arm() / update() interfacepytest.approx for statistical tolerance; coverage.py for coverage.toJSON() / fromJSON()); critical for browser persistence.go test (built-in); table-driven tests are idiomatic.RewardSource, Strategy, Sampler); single package with one file per strategy.