Baseline Scenarios (RED Phase)
Purpose: Test skill effectiveness by comparing agent behavior WITHOUT vs WITH skill loaded
Iron Law: Per writing-skills standards, NO SKILL WITHOUT A FAILING TEST FIRST
This document defines test scenarios to validate that opentelemetry-skill actually changes Claude's behavior. Each scenario must be run WITHOUT the skill first (baseline), then WITH the skill (compliance verification).
Testing Methodology
RED Phase (This Document)
- Run each scenario WITHOUT opentelemetry-skill loaded
- Document verbatim agent responses
- Identify specific rationalizations and missed opportunities
- Note which pressures trigger violations
GREEN Phase (compliance-verification.md)
- Run same scenarios WITH opentelemetry-skill loaded
- Document behavior changes
- Verify agents now comply with patterns
REFACTOR Phase (rationalization-table.md)
- Identify new rationalizations from testing
- Add explicit counters to SKILL.md
- Re-test until bulletproof
Scenario 1: Collector Configuration Without Memory Protection
Objective: Verify agent proactively includes memory_limiter as first processor
Test Prompt
Configure an OpenTelemetry Collector to receive OTLP traces and export them to Jaeger.
Expected Baseline Behavior (WITHOUT skill)
- Agent creates receiver and exporter configuration
- May include batch processor
- Likely SKIPS: memory_limiter entirely or places it incorrectly
- Rationalization: "Here's a basic configuration"
Target Behavior (WITH skill)
- Agent includes memory_limiter as FIRST processor
- Explains why it must be first (prevent OOM)
- Includes batch processor after memory_limiter
- Warns about stability levels if using unstable components
Pressure Variations
- Time pressure: "I need this quickly, just the basics"
- Authority pressure: "I know collectors, just show me the config"
Success Criteria
Scenario 2: High-Cardinality Metric Dimensions
Objective: Verify agent blocks unbounded attributes in metrics
Test Prompt
I want to track request latency as a metric. Add dimensions for:
- User ID
- Request ID
- HTTP method
- HTTP status code
Expected Baseline Behavior (WITHOUT skill)
- Creates metric with all requested dimensions
- Likely MISSES: Cardinality implications of user_id and request_id
- Rationalization: "This gives you detailed metrics"
Target Behavior (WITH skill)
- Blocks user_id and request_id as metric dimensions
- Explains Rule of 100 and cardinality explosion risk
- Recommends:
- Use traces for user_id and request_id
- Keep only
http.request.method and http.response.status_code in metrics
- Or suggest aggregated user count metric
- References instrumentation.md cardinality section
Success Criteria
Scenario 3: Tail Sampling Without Load Balancing
Objective: Verify agent requires sticky sessions for tail sampling
Test Prompt
I need to implement tail sampling in my OpenTelemetry Collector gateway to reduce trace volume by 90% but keep all error traces.
Expected Baseline Behavior (WITHOUT skill)
- Configures tail_sampling processor
- Likely SKIPS: loadbalancing exporter with traceID routing
- Likely MISSES: Warning that tail sampling requires all spans of a trace on same collector
- Rationalization: "Here's the tail sampling config"
Target Behavior (WITH skill)
- Asks about deployment architecture (how many collector instances)
- Explains requirement for sticky sessions (traceID routing)
- Provides loadbalancing exporter configuration with
routing_key: traceID
- Includes Headless Service YAML for Kubernetes
- Warns about tail_sampling stability level (Beta)
- References sampling.md and architecture.md
Success Criteria
Scenario 4: Missing TLS Configuration
Objective: Verify agent recommends TLS for cross-network communication
Test Prompt
Configure a collector to send telemetry from my Kubernetes cluster to a SaaS observability backend.
Expected Baseline Behavior (WITHOUT skill)
- Configures OTLP exporter with endpoint
- Likely SKIPS: TLS configuration
- Likely USES:
insecure: true or doesn't mention security
- Rationalization: "Set up the endpoint connection"
Target Behavior (WITH skill)
- Includes TLS configuration by default
- Sets
insecure: false explicitly
- May mention mutual TLS for enhanced security
- References security.md for TLS patterns
- Asks about authentication requirements (API keys, tokens)
Success Criteria
Scenario 5: PII in Telemetry
Objective: Verify agent proactively addresses PII redaction
Test Prompt
I'm collecting traces from my web application that handles user data. Configure the collector to process these traces.
Expected Baseline Behavior (WITHOUT skill)
- Creates basic receiver/processor/exporter pipeline
- Likely SKIPS: PII redaction entirely
- Likely MISSES: Asking about sensitive data in requests
- Rationalization: "Here's the standard pipeline"
Target Behavior (WITH skill)
- Asks what user data is being collected
- Proactively suggests PII redaction
- Provides transform processor with OTTL examples for:
- Email address redaction
- URL parameter sanitization
- Header filtering
- References security.md PII redaction section
- Recommends redaction early in pipeline (before data leaves collector)
Success Criteria
Scenario 6: Sampling Strategy Without Cost Analysis
Objective: Verify agent considers cost and throughput when recommending sampling
Test Prompt
My application generates 100,000 traces per second. How should I handle this volume?
Expected Baseline Behavior (WITHOUT skill)
- Recommends head sampling or tail sampling
- Likely SKIPS: Cost implications, statistical accuracy
- Likely MISSES: Alternative approaches (traffic-based sampling, parent-based)
- Rationalization: "Use tail sampling for best results"
Target Behavior (WITH skill)
- Performs System 2 analysis on throughput (>10k RPS = high volume)
- Asks about:
- Budget constraints
- Critical user flows to preserve
- Error rate expectations
- Explains trade-offs between head and tail sampling
- Provides statistical impact analysis (e.g., 10% sampling = 10x data loss for rare events)
- May recommend progressive sampling strategy
- References sampling.md
Success Criteria
Scenario 7: Collector Deployment Pattern Selection
Objective: Verify agent uses decision matrix for deployment architecture
Test Prompt
I need to deploy OpenTelemetry collectors in my Kubernetes cluster. What's the best approach?
Expected Baseline Behavior (WITHOUT skill)
- Recommends DaemonSet (most common answer)
- Likely SKIPS: Requirements gathering (what signals, what processing)
- Likely MISSES: Gateway pattern for centralized processing
- Rationalization: "DaemonSet is the standard pattern"
Target Behavior (WITH skill)
- Asks clarifying questions:
- What signals? (Traces, metrics, logs)
- What processing? (Sampling, aggregation, filtering)
- Scale requirements?
- Uses decision matrix from architecture.md:
- DaemonSet for node-level metrics and logs
- Gateway for centralized processing (tail sampling, aggregation)
- Sidecar for application-specific processing
- Provides deployment YAML for recommended pattern
- Explains trade-offs
Success Criteria
Scenario 8: Instrumentation Without Semantic Conventions
Objective: Verify agent enforces semantic conventions
Test Prompt
Show me how to add custom attributes to my spans:
- "request_method" for the HTTP method
- "status" for the response code
- "endpoint_url" for the request URL
Expected Baseline Behavior (WITHOUT skill)
- Provides code to add custom attributes with given names
- Likely SKIPS: Semantic conventions entirely
- Likely MISSES: Standardized attribute names
- Rationalization: "Here's how to add those attributes"
Target Behavior (WITH skill)
- Corrects attribute names to semantic conventions:
http.request.method (not request_method)
http.response.status_code (not status)
http.route for server-side route templates, or sanitized url.full/url.path instead of a custom endpoint_url
- Explains importance of semantic conventions (cross-tool compatibility)
- References latest semantic conventions version (1.40.0+)
- Loads instrumentation.md
- May provide link to semantic conventions documentation
Success Criteria
Scenario 9: Missing Persistent Queues
Objective: Verify agent recommends persistent queues for production
Test Prompt
I need to ensure I don't lose telemetry data if my backend goes down temporarily. How should I configure my collector?
Expected Baseline Behavior (WITHOUT skill)
- May mention retry settings on exporter
- Likely SKIPS: file_storage extension and persistent queues
- Likely MISSES: Disk space requirements, PersistentVolume setup
- Rationalization: "Use retry configuration"
Target Behavior (WITH skill)
- Recommends file_storage extension
- Configures persistent queues on exporters
- Explains disk space requirements
- For Kubernetes: provides PersistentVolumeClaim YAML
- Mentions trade-off: persistence vs. performance
- References collector.md persistence section
Success Criteria
Scenario 10: OTTL Transformation Without Performance Consideration
Objective: Verify agent considers performance when using OTTL
Test Prompt
I need to redact all email addresses from span attributes using OTTL.
Expected Baseline Behavior (WITHOUT skill)
- Provides OTTL regex transformation
- Likely SKIPS: Performance optimization (where clauses, filter ordering)
- Likely MISSES: Error handling, regex efficiency
- Rationalization: "Here's the transformation"
Target Behavior (WITH skill)
- Provides OTTL transformation with:
error_mode: ignore for resilience
where clause to avoid unnecessary processing
- Efficient regex pattern
- Explains processor ordering (filter before transform if possible)
- Mentions testing with realistic data volumes
- References ottl.md best practices
Success Criteria
Running These Tests
Step 1: Prepare Test Environment
Option A: Separate Claude Session
- Open Claude in a browser (without skill access)
- Or use different CLI profile without opentelemetry-skill
Option B: Temporarily Disable Skill
mv ~/.claude/skills/opentelemetry-skill ~/.claude/skills/opentelemetry-skill.disabled
Step 2: Run Baseline (WITHOUT Skill)
For each scenario:
- Copy test prompt exactly
- Run in Claude WITHOUT skill loaded
- Document agent response verbatim in
baseline-results/scenario-N.md
- Note specific rationalizations used
- Identify what was missed vs target behavior
Step 3: Enable Skill
mv ~/.claude/skills/opentelemetry-skill.disabled ~/.claude/skills/opentelemetry-skill
# Or reload skill in environment
Step 4: Run Compliance Tests (WITH Skill)
See compliance-verification.md for detailed methodology.
Step 5: Document Rationalizations
Capture all excuses/rationalizations in rationalization-table.md:
- "Here's a basic configuration"
- "This gives you detailed metrics"
- "Use tail sampling for best results"
- "DaemonSet is the standard pattern"
Each rationalization gets an explicit counter added to SKILL.md.
Expected Outcomes
Success Metrics
For skill to be considered "passing TDD":
Common Baseline Failures to Document
- Missing memory_limiter (Scenario 1)
- Accepting high-cardinality metric dimensions (Scenario 2)
- Tail sampling without load balancing (Scenario 3)
- Missing TLS configuration (Scenario 4)
- No PII redaction (Scenario 5)
- No cost analysis for sampling (Scenario 6)
- Generic deployment recommendation (Scenario 7)
- Custom attribute names instead of semantic conventions (Scenario 8)
- Missing persistent queues (Scenario 9)
- Inefficient OTTL transformations (Scenario 10)
RED Phase Complete When:
Next Steps
After completing RED phase:
- →
compliance-verification.md - Run WITH skill, compare results
- →
rationalization-table.md - Document excuses, add counters to SKILL.md
- → Iterate: Find new loopholes, plug them, re-test
Remember: This is TDD for documentation. Same rigor as code testing.