Compliance Verification (GREEN Phase)
Purpose: Verify that opentelemetry-skill changes agent behavior per TDD methodology
Prerequisite: baseline-scenarios.md must be completed first (RED phase)
This document defines the GREEN phase of TDD testing: running the same scenarios WITH the skill loaded and verifying behavior changes.
Testing Workflow
Prerequisites
- ✅ RED phase complete (
baseline-scenarios.md scenarios run WITHOUT skill)
- ✅ Baseline results documented in
baseline-results/ directory
- ✅ Skill loaded in Claude environment
GREEN Phase Process
For each scenario from baseline-scenarios.md:
- Load opentelemetry-skill in Claude environment
- Run exact same prompt as baseline
- Document agent response in
compliance-results/scenario-N.md
- Compare to baseline - what changed?
- Verify success criteria from baseline scenario
Comparison Template
For each scenario, document:
Scenario N: [Name]
Baseline Behavior (WITHOUT skill):
- [What agent did/said]
- [What was missed]
- [Rationalizations used]
Compliance Behavior (WITH skill):
- [What agent did/said]
- [What improved]
- [Skill content referenced]
Behavior Change:
- ✅ Improved: [Specific improvements]
- ⚠️ Partial: [Partially addressed]
- ❌ Unchanged: [Still missing]
Success Criteria Status:
Evidence of Skill Usage:
New Rationalizations Discovered:
- [Any new excuses/workarounds to add to rationalization table]
Scenario 1: Collector Configuration Without Memory Protection
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW includes memory_limiter as first processor
- Agent explains why processor ordering matters
- Agent provides production-ready defaults (80% limit, 20% spike)
Success Criteria Verification
Evidence Checklist
Look for agent:
- Mentioning "memory_limiter must be first"
- Providing default configuration (limit_percentage: 80)
- Explaining OOM prevention
- Referencing Core Principles from SKILL.md
Scenario 2: High-Cardinality Metric Dimensions
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW blocks user_id and request_id in metrics
- Agent explains Rule of 100 and cardinality explosion
- Agent suggests alternative approaches (traces, aggregation)
Success Criteria Verification
Evidence Checklist
Look for agent:
- Mentioning "Rule of 100"
- Explaining cardinality explosion risk
- Referencing instrumentation.md
- Suggesting traces for high-cardinality data
Scenario 3: Tail Sampling Without Load Balancing
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW requires loadbalancing exporter for tail sampling
- Agent explains sticky session requirement
- Agent provides traceID routing configuration
Success Criteria Verification
Evidence Checklist
Look for agent:
- Mentioning "loadbalancing exporter"
- Explaining "routing_key: traceID"
- Warning "all spans of a trace must reach same collector"
- Referencing sampling.md and architecture.md
Scenario 4: Missing TLS Configuration
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW includes TLS by default
- Agent sets insecure: false
- Agent mentions authentication requirements
Success Criteria Verification
Scenario 5: PII in Telemetry
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW proactively asks about sensitive data
- Agent recommends PII redaction with OTTL
- Agent provides specific redaction patterns
- Agent explains processor placement
Success Criteria Verification
Scenario 6: Sampling Strategy Without Cost Analysis
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW performs System 2 analysis on throughput
- Agent asks about budget and requirements
- Agent explains sampling trade-offs
- Agent provides statistical analysis
Success Criteria Verification
Scenario 7: Collector Deployment Pattern Selection
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW asks about signals and processing needs
- Agent uses deployment decision matrix
- Agent explains rationale for recommendation
Success Criteria Verification
Scenario 8: Instrumentation Without Semantic Conventions
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW corrects to semantic conventions
- Agent explains importance of standards
- Agent references specification
Success Criteria Verification
Scenario 9: Missing Persistent Queues
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW recommends file_storage extension
- Agent configures persistent queues
- Agent explains disk requirements
- Agent provides Kubernetes volume config
Success Criteria Verification
Scenario 10: OTTL Transformation Without Performance Consideration
Expected Improvements
Baseline → Compliance Changes:
- Agent NOW includes error_mode
- Agent uses where clauses
- Agent mentions performance implications
- Agent recommends testing
Success Criteria Verification
Overall Compliance Assessment
Passing Criteria
Skill is considered "passing GREEN phase" when:
Quantitative:
Qualitative:
Failure Modes
If scenarios fail (no behavior change):
Diagnosis:
- Check skill description - does it match trigger conditions?
- Check "When to Use" section - clear enough?
- Check content organization - is pattern findable?
- Check keyword coverage - would search find it?
Remediation:
- Enhance frontmatter description and keywords
- Reorganize content for scannability
- Add explicit counter-rationalizations
- Re-test in REFACTOR phase
Documentation Requirements
For Each Scenario
Create file: compliance-results/scenario-N-[name].md
Required sections:
- Full agent response (verbatim or screenshot)
- Comparison to baseline (what changed)
- Success criteria checklist
- Evidence of skill usage
- New rationalizations discovered
- PASS/PARTIAL/FAIL verdict
Summary Report
Create file: compliance-results/SUMMARY.md
Include:
- Overview: N/10 scenarios passed
- Success criteria: N% met overall
- Key improvements observed
- Remaining gaps
- Rationalizations to address in REFACTOR phase
GREEN Phase Complete When:
Next Steps
After GREEN phase:
- →
rationalization-table.md - Update with findings
- → REFACTOR phase - Add counters to SKILL.md for new rationalizations
- → Re-test scenarios that failed or partially passed
- → Iterate until 10/10 scenarios pass
This is iterative: First pass may only get 7/10 scenarios passing. That's expected. The goal is continuous improvement through the RED-GREEN-REFACTOR cycle.