Expert OpenTelemetry guidance for collector configuration, pipeline design, and production telemetry instrumentation. Use when configuring collectors, designing pipelines, instrumenting applications, implementing sampling, managing cardinality, securing telemetry, writing OTTL transformations, or setting up AI coding agent observability (Claude Code, Codex, Gemini CLI, GitHub Copilot).
99
100%
Does it follow best practices?
Impact
98%
8.16xAverage score across 4 eval scenarios
Passed
No known issues
Use these defaults:
Stability over Features: Check otelcol-contrib stability (Alpha/Beta/Stable); warn before production use of non-stable components.
Convention over Configuration: Prefer OpenTelemetry Semantic Conventions over custom attribute names.
Protocol Unification: Default to OTLP gRPC (port 4317); use OTLP HTTP (port 4318) when gRPC is unavailable due to agent, proxy, browser, or backend constraints.
Deterministic Routing Keys: For load-balancing exporters, use stable, deterministic routing keys — traceID for tail-sampling stickiness, tenant_id or cluster for tenant/shard routing. Normalize non-string attributes before routing.
Safety First: Prioritize collector stability (memory limiters, persistent queues, backpressure) over data completeness. Dropping data is preferable to crashing the collector.
Cardinality Awareness: High-cardinality attributes (>100 unique values) must NOT be metric dimensions — use traces or logs instead.
Security by Default: Redact PII, enable TLS for cross-network communication, and authenticate all collector endpoints.
Cross-Field Consistency: Treat collector reviews as systems reviews, not YAML linting. Compare processor order, memory limits, replica strategy, routing, metric temporality state, queue storage medium, PDB/HPA settings, and OTTL attribute types together before calling a config "safe".
Before generating config/code, confirm these. If unknown, ask first:
file_storage + persistent queues. → collector.mdWhen user requests match these patterns, include these points explicitly:
memory_limiter; keep it first in each pipeline processors list; explain OOM-prevention rationale.user_id: refuse; explain time-series explosion risk; suggest traces and bounded metric dimensions.loadbalancing with routing_key: traceID, Headless Service (clusterIP: None), error+10% policies, Beta stability caution.CLAUDE_CODE_ENABLE_TELEMETRY=1, OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative, ~/.claude/settings.json persistence; OTEL_LOG_USER_PROMPTS/OTEL_LOG_TOOL_DETAILS default to false — warn against enabling in shared/production environments without PII controls; avoid session.id as a metric dimension.When the user provides an existing collector config, Helm values file, or Kubernetes manifest, audit it for internal contradictions before proposing edits.
Always compare:
limit_mib must stay below the container memory limit with headroom for the Go runtime and internal buffers.tail_sampling, spanmetrics, and servicegraph require sticky routing when replicas can exceed 1.hostPort is usually a DaemonSet/node-local choice, not a horizontally scaled gateway Deployment default.replicaCount, HPA minReplicas, PodDisruptionBudget, and rolling update settings as one unit.http.response.status_code.deltatocumulative / cumulativetodelta are stateful conversions; verify source temporality, backend expectations, restart tolerance, and any replica/routing assumptions before enabling them.file_storage needs local locking-safe storage; ReadWriteMany / EFS / NFS-style volumes are not safe defaults.Load detailed reference documentation only when the user's request matches a trigger. This keeps context lean.
| Trigger keywords | Load | Key topics |
|---|---|---|
| Kubernetes, Helm, values.yaml, audit, review, DaemonSet, Sidecar, Gateway, Scaling, Load Balancing | architecture.md | DaemonSet vs Gateway vs Sidecar, Target Allocator, HPA, rollout consistency |
| Pipeline, Receiver, Processor, Exporter, Queue, Batch, Memory, Extensions, existing config | collector.md | Processor ordering, memory_limiter, file_storage, config audit heuristics, temporality/state audits, stability levels |
| SDK, Instrumentation, Spans, Attributes, Semantic Conventions, Cardinality | instrumentation.md | Auto vs manual, SemConv, cardinality Rule of 100 |
| Sampling, Cost, Volume, Head Sampling, Tail Sampling, Probabilistic | sampling.md | Head/tail sampling, sticky sessions, sampling math |
| Security, PII, GDPR, Redaction, TLS, Authentication, Credentials | security.md | PII redaction, mTLS, RBAC, extension exposure risks |
| Monitor the collector, Health, Alerts, Self-monitoring, Collector metrics | monitoring.md | otelcol_* metrics, dashboards, alert rules |
| Lambda, Azure Functions, GCP Functions, Serverless, FaaS, Mobile, Browser | platforms.md | FaaS patterns, Lambda extension layer, client-side apps |
| OTTL, Transform, Transformation, Modify, Filter attributes, Parse, Extract | ottl.md | OTTL syntax, context types, built-in functions, error handling |
| Connector, spanmetrics, servicegraph, routing connector, failover connector | connectors.md | R.E.D. metrics, service graph, routing, failover, stickiness |
| Claude Code, Codex, Gemini CLI, Copilot, AI agent, coding agent, MCP | ai-agents.md | Agent OTel support matrix, unified collector config, GenAI SemConv |
| playbook, production playbook, blog, 2025 blog, 2026 blog, real world | playbooks.md | Production patterns from opentelemetry.io blogs |
| anti-pattern, common mistake, what to avoid, pitfall | anti-patterns.md | Full annotated anti-pattern catalogue: pipeline, metrics, Kubernetes, AI agents, OTTL |
Use these defaults unless the user specifies otherwise. This is a copy-paste-ready starting point:
extensions:
health_check:
endpoint: "0.0.0.0:13133"
file_storage/queue:
directory: /var/lib/otelcol/queue
timeout: 10s
compaction:
on_start: true
on_rebound: false
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 20
batch:
timeout: 10s
send_batch_size: 1024
exporters:
otlp:
endpoint: "your-backend:4317"
sending_queue:
enabled: true
storage: file_storage/queue
num_consumers: 4
queue_size: 1024
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
max_elapsed_time: 300s
# otlphttp: # HTTP exporter — use when backend requires HTTP
# endpoint: "https://your-backend:4318"
# sending_queue: { enabled: true, storage: file_storage/queue }
service:
extensions: [health_check, file_storage/queue]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]Key defaults:
memory_limiter must be first in every processor chain.batch reduces exporter network calls.file_storage preserves queues across restarts only when the collector returns to the same host/volume. In Kubernetes, back /var/lib/otelcol/queue with a ReadWriteOnce block-backed PVC rather than RWX/network storage.health_check binds to localhost (not 0.0.0.0) in shared networks.Always include validation checkpoints when delivering configurations.
# Syntax and structural validation (local binary)
otelcol validate --config config.yaml
# Container-based dry-run (no outbound traffic)
docker run --rm -v $(pwd)/config.yaml:/etc/otelcol/config.yaml \
otel/opentelemetry-collector-contrib:latest \
validate --config /etc/otelcol/config.yaml# Health endpoint — returns 200 when collector is ready
curl -sf http://localhost:13133/ && echo "healthy"
# Tail logs for pipeline errors and dropped data
kubectl logs -l app=otelcol -f | grep -E "error|dropped|refused|timeout"
# Collector self-metrics (Prometheus scrape)
curl -s http://localhost:8888/metrics | grep -E "otelcol_processor_dropped|otelcol_exporter_send_failed"| Symptom | Likely cause | Fix |
|---|---|---|
| Collector exits at start | Config parse error | Run otelcol validate; check indentation and quoted strings |
memory_limiter: data dropped in logs | Memory limit hit | Increase limit_percentage, reduce send_batch_size, or add upstream sampling |
exporter queue is full | Backend unreachable or slow | Verify endpoint reachability; increase queue_size; check retry_on_failure settings |
pipeline drops data on restart | No persistent queue | Add file_storage extension and set storage: file_storage/queue in exporter sending_queue |
| OTTL statement silently skipped | Type mismatch or nil value | Add error_mode: ignore; guard with where attributes["key"] != nil; use Int() / String() converters |
| Tail sampling misses spans | Spans split across collector instances | Use loadbalancing exporter with routing_key: traceID upstream |
The most critical patterns are listed here. See anti-patterns.md for the full annotated catalog.
❌ Placing memory_limiter anywhere except first in the processor chain
❌ Using high-cardinality attributes (user_id, trace_id) as metric dimensions
❌ Exposing pprof (1777), zpages (55679) on 0.0.0.0 in production
❌ Using tail_sampling without sticky session load balancing (loadbalancing exporter)
❌ Omitting batch processor (causes excessive network calls)
❌ Calling a config "fine" because it parses, without checking memory limits, sticky routing, exporter durability, and rollout settings together
SKILL.md focused on routing logic, guardrails, and production defaults rather than inline release tracking.docs
evals
cardinality-protection
claude-code-telemetry
collector-memory-limiter
scenario-1
scenario-2
scenario-3
scenario-4
tail-sampling-setup
references