Expert OpenTelemetry guidance for collector configuration, pipeline design, and production telemetry instrumentation. Use when configuring collectors, designing pipelines, instrumenting applications, implementing sampling, managing cardinality, securing telemetry, writing OTTL transformations, or setting up AI coding agent observability (Claude Code, Codex, Gemini CLI, GitHub Copilot).
99
100%
Does it follow best practices?
Impact
98%
8.16xAverage score across 4 eval scenarios
Passed
No known issues
This reference catalogs known anti-patterns across collector configuration, metrics design, Kubernetes deployments, and AI agent instrumentation. The top 6 most critical patterns are also kept inline in SKILL.md; all patterns are listed here with explanations and mitigations.
memory_limiter not first in the processor chainPlacing any other processor before memory_limiter allows unchecked memory growth during bursts or back-pressure events, which can silently corrupt in-flight data before the limiter can shed load.
Fix: memory_limiter must always be the first processor in every pipeline.
batch processorWithout batching, each span/metric/log is exported individually, producing thousands of small network calls per second and dramatically reducing throughput.
Fix: Add batch immediately after memory_limiter in every pipeline.
0.0.0.0 in productionpprof (port 1777) and zpages (port 55679) expose internal profiling and pipeline state. Binding them to all interfaces in a shared network allows any workload to extract sensitive runtime data or profile the collector under load.
Fix: Bind debug extensions to localhost only, or omit them entirely in production.
tail_sampling without sticky session load balancingtail_sampling keeps per-trace state in memory. When multiple collector replicas compete for the same traffic without a loadbalancing exporter routing upstream, spans from the same trace are split across instances and sampling decisions diverge.
Fix: Place a loadbalancing exporter (with routing_key: traceID) upstream of all tail_sampling replicas.
Attributes such as user_id, trace_id, request_id, and session_id are unbounded and will cause metric cardinality explosions, OOM kills in the collector, and cost blowups in the backend.
Fix: Use traces or structured logs for high-cardinality data; keep metric dimensions below 100 unique values per attribute (Rule of 100).
A config that passes otelcol validate can still be operationally broken: memory limits may exceed pod limits, stateful processors may run without sticky routing, queues may be backed by unsafe filesystems, and rollout settings may be mutually contradictory.
Fix: Treat every collector review as a systems review, not a YAML lint. Compare memory limits, replica strategy, routing, metric temporality state, queue storage medium, and PDB/HPA settings together.
hostPort on a horizontally scaled gateway DeploymenthostPort reserves a port on every node the pod lands on. This is appropriate for DaemonSets where exactly one pod runs per node, but counter-productive and fragile on a Deployment that can schedule multiple replicas anywhere.
Fix: Use a ClusterIP or LoadBalancer Service instead. Reserve hostPort for DaemonSet node-agent patterns with an explicit justification.
memory_limiter.limit_mib exceeding the pod/container memory limitWhen the collector's limit_mib is higher than the container's resources.limits.memory, the Go runtime can allocate beyond the cgroup limit before the limiter fires, causing the container to be OOM-killed with no graceful shedding.
Fix: Set limit_mib to roughly 80% of the container memory limit to leave headroom for the Go runtime and internal buffers.
file_storage queues with RWX/network filesystemsfile_storage uses bbolt, which requires byte-level file locking via flock(2). EFS, NFS, and CephFS (in network modes) do not provide reliable POSIX locks, leading to database corruption on multi-writer or failover scenarios.
Fix: Back file_storage with a ReadWriteOnce block-backed PVC (for example, gp3 on EKS). Never use ReadWriteMany or NFS-class storage.
deltatocumulative / cumulativetodelta without checking source temporality and backend expectationsThese processors are stateful: they accumulate or differentiate metric state in memory. Restarting the collector resets that state, producing a discontinuous series. Running multiple replicas without sticky routing splits the state across instances.
Fix: Confirm source SDK temporality, the backend's expected temporality, restart tolerance, and replica routing before enabling temporality conversion processors. Add justification comments in the config.
Some backends (VictoriaMetrics, OpenTSDB) silently drop delta-temporality metrics or misinterpret them as cumulative. The error is silent and data appears to disappear.
Fix: Check the backend's temporality contract. Use deltatocumulative in the collector if the SDK emits delta and the backend expects cumulative — but see the stateful conversion warning above.
prompt.id or session.id as metric dimensionsThese identifiers are unbounded (a new value per agent invocation). Using them as metric label values creates cardinality explosions that degrade the metrics backend and increase cost without insight.
Fix: Use prompt.id and session.id as span/log attributes only, never as metric dimensions.
captureContent / OTEL_LOG_USER_PROMPTS without PII controlsThese settings log raw prompt and completion text. In shared or production environments this can expose user PII, secrets embedded in prompts, and proprietary code to the telemetry backend.
Fix: Keep content capture disabled by default. If required for debugging, enable only in isolated dev/staging environments with appropriate data governance controls.
Claude Code and Codex-exec do not emit traces — they emit metrics and logs/events only. Configuring a traces pipeline as the sole signal path will silently discard all their telemetry.
Fix: Check the AI agent support matrix in compatibility.md before designing the collector pipeline. Always include a metrics and logs pipeline alongside any traces pipeline.
gen_ai.token.type handling to only input/output valuesThe GenAI semantic conventions include additional token type values (for example cache_read, cache_creation). Hard-coding only input/output silently drops cost attribution for cached tokens.
Fix: Use a default/catch-all branch for unknown gen_ai.token.type values rather than strict equality checks.
GenAI and AI agent SemConv proposals move quickly. Implementing attributes from draft proposals means breaking changes when the spec stabilizes.
Fix: Check the stability status in compatibility.md before using new attributes in production instrumentation.
OTTL is strictly typed. Setting an attribute to a boolean and then matching it with IsMatch(..., "true") will silently fail. Using deprecated keys like http.status_code instead of http.response.status_code produces gaps when backends apply semantic convention normalization.
Fix: Use Int(), String(), Bool() converters explicitly in OTTL statements. Use current SemConv keys and validate with otelcol validate.
docs
evals
cardinality-protection
claude-code-telemetry
collector-memory-limiter
scenario-1
scenario-2
scenario-3
scenario-4
tail-sampling-setup
references