Expert OpenTelemetry guidance for collector configuration, pipeline design, and production telemetry instrumentation. Use when configuring collectors, designing pipelines, instrumenting applications, implementing sampling, managing cardinality, securing telemetry, writing OTTL transformations, or setting up AI coding agent observability (Claude Code, Codex, Gemini CLI, GitHub Copilot).
93
97%
Does it follow best practices?
Impact
85%
7.08xAverage score across 4 eval scenarios
Passed
No known issues
"Who watches the watchers?" Meta-monitoring is the practice of observing the observability pipeline itself. A failing collector can silently drop data, creating blind spots in production systems. This reference provides comprehensive guidance on collector self-monitoring, health checks, and alerting patterns.
Scenario: Your collector is silently dropping 50% of traces due to memory pressure.
Impact:
Solution: Monitor the collector's internal metrics to detect issues before data loss becomes critical.
| Category | Metrics | Purpose |
|---|---|---|
| Throughput | Accepted vs sent spans/metrics/logs | Data flow verification |
| Data Loss | Refused, dropped, failed exports | Detect backpressure and failures |
| Resources | CPU, memory, disk usage | Prevent OOM kills |
| Queue Health | Queue size vs capacity | Predict saturation |
| Export Performance | Export latency, retry count | Backend health |
The collector exposes internal metrics on port 8888 (default).
service:
telemetry:
logs:
level: info # Options: debug, info, warn, error
metrics:
level: detailed # Options: none, basic, normal, detailed
address: "0.0.0.0:8888" # Prometheus scrape endpointPrometheus scrape config:
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8888']
relabel_configs:
- source_labels: [__address__]
target_label: instanceKubernetes ServiceMonitor (with Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
endpoints:
- port: metrics
interval: 30s
path: /metricsThe OpenTelemetry Collector Internal Telemetry documentation now recommends using OTLP periodic readers for self-observability instead of (or in addition to) the Prometheus scrape pattern. This enables the collector to push its own metrics directly to a backend without requiring an external scraper.
| Pattern | Pros | Cons |
|---|---|---|
| Prometheus scrape (pull) | Simple, compatible with existing Prometheus stacks | Requires external scraper; pull model means delayed detection |
| OTLP push (recommended) | No external scraper; metrics arrive as fast as flush interval; unified with app telemetry in same backend | Requires OTLP-capable metrics backend |
Use service.telemetry.metrics.readers to configure an OTLP exporter for self-metrics:
service:
telemetry:
logs:
level: info
metrics:
level: detailed # none | basic | normal | detailed
readers:
- pull:
exporter:
prometheus:
host: "0.0.0.0"
port: 8888 # keep Prometheus endpoint for existing dashboards
- periodic:
interval: 60000 # push interval in milliseconds (60s)
timeout: 30000 # export timeout in milliseconds
exporter:
otlp:
protocol: grpc
endpoint: "http://otlp-backend:4317"
headers:
authorization: "Bearer ${env:SELF_TELEMETRY_TOKEN:-}"This configuration simultaneously:
:8888 for existing Prometheus scrapers (backward compatible)For environments where the same collector handles both application telemetry and self-telemetry, you can forward self-metrics through an OTLP loopback:
service:
telemetry:
metrics:
level: detailed
readers:
- periodic:
interval: 30000
exporter:
otlp:
protocol: grpc
endpoint: "localhost:4317" # loopback to own OTLP receiver
pipelines:
metrics:
receivers: [otlp] # receives both app and self metrics
processors: [memory_limiter, batch]
exporters: [otlp/backend]⚠️ Loopback cardinality warning: Self-metrics include collector-internal labels (receiver, processor, exporter names). These are low-cardinality and safe to forward, but ensure your backend can handle the additional time series.
Add resource attributes to self-telemetry for multi-collector environments:
service:
telemetry:
resource:
service.name: "otel-gateway"
service.instance.id: "${env:POD_NAME}"
k8s.namespace.name: "${env:NAMESPACE}"
k8s.node.name: "${env:NODE_NAME}"
metrics:
level: detailed
readers:
- periodic:
interval: 60000
exporter:
otlp:
protocol: grpc
endpoint: "http://central-backend:4317"| Scenario | Recommendation |
|---|---|
| Existing Prometheus/Grafana stack | Keep pull (:8888) + optionally add OTLP push |
| Backend supports OTLP metrics natively | Use OTLP push exclusively (simpler ops) |
| Air-gapped or strict network policy environments | Use pull model (no outbound from collector required) |
| Multi-collector fleet with centralized backend | OTLP push with service.instance.id for per-collector attribution |
All collector metrics follow the pattern:
otelcol_{component}_{signal}_{metric}Examples:
otelcol_receiver_accepted_spansotelcol_processor_dropped_metric_pointsotelcol_exporter_send_failed_log_recordsMetrics:
# Traces
otelcol_receiver_accepted_spans
otelcol_exporter_sent_spans
# Metrics
otelcol_receiver_accepted_metric_points
otelcol_exporter_sent_metric_points
# Logs
otelcol_receiver_accepted_log_records
otelcol_exporter_sent_log_recordsQuery (spans/second):
rate(otelcol_receiver_accepted_spans[1m])
rate(otelcol_exporter_sent_spans[1m])What to look for:
Metrics:
# Refused by receiver (backpressure)
otelcol_receiver_refused_spans
# Dropped by processor (filter/sampling)
otelcol_processor_dropped_spans
# Failed exports
otelcol_exporter_send_failed_spansQuery (dropped percentage):
100 * (
rate(otelcol_processor_dropped_spans[1m]) +
rate(otelcol_exporter_send_failed_spans[1m])
) / rate(otelcol_receiver_accepted_spans[1m])Alert threshold: > 1% data loss
Metrics:
# Memory
otelcol_process_memory_rss # Resident Set Size (actual RAM usage)
# CPU
rate(otelcol_process_cpu_seconds_total[1m])
# Goroutines (Go runtime)
otelcol_process_runtime_total_alloc_bytesQuery (memory usage percentage):
100 * otelcol_process_memory_rss / node_memory_MemTotal_bytesAlert threshold: > 80% of container limit
Metrics:
# Current queue size
otelcol_exporter_queue_size
# Max queue capacity
otelcol_exporter_queue_capacityQuery (queue saturation percentage):
100 * otelcol_exporter_queue_size / otelcol_exporter_queue_capacityAlert threshold: > 80% full
Metrics:
# Export latency
otelcol_exporter_send_failed_metric_points
# Retry count
otelcol_exporter_enqueue_failed_spansQuery (export failure rate):
rate(otelcol_exporter_send_failed_spans[1m])Alert threshold: > 0 (any failures)
The health_check extension provides HTTP endpoints for Kubernetes probes.
Configuration:
extensions:
health_check:
endpoint: "0.0.0.0:13133"
tls:
ca_file: ""
cert_file: ""
key_file: ""
path: "/"
check_collector_pipeline:
enabled: true
interval: "5m"
exporter_failure_threshold: 5
service:
extensions: [health_check]apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
template:
spec:
containers:
- name: otel-collector
ports:
- containerPort: 13133
name: health
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3Healthy:
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "Server available",
"uptime": "3h45m12s",
"uptime_ns": 13512000000000
}Unhealthy (exporter failing):
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{
"status": "Server unavailable",
"error": "exporters: otlp: failed to send data: connection refused"
}The community-standard dashboards for OpenTelemetry Collector monitoring:
Repository: monitoringartist/opentelemetry-collector-monitoring
Grafana Dashboard ID: 15983
Import:
# Via Grafana UI
Dashboard → Import → ID: 15983
# Via ConfigMap
kubectl create configmap otel-collector-dashboard \
--from-file=dashboard.json \
-n observabilityQuery:
sum(rate(otelcol_receiver_accepted_spans[1m])) by (receiver)
sum(rate(otelcol_exporter_sent_spans[1m])) by (exporter)Visualization: Line graph Purpose: Verify data is flowing through the pipeline
Query:
sum(rate(otelcol_processor_dropped_spans[1m])) by (processor)
sum(rate(otelcol_exporter_send_failed_spans[1m])) by (exporter)Visualization: Stacked area chart Purpose: Identify sources of data loss
Query:
otelcol_process_memory_rss / 1024 / 1024 # Convert to MBVisualization: Gauge + Line graph Purpose: Detect memory leaks or approaching OOM
Query:
100 * otelcol_exporter_queue_size / otelcol_exporter_queue_capacityVisualization: Gauge (thresholds: 50% yellow, 80% red) Purpose: Predict queue overflow
Query:
histogram_quantile(0.99, rate(otelcol_exporter_send_latency_bucket[1m]))Visualization: Heatmap Purpose: Detect backend slowness
groups:
- name: otel-collector
interval: 30s
rules:
# Data loss alert
- alert: OTelCollectorDataLoss
expr: |
rate(otelcol_exporter_send_failed_spans[1m]) > 0
for: 5m
labels:
severity: critical
component: otel-collector
annotations:
summary: "OpenTelemetry Collector is dropping data"
description: "Collector {{ $labels.instance }} has failed to export {{ $value }} spans/second for 5 minutes."
# High memory usage
- alert: OTelCollectorHighMemory
expr: |
100 * otelcol_process_memory_rss /
(container_spec_memory_limit_bytes{pod=~"otel-collector.*"} > 0) > 80
for: 10m
labels:
severity: warning
component: otel-collector
annotations:
summary: "OpenTelemetry Collector memory usage high"
description: "Collector {{ $labels.instance }} is using {{ $value }}% of its memory limit."
# Queue saturation
- alert: OTelCollectorQueueFull
expr: |
100 * otelcol_exporter_queue_size / otelcol_exporter_queue_capacity > 80
for: 5m
labels:
severity: warning
component: otel-collector
annotations:
summary: "OpenTelemetry Collector queue is filling up"
description: "Collector {{ $labels.instance }} queue is {{ $value }}% full. Risk of data loss."
# Collector down
- alert: OTelCollectorDown
expr: |
up{job="otel-collector"} == 0
for: 2m
labels:
severity: critical
component: otel-collector
annotations:
summary: "OpenTelemetry Collector is down"
description: "Collector {{ $labels.instance }} has been down for 2 minutes."
# Receiver backpressure
- alert: OTelCollectorBackpressure
expr: |
rate(otelcol_receiver_refused_spans[1m]) > 0
for: 5m
labels:
severity: warning
component: otel-collector
annotations:
summary: "OpenTelemetry Collector is applying backpressure"
description: "Collector {{ $labels.instance }} is refusing {{ $value }} spans/second due to memory limits."
# Export latency
- alert: OTelCollectorHighExportLatency
expr: |
histogram_quantile(0.99,
rate(otelcol_exporter_send_latency_bucket[1m])
) > 5
for: 10m
labels:
severity: warning
component: otel-collector
annotations:
summary: "OpenTelemetry Collector export latency high"
description: "Collector {{ $labels.instance }} p99 export latency is {{ $value }}s. Backend may be slow."route:
receiver: default
group_by: ['alertname', 'component']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
component: otel-collector
severity: critical
receiver: pagerduty
continue: true
- match:
component: otel-collector
severity: warning
receiver: slack
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: '<pagerduty_key>'
- name: slack
slack_configs:
- api_url: '<slack_webhook>'
channel: '#observability-alerts'Configure the collector to send its own telemetry to a backend:
exporters:
otlp/self:
endpoint: backend.example.com:4317
tls:
insecure: false
service:
telemetry:
metrics:
level: detailed
address: "0.0.0.0:8888"
# Export collector metrics via OTLP
readers:
- periodic:
interval: 30s
exporter:
otlp:
protocol: grpc
endpoint: backend.example.com:4317Enable tracing for the collector itself:
service:
telemetry:
traces:
processors:
- batch:
timeout: 10s
exporters:
- otlp:
endpoint: backend.example.com:4317Use case: Debug collector performance issues by tracing its internal operations.
Symptoms: Upstream services fail to send data, HTTP 503 errors
Check:
rate(otelcol_receiver_refused_spans[1m])Causes:
otelcol_process_memory_rss)Symptoms: Collector accepts data, but backend shows no traces
Check:
rate(otelcol_exporter_sent_spans[1m])
rate(otelcol_exporter_send_failed_spans[1m])Causes:
otelcol_exporter_queue_size)Symptoms: Collector OOMKilled, restarts frequently
Check:
otelcol_process_memory_rss
otelcol_exporter_queue_sizeCauses:
queue_size > memory allows)Symptoms: Traces missing spans, incorrect sampling decisions
Check:
routing_key: traceIDdecision_wait is long enough for trace completion✅ Expose metrics on port 8888 for Prometheus scraping
✅ Or use OTLP push via service.telemetry.metrics.readers for unified self-telemetry
✅ Monitor throughput: Accepted vs sent spans/metrics/logs
✅ Alert on data loss: otelcol_exporter_send_failed_spans > 0
✅ Track memory usage: Set alerts at 80% of limit
✅ Watch queue saturation: Alert when > 80% full — indicates downstream bottleneck
✅ Use health checks: Configure Kubernetes liveness/readiness probes
✅ Deploy dashboards: Use monitoringartist/opentelemetry-collector-monitoring
✅ Self-observe: Send collector metrics to the same backend with service.instance.id
Meta-monitoring is not optional—it's the safety net that prevents silent data loss in production observability systems.
docs
evals
cardinality-protection
claude-code-telemetry
collector-memory-limiter
scenario-1
scenario-2
scenario-3
scenario-4
tail-sampling-setup
references