CtrlK
BlogDocsLog inGet started
Tessl Logo

o11y-dev/opentelemetry-skill

Expert OpenTelemetry guidance for collector configuration, pipeline design, and production telemetry instrumentation. Use when configuring collectors, designing pipelines, instrumenting applications, implementing sampling, managing cardinality, securing telemetry, writing OTTL transformations, or setting up AI coding agent observability (Claude Code, Codex, Gemini CLI, GitHub Copilot).

93

7.08x
Quality

97%

Does it follow best practices?

Impact

85%

7.08x

Average score across 4 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

monitoring.mdreferences/

OpenTelemetry Collector Meta-Monitoring

Overview

"Who watches the watchers?" Meta-monitoring is the practice of observing the observability pipeline itself. A failing collector can silently drop data, creating blind spots in production systems. This reference provides comprehensive guidance on collector self-monitoring, health checks, and alerting patterns.

Table of Contents

  1. Why Meta-Monitoring?
  2. Collector Telemetry
  3. OTLP Self-Telemetry
  4. Critical Metrics
  5. Health Checks
  6. Dashboards
  7. Alert Rules

Why Meta-Monitoring?

The Silent Failure Problem

Scenario: Your collector is silently dropping 50% of traces due to memory pressure.

Impact:

  • ❌ Missing spans in distributed traces
  • ❌ Incorrect latency percentiles (p95, p99)
  • ❌ Undetected errors
  • ❌ False confidence in system health

Solution: Monitor the collector's internal metrics to detect issues before data loss becomes critical.

What to Monitor

CategoryMetricsPurpose
ThroughputAccepted vs sent spans/metrics/logsData flow verification
Data LossRefused, dropped, failed exportsDetect backpressure and failures
ResourcesCPU, memory, disk usagePrevent OOM kills
Queue HealthQueue size vs capacityPredict saturation
Export PerformanceExport latency, retry countBackend health

Collector Telemetry

The collector exposes internal metrics on port 8888 (default).

Enabling Telemetry

service:
  telemetry:
    logs:
      level: info  # Options: debug, info, warn, error
    
    metrics:
      level: detailed  # Options: none, basic, normal, detailed
      address: "0.0.0.0:8888"  # Prometheus scrape endpoint

Scraping Collector Metrics

Prometheus scrape config:

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8888']
    
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Kubernetes ServiceMonitor (with Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

OTLP Self-Telemetry

The OpenTelemetry Collector Internal Telemetry documentation now recommends using OTLP periodic readers for self-observability instead of (or in addition to) the Prometheus scrape pattern. This enables the collector to push its own metrics directly to a backend without requiring an external scraper.

Why OTLP for Self-Telemetry?

PatternProsCons
Prometheus scrape (pull)Simple, compatible with existing Prometheus stacksRequires external scraper; pull model means delayed detection
OTLP push (recommended)No external scraper; metrics arrive as fast as flush interval; unified with app telemetry in same backendRequires OTLP-capable metrics backend

Configuration: OTLP Periodic Reader

Use service.telemetry.metrics.readers to configure an OTLP exporter for self-metrics:

service:
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed          # none | basic | normal | detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: "0.0.0.0"
                port: 8888     # keep Prometheus endpoint for existing dashboards

        - periodic:
            interval: 60000    # push interval in milliseconds (60s)
            timeout: 30000     # export timeout in milliseconds
            exporter:
              otlp:
                protocol: grpc
                endpoint: "http://otlp-backend:4317"
                headers:
                  authorization: "Bearer ${env:SELF_TELEMETRY_TOKEN:-}"

This configuration simultaneously:

  1. Exposes metrics on :8888 for existing Prometheus scrapers (backward compatible)
  2. Pushes metrics every 60s via OTLP gRPC to a backend

Self-Telemetry Pipeline Pattern

For environments where the same collector handles both application telemetry and self-telemetry, you can forward self-metrics through an OTLP loopback:

service:
  telemetry:
    metrics:
      level: detailed
      readers:
        - periodic:
            interval: 30000
            exporter:
              otlp:
                protocol: grpc
                endpoint: "localhost:4317"   # loopback to own OTLP receiver

  pipelines:
    metrics:
      receivers: [otlp]                      # receives both app and self metrics
      processors: [memory_limiter, batch]
      exporters: [otlp/backend]

⚠️ Loopback cardinality warning: Self-metrics include collector-internal labels (receiver, processor, exporter names). These are low-cardinality and safe to forward, but ensure your backend can handle the additional time series.

Resource Attributes for Self-Telemetry

Add resource attributes to self-telemetry for multi-collector environments:

service:
  telemetry:
    resource:
      service.name: "otel-gateway"
      service.instance.id: "${env:POD_NAME}"
      k8s.namespace.name: "${env:NAMESPACE}"
      k8s.node.name: "${env:NODE_NAME}"
    metrics:
      level: detailed
      readers:
        - periodic:
            interval: 60000
            exporter:
              otlp:
                protocol: grpc
                endpoint: "http://central-backend:4317"

When to Use OTLP vs Prometheus for Self-Telemetry

ScenarioRecommendation
Existing Prometheus/Grafana stackKeep pull (:8888) + optionally add OTLP push
Backend supports OTLP metrics nativelyUse OTLP push exclusively (simpler ops)
Air-gapped or strict network policy environmentsUse pull model (no outbound from collector required)
Multi-collector fleet with centralized backendOTLP push with service.instance.id for per-collector attribution

Critical Metrics

Metric Naming Convention

All collector metrics follow the pattern:

otelcol_{component}_{signal}_{metric}

Examples:

  • otelcol_receiver_accepted_spans
  • otelcol_processor_dropped_metric_points
  • otelcol_exporter_send_failed_log_records

Golden Signals for Collectors

1. Throughput (Data Flow)

Metrics:

# Traces
otelcol_receiver_accepted_spans
otelcol_exporter_sent_spans

# Metrics
otelcol_receiver_accepted_metric_points
otelcol_exporter_sent_metric_points

# Logs
otelcol_receiver_accepted_log_records
otelcol_exporter_sent_log_records

Query (spans/second):

rate(otelcol_receiver_accepted_spans[1m])
rate(otelcol_exporter_sent_spans[1m])

What to look for:

  • ✅ Accepted ≈ Sent → Healthy pipeline
  • ⚠️ Accepted > Sent → Backpressure, queue filling
  • ❌ Accepted >> Sent → Data loss

2. Data Loss (Errors)

Metrics:

# Refused by receiver (backpressure)
otelcol_receiver_refused_spans

# Dropped by processor (filter/sampling)
otelcol_processor_dropped_spans

# Failed exports
otelcol_exporter_send_failed_spans

Query (dropped percentage):

100 * (
  rate(otelcol_processor_dropped_spans[1m]) +
  rate(otelcol_exporter_send_failed_spans[1m])
) / rate(otelcol_receiver_accepted_spans[1m])

Alert threshold: > 1% data loss

3. Resource Usage

Metrics:

# Memory
otelcol_process_memory_rss  # Resident Set Size (actual RAM usage)

# CPU
rate(otelcol_process_cpu_seconds_total[1m])

# Goroutines (Go runtime)
otelcol_process_runtime_total_alloc_bytes

Query (memory usage percentage):

100 * otelcol_process_memory_rss / node_memory_MemTotal_bytes

Alert threshold: > 80% of container limit

4. Queue Health

Metrics:

# Current queue size
otelcol_exporter_queue_size

# Max queue capacity
otelcol_exporter_queue_capacity

Query (queue saturation percentage):

100 * otelcol_exporter_queue_size / otelcol_exporter_queue_capacity

Alert threshold: > 80% full

5. Export Performance

Metrics:

# Export latency
otelcol_exporter_send_failed_metric_points

# Retry count
otelcol_exporter_enqueue_failed_spans

Query (export failure rate):

rate(otelcol_exporter_send_failed_spans[1m])

Alert threshold: > 0 (any failures)


Health Checks

Health Check Extension

The health_check extension provides HTTP endpoints for Kubernetes probes.

Configuration:

extensions:
  health_check:
    endpoint: "0.0.0.0:13133"
    tls:
      ca_file: ""
      cert_file: ""
      key_file: ""
    path: "/"
    check_collector_pipeline:
      enabled: true
      interval: "5m"
      exporter_failure_threshold: 5

service:
  extensions: [health_check]

Kubernetes Probes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  template:
    spec:
      containers:
      - name: otel-collector
        ports:
        - containerPort: 13133
          name: health
        
        livenessProbe:
          httpGet:
            path: /
            port: 13133
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        readinessProbe:
          httpGet:
            path: /
            port: 13133
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

Health Check Responses

Healthy:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "Server available",
  "uptime": "3h45m12s",
  "uptime_ns": 13512000000000
}

Unhealthy (exporter failing):

HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{
  "status": "Server unavailable",
  "error": "exporters: otlp: failed to send data: connection refused"
}

Dashboards

monitoringartist Dashboards

The community-standard dashboards for OpenTelemetry Collector monitoring:

Repository: monitoringartist/opentelemetry-collector-monitoring

Grafana Dashboard ID: 15983

Import:

# Via Grafana UI
Dashboard → Import → ID: 15983

# Via ConfigMap
kubectl create configmap otel-collector-dashboard \
  --from-file=dashboard.json \
  -n observability

Key Dashboard Panels

1. Throughput Panel

Query:

sum(rate(otelcol_receiver_accepted_spans[1m])) by (receiver)
sum(rate(otelcol_exporter_sent_spans[1m])) by (exporter)

Visualization: Line graph Purpose: Verify data is flowing through the pipeline

2. Data Loss Panel

Query:

sum(rate(otelcol_processor_dropped_spans[1m])) by (processor)
sum(rate(otelcol_exporter_send_failed_spans[1m])) by (exporter)

Visualization: Stacked area chart Purpose: Identify sources of data loss

3. Memory Usage Panel

Query:

otelcol_process_memory_rss / 1024 / 1024  # Convert to MB

Visualization: Gauge + Line graph Purpose: Detect memory leaks or approaching OOM

4. Queue Saturation Panel

Query:

100 * otelcol_exporter_queue_size / otelcol_exporter_queue_capacity

Visualization: Gauge (thresholds: 50% yellow, 80% red) Purpose: Predict queue overflow

5. Export Latency Panel

Query:

histogram_quantile(0.99, rate(otelcol_exporter_send_latency_bucket[1m]))

Visualization: Heatmap Purpose: Detect backend slowness


Alert Rules

Prometheus Alerting Rules

groups:
  - name: otel-collector
    interval: 30s
    rules:
      # Data loss alert
      - alert: OTelCollectorDataLoss
        expr: |
          rate(otelcol_exporter_send_failed_spans[1m]) > 0
        for: 5m
        labels:
          severity: critical
          component: otel-collector
        annotations:
          summary: "OpenTelemetry Collector is dropping data"
          description: "Collector {{ $labels.instance }} has failed to export {{ $value }} spans/second for 5 minutes."
      
      # High memory usage
      - alert: OTelCollectorHighMemory
        expr: |
          100 * otelcol_process_memory_rss / 
          (container_spec_memory_limit_bytes{pod=~"otel-collector.*"} > 0) > 80
        for: 10m
        labels:
          severity: warning
          component: otel-collector
        annotations:
          summary: "OpenTelemetry Collector memory usage high"
          description: "Collector {{ $labels.instance }} is using {{ $value }}% of its memory limit."
      
      # Queue saturation
      - alert: OTelCollectorQueueFull
        expr: |
          100 * otelcol_exporter_queue_size / otelcol_exporter_queue_capacity > 80
        for: 5m
        labels:
          severity: warning
          component: otel-collector
        annotations:
          summary: "OpenTelemetry Collector queue is filling up"
          description: "Collector {{ $labels.instance }} queue is {{ $value }}% full. Risk of data loss."
      
      # Collector down
      - alert: OTelCollectorDown
        expr: |
          up{job="otel-collector"} == 0
        for: 2m
        labels:
          severity: critical
          component: otel-collector
        annotations:
          summary: "OpenTelemetry Collector is down"
          description: "Collector {{ $labels.instance }} has been down for 2 minutes."
      
      # Receiver backpressure
      - alert: OTelCollectorBackpressure
        expr: |
          rate(otelcol_receiver_refused_spans[1m]) > 0
        for: 5m
        labels:
          severity: warning
          component: otel-collector
        annotations:
          summary: "OpenTelemetry Collector is applying backpressure"
          description: "Collector {{ $labels.instance }} is refusing {{ $value }} spans/second due to memory limits."
      
      # Export latency
      - alert: OTelCollectorHighExportLatency
        expr: |
          histogram_quantile(0.99, 
            rate(otelcol_exporter_send_latency_bucket[1m])
          ) > 5
        for: 10m
        labels:
          severity: warning
          component: otel-collector
        annotations:
          summary: "OpenTelemetry Collector export latency high"
          description: "Collector {{ $labels.instance }} p99 export latency is {{ $value }}s. Backend may be slow."

Alert Routing (Alertmanager)

route:
  receiver: default
  group_by: ['alertname', 'component']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  
  routes:
    - match:
        component: otel-collector
        severity: critical
      receiver: pagerduty
      continue: true
    
    - match:
        component: otel-collector
        severity: warning
      receiver: slack

receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: '<pagerduty_key>'
  
  - name: slack
    slack_configs:
      - api_url: '<slack_webhook>'
        channel: '#observability-alerts'

Advanced Patterns

Self-Observability Pipeline

Configure the collector to send its own telemetry to a backend:

exporters:
  otlp/self:
    endpoint: backend.example.com:4317
    tls:
      insecure: false

service:
  telemetry:
    metrics:
      level: detailed
      address: "0.0.0.0:8888"
      
      # Export collector metrics via OTLP
      readers:
        - periodic:
            interval: 30s
            exporter:
              otlp:
                protocol: grpc
                endpoint: backend.example.com:4317

Distributed Tracing of the Collector

Enable tracing for the collector itself:

service:
  telemetry:
    traces:
      processors:
        - batch:
            timeout: 10s
      exporters:
        - otlp:
            endpoint: backend.example.com:4317

Use case: Debug collector performance issues by tracing its internal operations.


Troubleshooting Checklist

Collector Not Accepting Data

Symptoms: Upstream services fail to send data, HTTP 503 errors

Check:

rate(otelcol_receiver_refused_spans[1m])

Causes:

  • Memory limiter triggered (check otelcol_process_memory_rss)
  • Receiver port not exposed (check Kubernetes Service)
  • Network policy blocking traffic

Data Not Reaching Backend

Symptoms: Collector accepts data, but backend shows no traces

Check:

rate(otelcol_exporter_sent_spans[1m])
rate(otelcol_exporter_send_failed_spans[1m])

Causes:

  • Exporter misconfigured (wrong endpoint, no TLS)
  • Backend down (check export failures)
  • Queue full (check otelcol_exporter_queue_size)

High Memory Usage

Symptoms: Collector OOMKilled, restarts frequently

Check:

otelcol_process_memory_rss
otelcol_exporter_queue_size

Causes:

  • No memory_limiter processor
  • Queue too large (queue_size > memory allows)
  • High throughput without batching

Incomplete Traces (Tail Sampling)

Symptoms: Traces missing spans, incorrect sampling decisions

Check:

  • Verify loadbalancing exporter is using routing_key: traceID
  • Check Headless Service is returning pod IPs, not VIP
  • Verify decision_wait is long enough for trace completion

Reference Links


Summary

Expose metrics on port 8888 for Prometheus scraping ✅ Or use OTLP push via service.telemetry.metrics.readers for unified self-telemetry ✅ Monitor throughput: Accepted vs sent spans/metrics/logs ✅ Alert on data loss: otelcol_exporter_send_failed_spans > 0Track memory usage: Set alerts at 80% of limit ✅ Watch queue saturation: Alert when > 80% full — indicates downstream bottleneck ✅ Use health checks: Configure Kubernetes liveness/readiness probes ✅ Deploy dashboards: Use monitoringartist/opentelemetry-collector-monitoring ✅ Self-observe: Send collector metrics to the same backend with service.instance.id

Meta-monitoring is not optional—it's the safety net that prevents silent data loss in production observability systems.

CHANGELOG.md

CONTRIBUTING.md

README.md

SKILL.md

tessl.json

tile.json