CtrlK
BlogDocsLog inGet started
Tessl Logo

coralogix/opentelemetry-skills

OpenTelemetry Collector deployment, instrumentation (Java/Python/Node.js/.NET/Go), and OTTL pipeline transforms for Coralogix — coralogix exporter config, Helm chart selection, Kubernetes topology, ECS/EKS/GKE deployments, SDK setup, APM transactions, and OTTL cardinality/PII/routing.

98

1.13x
Quality

97%

Does it follow best practices?

Impact

99%

1.13x

Average score across 81 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

ops-troubleshooting.mdskills/opentelemetry/opentelemetry-collector/references/

Troubleshooting: symptom → root cause

Maps recurring user symptoms to the actual root cause and the first action to take. Most issues come down to pipeline placement, exporter config, or an upstream connectivity/IAM problem masquerading as a config issue.

Contents

  • Operating rule: confirm this is a collector problem
  • Symptom to root-cause table
  • Debugging workflow
  • Collector self-telemetry metrics
  • Debug exporter isolation
  • High-signal collector logs
  • Scope checks and key facts

Operating rule: is the problem even the OTel Collector?

Before editing pipelines, confirm the symptom actually belongs to the collector. "No data reaching Coralogix" is often a DNS / TLS / proxy / domain / API-key problem — not a pipeline problem. If the user has added a transform processor to "force data through" and it made no difference, the root cause is upstream. Say so clearly before continuing.

Symptom → Root cause table

SymptomMost likely root causeFirst action
No data of any kind reaching Coralogixendpoint: used instead of domain:; full URL used instead of bare hostname; unbracketed $CORALOGIX_PRIVATE_KEY failing to expand in exporter v0.76+; or a UI hostname used as domain: instead of the data-ingestion hostnameVerify domain: is a bare <region>.coralogix.com hostname (see core for the regions and UI-vs-ingestion rules). private_key: "${env:CORALOGIX_PRIVATE_KEY}" (bracketed). Check egress/TLS/proxy to *.coralogix.com:443.
Collector starts, pipelines load, but exporter logs show rpc error: code = ResourceExhausted or code = PermissionDeniedRate limit / quota / wrong API key scopeCheck ingress quota on the Coralogix side; verify the PRIVATE_KEY is a Send-Your-Data key (see core for key categories).
Resource Catalog / Infrastructure Explorer is empty despite K8s deployment running fineMissing kubernetesResources preset; or using the default coralogix exporter instead of coralogix/resource_catalog; or missing x-coralogix-ingress: metadata-as-otlp-logs/v1 headerEnable opentelemetry-cluster-collector.presets.kubernetesResources.enabled: true. Ensure the dedicated coralogix/resource_catalog exporter has the ingress header. See preset-kubernetes.md.
APM shows traces but no error rates / p99 latenciesspanmetrics configured on the gateway (after tail_sampling) instead of the agent — metrics only see sampled spansMove the spanmetrics connector to the agent, upstream of any sampling. transactions processor must also be upstream of spanmetrics.
Traces are incomplete across services — some spans per trace present, others missingTail sampling running on daemonset agents (each agent only sees its node's spans); or multiple gateway replicas without consistent-hashing loadbalancerPut tail_sampling on a central gateway. Feed via loadbalancing exporter with routing by trace_id so all spans for one trace land on the same gateway replica.
spanMetrics values look double-countedSame spanmetrics connector running on both agent and gateway — metrics emitted twiceEnable spanmetrics on agent only. Remove any spanmetrics connector from gateway/cluster-collector.
k8sattributes fires full extraction in 3 collectors, API server at high QPSTriple-duplication anti-pattern — same full extract.metadata block in daemonset, cluster-collector, gatewayKeep passthrough: true on daemonset agents. Full extraction in one role only (usually gateway).
Collector refuses to start — cannot start pipelines: failed to start "resourcedetection/resource_catalog" processor: can't get K8s Instance Metadata; node name is emptyResource Catalog detector running on a pod that is not part of the cluster-collector deployment (e.g. daemonset agent where node-name injection fails)Keep resourcedetection/resource_catalog on cluster-collector only. Ensure K8S_NODE_NAME env is set from spec.nodeName on the cluster-collector pod spec.
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_total climbingNot enough gateway replicas / num_traces buffer too small / decision_wait too short for the trace durationScale gateway replicas. Increase num_traces (trace buffer) and decision_wait (how long to hold a trace before deciding). Check that spans for one trace actually reach the same replica.
Duplicate metric series or partial resetssingle-writer principle violated — multiple collectors emitting the same metric with different resource attributesAdd resourcedetection (env/ec2/system/host) to the metrics pipeline and deduplicate the resource.attributes set. Each unique series should have exactly one writer.
Exporter logs INVALID_ARGUMENT on send, but with no upstream body/data errorOTTL problem (transform/filter) — wrong context, missing nil guardSee the OTTL skill's Error Decoder. INVALID_ARGUMENT here is a pipeline processing error before export.
Exporting failed. Dropping data. with context deadline exceededCoralogix ingress slow or client-side backpressure / timeout too aggressiveIncrease coralogix.timeout (default 5s) to 30–120s. Add sending_queue.enabled: true + storage: file_storage for durable buffering. Investigate ingress latency metrics side of the house.
opentelemetry-cluster-collector not reporting cx.agent.type: cluster-collectorValues override broke the resource/metadata processor; or operator wiped the default service.pipelinesRestore default service.pipelines from chart defaults. Prefer extraProcessors hook over wholesale pipeline replacement. See setup-kubernetes.md "Don't override pipelines wholesale."
APM gateways consume massive memory, span drops in logsGateway sized for steady state; spike from burst traffic + tail_sampling buffer (num_traces) inflationSize gateway memory for: num_traces × avg spans/trace × avg span size. Reduce decision_wait if traces are short-lived.
Windows collector CPU spikes during peak IIS trafficNot the collector — Windows Defender scanning rotated IIS log filesOpen Resource Monitor → Disk. If msmpeng.exe is the top consumer on the IIS log path, exclude that path from Defender real-time scan. See setup-windows-standalone.md.
UI-generated Linux host config produces traces without language iconsHost wizard strips telemetry.sdk.* resource attributesWorkaround: transform processor re-adds them on the traces pipeline until the wizard fix lands. See setup-linux-standalone.md.
memory_limiter fires constantly, memory doesn't seem to drop after GCGo heap vs kernel page cache disconnect — kubectl top shows container RSS including page cache from hostPath /var/log/pods mountsCheck Go HeapAlloc via the collector's self-telemetry. If Go Sys < RSS significantly, the delta is kernel page cache (not a leak). See ops-memory-performance.md.
ECS Fargate sidecar: logs lost when app container crashesApp is essential: true, sidecar is essential: false — ECS kills task including sidecar before buffered logs drainAdd healthCheck to sidecar + dependsOn: [{containerName: otel-collector, condition: HEALTHY}] on the app. Use CDOT image (includes /healthcheck binary). See setup-ecs-fargate.md.
ECS EC2 daemonset: every log row attributed to the collector's containerecs detector in resourcedetection stamps collector's own container ID onto all recordsRemove ecs from resourcedetection.detectors on daemonset-mode collectors. Use the ecsattributes/container-logs CDOT processor for per-container attribution. See setup-ecs-ec2.md.
Fleet Manager says chart version applied, but user's values.yaml changes don't appearpresets.fleetManagement.supervisor.enabled: true — Fleet Manager UI overrides values.yaml config at runtimeExplain the precedence model: with supervisor enabled, config comes from Fleet Manager, not the Helm values. Edit in the UI or disable supervisor. See preset-fleet-management.md.
Collector v0.142+ crashes CrashLoopBackOff with older otel-integration chart versionsUpstream collector contains a breaking change not yet absorbed by the chartPin the collector image version (chart image.tag) or bump the whole chart to a matching release. Don't mix a new-image-tag override against an old chart.

Debugging workflow

If you see "it's broken," work through this order before editing pipelines:

1. Enable the collector's self-telemetry

If the collector isn't already emitting its own metrics, turn it on temporarily:

service:
  telemetry:
    metrics:
      level: detailed
      readers:
        - periodic:
            interval: 30000
            exporter:
              otlp:
                protocol: grpc
                endpoint: "http://localhost:4317"
    logs:
      level: info

Metrics to read first:

MetricTells you
otelcol_receiver_accepted_* vs _refused_*whether data is entering at all
otelcol_processor_batch_batch_send_sizebatch sizing; near-empty batches mean low volume or upstream drop
otelcol_processor_memory_limiter_*back-pressure events
otelcol_exporter_send_failed_*export failures by signal
otelcol_exporter_queue_capacity / _queue_sizesending_queue depth
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_totalgateway under-sized / wrong routing

2. Add a debug exporter, mirror the problematic signal

exporters:
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [coralogix, debug]    # mirror

If debug sees the data but coralogix doesn't, the problem is the exporter / connectivity. If neither sees it, the problem is upstream (receiver, source app, network).

Only turn verbosity: detailed on in production for short diagnostic windows — it is expensive, especially on Windows.

3. Read the collector logs

Typical high-signal log lines:

  • Exporting failed. Dropping data. + rpc error: code = X — exporter connectivity / auth / quota. Cross-check against domain, private_key, egress.
  • INVALID_ARGUMENT — OTTL problem in transform/filter. Switch to the OTTL skill.
  • can't get K8s Instance Metadata; node name is emptyresourcedetection/resource_catalog on wrong collector type (see table above).
  • cannot start pipelines: failed to start "X" processor — processor on the wrong collector role, or missing env/RBAC.
  • denied by autogke-no-write-mode-hostpath — GKE Autopilot Warden, see setup-kubernetes.md.

4. Cross-check quickly that it isn't the skill's own scope

  • OTTL issues → OTTL skill.
  • Fleet Management config precedence → preset-fleet-management.md.
  • Memory / page cache → ops-memory-performance.md.

5. Only now, touch the pipeline

When you do edit pipelines, change one thing at a time and watch the self-telemetry. Batching two changes in a single apply wastes the diagnostic budget.

Key Facts

  • Most "no data" problems are connectivity, not pipeline. Check domain/key/egress first.
  • coralogix domain field is a bare hostname. Not a URL, not endpoint:.
  • Enable self-telemetry before editing anything. otelcol_* metrics are the highest-signal diagnostic.
  • Mirror suspect signals through a debug exporter to isolate where data drops.
  • single-writer violations cause duplicate metrics and partial resets. Deduplicate resource attributes across collectors.
  • If memory_limiter fires chronically, look upstream — cardinality/volume — before sizing up the collector.
  • APM missing errors = spanmetrics placement. Move to agent, upstream of sampling.
  • Incomplete cross-service traces = tail_sampling placement + loadbalancing routing. Central gateway, consistent_hashing by trace_id.

README.md

tile.json