Expert OpenTelemetry guidance for collector configuration, pipeline design, and production telemetry instrumentation. Use when configuring collectors, designing pipelines, instrumenting applications, implementing sampling, managing cardinality, securing telemetry, writing OTTL transformations, or setting up AI coding agent observability (Claude Code, Codex, Gemini CLI, GitHub Copilot).
93
97%
Does it follow best practices?
Impact
85%
7.08xAverage score across 4 eval scenarios
Passed
No known issues
The OpenTelemetry Collector is a vendor-agnostic telemetry pipeline that receives, processes, and exports observability data. This reference provides deep technical guidance on pipeline anatomy, processor ordering, memory management, and production stability patterns.
A collector pipeline consists of three stages, with an optional fourth component — Connectors — that bridge two pipelines:
Receivers → Processors → Exporters
↓
[Connector] ← acts as Exporter on source pipeline
↓
[Connector] ← acts as Receiver on destination pipeline
↓
Receivers → Processors → ExportersFor simple single-pipeline flows:
Receivers → Processors → ExportersReceivers are the entry points for data. They listen on network endpoints or pull data from sources.
Common receivers:
otlp: Receives OTLP (gRPC/HTTP) - Use this by defaultprometheus: Scrapes Prometheus metricsjaeger: Receives Jaeger traces (legacy)zipkin: Receives Zipkin traces (legacy)filelog: Reads log files from diskhostmetrics: Collects host-level metrics (CPU, memory, disk)Processors transform, filter, enrich, or drop data. They execute in order.
Critical processors:
memory_limiter: Must be first - Prevents OOMbatch: Should be near end - Reduces network callsk8sattributes: Enriches with K8s metadatatransform: Applies OTTL transformationsfilter: Drops spans/metrics based on conditionstail_sampling: Intelligent sampling decisions (stateful)
tail_sampling and spanmetrics require sticky routing (routing all spans of a trace to the same collector instance). Pair with loadbalancing exporter using deterministic routing keys (e.g., traceID) to preserve stickiness.attributes: Adds/removes/hashes attributesresource: Modifies resource attributesExporters send data to backends.
Common exporters:
otlp: Exports to OTLP-compatible backendsprometheus: Exposes metrics for Prometheus scrapingjaeger: Exports to Jaeger (legacy)loadbalancing: Routes to multiple backends with consistent hashing
routing_key must be a stable, deterministic string (e.g., traceID, tenant_id, cluster). Convert non-string routing attributes to normalized strings before hashing to avoid shard churn and ensure even load distribution.logging: Outputs to stdout (debug only)file: Writes to disk (debug only)Connectors bridge two pipelines by acting simultaneously as an exporter on the source pipeline and a receiver on the destination pipeline. They enable cross-pipeline signal routing and aggregation (e.g., generating metrics from traces) without external tools.
Key connectors:
spanmetrics: Generates R.E.D. metrics (Rate, Errors, Duration) from trace spans — Betaservicegraph: Builds service dependency graph metrics from traces — Betarouting: Routes signals to different pipelines based on attribute values — Alphafailover: Automatic failover between pipelines on errors — Alphacount: Counts signals as metrics — Alphasignaltometrics: Converts any signal to metrics via OTTL expressions — AlphaSee connectors.md for full configuration examples and patterns.
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [otlp]
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]Key Rule: Each pipeline type (traces, metrics, logs) is independent.
The opentelemetry-collector (core) contains stable, vendor-neutral components:
otlp, prometheusbatch, memory_limiterotlp, loggingStability: Production-ready
The opentelemetry-collector-contrib contains extended components:
tail_sampling, transform, k8sattributesawsxray, googlecloud, azuremonitorfilelog, kafkareceiver, sqlqueryStability: Varies (Alpha/Beta/Stable)
⚠️ Always verify component stability before production use:
For production, use the OpenTelemetry Collector Builder (OCB) to create lean binaries:
# builder-config.yaml
dist:
name: otelcol-custom
description: Custom OpenTelemetry Collector
output_path: ./dist
receivers:
- gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.100.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.100.0
processors:
- gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.100.0
- gomod: go.opentelemetry.io/collector/processor/memorylimiterprocessor v0.100.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor v0.100.0
exporters:
- gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.100.0Benefits: ✅ Smaller binary size (50-100 MB vs 500+ MB) ✅ Reduced attack surface ✅ Only include components you actually use
Starting with Collector v0.146.0 (released February 2025), the minimum required Go version is Go 1.25. This is a breaking change for any custom collector builds or OCB-based distributions compiled with an older Go toolchain. Upgrade your Go toolchain to 1.25+ before building or upgrading to v0.146.0+.
Reference: Collector v0.146.0 release notes
cmd/builder — New init Subcommand (Experimental)The cmd/builder (OCB) tool introduced an experimental init subcommand in v0.146.0 that scaffolds a new custom collector project with a starter builder-config.yaml and directory structure:
ocb init --output-path ./my-collectorThis is useful for bootstrapping a new custom distribution without manually writing configuration from scratch.
The order of processors in the pipeline is not arbitrary. Incorrect ordering leads to OOM kills, wasted CPU, and data integrity issues.
| Order | Processor | Function | Criticality | Rationale |
|---|---|---|---|---|
| 1 | memory_limiter | Prevents OOM | Critical | Must be first. If placed later, data has already consumed heap space before the limiter checks. Placing it first enables backpressure to receivers. |
| 2 | extensions (auth) | Validates access | High | Reject unauthorized traffic immediately, before spending CPU on processing. |
| 3 | sampling (head) | Reduces volume | High | If using probabilistic sampling, do it early. Dropping 90% of traces saves CPU on subsequent processors. |
| 4 | k8sattributes | Enriches metadata | Medium | Adds context (Pod, Namespace, Node) needed for filtering and routing in later steps. Requires RBAC permissions. |
| 5 | transform / filter | Modifies/drops data | Medium | Apply OTTL transformations to scrub, rename, or drop specific spans/metrics. |
| 6 | redaction / attributes | Sanitizes PII | Critical (Compliance) | Must happen before batching or exporting to ensure sensitive data never leaves the collector. |
| 7 | batch | Optimizes network | High | Compresses data into chunks. Must be near the end. If placed before filtering, the batcher processes data that is eventually discarded. |
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 20
k8sattributes:
auth_type: "serviceAccount"
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.node.name
filter:
traces:
span:
- 'attributes["url.path"] == "/health"' # Drop health checks
attributes:
actions:
- key: credit_card
action: delete # PII redaction
batch:
timeout: 10s
send_batch_size: 1024
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, filter, attributes, batch]
exporters: [otlp]The OpenTelemetry Collector Contrib repository contains extended components and curated example configurations. Always verify component stability and pin to released versions.
⚠️ Check stability before production use: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/VERSIONING.md
Component stability badges:
Each component directory contains a README with configuration examples, stability level, and usage guidance.
| Component | Purpose | Stability | Link |
|---|---|---|---|
| transformprocessor | Apply OTTL transformations | Stable | Docs |
| filterprocessor | Drop spans/metrics based on conditions | Stable | Docs |
| k8sattributesprocessor | Enrich with Kubernetes metadata | Beta | Docs |
| tailsamplingprocessor | Intelligent sampling decisions | Beta | Docs |
| filelogreceiver | Read logs from disk | Beta | Docs |
| pprofreceiver | Receive pprof-formatted profiles | Alpha | Docs |
| loadbalancingexporter | Route to multiple backends with consistent hashing | Beta | Docs |
| resourcedetectionprocessor | Detect and attach resource attributes (cloud, host, K8s) | Beta | Docs |
| prometheusremotewriteexporter | Export metrics via Prometheus Remote Write | Beta | Docs |
⚠️ Prometheus Remote Write — InstrumentationScope attributes not exported: The
prometheusremotewriteexporterdoes not includeotel.scope.name/otel.scope.versionas Prometheus labels by default (#45266). If downstream consumers need to distinguish metrics by instrumentation scope, use the nativeotlpexporterinstead, or enrich the metric resource/data-point attributes before export using atransformprocessor.⚠️ Profiles signal is public Alpha: OpenTelemetry Profiles entered public Alpha in Collector
v0.148.0+. Use it for evaluation and early integration work, not critical production commitments yet. The current practical path is thepprofreceiverplus normal collector enrichment/transform processors (for example,k8sattributesand OTTL), and you should verify that your backend can ingest OTLP Profiles before standardizing on it.
Main examples directory: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/examples
Browse the examples/ directory for curated collector configurations. Common use cases include:
| Use Case | Example Type | Description |
|---|---|---|
| Gateway with tail sampling | Gateway deployment | Stateful sampling across traces, requires consistent routing (e.g., via loadbalancing exporter) |
| Kubernetes node agents | Agent/DaemonSet | Lightweight per-node collectors, hostmetrics, log collection |
| Log collection from files | Filelog receiver | Parse and enrich logs from disk, multiline support |
| K8s metadata enrichment | k8sattributes processor | Add pod/namespace/node attributes to telemetry |
| Basic debugging | Logging exporter | Output telemetry to stdout for troubleshooting |
Best Practice: Pin to released tags (e.g., v0.100.0+) matching your collector version instead of using main branch. This ensures production stability and avoids unexpected breaking changes.
otelcol-contrib validate --config config.yaml to catch deprecated/invalid settings before deployment.❌ Batch before filter: Wastes memory batching data that will be dropped ❌ Memory limiter not first: Limiter checks after data is already in memory ❌ Redaction after export: Sensitive data has already left the collector ❌ Sampling after enrichment: Wasted CPU adding attributes to dropped spans
The memory_limiter is the single most important processor for collector stability.
limit - spike_limit, the collector stops accepting new data (applies backpressure)limit, the collector forces garbage collection and drops dataprocessors:
memory_limiter:
check_interval: 1s # How often to check (1s recommended)
limit_mib: 1800 # Hard limit in MiB
spike_limit_mib: 300 # Buffer for spikes (typically 15-20% of limit)
limit_percentage: 80 # Alternative: percentage of total memory
spike_limit_percentage: 20 # Alternative: spike as percentageFor containerized deployments:
Example:
# Kubernetes container spec
resources:
limits:
memory: 2Gi # 2048 MiB
# Collector config
processors:
memory_limiter:
limit_mib: 1800 # 2048 - 248 (reserve)
spike_limit_mib: 360 # 20% buffer
check_interval: 1sprocessors:
memory_limiter:
limit_percentage: 80 # Use 80% of total memory
spike_limit_percentage: 20 # 20% buffer for bursts
check_interval: 1sWhy 80%?: Leaves headroom for Go runtime overhead, internal buffers, and JIT allocations.
When the limiter triggers:
RESOURCE_EXHAUSTED (HTTP 503)Key Point: Backpressure is not data loss—it's intelligent rate limiting.
For systems >10k RPS:
check_interval to 500ms for faster reactionspike_limit_percentage to 25% to handle burstsotelcol_processor_refused_spans metricBy default, exporters use in-memory queues. If the backend is down and the queue fills, data is dropped.
Backend outage → Exporter queue fills → New data is droppedThe file_storage extension persists queue data to disk (Write-Ahead Log).
extensions:
file_storage:
directory: /var/lib/otelcol/file_storage
timeout: 1s
compaction:
on_start: true # Clean up on startup
on_rebound: false # ⚠️ Keep false — bbolt v1.4.3 nil pointer crash risk (otelcol-contrib#46489)
directory: /tmp/otel_compaction
max_transaction_size: 65_536 # 64KB chunks
exporters:
otlp:
endpoint: backend.example.com:4317
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000 # Max batches (not spans)
storage: file_storage # Reference to extension
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 5m
service:
extensions: [file_storage]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]/var/lib/otelcol/file_storageThe disk space required depends on:
Formula:
Disk Space (GB) = (Spans/sec × Span Size KB × Downtime Seconds) / 1,000,000apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: otel-gateway-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi # Size for 1-hour buffer at 10k RPS
storageClassName: gp3 # AWS: gp3, GCP: pd-ssd
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-gateway
spec:
template:
spec:
containers:
- name: otel-collector
volumeMounts:
- name: storage
mountPath: /var/lib/otelcol
volumes:
- name: storage
persistentVolumeClaim:
claimName: otel-gateway-storageWatch these metrics:
otelcol_exporter_queue_size: Current queue depthotelcol_exporter_queue_capacity: Max queue sizeotelcol_exporter_send_failed_spans: Failed exports (triggers disk writes)Alert: queue_size / queue_capacity > 0.8 → Backend is struggling
otelcol_exporter_send_failed — Error Detail Attributes (v0.146.0+)When telemetry level is set to Detailed, the otelcol_exporter_send_failed metrics now include two additional attributes:
error.type: The class of error (e.g., network timeout, HTTP 5xx, connection refused), useful for routing alerts to the right on-call team.error.permanent: Boolean indicating whether the failure is permanent (no retry will succeed) vs transient (retry may recover).This allows operators to distinguish transient export failures (backend temporarily unavailable — safe to retry) from permanent failures (data format or auth errors — retrying wastes resources).
service:
telemetry:
metrics:
level: Detailed⚠️ Disk space is not unlimited: The collector does not enforce a hard cap on disk usage in older versions. You must:
df -h /var/lib/otelcolThe file_storage extension uses bbolt (go.etcd.io/bbolt) as its storage engine. bbolt relies on mmap() for memory-mapped I/O and POSIX flock() for exclusive file locking. These kernel-level primitives have strict filesystem requirements that are NOT met by network or distributed filesystems. Using an incompatible filesystem can result in silent data corruption, crashes (SIGBUS/SIGSEGV), or split-brain locking failures with no error messages.
| Filesystem | Type | mmap Support | flock Support | Verdict |
|---|---|---|---|---|
| ext4 / xfs | Local | Full | Full | ✅ Supported |
| AWS EBS (gp3/io2) | Block device | Full | Full | ✅ Supported |
| GCP Persistent Disk | Block device | Full | Full | ✅ Supported |
| Azure Managed Disk | Block device | Full | Full | ✅ Supported |
| AWS EFS | NFS v4.1 | Partial | Advisory only | ❌ NOT Supported — risk of silent corruption |
| NFS v3/v4 | Network | Partial | Advisory only | ❌ NOT Supported — flock is advisory, not mandatory |
| SMB/CIFS | Network | Partial | No | ❌ NOT Supported |
| GlusterFS | Distributed | Partial | Varies | ❌ NOT Supported |
| CephFS | Distributed | Partial | Varies | ⚠️ Not recommended |
on_rebound compaction (bbolt v1.4.3, affects file_storage users with compaction.on_rebound: true)⚠️ accessModes: ReadWriteMany (RWX) volumes almost always imply a network filesystem and MUST NOT be used with file_storage. ReadWriteOnce (RWO) backed by a block device (EBS gp3, GCP pd-ssd, Azure Managed Disk) is the only supported configuration.
spec:
accessModes:
- ReadWriteOnce # RWX (ReadWriteMany) is NOT safe — implies NFS/EFS
storageClassName: gp3 # AWS EBS gp3; use pd-ssd (GCP) or managed-premium (Azure)
resources:
requests:
storage: 50GiLinux kernel versions 5.10–5.15 with ext4 fast-commit enabled can corrupt bbolt databases. Fixes were backported to 5.10.94+, 5.15.17+, 5.15.27+, and are included in 5.17+. If you are running a kernel in this range, verify your kernel patch level or disable fast-commit (tune2fs -O ^fast_commit /dev/...). See the bbolt README Known Issues section.
bbolt v1.4.3 has a known nil pointer panic when a database reopen fails during on_rebound compaction. This manifests as a collector crash (not a graceful shutdown) when the storage file becomes transiently unavailable at compaction time.
Mitigation: Do not set compaction.on_rebound: true in file_storage until this is resolved upstream. Use on_start: true only:
extensions:
file_storage:
directory: /var/lib/otelcol/file_storage
timeout: 1s
compaction:
on_start: true # ✅ Safe
on_rebound: false # ⚠️ Avoid with bbolt v1.4.3 — risk of nil pointer crash
directory: /tmp/otel_compactionTrack upstream fix: otelcol-contrib#46489
Upgrading to bbolt v1.4.3 does not relax the mmap/flock filesystem requirements above. Keep using local block-backed volumes (ext4/xfs, EBS, PD, Managed Disk) and avoid NFS/EFS/SMB/CephFS for file_storage, even on the latest bbolt release.
The bbolt maintainers are tracking a security fix release request in bbolt#1187. Until a patched bbolt line is published and adopted by the collector distribution you run:
file_storage directory permissions to the collector user (0700) and avoid hostPath sharingWhen the upstream patch lands, prefer collector builds that vendor the fixed bbolt version and remove temporary exception handling only after validation in staging.
The OTel resiliency model has three tiers:
sending_queue)file_storage)Kafka as a durability layer is the standard pattern for cross-AZ, cross-region, or high-throughput deployments where disk-based WAL is insufficient.
| Scenario | Recommended Tier | Reason |
|---|---|---|
| Single-region, short outages (<1h) | file_storage (Tier 2) | Simpler, lower ops overhead |
| Cross-AZ or cross-region hops | Kafka (Tier 3) | Survives collector crashes, node failures |
| Multi-datacenter fan-in | Kafka (Tier 3) | Decouples producer and consumer tiers |
| Throughput >50k spans/sec | Kafka (Tier 3) | Disk I/O limits on single-node WAL |
| Compliance / long retention (>24h) | Kafka (Tier 3) | Configurable topic retention |
[App] → [OTel Agent] → [Kafka Topic: otel.traces] → [OTel Gateway] → [Backend]This decouples the ingest tier (agents) from the processing tier (gateways), enabling independent scaling and fault isolation.
exporters:
kafka:
brokers:
- kafka-broker-1.example.com:9092
- kafka-broker-2.example.com:9092
topic: otel.traces # dedicated topic per signal type
encoding: otlp_proto # use OTLP binary encoding (recommended)
producer:
compression: snappy # good balance of speed and ratio
required_acks: wait_for_all # durability: all ISR replicas must ack
max_message_bytes: 1000000 # 1 MB max message size
auth:
sasl:
username: ${env:KAFKA_USERNAME}
password: ${env:KAFKA_PASSWORD}
mechanism: SCRAM-SHA-512
tls:
insecure: false
retry_on_failure:
enabled: true
initial_interval: 5s
max_elapsed_time: 5m
sending_queue:
enabled: true
queue_size: 5000
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [kafka]receivers:
kafka:
brokers:
- kafka-broker-1.example.com:9092
- kafka-broker-2.example.com:9092
topic: otel.traces
group_id: otel-gateway-consumer-group # enables consumer group parallelism
encoding: otlp_proto
auth:
sasl:
username: ${env:KAFKA_USERNAME}
password: ${env:KAFKA_PASSWORD}
mechanism: SCRAM-SHA-512
tls:
insecure: false
initial_offset: latest # or "earliest" for replay
exporters:
otlp:
endpoint: backend.example.com:4317
sending_queue:
enabled: true
storage: file_storage # Tier 2 as backup within the gateway
queue_size: 10000
service:
extensions: [file_storage]
pipelines:
traces:
receivers: [kafka]
processors: [memory_limiter, k8sattributes, tail_sampling, batch]
exporters: [otlp]# Create topics with appropriate retention and replication
kafka-topics.sh --create \
--bootstrap-server kafka:9092 \
--topic otel.traces \
--partitions 12 \ # scale with gateway replicas
--replication-factor 3 \ # 3 replicas for HA
--config retention.ms=86400000 \ # 24-hour retention
--config compression.type=snappy
kafka-topics.sh --create --bootstrap-server kafka:9092 \
--topic otel.metrics --partitions 6 --replication-factor 3
kafka-topics.sh --create --bootstrap-server kafka:9092 \
--topic otel.logs --partitions 12 --replication-factor 3Each Kafka partition is consumed by one gateway replica at a time. Scale partitions to match your gateway replica count:
Partitions ≥ Max Gateway ReplicasExample: 3 gateway replicas → at least 3 partitions per topic.
Always use encoding: otlp_proto (binary OTLP) rather than otlp_json for production. JSON encoding is 3-5× larger and significantly slower to parse.
| Component | Stability |
|---|---|
kafkaexporter | Beta |
kafkareceiver | Beta |
The batch processor is critical for reducing network overhead.
Without batching:
With batching (batch size = 100):
processors:
batch:
timeout: 10s # Max wait time before sending
send_batch_size: 1024 # Max items per batch
send_batch_max_size: 2048 # Hard limit (emergency flush)| Parameter | Low Latency (Real-time) | High Throughput (Batch) |
|---|---|---|
timeout | 1s | 30s |
send_batch_size | 256 | 4096 |
send_batch_max_size | 512 | 8192 |
✅ Start with defaults: timeout: 10s, send_batch_size: 1024
✅ Monitor backend response times and adjust
✅ Always place batch near the end of the processor chain
📋 Emerging specification — max batch size for push metrics exporters: The OpenTelemetry specification has an active proposal (#4852) to introduce a standardized
max_batch_sizeconfiguration at the metrics exporter level (OTLP push exporters), independently of thebatchprocessor. When stabilized, this will allow backends to enforce per-export request size limits without requiring a shared pipeline-level batch processor. Until then, usesend_batch_max_sizein thebatchprocessor ormax_size_itemsin the exporter'ssending_queue.batch(v0.147.0+) to cap request sizes.
Extensions provide capabilities outside the pipeline:
| Extension | Purpose | Port | Security Risk |
|---|---|---|---|
health_check | Readiness/liveness probes | 13133 | Low (bind to localhost) |
pprof | CPU/memory profiling | 1777 | High (exposes internal state) |
zpages | Live debugging UI | 55679 | High (exposes traces in-flight) |
file_storage | Persistent queues | N/A | Low (disk I/O only) |
output_paths ConfigurationThe debug exporter (replacement for the deprecated logging exporter) now supports an output_paths configuration option, allowing output to be directed to one or more file paths in addition to stdout. This is useful for capturing debug output to a file without redirecting the entire collector process:
exporters:
debug:
verbosity: detailed
output_paths:
- stdout
- /tmp/otelcol-debug.logextensions:
health_check:
endpoint: "localhost:13133" # Bind to localhost in shared networks
pprof:
endpoint: "localhost:1777" # Never bind to 0.0.0.0 in production
file_storage:
directory: /var/lib/otelcol/file_storage
service:
extensions: [health_check, file_storage]⚠️ Never expose pprof or zpages on 0.0.0.0 in production:
pprof exposes heap dumps and can trigger CPU profiling (DoS risk)zpages exposes live trace data (may contain PII)Best practice:
localhost:PORT and use kubectl port-forward for debuggingNetworkPolicy to block external accessModern collector deployments use several configuration features that simplify operations and improve security.
Split large configurations across multiple files and merge them at startup:
# Merge base config with environment-specific overrides
otelcol --config=file:base.yaml --config=file:env-prod.yaml
# Or use glob patterns
otelcol --config=file:/etc/otelcol/base.yaml --config=file:/etc/otelcol/conf.d/*.yamlMerge rules: Later files override earlier ones for scalar values; maps are deep-merged.
Use case: Separate base pipeline config from per-environment exporter endpoints, credentials, or sampling rates.
# base.yaml — pipeline structure (shared across all environments)
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
processors:
memory_limiter:
limit_percentage: 80
spike_limit_percentage: 20
check_interval: 1s
batch:
timeout: 10s
send_batch_size: 1024
# env-prod.yaml — production-specific overrides
exporters:
otlp:
endpoint: prod-backend.example.com:4317
tls:
insecure: false
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]Use the ${env:VAR:-default} syntax to provide fallback values when environment variables are not set:
exporters:
otlp:
endpoint: ${env:OTLP_ENDPOINT:-localhost:4317} # fallback to localhost
headers:
authorization: "Bearer ${env:OTLP_TOKEN:-}" # empty string if unset
processors:
memory_limiter:
limit_mib: ${env:MEMORY_LIMIT_MIB:-1800}
spike_limit_mib: ${env:MEMORY_SPIKE_MIB:-360}
check_interval: 1s⚠️ Use ${env:VAR} (not $VAR or ${VAR}) — the env: prefix is required in all collector versions v0.84.0+. The legacy $VAR syntax is deprecated. The :-default fallback syntax (e.g., ${env:VAR:-default}) is supported since v0.84.0.
sending_queue: batch:)In v0.147.0+, the exporter's sending_queue supports an inline batch sub-configuration that controls how items are batched before being placed in the queue — separate from the batch processor:
exporters:
otlp:
endpoint: backend.example.com:4317
sending_queue:
enabled: true
storage: file_storage
queue_size: 5000
batch:
flush_timeout: 200ms # max wait before sending
min_size_items: 100 # target batch size (items)
max_size_items: 500 # hard limit per sendThis is useful when you want per-exporter batching behavior without adding a shared batch processor (e.g., different backends have different optimal batch sizes).
The otelcol binary provides built-in commands for debugging and validation:
# List all available components in the current binary
otelcol components
# Validate a configuration file (catch syntax/semantic errors before deploy)
otelcol validate --config=file:config.yaml
# Print the effective merged configuration (useful for debugging multi-file merges)
otelcol print-config --config=file:base.yaml --config=file:env-prod.yamlBest Practice: Always run otelcol validate in CI before deploying configuration changes to production.
# In Kubernetes init container or pre-deploy step
initContainers:
- name: validate-config
image: otel/opentelemetry-collector-contrib:0.147.0
command: ["otelcol-contrib", "validate", "--config=/etc/otelcol/config.yaml"]
volumeMounts:
- name: config
mountPath: /etc/otelcolextensions:
health_check:
endpoint: "localhost:13133"
file_storage:
directory: /var/lib/otelcol/file_storage
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
processors:
memory_limiter:
limit_percentage: 80
spike_limit_percentage: 20
check_interval: 1s
batch:
timeout: 10s
send_batch_size: 1024
exporters:
otlp:
endpoint: backend.example.com:4317
sending_queue:
enabled: true
storage: file_storage
queue_size: 5000
retry_on_failure:
enabled: true
service:
extensions: [health_check, file_storage]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]processors:
memory_limiter:
check_interval: 500ms # Faster checks
limit_percentage: 80
spike_limit_percentage: 25 # Larger buffer
batch:
timeout: 30s # Longer batching
send_batch_size: 4096 # Larger batches
filter:
traces:
span:
- 'attributes["url.path"] == "/health"'
- 'attributes["url.path"] == "/metrics"'
exporters:
otlp:
endpoint: backend.example.com:4317
sending_queue:
enabled: true
storage: file_storage
queue_size: 10000 # Larger queue
num_consumers: 20 # More parallel exports
retry_on_failure:
enabled: true
max_elapsed_time: 10m # Longer retry window✅ Always use memory_limiter as the first processor
✅ Always use batch processor near the end of the chain
✅ Enable file_storage for production to prevent data loss
✅ Use Kafka (Tier 3) for cross-AZ/cross-region durability at scale
✅ Use Connectors for span-to-metrics and cross-pipeline routing (see connectors.md)
✅ Use multi-file config merging to separate base config from environment overrides
✅ Use ${env:VAR:-default} syntax for environment variable defaults
✅ Run otelcol validate in CI before deploying configuration changes
✅ Check component stability levels before production use
✅ Use OCB to build custom, lean collector binaries
✅ Monitor otelcol_exporter_send_failed_spans for data loss
✅ Never expose pprof or zpages on 0.0.0.0
The collector is not just a forwarder—it's a high-performance data processing pipeline that requires careful configuration for production resilience.
docs
evals
cardinality-protection
claude-code-telemetry
collector-memory-limiter
scenario-1
scenario-2
scenario-3
scenario-4
tail-sampling-setup
references