Expert guidance for configuring and deploying the OpenTelemetry Collector. Use when setting up a Collector pipeline, configuring receivers, exporters, or processors, deploying a Collector to Kubernetes or Docker, or forwarding telemetry to Dash0. Triggers on requests involving collector, pipeline, OTLP receiver, exporter, or Dash0 collector setup.
100
100%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
For every span, make three decisions: what to name it, which kind to assign, and when to set its status to error. Use the tables and rules below to make each decision. Wrong span names break operation dashboards, wrong span kinds break service maps, and wrong status codes break error tracking.
The span hygiene rules in this file are aligned with the Instrumentation Score specification — a vendor-neutral scoring system that quantifies how well a service follows OpenTelemetry best practices. Each hygiene rule references the corresponding Instrumentation Score rule ID.
Span names MUST be low-cardinality. The number of unique span names in a system must be bounded and small.
{verb} {object}| Anti-Pattern (high cardinality) | Correct (low cardinality) | Fix |
|---|---|---|
GET /api/users/12345 | GET /api/users/:id | Use route template, not actual path |
SELECT * FROM orders WHERE id=99 | SELECT orders | Use table name, not full query |
process_payment_for_user_jane | process payment | User identity is an attribute |
send_invoice_#98765 | send invoice | Invoice number is an attribute |
validation_failed | validate user_input | Name the operation, not the outcome |
URL paths with embedded identifiers (/api/users/12345, /orders/550e8400-e29b-41d4-a716-446655440000) cause cardinality explosion in span names and attributes, and may leak internal IDs.
Replace dynamic path segments with placeholders before attaching the path to a span.
// BAD: high-cardinality path with embedded IDs
span.setAttribute('url.path', '/api/users/12345/orders/550e8400-e29b-41d4-a716-446655440000');
// GOOD: parameterized path
function parameterizePath(path) {
return path
.replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '/{uuid}')
.replace(/\/\d+/g, '/{id}');
}
span.setAttribute('url.path', parameterizePath(req.path));
// Result: '/api/users/{id}/orders/{uuid}'Many HTTP instrumentation libraries already parameterize routes (e.g., Express req.route.path yields /api/users/:id).
Use the framework-provided route template when available; fall back to regex replacement only when the framework does not expose one.
| Signal | Format | Example |
|---|---|---|
| HTTP server | {method} {http.route} | GET /api/users/:id |
| HTTP client | {method} {url.template} or {method} | POST /checkout |
| Database | {db.operation.name} {db.collection.name} | SELECT orders |
| RPC | {rpc.service}/{rpc.method} | UserService/GetUser |
| Messaging | {operation} {destination} | publish shop.orders |
http.route (server) or url.template (client).
If unavailable, use just the method, but to go great lengths to make sure http.route on the server and url.template on the client.db.query.summary > {operation} {collection} > {collection} > {db.system.name}._OTHER, use the protocol name alone (e.g., HTTP).Each span has exactly one kind. Choose based on the communication pattern, not the technology.
| Kind | Use When | Examples |
|---|---|---|
SERVER | Handling an inbound synchronous request | Incoming HTTP request, incoming gRPC call |
CLIENT | Making an outbound synchronous request | HTTP call, database query, outbound RPC |
PRODUCER | Initiating an asynchronous operation | Publishing a message to a queue or topic |
CONSUMER | Processing an asynchronous operation | Processing a message from a queue |
INTERNAL | Internal operation with no remote parent/child | In-memory computation, internal function call |
INTERNAL for everything.
Calls towards databases are CLIENT.
HTTP handlers are SERVER.
Only use INTERNAL for operations that genuinely have no remote counterpart.CLIENT for message publishing.
Publishing to a queue is asynchronous — use PRODUCER.
CLIENT implies the caller waits for a response.SERVER for message processing. Processing a queued message is CONSUMER, not SERVER, because the producer isn't waiting.| Operation | Span Kind |
|---|---|
create | PRODUCER |
send | PRODUCER (or CLIENT if waiting for ack) |
receive | CLIENT |
process | CONSUMER |
settle | CLIENT |
Leave span status UNSET by default.
Only set it to ERROR when the operation genuinely failed.
You can set the span status code to OK only on confirmed success and, if unsure, leave it to UNSET.
The rules differ by span kind — this is the most commonly misunderstood convention:
SpanKind.CLIENT)| HTTP Status | Span Status | Rationale |
|---|---|---|
| 1xx, 2xx, 3xx | UNSET | Request succeeded |
| 4xx | ERROR | Client's request failed |
| 5xx | ERROR | Server error = client failure |
| No response | ERROR | Connection/timeout failure |
SpanKind.SERVER)| HTTP Status | Span Status | Rationale |
|---|---|---|
| 1xx, 2xx, 3xx | UNSET | Request handled successfully |
| 4xx | UNSET | Server responded correctly to a bad request |
| 5xx | ERROR | Server failed to handle the request |
| No response | ERROR | Server-side failure |
The critical distinction: a 400 Bad Request on a server span is NOT an error — the server did its job. The same 400 on the corresponding client span IS an error — the client's request failed.
OK when the application logic has confirmed that the operation succeeded — for example, after validating a response, completing a transaction, or receiving an explicit acknowledgement.
OK signals to the backend that the operation was verified as successful, not merely that no error was caught.
Do not set OK speculatively; leave the status UNSET if the code does not explicitly confirm success.ERROR for errors that were retried and ultimately succeeded, or for errors that were intentionally handled.ERROR, include a status message that describes the failure.
The message should contain the error class and a short explanation — enough to understand the failure without opening the full trace.
Do not include stack traces in the status message; record those in a log record with exception.stacktrace instead.// BAD: error status without a message
span.setStatus({ code: SpanStatusCode.ERROR });
// BAD: generic message with no diagnostic value
span.setStatus({ code: SpanStatusCode.ERROR, message: 'something went wrong' });
// GOOD: specific message with error class and context
span.setStatus({
code: SpanStatusCode.ERROR,
message: `TimeoutError: upstream payment service did not respond within 5s`,
});Verify this rule in integration tests — see testing trace data.
ERROR for handled or retried errorsOnly set span status to ERROR when the failure is final.
Do not set it for errors that were retried and ultimately succeeded, or for errors that were intentionally handled (e.g., a fallback path that produces a valid result).
When retrying inside a span, record each failed attempt as a span event and set ERROR only after all retries are exhausted.
async function fetchWithRetry(url, maxRetries) {
return tracer.startActiveSpan('http.fetch_with_retry', async (span) => {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await fetch(url);
return response;
} catch (error) {
lastError = error;
span.addEvent('retry', { attempt, error: error.message });
}
}
// All retries exhausted — now set ERROR.
span.setStatus({ code: SpanStatusCode.ERROR, message: lastError.message });
span.end();
throw lastError;
});
}See the language-specific SDK rules for idiomatic examples in each language.
Record exception details as a structured log record, not as a span event.
The Span Event API is being deprecated in favour of log-based events.
Emit the exception as a log record within the active span context so that it carries trace_id and span_id automatically according to the logs guidance.
Use a log message that describes the failed operation — not a generic label like "exception" or "error".
The exception.* attributes carry the exception details; the message provides the operational context that makes the log record useful when scanning a log stream.
import { trace, context } from '@opentelemetry/api';
// BAD: uses the deprecated Span Event API
span.recordException(error);
// BAD: generic message with no operational context
logger.error('exception', { 'exception.type': error.name, ... });
// GOOD: descriptive message with exception attributes and trace correlation
const spanContext = trace.getSpan(context.active())?.spanContext();
logger.error('order.charge.failed', {
'trace_id': spanContext?.traceId,
'span_id': spanContext?.spanId,
'exception.type': error.name,
'exception.message': error.message,
'exception.stacktrace': error.stack,
});Include trace_id and span_id so the exception log record can be correlated with the span that produced it; see trace correlation for the getTraceContext() helper pattern.
Set exception.type, exception.message, and exception.stacktrace as log record attributes.
Serialize the stack trace as a single string — see exception stack traces for formatting rules.
OpenTelemetry SDKs batch telemetry before exporting. If the process exits before the batch is flushed, buffered spans are lost — including data from the request that caused the crash. Every application must ensure providers are shut down or flushed before process exit.
Abrupt termination (SIGKILL, OOM kill, segfault) bypasses all shutdown hooks — no in-process mitigation exists.
See the Graceful shutdown in the language-specific SDK rules for the idiomatic shutdown pattern in each runtime.
Auto-instrumentation libraries set protocol-level attributes (http.request.method, db.operation.name, url.path, etc.) automatically.
These are necessary but not sufficient — they describe how the system communicates, not what business operation is being performed.
Add domain-specific attributes to make traces actionable for debugging and business analysis.
Add attributes that answer the question: "when investigating this span during an incident, what context would I need?"
| Domain | Attribute examples | Why |
|---|---|---|
| E-commerce | order.id, cart.item_count, payment.method | Identify the affected order without searching logs |
| Auth | user.id, user.role, auth.method | Narrow down which users are impacted |
| Messaging | message.type, queue.depth (at publish time) | Understand the workload shape |
| Multi-tenant | tenant.id, tenant.plan | Isolate tenant-specific issues |
| Feature flags | feature_flag.key, feature_flag.variant | Correlate regressions with flag changes |
Set attributes at span creation time or as soon as the values are available.
Prefer setAttribute calls over constructor options when the value is computed after the span starts.
await tracer.startActiveSpan('process order', async (span) => {
span.setAttribute('order.id', order.id);
span.setAttribute('order.total', order.total);
span.setAttribute('payment.method', order.paymentMethod);
try {
await chargePayment(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: `${error.name}: ${error.message}`,
});
throw error;
} finally {
span.end();
}
});Auto-instrumentation creates spans you do not control directly (e.g., the SERVER span for an HTTP request).
To enrich these spans with business context, retrieve the active span and add attributes to it.
import { trace } from '@opentelemetry/api';
app.post('/api/orders', async (req, res) => {
const span = trace.getActiveSpan();
span?.setAttribute('order.id', req.body.orderId);
span?.setAttribute('tenant.id', req.headers['x-tenant-id']);
// ... handler logic
});Use this pattern at the earliest point in the request handler where the business context is known. Do not wrap auto-instrumented spans in a second manual span just to add attributes — that creates unnecessary nesting.
Follow the Attribute Registry conventions:
order.id, not orderId or order_id.enduser.id instead of user.id).com.acme.order.priority).Span attributes tolerate higher cardinality than metric attributes, but unbounded values may still cause problems at the backend (indexing costs, slow queries, harder to create metrics in the observability pipeline).
| Safe on spans | Avoid on spans |
|---|---|
order.id, user.id, tenant.id | request.body (arbitrary size) |
feature_flag.key (bounded set) | url.full with query params (unbounded) |
payment.method (bounded set) | Serialized objects or arrays |
See attribute placement for how cardinality tolerance differs across signals.
A CLIENT or PRODUCER root span indicates missing instrumentation or lost trace context.
Root spans must describe what the service is doing (the unit of work), not what it is calling (outgoing requests).
Root spans should have kind SERVER, CONSUMER, or INTERNAL.
If you see CLIENT or PRODUCER root spans, the likely causes are:
Cron jobs, scheduled tasks, CLI commands, and background workers have no inbound HTTP request to trigger a SERVER span via auto-instrumentation.
Without a manual root span, the first auto-instrumented outbound call (e.g., a database query or HTTP request) becomes a CLIENT root span, which misrepresents what the service is doing.
Create a manual SERVER root span that wraps the entire operation.
This ensures the outbound calls appear as children of a meaningful root span rather than as disconnected CLIENT or PRODUCER root spans.
// BAD: no root span — the database query becomes a CLIENT root span
async function processDaily() {
await db.query('SELECT * FROM orders WHERE status = $1', ['pending']);
}
// GOOD: manual root span wraps the headless operation
async function processDaily() {
await tracer.startActiveSpan('process daily orders', { kind: SpanKind.SERVER }, async (span) => {
try {
await db.query('SELECT * FROM orders WHERE status = $1', ['pending']);
} finally {
span.end();
}
});
}Every span of kind CLIENT (database query, HTTP call, RPC request) or PRODUCER (messaging, job queues) must be a child of a SERVER, CONSUMER, or INTERNAL span.
A CLIENT or PRODUCER span without such a parent means the trace has no record of why the outgoing call was made — it captures the call but not the work that triggered it.
Auto-instrumentation for web frameworks and messaging libraries creates SERVER or CONSUMER spans automatically.
CLIENT or PRODUCER spans for outbound calls made inside those handlers are already children — no action needed.
The problem arises in code that runs outside a request handler: cron jobs, background workers, startup tasks, and CLI commands.
These have no auto-instrumented parent, so the first outbound call becomes a parentless CLIENT or PRODUCER root span.
Wrap such operations in a manual SERVER or INTERNAL span — see headless operations for the pattern.
Verify this rule in integration tests — see testing trace data.
Every span with a parent_span_id must have a corresponding parent span in the trace.
Orphan spans indicate broken context propagation or instrumentation gaps, and result in fragmented, misleading trace views.
Common causes:
Verify this rule in integration tests — see testing trace data.
Keep INTERNAL spans under 10 per service within a single trace.
Excessive internal spans signal over-instrumentation and clutter trace views, making it harder to identify actual bottlenecks.
If you exceed this limit, consider:
batch.size attributeVerify this rule in integration tests — see testing trace data.
A trace must not contain more than 20 spans with a duration under 5 milliseconds. Exceeding this threshold indicates spans created in tight loops or over-instrumented internal code. These inflate trace storage without adding observability value.
Instead of creating a span per iteration, create a single span for the batch operation:
// BAD: 1,000 spans under 1ms each
items.forEach(item => {
tracer.startActiveSpan('process.item', span => {
process(item);
span.end();
});
});
// GOOD: single span with context
tracer.startActiveSpan('process.batch', span => {
span.setAttribute('batch.size', items.length);
items.forEach(process);
span.end();
});Verify this rule in integration tests — see testing trace data.
Use the AlwaysOn sampler in application SDKs.
This is the default in every OpenTelemetry SDK — do not change it.
Do not configure TraceIdRatioBased, ParentBased, or any other sampler that drops spans in the application.
SDK-side sampling makes irreversible decisions at the head of the trace, before the outcome of the request is known.
A trace that looked unremarkable at the start may turn out to contain an error, a latency spike, or a rarely exercised code path — all of which are lost if the SDK decided not to sample.
Defer all sampling decisions to the Collector or observability pipeline, where they can be changed centrally without redeploying applications.
The Collector can apply head sampling with the probabilisticsamplerprocessor, or tail sampling with the tailsamplingprocessor, using criteria that are impossible to evaluate at request start (error status, latency, attribute values).
# BAD: head sampling in the SDK — loses traces before their outcome is known
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1
# GOOD: default AlwaysOn sampler — sampling happens in the Collector
# (no OTEL_TRACES_SAMPLER needed; AlwaysOn is the default)Accurate RED metrics (request rate, error rate, duration) cannot be computed from sampled traces. If you sample in the Collector, materialize metrics from spans before the sampling step — see connectors and sampling.
Treat trace shape — spans, their parent-child relations, kinds, status codes, and attributes — as a functional requirement. Use an in-memory span exporter in integration tests to capture spans produced by the code under test and assert the rules from this file.
import { SpanKind } from '@opentelemetry/api';
import { InMemorySpanExporter, SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
const exporter = new InMemorySpanExporter();
const provider = new NodeTracerProvider();
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();
// Call between tests to reset captured spans
function resetSpans() {
exporter.reset();
}
// Helper: get all finished spans
function getSpans() {
return exporter.getFinishedSpans();
}Assert that every CLIENT or PRODUCER span has a parent.
A failure means code is making outbound calls without a wrapping SERVER or INTERNAL span — see headless operations.
function assertNoParentlessOutboundSpans() {
const parentless = getSpans().filter(
(s) => (s.kind === SpanKind.CLIENT || s.kind === SpanKind.PRODUCER) && !s.parentSpanId,
);
if (parentless.length > 0) {
const names = parentless.map((s) => s.name).join(', ');
throw new Error(`CLIENT/PRODUCER root spans detected: ${names}`);
}
}Assert that every span with a parentSpanId has a corresponding parent in the trace.
A failure indicates broken context propagation or a parent span that ended or was sampled out.
function assertNoOrphanSpans() {
const spans = getSpans();
const spanIds = new Set(spans.map((s) => s.spanContext().spanId));
const orphans = spans.filter((s) => s.parentSpanId && !spanIds.has(s.parentSpanId));
if (orphans.length > 0) {
const names = orphans.map((s) => s.name).join(', ');
throw new Error(`Orphan spans detected (parent not found): ${names}`);
}
}Assert that no trace contains more than 10 INTERNAL spans from a single service.
A failure signals over-instrumentation — replace fine-grained spans with batch spans or log records.
function assertInternalSpanLimit(maxPerTrace = 10) {
const spans = getSpans();
const counts = new Map<string, number>();
for (const s of spans) {
if (s.kind === SpanKind.INTERNAL) {
const traceId = s.spanContext().traceId;
counts.set(traceId, (counts.get(traceId) ?? 0) + 1);
}
}
for (const [traceId, count] of counts) {
if (count > maxPerTrace) {
throw new Error(
`Trace ${traceId} has ${count} INTERNAL spans (limit: ${maxPerTrace})`,
);
}
}
}Assert that no trace contains more than 20 spans with a duration under 5 milliseconds. A failure indicates spans created in tight loops — replace with a single batch span.
function assertShortDurationSpanLimit(maxPerTrace = 20, thresholdMs = 5) {
const spans = getSpans();
const counts = new Map<string, number>();
for (const s of spans) {
const durationMs = (s.endTime[0] - s.startTime[0]) * 1000 + (s.endTime[1] - s.startTime[1]) / 1e6;
if (durationMs < thresholdMs) {
const traceId = s.spanContext().traceId;
counts.set(traceId, (counts.get(traceId) ?? 0) + 1);
}
}
for (const [traceId, count] of counts) {
if (count > maxPerTrace) {
throw new Error(
`Trace ${traceId} has ${count} spans under ${thresholdMs} ms (limit: ${maxPerTrace})`,
);
}
}
}Assert that every span with status ERROR includes a non-empty status message.
A failure means an error was recorded without enough context to diagnose the failure from the trace alone.
function assertErrorSpansHaveMessages() {
const missing = getSpans().filter(
(s) => s.status.code === 2 /* ERROR */ && !s.status.message?.trim(),
);
if (missing.length > 0) {
const names = missing.map((s) => s.name).join(', ');
throw new Error(`ERROR spans without a status message: ${names}`);
}
}Call these assertions after each integration test that exercises a code path producing traces. If any assertion fails, investigate the code structure to find the most appropriate place to fix the instrumentation.
afterEach(() => {
assertNoParentlessOutboundSpans();
assertNoOrphanSpans();
assertInternalSpanLimit();
assertShortDurationSpanLimit();
assertErrorSpansHaveMessages();
resetSpans();
});