CtrlK
BlogDocsLog inGet started
Tessl Logo

o11y-dev/opentelemetry-skill

Expert OpenTelemetry guidance for collector configuration, pipeline design, and production telemetry instrumentation. Use when configuring collectors, designing pipelines, instrumenting applications, implementing sampling, managing cardinality, securing telemetry, writing OTTL transformations, or setting up AI coding agent observability (Claude Code, Codex, Gemini CLI, GitHub Copilot).

93

7.08x
Quality

97%

Does it follow best practices?

Impact

85%

7.08x

Average score across 4 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

playbooks.mdreferences/

OpenTelemetry Production Playbooks

Overview

This reference is a routing-friendly playbook index for OpenTelemetry blog content that is relevant to this skill. The goal is not to retell one company story in detail. The goal is to help the skill map user questions to the most relevant upstream operating patterns, then load the right deep-dive references from this repository.

The 2025 Developer Experience SIG survey explicitly called out the need for better production examples, debugging guidance, and more concrete deployment guidance. This document turns that need into a scalable maintenance model for future blog routing.

Use this document when a user asks for:

  • a real-world deployment pattern
  • a production rollout model for a platform team
  • a blog-derived example instead of a purely theoretical recommendation
  • a recent opentelemetry.io article that is relevant to a practical OTel task
  • a generic playbook that should remain reusable as more blogs are added

Table of Contents

  1. How to Use This Reference
  2. Playbook Routing Format
  3. Relevant 2025-2026 Blogs for This Skill
  4. Generic Playbook Patterns
  5. Common Failure Modes

How to Use This Reference

These playbooks are not meant to be copied verbatim. They should be used to answer questions like:

  • "Which upstream blog should I route to for this production question?"
  • "What real-world pattern covers self-service onboarding on Kubernetes?"
  • "What blog is most relevant for Lambda, logs, sampling, or naming?"
  • "Which reference docs should the skill load after matching a blog-derived pattern?"

For each routed playbook, load deeper reference material from this repository as needed:

  • architecture.md for deployment models, multi-cluster, and scaling
  • collector.md for collector pipeline structure and operational mechanics
  • connectors.md for routing and cross-pipeline patterns
  • instrumentation.md for SDKs, auto-instrumentation, naming, and signal semantics
  • monitoring.md for health, failure visibility, and alerting
  • platforms.md for Kubernetes, FaaS, browser, and platform patterns
  • sampling.md for head, tail, and probability-based sampling
  • security.md for TLS, auth, and exposure boundaries

Playbook Routing Format

As more OpenTelemetry.io blog posts are integrated, keep each playbook entry in this shape:

  1. Source: the blog post URL and title
  2. Routing signals: the kinds of user questions or keywords that should load it
  3. Playbook theme: the reusable operational or instrumentation pattern
  4. Why it matters: what skill behavior or production decision it informs
  5. Load next: which local references should be loaded after routing
  6. Caveats: limitations, maturity notes, or operational trade-offs

This structure keeps the skill generic. It routes by user intent and technical problem space, not by a specific company name.


Relevant 2025-2026 Blogs for This Skill

These are the most relevant recent 2025 and early-2026 opentelemetry.io blog posts to route through this skill today. The list is intentionally topic-driven and open-ended so future entries can be added without restructuring the document.

BlogPrimary routing signalsWhy it matters for the skillLoad next
Kubernetes annotation-based discovery for the OpenTelemetry Collectorreceiver_creator, annotation-based discovery, Kubernetes self-service scraping, pod annotationsStrong playbook for self-service Collector onboarding with platform safety railscollector, platforms
Observing Lambdas using the OpenTelemetry Collector Extension LayerLambda, serverless, extension layer, decouple processor, delayed exportCovers ephemeral runtime constraints and decoupled export patternsplatforms, collector, monitoring
Exposing OTel Collector in Kubernetes with Gateway API & mTLSGateway API, mTLS, external OTLP ingress, multi-cluster collector, hybrid cloudPractical security and ingress pattern for centralized collector deploymentssecurity, architecture, collector
How Mastodon Runs OpenTelemetry Collectors in Productionsmall team, one collector per namespace, OpenTelemetry Operator, Argo CD, tail sampling, vendor-neutral observabilityStrong operating model for keeping collector deployments simple, declarative, and reliable while preserving backend choice at production scalearchitecture, collector, monitoring
OpenTelemetry Profiles Enters Public Alphaprofiles, profiling, OTLP Profiles, eBPF profiler, pprof receiver, profile correlationGood routing target when users ask how continuous profiling fits into OpenTelemetry, especially around collector support and cross-signal correlationcollector, platforms, monitoring
Demystifying Automatic Instrumentation: How the Magic Actually Worksauto-instrumentation, zero-code, bytecode instrumentation, eBPF, runtime hooksHelps the skill explain which automatic instrumentation mechanism fits a runtimeinstrumentation, platforms
OpenTelemetry Logging and Youlogs, events, Logs API, log bridges, signal correlationUseful when users ask how logs relate to traces and metrics in OTel's modelinstrumentation, collector
How to Name Your Spansspan naming, low cardinality, semantic conventions, business spansGood routing target for custom instrumentation and naming guidanceinstrumentation
How to Name Your Span Attributesattribute naming, semantic conventions, custom attributes, reserved namespacesHelps the skill answer detailed questions about attribute design and stabilityinstrumentation
How to Name Your Metricsmetric naming, units, metric cardinality, service.name, semantic conventionsImportant for metric schema hygiene and cross-service aggregation adviceinstrumentation, monitoring
OpenTelemetry Sampling updateconsistent sampling, TraceState, probability sampling, W3C TraceContextStrong route for advanced sampling questions beyond basic head vs tail framingsampling
The Declarative configuration journey: Why it took 5 years to ignore health check endpoints in tracingdeclarative config, config file, health check exclusion, Java agent configGood route for questions about portable config, rule-based routing, and YAML-first OTel setupinstrumentation, sampling
OTTL contexts just got easier with context inferenceOTTL, transform processor, context inference, Collector transformsUseful when users need simpler transform-processor guidance and want to avoid manual context selection mistakescollector, connectors
Announcing Support for Complex Attribute Types in OTelcomplex attributes, maps, heterogeneous arrays, structured telemetryHelps the skill answer when complex payloads belong in attributes and when flat attributes remain the better designinstrumentation
Announcing the Beta Release of OpenTelemetry Go Auto-Instrumentation using eBPFGo auto-instrumentation, eBPF, runtime hooks, zero-code GoAdds a concrete runtime-specific route for Go users beyond generic auto-instrumentation explanationsinstrumentation, platforms
Alibaba, Datadog, and Quesma Join Forces on Go Compile-Time InstrumentationGo compile-time instrumentation, toolexec, zero-code Go, build-time instrumentationGood route when users compare compile-time instrumentation with eBPF or manual Go instrumentationinstrumentation
Announcing the RPC Semantic Conventions stabilization projectRPC semantic conventions, gRPC telemetry, convention migration, stabilizationUseful for questions about RPC naming, migration windows, and convention stability expectationsinstrumentation
Contributing the Unroll Processor to the OpenTelemetry Collector Contribunroll processor, bundled logs, record expansion, transform vs purpose-built processorAdds a routing path for log-pipeline questions where bundled payload expansion should not be forced into OTTL transformscollector, monitoring
How Mastodon Runs OpenTelemetry Collectors in Productionsmall team, Operator-managed collectors, one collector per namespace, Datadog connector, tail sampling in productionStrong production routing example for keeping collector architecture simple, using the OpenTelemetry Operator for lifecycle, and controlling volume with aggressive error-first samplingarchitecture, collector, sampling
OpenTelemetry Profiles Enters Public Alphaprofiles, continuous profiling, eBPF profiler, pprof receiver, profile signalUseful when users ask about bringing profiling into an OTel pipeline; it sets the right expectation that Profiles are practical to evaluate but still Alpha for critical production workloadscollector, platforms

Routing notes for future maintenance

  • Prefer adding new entries to this list rather than creating one-off narrative sections.
  • Route by technical intent such as sampling, logs, serverless, ingress, naming, or self-service onboarding.
  • Keep source links stable and place company-specific stories in Reference Links unless their patterns become broadly generic and reusable.

Generic Playbook Patterns

These patterns are intentionally generic so the skill can scale as more blogs are added.

Route by problem, not by company

The skill should match on the user's technical goal—such as Lambda export, secure collector ingress, or naming guidance—not on a company name from a blog post.

Prefer self-service with safety rails

Good playbooks let application teams opt in through narrow, well-defined interfaces while the platform retains the right guardrails.

Keep semantic context out of names

For spans, attributes, and metrics, prefer low-cardinality names and put varying context in attributes or resource metadata.

Treat external collector ingress as a security boundary

If telemetry crosses clusters, networks, or trust domains, route to patterns that include explicit authentication, encryption, and ownership boundaries.

Adapt the topology to the runtime

Ephemeral runtimes like Lambda need different collector and export patterns than long-running Kubernetes workloads.

Choose auto-instrumentation mechanisms deliberately

"Auto-instrumentation" is not a single implementation strategy. The right mechanism depends on runtime behavior, deployment model, and operational constraints.

Prefer declarative and portable configuration where possible

As OTel setups grow, YAML-first or schema-driven configuration becomes easier to review, reuse, and scale than scattered ad hoc flags.

Always connect a playbook to deeper docs

A blog route should be the front door. The implementation details should still come from the local references in this repository.


Common Failure Modes

❌ Routing on company names instead of technical intent

This makes the skill brittle and limits reuse as more upstream blogs are added.

❌ Treating all auto-instrumentation as the same thing

Different runtimes use different mechanisms with different trade-offs.

❌ Putting dynamic context into span or metric names

That breaks aggregation, increases cardinality, and makes dashboards harder to reuse.

❌ Exposing collectors without a clear trust model

External OTLP ingress should be treated as a security-sensitive boundary.

❌ Blocking ephemeral runtimes on exporter completion

Serverless systems need export paths that respect execution and billing limits.

❌ Letting configuration sprawl across ad hoc flags and one-off tweaks

As environments scale, declarative and shared configuration becomes more maintainable.

❌ Answering advanced sampling questions with only basic head-vs-tail advice

Some user questions require consistent probability sampling and TraceState-aware explanations.


Reference Links



Summary

✅ Keep production playbooks generic, reusable, and routing-friendly ✅ Use an expandable 2025-2026 blog routing scan instead of centering the document on one org ✅ Route by technical problem space such as serverless, ingress, logs, metrics, naming, transforms, and sampling ✅ Treat blog posts as entry points and local references as the detailed implementation guides ⚠️ Avoid coupling the skill to company-specific narratives when the same pattern can be expressed generically ⚠️ Keep expanding this index as new upstream blog posts become relevant to the skill

CHANGELOG.md

CONTRIBUTING.md

README.md

SKILL.md

tessl.json

tile.json