CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

README.mdexamples/observability/

Status: Stable

Observability Examples

Production-ready alerting rules using the RED method (Request rate, Error rate, Duration).

Examples

ExampleToolDescription
prometheus-alerts/orders-service.yamlPrometheusRED method alerting rules for a production service

Quick Start

# Install promtool (included with Prometheus binary)
# https://prometheus.io/download/

# Validate alerting rules syntax and expressions
cd prometheus-alerts
promtool check rules orders-service.yaml

# Test alert firing logic
promtool test rules orders-service.yaml

What the Alerts Cover

AlertThresholdSeverityWhen it fires
HighErrorRate> 5% errors over 5mcriticalError rate exceeds SLO budget
HighLatencyP99> 1s p99 over 5mwarningTail latency degraded
LowRequestRate< 0.1 req/s over 10mwarningService may be down or receiving no traffic

RED Method Applied

# Request rate — is traffic flowing?
expr: rate(http_requests_total{job="orders-service"}[5m])

# Error rate — are requests succeeding?
expr: |
  rate(http_requests_total{job="orders-service",status=~"5.."}[5m])
  / rate(http_requests_total{job="orders-service"}[5m])

# Duration (p99) — how slow are requests?
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="orders-service"}[5m]))

Instrument Your Service

To use these alerts, expose the following Prometheus metrics from your service:

// Node.js — prom-client
import { Counter, Histogram } from "prom-client";

const httpRequests = new Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
});

const httpDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration",
  labelNames: ["method", "path"],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5],
});
# Python — prometheus-client
from prometheus_client import Counter, Histogram

http_requests = Counter("http_requests_total", "Total HTTP requests", ["method", "path", "status"])
http_duration = Histogram("http_request_duration_seconds", "HTTP request duration", ["method", "path"])

Integrate with Grafana

# Prometheus scrape config
scrape_configs:
  - job_name: orders-service
    static_configs:
      - targets: ["orders-service:9090"]

Load prometheus-alerts/orders-service.yaml into Alertmanager and point your Grafana datasource at Prometheus to visualise these metrics.

Checklist

  • Service exposes /metrics on a dedicated port (not the main API port)
  • Metrics follow naming convention: <namespace>_<metric>_<unit>_total / _seconds / _bytes
  • All metrics have job and env labels for alert routing
  • Alert for duration is long enough to avoid flapping (5m for critical, 10m for warning)
  • Alertmanager routes critical alerts to PagerDuty, warnings to Slack

See Also

  • references/observability.md — structured logging, Prometheus metrics, OpenTelemetry tracing, Grafana dashboards, k6 load testing, capacity planning
  • /platform-skills:observability — instrument services, build dashboards, write alerts, run load tests, plan capacity

examples

observability

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json