Configures monitoring systems, implements structured logging pipelines, creates Prometheus/Grafana dashboards, defines alerting rules, and instruments distributed tracing. Implements Prometheus/Grafana stacks, conducts load testing, performs application profiling, and plans infrastructure capacity. Use when setting up application monitoring, adding observability to services, debugging production issues with logs/metrics/traces, running load tests with k6 or Artillery, profiling CPU/memory bottlenecks, or forecasting capacity needs.
97
100%
Does it follow best practices?
Impact
95%
1.17xAverage score across 6 eval scenarios
Passed
No known issues
Structured logging and Prometheus metrics instrumentation
Uses pino for logging
0%
100%
JSON structured fields
60%
100%
Request ID correlation
100%
100%
Sensitive field redaction
0%
100%
No sensitive data logged
100%
100%
Correct metric types
80%
100%
Metric naming convention
50%
50%
Business metrics present
100%
100%
Health check endpoint
100%
100%
Metrics endpoint
100%
100%
Consistent event naming
0%
100%
Prometheus alerting rules and Grafana dashboard design
Severity classification
100%
100%
Threshold-based alerting
90%
100%
Alert 'for' duration
100%
100%
Runbook URL annotations
100%
100%
Summary and description annotations
100%
100%
Alertmanager severity routing
100%
100%
RED method in dashboard
100%
100%
USE method in dashboard
100%
100%
predict_linear for capacity
100%
100%
Capacity forecast horizon
100%
100%
Business metrics in dashboard
100%
100%
No noisy alert patterns
100%
100%
k6 load testing with staged profiles and performance thresholds
Uses k6
100%
100%
Staged load profile
100%
100%
p95 latency threshold
100%
100%
p99 latency threshold
0%
100%
Error rate threshold
100%
100%
Multi-step user journey
100%
100%
Think time between steps
100%
100%
Custom business metrics
100%
100%
Response validation
100%
100%
Test plan explains thresholds
100%
100%
OpenTelemetry distributed tracing instrumentation
NodeSDK import
100%
100%
OTLPTraceExporter used
100%
100%
Jaeger OTLP URL
50%
50%
Service name resource attribute
87%
100%
Auto-instrumentations enabled
100%
100%
Manual span creation
100%
100%
Span attributes set
100%
100%
Exception recording
100%
100%
SpanStatusCode set
100%
100%
Context extraction from incoming requests
100%
100%
Context injection into outgoing requests
100%
100%
span.end() always called
100%
100%
Application profiling with CPU and memory diagnostics
clinic.js CPU profiling
100%
100%
clinic bubbleprof mentioned
50%
100%
performance.mark used
100%
100%
performance.measure used
100%
100%
PerformanceObserver used
100%
100%
process.memoryUsage called
100%
100%
v8.writeHeapSnapshot used
100%
100%
Profiling workflow documented
100%
100%
Symptom-to-tool mapping
100%
100%
No GUI-only tools prescribed
100%
80%
Capacity planning with resource forecasting and performance budgets
CPU buffer 30%
100%
100%
Memory buffer 20%
0%
100%
Connection buffer 25%
12%
100%
Storage buffer 40%
0%
100%
predict_linear for memory
100%
100%
predict_linear for disk
100%
100%
Performance budget apiP95
0%
100%
Performance budget apiP99
16%
100%
Performance budget error rate
100%
100%
Scale-up trigger at 80% CPU
0%
25%
Planning trigger at 70% CPU
50%
50%
Scale-down trigger defined
0%
0%
Instance count calculation
75%
100%
5b76101
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.