Service metrics, RED metrics (Rate, Errors, Duration), and runtime-specific telemetry for .NET, Java, Node.js, Python, PHP, and Go applications.
49
52%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/dt-obs-services/SKILL.mdMonitor application service performance, health, and runtime-specific metrics using DQL.
Monitor service Rate, Errors, Duration using metrics-based timeseries queries.
Key Metrics:
dt.service.request.response_time - Response time (microseconds)dt.service.request.count - Request countdt.service.request.failure_count - Failed request countCommon Use Cases:
Quick Example:
timeseries {
p95 = percentile(dt.service.request.response_time, 95),
total_requests = sum(dt.service.request.count),
failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]→ For detailed queries: See references/service-metrics.md
Span-based queries for complex scenarios requiring flexible filtering and custom aggregations.
Use Cases:
Quick Example:
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total→ For detailed queries: See references/service-metrics.md
Monitor message-based service communication (queues, topics).
Key Metrics:
dt.service.messaging.publish.count - Messages sent to queues or topicsdt.service.messaging.receive.count - Messages received from queues or topicsdt.service.messaging.process.count - Messages successfully processeddt.service.messaging.process.failure_count - Messages that failed processingUse Cases:
Quick Example:
timeseries {
published = sum(dt.service.messaging.publish.count),
received = sum(dt.service.messaging.receive.count),
processed = sum(dt.service.messaging.process.count),
failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}→ For detailed queries: See references/service-metrics.md
Monitor service mesh ingress performance and overhead.
Key Metrics:
dt.service.request.service_mesh.response_time - Mesh response time (microseconds)dt.service.request.service_mesh.count - Mesh request countdt.service.request.service_mesh.failure_count - Mesh failure countUse Cases:
Quick Example:
timeseries {
direct_p95 = percentile(dt.service.request.response_time, 95),
mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000→ For detailed queries: See references/service-metrics.md
Technology-specific runtime performance and resource usage metrics.
Java/JVM - references/java.md
Node.js - references/nodejs.md
.NET CLR - references/dotnet.md
Python - references/python.md
PHP - references/php.md
Go - references/go.md
✅ Use for:
❌ Don't use for:
Map user questions to capabilities:
| User Request | Use Capability | Key Files |
|---|---|---|
| "service performance", "response time", "error rate" | Service Performance (RED) | service-metrics.md |
| "SLA tracking", "health scoring" | Advanced Service Analysis | service-metrics.md |
| "service mesh", "Istio", "Linkerd", "mesh overhead" | Service Mesh Monitoring | service-metrics.md |
| "messaging", "queue", "topic", "publish", "consumer" | Service Messaging Metrics | service-metrics.md |
| "JVM GC", "Java memory", "heap" | Runtime-Specific (Java) | java.md |
| "Node.js event loop", "V8 heap" | Runtime-Specific (Node.js) | nodejs.md |
| ".NET CLR", "GC generation" | Runtime-Specific (.NET) | dotnet.md |
| "Python GC", "thread count" | Runtime-Specific (Python) | python.md |
| "OPcache", "PHP GC" | Runtime-Specific (PHP) | php.md |
| "goroutines", "Go GC", "scheduler" | Runtime-Specific (Go) | go.md |
1. Metrics-based (timeseries)
timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions}2. Span-based (fetch spans)
fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ...3. Comparison queries
append for baseline comparisonshift: -15m for time-shifted baselinesAlways include:
dt.service.name, k8s.workload.name, etc.)When referencing runtime-specific content:
1. Check response time (RED metrics)
2. Check error rate (RED metrics)
3. Check traffic patterns (RED metrics)
4. If runtime-specific issues suspected → Load runtime-specific reference1. Define SLA criteria (e.g., < 3s response time AND < 1% error rate)
2. Use span-based query for custom SLA logic
3. Calculate compliance percentage
4. Filter non-compliant services1. Check mesh response time
2. Compare mesh vs direct performance
3. Calculate mesh overhead
4. Analyze mesh failure ratesCore Service Monitoring:
Runtime-Specific Monitoring:
4991356
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.