Service performance monitoring with RED metrics (Rate, Errors, Duration) and runtime-specific telemetry for Java, .NET, Node.js, Python, PHP, and Go. Use when analyzing service health, SLA compliance, or runtime issues. Trigger: "service response time", "error rate", "throughput", "SLA compliance", "service mesh overhead", "JVM GC", "Java heap", "Node.js event loop", ".NET CLR", "Python threads", "PHP OPcache", "Go goroutines", "service performance", "p95 latency", "request failures", "database response time by name". Do NOT use for explaining existing queries, product documentation questions, infrastructure metrics (use dt-obs-hosts), log analysis (use dt-obs-logs), or distributed tracing workflows (use dt-obs-tracing).
71
86%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Monitor application service performance, health, and runtime-specific metrics using DQL.
Monitor service Rate, Errors, Duration using metrics-based timeseries queries.
Key Metrics:
dt.service.request.response_time - Response time (microseconds)dt.service.request.count - Request countdt.service.request.failure_count - Failed request countCommon Use Cases:
Quick Example:
timeseries {
p95 = percentile(dt.service.request.response_time, 95),
total_requests = sum(dt.service.request.count),
failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]→ For detailed queries: See references/service-metrics.md
Span-based queries for complex scenarios requiring flexible filtering and custom aggregations.
Use Cases:
Quick Example:
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total→ For detailed queries: See references/service-metrics.md
Monitor message-based service communication (queues, topics).
Key Metrics:
dt.service.messaging.publish.count - Messages sent to queues or topicsdt.service.messaging.receive.count - Messages received from queues or topicsdt.service.messaging.process.count - Messages successfully processeddt.service.messaging.process.failure_count - Messages that failed processingUse Cases:
Quick Example:
timeseries {
published = sum(dt.service.messaging.publish.count),
received = sum(dt.service.messaging.receive.count),
processed = sum(dt.service.messaging.process.count),
failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}→ For detailed queries: See references/service-metrics.md
Monitor service mesh ingress performance and overhead.
Key Metrics:
dt.service.request.service_mesh.response_time - Mesh response time (microseconds)dt.service.request.service_mesh.count - Mesh request countdt.service.request.service_mesh.failure_count - Mesh failure countUse Cases:
Quick Example:
timeseries {
direct_p95 = percentile(dt.service.request.response_time, 95),
mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000→ For detailed queries: See references/service-metrics.md
Technology-specific runtime performance and resource usage metrics.
Java/JVM - references/java.md
Node.js - references/nodejs.md
.NET CLR - references/dotnet.md
Python - references/python.md
PHP - references/php.md
Go - references/go.md
✅ Use for:
❌ Don't use for:
ask-dynatrace-docsWhen a user asks for analysis — threshold checks, anomaly detection, performance comparisons — proceed immediately with sensible defaults. Do not ask the user for parameter values you can reasonably assume.
Why this matters: analysis tools (e.g., static-threshold-analyzer) require specific
inputs like threshold values and service scope. The user expects results, not a
parameter interview. Pick reasonable defaults, state them clearly in the response,
and let the user refine.
Default values when not specified:
| Parameter | Default | Rationale |
|---|---|---|
| Response time threshold | 1000 ms (= 1,000,000 µs in the metric's base unit) | Common SLA boundary |
| Service scope | All services | Show the most relevant violations |
| Timeframe | From the request, or last 30 min for threshold checks, 2h for general analysis | Matches typical operational windows |
Example: threshold violation request
create-dql to build a timeseries query for avg(dt.service.request.response_time) grouped by dt.smartscape.servicestatic-threshold-analyzer with threshold = 1000000 (µs), alertCondition = ABOVEget-entity-nameReading user phrasing: Phrases like "the fixed threshold", "a threshold", or "the limit" name the type of analysis — static threshold check — not a specific number the user expects you to already know. "Fixed" distinguishes a static cutoff from a dynamic or seasonal baseline. When you see these phrases, apply the 1000 ms default from the table above and present results — the user can then refine if the default doesn't match their intent.
This skill covers service performance metrics and runtime monitoring only. If the
user asks a product documentation or configuration question (e.g., "How do I add custom
sensors?", "How do I configure service detection?"), use ask-dynatrace-docs instead —
this skill does not contain configuration how-tos.
Map user questions to capabilities:
| User Request | Use Capability | Key Files |
|---|---|---|
| "service performance", "response time", "error rate" | Service Performance (RED) | service-metrics.md |
| "SLA tracking", "health scoring" | Advanced Service Analysis | service-metrics.md |
| "service mesh", "Istio", "Linkerd", "mesh overhead" | Service Mesh Monitoring | service-metrics.md |
| "messaging", "queue", "topic", "publish", "consumer" | Service Messaging Metrics | service-metrics.md |
| "JVM GC", "Java memory", "heap" | Runtime-Specific (Java) | java.md |
| "Node.js event loop", "V8 heap" | Runtime-Specific (Node.js) | nodejs.md |
| ".NET CLR", "GC generation" | Runtime-Specific (.NET) | dotnet.md |
| "Python GC", "thread count" | Runtime-Specific (Python) | python.md |
| "OPcache", "PHP GC" | Runtime-Specific (PHP) | php.md |
| "goroutines", "Go GC", "scheduler" | Runtime-Specific (Go) | go.md |
1. Metrics-based (timeseries)
timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions}2. Span-based (fetch spans)
fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ...3. Comparison queries
append for baseline comparisonshift: -15m for time-shifted baselinesAlways include:
dt.service.name, k8s.workload.name, etc.)When referencing runtime-specific content:
1. Check response time (RED metrics)
2. Check error rate (RED metrics)
3. Check traffic patterns (RED metrics)
4. If runtime-specific issues suspected → Load runtime-specific reference1. Define SLA criteria (e.g., < 3s response time AND < 1% error rate)
2. Use span-based query for custom SLA logic
3. Calculate compliance percentage
4. Filter non-compliant services1. Check mesh response time
2. Compare mesh vs direct performance
3. Calculate mesh overhead
4. Analyze mesh failure rates| Problem | Cause | Solution |
|---|---|---|
| Response time values look too large | Metric is in microseconds | Divide by 1000 to convert to milliseconds |
| No data for service mesh metrics | Service mesh not configured | Verify mesh sidecar injection is enabled |
| Runtime metrics missing | Wrong technology or no OneAgent | Confirm the runtime is supported and OneAgent is active |
dt.smartscape.service returns SmartscapeId, not name | Need entity name resolution | Use getNodeName(dt.smartscape.service) |
| Error rate always zero | Using wrong failure metric | Use dt.service.request.failure_count, not custom fields |
Core Service Monitoring:
Runtime-Specific Monitoring:
7cbe1ef
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.