Distributed traces, spans, service dependencies, and request flow analysis. Use when investigating span-level details, failures, performance bottlenecks, or trace correlation. Trigger: "trace analysis", "slow requests", "failed spans", "service dependencies", "distributed trace", "span details", "HTTP status codes in traces", "database query spans", "messaging spans", "gRPC calls", "Lambda cold starts", "trace ID lookup", "exception analysis", "correlate logs and traces", "request attributes". Do NOT use for explaining existing queries, product documentation or configuration questions, service-level RED metrics (use dt-obs-services), log searching (use dt-obs-logs), or problem analysis (use dt-obs-problems).
71
86%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Distributed traces in Dynatrace consist of spans - building blocks representing units of work. With Traces in Grail, every span is accessible via DQL with full-text searchability on all attributes. This skill covers trace fundamentals, common analysis patterns, and span-type specific queries.
Spans represent logical units of work in distributed traces:
Span kinds:
span.kind: server - Incoming call to a servicespan.kind: client - Outgoing call from a servicespan.kind: consumer - Incoming message consumption call to a servicespan.kind: producer - Outgoing message production call from a servicespan.kind: internal - Internal operation within a serviceRoot spans: A request root span (request.is_root_span == true) represents an incoming call to a service. Use this to analyze end-to-end request performance.
Essential attributes for trace analysis:
| Attribute | Description |
|---|---|
trace.id | Unique trace identifier |
span.id | Unique span identifier |
span.parent_id | Parent span ID (null for root spans) |
request.is_root_span | Boolean, true for request entry points |
request.is_failed | Boolean, true if request failed |
duration | Span duration in nanoseconds |
span.timing.cpu | Overall CPU time of the span (stable) |
span.timing.cpu_self | CPU time excluding child spans (stable) |
dt.smartscape.service | Service Smartscape node ID |
dt.service.name | Dynatrace service name derived from service detection rules. It is equal to the Smartscape service node name. |
endpoint.name | Endpoint/route name |
Spans reference services via Smartscape node IDs and the detected service name dt.service.name which is also present on every span.
fetch spans
| summarize spans=count(), by: { dt.smartscape.service, dt.service.name }Node functions:
getNodeName(dt.smartscape.service) - Adds dt.smartscape.service.name field with the human-readable service namegetNodeField(dt.smartscape.service, "attribute_name") - Access specific node attributes📖 Learn more: See Entity Lookups for advanced entity selectors, infrastructure correlation, and hardware analysis.
One span can represent multiple real operations due to:
aggregation.count)samplingRatio parameterWhen to extrapolate: Always extrapolate when counting actual operations (not just spans). Use the multiplicity factor:
fetch spans
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1 / sampling.probability
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
* coalesce(aggregation.count, 1)
* dt.system.sampling_ratio
| summarize operation_count = sum(multiplicity)📖 Learn more: See Sampling and Extrapolation for detailed formulas and examples.
Fetch spans and explore by type:
fetch spans | limit 1Explore spans by function and type:
fetch spans
| summarize count(), by: { span.kind, code.namespace, code.function }List request root spans (incoming service calls):
fetch spans
| filter request.is_root_span == true
| fields trace.id, span.id, start_time, response_time = duration, endpoint.name
| limit 100Analyze service performance with error rates:
fetch spans
| filter request.is_root_span == true
| summarize
total_requests = count(),
failed_requests = countIf(request.is_failed == true),
avg_duration = avg(duration),
p95_duration = percentile(duration, 95),
by: {dt.service.name}
| fieldsAdd error_rate = (failed_requests * 100.0) / total_requests
| sort error_rate descFind all spans in a specific trace:
fetch spans
| filter trace.id == toUid("abc123def456")
| fields span.name, duration, dt.service.nameCalculate percentiles by endpoint:
fetch spans
| filter request.is_root_span == true
| summarize {
requests=count(),
avg_duration=avg(duration),
p95=percentile(duration, 95),
p99=percentile(duration, 99)
}, by: { endpoint.name }
| sort p99 desc💡 Best practice: Use percentiles (p95, p99) over averages for performance insights.
Find requests exceeding a threshold:
fetch spans, from:now() - 2h
| filter request.is_root_span == true
| filter duration > 5s
| fields trace.id, span.name, dt.service.name, duration
| sort duration desc
| limit 50fetch spans, from:now() - 24h
| filter http.route == "/api/v1/storage/findByISBN"
| summarize {
spans=count(),
trace=takeAny(record(start_time, trace.id))
}, by: { bin(duration, 10ms) }
| fields `bin(duration, 10ms)`, spans, trace.id=trace[trace.id], start_time=trace[start_time]Extract response time as timeseries:
fetch spans, from:now() - 24h
| filter request.is_root_span == true
| makeTimeseries {
requests=count(),
avg_duration=avg(duration),
p95=percentile(duration, 95),
p99=percentile(duration, 99)
}, by: { endpoint.name }📖 Learn more: See Performance Analysis for advanced patterns and timeseries techniques.
Summarize failures by service:
fetch spans
| filter request.is_root_span == true
| summarize
total = count(),
failed = countIf(request.is_failed == true),
by: { dt.service.name }
| fieldsAdd failure_rate = (failed * 100.0) / total
| sort failure_rate descBreakdown by failure detection reason:
fetch spans
| filter request.is_failed == true and isNotNull(dt.failure_detection.results)
| expand dt.failure_detection.results
| summarize count(), by: { dt.failure_detection.results[reason] }Failure reasons:
http_code - HTTP response code triggered failuregrpc_code - gRPC status code triggered failureexception - Exception caused failurespan_status - Span status indicated failurecustom_rule - Custom failure detection rule matchedFind failures by HTTP status code:
fetch spans
| filter request.is_failed == true
| filter iAny(dt.failure_detection.results[][reason] == "http_code")
| summarize count(), by: { http.response.status_code, endpoint.name }
| sort `count()` descList recent failures with details:
fetch spans
| filter request.is_root_span == true and request.is_failed == true
| fields
start_time,
trace.id,
endpoint.name,
http.response.status_code,
duration
| sort start_time desc
| limit 100📖 Learn more: See Failure Detection for exception analysis and custom rule investigation.
Analyze service communication patterns:
fetch spans, from:now() - 1h
| filter isNotNull(server.address)
| fieldsAdd
remote_side = server.address
| summarize
call_count = count(),
avg_duration = avg(duration),
by: {dt.service.name, remote_side}
| sort call_count descIdentify external API dependencies:
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
calls = count(),
avg_latency = avg(duration),
p99_latency = percentile(duration, 99),
by: { dt.service.name, server.address, server.port }
| sort calls descAggregate all spans in a trace to understand full request flow:
fetch spans, from:now() - 30m
| summarize {
spans = count(),
client_spans = countIf(span.kind == "client"),
// Endpoints involved in the trace
endpoints = toString(arrayRemoveNulls(collectDistinct(endpoint.name))),
// Extract the first request root in the trace
trace_root = takeMin(record(
root_detection_helper = coalesce(
if(request.is_root_span, 1),
if(isNull(span.parent_id), 2),
3),
start_time, endpoint.name, duration
))
}, by: { trace.id }
| fieldsFlatten trace_root
| fieldsRemove trace_root.root_detection_helper, trace_root
| fields
start_time = trace_root.start_time,
endpoint = trace_root.endpoint.name,
response_time = trace_root.duration,
spans,
client_spans,
endpoints,
trace.id
| sort start_time
| limit 100Root detection strategy: Use takeMin(record(...)) with a detection helper to reliably find the root request:
request.is_root_span == trueFind traces spanning multiple services:
fetch spans, from:now() - 1h
| summarize {
services = collectDistinct(dt.service.name),
trace_root = takeMin(record(root_detection_helper = coalesce(if(request.is_root_span, 1), 2), endpoint.name))
}, by: { trace.id }
| fieldsAdd service_count = arraySize(services)
| filter service_count > 1
| fields endpoint = trace_root[endpoint.name], service_count, services = toString(services), trace.id
| sort service_count desc
| limit 50Access custom request attributes captured by OneAgent on request root spans:
fetch spans
| filter request.is_root_span == true
| filter isNotNull(request_attribute.PaidAmount)
| makeTimeseries sum(request_attribute.PaidAmount)Field patterns: request_attribute.<name>, captured_attribute.<name> (always arrays)
→ Request Attributes — full patterns for request attributes, captured attributes, and request ID aggregation
| Span Type | Detection | Key Fields | Reference |
|---|---|---|---|
| HTTP server (incoming) | span.kind == "server" and isNotNull(http.request.method) | http.route, http.request.method, http.response.status_code | http-spans.md |
| HTTP client (outgoing) | span.kind == "client" and isNotNull(http.request.method) | server.address, server.port | http-spans.md |
| Database | span.kind == "client" and isNotNull(db.system) | db.system, db.namespace, db.statement | database-spans.md |
| Messaging | isNotNull(messaging.system) | messaging.system, messaging.destination.name, messaging.operation.type | messaging-spans.md |
| RPC / gRPC | isNotNull(rpc.system) | rpc.system, rpc.service, rpc.method, rpc.grpc.status_code | rpc-spans.md |
| Serverless / FaaS | isNotNull(faas.name) and span.kind == "server" | faas.name, faas.trigger.type, cloud.provider | serverless-spans.md |
⚠️ Database spans: Can be aggregated (one span = multiple calls). Always use aggregation.count extrapolation for accurate operation counts.
📖 Detailed patterns per span type: See the reference files above.
Exceptions are stored as span.events within spans:
fetch spans
| filter iAny(span.events[][span_event.name] == "exception")
| expand span.events
| fieldsFlatten span.events, fields: { exception.type }
| summarize {
count(),
trace=takeAny(record(start_time, trace.id))
}, by: { exception.type }
| fields exception.type, `count()`, trace.id=trace[trace.id], start_time=trace[start_time]💡 Tip: Use iAny() to check conditions within span event arrays.
→ Logs Correlation — joining logs and traces, filtering traces by log content → Network Analysis — client IPs, DNS resolution, subnet analysis
| Area | Rule |
|---|---|
| Filtering | Apply request.is_root_span == true and endpoint filters first |
| Sampling | Use samplingRatio (e.g., 100 = read 1%) for performance |
| Percentiles | Use p95/p99 over averages for performance analysis |
| Root spans | Use request.is_root_span == true for end-to-end analysis |
| Trace grouping | Group by trace.id for complete trace metrics |
| Request grouping | Group by request.id for OneAgent-only request metrics |
| Extrapolation | Always apply multiplicity for accurate operation counts |
| Exemplars | Use takeAny(record(start_time, trace.id)) to enable UI drilldown |
| Problem | Cause | Solution |
|---|---|---|
| Duration values seem wrong (too large) | duration is in nanoseconds, not milliseconds | Divide by 1000000 or compare with 5s (DQL duration literal) |
| Span counts don't match expected request volume | Sampling or aggregation not accounted for | Use multiplicity extrapolation — see Sampling and Extrapolation reference |
getNodeName(dt.smartscape.service) returns null | Service not yet resolved or OneAgent not monitoring | Verify OneAgent monitors the service; entity resolution may have a short delay |
request.is_root_span filter returns nothing | Querying OpenTelemetry-only traces without OneAgent | Use isNull(span.parent_id) as fallback for root span detection |
trace.id filter returns no results | Trace ID not converted to UID format | Use filter trace.id == toUid("abc123...") for string-based trace IDs |
| Database span counts are too low | Database spans are aggregated (one span = N calls) | Always use aggregation.count extrapolation for database operation counts |
Detailed documentation for specific topics:
7cbe1ef
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.