dt-obs-hosts

Host and process metrics including CPU, memory, disk, network, containers, and process-level telemetry. Use when analyzing infrastructure health, resource utilization, process consumption, or host discovery. Also use when building timeseries queries for host metrics that feed into analytical workflows like anomaly detection, forecasting, or seasonality analysis. Trigger: "show hosts", "CPU usage", "memory utilization", "disk space", "high CPU", "host with most free disk", "top hosts by CPU", "top processes by memory", "Linux hosts in AWS", "what databases are running", "infrastructure costs by cost center", "hosts running EOL Java", "container monitoring", "listening ports", "process resource consumption", "CPU forecast", "memory anomaly", "host seasonality". Do NOT use for explaining existing queries, product documentation questions, Kubernetes pod/workload queries (use dt-obs-kubernetes), AWS cloud resource inventory (use dt-obs-aws), or service-level metrics (use dt-obs-services).

Quality

92%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Infrastructure Hosts Skill

Monitor and manage host and process infrastructure including CPU, memory, disk, network, and technology inventory.

When to Use This Skill

Use this skill when the user needs to:

Inventory: "Show me all Linux hosts in AWS us-east-1"
Monitor: "What hosts have high CPU usage?"
Troubleshoot: "Which processes are consuming the most memory?"
Discover: "What databases are running in production?"
Plan: "Track Kubernetes version distribution for upgrade planning"
Cost: "Calculate infrastructure costs by cost center"
Security: "Find all processes listening on port 22"
Compliance: "Identify hosts running EOL Java versions"
Quality: "Check data completeness for AWS hosts"
Optimize: "Find rightsizing candidates based on utilization"

Cross-source join required: If the query must combine host data with logs or other telemetry sources (e.g. "show logs from Linux hosts with their IP addresses") → also read dt-dql-essentials/references/smartscape-topology-navigation.md before writing the query.

Core Concepts

Entities

HOST - Physical or virtual machines (cloud or on-premise)
PROCESS - Running processes and process groups
CONTAINER - Kubernetes containers
NETWORK_INTERFACE - Host network interfaces
DISK - Host disk volumes

Metrics Categories

Host Metrics - dt.host.cpu.*, dt.host.memory.*, dt.host.disk.*, dt.host.net.*
Process Metrics - dt.process.cpu.*, dt.process.memory.*, dt.process.io.*, dt.process.network.*
Inventory - OS type, cloud provider, technology stack, versions
Cost - dt.cost.costcenter, dt.cost.product
Quality - Metadata completeness, version compliance

Alert Thresholds

CPU/Memory/Disk: 80% warning, 90% critical
Network: >70% high, >85% saturated
Disk Latency: >20ms bottleneck
Network Errors: Drop rate >1%, error rate >0.1%
Swap: >30% warning, >50% critical

Key Workflows

1. Host Discovery and Classification

Discover hosts, classify by OS/cloud, inventory resources.

smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count desc

OS Types: LINUX, WINDOWS, AIX, SOLARIS, ZOS

→ For cloud-specific attributes, see references/inventory-discovery.md

2. Resource Utilization Monitoring

Monitor CPU, memory, disk, network across hosts.

timeseries {
  cpu = avg(dt.host.cpu.usage),
  memory = avg(dt.host.memory.usage),
  disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) desc

High utilization threshold: 80% warning, 90% critical

Key CPU Metrics:

dt.host.cpu.usage — Total CPU utilization (0-100%)
dt.host.cpu.idle — CPU idle time (inverse of usage; useful for anomaly detection)
dt.host.cpu.user — CPU time in user mode
dt.host.cpu.system — CPU time in kernel mode
dt.host.cpu.iowait — CPU waiting for I/O (Linux only)

→ For detailed CPU analysis, see references/host-metrics.md
→ For memory breakdown, see references/host-metrics.md

Disk Free Space — Find Hosts with Most/Least Free Disk

timeseries disk_used_pct = avg(dt.host.disk.used.percent), by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| fieldsAdd avg_disk_used = arrayAvg(disk_used_pct),
    free_pct = 100 - arrayAvg(disk_used_pct)
| sort free_pct desc
| limit 10

3. Process Resource Analysis

Identify top resource consumers at process level.

timeseries {
  cpu = avg(dt.process.cpu.usage),
  memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 20

→ For process I/O analysis, see references/process-monitoring.md
→ For process network metrics, see references/process-monitoring.md

4. Technology Stack Inventory

Discover and track software technologies and versions.

smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count desc

Common Technologies: Java, Node.js, Python, .NET, databases, web servers, messaging systems

→ For version compliance checks, see references/inventory-discovery.md

5. Service Discovery via Ports

Map listening ports to services for security and inventory.

smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand listen_port = process.listen_ports
| summarize process_count = count(), by: {listen_port, dt.process_group.detected_name}
| sort toLong(listen_port) asc
| limit 50

Well-known ports: 80 (HTTP), 443 (HTTPS), 22 (SSH), 3306 (MySQL), 5432 (PostgreSQL)

→ For comprehensive port mapping, see references/inventory-discovery.md

6. Container and Kubernetes Monitoring

Track container distribution and K8s workload types.

smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count desc

Workload Types: deployment, daemonset, statefulset, job, cronjob

Note: Container image names/versions NOT available in smartscape.

→ For K8s version tracking, see references/container-monitoring.md
→ For container lifecycle, see references/container-monitoring.md

7. Cost Attribution and Chargeback

Calculate infrastructure costs by cost center.

smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize 
    host_count = count(),
    total_cores = sum(toLong(host.logical.cpu.cores)),
    total_memory_gb = sum(memory_gb),
    by: {dt.cost.costcenter}
| sort total_cores desc

→ For product-level cost tracking, see references/inventory-discovery.md

8. Infrastructure Health Correlation

Correlate host and process metrics for cross-layer analysis.

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  host_memory = avg(dt.host.memory.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
    host_name = getNodeName(dt.smartscape.host),
    process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) desc

Health scoring: Critical if any resource >90%, warning if >80%

→ For multi-resource saturation detection, see references/host-metrics.md

Response Construction

When the user asks for data retrieval or a DQL query (e.g., "show me top hosts by CPU"), include the DQL query in the response alongside the results. Users want to see and reuse the query — it is the deliverable, not just a means to get results.

When the user asks for analysis (anomaly detection, forecasting, seasonality), the analysis results are the deliverable. Focus on presenting findings clearly:

Prioritize metric-level findings over data collection artifacts. If an analysis tool reports data gaps alongside actual anomalies, lead with the metric behavior the user asked about and mention gaps only as supplementary context.
Include host names (not just IDs) using getNodeName(dt.smartscape.host) or the get-entity-name tool.
State the timeframe analyzed and the tools/parameters used.

Analytical Workflows

Host metric queries often serve as inputs to analytical tools (anomaly detection, forecasting, seasonality analysis). This skill helps construct the right DQL query; the actual analysis is performed by dedicated tools.

Anomaly Detection and Pattern Analysis

When users ask about "unusual behavior", "anomalies", "spikes", or "sudden changes" in host metrics, the workflow is:

Construct the timeseries query using this skill's patterns
Pass it to the appropriate analysis tool (anomaly detector, novelty detection)

Choosing between detectors:

adaptive-anomaly-detector — use when the user asks about magnitude: "spikes", "abrupt changes", "values that went above normal", "sudden jumps". It answers "did this metric cross an unexpected threshold?" and reports alert durations and peak values.
timeseries-novelty-detection — use when the user asks about behavioral change: "unusual patterns", "something changed", "trends", "new behavior". It answers "did the shape of the signal change?" without implying a specific threshold was crossed.

Response format for anomaly results: Include both the host name (resolved via getNodeName(dt.smartscape.host) or get-entity-name) and the host entity ID alongside timestamps and values. Entity IDs alone are opaque to users; names alone prevent follow-up queries.

Novelty type selection rule: When using novelty detection, set analysisNoveltyType to only [SPIKE, CHANGE_IN_VALUES, TREND_IN_VALUES] by default. EXCLUDE GAP_WITH_MISSING_VALUES and CHANGE_IN_MISSING_VALUES unless the user explicitly asks about data gaps or monitoring coverage. Data gaps are infrastructure issues, not metric behavior anomalies — reporting them when the user asks about CPU or memory patterns is incorrect.

Queries for analysis tools should use simple timeseries format with a single aggregated metric and appropriate time range:

timeseries avg(dt.host.cpu.idle), by: {dt.smartscape.host}

timeseries avg(dt.host.memory.usage), by: {dt.smartscape.host}

Avoid adding filters or field transformations that reduce the data — the analysis tools work best with complete timeseries data.

Forecasting

When users ask to "predict", "forecast", or "estimate future" host metrics:

Construct the timeseries query with sufficient historical data (e.g., 7d for short-term, 30d for longer predictions)
Pass to the forecasting tool with the desired forecast horizon

The forecast horizon (how far ahead to predict) and the historical window (how much past data the model trains on) are independent. A request like "forecast the next 2 hours" sets the horizon to 2h — it says nothing about the lookback. Always use at least 7 days of historical data regardless of how short the forecast horizon is. Too few training data points cause the forecast model to fail and fall back to raw historical values.

timeseries avg(dt.host.cpu.usage), by: {dt.smartscape.host}

Seasonality Detection

When users ask about "seasonality", "weekly patterns", or "recurring behavior":

Use a longer time range (at least 14d for weekly, 30d+ for monthly)
Pass to the seasonal baseline anomaly detector

Response format for seasonal analysis: When presenting results, include:

Whether seasonal anomalies were detected (yes/no)
The analysis timeframe and parameters used
For each affected host: host name (not just ID), timestamps of violations, violation counts, baseline values vs actual values, and upper/lower bounds
Organize results by host if multiple hosts are involved

Scope Boundary — Service-Level vs Host-Level Metrics

This skill covers host and process infrastructure metrics only. If the user asks about service-level metrics (request rate, response time, error rate, service calls per minute, throughput), use dt-obs-services instead — even when the question involves forecasting or anomaly detection of those metrics.

Redirect these to dt-obs-services: "service calls per minute", "request rate", "response time by service", "error rate by endpoint", "service throughput forecast".

Common Query Patterns

Pattern 1: Smartscape Discovery

Use smartscapeNodes to discover and classify entities.

smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>

Pattern 2: Timeseries Performance

Use timeseries to analyze metrics over time.

timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>

Pattern 3: Cross-Layer Correlation

Correlate host and process metrics.

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}

Pattern 4: Entity Enrichment with Lookup

Enrich data with entity attributes. After lookup, reference fields with lookup. prefix.

timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
    smartscapeNodes HOST
    | fields id, cpuCores, memoryTotal
  ], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024

Tags and Metadata

Important Notes

Generic tags field is NOT populated in smartscape queries
Use specific tag fields: tags:azure[*], tags:environment
Use custom metadata: host.custom.metadata[*]

Available Tags

Azure Tags: tags:azure[dt_owner_team], tags:azure[dt_cloudcost_capability]
Environment: tags:environment
Custom Metadata: host.custom.metadata[OperatorVersion], host.custom.metadata[Cluster]
Cost: dt.cost.costcenter, dt.cost.product

→ For complete tag reference, see references/inventory-discovery.md

Cloud-Specific Attributes

AWS

cloud.provider == "aws"
aws.region, aws.availability_zone, aws.account.id
aws.resource.id, aws.resource.name
aws.state (running, stopped, terminated)

Azure

cloud.provider == "azure"
azure.location, azure.subscription, azure.resource.group
azure.status, azure.provisioning_state
azure.resource.sku.name (VM size)

Kubernetes

k8s.cluster.name, k8s.cluster.uid
k8s.namespace.name, k8s.node.name, k8s.pod.name
k8s.workload.name, k8s.workload.kind

→ For multi-cloud analysis, see references/inventory-discovery.md

Best Practices

Use percentiles (p95, p99) for latency; max() for limits; avg() for trends
Set multi-level thresholds (warning 80%, critical 90%)
Filter early in the pipeline; limit results with | limit N
Aggregate before enrichment (lookup)
Use getNodeName(dt.smartscape.host) for human-readable host names; getNodeName(dt.smartscape.process) for processes
Convert bytes to GB: / 1024 / 1024 / 1024; round with round(value, decimals: 1)

Time windows: Real-time: 5-15 min | Trends: 1-7 days | Capacity planning: 30-90 days

Limitations

dt.host.cpu.iowait available on Linux only
Generic tags field NOT populated in smartscape (use specific tag namespaces)
Container image names NOT available in smartscape

Troubleshooting

Problem	Cause	Solution
No hosts returned from `smartscapeNodes "HOST"`	Missing time range or OneAgent not deployed	Verify OneAgent is installed; add a time range to the query
`tags` field always empty	Generic `tags` not populated in smartscape	Use specific tag namespaces: `tags:azure[*]`, `tags:environment`, `dt.cost.costcenter`
Memory values in bytes are unreadable	Raw metric unit is bytes	Divide by `1024 / 1024 / 1024` and use `round(value, decimals: 1)`
`dt.host.cpu.iowait` returns no data	Metric is Linux-only	Check `os.type`; iowait is unavailable on Windows, AIX, Solaris
Container image names missing	Not available in smartscape	Use `k8s.object` parsing for image details; see dt-obs-kubernetes skill
`process.software_technologies` is empty	Process not monitored by deep injection	Verify OneAgent deep monitoring is enabled for the process group

When to Load References

This skill uses progressive disclosure. Start here for 80% of use cases. Load reference files for detailed specifications when needed.

Load host-metrics.md when:

Analyzing CPU component breakdown (user, system, iowait, steal)
Investigating memory pressure and swap usage
Troubleshooting disk I/O latency
Diagnosing network packet drops or errors

Load process-monitoring.md when:

Analyzing process-level I/O patterns
Investigating TCP connection quality
Detecting resource exhaustion (file descriptors, threads)
Tracking GC suspension time

Load container-monitoring.md when:

Analyzing container lifecycle and churn
Tracking Kubernetes version distribution
Managing OneAgent operator versions
Planning K8s cluster upgrades

Load inventory-discovery.md when:

Performing security audits via port discovery
Implementing cost attribution and chargeback
Validating data quality and metadata completeness
Managing multi-cloud infrastructure

References

host-metrics.md - Detailed host CPU, memory, disk, and network monitoring
process-monitoring.md - Process-level CPU, memory, I/O, and network analysis
container-monitoring.md - Container inventory, Kubernetes versions, and operator management
inventory-discovery.md - Host/process discovery, technology inventory, cost attribution, and data quality

Repository: Dynatrace/dynatrace-for-ai
Commit: 9b399fa

Last updated: 8 days ago
Created: 8 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.