Comprehensive toolkit for generating best practice PromQL (Prometheus Query Language) queries following current standards and conventions. Use this skill when creating new PromQL queries, implementing monitoring and alerting rules, or building observability dashboards.
Overall
score
100%
Does it follow best practices?
Validation for skill structure
Complete reference of Prometheus Query Language functions organized by category.
Aggregation operators combine multiple time series into fewer time series.
Syntax: <operator> [without|by (<label_list>)] (<instant_vector>)
Calculates sum of values across time series.
# Sum all HTTP requests
sum(http_requests_total)
# Sum by job and endpoint
sum by (job, endpoint) (http_requests_total)
# Sum without instance label
sum without (instance) (http_requests_total)Use for: Totaling metrics across instances, calculating aggregate throughput.
Calculates average of values across time series.
# Average CPU usage across all instances
avg(cpu_usage_percent)
# Average by environment
avg by (environment) (cpu_usage_percent)Use for: Average resource usage, typical response times.
Returns maximum or minimum value across time series.
# Maximum memory usage across instances
max(memory_usage_bytes)
# Minimum available disk space by node
min by (node) (disk_available_bytes)Use for: Peak resource usage, bottleneck identification.
Counts the number of time series.
# Count of running instances
count(up == 1)
# Count of instances by version
count by (version) (app_version_info)Use for: Counting instances, availability calculations.
Counts time series with the same value.
# Count how many instances have each version
count_values("version", app_version)Use for: Distribution analysis, version tracking.
Returns k largest or smallest time series by value.
# Top 5 endpoints by request count
topk(5, rate(http_requests_total[5m]))
# Bottom 3 instances by available memory
bottomk(3, node_memory_available_bytes)Use for: Identifying highest/lowest consumers, troubleshooting hotspots.
Calculates φ-quantile (0 ≤ φ ≤ 1) across dimensions.
# 95th percentile of response times
quantile(0.95, response_time_seconds)
# 50th percentile (median) by service
quantile(0.5, response_time_seconds) by (service)Use for: Percentile calculations across simple metrics (not histograms).
Calculates standard deviation or variance.
# Standard deviation of response times
stddev(response_time_seconds)Use for: Measuring variability, detecting anomalies.
Functions for working with counter metrics (cumulative values that only increase).
Calculates per-second average rate of increase over a time range.
# Requests per second over last 5 minutes
rate(http_requests_total[5m])
# Bytes sent per second
rate(bytes_sent_total[1m])How it works:
Best practices:
_total, _count, _sum, or _bucket suffix)[1m] to [5m]When to use: For graphing trends, alerting on sustained rates, calculating throughput.
Calculates instant rate based on the last two data points.
# Instant rate of HTTP requests
irate(http_requests_total[5m])
# Real-time throughput (sensitive to spikes)
irate(bytes_processed_total[2m])How it works:
rate()Best practices:
[2m] to [5m]rate(), shows spikesWhen to use: For alerting on spike detection, real-time dashboards showing immediate changes.
Rate vs irate:
rate(): Average over time range, smoothirate(): Instant based on last 2 points, volatilerate()irate()Native Histogram Support (Prometheus 3.3+): irate() and idelta() now work with native histograms, enabling instant rate calculations on histogram data.
# Instant rate on native histogram (Prometheus 3.3+)
irate(http_request_duration_seconds[5m])Calculates total increase over a time range.
# Total requests in the last hour
increase(http_requests_total[1h])
# Total bytes sent in the last day
increase(bytes_sent_total[24h])How it works:
rate(v) * range_in_secondsBest practices:
When to use: Calculating totals for billing, capacity planning, SLO calculations.
Counts the number of counter resets within a time range.
# Number of times counter reset in last hour
resets(http_requests_total[1h])When to use: Detecting application restarts, investigating metric inconsistencies.
Functions for extracting time components and working with timestamps.
Returns current evaluation timestamp as seconds since Unix epoch.
# Current timestamp
time()
# Time since metric was last seen (in seconds)
time() - max(metric_timestamp)Use for: Calculating age of data, time-based math.
Returns timestamp of each sample in the instant vector.
# Get timestamp of last scrape
timestamp(up)
# Time since last successful backup
time() - timestamp(last_backup_success)Use for: Checking staleness, calculating time since event.
Extract time components from Unix timestamp.
# Current year
year()
# Current month (1-12)
month()
# Current day of month (1-31)
day_of_month()
# Current day of week (0=Sunday, 6=Saturday)
day_of_week()
# Extract from specific timestamp
year(timestamp(last_backup))Use for: Time-based filtering, business hour alerting.
Extract hour (0-23) or minute (0-59) from timestamp.
# Current hour
hour()
# Current minute
minute()
# Check if within business hours (9 AM - 5 PM)
hour() >= 9 and hour() < 17Use for: Time-of-day alerting, business hour filtering.
Returns number of days in the month of the timestamp.
# Days in current month
days_in_month()
# Days in month of specific timestamp
days_in_month(timestamp(metric))Use for: Calendar calculations, month-end processing.
These functions are available in Prometheus 3.5+ behind the --enable-feature=promql-experimental-functions flag.
Returns the timestamp when the maximum value occurred in the range.
# When did CPU usage peak in the last hour?
ts_of_max_over_time(cpu_usage_percent[1h])
# Find when error spike happened
ts_of_max_over_time(rate(errors_total[5m])[1h:1m])Use for: Incident investigation, finding when peaks occurred.
Returns the timestamp when the minimum value occurred in the range.
# When was memory usage lowest?
ts_of_min_over_time(memory_available_bytes[1h])
# Find when throughput dropped
ts_of_min_over_time(rate(requests_total[5m])[1h:1m])Use for: Finding performance troughs, capacity planning.
Returns the timestamp of the last sample in the range.
# When was this metric last scraped?
ts_of_last_over_time(up[10m])
# Check data freshness
time() - ts_of_last_over_time(metric[1h])Use for: Detecting stale data, monitoring scrape health.
Returns the first (oldest) value in the time range.
Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
# Get the first value in a range
first_over_time(metric[1h])
# Compare current vs initial value
metric - first_over_time(metric[1h])
# Calculate change over time window
last_over_time(metric[1h]) - first_over_time(metric[1h])Use for: Baseline comparisons, detecting drift, calculating change over time.
Returns the timestamp of the first sample in the range.
Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
# When did this time series start?
ts_of_first_over_time(metric[24h])
# How long has this metric existed?
time() - ts_of_first_over_time(metric[7d])Use for: Tracking when metrics first appeared, calculating metric age.
Calculates the median absolute deviation of all float samples in the specified interval.
Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
# Median absolute deviation of CPU usage over 1 hour
mad_over_time(cpu_usage_percent[1h])
# Detect anomalies: values far from median
metric > avg_over_time(metric[1h]) + 3 * mad_over_time(metric[1h])Use for: Anomaly detection, measuring variability robustly (less sensitive to outliers than stddev).
Returns vector elements sorted by the values of the given labels in ascending order.
Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
# Sort by service name
sort_by_label(up, "service")
# Sort by multiple labels
sort_by_label(http_requests_total, "job", "instance")How it works:
Use for: Organizing query results for display, dashboard ordering.
Same as sort_by_label, but sorts in descending order.
Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
# Sort by service name (descending)
sort_by_label_desc(up, "service")Use for: Reverse alphabetical ordering of results.
Mathematical operations on metric values.
Returns absolute value.
# Absolute value of temperature difference
abs(current_temp - target_temp)Rounds up or down to nearest integer.
# Round up CPU count
ceil(cpu_count_fractional)
# Round down memory in GB
floor(memory_bytes / 1024 / 1024 / 1024)Rounds to nearest integer or specified precision.
# Round to nearest integer
round(cpu_usage_percent)
# Round to nearest 0.1
round(response_time_seconds, 0.1)
# Round to nearest 10
round(request_count, 10)Calculates square root.
# Standard deviation calculation
sqrt(avg(metric^2) - avg(metric)^2)Exponential and logarithmic functions.
# Natural exponential
exp(log_scale_metric)
# Natural logarithm
ln(exponential_metric)
# Base-2 logarithm
log2(power_of_two_metric)
# Base-10 logarithm
log10(large_number_metric)Limits values to a range.
# Clamp between 0 and 100
clamp(metric, 0, 100)
# Cap at maximum
clamp_max(metric, 100)
# Ensure minimum
clamp_min(metric, 0)Use for: Normalizing values, preventing display overflow.
Returns sign of value: 1 for positive, 0 for zero, -1 for negative.
# Get sign of temperature delta
sgn(current_temp - target_temp)Native histograms are now stable in Prometheus 3.x. These functions work with native histogram data.
For native histograms, the syntax is simpler - no _bucket suffix or le label needed:
# Native histogram quantile (simpler syntax)
histogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds[5m]))
)
# Compare with classic histogram (requires _bucket and le)
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)Extracts the count of observations from a native histogram.
# Rate of observations per second
histogram_count(rate(http_request_duration_seconds[5m]))
# Total observations in time window
histogram_count(increase(http_request_duration_seconds[1h]))Use for: Getting request counts from native histogram metrics.
Extracts the sum of observations from a native histogram.
# Sum of all observation values
histogram_sum(rate(http_request_duration_seconds[5m]))
# Average value from native histogram
histogram_sum(rate(http_request_duration_seconds[5m]))
/
histogram_count(rate(http_request_duration_seconds[5m]))Use for: Calculating averages, total latency.
Calculates the fraction of observations between two values in a native histogram.
# Fraction of requests under 100ms
histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))
# Percentage of requests between 100ms and 500ms
histogram_fraction(0.1, 0.5, rate(http_request_duration_seconds[5m])) * 100
# SLO compliance: percentage under threshold
histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m])) >= 0.95Use for: SLO compliance calculations, distribution analysis.
Calculates the estimated standard deviation of observations in a native histogram.
# Standard deviation of request durations
histogram_stddev(rate(http_request_duration_seconds[5m]))How it works:
Use for: Understanding variability in metrics, anomaly detection.
Calculates the estimated standard variance of observations in a native histogram.
# Standard variance of request durations
histogram_stdvar(rate(http_request_duration_seconds[5m]))
# Compare variance across services
histogram_stdvar(sum by (service) (rate(http_request_duration_seconds[5m])))How it works:
histogram_stddev (variance = stddev²)Use for: Statistical analysis, comparing variability across dimensions.
Calculates average from a native histogram (shorthand for sum/count).
# Average request duration
histogram_avg(rate(http_request_duration_seconds[5m]))Use for: Quick average calculations.
This section documents important changes in Prometheus 3.0 (released November 2024) that affect PromQL queries.
Range Selectors Now Left-Open
rate(metric[5m]) where the 5-minute-ago sample may behave differentlyholt_winters Renamed to double_exponential_smoothing
--enable-feature=promql-experimental-functionsRegex . Now Matches All Characters
. regex pattern now matches all characters including newlinesUTF-8 Metric and Label Names
{"metric.name.with" = "value"}Native Histograms Stable
New Experimental Time Functions (require --enable-feature=promql-experimental-functions)
first_over_time() - Returns the first value in a range (Prometheus 3.7+)ts_of_first_over_time() - Timestamp of first sample (Prometheus 3.7+)ts_of_max_over_time() - When maximum occurred (Prometheus 3.5+)ts_of_min_over_time() - When minimum occurred (Prometheus 3.5+)ts_of_last_over_time() - Timestamp of last sample (Prometheus 3.5+)Functions for working with classic histogram and summary metrics.
Calculates φ-quantile (0 ≤ φ ≤ 1) from histogram buckets.
# 95th percentile of request duration
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# 50th percentile (median) by service
histogram_quantile(0.5,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# 99th percentile with job label preserved
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)Critical requirements:
le label (bucket upper bound)rate() or irate() on bucket countersBest practices:
sum before calling histogram_quantile()le label in aggregation: sum by (le) or sum by (job, le)rate() inside the aggregationrate() (typically [5m])Common mistakes:
histogram_quantile(0.95, rate(metric_bucket[5m])) - Missing aggregationhistogram_quantile(0.95, sum(metric_bucket)) - Missing rate() and le labelhistogram_quantile(0.95, sum by (le) (rate(metric_bucket[5m]))) - CorrectWhen to use: Calculating latency percentiles, response time SLOs.
Extracts total count or sum of observations from histogram.
# Total number of requests (from histogram)
histogram_count(http_request_duration_seconds)
# Total duration of all requests
histogram_sum(http_request_duration_seconds)
# Average request duration
histogram_sum(http_request_duration_seconds)
/
histogram_count(http_request_duration_seconds)Note: For classic histograms, use _count and _sum suffixes instead:
http_request_duration_seconds_count
http_request_duration_seconds_sumCalculates fraction of observations between two values.
# Fraction of requests faster than 100ms
histogram_fraction(0, 0.1, http_request_duration_seconds)
# Percentage of requests between 100ms and 500ms
histogram_fraction(0.1, 0.5, http_request_duration_seconds) * 100Use for: Calculating SLO compliance, analyzing distribution.
Functions that operate on range vectors (time series over a duration).
Calculate statistics over a time range.
# Average value over last 5 minutes
avg_over_time(cpu_usage_percent[5m])
# Maximum value over last hour
max_over_time(memory_usage_bytes[1h])
# Minimum value over last 10 minutes
min_over_time(disk_available_bytes[10m])
# Sum of values over time range
sum_over_time(event_counter[1h])
# Count of samples in time range
count_over_time(metric[5m])
# Standard deviation over time
stddev_over_time(response_time[5m])
# Variance over time
stdvar_over_time(response_time[5m])
# Quantile over time
quantile_over_time(0.95, response_time[5m])
# First value in range (oldest)
present_over_time(metric[5m])
# Changes (count of value changes)
changes(metric[5m])Best practices:
rate() instead)[5m], [1h], [1d]Use cases:
avg_over_time(): Smoothing noisy gaugesmax_over_time() / min_over_time(): Peak/trough detectionchanges(): Detecting flapping or instabilityCalculates per-second derivative using linear regression.
# Rate of change of queue length
deriv(queue_length[5m])Use for: Predicting trends, detecting gradual changes.
Predicts value at future time using linear regression.
# Predict disk usage in 4 hours
predict_linear(disk_usage_bytes[1h], 4*3600)
# Predict when disk will be full
(disk_capacity_bytes - disk_usage_bytes)
/
deriv(disk_usage_bytes[1h])Use for: Capacity forecasting, preemptive alerting.
Calculates smoothed value using double exponential smoothing (Holt Linear method).
Prometheus 3.0 Breaking Change: This function was renamed from
holt_winterstodouble_exponential_smoothingin Prometheus 3.0. The old nameholt_wintersno longer works.Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
# Smooth and forecast metric (Prometheus 3.0+)
double_exponential_smoothing(metric[1h], 0.5, 0.5)
# For Prometheus 2.x, use the old name:
# holt_winters(metric[1h], 0.5, 0.5)Parameters:
Important Notes:
Use for: Seasonal pattern detection, anomaly detection, trend forecasting.
Functions for modifying labels on time series.
Replaces label value using regex. Syntax:
label_replace(v, dst_label, replacement, src_label, regex)
# Extract hostname from instance (remove port)
# Input: instance="server-1:9090" → Output: hostname="server-1"
label_replace(
up,
"hostname", # destination label name
"$1", # replacement ($1 = first capture group)
"instance", # source label
"(.+):\\d+" # regex (capture everything before :port)
)
# Extract region from instance FQDN
# Input: instance="web-1.us-east-1.example.com:9090"
# Output: region="us-east-1"
label_replace(
metric,
"region",
"$1",
"instance",
"[^.]+\\.([^.]+)\\..*"
)
# Create environment label from job name
# Input: job="api-production" → Output: env="production"
label_replace(
metric,
"env",
"$1",
"job",
".*-(.*)"
)
# Copy label to new name (rename)
label_replace(
metric,
"service", # new label name
"$1",
"job", # original label
"(.*)" # match everything
)
# Add static prefix/suffix to label
label_replace(
metric,
"full_name",
"prefix-$1-suffix",
"name",
"(.*)"
)
# Handle missing labels (empty replacement if no match)
label_replace(
metric,
"extracted",
"$1",
"optional_label",
"pattern-(.*)" # Returns empty string if no match
)Syntax notes:
$1, $2, etc. refer to regex capture groupsUse for: Creating new labels, extracting parts of label values, renaming labels.
Joins multiple label values with a separator. Syntax:
label_join(v, dst_label, separator, src_label1, src_label2, ...)
# Combine job and instance into single label
# Input: job="api", instance="server-1" → Output: job_instance="api:server-1"
label_join(
metric,
"job_instance", # destination label name
":", # separator
"job", # first source label
"instance" # second source label
)
# Create full path from multiple labels
# Input: namespace="prod", service="api", pod="api-xyz"
# Output: full_path="prod/api/api-xyz"
label_join(
metric,
"full_path",
"/",
"namespace",
"service",
"pod"
)
# Create unique identifier
label_join(
metric,
"uid",
"-",
"cluster",
"namespace",
"pod"
)
# Join with empty separator (concatenate)
label_join(
metric,
"combined",
"",
"prefix",
"name"
)Use for: Combining labels for grouping, creating unique identifiers, display purposes.
The info() function (experimental in Prometheus 3.x) enriches metrics with labels from info metrics like target_info.
Requires Feature Flag: Must enable with
--enable-feature=promql-experimental-functions
Syntax: info(v instant-vector, [data-label-selector instant-vector])
# Enrich metrics with target_info labels
info(
rate(http_requests_total[5m]),
{k8s_cluster_name=~".+"}
)
# Without data-label-selector (adds all data labels from matching info metrics)
info(rate(http_requests_total[5m]))
# Equivalent using raw join (works in all Prometheus versions)
rate(http_requests_total[5m])
* on (job, instance) group_left (k8s_cluster_name, k8s_namespace_name)
target_infoHow it works:
v, all info series with matching identifying labelsCurrent Limitations:
target_info metricUse for: Adding resource attributes from OpenTelemetry, enriching metrics with metadata, simplifying group_left joins with info metrics.
Miscellaneous utility functions.
Returns 1-element vector if input is empty, otherwise returns empty.
# Alert if metric is missing
absent(up{job="critical-service"})
# Alert if no instances are up
absent(up{job="api"} == 1)Use for: Alerting on missing metrics or time series.
Returns 1 if no samples exist in the time range.
# Alert if no data for 10 minutes
absent_over_time(metric[10m])Use for: Detecting data gaps, scrape failures.
Converts single-element instant vector to scalar.
# Convert vector to scalar for math
scalar(sum(up{job="api"}))
# Use in calculations
metric * scalar(sum(scaling_factor))Warning: Returns NaN if input has 0 or >1 elements.
Converts scalar to single-element instant vector.
# Convert number to vector
vector(123)
# Current timestamp as vector
vector(time())Use for: Combining scalars with vector operations.
Sorts instant vector by value.
# Sort ascending
sort(http_requests_total)
# Sort descending
sort_desc(http_requests_total)Use for: Display ordering (topk/bottomk are usually better).
Returns constant 1 for each time series, removing all values.
# Get all time series without values
group(metric)Use for: Existence checks, label discovery.
Functions can be chained to build complex queries:
# Multi-stage aggregation
topk(10,
sum by (endpoint) (
rate(http_requests_total{job="api"}[5m])
)
)
# Nested time-based calculations
max_over_time(
rate(metric[5m])[1h:1m]
)
# Complex ratio with aggregations
(
sum by (job) (rate(http_errors_total[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
) * 100Range Vector Size: Larger ranges process more data
[5m] is fast and usually sufficient[1h] or larger can be expensiveCardinality: Functions on high-cardinality metrics are expensive
Subqueries: Can be very expensive
Regex: Slower than exact matches
= instead of =~ when possibleFor Counters (metrics with _total, _count, _sum, _bucket):
rate()irate()increase()resets()For Gauges (memory, temperature, queue depth):
avg_over_time()max_over_time() / min_over_time()avg_over_time()For Histograms (_bucket suffix with le label):
histogram_quantile()_sum / _count_countFor Summaries (pre-calculated quantiles):
_sum / _countInstall with Tessl CLI
npx tessl i pantheon-ai/promql-generator