Complete PromQL toolkit with generation and validation capabilities
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Comprehensive guide to writing efficient, correct, and maintainable PromQL queries.
What they are: Metrics that only increase (or reset to zero). Examples: http_requests_total, errors_count.
Best practices:
rate() or increase() with countersrate() for per-second rates: rate(http_requests_total[5m])increase() for total increase: increase(http_requests_total[1h])rate() or increase() without a range vectorNaming convention: Counters typically end with _total, _count, _sum, or _bucket.
Examples:
# Good: Calculate requests per second
rate(http_requests_total{job="api"}[5m])
# Good: Total requests in last hour
increase(http_requests_total{job="api"}[1h])
# Bad: Raw counter value
http_requests_total{job="api"}What they are: Metrics that can go up and down. Examples: memory_usage_bytes, temperature_celsius.
Best practices:
avg_over_time(), max_over_time(), min_over_time() for time windowsdelta() for change over time (but not common)rate(), irate(), or increase() on gaugesExamples:
# Good: Current memory usage
node_memory_usage_bytes{instance="prod-1"}
# Good: Average over time
avg_over_time(node_memory_usage_bytes{instance="prod-1"}[5m])
# Good: Maximum in last hour
max_over_time(node_cpu_percent{instance="prod-1"}[1h])
# Bad: Rate on gauge
rate(memory_usage_bytes[5m])What they are: Multiple time series representing bucketed observations. Metrics end with _bucket, _sum, _count.
Best practices:
histogram_quantile() to calculate quantilesle label in by() clauserate() on bucket metricsExamples:
# Good: Calculate 95th percentile latency
histogram_quantile(0.95,
sum by (job, le) (
rate(http_request_duration_seconds_bucket{job="api"}[5m])
)
)
# Good: Calculate average from histogram
rate(http_request_duration_seconds_sum{job="api"}[5m])
/
rate(http_request_duration_seconds_count{job="api"}[5m])
# Bad: Missing rate()
histogram_quantile(0.95, sum by (le) (http_request_duration_seconds_bucket))
# Bad: Missing 'le' in aggregation
histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds_bucket[5m])))What they are: Pre-calculated quantiles with _sum and _count. Includes labels like quantile="0.95".
Best practices:
_sum and _count to calculate averagesExamples:
# Good: Calculate average from summary
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
# Bad: Averaging quantiles (mathematically invalid!)
avg(http_request_duration_seconds{quantile="0.95"})Why: Reduces cardinality, improves query performance, and makes intent clear.
# Bad: No filters
http_requests_total
# Good: Specific filters
http_requests_total{job="api-service", environment="production"}
# Good: Multiple filters for precision
http_requests_total{
job="api-service",
environment="production",
datacenter="us-east-1",
instance="prod-api-1"
}Why: Exact matches are faster (index lookups) vs regex (pattern matching).
# Bad: Regex for exact match
http_requests_total{status=~"200"}
# Good: Exact match
http_requests_total{status="200"}
# Regex is fine when you need it:
http_requests_total{status=~"2[0-9]{2}"} # All 2xx status codes# Bad: Multiple OR queries
sum(http_requests_total{path="/api/users"})
or
sum(http_requests_total{path="/api/products"})
or
sum(http_requests_total{path="/api/orders"})
# Good: Single regex with alternation
sum by (path) (
http_requests_total{path=~"/api/(users|products|orders)"}
)
# Good: Negative regex for exclusions
http_requests_total{path!~"/health|/metrics"}= : Equal to!= : Not equal to=~ : Regex match (fully anchored)!~ : Regex does not matchby() or without() ClausesWhy: Makes output labels explicit and prevents confusion.
# Unclear: What labels will remain?
sum(rate(http_requests_total[5m]))
# Clear: Group by these labels
sum by (job, instance) (rate(http_requests_total[5m]))
# Clear: Remove only these labels
sum without (pod, container) (rate(http_requests_total[5m]))without() for High-Cardinality LabelsWhy: More maintainable when you want to keep many labels.
# Verbose: List all labels to keep
sum by (job, instance, environment, datacenter, region, cluster, zone) (metric)
# Better: Drop only the high-cardinality labels
sum without (pod, container, node) (metric)sum: Total across seriesavg: Average valuemin: Minimum valuemax: Maximum valuecount: Count of seriesstddev: Standard deviationstdvar: Standard variancetopk(N, ...): Top N seriesbottomk(N, ...): Bottom N seriesquantile(φ, ...): φ-quantile (0 ≤ φ ≤ 1)# Sum request rate per service
sum by (service) (rate(http_requests_total[5m]))
# Average CPU across all cores per node
avg by (instance) (rate(node_cpu_seconds_total[5m]))
# Top 10 pods by memory usage
topk(10, container_memory_usage_bytes)
# Count running instances per job
count by (job) (up == 1)Rule of thumb: Use at least 4x your scrape interval.
rate() range: [1m] (preferably [2m])# Bad: Too short (less than 4x scrape interval)
rate(http_requests_total[30s])
# Good: At least 2 minutes
rate(http_requests_total[2m])
# Common: 5 minutes (good balance of responsiveness and stability)
rate(http_requests_total[5m])
# Longer ranges: More stable, less sensitive to spikes
rate(http_requests_total[15m])irate(): Instant rate, only uses last two samples.
[2m] to [5m] typicallyrate(): Average rate over entire range.
# Good: irate with short range
irate(http_requests_total[2m])
# Good: rate for longer range
rate(http_requests_total[5m])
# Bad: irate with long range (only uses last 2 samples anyway!)
irate(http_requests_total[1h])Syntax: query[range:resolution]
Use sparingly: Subqueries can be very expensive.
# Calculate max rate over 30 minutes with 1-minute resolution
max_over_time(
rate(http_requests_total[5m])[30m:1m]
)
# Bad: Excessive range
max_over_time(
rate(http_requests_total[5m])[95d:1m]
) # Processes millions of samples!
# Better: Use recording rules for long ranges# Good: Filter before expensive operations
sum(rate(http_requests_total{job="api", status="200"}[5m]))
# Bad: Filter after aggregation (processes more data)
sum(rate(http_requests_total[5m])) and {job="api", status="200"}# Instead of processing all series:
sum by (pod) (rate(container_cpu_usage[5m]))
# Limit to top 10 in query:
topk(10, sum by (pod) (rate(container_cpu_usage[5m])))See Recording Rules section below.
# Slower: Regex match
{label=~"value"}
# Faster: Exact match
{label="value"}# Bad: Same rate calculated twice
rate(metric[5m]) / rate(metric[5m] offset 1h)
# Can't be optimized in PromQL directly, but use recording rules:
# - record: metric:rate5m
# expr: rate(metric[5m])
# Then:
metric:rate5m / (metric:rate5m offset 1h)Purpose: Pre-compute frequently-used or expensive queries.
Benefits:
When to use:
Naming convention:
level:metric:operationsExamples:
job:http_requests:rate5minstance:node_cpu:rate1mjob_instance:request_latency_seconds:mean5mConfiguration example:
groups:
- name: example_recording_rules
interval: 30s
rules:
# Basic rate recording
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Error rate recording
- record: job:http_requests:error_rate5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# Average latency recording
- record: job:http_request_latency_seconds:mean5m
expr: |
sum by (job) (rate(http_request_duration_seconds_sum[5m]))
/
sum by (job) (rate(http_request_duration_seconds_count[5m]))# Calculate quantile
histogram_quantile(0.95,
sum by (le, job) (
rate(http_request_duration_seconds_bucket{job="api"}[5m])
)
)
# Always include 'le' in aggregation
sum by (job, le) (...) # ✅ Correct
sum by (job) (...) # ❌ Wrong - missing 'le'
# Use rate() on bucket metrics
rate(http_request_duration_seconds_bucket[5m]) # ✅ Correct
http_request_duration_seconds_bucket # ❌ Wrong - missing rate()rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])rate(http_request_duration_seconds_count[5m])Native histograms are a newer histogram format introduced in Prometheus 2.40 and made stable in 3.0. They offer significant storage and query efficiency improvements over classic histograms.
| Classic Histograms | Native Histograms |
|---|---|
Separate _bucket, _sum, _count time series | Single time series containing all data |
| Fixed bucket boundaries defined at instrumentation | Dynamic bucket resolution |
Requires _bucket suffix in queries | No _bucket suffix needed |
Always need le label in aggregation | No le label manipulation |
# Classic histogram (old way)
histogram_quantile(0.9, sum by (job, le) (rate(http_request_duration_seconds_bucket[10m])))
# Native histogram (simpler - no _bucket suffix, no 'le' label needed)
histogram_quantile(0.9, sum by (job) (rate(http_request_duration_seconds[10m])))Prometheus provides special functions for native histograms:
# Calculate average from native histogram
histogram_avg(rate(http_request_duration_seconds[5m]))
# Calculate standard deviation
histogram_stddev(rate(http_request_duration_seconds[5m]))
# Calculate standard variance
histogram_stdvar(rate(http_request_duration_seconds[5m]))
# Get observation count
histogram_count(rate(http_request_duration_seconds[5m]))
# Get sum of observations
histogram_sum(rate(http_request_duration_seconds[5m]))
# Get fraction of observations in a range
histogram_fraction(0.1, 0.5, rate(http_request_duration_seconds[5m]))Still use rate() with native histograms - The histogram functions work with rate-aggregated data
# ✅ Correct
histogram_avg(rate(http_request_duration_seconds[5m]))
# ❌ Wrong - missing rate()
histogram_avg(http_request_duration_seconds)Simpler aggregation - No need for le label in by() clause
# Classic histogram - need 'le'
histogram_quantile(0.95, sum by (job, le) (rate(metric_bucket[5m])))
# Native histogram - no 'le' needed
histogram_quantile(0.95, sum by (job) (rate(metric[5m])))Enable native histograms in Prometheus - Requires configuration:
# prometheus.yml
global:
scrape_native_histograms: trueCheck if metrics are native or classic - Query the metric directly to see its format in the response
# Bad: Complex alert expression
alert: HighErrorRate
expr: |
(
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
) > 0.05
# Better: Use recording rule, simple alert
# Recording rule:
- record: job:http_requests:error_rate5m
expr: ...
# Alert:
alert: HighErrorRate
expr: job:http_requests:error_rate5m > 0.05for Clause to Avoid Flapping- alert: HighMemoryUsage
expr: node_memory_usage_percent > 90
for: 5m # Must be true for 5 minutes
annotations:
summary: "High memory usage on {{ $labels.instance }}"# Alert if request rate drops suddenly
(
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1h)
) < 0.5 # Less than 50% of rate 1 hour ago# Error rate as percentage
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100# Success rate as percentage
(
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100# Memory usage percentage
(
node_memory_usage_bytes
/
node_memory_total_bytes
) * 100# Compare current to 1 day ago
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1d)# Alert if current rate > 2x the max rate in last hour
rate(metric[5m])
>
max_over_time(rate(metric[5m])[1h:]) * 2# Alert if metric disappears
absent(up{job="critical-service"})
# Alert if metric was present but now gone
absent_over_time(up{job="critical-service"}[5m])# Add labels from info metric to other metrics
rate(http_requests_total[5m])
* on (job, instance) group_left (version, commit)
service_info| Pattern | Use Case | Example |
|---|---|---|
rate(counter[5m]) | Per-second rate of counter | rate(http_requests_total[5m]) |
increase(counter[1h]) | Total increase in counter | increase(requests_total[1h]) |
gauge | Current value | node_memory_usage_bytes |
avg_over_time(gauge[5m]) | Average gauge over time | avg_over_time(cpu_percent[5m]) |
histogram_quantile(0.95, ...) | Calculate percentile | See histogram section |
sum by (label) (...) | Aggregate by labels | sum by (job) (rate(metric[5m])) |
topk(N, ...) | Top N series | topk(10, metric) |
absent(metric) | Check if metric missing | absent(up{job="api"}) |
metric offset 1h | Historical comparison | rate(metric[5m] offset 1h) |