Comprehensive toolkit for validating, optimizing, and understanding Prometheus Query Language (PromQL) queries. Use this skill when working with PromQL queries to check syntax, detect anti-patterns, identify optimization opportunities, and interactively plan queries with users.
Overall
score
93%
Does it follow best practices?
Validation for skill structure
Comprehensive guide to common mistakes, anti-patterns, and pitfalls in PromQL queries.
Problem: Querying metrics without any label filters matches all time series, causing high cardinality.
# ❌ BAD: No filters - could match thousands of series
http_requests_total
# ❌ BAD: Empty label matcher
http_requests_total{}
# ✅ GOOD: Specific label filters
http_requests_total{job="api-service", environment="production"}Impact:
Detection: Look for metric names without {...} selectors or with empty {}.
Problem: Querying on labels with thousands of unique values.
# ❌ BAD: User ID has millions of unique values
http_requests_total{user_id="12345"}
# ✅ GOOD: Use low-cardinality labels
http_requests_total{job="api", endpoint="/users"}High-cardinality labels to avoid:
Low-cardinality labels (safe to use):
Problem: Overly broad regex patterns match too many series.
# ❌ BAD: Matches everything
http_requests_total{path=~".*"}
# ❌ BAD: Very broad pattern
http_requests_total{instance=~".*-prod-.*"}
# ✅ GOOD: Specific pattern with other filters
http_requests_total{
job="api",
instance=~"api-prod-[0-9]+",
datacenter="us-east-1"
}Problem: Using counter metrics without rate() or increase() gives meaningless values.
# ❌ BAD: Raw counter value (always increasing, not useful)
http_requests_total{job="api"}
# ❌ BAD: Aggregating raw counters
sum(http_requests_total)
# ✅ GOOD: Use rate() for per-second rate
rate(http_requests_total{job="api"}[5m])
# ✅ GOOD: Use increase() for total increase
increase(http_requests_total{job="api"}[1h])Why it's wrong: Counters only increase (or reset). The raw value shows total count since process start, not current rate.
Problem: rate(), irate(), and increase() assume monotonically increasing values. Gauges go up and down.
# ❌ BAD: rate() on gauge (memory can go up or down)
rate(node_memory_usage_bytes[5m])
# ❌ BAD: irate() on gauge
irate(cpu_temperature_celsius[5m])
# ✅ GOOD: Use gauge directly
node_memory_usage_bytes
# ✅ GOOD: Or use avg_over_time for smoothing
avg_over_time(node_memory_usage_bytes[5m])
# ✅ GOOD: Use delta() if you need change over time
delta(cpu_temperature_celsius[5m])How to identify:
_total, _count, _sum, _bucket_bytes, _usage, _percent, _celsiusProblem: rate(), irate(), increase() require a time range.
# ❌ BAD: Missing range vector
rate(http_requests_total)
# ❌ BAD: Missing range vector
increase(requests_total{job="api"})
# ✅ GOOD: Include time range
rate(http_requests_total[5m])
# ✅ GOOD: Include range for increase
increase(requests_total{job="api"}[1h])Error message: "parse error: expected type range vector in call to function, got instant vector"
Problem: Subqueries over very long time ranges process millions of samples.
# ❌ BAD: 95-day subquery (extremely slow, may timeout)
max_over_time(rate(http_requests_total[5m])[95d:1m])
# ❌ BAD: Long range with high resolution
avg_over_time(metric[30d:10s])
# ✅ GOOD: Reasonable time range
max_over_time(rate(http_requests_total[5m])[7d:5m])
# ✅ BETTER: Use recording rules for long-term analysis
# Create recording rule:
# - record: :http_requests:rate5m
# expr: rate(http_requests_total[5m])
# Then query:
max_over_time(:http_requests:rate5m[30d:1h])Impact:
Problem: Using regex (=~) when exact match (=) would work.
# ❌ BAD: Regex for exact match (slower)
http_requests_total{status=~"200"}
# ❌ BAD: Regex that's actually exact
http_requests_total{job=~"api-service"}
# ✅ GOOD: Exact match (faster index lookup)
http_requests_total{status="200"}
# ✅ GOOD: Exact match
http_requests_total{job="api-service"}Performance difference: Exact matches can be 5-10x faster due to index lookups vs pattern matching.
When regex IS appropriate:
# ✅ GOOD: Multiple alternatives
http_requests_total{status=~"200|201|204"}
# ✅ GOOD: Pattern matching
http_requests_total{path=~"/api/v[0-9]+/.*"}
# ✅ GOOD: Exclusions
http_requests_total{path!~"/health|/metrics"}Problem: Running expensive queries repeatedly in multiple dashboards/alerts.
# ❌ BAD: Complex query used in 10 dashboards, evaluated 100 times/minute
sum by (job, instance) (
rate(http_request_duration_seconds_sum{job="api"}[5m])
) /
sum by (job, instance) (
rate(http_request_duration_seconds_count{job="api"}[5m])
)
# ✅ GOOD: Create recording rule (evaluated once per cycle)
# prometheus.yml:
# - record: job_instance:http_request_duration_seconds:mean5m
# expr: |
# sum by (job, instance) (
# rate(http_request_duration_seconds_sum{job="api"}[5m])
# ) /
# sum by (job, instance) (
# rate(http_request_duration_seconds_count{job="api"}[5m])
# )
# Then use pre-computed metric:
job_instance:http_request_duration_seconds:mean5mWhen to use recording rules:
Problem: Filtering after expensive aggregation processes unnecessary data.
# ❌ BAD: Aggregates all jobs first, then filters
sum(rate(http_requests_total[5m])) and {job="api"}
# ❌ BAD: Processes all data before filtering
sum by (job) (rate(http_requests_total[5m])) == {job="api"}
# ✅ GOOD: Filter first, then aggregate
sum(rate(http_requests_total{job="api"}[5m]))
# ✅ GOOD: Specific filters reduce data early
sum by (path) (rate(http_requests_total{job="api", status="200"}[5m]))Performance impact: 10-100x slower when filtering after aggregation.
Problem: Averaging quantiles across instances is mathematically invalid.
# ❌ BAD: Mathematically incorrect!
avg(http_request_duration_seconds{quantile="0.95"})
# ❌ BAD: Summing quantiles is also wrong
sum(response_time_seconds{quantile="0.99"})
# ✅ GOOD: Calculate quantile from histogram buckets
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# ✅ GOOD: Calculate average from _sum and _count
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])Why it's wrong: Quantiles are non-additive. The average of two 95th percentiles is NOT the overall 95th percentile.
Solution: Use histograms instead of summaries when you need aggregation.
Problem: Dividing metrics with different label sets gives unexpected results.
# ❌ BAD: Labels don't match (no instance on right side)
rate(http_requests_total{job="api", instance="host1"}[5m])
/
rate(http_requests_total{job="api"}[5m])
# Result: No data (label mismatch)
# ✅ GOOD: Ensure both sides have same label filters
rate(http_requests_total{job="api", status="500"}[5m])
/
rate(http_requests_total{job="api"}[5m])
# ✅ GOOD: Use aggregation to match label dimensions
sum(rate(http_requests_total{status="500"}[5m]))
/
sum(rate(http_requests_total[5m]))Problem: Using offset incorrectly or misunderstanding its placement.
# ✅ CORRECT: offset after range vector selector
http_requests_total[5m] offset 1h
# ✅ CORRECT: offset with instant vector
http_requests_total offset 1h
# ✅ CORRECT: offset inside rate() function
rate(http_requests_total[5m] offset 1h)
# ❌ BAD: offset between metric name and range (invalid syntax)
http_requests_total offset 1h [5m]Note: The offset modifier shifts the time range back by the specified duration. It comes AFTER the selector (including range vector bracket if present).
Problem: Using multiple OR operations instead of regex alternation.
# ❌ BAD: Multiple queries combined with OR
http_requests_total{job="api"}
or
http_requests_total{job="web"}
or
http_requests_total{job="worker"}
# ✅ GOOD: Single regex with alternatives
http_requests_total{job=~"api|web|worker"}
# ✅ GOOD: With aggregation
sum by (job) (rate(http_requests_total{job=~"api|web|worker"}[5m]))Performance: Single regex is 3-5x faster than multiple ORs.
Problem: Unclear what labels remain after aggregation.
# ❌ BAD: What labels are in the result?
sum(rate(http_requests_total[5m]))
# ❌ BAD: Unclear aggregation
avg(node_memory_usage_bytes)
# ✅ GOOD: Explicit grouping
sum by (job, instance) (rate(http_requests_total[5m]))
# ✅ GOOD: Explicit label dropping
sum without (pod, container) (rate(http_requests_total[5m]))Problem: Order of operations affects results for ratios.
# ❌ BAD: Sum first, then divide (wrong denominator)
sum(rate(http_request_duration_seconds_sum[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# This gives overall average, not average per instance
# ✅ GOOD: Divide first, then aggregate (if you want per-instance avg)
sum(
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
)
# Note: Both may be valid depending on your goal!
# Be explicit about what you're calculating.Problem: irate() only uses the last two samples, making longer ranges wasteful.
# ❌ BAD: irate over 1 hour (only uses last 2 samples!)
irate(http_requests_total[1h])
# ❌ BAD: irate over 10 minutes (still only 2 samples)
irate(http_requests_total[10m])
# ✅ GOOD: Use rate() for longer ranges
rate(http_requests_total[1h])
# ✅ GOOD: Use irate() with short range
irate(http_requests_total[2m])When to use irate():
When to use rate():
Problem: rate() range shorter than 4x scrape interval gives inaccurate results.
# ❌ BAD: 30s range with 15s scrape interval (only 2 samples)
rate(http_requests_total[30s])
# ❌ BAD: 1m range might not have enough samples
rate(http_requests_total[1m])
# ✅ GOOD: At least 4x scrape interval (for 15s scrape: 1m minimum)
rate(http_requests_total[2m])
# ✅ GOOD: 5m is a common, safe choice
rate(http_requests_total[5m])Rule: rate_range >= 4 * scrape_interval
Problem: histogram_quantile needs rate() on bucket metrics.
# ❌ BAD: Missing rate() (uses raw bucket counts)
histogram_quantile(0.95,
sum by (le) (http_request_duration_seconds_bucket)
)
# ✅ GOOD: Include rate()
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)Problem: histogram_quantile requires the 'le' (less than or equal) label.
# ❌ BAD: Missing 'le' in aggregation
histogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds_bucket[5m]))
)
# ✅ GOOD: Include 'le' in by() clause
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# ✅ GOOD: Remove other labels but keep 'le'
histogram_quantile(0.95,
sum without (instance, pod) (rate(http_request_duration_seconds_bucket[5m]))
)Problem: Summary quantiles cannot be aggregated across instances.
# ❌ BAD: Cannot meaningfully aggregate summary quantiles
avg(http_request_duration_seconds{quantile="0.95"})
# ✅ SOLUTION: Use histograms instead of summaries
# Histograms allow aggregation:
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)When to use each:
Problem: Applying the same function twice or unnecessary nesting.
# ❌ BAD: Double rate (doesn't make sense)
rate(rate(http_requests_total[5m])[10m])
# ❌ BAD: Unnecessary nesting
avg(avg_over_time(metric[5m]))
# ✅ GOOD: Single function
rate(http_requests_total[5m])
# ✅ GOOD: Single aggregation
avg_over_time(metric[5m])Problem: One-to-many joins require group_left or group_right.
# ❌ BAD: Many-to-one join without group_left
rate(http_requests_total[5m])
* on (job, instance)
service_info
# Error: "multiple matches for labels"
# ✅ GOOD: Use group_left to include labels from right side
rate(http_requests_total[5m])
* on (job, instance) group_left (version, commit)
service_infoBefore running your query, check:
Install with Tessl CLI
npx tessl i pantheon-ai/promql-validator