Complete PromQL toolkit with generation and validation capabilities
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Comprehensive guide to writing efficient, maintainable, and correct PromQL queries.
Problem: Querying metrics without label filters can match thousands or millions of time series, causing performance issues and timeouts.
# ❌ Bad: No filtering, matches all time series
rate(http_requests_total[5m])
# ✅ Good: Specific filtering
rate(http_requests_total{job="api-server", environment="production"}[5m])Best practices:
job label filterenvironment or cluster for multi-environment setupsinstance for single-instance queriesendpoint, method, status_code as neededProblem: Regex matching (=~) is significantly slower than exact matching (=).
# ❌ Bad: Unnecessary regex for exact match
http_requests_total{status_code=~"200"}
# ✅ Good: Exact match is faster
http_requests_total{status_code="200"}
# ✅ Good: Regex when truly needed
http_requests_total{status_code=~"2.."} # All 2xx codes
http_requests_total{instance=~"prod-.*"} # Pattern matchingWhen regex is appropriate:
instance=~"prod-.*"status_code=~"200|201|202"status_code=~"5.."Optimization tips:
=~"^prod-.*"Problem: Labels with many unique values create massive number of time series.
# ❌ Bad: user_id creates one series per user (high cardinality)
sum by (user_id) (rate(requests_total[5m]))
# ✅ Good: Aggregate without high-cardinality labels
sum(rate(requests_total[5m]))
# ✅ Good: Use low-cardinality labels
sum by (service, environment) (rate(requests_total[5m]))High-cardinality labels to avoid in aggregations:
Solutions:
without()path_pattern instead of full_url)Problem: Counter metrics always increase; raw values are not useful for analysis.
# ❌ Bad: Raw counter value is not meaningful
http_requests_total
# ✅ Good: Calculate rate (requests per second)
rate(http_requests_total[5m])
# ✅ Good: Calculate total increase over period
increase(http_requests_total[1h])Counter identification:
_total (e.g., requests_total, errors_total)_count (e.g., http_requests_count)_sum (e.g., request_duration_seconds_sum)_bucket (e.g., request_duration_seconds_bucket)Problem: Gauge metrics represent current state, not cumulative values.
# ❌ Bad: rate() on gauge doesn't make sense
rate(memory_usage_bytes[5m])
# ✅ Good: Use gauge value directly
memory_usage_bytes
# ✅ Good: Use *_over_time functions for analysis
avg_over_time(memory_usage_bytes[5m])
max_over_time(memory_usage_bytes[1h])Gauge examples:
memory_usage_bytescpu_temperature_celsiusqueue_lengthactive_connectionsProblem: histogram_quantile() requires proper aggregation and the le label.
# ❌ Bad: Missing aggregation
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
# ❌ Bad: Missing le label in aggregation
histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket[5m])))
# ❌ Bad: Missing rate() on buckets
histogram_quantile(0.95, sum by (le) (request_duration_seconds_bucket))
# ✅ Good: Correct usage
histogram_quantile(0.95,
sum by (le) (rate(request_duration_seconds_bucket[5m]))
)
# ✅ Good: Preserving additional labels
histogram_quantile(0.95,
sum by (service, le) (rate(request_duration_seconds_bucket[5m]))
)Requirements for histogram_quantile():
rate() or irate() to bucket counterssumle label in aggregationProblem: Averaging quantiles is mathematically invalid and produces incorrect results.
# ❌ Bad: Averaging quantiles is wrong
avg(request_duration_seconds{quantile="0.95"})
# ✅ Good: Use _sum and _count to calculate average
sum(rate(request_duration_seconds_sum[5m]))
/
sum(rate(request_duration_seconds_count[5m]))
# ✅ Good: If you need quantiles, use histogram
histogram_quantile(0.95,
sum by (le) (rate(request_duration_seconds_bucket[5m]))
)by(): Keeps only specified labels, removes all others without(): Removes specified labels, keeps all others
# Use by() when you know exactly what labels you want to keep
sum by (service, environment) (rate(requests_total[5m]))
# Use without() when you want to remove specific labels
sum without (instance, pod) (rate(requests_total[5m]))When to use each:
Always aggregate before calling histogram_quantile():
# ❌ Bad: Trying to aggregate after quantile calculation
sum(
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
)
# ✅ Good: Aggregate first, then calculate quantile
histogram_quantile(0.95,
sum by (le) (rate(request_duration_seconds_bucket[5m]))
)
# ✅ Good: Aggregate with grouping
histogram_quantile(0.95,
sum by (service, le) (rate(request_duration_seconds_bucket[5m]))
)Choose the right aggregation for your use case:
# sum: For counting, totaling
sum(up{job="api"}) # Total number of instances
# avg: For average values
avg(cpu_usage_percent) # Average CPU across instances
# max/min: For identifying extremes
max(memory_usage_bytes) # Instance with highest memory use
# count: For counting series
count(up{job="api"} == 1) # Number of healthy instances
# topk/bottomk: For top/bottom N
topk(10, rate(requests_total[5m])) # Top 10 by request rate
# quantile: For percentiles across simple metrics
quantile(0.95, response_time_seconds) # 95th percentileThe number of time series matters most for query performance.
# Check cardinality of a metric
count(metric_name)
# Check cardinality by label
count by (label_name) (metric_name)
# Identify high-cardinality metrics
topk(10, count by (__name__) ({__name__=~".+"}))Strategies to reduce cardinality:
Larger time ranges process more data and run slower.
# ❌ Slow: Very large range for rate
rate(requests_total[1h])
# ✅ Fast: Appropriate range for rate
rate(requests_total[5m])
# For recording rules: Pre-compute common ranges
# Then use the recorded metric instead
job:requests:rate5m # Recorded metricTime range guidelines:
[1m] to [5m] for real-time monitoring[1h] to [1d] when needed[5m] if queried frequentlySubqueries can exponentially increase query cost.
# ❌ Expensive: Subquery over long range
max_over_time(rate(metric[5m])[7d:1h])
# ✅ Better: Use recording rule
max_over_time(job:metric:rate5m[7d])
# ✅ Better: Reduce range if possible
max_over_time(rate(metric[5m])[1d:1h])Subquery cost = range_duration / resolution × base_query_cost
Recording rules pre-compute expensive queries.
# Recording rule configuration
groups:
- name: request_rates
interval: 30s
rules:
# Pre-compute expensive aggregation
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Pre-compute complex quantile
- record: job:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)Use recording rules when:
Too short: Noisy, sensitive to scraping jitter Too long: Hides important spikes, slow to react
# Real-time monitoring: 1-5 minutes
rate(requests_total[2m])
rate(requests_total[5m])
# Trend analysis: 15 minutes to 1 hour
rate(requests_total[15m])
rate(requests_total[1h])
# Historical analysis: Hours to days
rate(requests_total[6h])
rate(requests_total[1d])Guidelines:
[1m][2m][5m] works well for most cases# rate(): Average over time range, smooth
rate(requests_total[5m])
# irate(): Instant based on last 2 points, volatile
irate(requests_total[5m])When to use irate():
When to use rate():
Format: level:metric:operations
# level: Aggregation level (job, service, cluster)
# metric: Base metric name
# operations: Functions applied (rate5m, p95, sum)
rules:
# Good examples
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job_endpoint:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (job, endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: cluster:cpu_usage:ratio
expr: |
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum(rate(node_cpu_seconds_total[5m]))# Instead of running this expensive query repeatedly:
# histogram_quantile(0.95, sum by (le) (rate(latency_bucket[5m])))
# Create a recording rule:
- record: :http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Then use the recorded metric:
# :http_request_duration:p95Build complex metrics in stages:
# Layer 1: Basic rates
- record: instance:requests:rate5m
expr: rate(http_requests_total[5m])
# Layer 2: Job-level aggregation
- record: job:requests:rate5m
expr: sum by (job) (instance:requests:rate5m)
# Layer 3: Derived metrics
- record: job:error_ratio:rate5m
expr: |
sum by (job) (instance:requests:rate5m{status_code=~"5.."})
/
job:requests:rate5mAlert expressions should return 1 (firing) or 0 (not firing).
# ✅ Good: Boolean expression
(
sum(rate(errors_total[5m]))
/
sum(rate(requests_total[5m]))
) > 0.05
# ✅ Good: Explicit comparison
http_requests_rate < 10
# ✅ Good: Complex boolean
(cpu_usage > 80) and (memory_usage > 90)for Duration for StabilityAvoid alerting on transient spikes.
# Alert only after condition persists for 10 minutes
- alert: HighErrorRate
expr: |
(
sum(rate(errors_total[5m]))
/
sum(rate(requests_total[5m]))
) > 0.05
for: 10m
annotations:
summary: "Error rate above 5% for 10+ minutes"for duration guidelines:
5m10m to 15m30m+0m (no for)# ✅ Good: Include labels that identify the problem
sum by (service, environment) (
rate(errors_total[5m])
) > 100
# Alerts will show which service and environment# ❌ Bad: Too generic
absent(up)
# ✅ Good: Specific service
absent(up{job="critical-service"})
# ✅ Good: With timeout
absent_over_time(up{job="critical-service"}[10m])Use multi-line formatting for readability:
# ✅ Good: Multi-line with indentation
histogram_quantile(0.95,
sum by (service, le) (
rate(http_request_duration_seconds_bucket{
environment="production",
job="api-server"
}[5m])
)
)
# ❌ Bad: Single line, hard to read
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket{environment="production", job="api-server"}[5m])))rules:
# Calculate p95 latency for all API endpoints
# Used by: API dashboard, SLO calculations, latency alerts
- record: api:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (endpoint, le) (
rate(http_request_duration_seconds_bucket{job="api"}[5m])
)
)# ✅ Good: Clear purpose from name
- record: api:error_rate:ratio5m
- record: db:query_duration:p99
- record: cluster:memory_usage:bytes
# ❌ Bad: Unclear names
- record: metric1
- record: temp_calc
- record: x# ❌ Anti-pattern
rate(http_requests_total[5m])
# ✅ Fix
rate(http_requests_total{job="api-server", environment="prod"}[5m])# ❌ Anti-pattern
metric{label=~"value"}
# ✅ Fix
metric{label="value"}# ❌ Anti-pattern
rate(memory_usage_bytes[5m])
# ✅ Fix
avg_over_time(memory_usage_bytes[5m])# ❌ Anti-pattern
http_requests_total
# ✅ Fix
rate(http_requests_total[5m])# ❌ Anti-pattern
avg(http_duration{quantile="0.95"})
# ✅ Fix
histogram_quantile(0.95,
sum by (le) (rate(http_duration_bucket[5m]))
)# ❌ Anti-pattern
histogram_quantile(0.95, rate(latency_bucket[5m]))
# ✅ Fix
histogram_quantile(0.95,
sum by (le) (rate(latency_bucket[5m]))
)# ❌ Anti-pattern
sum by (user_id) (requests) # millions of series
# ✅ Fix
sum(requests) # single series
# Or use low-cardinality labels
sum by (service) (requests)Check cardinality:
count(your_query)Verify result makes sense:
Test edge cases:
# Test with different ranges
rate(metric[1m])
rate(metric[5m])
rate(metric[1h])
# Verify results are reasonable# Verify metric exists
count(metric_name) > 0
# Check for gaps
absent_over_time(metric_name[10m])Before deploying a PromQL query, verify:
job)=) instead of regex when possiblerate() for counters*_over_time() for gaugeshistogram_quantile() with sum by (le) for histograms[5m])