Comprehensive toolkit for generating best practice PromQL (Prometheus Query Language) queries following current standards and conventions. Use this skill when creating new PromQL queries, implementing monitoring and alerting rules, or building observability dashboards.
Overall
score
100%
Does it follow best practices?
Validation for skill structure
Comprehensive guide to writing efficient, maintainable, and correct PromQL queries.
Problem: Querying metrics without label filters can match thousands or millions of time series, causing performance issues and timeouts.
# ❌ Bad: No filtering, matches all time series
rate(http_requests_total[5m])
# ✅ Good: Specific filtering
rate(http_requests_total{job="api-server", environment="production"}[5m])Best practices:
job label filterenvironment or cluster for multi-environment setupsinstance for single-instance queriesendpoint, method, status_code as neededProblem: Regex matching (=~) is significantly slower than exact matching (=).
# ❌ Bad: Unnecessary regex for exact match
http_requests_total{status_code=~"200"}
# ✅ Good: Exact match is faster
http_requests_total{status_code="200"}
# ✅ Good: Regex when truly needed
http_requests_total{status_code=~"2.."} # All 2xx codes
http_requests_total{instance=~"prod-.*"} # Pattern matchingWhen regex is appropriate:
instance=~"prod-.*"status_code=~"200|201|202"status_code=~"5.."Optimization tips:
=~"^prod-.*"Problem: Labels with many unique values create massive number of time series.
# ❌ Bad: user_id creates one series per user (high cardinality)
sum by (user_id) (rate(requests_total[5m]))
# ✅ Good: Aggregate without high-cardinality labels
sum(rate(requests_total[5m]))
# ✅ Good: Use low-cardinality labels
sum by (service, environment) (rate(requests_total[5m]))High-cardinality labels to avoid in aggregations:
Solutions:
without()path_pattern instead of full_url)Problem: Counter metrics always increase; raw values are not useful for analysis.
# ❌ Bad: Raw counter value is not meaningful
http_requests_total
# ✅ Good: Calculate rate (requests per second)
rate(http_requests_total[5m])
# ✅ Good: Calculate total increase over period
increase(http_requests_total[1h])Counter identification:
_total (e.g., requests_total, errors_total)_count (e.g., http_requests_count)_sum (e.g., request_duration_seconds_sum)_bucket (e.g., request_duration_seconds_bucket)Problem: Gauge metrics represent current state, not cumulative values.
# ❌ Bad: rate() on gauge doesn't make sense
rate(memory_usage_bytes[5m])
# ✅ Good: Use gauge value directly
memory_usage_bytes
# ✅ Good: Use *_over_time functions for analysis
avg_over_time(memory_usage_bytes[5m])
max_over_time(memory_usage_bytes[1h])Gauge examples:
memory_usage_bytescpu_temperature_celsiusqueue_lengthactive_connectionsProblem: histogram_quantile() requires proper aggregation and the le label.
# ❌ Bad: Missing aggregation
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
# ❌ Bad: Missing le label in aggregation
histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket[5m])))
# ❌ Bad: Missing rate() on buckets
histogram_quantile(0.95, sum by (le) (request_duration_seconds_bucket))
# ✅ Good: Correct usage
histogram_quantile(0.95,
sum by (le) (rate(request_duration_seconds_bucket[5m]))
)
# ✅ Good: Preserving additional labels
histogram_quantile(0.95,
sum by (service, le) (rate(request_duration_seconds_bucket[5m]))
)Requirements for histogram_quantile():
rate() or irate() to bucket counterssumle label in aggregationProblem: Averaging quantiles is mathematically invalid and produces incorrect results.
# ❌ Bad: Averaging quantiles is wrong
avg(request_duration_seconds{quantile="0.95"})
# ✅ Good: Use _sum and _count to calculate average
sum(rate(request_duration_seconds_sum[5m]))
/
sum(rate(request_duration_seconds_count[5m]))
# ✅ Good: If you need quantiles, use histogram
histogram_quantile(0.95,
sum by (le) (rate(request_duration_seconds_bucket[5m]))
)by(): Keeps only specified labels, removes all others without(): Removes specified labels, keeps all others
# Use by() when you know exactly what labels you want to keep
sum by (service, environment) (rate(requests_total[5m]))
# Use without() when you want to remove specific labels
sum without (instance, pod) (rate(requests_total[5m]))When to use each:
Always aggregate before calling histogram_quantile():
# ❌ Bad: Trying to aggregate after quantile calculation
sum(
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
)
# ✅ Good: Aggregate first, then calculate quantile
histogram_quantile(0.95,
sum by (le) (rate(request_duration_seconds_bucket[5m]))
)
# ✅ Good: Aggregate with grouping
histogram_quantile(0.95,
sum by (service, le) (rate(request_duration_seconds_bucket[5m]))
)Choose the right aggregation for your use case:
# sum: For counting, totaling
sum(up{job="api"}) # Total number of instances
# avg: For average values
avg(cpu_usage_percent) # Average CPU across instances
# max/min: For identifying extremes
max(memory_usage_bytes) # Instance with highest memory use
# count: For counting series
count(up{job="api"} == 1) # Number of healthy instances
# topk/bottomk: For top/bottom N
topk(10, rate(requests_total[5m])) # Top 10 by request rate
# quantile: For percentiles across simple metrics
quantile(0.95, response_time_seconds) # 95th percentileThe number of time series matters most for query performance.
# Check cardinality of a metric
count(metric_name)
# Check cardinality by label
count by (label_name) (metric_name)
# Identify high-cardinality metrics
topk(10, count by (__name__) ({__name__=~".+"}))Strategies to reduce cardinality:
Larger time ranges process more data and run slower.
# ❌ Slow: Very large range for rate
rate(requests_total[1h])
# ✅ Fast: Appropriate range for rate
rate(requests_total[5m])
# For recording rules: Pre-compute common ranges
# Then use the recorded metric instead
job:requests:rate5m # Recorded metricTime range guidelines:
[1m] to [5m] for real-time monitoring[1h] to [1d] when needed[5m] if queried frequentlySubqueries can exponentially increase query cost.
# ❌ Expensive: Subquery over long range
max_over_time(rate(metric[5m])[7d:1h])
# ✅ Better: Use recording rule
max_over_time(job:metric:rate5m[7d])
# ✅ Better: Reduce range if possible
max_over_time(rate(metric[5m])[1d:1h])Subquery cost = range_duration / resolution × base_query_cost
Recording rules pre-compute expensive queries.
# Recording rule configuration
groups:
- name: request_rates
interval: 30s
rules:
# Pre-compute expensive aggregation
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Pre-compute complex quantile
- record: job:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)Use recording rules when:
Too short: Noisy, sensitive to scraping jitter Too long: Hides important spikes, slow to react
# Real-time monitoring: 1-5 minutes
rate(requests_total[2m])
rate(requests_total[5m])
# Trend analysis: 15 minutes to 1 hour
rate(requests_total[15m])
rate(requests_total[1h])
# Historical analysis: Hours to days
rate(requests_total[6h])
rate(requests_total[1d])Guidelines:
[1m][2m][5m] works well for most cases# rate(): Average over time range, smooth
rate(requests_total[5m])
# irate(): Instant based on last 2 points, volatile
irate(requests_total[5m])When to use irate():
When to use rate():
Format: level:metric:operations
# level: Aggregation level (job, service, cluster)
# metric: Base metric name
# operations: Functions applied (rate5m, p95, sum)
rules:
# Good examples
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job_endpoint:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (job, endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: cluster:cpu_usage:ratio
expr: |
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum(rate(node_cpu_seconds_total[5m]))# Instead of running this expensive query repeatedly:
# histogram_quantile(0.95, sum by (le) (rate(latency_bucket[5m])))
# Create a recording rule:
- record: :http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Then use the recorded metric:
# :http_request_duration:p95Build complex metrics in stages:
# Layer 1: Basic rates
- record: instance:requests:rate5m
expr: rate(http_requests_total[5m])
# Layer 2: Job-level aggregation
- record: job:requests:rate5m
expr: sum by (job) (instance:requests:rate5m)
# Layer 3: Derived metrics
- record: job:error_ratio:rate5m
expr: |
sum by (job) (instance:requests:rate5m{status_code=~"5.."})
/
job:requests:rate5mAlert expressions should return 1 (firing) or 0 (not firing).
# ✅ Good: Boolean expression
(
sum(rate(errors_total[5m]))
/
sum(rate(requests_total[5m]))
) > 0.05
# ✅ Good: Explicit comparison
http_requests_rate < 10
# ✅ Good: Complex boolean
(cpu_usage > 80) and (memory_usage > 90)for Duration for StabilityAvoid alerting on transient spikes.
# Alert only after condition persists for 10 minutes
- alert: HighErrorRate
expr: |
(
sum(rate(errors_total[5m]))
/
sum(rate(requests_total[5m]))
) > 0.05
for: 10m
annotations:
summary: "Error rate above 5% for 10+ minutes"for duration guidelines:
5m10m to 15m30m+0m (no for)# ✅ Good: Include labels that identify the problem
sum by (service, environment) (
rate(errors_total[5m])
) > 100
# Alerts will show which service and environment# ❌ Bad: Too generic
absent(up)
# ✅ Good: Specific service
absent(up{job="critical-service"})
# ✅ Good: With timeout
absent_over_time(up{job="critical-service"}[10m])Use multi-line formatting for readability:
# ✅ Good: Multi-line with indentation
histogram_quantile(0.95,
sum by (service, le) (
rate(http_request_duration_seconds_bucket{
environment="production",
job="api-server"
}[5m])
)
)
# ❌ Bad: Single line, hard to read
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket{environment="production", job="api-server"}[5m])))rules:
# Calculate p95 latency for all API endpoints
# Used by: API dashboard, SLO calculations, latency alerts
- record: api:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (endpoint, le) (
rate(http_request_duration_seconds_bucket{job="api"}[5m])
)
)# ✅ Good: Clear purpose from name
- record: api:error_rate:ratio5m
- record: db:query_duration:p99
- record: cluster:memory_usage:bytes
# ❌ Bad: Unclear names
- record: metric1
- record: temp_calc
- record: x# ❌ Anti-pattern
rate(http_requests_total[5m])
# ✅ Fix
rate(http_requests_total{job="api-server", environment="prod"}[5m])# ❌ Anti-pattern
metric{label=~"value"}
# ✅ Fix
metric{label="value"}# ❌ Anti-pattern
rate(memory_usage_bytes[5m])
# ✅ Fix
avg_over_time(memory_usage_bytes[5m])# ❌ Anti-pattern
http_requests_total
# ✅ Fix
rate(http_requests_total[5m])# ❌ Anti-pattern
avg(http_duration{quantile="0.95"})
# ✅ Fix
histogram_quantile(0.95,
sum by (le) (rate(http_duration_bucket[5m]))
)# ❌ Anti-pattern
histogram_quantile(0.95, rate(latency_bucket[5m]))
# ✅ Fix
histogram_quantile(0.95,
sum by (le) (rate(latency_bucket[5m]))
)# ❌ Anti-pattern
sum by (user_id) (requests) # millions of series
# ✅ Fix
sum(requests) # single series
# Or use low-cardinality labels
sum by (service) (requests)Check cardinality:
count(your_query)Verify result makes sense:
Test edge cases:
# Test with different ranges
rate(metric[1m])
rate(metric[5m])
rate(metric[1h])
# Verify results are reasonable# Verify metric exists
count(metric_name) > 0
# Check for gaps
absent_over_time(metric_name[10m])Before deploying a PromQL query, verify:
job)=) instead of regex when possiblerate() for counters*_over_time() for gaugeshistogram_quantile() with sum by (le) for histograms[5m])Install with Tessl CLI
npx tessl i pantheon-ai/promql-generator@0.1.1