A DevOps team supports a multi-tenant SaaS platform. Their Grafana dashboards are timing out because several expensive PromQL queries run on every panel refresh. The head of infrastructure has determined that three queries are the main culprits — they are all computed in multiple dashboard panels and run over large time ranges. The team wants to pre-compute these queries as Prometheus recording rules so they can be reused across dashboards without re-evaluating the full expression each time.

Additionally, a teammate proposed the following query for a "per-user request breakdown" panel:

sum by (user_id) (rate(http_requests_total[5m]))

The senior engineer flagged this as potentially dangerous to the Prometheus instance but hasn't explained why. The team wants the recording rules written up, and also wants an explanation of the issue with the proposed query and what a safer alternative looks like.

The three queries to pre-compute:

The per-job request rate over 5 minutes: sum by (job) (rate(http_requests_total[5m]))
The per-job 5xx error rate over 5 minutes: sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
The per-job P95 latency: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Output Specification

Produce two files:

recording_rules.yaml — Prometheus recording rule YAML for the three queries above, using correct recording rule naming conventions.
cardinality_analysis.md — A short Markdown document (1-2 paragraphs) explaining the problem with the sum by (user_id) query and providing a safer alternative query.

pantheon-ai/promql-toolkit

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}generator/evals/scenario-3/

Create Recording Rules to Speed Up Slow Dashboard Queries

Problem/Feature Description

Output Specification

task.mdgenerator/evals/scenario-3/