Complete PromQL toolkit with generation and validation capabilities
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
CRITICAL: Always engage the user in collaborative planning before generating any query. Never skip the planning phase.
devops-skills:promql-validator. Display structured results (syntax, best practices, explanation). Fix any issues and re-validate until all checks pass.Ask vs. Infer: If the user's request already clearly specifies goal, use case, and context, acknowledge those details instead of re-asking. Only ask for missing or ambiguous information.
Always consult the relevant reference file before writing code.
| Scenario | Reference File |
|---|---|
| Histogram queries | references/metric_types.md (Histogram section) |
| Error/latency patterns | references/promql_patterns.md (RED section) |
| Resource monitoring | references/promql_patterns.md (USE section) |
| Optimization / anti-patterns | references/best_practices.md |
| Specific functions | references/promql_functions.md |
rate()/increase() on counters; *_over_time() or direct use for gauges; histogram_quantile() for histograms.by()/without() on all aggregations.level:metric:operations).# Request rate (counter)
sum(rate(http_requests_total{job="api-server"}[5m])) by (endpoint)
# Error rate ratio
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api-server"}[5m]))
# P95 latency (classic histogram)
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api-server"}[5m]))
)
# P95 latency (native histogram, Prometheus 3.x+)
histogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds[5m]))
)
# Availability
(count(up{job="api-server"} == 1) / count(up{job="api-server"})) * 100
# Burn rate (99.9% SLO, 1h window)
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001
# Multi-window burn-rate alert (page: 2% budget in 1h, burn rate 14.4)
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h]))
) > 14.4 * 0.001
and
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m]))
) > 14.4 * 0.001For complete SLO patterns, Native Histogram functions (histogram_count, histogram_sum, histogram_fraction), subqueries, offset/@ modifiers, vector matching, and Kubernetes patterns — see the assets/ files.
After generating, invoke devops-skills:promql-validator and display results in this format:
## PromQL Validation Results
### Syntax Check
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
- Issues: [list any syntax errors]
### Best Practices Check
- Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ HAS ISSUES
- Issues: [list problems found]
- Suggestions: [list optimizations]
### Query Explanation
- What it measures: [plain English]
- Output labels: [label list or "None (scalar)"]
- Expected result structure: [instant vector / scalar / etc.]Fix all issues and re-validate until clean.
# Alerting rule with for clause
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > 0.05
for: 10m
# Recording rule (naming: level:metric:operations)
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))rate() on a gauge metricrate() computes per-second rate of increase and assumes monotonically increasing counters. Applied to a gauge, it produces nonsensical results because gauges can decrease.rate(node_memory_MemFree_bytes[5m]) — memory is a gaugenode_memory_MemFree_bytes (direct use) or delta(node_memory_MemFree_bytes[5m]) for change over timeavg() across quantile labels{quantile="0.95"} labels that cannot be re-aggregated across instances. Using avg() on them produces statistically meaningless results.avg(http_request_duration_seconds{quantile="0.95"})histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) — use histogram type with histogram_quantile()by() without filtering firstuser_id, request_id, or pod without label filters produces thousands of series, overwhelming dashboards and recording rules.sum by (user_id) (rate(http_requests_total[5m]))sum by (job, status_code) (rate(http_requests_total{job="api"}[5m])) — filter and aggregate on stable, low-cardinality labelsincrease() for alerting thresholdsincrease() is extrapolated and can return non-integer values on sparse counters. For alerting, rate() produces stable per-second thresholds that are scrape-interval independent.increase(http_requests_total{status=~"5.."}[5m]) > 10rate(http_requests_total{status=~"5.."}[5m]) > 0.033 (2 errors/minute = 0.033/second)for clause on alert rulesfor fire immediately on a single evaluation, causing false positives from transient spikes. The for clause requires the condition to be true for a sustained period.alert: HighErrorRate with expr: error_rate > 0.05 and no forfor: 5m to require the condition to hold for 5 minutes before firingprometheus, then fetch docs with the relevant topic."Prometheus PromQL [function/operator] documentation [version] examples"| Symptom | Likely Cause | Fix |
|---|---|---|
| Empty results | Wrong label filters or metric not scraped | Check up{job="..."}, verify label values |
| Too many series | High cardinality | Add label filters, aggregate, use recording rules |
| Wrong values | Wrong function for metric type | rate() on counters; direct or *_over_time() on gauges |
| Slow queries | Large range vectors or missing filters | Narrow time range, add filters, use recording rules |
Internal:
for clauseslevel:metric:operations naming