promql-validator

Name: promql-validator
Author: Leo-Atienza

Comprehensive toolkit for validating, optimizing, and understanding Prometheus Query Language (PromQL) queries. Use this skill when working with PromQL queries to check syntax, detect anti-patterns, identify optimization opportunities, and interactively plan queries with users.

Install

mkdir -p .claude/skills/promql-validator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16307" && unzip -o skill.zip -d .claude/skills/promql-validator && rm skill.zip

Installs to .claude/skills/promql-validator

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Comprehensive toolkit for validating, optimizing, and understanding Prometheus Query Language (PromQL) queries. Use this skill when working with PromQL queries to check syntax, detect anti-patterns, identify optimization opportunities, and interactively plan queries with users.

278 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

How This Skill Works

This skill performs multi-level validation and provides interactive query planning:

Syntax Validation: Checks for syntactically correct PromQL expressions
Semantic Validation: Ensures queries make logical sense (e.g., rate() on counters, not gauges)
Anti-Pattern Detection: Identifies common mistakes and inefficient patterns
Optimization Suggestions: Recommends performance improvements
Query Explanation: Translates PromQL to plain English
Interactive Planning: Helps users clarify intent and refine queries

Workflow

When a user provides a PromQL query, follow this workflow:

Step 1: Validate Syntax

Run the syntax validation script to check for basic correctness:

python3 .claude/skills/promql-validator/scripts/validate_syntax.py "<query>"

The script will check for:

Valid metric names and label matchers
Correct operator usage
Proper function syntax
Valid time durations and ranges
Balanced brackets and quotes
Correct use of modifiers (offset, @)

Step 2: Check Best Practices

Run the best practices checker to detect anti-patterns and optimization opportunities:

python3 .claude/skills/promql-validator/scripts/check_best_practices.py "<query>"

The script will identify:

High cardinality queries without label filters
Inefficient regex matchers that could be exact matches
Missing rate()/increase() on counter metrics
rate() used on gauge metrics
Averaging pre-calculated quantiles
Subqueries with excessive time ranges
irate() over long time ranges
Opportunities to add more specific label filters
Complex queries that should use recording rules

Step 3: Explain the Query

Parse and explain what the query does in plain English:

What metrics are being queried
What type of metrics they are (counter, gauge, histogram, summary)
What functions are applied and why
What the query calculates
What labels will be in the output
What the expected result structure looks like

Required Output Details (always include these explicitly):

**Output Labels**: [list labels that will be in the result, or "None (fully aggregated to scalar)"]
**Expected Result Structure**: [instant vector / range vector / scalar] with [N series / single value]

Example:

**Output Labels**: job, instance
**Expected Result Structure**: Instant vector with one series per job/instance combination

Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)

Ask the user clarifying questions to verify the query matches their intent:

Understand the Goal: "What are you trying to monitor or measure?"
- Request rate, error rate, latency, resource usage, etc.
Verify Metric Type: "Is this a counter (always increasing), gauge (can go up/down), histogram, or summary?"
- This affects which functions to use
Clarify Time Range: "What time window do you need?"
- Instant value, rate over time, historical analysis
Confirm Aggregation: "Do you need to aggregate data across labels? If so, which labels?"
- by (job), by (instance), without (pod), etc.
Check Output Intent: "Are you using this for alerting, dashboarding, or ad-hoc analysis?"
- Affects optimization priorities

IMPORTANT: Two-Phase Dialogue

After presenting Steps 1-4 results (Syntax, Best Practices, Query Explanation, and Intent Questions):

⏸️ STOP HERE AND WAIT FOR USER RESPONSE

Do NOT proceed to Steps 5-7 until the user answers the clarifying questions. This ensures the subsequent recommendations are tailored to the user's actual intent.

Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)

Only proceed to this step after the user has answered the clarifying questions from Step 4.

After understanding the user's intent:

Explain what the current query actually does
Highlight any mismatches between intent and implementation
Suggest corrections if the query doesn't match the goal
Offer alternative approaches if applicable

When relevant, mention known limitations:

Note when metric type detection is heuristic-based (e.g., "The script inferred this is a gauge based on the _bytes suffix. Please confirm if this is correct.")
Acknowledge when high-cardinality warnings might be false positives (e.g., "This warning may not apply if you're using a recording rule or know your cardinality is low.")

Step 6: Offer Optimizations

Based on validation results:

Suggest more efficient query patterns
Recommend recording rules for complex/repeated queries
Propose better label matchers to reduce cardinality
Advise on appropriate time ranges

Reference Examples: When suggesting corrections, cite relevant examples using this format:

As shown in `examples/bad_queries.promql` (lines 91-97):
❌ BAD: `avg(http_request_duration_seconds{quantile="0.95"})`
✅ GOOD: Use histogram_quantile() with histogram buckets

Citation sources:

examples/good_queries.promql - for well-formed patterns
examples/optimization_examples.promql - for before/after comparisons
examples/bad_queries.promql - for showing what to avoid
docs/best_practices.md - for detailed explanations
docs/anti_patterns.md - for anti-pattern deep dives

Citation Format: file_path (lines X-Y) with the relevant code snippet quoted

Step 7: Let User Plan/Refine

Give the user control:

Ask if they want to modify the query
Offer to help rewrite it for better performance
Provide multiple alternatives if applicable
Explain trade-offs between different approaches

Key Validation Rules

Syntax Rules

Metric Names: Must match [a-zA-Z_:][a-zA-Z0-9_:]* or use UTF-8 quoting syntax (Prometheus 3.0+):
- Quoted form: {"my.metric.with.dots"}
- Using name label: {__name__="my.metric.with.dots"}
Label Matchers: = (equal), != (not equal), =~ (regex match), !~ (regex not match)
Time Durations: [0-9]+(ms|s|m|h|d|w|y) - e.g., 5m, 1h, 7d
Range Vectors: metric_name[duration] - e.g., http_requests_total[5m]
Offset Modifier: offset <duration> - e.g., metric_name offset 5m
@ Modifier: @ <timestamp> or @ start() / @ end()

Semantic Rules

rate() and irate(): Should only be used with counter metrics (metrics ending in _total, _count, _sum, or _bucket)
Counters: Should typically use rate() or increase(), not raw values
Gauges: Should not use rate() or increase()
Histograms: Use histogram_quantile() with le label and rate() on _bucket metrics
Summaries: Don't average quantiles; calculate from _sum and _count
Aggregations: Use by() or without() to control output labels

Performance Rules

Cardinality: Always use specific label matchers to reduce series count
Regex: Use = instead of =~ when possible for exact matches
Rate Range: Should be at least 4x the scrape interval (typically [2m] minimum)
irate(): Best for short ranges (<5m); use rate() for longer periods
Subqueries: Avoid excessive time ranges that process millions of samples
Recording Rules: Use for complex queries accessed frequently

Anti-Patterns to Detect

High Cardinality Issues

❌ Bad: http_requests_total{}

Matches all time series without filtering

✅ Good: http_requests_total{job="api", instance="prod-1"}

Specific label filters reduce cardinality

Regex Overuse

❌ Bad: http_requests_total{status=~"2.."}

Regex is slower and less precise

✅ Good: http_requests_total{status="200"}

Exact match is faster

Missing rate() on Counters

❌ Bad: http_requests_total

Counter raw values are not useful (always increasing)

✅ Good: rate(http_requests_total[5m])

Rate shows requests per second

rate() on Gauges

❌ Bad: rate(memory_usage_bytes[5m])

Gauges measure current state, not cumulative values

✅ Good: memory_usage_bytes

Use gauge value directly or with avg_over_time()

Averaging Quantiles

❌ Bad: avg(http_request_duration_seconds{quantile="0.95"})

Mathematically invalid to average pre-calculated quantiles

✅ Good: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Calculate quantile from histogram buckets

Excessive Subquery Ranges

❌ Bad: rate(metric[5m])[90d:1m]

Processes millions of samples, very slow

✅ Good: Use recording rules or limit range to necessary duration

irate() Over Long Ranges

❌ Bad: irate(metric[1h])

irate() only looks at last two samples, range is wasted

✅ Good: rate(metric[1h]) or irate(metric[5m])

Use rate() for longer ranges or reduce irate() range

Mixed Metric Types

❌ Bad: avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)

Combines summary quantiles, gauge metrics, and counters in arithmetic
Produces meaningless results

✅ Good: Keep each metric type in separate, purpose-specific queries:

Latency: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
Memory: node_memory_usage_bytes{instance="prod-1"}
Request rate: rate(http_requests_total{job="api"}[5m])

Output Format

Provide validation results in this structure:

## PromQL Validation Results

### Syntax Check
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
- Issues: [list any syntax errors with line/position]

### Semantic Check
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
- Issues: [list any logical problems]

### Performance Analysis
- Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ INEFFICIENT
- Issues: [list optimization opportunities]
- Suggestions: [specific improvements]

### Query Explanation
Your query: `<query>`

This query does:
- [Plain English explanation]
- Metrics: [list metrics and their types]
- Functions: 

---

*Content truncated.*

More by Leo-Atienza

View all by Leo-Atienza →

nestjs-expert

Leo-Atienza

Use when building NestJS applications requiring modular architecture, dependency injection, or TypeScript backend development. Invoke for modules, controllers, services, DTOs, guards, interceptors, TypeORM/Prisma.

Install

mkdir -p .claude/skills/promql-validator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16307" && unzip -o skill.zip -d .claude/skills/promql-validator && rm skill.zip

Installs to .claude/skills/promql-validator

Safety

Review before install

Runs shell / code
Bundles scripts

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

2mo ago

License

MIT

Repo stars

Loads

~3,645 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

Leo-Atienza

2 skills published

Links

Source code

promql-validator

Install

Activation

About this skill

How This Skill Works

Workflow

Step 1: Validate Syntax

Step 2: Check Best Practices

Step 3: Explain the Query

Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)

Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)

Step 6: Offer Optimizations

Step 7: Let User Plan/Refine

Key Validation Rules

Syntax Rules

Semantic Rules

Performance Rules

Anti-Patterns to Detect

High Cardinality Issues

Regex Overuse

Missing rate() on Counters

rate() on Gauges

Averaging Quantiles

Excessive Subquery Ranges

irate() Over Long Ranges

Mixed Metric Types

Output Format

More by Leo-Atienza

nestjs-expert

Search skills