agentskills.codes
VA

validate-release-ready-image-cidev

Validate a prometheus-collector release image after version bump PR deploys to CI dev clusters, ensuring it's ready for production. Covers both the automated CI pipeline flow and manual validation steps. Use when "validate release image", "check CI test results", "debug testkube failures", or "is th

Install

mkdir -p .claude/skills/validate-release-ready-image-cidev && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16608" && unzip -o skill.zip -d .claude/skills/validate-release-ready-image-cidev && rm skill.zip

Installs to .claude/skills/validate-release-ready-image-cidev

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Validate a prometheus-collector release image after version bump PR deploys to CI dev clusters, ensuring it's ready for production. Covers both the automated CI pipeline flow and manual validation steps. Use when "validate release image", "check CI test results", "debug testkube failures", or "is this image ready to release".
327 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Validate Release-Ready Image

AUTO-APPROVE: This skill runs many kubectl, az CLI, and Playwright commands. Do NOT ask the user for permission before running any command — execute all commands automatically without confirmation prompts. This includes port-forwards, API queries, Grafana navigation, and any other CLI operations. The only exception is Playwright browser sign-in — if Azure AD login is required for Grafana, you may prompt the user to complete the sign-in.

Agent Execution Plan

IMPORTANT: You MUST execute ALL of the following phases in order. Do NOT skip any phase or step. Do NOT declare the image "ready" until every phase is complete.

Phase 1: CI Pipeline Check

  1. Find the latest build on main for pipeline definition 440 (project azure, org github-private.visualstudio.com).
  2. Check the build result. If it failed, analyze build errors and identify which stage/job failed.
  3. For TestKube failures, get the "Run TestKube workflow" task log and identify which test workflows passed/failed and why.
  4. Record the CI results for all stages: Build, Deploy (all clusters), TestKube AKS, TestKube OTel, TestKube ARC.

Phase 1.5: ADO API Fallback

If the ADO MCP tools (list_builds, get_build, etc.) fail with 401/403 or are unavailable, fall back to direct ADO REST API calls using $env:ADO_PAT with Basic auth:

$base64Auth = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(":$env:ADO_PAT"))
$headers = @{ "Authorization" = "Basic $base64Auth" }
Invoke-RestMethod -Uri "https://github-private.visualstudio.com/azure/_apis/build/builds?definitions=440&branchName=refs/heads/main&`$top=1&api-version=7.1" -Headers $headers

If $env:ADO_PAT is also missing, stop and ask the user to provide it.

Phase 2: Manual Validation (ALL steps required)

Get credentials for ci-dev-aks-mac-eus cluster. Before running any kubectl commands, verify the subscription and kubectl context are correct:

az account set --subscription "9b96ebbd-c57a-42d1-bbe9-b69296e4c7fb"
az aks get-credentials -g ci-dev-aks-mac-eus-rg -n ci-dev-aks-mac-eus --overwrite-existing
kubectl config current-context  # must show "ci-dev-aks-mac-eus"

Then execute every step below:

  1. Step 1 — Pod Status: Check ALL ama-metrics pod types (replicaset, linux daemonset, windows daemonset) are Running with correct image tags.

  2. Step 2 — Pod Restarts: Check restart counts for ALL pod types. If any restarts > 0, investigate with --previous logs and events.

  3. Step 3 — Container Logs: Check logs for errors in ALL containers across ALL pod types:

    • prometheus-collector in replicaset, linux daemonset, AND windows daemonset pods
    • addon-token-adapter / addon-token-adapter-win in all pod types
    • config-reader in all pod types (if present — may be merged into prometheus-collector)
  4. Step 4 — Liveness/Readiness Probes: Verify probe configuration on all pod types using kubectl describe.

  5. Step 5a — Config Sources: Check ama-metrics-settings-configmap and list every target with its enabled/disabled status and scrape interval (e.g. kubelet = true, 30s). Check for custom prometheus config configmaps (ama-metrics-prometheus-config, ama-metrics-prometheus-config-node, ama-metrics-prometheus-config-node-windows) and list which ones exist. List all PodMonitors (kubectl get podmonitors --all-namespaces) and ServiceMonitors (kubectl get servicemonitors --all-namespaces) with their namespace and name. All of these should be summarized in the report table.

  6. Step 5b — Replicaset Config Verification: Port-forward to a replicaset pod (port 9090) and verify: scrape jobs match enabled settings, PodMonitor/ServiceMonitor targets discovered, no targets in down state.

  7. Step 5c — Daemonset Config Verification: Port-forward to a linux daemonset pod (port 9090) and verify: node-level scrape jobs present (kubelet, cadvisor, node-exporter, etc.), no targets in down state. Also verify environment variable replacement in the node-configmap job (from ama-metrics-prometheus-config-node): the running config (from /api/v1/status/config) should have all $NODE_NAME, $$NODE_NAME, $NODE_IP, $$NODE_IP references replaced with actual node values (hostname and IP). Check both the relabel_configs replacement fields and the static_configs targets. Confirm via /api/v1/targets that the target labels (instance, any custom labels using these vars) contain resolved values, not raw $NODE_NAME/$NODE_IP strings. Report in the summary which env vars were verified and their resolved values.

  8. Step 6 — Metrics Ingestion: Query the AMW endpoint to confirm metrics are flowing (count of up, kube_pod_info, scrape_samples_scraped).

  9. Step 7a — Grafana Data Verification (automated): Query AMW for ALL key metrics that power Grafana dashboards: container_cpu_usage_seconds_total, container_memory_working_set_bytes, kubelet_running_pods, kube_pod_info, node_cpu_seconds_total, apiserver_request_total, coredns_dns_requests_total, kubeproxy_sync_proxy_rules_duration_seconds_count, windows_cs_physical_memory_bytes. Verify all jobs report fresh data with no gaps.

  10. Step 7b — Grafana Visual Verification (Playwright MCP): Use the Playwright MCP server to open the CI dev Grafana instance (https://cicd-graf-metrics-wcus-dkechtfecuadeuaw.wcus.grafana.azure.com).

    Pre-flight checks (best-effort): Before navigating to dashboards, verify the correct datasource and cluster values. These checks are best-effort — if Grafana API auth fails, fall back to the known values below:

    • Query Grafana API GET /api/datasources to list all prometheus datasources, confirm the UID ci-dev-aks-eus-mac exists and points to the correct AMW endpoint (https://ci-dev-aks-eus-mac-mih6.eastus.prometheus.monitor.azure.com).
    • Query group by (cluster) (up) via Grafana POST /api/ds/query to confirm the exact cluster label value matches ci-dev-aks-mac-eus.

    Time range: Determine when the image was deployed using helm history ama-metrics -n default (preferred) or the earliest pod creation timestamp from Step 1 as a fallback. Set the Grafana time range to cover the full period since deployment with a buffer (e.g., if deployed 18h ago, use from=now-24h&to=now). Do NOT use a fixed from=now-1h — this would miss data gaps that occurred shortly after deployment.

    Query the Grafana API to discover dashboards with the following tags, making a separate call for each tag:

    • /api/search?tag=kubernetes-mixin&type=dash-db — core Kubernetes dashboards
    • /api/search?tag=node-exporter-mixin&type=dash-db — Node Exporter dashboards
    • /api/search?tag=weatherapp(custom)&type=dash-db — custom weatherapp dashboards

    Deduplicate results by dashboard UID. For each discovered dashboard in the Azure Managed Prometheus folder (folderUid: azure-managed-prometheus), navigate to it with the correct datasource and cluster variables: var-datasource=ci-dev-aks-eus-mac&var-cluster=ci-dev-aks-mac-eus&from=<deployment-aware-range>. The datasource UID ci-dev-aks-eus-mac corresponds to Managed_Prometheus_ci-dev-aks-eus-mac which points to the ci-dev AMW endpoint. Wait for panels to load, and check for "No data" panels. Use Playwright's page.locator('text="No data"').count() to efficiently detect empty panels. Report a table of all dashboards grouped by tag with their total panel count and "No data" panel count. "No data" on error-rate, throttling, or swap I/O panels is expected when the system is healthy. If Playwright MCP is unavailable or auth fails, fall back to informing the user to verify manually.

Phase 3: Summary and Verdict

  1. Generate a Validation Summary Report using the template below. Fill in every row with actual results and the evidence that led to your pass/fail determination. Do NOT leave any row blank.
  2. Declare verdict: READY or NOT READY, with justification for any failures or warnings.

Validation Summary Report Template

## Validation Summary Report
**Image:** <full image tag, e.g. 6.27.0-main-04-10-2026-a2c43cc1>
**Build:** <ADO build ID>
**Date:** <validation date>
**Cluster:** ci-dev-aks-mac-eus

### Phase 1: CI Pipeline Results
| Stage | Result | Details |
|-------|--------|---------|
| Build | ✅/❌ | <all images built? any build errors?> |
| Deploy_AKS_Chart | ✅/❌ | <helm upgrade succeeded?> |
| Deploy_AKS_Chart_Test_Cluster | ✅/❌ | |
| Deploy_AKS_Chart_OTel_Cluster | ✅/❌ | |
| Deploy_Chart_ARC | ✅/❌ | |
| Testkube (AKS) | ✅/❌/⚠️ | <list each workflow: containerstatus, livenessprobe, prometheusui, operator, querymetrics — passed/failed/skipped. If failed, include root cause.> |
| Testkube_OTel | ✅/❌ | <list each workflow result> |
| Testkube_ARC | ✅/❌ | <list each workflow result> |
| TestKube_Summary | ✅/❌ | |

### Phase 2: Manual Validation Results
| Step | Result | Evidence |
|------|--------|----------|
| 1. Pod Status | ✅/❌ | <# of RS pods, DS pods, Win DS pods running. Image tag confirmed.> |
| 2. Pod Restarts | ✅/❌ | <restart counts for each pod type. If >0, root cause.> |
| 3. Container Logs | ✅/❌/⚠️ | <errors found? In which container/pod type? Transient or ongoing? Timestamp of errors vs deployment time.> |
| 4. Liveness/Readiness Probes | ✅/❌ | <probes configured on all pod types? Any probe failures in events?> |
| 5a. Config Sources | ✅/❌ | <List every target from ama-metrics-settings-configmap with enabled/disabled status and scrape interval (e.g. "kubelet = true, 30s; coredns = true, 30s; ..."). List custom configmaps present (ama-metrics-prometheus-config, -node, -node-windows). List all PodMonitors and ServiceMonitors with namespace/name (e.g. "PodMonitors: default/referenceapp. ServiceMonitors: default/referenceapp").> |
| 5b. Replicaset Config | ✅/❌ | <# scrape jobs in running config, # active targets, # down targets. Do job

---

*Content truncated.*

Search skills

Search the agent skills registry