Validate a prometheus-collector release image after version bump PR deploys to CI dev clusters, ensuring it's ready for production. Covers both the automated CI pipeline flow and manual validation steps. Use when "validate release image", "check CI test results", "debug testkube failures", or "is th
Install
mkdir -p .claude/skills/validate-release-ready-image-cidev && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16608" && unzip -o skill.zip -d .claude/skills/validate-release-ready-image-cidev && rm skill.zipInstalls to .claude/skills/validate-release-ready-image-cidev
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Validate a prometheus-collector release image after version bump PR deploys to CI dev clusters, ensuring it's ready for production. Covers both the automated CI pipeline flow and manual validation steps. Use when "validate release image", "check CI test results", "debug testkube failures", or "is this image ready to release".About this skill
Validate Release-Ready Image
AUTO-APPROVE: This skill runs many kubectl, az CLI, and Playwright commands. Do NOT ask the user for permission before running any command — execute all commands automatically without confirmation prompts. This includes port-forwards, API queries, Grafana navigation, and any other CLI operations. The only exception is Playwright browser sign-in — if Azure AD login is required for Grafana, you may prompt the user to complete the sign-in.
Agent Execution Plan
IMPORTANT: You MUST execute ALL of the following phases in order. Do NOT skip any phase or step. Do NOT declare the image "ready" until every phase is complete.
Phase 1: CI Pipeline Check
- Find the latest build on
mainfor pipeline definition 440 (projectazure, orggithub-private.visualstudio.com). - Check the build result. If it failed, analyze build errors and identify which stage/job failed.
- For TestKube failures, get the "Run TestKube workflow" task log and identify which test workflows passed/failed and why.
- Record the CI results for all stages: Build, Deploy (all clusters), TestKube AKS, TestKube OTel, TestKube ARC.
Phase 1.5: ADO API Fallback
If the ADO MCP tools (list_builds, get_build, etc.) fail with 401/403 or are unavailable, fall back to direct ADO REST API calls using $env:ADO_PAT with Basic auth:
$base64Auth = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(":$env:ADO_PAT"))
$headers = @{ "Authorization" = "Basic $base64Auth" }
Invoke-RestMethod -Uri "https://github-private.visualstudio.com/azure/_apis/build/builds?definitions=440&branchName=refs/heads/main&`$top=1&api-version=7.1" -Headers $headers
If $env:ADO_PAT is also missing, stop and ask the user to provide it.
Phase 2: Manual Validation (ALL steps required)
Get credentials for ci-dev-aks-mac-eus cluster. Before running any kubectl commands, verify the subscription and kubectl context are correct:
az account set --subscription "9b96ebbd-c57a-42d1-bbe9-b69296e4c7fb"
az aks get-credentials -g ci-dev-aks-mac-eus-rg -n ci-dev-aks-mac-eus --overwrite-existing
kubectl config current-context # must show "ci-dev-aks-mac-eus"
Then execute every step below:
-
Step 1 — Pod Status: Check ALL ama-metrics pod types (replicaset, linux daemonset, windows daemonset) are Running with correct image tags.
-
Step 2 — Pod Restarts: Check restart counts for ALL pod types. If any restarts > 0, investigate with
--previouslogs and events. -
Step 3 — Container Logs: Check logs for errors in ALL containers across ALL pod types:
prometheus-collectorin replicaset, linux daemonset, AND windows daemonset podsaddon-token-adapter/addon-token-adapter-winin all pod typesconfig-readerin all pod types (if present — may be merged into prometheus-collector)
-
Step 4 — Liveness/Readiness Probes: Verify probe configuration on all pod types using
kubectl describe. -
Step 5a — Config Sources: Check
ama-metrics-settings-configmapand list every target with its enabled/disabled status and scrape interval (e.g.kubelet = true, 30s). Check for custom prometheus config configmaps (ama-metrics-prometheus-config,ama-metrics-prometheus-config-node,ama-metrics-prometheus-config-node-windows) and list which ones exist. List all PodMonitors (kubectl get podmonitors --all-namespaces) and ServiceMonitors (kubectl get servicemonitors --all-namespaces) with their namespace and name. All of these should be summarized in the report table. -
Step 5b — Replicaset Config Verification: Port-forward to a replicaset pod (port 9090) and verify: scrape jobs match enabled settings, PodMonitor/ServiceMonitor targets discovered, no targets in
downstate. -
Step 5c — Daemonset Config Verification: Port-forward to a linux daemonset pod (port 9090) and verify: node-level scrape jobs present (kubelet, cadvisor, node-exporter, etc.), no targets in
downstate. Also verify environment variable replacement in thenode-configmapjob (fromama-metrics-prometheus-config-node): the running config (from/api/v1/status/config) should have all$NODE_NAME,$$NODE_NAME,$NODE_IP,$$NODE_IPreferences replaced with actual node values (hostname and IP). Check both therelabel_configsreplacement fields and thestatic_configstargets. Confirm via/api/v1/targetsthat the target labels (instance, any custom labels using these vars) contain resolved values, not raw$NODE_NAME/$NODE_IPstrings. Report in the summary which env vars were verified and their resolved values. -
Step 6 — Metrics Ingestion: Query the AMW endpoint to confirm metrics are flowing (count of
up,kube_pod_info,scrape_samples_scraped). -
Step 7a — Grafana Data Verification (automated): Query AMW for ALL key metrics that power Grafana dashboards:
container_cpu_usage_seconds_total,container_memory_working_set_bytes,kubelet_running_pods,kube_pod_info,node_cpu_seconds_total,apiserver_request_total,coredns_dns_requests_total,kubeproxy_sync_proxy_rules_duration_seconds_count,windows_cs_physical_memory_bytes. Verify all jobs report fresh data with no gaps. -
Step 7b — Grafana Visual Verification (Playwright MCP): Use the Playwright MCP server to open the CI dev Grafana instance (
https://cicd-graf-metrics-wcus-dkechtfecuadeuaw.wcus.grafana.azure.com).Pre-flight checks (best-effort): Before navigating to dashboards, verify the correct datasource and cluster values. These checks are best-effort — if Grafana API auth fails, fall back to the known values below:
- Query Grafana API
GET /api/datasourcesto list all prometheus datasources, confirm the UIDci-dev-aks-eus-macexists and points to the correct AMW endpoint (https://ci-dev-aks-eus-mac-mih6.eastus.prometheus.monitor.azure.com). - Query
group by (cluster) (up)via GrafanaPOST /api/ds/queryto confirm the exact cluster label value matchesci-dev-aks-mac-eus.
Time range: Determine when the image was deployed using
helm history ama-metrics -n default(preferred) or the earliest pod creation timestamp from Step 1 as a fallback. Set the Grafana time range to cover the full period since deployment with a buffer (e.g., if deployed 18h ago, usefrom=now-24h&to=now). Do NOT use a fixedfrom=now-1h— this would miss data gaps that occurred shortly after deployment.Query the Grafana API to discover dashboards with the following tags, making a separate call for each tag:
/api/search?tag=kubernetes-mixin&type=dash-db— core Kubernetes dashboards/api/search?tag=node-exporter-mixin&type=dash-db— Node Exporter dashboards/api/search?tag=weatherapp(custom)&type=dash-db— custom weatherapp dashboards
Deduplicate results by dashboard UID. For each discovered dashboard in the
Azure Managed Prometheusfolder (folderUid:azure-managed-prometheus), navigate to it with the correct datasource and cluster variables:var-datasource=ci-dev-aks-eus-mac&var-cluster=ci-dev-aks-mac-eus&from=<deployment-aware-range>. The datasource UIDci-dev-aks-eus-maccorresponds toManaged_Prometheus_ci-dev-aks-eus-macwhich points to the ci-dev AMW endpoint. Wait for panels to load, and check for "No data" panels. Use Playwright'spage.locator('text="No data"').count()to efficiently detect empty panels. Report a table of all dashboards grouped by tag with their total panel count and "No data" panel count. "No data" on error-rate, throttling, or swap I/O panels is expected when the system is healthy. If Playwright MCP is unavailable or auth fails, fall back to informing the user to verify manually. - Query Grafana API
Phase 3: Summary and Verdict
- Generate a Validation Summary Report using the template below. Fill in every row with actual results and the evidence that led to your pass/fail determination. Do NOT leave any row blank.
- Declare verdict: READY or NOT READY, with justification for any failures or warnings.
Validation Summary Report Template
## Validation Summary Report
**Image:** <full image tag, e.g. 6.27.0-main-04-10-2026-a2c43cc1>
**Build:** <ADO build ID>
**Date:** <validation date>
**Cluster:** ci-dev-aks-mac-eus
### Phase 1: CI Pipeline Results
| Stage | Result | Details |
|-------|--------|---------|
| Build | ✅/❌ | <all images built? any build errors?> |
| Deploy_AKS_Chart | ✅/❌ | <helm upgrade succeeded?> |
| Deploy_AKS_Chart_Test_Cluster | ✅/❌ | |
| Deploy_AKS_Chart_OTel_Cluster | ✅/❌ | |
| Deploy_Chart_ARC | ✅/❌ | |
| Testkube (AKS) | ✅/❌/⚠️ | <list each workflow: containerstatus, livenessprobe, prometheusui, operator, querymetrics — passed/failed/skipped. If failed, include root cause.> |
| Testkube_OTel | ✅/❌ | <list each workflow result> |
| Testkube_ARC | ✅/❌ | <list each workflow result> |
| TestKube_Summary | ✅/❌ | |
### Phase 2: Manual Validation Results
| Step | Result | Evidence |
|------|--------|----------|
| 1. Pod Status | ✅/❌ | <# of RS pods, DS pods, Win DS pods running. Image tag confirmed.> |
| 2. Pod Restarts | ✅/❌ | <restart counts for each pod type. If >0, root cause.> |
| 3. Container Logs | ✅/❌/⚠️ | <errors found? In which container/pod type? Transient or ongoing? Timestamp of errors vs deployment time.> |
| 4. Liveness/Readiness Probes | ✅/❌ | <probes configured on all pod types? Any probe failures in events?> |
| 5a. Config Sources | ✅/❌ | <List every target from ama-metrics-settings-configmap with enabled/disabled status and scrape interval (e.g. "kubelet = true, 30s; coredns = true, 30s; ..."). List custom configmaps present (ama-metrics-prometheus-config, -node, -node-windows). List all PodMonitors and ServiceMonitors with namespace/name (e.g. "PodMonitors: default/referenceapp. ServiceMonitors: default/referenceapp").> |
| 5b. Replicaset Config | ✅/❌ | <# scrape jobs in running config, # active targets, # down targets. Do job
---
*Content truncated.*