agentskills.codes
HO

homelab-investigator

>

Install

mkdir -p .claude/skills/homelab-investigator && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14750" && unzip -o skill.zip -d .claude/skills/homelab-investigator && rm skill.zip

Installs to .claude/skills/homelab-investigator

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Investigate homelab infrastructure health using VictoriaMetrics and VictoriaLogs. Use for node health, pod issues, storage, network, and security investigation.
160 chars✓ has a “when” trigger

About this skill

Homelab Investigator

Investigate homelab infrastructure health by querying VictoriaMetrics (metrics) and VictoriaLogs (logs). Read-only analysis -- query, correlate, classify, recommend.

Principles

  • Investigation-only: query and analyze, never modify infrastructure
  • Evidence-based: every conclusion backed by query results
  • Correlate across subsystems: node issues cause pod issues; network issues cause ingress errors; storage issues cause mount stalls
  • Challenge your own conclusions: when you think you've found the root cause, ask "what else could cause these symptoms?" and look for evidence
  • Human decides: present findings with confidence levels; the engineer makes the call

Services & Data

ServiceURLAPI
VictoriaMetricshttps://victoriametrics.matthew-stratton.mePromQL via /api/v1/query
VictoriaLogshttps://victorialogs.matthew-stratton.meLogsQL via /select/logsql/query
Grafanahttps://grafana.matthew-stratton.meDashboards (visual verification)

Data retention: VM 30 days, VL 7 days.

Status: !python3 .claude/skills/homelab-investigator/obs-query health 2>/dev/null | python3 -c "import sys,json; d=json.load(sys.stdin); nodes=list(d['cpu_pct'].keys()); print(f'VM up -- {len(nodes)} nodes: {', '.join(nodes)}')" || echo "VictoriaMetrics not reachable" !curl -sf https://victorialogs.matthew-stratton.me/health >/dev/null 2>&1 && echo "VictoriaLogs up" || echo "VictoriaLogs not reachable"

References:

  • Metrics catalog + query patterns: docs/appendix/victoriametrics-queries.md
  • Query tool recipes: .claude/skills/homelab-investigator/query-recipes.md
  • Known failure patterns: .claude/skills/homelab-investigator/known-patterns.md
  • Stack architecture: docs/06-observability.md

Hardware Context

  • 2 RPi 4B nodes (4-core ARM, 8GB RAM): k3-m1 (control-plane), k3-n1 (worker)
  • MikroTik RB5009 router: 192.168.1.1
  • Synology DS720+ NAS: 192.168.1.200 (NFS storage, iSCSI)

Query Tool

All queries go through .claude/skills/homelab-investigator/obs-query <command> [args...]. Outputs JSON lines (one JSON object per line).

Key Commands by Phase

PhaseCommands
Health checkhealth, node-health, pod-health
Node diagnosticscpu, memory, disk, temperature, network
Infrastructurerouter, nas, nas-storage
Kubernetespod-restarts, resource-pressure, deployments, node-conditions
Logsingress-status, ingress-errors, firewall-drops, modsecurity, search-logs

See query-recipes.md for detailed usage and interpretation of each command.

Key Metrics

MetricWhat It Tells You
cpu_usage_idle{cpu="cpu-total"}Node CPU idle %. Compute usage as 100 - idle.
mem_used_percentNode memory usage %.
disk_used_percent{path="/"}Root filesystem usage %.
temp_tempCPU/SoC temperature in Celsius.
diskio_io_awaitDisk IO latency in ms. High = SD card degradation.
snmp_mikrotik_cpu_loadRouter CPU %. Should be near 0 for home use.
snmp_synology_disk_disk_statusNAS disk health. 1=Normal.
snmp_synology_raid_raid_statusNAS RAID health. 1=Normal, 11=Degraded.
kube_pod_status_phasePod phase (Running, Pending, Failed, etc.).
kube_pod_container_status_restarts_totalContainer restart counter.
kubernetes_pod_container_memory_working_set_bytesPer-container memory usage.

Investigation Workflow

Phase 0: Parse Input

Route by user request:

InputAction
"Is everything healthy?" / no argsPhase 1: health snapshot
Node name (k3-m1, k3-n1)Phase 2: node diagnostics
Pod/namespace namePhase 2: pod diagnostics
"router" / "nas" / "storage"Phase 2: infrastructure
"ingress" / "firewall" / "security"Phase 2: log analysis
Vague problem descriptionPhase 1 first, then narrow

Phase 1: Health Snapshot

Run obs-query health. Scan for red flags:

MetricHealthyWarningCritical
CPU %<8080-95>95
Memory %<8585-95>95
Disk %<8080-90>90
Temperature C<7070-80>80
Router CPU<5050-80>80
NAS disk_status1--!= 1
NAS raid_status12 (repairing)11 (degraded)
Pod restarts0 in 1h1-5 in 1h>5 in 1h

If everything is green, say so. If red flags exist, proceed to Phase 2 with the most critical issue.

Phase 2: Identify Anomaly

Narrow to the specific subsystem:

  • Node issue: node-health <node>, then cpu, memory, disk, temperature, network for detail
  • Pod issue: pod-health <namespace>, pod-restarts <namespace>, resource-pressure <namespace>
  • Storage issue: nas, nas-storage, disk <node>
  • Network/ingress: ingress-status, ingress-errors, network
  • Security: firewall-drops, modsecurity, ingress-errors
  • Infrastructure: router, nas, deployments, node-conditions

Phase 3: Drill Down & Correlate

Cross-reference metrics with logs. Look for related symptoms:

  • High CPU + pod restarts = possible OOM or crash loop
  • High disk IO + processes_blocked = NFS mount stall or SD card degradation
  • Ingress 5xx + deployment unhealthy = backend down
  • High temperature + CPU spike = thermal throttling
  • Node not Ready + multiple pod failures = node-level issue

Phase 4: Recommend

Summarize findings and propose remediation. Reference known-patterns.md for recognized failure signatures.

Output Format

## Investigation: [scope]

**Status**: [healthy | warning | critical]

### Findings
- [finding 1 with evidence]
- [finding 2 with evidence]

### Root Cause (if identified)
[Traced failure chain: symptom -> cause -> underlying issue]
**Confidence**: [high | medium | low]

### Recommendation
[Specific remediation steps]

Knowledge Capture

  • Read known-patterns.md at the start of investigations.
  • When you discover a stable failure pattern (observed across multiple incidents), append it to known-patterns.md.
  • Don't capture session-specific findings -- that's conversation context.

Complementary Tools

  • scripts/homelab-diagnose.py (via just diagnose): SSH-based diagnostics for when the observability stack is down. Gathers raw system state directly from nodes.
  • Grafana dashboards: Visual investigation with historical trends. Use for time-correlated analysis.
  • Both complement obs-query: diagnose.py works without the obs stack, Grafana provides visual context, obs-query provides programmatic querying for systematic investigation.

Search skills

Search the agent skills registry