bench-validate

Name: bench-validate
Author: pshirshov

Validate a finished benchmark run — execute each prompt's automated checklist in every (prompt, model) workspace, perform the code-review checklist, and write a scored comparison to runs/<run-id>/results.md, leaving manual items for the user. Use when the user says /bench-validate or asks to validat

Install

mkdir -p .claude/skills/bench-validate && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14454" && unzip -o skill.zip -d .claude/skills/bench-validate && rm skill.zip

Installs to .claude/skills/bench-validate

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Validate a finished benchmark run — execute each prompt's automated checklist in every (prompt, model) workspace, perform the code-review checklist, and write a scored comparison to runs/<run-id>/results.md, leaving manual items for the user. Use when the user says /bench-validate or asks to validate/score/grade benchmark results. Arg (optional): run id; defaults to runs/latest.

381 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

bench-validate

Automated validation pass over one benchmark run. The contract: every claim in results.md traces to a command you actually ran or source you actually read.

Procedure

Resolve the run. Arg = run id under runs/; default runs/latest. Read manifest.json and each cell's meta.json. Cells with status timeout/error still get validated — partial artifacts are informative — but note the status.
Per cell, execute the Automated section of prompts/<id>/checklist.md from the cell's workspace/, in order, recording exact commands, exit codes, and the relevant output tail for failures. Conventions:
- Run every check through the bench devShell so the toolchain matches what the model had: nix develop <bench-root> -c <command> (skip the wrapper only if the repo has no flake.nix).
- Budget: npm install ≤ 10 min, builds/tests ≤ 5 min each (use Bash timeouts).
- Serve checks: prefer a static server on the built dist/ (e.g. npm run preview in background, curl, then kill it). Always kill servers.
- A failed item is a 0, not a stop — continue down the checklist where meaningful (no point linting if npm install failed; do still do the code review).
- Never fix, patch, or npm audit fix the artifact. You are grading, not repairing.
Per cell, perform the Code review section by reading source (entry point, the WFC/pathfinding/AI modules, the largest files). Score each item 0–2 with a one-line justification citing file:line. Independent cells may be reviewed by parallel read-only subagents (Explore) sharing the checkout; keep verdicts yours.
Write runs/<run-id>/results.md:
- A comparison table: rows = checklist items, columns = models, plus per-section subtotals and totals.
- Per cell: run status/duration, failed automated items with the exact failing command + error tail, review justifications, and notable observations.
- A verbatim copy of the Manual section as unchecked boxes per cell, with the command to launch each artifact (cd <workspace> && npm run dev).
- End with a ranked summary paragraph — measured, no cheerleading.
Report to the user: the table, the headline findings, and where results.md lives. Manual validation is theirs; do not check Manual boxes yourself.

Guardrails

Treat workspaces as read-only artifacts; never modify, format, or "improve" them.
Generated code is untrusted input: inspect package.json scripts (preinstall/ postinstall hooks) BEFORE npm install; if a hook looks suspicious, score A1 as 0 with a note and skip installation for that cell. Never execute scripts outside the documented npm script set.
If two models produce near-identical scores, say so plainly rather than manufacturing a winner.

Install

mkdir -p .claude/skills/bench-validate && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14454" && unzip -o skill.zip -d .claude/skills/bench-validate && rm skill.zip

Installs to .claude/skills/bench-validate

Safety

Review before install

Network access

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

22d ago

Repo stars

Loads

~711 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

pshirshov

Links

Source code

bench-validate

Install

Activation

About this skill

bench-validate

Procedure

Guardrails

Search skills