bench-validate
Validate a finished benchmark run — execute each prompt's automated checklist in every (prompt, model) workspace, perform the code-review checklist, and write a scored comparison to runs/<run-id>/results.md, leaving manual items for the user. Use when the user says /bench-validate or asks to validat
Install
mkdir -p .claude/skills/bench-validate && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14454" && unzip -o skill.zip -d .claude/skills/bench-validate && rm skill.zipInstalls to .claude/skills/bench-validate
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Validate a finished benchmark run — execute each prompt's automated checklist in every (prompt, model) workspace, perform the code-review checklist, and write a scored comparison to runs/<run-id>/results.md, leaving manual items for the user. Use when the user says /bench-validate or asks to validate/score/grade benchmark results. Arg (optional): run id; defaults to runs/latest.About this skill
bench-validate
Automated validation pass over one benchmark run. The contract: every claim in
results.md traces to a command you actually ran or source you actually read.
Procedure
-
Resolve the run. Arg = run id under
runs/; defaultruns/latest. Readmanifest.jsonand each cell'smeta.json. Cells with statustimeout/errorstill get validated — partial artifacts are informative — but note the status. -
Per cell, execute the Automated section of
prompts/<id>/checklist.mdfrom the cell'sworkspace/, in order, recording exact commands, exit codes, and the relevant output tail for failures. Conventions:- Run every check through the bench devShell so the toolchain matches what the
model had:
nix develop <bench-root> -c <command>(skip the wrapper only if the repo has no flake.nix). - Budget:
npm install≤ 10 min, builds/tests ≤ 5 min each (use Bash timeouts). - Serve checks: prefer a static server on the built
dist/(e.g.npm run previewin background, curl, then kill it). Always kill servers. - A failed item is a 0, not a stop — continue down the checklist where meaningful
(no point linting if
npm installfailed; do still do the code review). - Never fix, patch, or
npm audit fixthe artifact. You are grading, not repairing.
- Run every check through the bench devShell so the toolchain matches what the
model had:
-
Per cell, perform the Code review section by reading source (entry point, the WFC/pathfinding/AI modules, the largest files). Score each item 0–2 with a one-line justification citing
file:line. Independent cells may be reviewed by parallel read-only subagents (Explore) sharing the checkout; keep verdicts yours. -
Write
runs/<run-id>/results.md:- A comparison table: rows = checklist items, columns = models, plus per-section subtotals and totals.
- Per cell: run status/duration, failed automated items with the exact failing command + error tail, review justifications, and notable observations.
- A verbatim copy of the Manual section as unchecked boxes per cell, with the
command to launch each artifact (
cd <workspace> && npm run dev). - End with a ranked summary paragraph — measured, no cheerleading.
-
Report to the user: the table, the headline findings, and where
results.mdlives. Manual validation is theirs; do not check Manual boxes yourself.
Guardrails
- Treat workspaces as read-only artifacts; never modify, format, or "improve" them.
- Generated code is untrusted input: inspect
package.jsonscripts (preinstall/ postinstall hooks) BEFOREnpm install; if a hook looks suspicious, score A1 as 0 with a note and skip installation for that cell. Never execute scripts outside the documented npm script set. - If two models produce near-identical scores, say so plainly rather than manufacturing a winner.