gale-v2-refactor
>
Install
mkdir -p .claude/skills/gale-v2-refactor && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13250" && unzip -o skill.zip -d .claude/skills/gale-v2-refactor && rm skill.zipInstalls to .claude/skills/gale-v2-refactor
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Use when working on the gale v2 consolidation/refactor — folding the accumulated CPU+GPU DG optimization work into user-friendly building blocks behind src/sim/'s Sim/Device abstraction while defending the measured performance edge. Carries the working-discipline rules distilled from the optimization sessions (the mistakes that got corrected repeatedly), the measurement gotchas, and the v2 design rails. Pair with the gale-gpu-perf skill for the profiling gate. Read docs/gale-v2-refactor.md for the full design + benchmark ledger.About this skill
gale v2 refactor — discipline, tools, direction
This is a serious refactor: consolidate the perf work into clean building blocks without losing
the numbers. The design + the benchmark baselines live in docs/gale-v2-refactor.md — read it first.
This skill exists mostly to stop the recurring mistakes. Take the working-discipline section literally.
Working discipline — the mistakes that kept happening (don't repeat them)
These are real corrections from the optimization sessions. Each cost time. Internalize the rule, not just the example.
-
Iteration counts are language-independent. NEVER cross-compare them across implementations unless the ALGORITHM is identical. A correct Chebyshev-MG-PCG converges in the same iterations in CUDA or cuda-oxide; if a C++ prototype gives 16 and the Rust gives 24, that is not a language/cuda-oxide effect — it's a different algorithm (here: h-MG vs p-MG) or a bug. Comparing C++-iters to Rust-iters and blaming the backend wasted a whole analysis. Compare ms/step (same language, same hardware) for perf; compare iters only within one algorithm.
-
A prototype is a "faithful model" only if it runs the SAME algorithm — a coincidental scalar match does not prove it. A C++ h-multigrid prototype was called a faithful model of the production p-multigrid because their Jacobi iteration counts happened to match (29≈28); its 45% win was projected onto production, which actually got ~6%. One matching number ≠ same algorithm. Before projecting a prototype's win, confirm the prototype's structure matches production.
-
Never say a kernel is "at the floor" / "nothing left" without an ncu number showing the bound and what's moving the bytes. Claiming
op_chebywas "at the bandwidth floor" was wrong — it was BW-bound on the smoother fields (b/dinv/dvec), not the matvec, and mixed precision was a ~1.5× lever. "It works" ≠ done. "It works + here are the profile numbers (DRAM%, occupancy, the bytes)" = done. If you're about to write "we're near optimal," profile it instead. -
Measure; don't project. And measure WALL-CLOCK, not API/launch counts. Don't hedge with "this is probably ~X%." Run it. And note: a while-graph A/B that changed API counts showed no wall-clock win — API counts are not performance. An
ncu --graph-profiling graphnumber on a while-graph kernel is a measurement artifact, not a measurement (§ below). -
A/B in the PRODUCTION path (the device-resident while-graph), not standalone. The arith matvec was bit-perfect in the standalone
op_profilebut broke the Neumann pressure solve (46× slower) only in the resident path — the arithmetic-neighbour opt had silently dropped the per-face BC metadata. A standalone win can be a production regression. Validate where it ships. -
For GPU perf work, "C++" means CUDA/C++ (GPU). When asked to prototype "in C++", prototype on the GPU. The CPU is only the cheap way to answer an algorithm/iteration-count question (e.g. "does FP32 smoothing hold the iteration count?") — not a perf prototype.
-
Push to the measured answer; but surface genuine forks honestly with the data. Don't stop short or checkpoint to avoid work. Do surface a real architecture fork (e.g. "the win needs h-multigrid, a separate solver") with the measurement that motivates it — that's not hedging, it's the user's call to make. The test: are you stopping because you're unsure (then measure) or because there's a genuine large speculative investment with a real decision (then present the data + recommend)?
-
Be honest about what a commit bundles and what you left alone. poisson.rs carried prior uncommitted work mixed in-file; say so. Don't sweep unrelated WIP into a feature commit silently.
Build & measurement gotchas (these bite every session)
cargo oxide build --arch sm_70for the Titan V, ORCUDA_OXIDE_TARGET=sm_70when running the bare binary. cuda-oxide embeds kernels as NVVM-IR and JIT-links the cubin at runtime forCUDA_OXIDE_TARGET(defaults to sm_120). A plaincargo oxide build+ runningtarget/release/<bin>directly →DriverError(209) "no kernel image available"— that's the wrong-arch cubin, not a perms/CUPTI problem.cargo oxide runauto-detects and injects sm_70; the bare binary (what profilers run) does not.- ncu CANNOT profile kernels inside conditional (while/if) CUDA graphs — its per-kernel numbers are
artifacts. To profile a solve kernel: take it out of the graph (
RVP_NOGRAPH=1/GpuResidentVe::with_while_graph(false), orop_profile) and ncu the plain launch. Use nsys--cuda-graph-trace=nodefor the in-situ which-kernel-dominates breakdown (timing only, no HW counters). See the gale-gpu-perf skill for the full gate. - Run
cargo oxide, notcargo-oxide(env/cache differ — a stale backend is used silently). - The mandatory gate before calling GPU work done (from gale-gpu-perf): RULE ZERO (no per-step host↔dev sync — nsys memcpy/sync counts must NOT scale with steps×stages), the kernel's bound via ncu, before/after ms reported as a number, correctness regression still green.
v2 design rails (the direction — full detail in docs/gale-v2-refactor.md)
- Three residency levels. L0 CPU (host loop, oracle). L1 GPU host-orchestrated (device-resident
state, host calls
step()per step — a prototype; any per-step readback violates RULE ZERO). L2 GPU fully device-resident (whole step a captured graph, whole loop a while-graph, no host per-step work — the production bar for sweeps).GpuDualSplittingis L1 viaStateIntegrator;GpuResidentVe/Nsis L2 and cannot be a per-stepStateIntegrator. - Sim-as-spec, Device-as-compiler.
Simis a declarative physics spec;Device::plan(spec)picks the backend + the highest residency level the spec supports. Add aResidentIntegratorseam (run_resident(state, nsteps, io_every, writers)) for L2 alongside theStateIntegratorseam for L1. - Dual-impl building blocks. Each Term/StageHook/Updater/Writer carries a CPU impl + optionally a device-resident (kernel) impl. Closure-only blocks cap a sim at L1; giving them kernel twins is what lets a sim reach L2.
gpucargo feature (default-on). Gates the optionalgale-gpudep + everyCudaarm. Without it: CPU-only, no cuda-oxide. With it: the user builds via cuda-oxide (like nvcc). No silent CPU fallback forDevice::Cuda.- The
Deviceenum exists but is currently dormant (never read bySimulation::run); the real seam today isStateIntegrator, and GPU is reached by manually constructingGpuDualSplitting. The refactor makesDevicethe compiler that resolves this.
Order of operations (don't skip step 1)
- Freeze the §4 baselines into
benches/first (the ms/step·ms/kernel·ms/iter triplet + the RULE-ZERO check). Refactoring without a regression baseline is how the perf edge silently dies. - Feature-flag the GPU. 3.
Device::plan+ integrator factory (L1, already exists). 4. TheResidentIntegratorseam (L2 — where the ~1.5× lives). 5. Dual-impl the hooks/updaters. 6. examples/benches/tests reorg. 7. Autotuning hooks (swappable scalar type + launch configs).
Each step ends green on the regression triplet. A >~5% move without an explanation is a bug.
Magic-number levers worth autotuning (don't hand-tune per GPU)
block-size / elements-per-block (biggest knob, hardware-specific), mixed-precision toggle (FP32 vs FP64 smoother), coarse-iteration count + Krylov tolerance, the hp hierarchy balance, graph-vs-no-graph. Make the scalar type and launch configs swappable so an autotuner has something to turn. The algorithm (hp-MG + Chebyshev + restructured matvec + residency) is portable and good; new-hardware work is mostly autotuning + two measured investigations (FP64 tensor-core contractions on A100+, mixed-precision iterative refinement for FP64-weak parts).