agentskills.codes
AI

ai-tune-from-scratch

Tune typical shapes from scratch (baseline .co kernels) for >=150 iterations each. Covers shapes far from the reference that need full optimization. Non-stop sweep until all typical shapes are tuned. Use when the user asks to "tune from scratch", "full tune", or "ai-tune-from-scratch".

Install

mkdir -p .claude/skills/ai-tune-from-scratch && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14567" && unzip -o skill.zip -d .claude/skills/ai-tune-from-scratch && rm skill.zip

Installs to .claude/skills/ai-tune-from-scratch

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Tune typical shapes from scratch (baseline .co kernels) for >=150 iterations each. Covers shapes far from the reference that need full optimization. Non-stop sweep until all typical shapes are tuned. Use when the user asks to "tune from scratch", "full tune", or "ai-tune-from-scratch".
286 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

AI-Tune from Scratch: Deep Tuning for Typical Shapes

Tune typical representative shapes from scratch using baseline .co kernels from choreo/benchmark/performance/gemm_sp/. Each shape gets at least 150 optimization iterations. This produces the convergence curves and deep optimization data the paper needs.

All work stays on the current branch. Each shape's artifacts are stored separately. No experiment branches.

CroqTuner Harness Integration

This skill is managed by the CroqTuner FSM harness. Before doing anything:

  1. Set the state file env var (MUST do this before ANY FSM script call):
    export CROQTUNER_STATE_FILE=".claude/skills/fsm-engine/state/loop-state_from_scratch.json"
    
  2. Read the CroqTuner router: .claude/skills/fsm-engine/SKILL.md
  3. Read identity constraints: .claude/skills/fsm-engine/protocol/identity.md
  4. Check FSM state: read $CROQTUNER_STATE_FILE (if it exists, resume from there)

CRITICAL: from-scratch uses loop-state_from_scratch.json, NOT loop-state.json. This allows from-scratch and from-current-best to run concurrently in separate agent sessions without conflicting.

The CroqTuner harness provides:

  • FSM state tracking — you always know which step to execute next
  • Guard flag enforcement — pre/post validation scripts block skipped steps
  • Idea dedup log — append-only JSONL prevents repeated ideas
  • Compaction-safe resume — structured summary survives context loss
  • Completion promise — mechanical check prevents premature exit

For each iteration, follow the CroqTuner step sequence:

pre-step-check.sh → execute step → update guard flags → post-step-check.sh → state-transition.sh

(All scripts automatically use $CROQTUNER_STATE_FILE when set.)

Initialize the FSM for this mode:

export CROQTUNER_STATE_FILE=".claude/skills/fsm-engine/state/loop-state_from_scratch.json"
bash .claude/skills/fsm-engine/scripts/state-transition.sh INIT \
    shape_key=<KEY> dtype=<DTYPE> mode=from_scratch max_iteration=150 shape='[M,N,K]'

Overview

The reference kernels (f16 iter143, e4m3 iter068) were deeply tuned for one shape (4096,8192,8192). Shapes far from this reference — small squares, very large squares, extreme aspect ratios — may need fundamentally different tile configurations, pipeline depths, or warp specialization topologies. These shapes start from baseline .co kernels and undergo full 150+ iteration tuning.

The agent IS AWARE of the reference best kernel's optimizations (hoisted metadata, split TMA, 3-stage pipeline, etc.) and may try to port those ideas to the new shape, but must start from the baseline .co and discover the right combination for each shape independently.

Typical Shapes (from manifest.json)

These are representative shapes far from the reference (4096,8192,8192) across ALL 7 scenarios that need from-scratch tuning. The full list is in manifest.jsontypical_shapes_for_scratch.

#Shape (M,N,K)Source scenarioPositionWhy it needs from-scratch
1768×768×768square~1/4Small, non-power-of-2 transition
212288×12288×12288square~3/4Very large, extreme DRAM pressure
3512×8192×8192sweep_m~1/4Small M, wave quantization (0% from-best)
416384×8192×8192sweep_m~3/4Large M, high CTA count
54096×1024×8192sweep_n~1/4Small N
64096×32768×8192sweep_n~3/4Extreme N, huge output tile count
74096×8192×512sweep_k~1/4Small K, few pipeline stages
84096×8192×32768sweep_k~3/4Deep K loop, different pipeline depth
91024×1024×4096sweep_mn (fixK4096)~1/4Small MN, medium K
1012288×12288×4096sweep_mn (fixK4096)~3/4Large MN, medium K
111024×8192×1024sweep_mk (fixN8192)~1/4Small MK, large N
1212288×8192×12288sweep_mk (fixN8192)~3/4Large MK, large N
134096×1024×1024sweep_nk (fixM4096)~1/4Small NK
1412288×4096×4096sweep_nk (fixNK4096)~3/4Large M, medium NK

These 14 shapes cover all 7 scenarios with 2 shapes each at ~1/4 and ~3/4 positions of each sweep range (all far-region). The full list is in manifest.jsontypical_shapes_for_scratch. The agent should also check tuning/state.json for any shapes marked as needing from-scratch tuning by ai-tune-from-current-best (shapes where adaptation failed or hit local minimum early).

Pre-flight

  1. Read CroqTuner harness: .claude/skills/fsm-engine/SKILL.md — the FSM router.
  2. Read manifest: kernels/manifest.json — baseline kernels, build config, typical shapes.
  3. Read tuning state: tuning/state.json — which shapes are already done.
  4. Read choreo program: /home/albert/workspace/choreo/.claude/program.md — full loop protocol. ALL rules apply without relaxation.
  5. Read choreo-syntax skill: MUST read before editing any .co file.
  6. Read reference kernel READMEs: Read the optimization history from README_gemm_sp_f16_aitune_2026-03-25.md and README_e4m3_aitune_2026-03-21.md in choreo/benchmark/performance/gemm_sp/ to understand what optimizations worked at the reference shape.

Per-Shape Storage Layout

IMPORTANT: From-scratch uses the _fs suffix to isolate artifacts from from-current-best. The key for from-scratch is <dtype>_<M>x<N>x<K>_fs (e.g. f16_768x768x768_fs). This ensures from-scratch and from-current-best results coexist cleanly for paper comparison.

tuning/
├── state.json
├── logs/<dtype>_<M>x<N>x<K>_fs/results.tsv
├── srcs/<dtype>_<M>x<N>x<K>_fs/
│   ├── baseline.co                         ← copy of baseline .co (starting point)
│   ├── iter001_<tag>.co (or .cu)
│   └── ...
├── perf/<dtype>_<M>x<N>x<K>_fs/
│   ├── timing_iter000_baseline.txt
│   ├── ncu_iter001.txt (or .ncu-rep)
│   └── ...
└── checkpoints/<dtype>_<M>x<N>x<K>_fs.json

Baseline Selection

Each dtype has ONE canonical baseline .co file — the simplest correct kernel with no warp-specialization, no prepack, no pipeline stages:

  • f16: gemm_sp_f16.co (WARP_M=64, WARP_N=256, TILE_K=64, WARP_K=32, swiz64/128)
  • e4m3: gemm_sp_e4m3.co (WARP_M=64, WARP_N=256, TILE_K=64, WARP_K=64, swiz32/64)

These are the canonical starting points from choreo/benchmark/performance/gemm_sp/. The from-scratch skill starts here and discovers ALL optimizations (warp-spec, prepack, pipeline stages, TMA metadata, etc.) independently through profiling.

Copy the baseline to the shape's src directory (note the _fs suffix on KEY):

CHOREO_GEMM_SP="/home/albert/workspace/choreo/benchmark/performance/gemm_sp"
KEY="${DTYPE}_${M}x${N}x${K}_fs"
cp "$CHOREO_GEMM_SP/gemm_sp_${DTYPE}.co" "tuning/srcs/$KEY/baseline.co"

Before compiling, edit the #define MATMUL_DEFAULT_M/N/K in the copied baseline to match the target shape, OR pass them via -D flags to choreo.

Sweep Loop (NON-STOP)

for dtype in requested_dtypes:
    # Primary: shapes from manifest.typical_shapes_for_scratch[dtype] (14 per dtype)
    typical_shapes = manifest.typical_shapes_for_scratch[dtype]

    # Secondary: ALL shapes in state.json with region="far" for this dtype
    for key, info in state.json.shapes:
        if key.startswith(dtype) and info.region == "far":
            shape = parse_shape_from_key(key)  # e.g. "f16_256x256x256" -> (256,256,256)
            if shape not in typical_shapes:
                typical_shapes.append(shape)

    # Also add any shapes flagged as "needs_scratch" by from-current-best
    for key, info in state.json.shapes:
        if info.status == "needs_scratch":
            typical_shapes.append(info.shape)

    for shape in typical_shapes:
        key = f"{dtype}_{shape.M}x{shape.N}x{shape.K}_fs"

        # Check if already done — ONLY skip if iter >= 150
        if state.json[key].status == "done" and state.json[key].mode == "from_scratch" and state.json[key].current_iter >= 150:
            skip
        # IMPORTANT: if status == "done" but current_iter < 150, it was
        # prematurely closed. REOPEN it and continue from current_iter.

        # Check if partially done (resume)
        if state.json[key] exists and state.json[key].mode == "from_scratch" and current_iter > 0:
            resume from checkpoint

        # Otherwise: start fresh
        run_scratch_loop(dtype, shape, max_iter=150)

DO NOT STOP between shapes. After finishing one shape, immediately proceed to the next. The primary sweep covers 14 typical shapes per dtype (28 total for f16+e4m3), then continues to remaining far shapes.

anti-early-stop (critical — read this): there are ~260 shapes still pending. completing 1, 2, 5, or even 20 shapes is not a reason to stop. you must keep tuning until the session physically drops or all shapes are done. do not:

  • output a "progress report" and wait for user input
  • say "i've completed n shapes, shall i continue?"
  • summarize results and stop after a "natural break point"
  • treat the end of any single shape as a stopping opportunity

the user has explicitly requested non-stop operation. the only valid stop conditions are:

  1. all shapes have status: "done" in tuning/state.json
  2. session/connection drops (crash-safe resume handles this)

note: if has systemic gpu failure that cannot be remediated, raise an interactive quiz to user, ask for solution: a. force kill process (use sudo, ask user for passwd); b. wait for gpu to idle (ask user wait time); c. stop tuning.

Per-Shape Tuning Loop

Step 0: Setup

Similar to ai-tune-from-current-best Step 0, but uses _fs suffix and copies baseline .co instead of reference .cu.

DTYPE=<f16|e4m3>
M=<M>  N=<N>  K=<K>
KEY="${DTYPE}_${M}x${N}x${K}_fs"
SHAPE_DIR_LOGS="tuning/logs/$KEY"
SHAPE_DIR_SRCS="tuning/srcs/$KEY"
SHAPE_DIR_PERF="tuning/perf/$KEY"

mkdir -p "$SHAPE_DIR_LOGS" "$SHAPE_DIR_SRCS" "$SHAPE_DIR_PERF"

Compile the baseline for target shape:

CH

---

*Content truncated.*

Search skills

Search the agent skills registry