agentskills.codes
RE

rebuild-leaderboard

Re-run all models on the LatamBoard leaderboard from scratch after data loss. Identifies which model configs exist, runs each one with the full latam_board task suite on the cluster, and publishes results to HuggingFace after each model so progress is never lost.

Install

mkdir -p .claude/skills/rebuild-leaderboard && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13796" && unzip -o skill.zip -d .claude/skills/rebuild-leaderboard && rm skill.zip

Installs to .claude/skills/rebuild-leaderboard

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Re-run all models on the LatamBoard leaderboard from scratch after data loss. Identifies which model configs exist, runs each one with the full latam_board task suite on the cluster, and publishes results to HuggingFace after each model so progress is never lost.
263 charsno explicit “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Rebuild Leaderboard from Scratch

Use when raw benchmark outputs are gone (e.g., cluster wipe) and you need to re-evaluate all models that appear on latamboard.surus.lat.

Cluster: ssh cluster.surus.ddns.net — 4× RTX 3090 (24 GB each)


How this works

The HuggingFace dataset (mauroibz/leaderboard-results) still has the old processed scores. You're not replacing them — you're re-running each model, getting fresh scores, and merging them back with merge_and_publish. Publish after every model, not at the end. If a run crashes, you keep everything already published.


Phase 1 — Prep (local or cluster)

1a. Confirm HF_TOKEN is set

grep HF_TOKEN .env || echo "MISSING — add HF_TOKEN=hf_... to .env"

1b. Check the dataset name in config

grep "results:" configs/config.yaml
# Should be:  results: "mauroibz/leaderboard-results"
# If it says  results: "LatamBoard/leaderboard-results"  → fix it first

1c. See what's currently on the leaderboard

python3 -c "
import json
with open('outputs/publish/summaries/all_model_summaries.json') as f:
    data = json.load(f)
print(f'{len(data)} models on HF:')
for k in sorted(data): print(' ', k)
" 2>/dev/null || echo "(no local summaries — that's fine, they're on HF)"

Phase 2 — Models with existing configs (run these first)

These models have a config in configs/models/ and can be run immediately. All use --tasks latam_board regardless of what the config file says, to ensure consistent coverage (spanish + portuguese + translation + structured_extraction).

The run loop

Run one model at a time — each takes a full GPU card and finishes in 30–90 minutes depending on model size.

# Template — repeat for each model below
benchy eval --config configs/models/<CONFIG>.yaml --tasks latam_board
# note the run_id printed at the start, then:
python -m src.leaderboard.merge_and_publish --run-id <RUN_ID>

Model list

Config fileModelGPU config
zephyr-7b-beta.yamlHuggingFaceH4/zephyr-7b-betasingle card
llama3.1.yamlmeta-llama/Llama-3.1-8B-Instructsingle card
llama3.2.yamlmeta-llama/Llama-3.2-3B-Instructsingle card
ministral8b.yamlmistralai/Ministral-8B-Instruct-2410single card
DeepSeek-R1-Distill-Qwen-7B.yamldeepseek-ai/DeepSeek-R1-Distill-Qwen-7Bsingle card
DeepSeek-R1-Distill-Llama-8B.yamldeepseek-ai/DeepSeek-R1-Distill-Llama-8Bsingle card
Yi-1.5-6B-Chat.yaml01-ai/Yi-1.5-6B-Chatsingle card
Yi-1.5-9B-Chat.yaml01-ai/Yi-1.5-9B-Chatsingle card
Hermes-3-Llama-3.1-8B.yamlNousResearch/Hermes-3-Llama-3.1-8Bsingle card
gemma3n2.yamlgoogle/gemma-3n-E2B-itsingle card
gemma3n4.yamlgoogle/gemma-3n-E4B-itsingle card
hormoz8b.yamlHormoz-8Bsingle card
phi4mini.yamlmicrosoft/Phi-4-mini-instructsingle card
aya8b.yamlCohereLabs/aya-expanse-8bsingle card
qwen34b.yamlQwen/Qwen3-4B-Instructsingle card

With 4× 3090s you can run up to 4 models in parallel by pinning each to a different card:

# Parallel example — pin each run to a specific GPU
CUDA_VISIBLE_DEVICES=0 benchy eval --config configs/models/zephyr-7b-beta.yaml --tasks latam_board &
CUDA_VISIBLE_DEVICES=1 benchy eval --config configs/models/llama3.1.yaml --tasks latam_board &
CUDA_VISIBLE_DEVICES=2 benchy eval --config configs/models/ministral8b.yaml --tasks latam_board &
CUDA_VISIBLE_DEVICES=3 benchy eval --config configs/models/DeepSeek-R1-Distill-Qwen-7B.yaml --tasks latam_board &
wait
# Then publish all four:
python -m src.leaderboard.merge_and_publish --run-id <ID1>
python -m src.leaderboard.merge_and_publish --run-id <ID2>
python -m src.leaderboard.merge_and_publish --run-id <ID3>
python -m src.leaderboard.merge_and_publish --run-id <ID4>

Note: Two-card configs (vllm_two_card) need adjacent GPUs. Check the config's vllm.provider_config before parallelising — don't put two two-card models on the same pair.


Phase 3 — Models without configs (need new configs)

These models are on the leaderboard but have no matching config in configs/models/. Create a config for each using the configure-model skill, then run the same way as Phase 2.

Model on leaderboardLikely HF pathNotes
Qwen3-4B-Instruct-2507Qwen/Qwen3-4B-Instruct-2507newer Qwen3 variant

To create a missing config:

# Minimal config template — save as configs/models/<name>.yaml
cat > configs/models/MyModel.yaml << 'EOF'
model:
  name: "org/ModelName"   # exact HuggingFace repo path

vllm:
  provider_config: "vllm_single_card"
  overrides: {}

tasks:
  - "latam_board"
EOF

Then validate before running:

benchy validate --config configs/models/MyModel.yaml

Phase 4 — Verify

After all models are published:

# Count models now in the dataset
python3 -c "
from huggingface_hub import hf_hub_download
import json
p = hf_hub_download('mauroibz/leaderboard-results', 'leaderboard_table.json', repo_type='dataset')
data = json.load(open(p))
print(f'{len(data)} models on HF:')
for row in data: print(' ', row.get('full_model_name', row.get('model_name')))
"

Open latamboard.surus.lat and hard-refresh — all models should appear with scores. The leaderboard fetches live from HF so no redeploy is needed.


Tracking progress

The fastest way to see where you are mid-rebuild:

# Which runs have completed?
ls outputs/benchmark_outputs/ | sort

# Which are already published to HF?
ls outputs/publish/summaries/*_summary.json 2>/dev/null | wc -l

If a run fails

Don't stop. Skip the failed model, keep going, come back to it:

# Check what went wrong
cat outputs/benchmark_outputs/<run_id>/<model>/run_outcome.json | python3 -m json.tool | grep -E '"status"|"reason"'

# Common fixes:
# - OOM: switch to two-card config or reduce batch size
# - API error: check HF_TOKEN, model access permissions
# - Task not found: model config may need --tasks override

Failed models won't overwrite good scores — merge_and_publish only adds/ updates, never deletes existing HF entries.

Search skills

Search the agent skills registry