rebuild-leaderboard

Name: rebuild-leaderboard
Author: surus-lat

Re-run all models on the LatamBoard leaderboard from scratch after data loss. Identifies which model configs exist, runs each one with the full latam_board task suite on the cluster, and publishes results to HuggingFace after each model so progress is never lost.

Install

mkdir -p .claude/skills/rebuild-leaderboard && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13796" && unzip -o skill.zip -d .claude/skills/rebuild-leaderboard && rm skill.zip

Installs to .claude/skills/rebuild-leaderboard

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Re-run all models on the LatamBoard leaderboard from scratch after data loss. Identifies which model configs exist, runs each one with the full latam_board task suite on the cluster, and publishes results to HuggingFace after each model so progress is never lost.

263 charsno explicit “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Rebuild Leaderboard from Scratch

Use when raw benchmark outputs are gone (e.g., cluster wipe) and you need to re-evaluate all models that appear on latamboard.surus.lat.

Cluster: ssh cluster.surus.ddns.net — 4× RTX 3090 (24 GB each)

How this works

The HuggingFace dataset (mauroibz/leaderboard-results) still has the old processed scores. You're not replacing them — you're re-running each model, getting fresh scores, and merging them back with merge_and_publish. Publish after every model, not at the end. If a run crashes, you keep everything already published.

Phase 1 — Prep (local or cluster)

1a. Confirm HF_TOKEN is set

grep HF_TOKEN .env || echo "MISSING — add HF_TOKEN=hf_... to .env"

1b. Check the dataset name in config

grep "results:" configs/config.yaml
# Should be:  results: "mauroibz/leaderboard-results"
# If it says  results: "LatamBoard/leaderboard-results"  → fix it first

1c. See what's currently on the leaderboard

python3 -c "
import json
with open('outputs/publish/summaries/all_model_summaries.json') as f:
    data = json.load(f)
print(f'{len(data)} models on HF:')
for k in sorted(data): print(' ', k)
" 2>/dev/null || echo "(no local summaries — that's fine, they're on HF)"

Phase 2 — Models with existing configs (run these first)

These models have a config in configs/models/ and can be run immediately. All use --tasks latam_board regardless of what the config file says, to ensure consistent coverage (spanish + portuguese + translation + structured_extraction).

The run loop

Run one model at a time — each takes a full GPU card and finishes in 30–90 minutes depending on model size.

# Template — repeat for each model below
benchy eval --config configs/models/<CONFIG>.yaml --tasks latam_board
# note the run_id printed at the start, then:
python -m src.leaderboard.merge_and_publish --run-id <RUN_ID>

Model list

Config file	Model	GPU config
`zephyr-7b-beta.yaml`	HuggingFaceH4/zephyr-7b-beta	single card
`llama3.1.yaml`	meta-llama/Llama-3.1-8B-Instruct	single card
`llama3.2.yaml`	meta-llama/Llama-3.2-3B-Instruct	single card
`ministral8b.yaml`	mistralai/Ministral-8B-Instruct-2410	single card
`DeepSeek-R1-Distill-Qwen-7B.yaml`	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	single card
`DeepSeek-R1-Distill-Llama-8B.yaml`	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	single card
`Yi-1.5-6B-Chat.yaml`	01-ai/Yi-1.5-6B-Chat	single card
`Yi-1.5-9B-Chat.yaml`	01-ai/Yi-1.5-9B-Chat	single card
`Hermes-3-Llama-3.1-8B.yaml`	NousResearch/Hermes-3-Llama-3.1-8B	single card
`gemma3n2.yaml`	google/gemma-3n-E2B-it	single card
`gemma3n4.yaml`	google/gemma-3n-E4B-it	single card
`hormoz8b.yaml`	Hormoz-8B	single card
`phi4mini.yaml`	microsoft/Phi-4-mini-instruct	single card
`aya8b.yaml`	CohereLabs/aya-expanse-8b	single card
`qwen34b.yaml`	Qwen/Qwen3-4B-Instruct	single card

With 4× 3090s you can run up to 4 models in parallel by pinning each to a different card:

# Parallel example — pin each run to a specific GPU
CUDA_VISIBLE_DEVICES=0 benchy eval --config configs/models/zephyr-7b-beta.yaml --tasks latam_board &
CUDA_VISIBLE_DEVICES=1 benchy eval --config configs/models/llama3.1.yaml --tasks latam_board &
CUDA_VISIBLE_DEVICES=2 benchy eval --config configs/models/ministral8b.yaml --tasks latam_board &
CUDA_VISIBLE_DEVICES=3 benchy eval --config configs/models/DeepSeek-R1-Distill-Qwen-7B.yaml --tasks latam_board &
wait
# Then publish all four:
python -m src.leaderboard.merge_and_publish --run-id <ID1>
python -m src.leaderboard.merge_and_publish --run-id <ID2>
python -m src.leaderboard.merge_and_publish --run-id <ID3>
python -m src.leaderboard.merge_and_publish --run-id <ID4>

Note: Two-card configs (vllm_two_card) need adjacent GPUs. Check the config's vllm.provider_config before parallelising — don't put two two-card models on the same pair.

Phase 3 — Models without configs (need new configs)

These models are on the leaderboard but have no matching config in configs/models/. Create a config for each using the configure-model skill, then run the same way as Phase 2.

Model on leaderboard	Likely HF path	Notes
Qwen3-4B-Instruct-2507	Qwen/Qwen3-4B-Instruct-2507	newer Qwen3 variant

To create a missing config:

# Minimal config template — save as configs/models/<name>.yaml
cat > configs/models/MyModel.yaml << 'EOF'
model:
  name: "org/ModelName"   # exact HuggingFace repo path

vllm:
  provider_config: "vllm_single_card"
  overrides: {}

tasks:
  - "latam_board"
EOF

Then validate before running:

benchy validate --config configs/models/MyModel.yaml

Phase 4 — Verify

After all models are published:

# Count models now in the dataset
python3 -c "
from huggingface_hub import hf_hub_download
import json
p = hf_hub_download('mauroibz/leaderboard-results', 'leaderboard_table.json', repo_type='dataset')
data = json.load(open(p))
print(f'{len(data)} models on HF:')
for row in data: print(' ', row.get('full_model_name', row.get('model_name')))
"

Open latamboard.surus.lat and hard-refresh — all models should appear with scores. The leaderboard fetches live from HF so no redeploy is needed.

Tracking progress

The fastest way to see where you are mid-rebuild:

# Which runs have completed?
ls outputs/benchmark_outputs/ | sort

# Which are already published to HF?
ls outputs/publish/summaries/*_summary.json 2>/dev/null | wc -l

If a run fails

Don't stop. Skip the failed model, keep going, come back to it:

# Check what went wrong
cat outputs/benchmark_outputs/<run_id>/<model>/run_outcome.json | python3 -m json.tool | grep -E '"status"|"reason"'

# Common fixes:
# - OOM: switch to two-card config or reduce batch size
# - API error: check HF_TOKEN, model access permissions
# - Task not found: model config may need --tasks override

Failed models won't overwrite good scores — merge_and_publish only adds/ updates, never deletes existing HF entries.

More by surus-lat

View all by surus-lat →

evaluate

surus-lat

Run benchy evaluations against models or systems. Covers the canonical smoke→full workflow, config selection, task filtering, exit policies, and reading run_outcome.json. Use when asked to evaluate, benchmark, or run benchy against a model or system config.

synthesize-data

surus-lat

Generate synthetic benchmark examples from a task spec. Invoked from setup-data when there is no data, or directly when the user asks to generate examples. Produces a JSONL file in .data/<benchmark-name>/.

Install

mkdir -p .claude/skills/rebuild-leaderboard && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13796" && unzip -o skill.zip -d .claude/skills/rebuild-leaderboard && rm skill.zip

Installs to .claude/skills/rebuild-leaderboard

Safety

Review before install

Runs shell / code
Reads credentials

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

11d ago

Repo stars

Loads

~1,560 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

surus-lat

3 skills published

Links

Source code

rebuild-leaderboard

Install

Activation

About this skill

Rebuild Leaderboard from Scratch

How this works

Phase 1 — Prep (local or cluster)

Phase 2 — Models with existing configs (run these first)

The run loop

Model list

Phase 3 — Models without configs (need new configs)

Phase 4 — Verify

Tracking progress

If a run fails

More by surus-lat

evaluate

synthesize-data

Search skills