agent-job-info
Summarize DeepResearch experiment job status across scheduler/process
Install
mkdir -p .claude/skills/agent-job-info && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13476" && unzip -o skill.zip -d .claude/skills/agent-job-info && rm skill.zipInstalls to .claude/skills/agent-job-info
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Summarize DeepResearch experiment job status across scheduler/processAbout this skill
Agent Job Info
Use this skill to answer "what is happening with my experiment jobs?" It is a read-only status aggregation workflow. Do not launch jobs, patch scripts, change remote lifecycle state, or mark a run healthy from inferred evidence.
Inputs
Resolve what the user provided:
sandbox: LUMI, Snellius, Brev, AutoDL, RunPod, local, or all configured backends when omitted.workspace: explicit path, current repo workspace, orworkspace/<project>.experiment: experiment name, Slurm job name/id, W&B run/group, launcher, output directory, or latest active/recent run when omitted.
If the sandbox is omitted, inspect all locally configured DeepResearch backends independently. If workspace or experiment identity remains ambiguous after checking current context, ask for the missing value instead of guessing.
Evidence Sources
Prefer concrete local and remote artifacts in this order:
- User-provided job IDs, full W&B URLs (
https://wandb.ai/<entity>/<project>/runs/<run_id>), experiment names, or paths. - Workspace run-state files, monitor state files, progress CSVs, timing JSON, and launch manifests.
- Workspace CSV ledger at
<paths.ledger_csv>(read from the SUE live configscale_up_outputs/<exp_dir>/config/runtime.yaml), filtering to rows whoseexperiment_version_idmatchesruntime.yaml:experiment_version.currentwhen the column is present. Usewandb-csvwhen the ledger needs interpretation or update-status analysis. - W&B run state, latest step, latest metric history/summary, run config, tags, and run URL when credentials are available.
- Scheduler or process status from the selected sandbox: Slurm
squeue/sacctfor LUMI/Snellius, direct process/tmux checks for Brev/AutoDL, and RunPod API or controller logs for RunPod. - Durable logs, stdout/stderr, progress CSV rows, checkpoint directories, metrics JSON/CSV, output artifacts, and report files.
Use sue-cluster-info (LUMI/Snellius quota and job limits) and the active backend's private config to resolve backend access and quota/status details. Do not
print secrets, private hostnames, account/allocation IDs, storage roots, tokens,
or passwords. Do not use local Slurm, local Python, local W&B state, or local
paths as substitutes for a selected remote sandbox.
Status Checks
For each candidate job or experiment, determine:
- scheduler/process state: pending, running, completed, failed, canceled, blocked, unknown, or not_applicable
- experiment identity: job name/id, launcher, method, variant/config, seed, backend, W&B project/group/run id, and ledger row key
- progress: latest step/epoch, examples/images/records processed, latest update time, current speed, and whether progress is stale
- checkpoints: latest checkpoint path, latest checkpoint step, final checkpoint or model artifact, and whether paths exist and are non-empty
- metrics: FID, BPD, loss, training speed, and other real metrics already computed from real samples; never invent placeholder values
- W&B: run state, latest step, metric summary/history, tags, and URL resolution
- ledger: CSV row presence, status, W&B join fields, checkpoint/output path, metrics, timestamp, and backend
- logs/artifacts: first relevant error line, terminal state, output files, and missing artifacts
If a source cannot be checked, report that source as blocked with the exact
command, path, key, or API error. Continue checking independent sources only
when their evidence does not depend on the failed source. Never replace a
failed source with a weaker inferred answer.
Backend Notes
- For LUMI, verify the required jump/proxy SSH route before remote scheduler queries. Use the selected sandbox's scheduler and log paths.
- For Snellius, use the configured Snellius SSH route and Slurm/account policy; do not apply LUMI partition or jump-route assumptions.
- For Brev, do not use Slurm. Use direct SSH/process/tmux status and
nvidia-brev-gpu-queryonly for instance/GPU availability questions. - For AutoDL, resolve the active machine from ignored sandbox config and use direct remote checks. Prefer COS only for file transfer questions, not status.
- For RunPod or other paid compute, do not start new paid resources. If a status check confirms idle paid compute, follow the project paid-compute policy and report the exact evidence and action.
Report Contract
Return a compact table. For scale monitor or Scale-RAE status requests, prefer
this row shape:
Stage | Split | Slurm State | Jobs R/P/C/F | Progress | Rate | ETA | Last Update | Artifact / Log, preceded or followed by a one-line Overall:
summary. Jobs R/P/C/F means running, pending, completed, and failed counts.
Use MISSING, STALE, or BLOCKED in the relevant cell when a scheduler
query, progress CSV, state file, or log path cannot be read; include the exact
path/key/cause rather than inferring health from another source.
Include raw private backend details only in operator_private_notes. If the
result may be shared with a student, add a
student_safe_report that redacts backend identity, private W&B entity/account
URLs, hosts, usernames, account IDs, storage roots, SSH aliases, queue details,
and Slurm layout.
| field | value |
|---|---|
| workspace | resolved workspace path or blocker |
| sandbox | backend checked, or all configured backends |
| experiment | experiment/job/group/variant identity |
| scheduler_state | scheduler/process state, job id, elapsed/time limit, and exact blocker |
| progress | latest step/epoch/unit count, speed, last update, stale/not stale |
| checkpoint | latest/final checkpoint path, step, exists/non-empty status |
| metrics | real metric names/values such as FID/BPD/loss/speed and source |
| wandb | project/group/run id, run state, latest step, URL status; raw URL only in operator notes if private |
| ledger | CSV path, row status, join fields, metrics, timestamp |
| logs | log paths inspected, terminal error or no error found |
| outputs | metrics/artifact/output paths verified |
| status | healthy, running, completed, failed, blocked, unknown, or mixed |
| next_action | monitor, wait, inspect blocker, retry failed only, run report job, stop idle paid compute, or handoff |
Use healthy only when scheduler/process state, W&B, ledger, checkpoint/output,
and required metrics agree. Use blocked when a required source cannot be read
or queried. Use unknown only when the user did not provide enough identity and
it cannot be discovered from workspace context.
Handoffs
- Use
sue-fullrunfor production launch orchestration or retries. - Use
sue-dryrunwhen a dry run must be executed or repaired. - Use
sue-scripts-writingwhen launcher/monitor/run-state surfaces are missing or broken. - Use
sue-cluster-infofor LUMI/Snellius quota and job-limit overview, or the active backend's private config, for backend quota/allocation/storage/active-job inventory that is not tied to a specific experiment. - Use
wandb-csvwhen the user asks to update, repair, or compare the workspace CSV result ledger.