agentskills.codes
AG

agent-job-info

Summarize DeepResearch experiment job status across scheduler/process

Install

mkdir -p .claude/skills/agent-job-info && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13476" && unzip -o skill.zip -d .claude/skills/agent-job-info && rm skill.zip

Installs to .claude/skills/agent-job-info

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Summarize DeepResearch experiment job status across scheduler/process
69 charsno explicit “when” trigger

About this skill

Agent Job Info

Use this skill to answer "what is happening with my experiment jobs?" It is a read-only status aggregation workflow. Do not launch jobs, patch scripts, change remote lifecycle state, or mark a run healthy from inferred evidence.

Inputs

Resolve what the user provided:

  • sandbox: LUMI, Snellius, Brev, AutoDL, RunPod, local, or all configured backends when omitted.
  • workspace: explicit path, current repo workspace, or workspace/<project>.
  • experiment: experiment name, Slurm job name/id, W&B run/group, launcher, output directory, or latest active/recent run when omitted.

If the sandbox is omitted, inspect all locally configured DeepResearch backends independently. If workspace or experiment identity remains ambiguous after checking current context, ask for the missing value instead of guessing.

Evidence Sources

Prefer concrete local and remote artifacts in this order:

  1. User-provided job IDs, full W&B URLs (https://wandb.ai/<entity>/<project>/runs/<run_id>), experiment names, or paths.
  2. Workspace run-state files, monitor state files, progress CSVs, timing JSON, and launch manifests.
  3. Workspace CSV ledger at <paths.ledger_csv> (read from the SUE live config scale_up_outputs/<exp_dir>/config/runtime.yaml), filtering to rows whose experiment_version_id matches runtime.yaml:experiment_version.current when the column is present. Use wandb-csv when the ledger needs interpretation or update-status analysis.
  4. W&B run state, latest step, latest metric history/summary, run config, tags, and run URL when credentials are available.
  5. Scheduler or process status from the selected sandbox: Slurm squeue/sacct for LUMI/Snellius, direct process/tmux checks for Brev/AutoDL, and RunPod API or controller logs for RunPod.
  6. Durable logs, stdout/stderr, progress CSV rows, checkpoint directories, metrics JSON/CSV, output artifacts, and report files.

Use sue-cluster-info (LUMI/Snellius quota and job limits) and the active backend's private config to resolve backend access and quota/status details. Do not print secrets, private hostnames, account/allocation IDs, storage roots, tokens, or passwords. Do not use local Slurm, local Python, local W&B state, or local paths as substitutes for a selected remote sandbox.

Status Checks

For each candidate job or experiment, determine:

  • scheduler/process state: pending, running, completed, failed, canceled, blocked, unknown, or not_applicable
  • experiment identity: job name/id, launcher, method, variant/config, seed, backend, W&B project/group/run id, and ledger row key
  • progress: latest step/epoch, examples/images/records processed, latest update time, current speed, and whether progress is stale
  • checkpoints: latest checkpoint path, latest checkpoint step, final checkpoint or model artifact, and whether paths exist and are non-empty
  • metrics: FID, BPD, loss, training speed, and other real metrics already computed from real samples; never invent placeholder values
  • W&B: run state, latest step, metric summary/history, tags, and URL resolution
  • ledger: CSV row presence, status, W&B join fields, checkpoint/output path, metrics, timestamp, and backend
  • logs/artifacts: first relevant error line, terminal state, output files, and missing artifacts

If a source cannot be checked, report that source as blocked with the exact command, path, key, or API error. Continue checking independent sources only when their evidence does not depend on the failed source. Never replace a failed source with a weaker inferred answer.

Backend Notes

  • For LUMI, verify the required jump/proxy SSH route before remote scheduler queries. Use the selected sandbox's scheduler and log paths.
  • For Snellius, use the configured Snellius SSH route and Slurm/account policy; do not apply LUMI partition or jump-route assumptions.
  • For Brev, do not use Slurm. Use direct SSH/process/tmux status and nvidia-brev-gpu-query only for instance/GPU availability questions.
  • For AutoDL, resolve the active machine from ignored sandbox config and use direct remote checks. Prefer COS only for file transfer questions, not status.
  • For RunPod or other paid compute, do not start new paid resources. If a status check confirms idle paid compute, follow the project paid-compute policy and report the exact evidence and action.

Report Contract

Return a compact table. For scale monitor or Scale-RAE status requests, prefer this row shape: Stage | Split | Slurm State | Jobs R/P/C/F | Progress | Rate | ETA | Last Update | Artifact / Log, preceded or followed by a one-line Overall: summary. Jobs R/P/C/F means running, pending, completed, and failed counts. Use MISSING, STALE, or BLOCKED in the relevant cell when a scheduler query, progress CSV, state file, or log path cannot be read; include the exact path/key/cause rather than inferring health from another source.

Include raw private backend details only in operator_private_notes. If the result may be shared with a student, add a student_safe_report that redacts backend identity, private W&B entity/account URLs, hosts, usernames, account IDs, storage roots, SSH aliases, queue details, and Slurm layout.

fieldvalue
workspaceresolved workspace path or blocker
sandboxbackend checked, or all configured backends
experimentexperiment/job/group/variant identity
scheduler_statescheduler/process state, job id, elapsed/time limit, and exact blocker
progresslatest step/epoch/unit count, speed, last update, stale/not stale
checkpointlatest/final checkpoint path, step, exists/non-empty status
metricsreal metric names/values such as FID/BPD/loss/speed and source
wandbproject/group/run id, run state, latest step, URL status; raw URL only in operator notes if private
ledgerCSV path, row status, join fields, metrics, timestamp
logslog paths inspected, terminal error or no error found
outputsmetrics/artifact/output paths verified
statushealthy, running, completed, failed, blocked, unknown, or mixed
next_actionmonitor, wait, inspect blocker, retry failed only, run report job, stop idle paid compute, or handoff

Use healthy only when scheduler/process state, W&B, ledger, checkpoint/output, and required metrics agree. Use blocked when a required source cannot be read or queried. Use unknown only when the user did not provide enough identity and it cannot be discovered from workspace context.

Handoffs

  • Use sue-fullrun for production launch orchestration or retries.
  • Use sue-dryrun when a dry run must be executed or repaired.
  • Use sue-scripts-writing when launcher/monitor/run-state surfaces are missing or broken.
  • Use sue-cluster-info for LUMI/Snellius quota and job-limit overview, or the active backend's private config, for backend quota/allocation/storage/active-job inventory that is not tied to a specific experiment.
  • Use wandb-csv when the user asks to update, repair, or compare the workspace CSV result ledger.

Search skills

Search the agent skills registry