agent-job-info

Name: agent-job-info
Author: dongzhuoyao

Summarize DeepResearch experiment job status across scheduler/process

Install

mkdir -p .claude/skills/agent-job-info && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13476" && unzip -o skill.zip -d .claude/skills/agent-job-info && rm skill.zip

Installs to .claude/skills/agent-job-info

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Summarize DeepResearch experiment job status across scheduler/process

69 charsno explicit “when” trigger

About this skill

Agent Job Info

Use this skill to answer "what is happening with my experiment jobs?" It is a read-only status aggregation workflow. Do not launch jobs, patch scripts, change remote lifecycle state, or mark a run healthy from inferred evidence.

Inputs

Resolve what the user provided:

sandbox: LUMI, Snellius, Brev, AutoDL, RunPod, local, or all configured backends when omitted.
workspace: explicit path, current repo workspace, or workspace/<project>.
experiment: experiment name, Slurm job name/id, W&B run/group, launcher, output directory, or latest active/recent run when omitted.

If the sandbox is omitted, inspect all locally configured DeepResearch backends independently. If workspace or experiment identity remains ambiguous after checking current context, ask for the missing value instead of guessing.

Evidence Sources

Prefer concrete local and remote artifacts in this order:

User-provided job IDs, full W&B URLs (https://wandb.ai/<entity>/<project>/runs/<run_id>), experiment names, or paths.
Workspace run-state files, monitor state files, progress CSVs, timing JSON, and launch manifests.
Workspace CSV ledger at <paths.ledger_csv> (read from the SUE live config scale_up_outputs/<exp_dir>/config/runtime.yaml), filtering to rows whose experiment_version_id matches runtime.yaml:experiment_version.current when the column is present. Use wandb-csv when the ledger needs interpretation or update-status analysis.
W&B run state, latest step, latest metric history/summary, run config, tags, and run URL when credentials are available.
Scheduler or process status from the selected sandbox: Slurm squeue/sacct for LUMI/Snellius, direct process/tmux checks for Brev/AutoDL, and RunPod API or controller logs for RunPod.
Durable logs, stdout/stderr, progress CSV rows, checkpoint directories, metrics JSON/CSV, output artifacts, and report files.

Use sue-cluster-info (LUMI/Snellius quota and job limits) and the active backend's private config to resolve backend access and quota/status details. Do not print secrets, private hostnames, account/allocation IDs, storage roots, tokens, or passwords. Do not use local Slurm, local Python, local W&B state, or local paths as substitutes for a selected remote sandbox.

Status Checks

For each candidate job or experiment, determine:

scheduler/process state: pending, running, completed, failed, canceled, blocked, unknown, or not_applicable
experiment identity: job name/id, launcher, method, variant/config, seed, backend, W&B project/group/run id, and ledger row key
progress: latest step/epoch, examples/images/records processed, latest update time, current speed, and whether progress is stale
checkpoints: latest checkpoint path, latest checkpoint step, final checkpoint or model artifact, and whether paths exist and are non-empty
metrics: FID, BPD, loss, training speed, and other real metrics already computed from real samples; never invent placeholder values
W&B: run state, latest step, metric summary/history, tags, and URL resolution
ledger: CSV row presence, status, W&B join fields, checkpoint/output path, metrics, timestamp, and backend
logs/artifacts: first relevant error line, terminal state, output files, and missing artifacts

If a source cannot be checked, report that source as blocked with the exact command, path, key, or API error. Continue checking independent sources only when their evidence does not depend on the failed source. Never replace a failed source with a weaker inferred answer.

Backend Notes

For LUMI, verify the required jump/proxy SSH route before remote scheduler queries. Use the selected sandbox's scheduler and log paths.
For Snellius, use the configured Snellius SSH route and Slurm/account policy; do not apply LUMI partition or jump-route assumptions.
For Brev, do not use Slurm. Use direct SSH/process/tmux status and nvidia-brev-gpu-query only for instance/GPU availability questions.
For AutoDL, resolve the active machine from ignored sandbox config and use direct remote checks. Prefer COS only for file transfer questions, not status.
For RunPod or other paid compute, do not start new paid resources. If a status check confirms idle paid compute, follow the project paid-compute policy and report the exact evidence and action.

Report Contract

Return a compact table. For scale monitor or Scale-RAE status requests, prefer this row shape: Stage | Split | Slurm State | Jobs R/P/C/F | Progress | Rate | ETA | Last Update | Artifact / Log, preceded or followed by a one-line Overall: summary. Jobs R/P/C/F means running, pending, completed, and failed counts. Use MISSING, STALE, or BLOCKED in the relevant cell when a scheduler query, progress CSV, state file, or log path cannot be read; include the exact path/key/cause rather than inferring health from another source.

Include raw private backend details only in operator_private_notes. If the result may be shared with a student, add a student_safe_report that redacts backend identity, private W&B entity/account URLs, hosts, usernames, account IDs, storage roots, SSH aliases, queue details, and Slurm layout.

field	value
workspace	resolved workspace path or blocker
sandbox	backend checked, or all configured backends
experiment	experiment/job/group/variant identity
scheduler_state	scheduler/process state, job id, elapsed/time limit, and exact blocker
progress	latest step/epoch/unit count, speed, last update, stale/not stale
checkpoint	latest/final checkpoint path, step, exists/non-empty status
metrics	real metric names/values such as FID/BPD/loss/speed and source
wandb	project/group/run id, run state, latest step, URL status; raw URL only in operator notes if private
ledger	CSV path, row status, join fields, metrics, timestamp
logs	log paths inspected, terminal error or `no error found`
outputs	metrics/artifact/output paths verified
status	healthy, running, completed, failed, blocked, unknown, or mixed
next_action	monitor, wait, inspect blocker, retry failed only, run report job, stop idle paid compute, or handoff

Use healthy only when scheduler/process state, W&B, ledger, checkpoint/output, and required metrics agree. Use blocked when a required source cannot be read or queried. Use unknown only when the user did not provide enough identity and it cannot be discovered from workspace context.

Handoffs

Use sue-fullrun for production launch orchestration or retries.
Use sue-dryrun when a dry run must be executed or repaired.
Use sue-scripts-writing when launcher/monitor/run-state surfaces are missing or broken.
Use sue-cluster-info for LUMI/Snellius quota and job-limit overview, or the active backend's private config, for backend quota/allocation/storage/active-job inventory that is not tied to a specific experiment.
Use wandb-csv when the user asks to update, repair, or compare the workspace CSV result ledger.

More by dongzhuoyao

View all by dongzhuoyao →

tao-rebuttal-strategist

dongzhuoyao

Rebuttal strategy planning

sue-runpod-cleanup

dongzhuoyao

sue-runpod-cleanup: Clean stale files under /workspace/ on a RunPod

sue-update-lesson

dongzhuoyao

Use whenever a durable SUE / scale-up lesson is learned in any sue-*

sue-normalize-structure

dongzhuoyao

sue-normalize-structure: Use when a DeepResearch workspace or codebase\

Install

mkdir -p .claude/skills/agent-job-info && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/13476" && unzip -o skill.zip -d .claude/skills/agent-job-info && rm skill.zip

Installs to .claude/skills/agent-job-info

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

2d ago

Repo stars

Loads

~1,784 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

dongzhuoyao

5 skills published

Links

Source code

agent-job-info

Install

Activation

About this skill

Agent Job Info

Inputs

Evidence Sources

Status Checks

Backend Notes

Report Contract

Handoffs

More by dongzhuoyao

tao-rebuttal-strategist

sue-runpod-cleanup

sue-update-lesson

sue-normalize-structure

Search skills