nsys-capture
Wrap an arbitrary action (a bench run, a single curl, an N-second sleep) with `nsys profile`, then immediately export the .nsys-rep to SQLite so downstream skills can SQL-query it without reopening the binary trace.
Install
mkdir -p .claude/skills/nsys-capture && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15270" && unzip -o skill.zip -d .claude/skills/nsys-capture && rm skill.zipInstalls to .claude/skills/nsys-capture
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Wrap an arbitrary action (a bench run, a single curl, an N-second sleep) with `nsys profile`, then immediately export the .nsys-rep to SQLite so downstream skills can SQL-query it without reopening the binary trace.About this skill
nsys-capture
WHEN
Concrete conditions:
- Bench gap is unexplained.
bench_summary.jsonshows X vs Y differ ≥20%,server-log-miningreturned no config-shaped culprit (no cap, no KV pressure, no retract). The next layer of evidence is per-kernel timing → this skill. - Kernel-level hypothesis. Agent has a specific guess like "autotune is producing
a different tile size" or "cudagraph isn't actually replaying". Both need timeline
inspection. Call this skill against the candidate config to get raw data, then
call
nsys-timeline-sqlto extract. - Patch verification. Before applying a source patch, capture a baseline. After,
capture an identical workload. Diff via
nsys-timeline-sql. Never trust an e2e bench alone to attribute a fix to the patched kernel.
Do not call when:
- The gap is small (<10%); nsys overhead itself is 5–15%, signal-to-noise too low.
- You don't yet have a hypothesis about which window matters. Sample-profiling 30s of mixed warmup + steady state gives garbage. Always run the target_cmd long enough that ≥80% of profile time is in steady state.
- Multiple users are heavy on the same GPU. CUPTI events bleed across processes
for cudagraph-related metrics. Use
gpu_idto pin.
WHY (the failure mode this prevents)
Three concrete past mistakes:
- Profiled too short (2026-06-05). Captured 5s of a sglang start → measured
profile dominated by graph capture, not steady-state inference. Wrong conclusion
about which kernel was hot. Skill rule: minimum target_cmd duration 20s of
steady state, asserted via
bench_summary.jsonwall_s.mean. - Lost the .nsys-rep (2026-06-07). Profile worked, then user deleted the ~80 MB file before agent re-queried. Had to re-run the entire bench. Skill rule: always export sqlite immediately (within the same skill call). The sqlite copy is what downstream uses; the .nsys-rep can be deleted to save disk.
- Forgot to kill the nsys daemon (2026-06-04). Bench finished but
nsys profilewas wrappingsleep 600. The .nsys-rep file was 0 bytes until SIGINT. Skill rule: send SIGINT afterduration_s, wait up to 60s for flush, then confirm file size > 1 MB before returning success.
HOW
python .github/skills/nsys-capture/impl/run_capture.py \
--target-cmd 'python results/4way_bench/scripts/run_bench_4way.py http://127.0.0.1:30000 sglang_cutlass /tmp/x' \
--duration-s 60 \
--gpu-id 1 \
--out-dir results/<exp>/nsys/<config>/ \
--extra-nsys "--cuda-graph-trace=node"
Internal sequence:
- Resolve nsys binary — search
which nsys, fall back to known path/home/t-chendili/cuda/12.6/bin/nsys. Write the version into output. - Launch
nsys profilewith these defaults:-t cuda,nvtx,osrt(CUDA + user NVTX ranges + OS runtime)-s none(no sampling — cuts overhead by 5x)--cuda-memory-usage=true(catches host-side allocator behavior)--force-overwrite=true -o <out_dir>/profile- target = the user's
target_cmd(wrapped viash -c) CUDA_VISIBLE_DEVICES=<gpu_id>in the launched env
- Wait up to
duration_s + 30sfortarget_cmdto exit naturally. If still running, SIGINT the nsys wrapper, wait 60s for flush. - Verify
profile.nsys-repexists and is >1 MB. - Export sqlite immediately:
nsys export --type sqlite --force-overwrite=true --output profile.sqlite profile.nsys-rep. - Write
nsys_capture.json.
OUTPUT CONTRACT — nsys_capture.json
{
"schema_version": 0,
"ok": true,
"captured_at": "2026-06-09T04:00:00Z",
"nsys_binary": "/home/t-chendili/cuda/12.6/bin/nsys",
"nsys_version": "2024.5.1",
"gpu_id": 1,
"target_cmd": "python ... /tmp/x",
"target_exit_code": 0,
"target_duration_s": 38.4,
"files": {
"nsys_rep": "results/.../profile.nsys-rep",
"nsys_rep_size_mb": 87.2,
"sqlite": "results/.../profile.sqlite",
"sqlite_size_mb": 91.6
},
"warnings": [
/* e.g. "target_cmd exited before duration_s — profile may be short" */
]
}
On failure:
{ "schema_version": 0, "ok": false, "error": "<one sentence>",
"stderr_tail": "<last 1KB of nsys stderr>" }
WHICH METRIC HELPS WHICH PROBLEM
This skill itself produces no analysis metrics — it produces the data.
The metrics-to-problem mapping lives in the next skill: nsys-timeline-sql.
What this skill does enable for downstream:
If extra_nsys includes... | Downstream gets... | Useful for |
|---|---|---|
| (default: no extras) | per-kernel timestamps + per-CUDA-API timestamps | Most cases. GPU idle gap, top kernels, launch counts. |
--cuda-graph-trace=node | per-kernel rows inside cudagraph replays, not folded | Diagnosing whether cudagraph is hiding the real bottleneck (sglang R_short case) |
--gpu-metrics-device=all | SM occupancy, mem throughput sampled at 100 Hz | "Why is this kernel slow?" — but adds 5–10% overhead, only enable when targeted |
-t cuda,nvtx,osrt,nccl | NCCL collectives (allreduce, alltoall) timing | Multi-GPU / TP > 1 only |
-t cuda,nvtx,osrt,cublas,cusparse,cudnn | library-level annotations | When you suspect cuBLAS dispatch overhead |
Default is intentionally minimal — extras cost overhead and disk. The agent should add them only when the question requires them.
METHODOLOGY — predict-then-verify
Before capturing, the agent must write down (in plan.md or call context):
"I expect to see in the profile: <kernel X> taking >Y% of GPU time, OR <N> cudaLaunchKernel calls per second, OR <stream Z> idle for >Wms gaps."
After nsys-timeline-sql returns, compare. If the profile shows none of those:
- Either the hypothesis is wrong (don't backfill a new story to fit the data).
- Or the capture missed the relevant window (run again with longer duration / different target_cmd phase).
EXTENSION
Two escape hatches when the default capture isn't enough:
- Custom nsys args via
--extra-nsys— anythingnsys profile --helpaccepts. Examples:--sample=cpu,--cudabacktrace=true,--capture-range=cudaProfilerApi. - Capture multiple windows — call this skill N times with N target_cmds that
each
sleep Kbefore different phases (warmup vs steady vs teardown). The skill is intentionally stateless to enable this.
FAILURE MODES
| Mode | Detection | Mitigation |
|---|---|---|
nsys not on PATH and fallback path missing | first which nsys + stat fallback | {"ok": false, "error": "nsys not found; install or set NSYS_PATH"} |
.nsys-rep is 0 bytes after duration | post-flush size check | retry once with longer flush wait (90s); if still 0, fail |
| sqlite export fails (corrupt trace) | exit code of nsys export | keep .nsys-rep, fail loudly so user can debug manually |
| target_cmd crashes during capture | non-zero exit | record target_exit_code, attempt flush+export anyway (partial profile may still be useful), set ok: true with warning |
| Disk full during capture | check df pre-flight, OSError mid-flight | abort cleanly, fail with "disk full at <path>" |
| Another nsys profile already running on host | pgrep nsys returns >0 | warn, proceed (concurrent profiling works but doubles overhead) |
| CUDA_VISIBLE_DEVICES already set by target_cmd | env var conflict detection | warn, ours wins (skill overrides) |
ROADMAP
- v1 — auto-detect the "steady state" window inside the profile (skip first 10%
of timeline as warmup) and emit
steady_state_start_nsinto capture metadata, so downstream queries default to that window. - v1 — compress .nsys-rep to .nsys-rep.zst after sqlite export (typically 3-5× smaller, since sqlite has the per-event data anyway).
- v2 — multi-rank capture (one .nsys-rep per TP rank, merged into one sqlite).
REFERENCES
- nsys deep-dive (what each report contains):
docs/2026-06-08/nsys_deep_dive_and_proton.md - Past nsys-based validation:
docs/2026-06-08/nsys_2x2_validation_and_nsys_usage.md - Capability audit (what nsys can / can't see):
docs/2026-06-08/agent_profiling_capability_audit.mdPart B