agentskills.codes
BU

Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron-LM. Covers container-based development, uv package management, linting, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, o

Install

mkdir -p .claude/skills/build-and-test-mindlab-research && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14659" && unzip -o skill.zip -d .claude/skills/build-and-test-mindlab-research && rm skill.zip

Installs to .claude/skills/build-and-test-mindlab-research

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron-LM. Covers container-based development, uv package management, linting, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, or investigating CI failures.
328 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Developer Guide

This guide covers the recommended development workflow for Megatron-LM. The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.


Why Containers

Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

  • Identical CUDA / NCCL / cuDNN versions across all developers and CI.
  • uv.lock resolves the same way locally and in CI.
  • GPU-dependent operations (training, testing) work out of the box.

Option A — Use a Pre-built Image (fastest)

Images are tagged by PR number and commit SHA:

# Pull the latest image built for the current PR branch:
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
docker pull 766267172432.dkr.ecr.us-east-1.amazonaws.com/megatron-lm:${PR_NUMBER}

# Or pull the image built from main:
docker pull 766267172432.dkr.ecr.us-east-1.amazonaws.com/megatron-lm:main

Option B — Build from Scratch

# dev image (default)
docker build \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
  --build-arg IMAGE_TYPE=dev \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local .

# lts image
docker build \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
  --build-arg IMAGE_TYPE=lts \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local-lts .

Which image variant is used is controlled by the PR label container::lts; absent that label, dev is used.

Running Work Inside the Container

docker run --rm --gpus all \
  -v $(pwd):/workspace \
  -w /workspace \
  megatron-lm:local \
  bash -c "<your command>"

Dependency Management

Dependencies are declared in pyproject.toml. The venv lives at /opt/venv inside the container (already on PATH).

All uv operations must be run inside the container. Never run uv sync / uv pip install on the host.

uv Dependency Groups

GroupPurpose
trainingRuntime training extras
devFull dev environment (TransformerEngine, ModelOpt, …)
ltsLTS-safe subset (no ModelOpt)
testpytest, coverage, nemo-run
lintingruff, black, isort, pylint
buildCython, pybind11, nvidia-mathdx

Install commands (inside the container):

# Full dev + test environment
uv sync --locked --group dev --group test

# Linting only
uv sync --locked --only-group linting

# LTS environment
uv sync --locked --group lts --group test

Several dependencies are sourced directly from git (TransformerEngine, nemo-run, FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file pins exact revisions; update it with uv lock when changing pyproject.toml.


Linting

Run before opening a PR:

# Check mode (no changes applied)
BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false bash tools/autoformat.sh

Tools invoked: black, isort, pylint, ruff, mypy.


Running Tests

Test Layout

tests/
├── unit_tests/          # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/    # end-to-end shell + training scripts
└── test_utils/
    ├── recipes/
    │   ├── h100/        # YAML recipes for H100 jobs
    │   └── gb200/       # YAML recipes for GB200 jobs
    └── python_scripts/  # helpers (recipe_parser, golden-value download, …)

How Tests Execute

All tests run on a single DGX H100 node (8 GPUs). The GitHub Actions runner invokes launch_nemo_run_workload.py, which uses nemo-run to launch a DockerExecutor container. The repo is bind-mounted at /opt/megatron-lm; training data is mounted at /mnt/artifacts.

Unit tests are dispatched through torch.distributed.run:

  • Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
  • Per-rank log files land at {assets_dir}/logs/1/ and are uploaded as a GitHub artifact after the run.

Functional tests are driven by tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.

Flaky-failure auto-retry: launch_nemo_run_workload.py retries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.

Recipe YAML Structure

Recipes live in tests/test_utils/recipes/ and are parsed by tests/test_utils/python_scripts/recipe_parser.py. Each file expands a cartesian products block into individual workload specs:

type: basic
format_version: 1
spec:
  name: "{test_case}_{environment}_{platforms}_{tag}"
  model: gpt
  nodes: 1
  gpus: 8
  platforms: dgx_h100
  time_limit: 1800
  script_setup: |
    ...
  script: |-
    bash tests/unit_tests/run_ci_test.sh \
      --tag {tag} \
      --environment {environment} \
      --bucket "tests/unit_tests/models/**/*.py" \
      --log-dir {assets_dir}/logs/1/
products:
  - test_case: [my_test]
    environment: [dev, lts]
    tag: [latest, legacy]
    scope: [mr-github]
    n_repeat: [1]
    time_limit: [1800]

Key runtime placeholders: {assets_dir}, {artifacts_dir}, {test_case}, {environment}, {platforms}, {tag}, {n_repeat}.

Adding a Unit Test

  1. Create tests/unit_tests/<category>/test_<name>.py.

  2. Use fixtures from tests/unit_tests/conftest.py.

  3. Apply markers as needed:

    • @pytest.mark.internal — skipped on legacy tag
    • @pytest.mark.flaky — skipped in lts environment
    • @pytest.mark.experimentallatest tag only
  4. Verify the test runs locally inside the container:

    pytest -xvs tests/unit_tests/<category>/test_<name>.py
    
  5. If the test needs a dedicated CI bucket, add an entry to tests/test_utils/recipes/h100/unit-tests.yaml.

Adding a Functional / Integration Test

  1. Create tests/functional_tests/test_cases/<model>/<test_name>/.

  2. Write the shell test script; use {assets_dir} for all output paths.

  3. Add a YAML recipe under tests/test_utils/recipes/h100/ (and gb200/ if needed). Required fields: scope, environment, platform, n_repeat, time_limit.

  4. Push the PR, add the label "Run functional tests" to trigger a full run.

  5. After a successful run, download golden values:

    python tests/test_utils/python_scripts/download_golden_values.py \
      --source github --pipeline-id <run-id>
    
  6. Commit the downloaded golden values.

CI Test Scope Labels

PR labelScopeBehaviour
(none)mr-github-slimLightweight subset, fast feedback
Run testsmr-githubFull suite, lightweight mode (4 steps, no golden compare)
Run functional testsmr-githubFull suite, 100-step training + golden compare, n_repeat=5
container::lts(any)Use the LTS base image instead of dev

CI Pipeline

The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes to branches matching pull-request/[0-9]+ and deploy-release/*, on merge groups, on a daily schedule, and on manual dispatch.

Pipeline Structure

is-not-external-contributor
  └─ pre-flight
       └─ configure          # determines scope, container tag, n_repeat
            ├─ linting
            ├─ cicd-container-build
            │    ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
            │    ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
            │    └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
            └─ Nemo_CICD_Test  # final pass/fail gate

Images are pushed to:

  • AWS ECR: 766267172432.dkr.ecr.us-east-1.amazonaws.com/…
  • GCP Artifact Registry: us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…

CI Failure Investigation

CI branches always follow the pattern pull-request/<number>.

Locating the PR from a CI Branch

# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')

# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM

# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM

# List the files changed in the PR
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM --json files --jq '.files[].path'

If the branch name contains a non-numeric suffix (e.g. pull-request/my-branch), search by branch name instead:

gh pr list --repo NVIDIA/Megatron-LM --head "pull-request/my-branch"

Reading CI Job Logs

# List recent workflow runs for the PR
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"

# Show summary of a specific run
gh run view <run-id> --repo NVIDIA/Megatron-LM

# Stream the GitHub Actions runner output (stdout of ranks 0 and 3 only)
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed

Full per-rank logs are not in the runner stdout. They are uploaded as GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.

If the runner output does not show a clear error, download the full artifact and crawl all rank logs:

# 1. Find the artifact name for the failing run
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
  --jq '.artifacts[].name'

# 2. Download the artifact zip
gh run download <run-id> --repo NVIDIA/Megatron-LM \
  --name "logs-<artifact-name>" -D ./ci-logs

# 3. Locate which rank logs contain errors (file list only, no content yet)
grep -r -l "ERROR\|Traceback\|

---

*Content truncated.*

Search skills

Search the agent skills registry