build-and-test

Name: build-and-test
Author: MindLab-Research

Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron-LM. Covers container-based development, uv package management, linting, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, o

Install

mkdir -p .claude/skills/build-and-test-mindlab-research && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14659" && unzip -o skill.zip -d .claude/skills/build-and-test-mindlab-research && rm skill.zip

Installs to .claude/skills/build-and-test-mindlab-research

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron-LM. Covers container-based development, uv package management, linting, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, or investigating CI failures.

328 chars✓ has a “when” triggerlonger than Claude Code's old 250-char listing cap (fine on current versions)

About this skill

Developer Guide

This guide covers the recommended development workflow for Megatron-LM. The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.

Why Containers

Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lock resolves the same way locally and in CI.
GPU-dependent operations (training, testing) work out of the box.

Option A — Use a Pre-built Image (fastest)

Images are tagged by PR number and commit SHA:

# Pull the latest image built for the current PR branch:
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
docker pull 766267172432.dkr.ecr.us-east-1.amazonaws.com/megatron-lm:${PR_NUMBER}

# Or pull the image built from main:
docker pull 766267172432.dkr.ecr.us-east-1.amazonaws.com/megatron-lm:main

Option B — Build from Scratch

# dev image (default)
docker build \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
  --build-arg IMAGE_TYPE=dev \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local .

# lts image
docker build \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
  --build-arg IMAGE_TYPE=lts \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local-lts .

Which image variant is used is controlled by the PR label container::lts; absent that label, dev is used.

Running Work Inside the Container

docker run --rm --gpus all \
  -v $(pwd):/workspace \
  -w /workspace \
  megatron-lm:local \
  bash -c "<your command>"

Dependency Management

Dependencies are declared in pyproject.toml. The venv lives at /opt/venv inside the container (already on PATH).

All uv operations must be run inside the container. Never run uv sync / uv pip install on the host.

uv Dependency Groups

Group	Purpose
`training`	Runtime training extras
`dev`	Full dev environment (TransformerEngine, ModelOpt, …)
`lts`	LTS-safe subset (no ModelOpt)
`test`	pytest, coverage, nemo-run
`linting`	ruff, black, isort, pylint
`build`	Cython, pybind11, nvidia-mathdx

Install commands (inside the container):

# Full dev + test environment
uv sync --locked --group dev --group test

# Linting only
uv sync --locked --only-group linting

# LTS environment
uv sync --locked --group lts --group test

Several dependencies are sourced directly from git (TransformerEngine, nemo-run, FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file pins exact revisions; update it with uv lock when changing pyproject.toml.

Linting

Run before opening a PR:

# Check mode (no changes applied)
BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false bash tools/autoformat.sh

Tools invoked: black, isort, pylint, ruff, mypy.

Running Tests

Test Layout

tests/
├── unit_tests/          # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/    # end-to-end shell + training scripts
└── test_utils/
    ├── recipes/
    │   ├── h100/        # YAML recipes for H100 jobs
    │   └── gb200/       # YAML recipes for GB200 jobs
    └── python_scripts/  # helpers (recipe_parser, golden-value download, …)

How Tests Execute

All tests run on a single DGX H100 node (8 GPUs). The GitHub Actions runner invokes launch_nemo_run_workload.py, which uses nemo-run to launch a DockerExecutor container. The repo is bind-mounted at /opt/megatron-lm; training data is mounted at /mnt/artifacts.

Unit tests are dispatched through torch.distributed.run:

Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
Per-rank log files land at {assets_dir}/logs/1/ and are uploaded as a GitHub artifact after the run.

Functional tests are driven by tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.

Flaky-failure auto-retry: launch_nemo_run_workload.py retries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.

Recipe YAML Structure

Recipes live in tests/test_utils/recipes/ and are parsed by tests/test_utils/python_scripts/recipe_parser.py. Each file expands a cartesian products block into individual workload specs:

type: basic
format_version: 1
spec:
  name: "{test_case}_{environment}_{platforms}_{tag}"
  model: gpt
  nodes: 1
  gpus: 8
  platforms: dgx_h100
  time_limit: 1800
  script_setup: |
    ...
  script: |-
    bash tests/unit_tests/run_ci_test.sh \
      --tag {tag} \
      --environment {environment} \
      --bucket "tests/unit_tests/models/**/*.py" \
      --log-dir {assets_dir}/logs/1/
products:
  - test_case: [my_test]
    environment: [dev, lts]
    tag: [latest, legacy]
    scope: [mr-github]
    n_repeat: [1]
    time_limit: [1800]

Key runtime placeholders: {assets_dir}, {artifacts_dir}, {test_case}, {environment}, {platforms}, {tag}, {n_repeat}.

Adding a Unit Test

Create tests/unit_tests/<category>/test_<name>.py.
Use fixtures from tests/unit_tests/conftest.py.
Apply markers as needed:
- @pytest.mark.internal — skipped on legacy tag
- @pytest.mark.flaky — skipped in lts environment
- @pytest.mark.experimental — latest tag only

Verify the test runs locally inside the container:

pytest -xvs tests/unit_tests/<category>/test_<name>.py

If the test needs a dedicated CI bucket, add an entry to tests/test_utils/recipes/h100/unit-tests.yaml.

Adding a Functional / Integration Test

Create tests/functional_tests/test_cases/<model>/<test_name>/.
Write the shell test script; use {assets_dir} for all output paths.
Add a YAML recipe under tests/test_utils/recipes/h100/ (and gb200/ if needed). Required fields: scope, environment, platform, n_repeat, time_limit.
Push the PR, add the label "Run functional tests" to trigger a full run.

After a successful run, download golden values:

python tests/test_utils/python_scripts/download_golden_values.py \
  --source github --pipeline-id <run-id>

Commit the downloaded golden values.

CI Test Scope Labels

PR label	Scope	Behaviour
(none)	`mr-github-slim`	Lightweight subset, fast feedback
`Run tests`	`mr-github`	Full suite, lightweight mode (4 steps, no golden compare)
`Run functional tests`	`mr-github`	Full suite, 100-step training + golden compare, n_repeat=5
`container::lts`	(any)	Use the LTS base image instead of dev

CI Pipeline

The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes to branches matching pull-request/[0-9]+ and deploy-release/*, on merge groups, on a daily schedule, and on manual dispatch.

Pipeline Structure

is-not-external-contributor
  └─ pre-flight
       └─ configure          # determines scope, container tag, n_repeat
            ├─ linting
            ├─ cicd-container-build
            │    ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
            │    ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
            │    └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
            └─ Nemo_CICD_Test  # final pass/fail gate

Images are pushed to:

AWS ECR: 766267172432.dkr.ecr.us-east-1.amazonaws.com/…
GCP Artifact Registry: us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…

CI Failure Investigation

CI branches always follow the pattern pull-request/<number>.

Locating the PR from a CI Branch

# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')

# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM

# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM

# List the files changed in the PR
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM --json files --jq '.files[].path'

If the branch name contains a non-numeric suffix (e.g. pull-request/my-branch), search by branch name instead:

gh pr list --repo NVIDIA/Megatron-LM --head "pull-request/my-branch"

Reading CI Job Logs

# List recent workflow runs for the PR
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"

# Show summary of a specific run
gh run view <run-id> --repo NVIDIA/Megatron-LM

# Stream the GitHub Actions runner output (stdout of ranks 0 and 3 only)
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed

Full per-rank logs are not in the runner stdout. They are uploaded as GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.

If the runner output does not show a clear error, download the full artifact and crawl all rank logs:

# 1. Find the artifact name for the failing run
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
  --jq '.artifacts[].name'

# 2. Download the artifact zip
gh run download <run-id> --repo NVIDIA/Megatron-LM \
  --name "logs-<artifact-name>" -D ./ci-logs

# 3. Locate which rank logs contain errors (file list only, no content yet)
grep -r -l "ERROR\|Traceback\|

---

*Content truncated.*

Install

mkdir -p .claude/skills/build-and-test-mindlab-research && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14659" && unzip -o skill.zip -d .claude/skills/build-and-test-mindlab-research && rm skill.zip

Installs to .claude/skills/build-and-test-mindlab-research

Safety

Review before install

Runs shell / code

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

3mo ago

Repo stars

Loads

~3,334 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

MindLab-Research

Links

Source code

build-and-test

Install

Activation

About this skill

Developer Guide

Why Containers

Option A — Use a Pre-built Image (fastest)

Option B — Build from Scratch

Running Work Inside the Container

Dependency Management

uv Dependency Groups

Linting

Running Tests

Test Layout

How Tests Execute

Recipe YAML Structure

Adding a Unit Test

Adding a Functional / Integration Test

CI Test Scope Labels

CI Pipeline

Pipeline Structure

CI Failure Investigation

Locating the PR from a CI Branch

Reading CI Job Logs

Search skills