build-and-test
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron-LM. Covers container-based development, uv package management, linting, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, o
Install
mkdir -p .claude/skills/build-and-test-mindlab-research && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14659" && unzip -o skill.zip -d .claude/skills/build-and-test-mindlab-research && rm skill.zipInstalls to .claude/skills/build-and-test-mindlab-research
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron-LM. Covers container-based development, uv package management, linting, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, or investigating CI failures.About this skill
Developer Guide
This guide covers the recommended development workflow for Megatron-LM. The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.
Why Containers
Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
- Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lockresolves the same way locally and in CI.- GPU-dependent operations (training, testing) work out of the box.
Option A — Use a Pre-built Image (fastest)
Images are tagged by PR number and commit SHA:
# Pull the latest image built for the current PR branch:
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
docker pull 766267172432.dkr.ecr.us-east-1.amazonaws.com/megatron-lm:${PR_NUMBER}
# Or pull the image built from main:
docker pull 766267172432.dkr.ecr.us-east-1.amazonaws.com/megatron-lm:main
Option B — Build from Scratch
# dev image (default)
docker build \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
--build-arg IMAGE_TYPE=dev \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local .
# lts image
docker build \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
--build-arg IMAGE_TYPE=lts \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local-lts .
Which image variant is used is controlled by the PR label container::lts;
absent that label, dev is used.
Running Work Inside the Container
docker run --rm --gpus all \
-v $(pwd):/workspace \
-w /workspace \
megatron-lm:local \
bash -c "<your command>"
Dependency Management
Dependencies are declared in pyproject.toml. The venv lives at /opt/venv
inside the container (already on PATH).
All
uvoperations must be run inside the container. Never runuv sync/uv pip installon the host.
uv Dependency Groups
| Group | Purpose |
|---|---|
training | Runtime training extras |
dev | Full dev environment (TransformerEngine, ModelOpt, …) |
lts | LTS-safe subset (no ModelOpt) |
test | pytest, coverage, nemo-run |
linting | ruff, black, isort, pylint |
build | Cython, pybind11, nvidia-mathdx |
Install commands (inside the container):
# Full dev + test environment
uv sync --locked --group dev --group test
# Linting only
uv sync --locked --only-group linting
# LTS environment
uv sync --locked --group lts --group test
Several dependencies are sourced directly from git (TransformerEngine, nemo-run,
FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file
pins exact revisions; update it with uv lock when changing pyproject.toml.
Linting
Run before opening a PR:
# Check mode (no changes applied)
BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false bash tools/autoformat.sh
Tools invoked: black, isort, pylint, ruff, mypy.
Running Tests
Test Layout
tests/
├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/ # end-to-end shell + training scripts
└── test_utils/
├── recipes/
│ ├── h100/ # YAML recipes for H100 jobs
│ └── gb200/ # YAML recipes for GB200 jobs
└── python_scripts/ # helpers (recipe_parser, golden-value download, …)
How Tests Execute
All tests run on a single DGX H100 node (8 GPUs). The GitHub Actions runner
invokes launch_nemo_run_workload.py, which uses nemo-run to launch a
DockerExecutor container. The repo is bind-mounted at /opt/megatron-lm;
training data is mounted at /mnt/artifacts.
Unit tests are dispatched through torch.distributed.run:
- Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
- Per-rank log files land at
{assets_dir}/logs/1/and are uploaded as a GitHub artifact after the run.
Functional tests are driven by
tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the
pytest validation step; training output from all ranks is uploaded as an artifact.
Flaky-failure auto-retry: launch_nemo_run_workload.py retries up to
3 times for known transient patterns (NCCL timeout, ECC error, segfault,
HuggingFace connectivity, …) before declaring a genuine failure.
Recipe YAML Structure
Recipes live in tests/test_utils/recipes/ and are parsed by
tests/test_utils/python_scripts/recipe_parser.py. Each file expands a
cartesian products block into individual workload specs:
type: basic
format_version: 1
spec:
name: "{test_case}_{environment}_{platforms}_{tag}"
model: gpt
nodes: 1
gpus: 8
platforms: dgx_h100
time_limit: 1800
script_setup: |
...
script: |-
bash tests/unit_tests/run_ci_test.sh \
--tag {tag} \
--environment {environment} \
--bucket "tests/unit_tests/models/**/*.py" \
--log-dir {assets_dir}/logs/1/
products:
- test_case: [my_test]
environment: [dev, lts]
tag: [latest, legacy]
scope: [mr-github]
n_repeat: [1]
time_limit: [1800]
Key runtime placeholders: {assets_dir}, {artifacts_dir}, {test_case},
{environment}, {platforms}, {tag}, {n_repeat}.
Adding a Unit Test
-
Create
tests/unit_tests/<category>/test_<name>.py. -
Use fixtures from
tests/unit_tests/conftest.py. -
Apply markers as needed:
@pytest.mark.internal— skipped onlegacytag@pytest.mark.flaky— skipped inltsenvironment@pytest.mark.experimental—latesttag only
-
Verify the test runs locally inside the container:
pytest -xvs tests/unit_tests/<category>/test_<name>.py -
If the test needs a dedicated CI bucket, add an entry to
tests/test_utils/recipes/h100/unit-tests.yaml.
Adding a Functional / Integration Test
-
Create
tests/functional_tests/test_cases/<model>/<test_name>/. -
Write the shell test script; use
{assets_dir}for all output paths. -
Add a YAML recipe under
tests/test_utils/recipes/h100/(andgb200/if needed). Required fields:scope,environment,platform,n_repeat,time_limit. -
Push the PR, add the label "Run functional tests" to trigger a full run.
-
After a successful run, download golden values:
python tests/test_utils/python_scripts/download_golden_values.py \ --source github --pipeline-id <run-id> -
Commit the downloaded golden values.
CI Test Scope Labels
| PR label | Scope | Behaviour |
|---|---|---|
| (none) | mr-github-slim | Lightweight subset, fast feedback |
Run tests | mr-github | Full suite, lightweight mode (4 steps, no golden compare) |
Run functional tests | mr-github | Full suite, 100-step training + golden compare, n_repeat=5 |
container::lts | (any) | Use the LTS base image instead of dev |
CI Pipeline
The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes
to branches matching pull-request/[0-9]+ and deploy-release/*, on merge
groups, on a daily schedule, and on manual dispatch.
Pipeline Structure
is-not-external-contributor
└─ pre-flight
└─ configure # determines scope, container tag, n_repeat
├─ linting
├─ cicd-container-build
│ ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
│ ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
│ └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
└─ Nemo_CICD_Test # final pass/fail gate
Images are pushed to:
- AWS ECR:
766267172432.dkr.ecr.us-east-1.amazonaws.com/… - GCP Artifact Registry:
us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…
CI Failure Investigation
CI branches always follow the pattern pull-request/<number>.
Locating the PR from a CI Branch
# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# List the files changed in the PR
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM --json files --jq '.files[].path'
If the branch name contains a non-numeric suffix (e.g. pull-request/my-branch),
search by branch name instead:
gh pr list --repo NVIDIA/Megatron-LM --head "pull-request/my-branch"
Reading CI Job Logs
# List recent workflow runs for the PR
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"
# Show summary of a specific run
gh run view <run-id> --repo NVIDIA/Megatron-LM
# Stream the GitHub Actions runner output (stdout of ranks 0 and 3 only)
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed
Full per-rank logs are not in the runner stdout. They are uploaded as
GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.
If the runner output does not show a clear error, download the full artifact and crawl all rank logs:
# 1. Find the artifact name for the failing run
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
--jq '.artifacts[].name'
# 2. Download the artifact zip
gh run download <run-id> --repo NVIDIA/Megatron-LM \
--name "logs-<artifact-name>" -D ./ci-logs
# 3. Locate which rank logs contain errors (file list only, no content yet)
grep -r -l "ERROR\|Traceback\|
---
*Content truncated.*