ci: remove nick-fields/retry wrapper and add shared sinfo-based GPU partition selection by sbryngelson · Pull Request #1299 · MFlowCode/MFC

sbryngelson · 2026-03-10T03:45:48Z

Summary

Replace nick-fields/retry JS action with plain run: steps for Frontier builds (test.yml) and bench builds (bench.yml). The JS action wrapper was getting SIGKILL'd on Frontier login nodes after the build completed successfully, causing false build failures. Retry logic is handled by retry_build() in .github/scripts/retry-build.sh, which all cluster build.sh scripts already call.
Unified CI scripts — replace per-cluster submit/test/bench scripts with parameterized versions:
- submit-slurm-job.sh: single submit+monitor script for all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh, frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) selected via case block. Idempotent stale-job cancellation now applies to all clusters.
- common/test.sh: unified test script with conditional build, cluster-aware GPU detection, thread counts, RDMA, and sharding.
- common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags.
Shared select-gpu-partition.sh script for sinfo-based GPU partition selection, used by both test and benchmark jobs. GPU partition priority: gpu-rtx6000 → gpu-l40s → gpu-v100 → gpu-h200 → gpu-h100 → gpu-a100.
Parallel benchmark jobs require 2 idle/mix nodes (GPU_PARTITION_MIN_NODES=2) before selecting a partition, since PR and master benchmark jobs run concurrently.
Exclude dead GPU node atl1-1-03-002-29-0 (persistent cuInit error 999).
Delete dead code: run-tests-with-retry.sh (never called).

Workflow simplification

test.yml self job: 5 conditional steps → 2 (Build + Test)
test.yml case-opt job: 5 conditional steps → 3
11 per-cluster scripts deleted, 3 unified scripts added

Test plan

Trigger CI on a PR and verify Frontier build steps pass without false failures
Verify Phoenix test jobs land on an available GPU partition via sinfo selection
Verify Phoenix benchmark jobs (PR + master) land on the same partition with 2 available nodes
Verify all cluster/device/interface combinations produce correct SLURM submissions

…nodes The JS action wrapper gets SIGKILL'd on Frontier login nodes under memory pressure, falsely failing the Build step even when build.sh succeeds. retry_build() inside build.sh already handles 2-attempt retry with rm -rf build between attempts. Also move gpu-v100 to last in Phoenix GPU partition priority so SLURM prefers newer GPU nodes (a100/h100/l40s/h200) over the aging V100s that have had recurring driver issues.

Extract partition selection into select-gpu-partition.sh so both test jobs (submit-job.sh) and benchmark jobs (run_parallel_benchmarks.sh) use the same sinfo-based logic with a consistent priority order: gpu-rtx6000 -> gpu-l40s -> gpu-v100 -> gpu-h200 -> gpu-h100 -> gpu-a100 Tests now dynamically pick the best available partition rather than submitting to a static multi-partition list, matching the benchmark approach. Bench still exports BENCH_GPU_PARTITION so PR and master land on the same GPU type for fair comparisons.

Copilot

Pull request overview

This PR updates CI/SLURM automation to centralize Phoenix GPU partition selection logic and simplify the non-Phoenix build step in GitHub Actions.

Changes:

Replace the nick-fields/retry wrapper in test.yml with a direct run step + timeout-minutes.
Introduce a reusable .github/scripts/select-gpu-partition.sh and use it from both Phoenix benchmark and test submission paths.
Simplify .github/scripts/retry-build.sh by removing support for a post-build validation hook.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`.github/workflows/test.yml`	Removes external retry wrapper around non-Phoenix build step.
`.github/workflows/phoenix/submit-job.sh`	Uses shared GPU partition selection for non-benchmark Phoenix submissions.
`.github/scripts/select-gpu-partition.sh`	New shared helper to pick an available Phoenix GPU partition via `sinfo`.
`.github/scripts/run_parallel_benchmarks.sh`	Refactors inline partition selection into the shared helper script.
`.github/scripts/retry-build.sh`	Removes RETRY_VALIDATE_CMD-based post-build validation behavior from retry loop.

.github/scripts/retry-build.sh

 # Provides retry_build(): 2-attempt loop.
 # On failure of attempt 1, nukes the entire build directory before attempt 2.
-# Set RETRY_VALIDATE_CMD to run a post-build validation; failure triggers a retry.
 # Usage: source .github/scripts/retry-build.sh
 #        retry_build ./mfc.sh build -j 8 --gpu acc

 retry_build() {
-    local validate_cmd="${RETRY_VALIDATE_CMD:-}"
    local max_attempts=2
    local attempt=1
    while [ $attempt -le $max_attempts ]; do
        echo "Build attempt $attempt of $max_attempts..."
        if "$@"; then
-            if [ -n "$validate_cmd" ]; then
-                if ! eval "$validate_cmd"; then
-                    echo "Post-build validation failed on attempt $attempt."
-                    if [ $attempt -lt $max_attempts ]; then
-                        echo "  Nuking build directory before retry..."
-                        rm -rf build 2>/dev/null || true
-                        sleep 5
-                        attempt=$((attempt + 1))
-                        continue
-                    else
-                        echo "Validation still failing after $max_attempts attempts."
-                        return 1
-                    fi
-                fi
-            fi
            echo "Build succeeded on attempt $attempt."
            return 0


.github/workflows/phoenix/submit-job.sh

 if [ "$job_type" = "bench" ]; then
-    bench_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}"
-    echo "Submitting bench GPU job to partition: $bench_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})"
-    sbatch_gpu_opts="\
-#SBATCH -p $bench_partition
-#SBATCH --ntasks-per-node=4       # Number of cores per node required
-#SBATCH -G2\
-"
+    # BENCH_GPU_PARTITION is pre-selected by run_parallel_benchmarks.sh so both
+    # PR and master jobs land on the same GPU type for a fair comparison.
+    gpu_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}"
+    echo "Submitting bench GPU job to partition: $gpu_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})"
    sbatch_time="#SBATCH -t 04:00:00"
 else
-    sbatch_gpu_opts="\
-#SBATCH -p gpu-v100,gpu-a100,gpu-h100,gpu-l40s,gpu-h200
+    source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh"
+    gpu_partition="$SELECTED_GPU_PARTITION"
+    sbatch_time="#SBATCH -t 03:00:00"


.github/scripts/retry-build.sh

coderabbitai

🧹 Nitpick comments (1)

.github/workflows/phoenix/submit-job.sh (1)
44-44: Consider using a more robust path resolution.

The relative path ../../scripts/select-gpu-partition.sh works correctly but is fragile if the script hierarchy changes. Consider extracting the repository root and using an absolute path pattern:
♻️ Optional: Use repo-root-relative path
 else
-    source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh"
+    _repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)"
+    source "${_repo_root}/.github/scripts/select-gpu-partition.sh"
     gpu_partition="$SELECTED_GPU_PARTITION"
     sbatch_time="#SBATCH -t 03:00:00"
 fi

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d42e334-ff80-4016-adf4-47b1c6150d4a

📥 Commits

Reviewing files that changed from the base of the PR and between edff972 and 24185f8.

📒 Files selected for processing (5)

.github/scripts/retry-build.sh
.github/scripts/run_parallel_benchmarks.sh
.github/scripts/select-gpu-partition.sh
.github/workflows/phoenix/submit-job.sh
.github/workflows/test.yml

💤 Files with no reviewable changes (1)

.github/scripts/retry-build.sh

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Free Tier Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

.github/scripts/retry-build.sh

 #        retry_build ./mfc.sh build -j 8 --gpu acc

 retry_build() {
-    local validate_cmd="${RETRY_VALIDATE_CMD:-}"


Make bench jobs use sinfo-based GPU partition selection (via select-gpu-partition.sh) as a baseline, then override with BENCH_GPU_PARTITION only when run_parallel_benchmarks.sh has pre-selected a partition for PR/master consistency. Previously bench jobs fell back to a hardcoded gpu-rtx6000 when BENCH_GPU_PARTITION was unset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…selection For parallel benchmarks (PR + master), both jobs need a GPU node concurrently, so require at least 2 idle/mix nodes before selecting a partition. Add GPU_PARTITION_MIN_NODES parameter to select-gpu-partition.sh (defaults to 1 for single-job test use). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

phoenix/test.sh relies on RETRY_VALIDATE_CMD to smoke-test the freshly built syscheck binary and trigger a rebuild on failure, catching architecture mismatches (SIGILL) from binaries compiled on a different compute node. Mistakenly removed in the previous commit as 'unused'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.github/scripts/select-gpu-partition.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace per-cluster submit/test/bench scripts with unified versions: - submit-slurm-job.sh: single parameterized submit+monitor script for all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh, frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) is selected via a case block. Idempotent stale-job cancellation now applies to all clusters, not just Phoenix. - common/test.sh: unified test script with conditional build (skips if build/ exists from Frontier login-node build), cluster-aware GPU detection, thread counts, RDMA, and sharding. - common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags. Also removes nick-fields/retry from bench.yml (frontier build.sh already uses retry_build internally) and deletes dead code (run-tests-with-retry.sh). test.yml self job: 5 conditional steps -> 2 steps (Build + Test). test.yml case-opt job: 5 conditional steps -> 3 steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove no-op 'rm -rf build' inside 'if [ ! -d build ]' guard in common/test.sh and common/bench.sh. - Default gpu_partition to 'batch' before dynamic selection to prevent unbound variable error if a new cluster is added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-10T09:14:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.95%. Comparing base (edff972) to head (fde5fc4).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1299   +/-   ##
=======================================
  Coverage   44.94%   44.95%           
=======================================
  Files          70       70           
  Lines       20504    20504           
  Branches     1946     1946           
=======================================
+ Hits         9216     9217    +1     
  Misses      10166    10166           
+ Partials     1122     1121    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

submit_and_monitor_bench.sh cd's into master/ before calling submit-slurm-job.sh, which reads the bench script via cat. Since master branch doesn't have common/bench.sh yet, the cat fails. Fix by resolving the bench script path from the PR tree (absolute path) so it works regardless of cwd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-10T14:14:55Z

Claude Code Review

Head SHA: 61b84cfdf1a75a4811f4577af409f8f6abf18182
Files changed: 20 (387 additions, 461 deletions — net reduction)

Changed files:

.github/scripts/submit-slurm-job.sh (new — unified SLURM submit+monitor)
.github/workflows/common/test.sh (new — unified test script)
.github/workflows/common/bench.sh (new — unified bench script)
.github/scripts/select-gpu-partition.sh (new — sinfo-based GPU selection)
.github/workflows/test.yml, bench.yml (workflow simplification)
11 per-cluster scripts deleted

Summary

Replaces nick-fields/retry JS wrapper (SIGKILL'd on Frontier login nodes) with plain run: steps; retry logic stays inside SLURM jobs via retry_build().
Consolidates 11 per-cluster scripts into 3 unified parameterized scripts; cluster-specific config lives in a case block in submit-slurm-job.sh.
Extracts GPU partition selection into a reusable select-gpu-partition.sh; GPU_PARTITION_MIN_NODES=2 ensures PR and master bench jobs land on the same partition.
Idempotent stale-job cancellation now covers all clusters uniformly.
Net code reduction of ~25% with improved consistency.

Findings

.github/workflows/common/bench.sh:50 — dropped -j $n_ranks from GPU bench command (possible regression)

Old frontier/bench.sh:

./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks

New common/bench.sh:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j $n_ranks flag is gone in the GPU branch. If this controls harness-level parallelism for the benchmark runner, removing it changes behavior. Worth confirming — if -n already handles this, a short comment would clarify intent.

.github/scripts/select-gpu-partition.sh:20 — sinfo node-state regex misses modifier suffixes

_idle=$(sinfo -p "$_part" --noheader -o "%t" 2>/dev/null | grep -cE "^(idle|mix)" || true)

SLURM can emit states with modifiers like idle~ (power-saving) or idle+ (not-responding). These are not matched by ^(idle|mix). This is inherited from the old run_parallel_benchmarks.sh logic, not a regression introduced here — noting it as an improvement opportunity.

.github/workflows/common/bench.sh:15 — TMPDIR collision risk on Phoenix

currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))

The random suffix space (0–8999) is small. Concurrent jobs on the same node can collide. Using $$ (PID) or mktemp -d would eliminate this. Pre-existing behavior from old phoenix scripts, but the unification is a good chance to fix it.

.github/scripts/submit-slurm-job.sh:155 — sbatch heredoc expands script contents in outer shell

submit_output=$(sbatch <<EOT
...
$sbatch_script_contents
EOT
)

Using an unquoted EOT delimiter causes the outer shell to expand $(), backticks, and ${var} in the embedded script before sbatch sees it. This is intentional here (to interpolate $job_slug, $job_device, etc. from the outer scope), but it means any future script sourced via this mechanism must not contain unexpanded ${} that differ from the outer shell's env. A comment explaining this design choice would help contributors avoid subtle bugs.

Minor Improvement Opportunities

submit-slurm-job.sh:60 — A new cluster added to the outer case "$cluster" block must also be added to the inner device case blocks or sbatch_device_opts will be unset (fails with -u). A comment noting this coupling would help.
bench.yml:115 — The bench Setup & Build step lost retry without explanation. Adding # retry handled inside build.sh via retry_build() would make the intent clear.
common/test.sh:27 — validate_cmd="" then RETRY_VALIDATE_CMD="$validate_cmd" for non-Phoenix paths. Since retry-build.sh reads ${RETRY_VALIDATE_CMD:-}, an empty assignment and unset behave the same today, but this is fragile. Prefer unset RETRY_VALIDATE_CMD for the non-Phoenix branch.

Overall this is a well-motivated refactor — the unified scripts are easier to reason about than 11 near-duplicate per-cluster files. The dropped -j bench flag is the only item I would want a clear answer on before merging.

RTX 6000 nodes can't finish the full test suite within the 3-hour SLURM wall time. Use gpu-l40s as the new fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-10T19:52:39Z

Claude Code Review

Head SHA: 1c4bd73c02dc2ed894d649c12f932bacd09e059f

Files changed: 20

Changed files

.github/scripts/retry-build.sh (comment/doc update)
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_parallel_benchmarks.sh (refactored to use shared script)
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new, replaces 3 per-cluster scripts)
.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)
11 deleted per-cluster scripts (frontier/{bench,submit,test}.sh, frontier_amd symlinks, phoenix/{bench,submit-job,submit,test}.sh)
.github/workflows/test.yml

Summary

Removes the nick-fields/retry JS wrapper that was getting SIGKILL'd on Frontier login nodes post-build, replacing it with plain run: steps — the underlying retry logic is provided by retry_build() in retry-build.sh.
Consolidates 11 per-cluster SLURM scripts into 3 unified scripts: submit-slurm-job.sh, common/test.sh, common/bench.sh.
Adds shared select-gpu-partition.sh for sinfo-based partition selection, used by both test and bench jobs.
Excludes the dead GPU node atl1-1-03-002-29-0 (cuInit 999) in all Phoenix GPU jobs.
Reduces CI workflow steps from 5 → 2 for the self job, 5 → 3 for the case-opt job.

Findings

[Medium] `common/test.sh`: Phoenix GPU tests may rebuild without GPU flags

File: .github/workflows/common/test.sh, line 536

Old phoenix/test.sh passed build_opts (containing --gpu acc) to the test command:

# phoenix/test.sh (deleted)
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts ${build_opts:---no-gpu} -- -c phoenix

New common/test.sh does not pass the build interface flag to the test run:

# common/test.sh line 536
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $rdma_opts $device_opts $shard_opts -- -c $job_cluster

build_opts (set at line 475 from gpu-opts.sh) is used for the dry-run build at line 496, but is not forwarded to the live test command. If ./mfc.sh test triggers an incremental rebuild, it would do so without --gpu acc / --gpu omp, potentially creating CPU-only binaries. Please verify that the cached CMake configuration prevents recompilation, or explicitly pass $build_opts (or an equivalent flag like --no-build) here.

[Minor] GPU partition priority change may shift benchmark baselines

File: .github/scripts/select-gpu-partition.sh, line 115

New priority: gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100

Old run_parallel_benchmarks.sh priority: gpu-rtx6000 gpu-l40s gpu-v100 gpu-h200 gpu-h100 gpu-a100

The gpu-v100 moved from position 3 to last. Bench jobs that previously landed on v100 for reproducibility will now prefer h200/h100/a100. This is benign if intentional, but benchmark baselines in the historical record will appear as regressions/improvements when the partition switches. Consider adding a note to the PR/changelog that bench baselines will need to be re-established after this change.

[Minor] `common/bench.sh` drops `-j` flag from `./mfc.sh bench`

File: .github/workflows/common/bench.sh, lines 451–453

Old frontier/bench.sh:

./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster ...

New:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j $n_ranks flag was removed from ./mfc.sh bench. If this flag controls build parallelism for the bench step (vs. the separate n_jobs for the build step above), this may serialize benchmark compilation. Intentional?

[Nit] `bench.yml` Setup & Build: no retry for Frontier login-node build failures

File: .github/workflows/bench.yml, lines 389–397

The nick-fields/retry with on_retry_command: rm -rf pr/build master/build has been removed. The PR explanation is correct (SIGKILL after successful build caused false failures). However, transient login-node build failures on Frontier now fail the job outright with no retry. This is an acceptable trade-off given the described failure mode.

Overall this is a well-motivated cleanup that reduces duplication and eliminates a known false-failure mode. The main item to verify before merging is the missing build_opts in the Phoenix test command (#1 above).

github-actions · 2026-03-10T19:53:44Z

Claude Code Review

Head SHA: 1c4bd73c02dc2ed894d649c12f932bacd09e059f

Files changed: 20

File	+/-
`.github/scripts/submit-slurm-job.sh`	+200 (new)
`.github/workflows/common/test.sh`	+70 (new)
`.github/workflows/common/bench.sh`	+54 (new)
`.github/scripts/select-gpu-partition.sh`	+34 (new)
`.github/scripts/retry-build.sh`	+5/-2
`.github/workflows/test.yml`	+7/-34
`.github/workflows/bench.yml`	+9/-15
11 deleted per-cluster scripts	-849

Summary

Replaces nick-fields/retry JS action (SIGKILL'd on Frontier login nodes after successful builds) with plain run: steps, eliminating false CI failures.
Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), significantly reducing duplication.
Extracts sinfo-based GPU partition selection into a shared select-gpu-partition.sh, with a GPU_PARTITION_MIN_NODES=2 guard so parallel benchmark jobs land on the same partition.
Hardcodes exclusion of broken node atl1-1-03-002-29-0 (persistent cuInit error 999).
Deletes dead code (run-tests-with-retry.sh).

Findings

[Medium] bench.yml: Frontier login-node builds now have no retry

.github/workflows/bench.yml, new "Setup & Build" step removes nick-fields/retry with max_attempts: 2 and on_retry_command: rm -rf pr/build master/build. If a Frontier bench login-node build hits a transient failure (not a SIGKILL on the Actions runner side), there is no retry. The PR states that retry_build() inside cluster build.sh scripts handles retries — please confirm all build.sh scripts referenced by matrix.build_script in bench.yml internally call retry_build(), otherwise a transient build failure will fail the entire bench job with no recovery.

[Minor] common/bench.sh lines 424-427: Unquoted variables in TMPDIR setup

tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_build
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))
mkdir -p $tmpbuild
mkdir -p $currentdir

$tmpbuild and $currentdir should be quoted (mkdir -p "$tmpbuild" / mkdir -p "$currentdir"). The trap 'rm -rf "$currentdir"' on line 427 is correctly quoted, making this inconsistent. Low practical risk given the fixed path, but worth fixing.

[Minor] select-gpu-partition.sh line 115: Partition priority change may affect benchmark baseline consistency

_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

RTX 6000 is excluded entirely (comment: "too slow for the test suite time limit"). The old run_parallel_benchmarks.sh preferred gpu-rtx6000 first for benchmarks ("gives the most consistent baselines"). This shared script is now used for both tests and benchmarks. If historical benchmark baselines were collected on rtx6000, the shift to l40s/h-series may affect PR-vs-master comparisons. Confirm this is acceptable for bench jobs (4-hour time limit is more generous than the 3-hour test limit).

[Minor] common/bench.sh lines 451/453: -j parallelism flag dropped from ./mfc.sh bench

Old phoenix/bench.sh passed -j $n_jobs (CPU cores) before -- to ./mfc.sh bench. The new unified script omits this. If ./mfc.sh bench uses -j for build or execution parallelism, Phoenix bench jobs may run slower. The old frontier/bench.sh used -j $n_ranks (MPI ranks, not CPU cores — apparent copy-paste bug), so dropping it for Frontier is likely correct. Worth verifying the Phoenix case is intentional.

Overall a clean, well-motivated CI refactor. The no-retry concern for Frontier bench login-node builds is the main thing to confirm before merge.

The dry-run build uses build_opts but the live test command didn't. CMake caches the config, but passing it explicitly is safer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-10T19:57:53Z

Claude Code Review

Head SHA: b51f214469a1296bcfd2ef39678317454a67b06e

Files changed: 20

.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_parallel_benchmarks.sh
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new)
.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)
.github/workflows/frontier/{bench,submit,test}.sh (deleted)
.github/workflows/frontier_amd/{bench,submit,test}.sh (deleted, were symlinks)
.github/workflows/phoenix/{bench,submit-job,submit,test}.sh (deleted)
.github/workflows/test.yml

Summary:

Removes nick-fields/retry JS action (SIGKILL culprit on Frontier) and replaces with plain run: steps backed by retry_build() inside SLURM jobs.
Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh) + shared select-gpu-partition.sh.
Phoenix build/test/bench flow now runs entirely inside SLURM (no split submit+monitor steps).
GPU partition selection extracted to a reusable script; gpu-rtx6000 excluded; parallel bench jobs require 2 idle/mix nodes (GPU_PARTITION_MIN_NODES=2).
Dead node atl1-1-03-002-29-0 excluded via #SBATCH --exclude= in the unified submit script.

Findings

1. Missing -j flag in unified bench command (behavioral regression risk)

.github/workflows/common/bench.sh, lines 41–45:

if [ "$job_device" = "gpu" ]; then
    ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
else
    ./mfc.sh bench --mem 1 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
fi

Both old scripts passed a -j flag: phoenix/bench.sh used -j $n_jobs (capped at 64) and frontier/bench.sh used -j $n_ranks. The new unified script drops -j entirely. If mfc.sh bench defaults to nproc this is fine on small nodes but could be unexpectedly aggressive on Frontier's 192-core GNR nodes. Worth verifying the default or restoring an explicit -j $n_jobs.

2. No retry for login-node build in bench workflow

.github/workflows/bench.yml, lines 105–119:
The old nick-fields/retry wrapper included on_retry_command: rm -rf pr/build master/build. The new plain run: step has no retry. For Frontier (which pre-builds on a login node), a transient build failure now fails CI with no second chance. This is intentional given the SIGKILL issue, but it means flaky Frontier login-node builds will show as real failures. Consider wrapping with retry_build or documenting that Frontier re-runs are expected to self-heal via the job's own build logic when build/ is absent.

3. Inconsistency in GPU partition priority between test and bench contexts

.github/scripts/select-gpu-partition.sh, line 14:

_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

The old run_parallel_benchmarks.sh explicitly preferred gpu-rtx6000 first for benchmark consistency ("most nodes, most consistent baselines"). The new unified script excludes gpu-rtx6000 entirely (comment says "too slow for the test suite time limit"). Using the same script for both test and bench jobs means bench jobs may now land on L40S or H200 nodes — faster, but potentially less stable baselines compared to historical RTX 6000 runs. If benchmark trend data is used for regression detection, this partition change will create a discontinuity. Consider whether bench-specific partition preferences should be passed via GPU_PARTITION_MIN_NODES / a separate priority list.

4. select-gpu-partition.sh is sourced with source but not marked executable

.github/scripts/select-gpu-partition.sh is created without an execute bit (no chmod +x in the diff, and the file itself has no shebang-based #!/bin/bash protection since it's always sourced). This is fine as long as callers always use source/. , which they do — just confirm it's not accidentally called with bash anywhere.

Minor / no action needed

RANDOM % 9000 in common/bench.sh (vs old % 900) — more unique tmp dirs, harmless improvement.
The frontier_amd stubs (symlinks → deleted) are correctly replaced by cluster dispatch in submit-slurm-job.sh.
The new submit-slurm-job.sh heredoc uses #SBATCH -o ${output_file} (space after -o); old scripts used #SBATCH -o$output_file — both are valid SBATCH syntax.

Overall this is a clean, well-motivated simplification. The main item to verify before merge is whether dropping -j from the bench command is intentional.

Under set -e, 'wait $pid' returning non-zero aborts the script before the exit code is captured, leaving the second parallel job unmonitored. Use 'wait $pid || exit=$?' so both jobs are always waited on. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- select-gpu-partition.sh: priority comment said 'smaller/older' but list is now L40S/H200/H100/A100/V100 - submit-slurm-job.sh: 'Idempotent' → 'Rerun-safe' (it always submits a new job) - bench.sh: n_jobs only used for build, not bench Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

For non-Phoenix GPU jobs, both device_opts and build_opts resolved to the same --gpu flag. Let build_opts carry it; device_opts is only for cluster-specific runtime flags like -g (Phoenix GPU IDs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-10T20:03:56Z

Claude Code Review

Head SHA: 698bd2e7f642c4f2fa6ffb24523109ea5b6e558e

Files changed: 20

.github/scripts/retry-build.sh (modified)
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_parallel_benchmarks.sh (modified)
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new)
.github/scripts/submit_and_monitor_bench.sh (modified)
.github/workflows/bench.yml (modified)
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)
.github/workflows/frontier/bench.sh (deleted)
.github/workflows/frontier/submit.sh (deleted)
.github/workflows/frontier/test.sh (deleted)
.github/workflows/frontier_amd/{bench,submit,test}.sh (deleted symlinks)
.github/workflows/phoenix/{bench,submit-job,submit,test}.sh (deleted)
.github/workflows/test.yml (modified)

Summary

Replaces nick-fields/retry JS action (SIGKILL'd on Frontier login nodes) with plain run: steps; retry is now handled entirely inside retry_build() in shell scripts.
Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), reducing duplication significantly.
Adds select-gpu-partition.sh for sinfo-based GPU partition selection, shared between test and bench jobs; requires 2 idle/mix nodes for bench to keep PR+master jobs on the same partition.
Excludes dead node atl1-1-03-002-29-0 (persistent cuInit 999) from Phoenix GPU sbatch options.
Deletes dead code run-tests-with-retry.sh (was never called).

Findings

1. common/bench.sh:482-484 — -j (parallel-jobs) flag dropped from ./mfc.sh bench

Old phoenix/bench.sh:

./mfc.sh bench --mem 4 -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks

Old frontier/bench.sh (GPU path):

./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks

New common/bench.sh:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j flag controlling build/bench worker parallelism is absent in the unified script. If ./mfc.sh bench has a built-in default, this may be fine, but it is a behavioural change that could affect benchmark throughput or reproducibility. Worth confirming the default is acceptable (or re-adding -j $n_jobs).

2. select-gpu-partition.sh:135,142 — Comment says "smaller/older first" but gpu-v100 is last

# Priority order prefers smaller/older nodes to leave modern GPUs free
_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

gpu-v100 (older/smaller than h100/h200/a100) is at the end, contradicting the stated intent. If the aim is to leave modern GPUs free for production, gpu-v100 should be earlier in the list (e.g., gpu-l40s gpu-v100 gpu-h200 gpu-h100 gpu-a100). This also diverges from the old bench ordering (rtx6000 → l40s → v100 → h200 → h100 → a100). Minor but could affect which GPU is chosen for benchmarks.

3. submit-slurm-job.sh:326-354 — Unquoted heredoc with script content expansion

sbatch_script_contents=$(cat "$script_path")
...
submit_output=$(sbatch <<EOT
...
$sbatch_script_contents
EOT
)

The unquoted EOT delimiter means bash expands $sbatch_script_contents during heredoc evaluation. Any $(...) or backtick constructs in the bench/test scripts execute in the submission shell. This is a pre-existing pattern (identical to the old frontier/submit.sh), so it is low risk for trusted internal scripts, but it is worth being aware of if scripts ever include command substitutions meant to run inside the SLURM job.

4. test.yml:237-252 — Frontier login-node build loses GitHub-Actions-level retry

The old step used nick-fields/retry with on_retry_command: rm -rf build. The new step is a plain run: with no retry wrapper. The PR notes that build.sh scripts call retry_build() internally — confirmed by the codebase — so this is intentional and correct. Just noting it for reviewers who might question why the external retry was dropped without replacement.

Overall: Clean, well-scoped CI infrastructure simplification. The -j omission in common/bench.sh is the only change that could silently degrade benchmark performance. The GPU partition priority order is worth a second look for correctness.

github-actions · 2026-03-10T20:06:55Z

Claude Code Review

Head SHA: 6cba9352e97be4ec28f2d2ca294a4143787acab8

Files changed: 20

.github/scripts/retry-build.sh (updated)
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_parallel_benchmarks.sh (updated)
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new)
.github/scripts/submit_and_monitor_bench.sh (updated)
.github/workflows/bench.yml (updated)
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)
.github/workflows/test.yml (updated)
10 cluster-specific scripts deleted (frontier/, frontier_amd/, phoenix/*)

Summary:

Removes nick-fields/retry JS action wrapper to fix SIGKILL false failures on Frontier login nodes; retry logic moves entirely into shell (retry_build())
Consolidates 11 per-cluster CI scripts into 3 unified parameterized scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh)
Adds shared select-gpu-partition.sh for sinfo-based GPU partition selection with configurable minimum node threshold
Fixes wait $pid || exit=$? pattern in both bench.yml and run_parallel_benchmarks.sh to avoid set -e aborting on the first job failure and orphaning the second
Clean reduction: 466 deletions, 399 additions; test.yml self job drops from 5 conditional steps to 2

Findings

1. Benchmark partition selection is a semantic regression (.github/scripts/select-gpu-partition.sh, .github/scripts/run_parallel_benchmarks.sh)

The old run_parallel_benchmarks.sh preferred gpu-rtx6000 first for benchmarks (most nodes, most consistent baselines). The new select-gpu-partition.sh excludes gpu-rtx6000 entirely and starts with gpu-l40s. For existing benchmark baseline comparisons between master and PR this means results will be measured on a different (newer, faster) GPU type going forward. This changes the reference environment for historical benchmark regression detection. Intentional or oversight?

2. Missing -j flag in common/bench.sh (.github/workflows/common/bench.sh lines 42-46)

Old phoenix/bench.sh:

./mfc.sh bench  -j  -o ".yaml" -- -c phoenix-bench  -n

New unified bench.sh:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j $n_jobs is dropped. If mfc.sh bench uses -j for case-level parallelism (parallel benchmark runs), this silently serializes benchmarks and inflates wall time. The CPU bench path is particularly affected since the old Phoenix CPU bench explicitly used -j $n_jobs. The old Frontier bench passed -j $n_ranks (which was likely a bug — ranks ≠ job threads), so unifying to no -j may be intentional; worth confirming for the CPU Phoenix path.

3. bench.yml build step has no retry at Actions level (.github/workflows/bench.yml lines 108-119)

The removed nick-fields/retry@v3 provided a GitHub Actions-level retry (with on_retry_command: rm -rf pr/build master/build). The new plain run: step has no equivalent. The PR author states that retry_build() inside build.sh covers this — but that only works if the bench build_script (e.g., bash .github/workflows/$cluster/build.sh ...) internally calls retry_build(). If any build.sh script on a supported cluster does NOT call retry_build(), transient build failures on that cluster will now fail the entire bench job without recovery.

4. Minor: select-gpu-partition.sh uses unquoted variable in for-loop (.github/scripts/select-gpu-partition.sh line 20)

for _part in $_GPU_PARTITION_PRIORITY; do

With set -euo pipefail (not set in this file), word splitting on $_GPU_PARTITION_PRIORITY is intentional for space-separated partition names, but the file does not set set -euo pipefail at all (unlike sibling scripts). Adding set -euo pipefail at the top of this sourced script would be consistent and catch errors in the sinfo call path.

Improvement Opportunities

submit-slurm-job.sh line 152: The heredoc submits $sbatch_script_contents via an unquoted <<EOT. This is intentional for preamble variable substitution but relies on shell variable-expansion semantics (expanded value of $sbatch_script_contents is not further expanded). A comment explaining this design contract would help future maintainers avoid accidentally introducing <<'EOT'.
submit_and_monitor_bench.sh job_slug derivation: The output YAML name (bench-$device-$interface.yaml) now depends on basename common/bench.sh → bench, which matches. But the comment block referencing bench-$device-$interface at line 30 of submit_and_monitor_bench.sh is still correct only because the naming stays consistent. If bench.sh were ever renamed, the slug and the post-job file lookup would silently diverge.

Overall this is a solid CI refactor that eliminates real false-failure modes and meaningfully reduces script duplication. The three findings above (especially #1 and #2) are worth confirming before merge.

github-actions · 2026-03-10T20:09:48Z

Claude Code Review

Head SHA: 6cba9352e97be4ec28f2d2ca294a4143787acab8

Files changed: 20

.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh           (deleted)
.github/scripts/run_parallel_benchmarks.sh
.github/scripts/select-gpu-partition.sh           (new)
.github/scripts/submit-slurm-job.sh               (new)
.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/common/bench.sh                 (new)
.github/workflows/common/test.sh                  (new)
.github/workflows/frontier/{bench,submit,test}.sh (deleted)
.github/workflows/frontier_amd/{bench,submit,test}.sh (deleted)
.github/workflows/phoenix/{bench,submit-job,submit,test}.sh (deleted)
.github/workflows/test.yml

Summary

Replaces the nick-fields/retry JS action (SIGKILL'd on Frontier login nodes post-build) with plain run: steps; retry logic stays in retry_build() inside the build scripts.
Unifies 11 per-cluster scripts into 3 parameterized ones: submit-slurm-job.sh, common/test.sh, common/bench.sh.
Adds sinfo-based GPU partition selection (select-gpu-partition.sh), with GPU_PARTITION_MIN_NODES=2 for parallel bench jobs.
Fixes run_parallel_benchmarks.sh orphan-job bug: wait $pid || exit=$? pattern prevents set -e from aborting before both PR and master jobs complete.
Deletes dead code (run-tests-with-retry.sh) and dead cluster script symlinks.

Findings

1. common/bench.sh — bench execution missing -j flag (possible regression)

Both old per-cluster bench scripts passed a job-parallelism flag to ./mfc.sh bench:

frontier/bench.sh: ./mfc.sh bench --mem 4 **-j $n_ranks** -o ...
phoenix/bench.sh: ./mfc.sh bench $bench_opts **-j $n_jobs** -o ...

The new common/bench.sh:44-48 omits -j entirely:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

If -j controls MPI ranks for the bench build step (distinct from -n which controls the run), dropping it changes behaviour. If it's redundant with -n, this is fine — worth confirming intentional.

2. select-gpu-partition.sh:16 — Benchmark GPU baseline silently changes

_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

Old run_parallel_benchmarks.sh explicitly preferred gpu-rtx6000 first for benchmark consistency ("most nodes, gives most consistent baselines"). The new shared select-gpu-partition.sh excludes gpu-rtx6000 entirely (with the justification that it's too slow for the test time limit). However, bench jobs have a 4-hour time limit so RTX 6000 is not a bottleneck for them.

Using gpu-l40s as the new default will silently shift all future benchmark baselines. If there are stored master-branch baselines on gpu-rtx6000, the first post-merge benchmark run will show apparent performance changes that are GPU-change artefacts, not code changes. Consider either: (a) noting this explicitly in the PR for any stored baseline consumers, or (b) keeping a separate _BENCH_GPU_PARTITION_PRIORITY list in run_parallel_benchmarks.sh that preserves the old order for bench consistency.

3. submit_and_monitor_bench.sh:23 — master-branch race on first post-merge bench run

PR_BENCH_SCRIPT="$(cd "${SCRIPT_DIR}/../workflows/common" && pwd)/bench.sh"

The comment acknowledges that master may not have common/bench.sh yet. When a bench job triggers immediately after merge (comparing new master vs a PR that itself predates this change), the master/ checkout will lack common/bench.sh and the master bench job will fail. The workaround mentioned ("use the PR tree's script for both jobs") does protect against this, because PR_BENCH_SCRIPT is always resolved from the PR's checkout. Verified: this is fine as written — just noting the transient window is real.

4. bench.yml — stale parallel build artifacts not cleaned on rerun

Old nick-fields/retry had:

on_retry_command: rm -rf pr/build master/build

The new plain run: step (bench.yml:113-125) has no equivalent cleanup. retry_build() inside each cluster's build script cleans up ./build (relative to CWD, i.e. pr/build or master/build) on its second attempt, so single-build retries are covered. However, if the Setup & Build step itself is re-run (e.g. GitHub Actions job re-run), pr/build and master/build at the workspace root are NOT cleaned before resubmission, which could cause cross-run contamination for bench builds. Low risk since bench jobs typically run on fresh runners, but worth tracking.

5. Minor: run_parallel_benchmarks.sh:28 — VAR=val source idiom is correct but non-obvious

GPU_PARTITION_MIN_NODES=2 source "${SCRIPT_DIR}/select-gpu-partition.sh"

In bash, VAR=val source file.sh works because source/`." is a special built-in and variable assignments to special built-ins affect the current shell environment. This is correct. A slightly more readable alternative is two lines:

GPU_PARTITION_MIN_NODES=2
source "${SCRIPT_DIR}/select-gpu-partition.sh"

(since select-gpu-partition.sh unsets _GPU_PARTITION_MIN_NODES but not GPU_PARTITION_MIN_NODES, the caller would need to unset it manually if that matters.) Not a bug.

What looks good

wait $pid || exit=$? fix in run_parallel_benchmarks.sh is correct and important — the old wait; exit=$? pattern would abort on set -e before waiting for the second job.
Idempotent job cancellation (sacct state check + scancel before resubmit) is correctly carried forward into the unified submit-slurm-job.sh.
select-gpu-partition.sh correctly unsets its internal _-prefixed variables after use to avoid polluting the caller's environment.
Removing run-tests-with-retry.sh is correct — the PR description confirms it was never called.

With clean:false, old SLURM job epilogs can write to the .out file after our stale-job check completes. The monitor tail then picks up this stale output (including errors from dead nodes) and reports it as if it came from the new job. Removing the .out file before submission ensures a clean output stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-10T22:25:38Z

Claude Code Review

Head SHA: 4915884ab6f783177c32f629e7c6d6eebd4cfe44
Files changed: 20

Changed files

.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_parallel_benchmarks.sh
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new)
.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)
.github/workflows/frontier/{bench,submit,test}.sh (deleted)
.github/workflows/frontier_amd/{bench,submit,test}.sh (deleted)
.github/workflows/phoenix/{bench,submit-job,submit,test}.sh (deleted)
.github/workflows/test.yml

Summary

Replaces the nick-fields/retry JS action with plain run: steps, eliminating false build failures caused by SIGKILL on Frontier login nodes after a successful build.
Collapses 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), reducing duplication significantly (+404 / -466).
Adds select-gpu-partition.sh for sinfo-based GPU partition selection shared between test and bench jobs; bench jobs require 2 idle nodes (GPU_PARTITION_MIN_NODES=2) for PR/master consistency.
Correctly fixes the set -e / wait interaction in run_parallel_benchmarks.sh that would have orphaned the second benchmark job on first failure.
Excludes the dead GPU node (atl1-1-03-002-29-0) and deletes the never-called run-tests-with-retry.sh.

Findings

1. Missing -j $n_jobs in common/bench.sh (potential regression)

Both old phoenix/bench.sh and frontier/bench.sh passed -j $n_jobs to ./mfc.sh bench. The new unified script at .github/workflows/common/bench.sh lines 44–47 omits this flag:

# old (phoenix/bench.sh):
./mfc.sh bench $bench_opts -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks

# new (common/bench.sh line 46):
./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

If ./mfc.sh bench uses -j to control parallelism (e.g., build-level jobs during bench setup), dropping it may serialize builds. Worth confirming this is intentional and that retry_build (which does pass -j $n_jobs) already covers the build path.

2. No retry for login-node build in bench.yml Setup & Build step

The old step used nick-fields/retry@v3 with on_retry_command: rm -rf pr/build master/build. The new step removes retry entirely. A single transient build failure will now fail the bench job. The retry_build() function handles retries inside SLURM jobs for Phoenix, but the login-node pre-build for Frontier (Setup & Build step in bench.yml) is no longer retried. This may be acceptable given the motivation (dropping the action that SIGKILL'd), but a brief comment documenting the intentional removal would help future maintainers.

3. Unexplained sleep 10 in common/bench.sh (lines 50–53)

# --- Phoenix cleanup (trap EXIT handles rm -rf "$currentdir") ---
if [ "$job_cluster" = "phoenix" ]; then
    sleep 10
    unset TMPDIR
fi

Inherited from phoenix/bench.sh. The trap 'rm -rf "$currentdir" || true' EXIT already handles cleanup; the sleep 10 purpose is unclear (NFS flush? post-bench I/O settle?). Consider adding a comment or removing if no longer needed.

4. select-gpu-partition.sh partition priority change (minor / documented)

Old run_parallel_benchmarks.sh order: gpu-rtx6000 → gpu-l40s → gpu-v100 → gpu-h200 → gpu-h100 → gpu-a100
New select-gpu-partition.sh (line 16): gpu-l40s → gpu-h200 → gpu-h100 → gpu-a100 → gpu-v100 (rtx6000 excluded)

The PR description and script comment document this as intentional ("RTX 6000 too slow for the test suite time limit"). Flagging only so reviewers are aware the bench GPU pool changed alongside the test pool.

No other issues found

The unified submit script is well-structured. Idempotent stale-job cancellation, SIGHUP suppression, and set -euo pipefail are correctly applied. The variable-scoping pattern for select-gpu-partition.sh (prefixed _ vars + unset at end) is a good practice for sourced scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T03:16:48Z

Claude Code Review

Head SHA: e1e0f424f53b20f88d3bb3435b200de4d9c0d32e

Files changed: 21

.github/scripts/submit-slurm-job.sh (new, +205)
.github/workflows/common/test.sh (new, +71)
.github/workflows/common/bench.sh (new, +54)
.github/scripts/select-gpu-partition.sh (new, +35)
.github/scripts/run-tests-with-retry.sh (deleted)
.github/workflows/frontier/submit.sh (deleted, -105)
.github/workflows/phoenix/submit-job.sh (deleted, -123)
.github/workflows/phoenix/submit.sh (deleted, -34), test.sh (deleted, -22), bench.sh (deleted, -32)
.github/workflows/test.yml, bench.yml, and 9 other scripts modified

Summary

Removes nick-fields/retry JS action (root cause: SIGKILL on Frontier login nodes after successful build) and replaces with plain run: steps for both test and bench workflows.
Consolidates 11 per-cluster scripts into 3 unified scripts: submit-slurm-job.sh, common/test.sh, common/bench.sh. Logic for Phoenix vs Frontier is now in case blocks rather than separate files.
Adds select-gpu-partition.sh for sinfo-based GPU partition selection; bench jobs require GPU_PARTITION_MIN_NODES=2 so PR and master bench can land on the same partition concurrently.
Excludes persistent-fault Phoenix GPU node atl1-1-03-002-29-0.
Dead code deletion: run-tests-with-retry.sh (unreferenced), 3 frontier_amd symlinks, old per-cluster submit scripts.

Findings

1. bench.yml — lost build retry (intentional, but worth confirming scope)
The old nick-fields/retry step ran up to 2 attempts with on_retry_command: rm -rf pr/build master/build. The new plain run: step has a single attempt with no cleanup on failure. The PR description explains this is intentional (JS wrapper SIGKILL root cause), and Phoenix bench builds now happen inside SLURM via retry_build(). But for any cluster whose bench build runs on the GHA login-node step (non-empty matrix.build_script), transient failures are now terminal. Low risk if only Phoenix has a non-empty build_script (which builds inside SLURM anyway), but worth a quick confirmation.

2. common/test.sh line 29, common/bench.sh line 34 — redundant || exit 1
Both files set set -euo pipefail at the top. The || exit 1 after retry_build ... is redundant; a non-zero exit from retry_build will already abort the script. Not a bug, minor style noise only.

3. common/bench.sh line 10 — nproc called twice
n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) )) forks two nproc subshells. Cosmetic; not a correctness issue.

4. select-gpu-partition.sh — gpu-rtx6000 excluded from priority list
The old run_parallel_benchmarks.sh specifically preferred gpu-rtx6000 first for bench consistency ("has the most nodes and gives the most consistent baselines"). The new script excludes it entirely (comment: "too slow for the test suite time limit"). The fallback is now gpu-l40s. This is a deliberate change, but means bench baselines for existing time-series comparisons may shift. Already noted in the PR description; flagging for visibility.

5. Improvement noted — run_parallel_benchmarks.sh wait exit-code capture
The change from wait "$pr_pid"; pr_exit=$? to wait "$pr_pid" || pr_exit=$? is a correct fix: without ||, set -e would abort the script on the first failing wait, orphaning the second background job. Well-commented in the diff.

Overall

Clean, well-motivated refactor with no blocking issues found. The logic is correctly unified and idempotency/stale-job cancellation now applies to all clusters. The key behavioral tradeoff (single-attempt bench build on GHA runner) appears intentional and acceptable given the SIGKILL root cause. Items 2–3 are style-only; item 1 deserves a quick confirmation that no non-Phoenix cluster runs the bench build on the GHA runner step.

…ilation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T03:38:06Z

Claude Code Review

Head SHA: 97ecc2353a4a7aef5c1d3ba70042efffa2ea8a25

Files changed: 22

Key files:

.github/scripts/submit-slurm-job.sh (new, +205)
.github/workflows/common/test.sh (new, +71)
.github/workflows/common/bench.sh (new, +54)
.github/scripts/select-gpu-partition.sh (new, +35)
.github/workflows/test.yml (+7/-34)
.github/workflows/bench.yml (+13/-15)
11 deleted per-cluster scripts
.github/scripts/run-tests-with-retry.sh (deleted)

Summary

Replaces nick-fields/retry JS action with plain run: steps in test.yml and bench.yml, fixing false SIGKILL failures on Frontier login nodes.
Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), a meaningful reduction in duplication.
Adds shared select-gpu-partition.sh with sinfo-based GPU partition selection; GPU_PARTITION_MIN_NODES=2 ensures both parallel bench jobs land on the same partition.
Stale-job idempotency (cancel + resubmit) now covers all clusters uniformly.
Removes never-called run-tests-with-retry.sh (dead code).

Findings

1. common/bench.sh:540,542 — mfc.sh bench missing -j flag (functional regression risk)

The old per-cluster bench scripts passed -j $n_ranks (Frontier GPU) or -j $n_jobs (Phoenix). The unified script omits -j entirely:

# New (common/bench.sh:540)
./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
# Old (frontier/bench.sh, GPU)
./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks

If -j to mfc.sh bench controls internal benchmark parallelism (number of cases run concurrently), dropping it will silently change benchmark throughput. Confirm the default is acceptable or add -j $n_jobs back.

2. select-gpu-partition.sh:195 — RTX 6000 excluded from benchmark partition priority

The new priority list (gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100) excludes gpu-rtx6000, which was the preferred partition for benchmarks in the old run_parallel_benchmarks.sh (first choice, also the fallback). This is intentional per the PR description (RTX 6000 too slow), but it means any historical benchmark baselines were on RTX 6000 and comparisons against new results on L40S/H200 will reflect both code changes and hardware differences. Worth noting in the PR or in a comment in the script that baselines established before this change are not directly comparable.

3. common/bench.sh:510-516 — TMPDIR collision range expanded but still racy

# Old (phoenix/bench.sh): RANDOM % 900  → 0–899
# New (common/bench.sh): RANDOM % 9000  → 0–8999
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))

The range increase reduces collision probability, but two concurrent jobs (PR + master) could still collide. Consider using mktemp -d $tmpbuild/run-XXXXXX for a guaranteed-unique temp directory and remove the mkdir -p $currentdir line.

4. submit-slurm-job.sh:395 — SLURM job script has set -e but embeds arbitrary script contents

set -e
set -x
...
$sbatch_script_contents

$sbatch_script_contents is injected verbatim from the sourced script. If any embedded script has a non-zero-exiting subshell expression that's not part of a conditional, set -e may abort the job prematurely. This is the same pattern as the old scripts so the risk is not new, but worth noting for future scripts added to this pipeline.

5. test.yml:1079 — Non-Phoenix clusters now also go through submit-slurm-job.sh

Previously, Frontier used a cluster-specific submit.sh that invoked run_monitored_slurm_job.sh. Now all clusters use the unified submit-slurm-job.sh. The new script delegates monitoring to the same run_monitored_slurm_job.sh at line 426. This is correct — just confirming the Frontier monitoring path is preserved.

Minor observations

run_case_optimization.sh:112 changes -j "$(nproc)" to -j 8 — reasonable cap, consistent with test scripts.
run_parallel_benchmarks.sh:143 uses GPU_PARTITION_MIN_NODES=2 source ... — this inline assignment before source works correctly in bash (built-in inherits the env var).
select-gpu-partition.sh cleans up its internal variables with unset — good practice for a sourced script.

Overall this is a clean and well-motivated consolidation. The main item to verify before merge is whether the mfc.sh bench -j flag omission is intentional.

🤖 Generated with Claude Code

GitHub Actions runners have different CPU microarchitectures. MFC compiles with -march=native, so cached binaries from one runner can contain instructions illegal on another. Adding the GCC-detected -march target to the cache key ensures each ISA gets its own cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T12:09:22Z

Claude Code Review

Head SHA: 17934983abb33b9d50cd2545890c553a1a36184d
Files changed: 22 (11 deleted, 3 new unified scripts, 8 modified)

Changed files

.github/scripts/prebuild-case-optimization.sh
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_case_optimization.sh
.github/scripts/run_parallel_benchmarks.sh
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new)
.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)

Summary

Removes the nick-fields/retry JS action wrapper that was sending SIGKILL to Frontier login-node processes after a successful build; retry logic now lives entirely in shell (retry_build()).
Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), reducing duplication significantly.
Introduces select-gpu-partition.sh for sinfo-based partition selection; the GPU_PARTITION_MIN_NODES=2 guard for bench jobs correctly ensures both PR and master bench runs land on the same partition type.
Adds CPU-architecture hash to the GitHub Actions cache key in test.yml to prevent SIGILL from cache hits across runners with different ISAs.
Fixes wait $pid || exit=$? pattern in run_parallel_benchmarks.sh and bench.yml so a failure in one background job no longer orphans the other under set -e.

Findings

1. gpu-rtx6000 silently dropped from bench partition pool — select-gpu-partition.sh, line 195

_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"
_GPU_PARTITION_FALLBACK="gpu-l40s"

The old code favoured gpu-rtx6000 first for benchmarks (most nodes, most consistent baselines). The new script excludes it entirely with a test-focused comment ("too slow for the test suite time limit"), but the same script is now used for bench partition selection too (via run_parallel_benchmarks.sh line 143). If rtx6000 should remain available for bench jobs, the priority list or the GPU_PARTITION_MIN_NODES path needs a bench-specific override. Intentional or oversight?

2. ./mfc.sh bench loses -j flag — common/bench.sh, lines 540–542

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

Old phoenix/bench.sh passed -j $n_jobs to mfc.sh bench. The new script drops it. If mfc.sh bench defaults to -j 1, benchmark suites that previously ran concurrently will now run serially (slower wall time, possibly exceeding the 4-hour limit). If the intent is serial-for-reproducibility, a comment explaining that choice would help avoid confusion.

3. bench.yml outer retry removed without inner retry for the parallel build — bench.yml, lines 474–486
The old nick-fields/retry wrapper re-ran both pr+master builds on failure. The new plain run: block has no retry. Each individual build script internally calls retry_build (2 attempts), but a transient failure that hits both builds simultaneously (e.g., NFS blip) would now fail the step permanently. This matches the stated intent (SIGKILL was the problem, not the retry count), but it is a deliberate reduction in resilience.

4. Empty RETRY_VALIDATE_CMD exported into environment — common/test.sh, lines 583–585

validate_cmd=""
...
RETRY_VALIDATE_CMD="$validate_cmd" \
    retry_build ./mfc.sh test -v --dry-run -j 8 $build_opts || exit 1

For non-Phoenix clusters, validate_cmd is left empty and exported. retry-build.sh handles RETRY_VALIDATE_CMD="" safely (local validate_cmd="${RETRY_VALIDATE_CMD:-}"), so this is benign — but a short comment ("empty on non-Phoenix — no syscheck needed") would clarify intent.

Minor nits

submit-slurm-job.sh line 250: sbatch_script_contents is assigned before SCRIPT_DIR but used much later; reordering would be cleaner.
common/bench.sh line 511: mkdir -p $tmpbuild / mkdir -p $currentdir should quote the variables to guard against word-splitting if the path ever contains spaces.
The frontier_amd symlinks (bench.sh, submit.sh, test.sh) are deleted, which is correct since submit-slurm-job.sh dispatches via cluster argument — worth confirming no caller still references those paths.

Overall this is a well-motivated consolidation. The partition-priority question (#1) and the -j removal (#2) are the items most worth double-checking before merge.

…le speedup) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T12:48:02Z

Claude Code Review

Head SHA: dbe6a5e

Files changed: 22

File	+/-
`.github/scripts/submit-slurm-job.sh`	+205 (new)
`.github/workflows/common/test.sh`	+71 (new)
`.github/workflows/common/bench.sh`	+54 (new)
`.github/scripts/select-gpu-partition.sh`	+35 (new)
`.github/workflows/test.yml`	+7/-40
`.github/workflows/bench.yml`	+13/-15
`.github/scripts/run_parallel_benchmarks.sh`	+8/-23
(15 files deleted or updated)

Summary

Replaces nick-fields/retry JS action with plain run: steps in bench.yml and test.yml, fixing SIGKILL false failures on Frontier login nodes.
Unifies per-cluster SLURM submit/test/bench scripts into three shared scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), deleting 11 per-cluster files.
Introduces select-gpu-partition.sh for sinfo-based GPU partition selection; excludes slow RTX 6000 nodes from selection; requires ≥2 idle nodes for parallel bench jobs.
Hard-codes exclusion of broken node atl1-1-03-002-29-0 in the unified submit script.
Stale-job idempotency (cancel + resubmit) now covers all clusters, not just Phoenix.

Findings

1. Missing -j flag in common/bench.sh benchmark invocation (behavioral change)
File: .github/workflows/common/bench.sh, lines 539–542

Old phoenix/bench.sh passed -j $n_jobs (≤64 parallel jobs) and old frontier/bench.sh passed -j $n_ranks. The new unified bench.sh drops the -j flag entirely:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

If ./mfc.sh bench defaults to serial execution when -j is absent, benchmark wall time could increase significantly. Please confirm this is intentional or add back a sensible default (e.g. -j $(nproc) or -j $n_jobs).

2. Removal of GitHub Actions build cache for non-cluster runners (potential CI slowdown)
File: .github/workflows/test.yml, around line 1053 (deleted block)

The Restore Build Cache step (using actions/cache@v4) was removed from the self job. If this job runs on standard GitHub runners (Ubuntu/macOS), every run will now perform a cold build. If the removal is intentional (e.g., to avoid cache poisoning), a brief comment in the YML would help future reviewers understand why.

3. bench.yml — no retry at the GitHub Actions level for parallel builds
File: .github/workflows/bench.yml, lines 474–486

The nick-fields/retry wrapper with on_retry_command: rm -rf pr/build master/build has been removed. The PR states that retry_build() in retry-build.sh covers this for SLURM-submitted jobs. However, the Setup & Build step in bench.yml calls ${{ matrix.build_script }} directly in the GitHub Actions runner (not in SLURM). If this step fails, there is now no retry. Please confirm matrix.build_script always delegates to a script that internally calls retry_build().

4. EOT heredoc delimiter collision risk in submit-slurm-job.sh
File: .github/scripts/submit-slurm-job.sh, line 384–411

$sbatch_script_contents is inlined into an unquoted heredoc delimited by EOT. If any sourced script (common/test.sh, common/bench.sh, run_case_optimization.sh, etc.) ever contains a line with exactly EOT as the only content, the heredoc terminates early and the sbatch submission silently submits a truncated script. None of the current scripts trigger this, but it is a latent fragility worth noting (same pattern as the deleted per-cluster scripts).

5. Minor: $(nproc) evaluated twice in common/bench.sh
File: .github/workflows/common/bench.sh, line 507

n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) ))

$(nproc) forks a subprocess twice. Trivial but clean to fix:

_nproc=$(nproc); n_jobs=$(( _nproc > 64 ? 64 : _nproc ))

Overall this is a solid simplification. The unified scripts remove significant duplication and the GPU partition selection logic is well-commented. The main item to double-check before merge is finding #1 (the dropped -j flag on bench).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T13:06:07Z

Claude Code Review

Head SHA: fde5fc4

Files changed: 22

Key files:

.github/scripts/submit-slurm-job.sh (new, +207)
.github/workflows/common/test.sh (new, +71)
.github/workflows/common/bench.sh (new, +54)
.github/scripts/select-gpu-partition.sh (new, +35)
.github/workflows/test.yml (-40/+7)
.github/workflows/bench.yml (-15/+13)
11 per-cluster scripts deleted

Summary

Replaces nick-fields/retry JS action with plain run: steps, fixing false SIGKILL failures on Frontier login nodes.
Consolidates 11 per-cluster submit/test/bench scripts into 3 unified parameterized scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh).
Extracts select-gpu-partition.sh for shared sinfo-based GPU partition selection with configurable GPU_PARTITION_MIN_NODES.
Parallel bench jobs now require 2 idle/mix nodes before selecting a partition to ensure both PR and master land on the same GPU type.
Drops dead run-tests-with-retry.sh.

Findings

1. Potential job ID extraction fragility — submit-slurm-job.sh L416

job_id=$(echo "$submit_output" | grep -oE '[0-9]+')

If sbatch emits a warning containing digits before the Submitted batch job line (e.g., a memory limit advisory), this grabs the wrong number. Safer:

job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job [0-9]+' | grep -oE '[0-9]+')

The old per-cluster scripts had the same issue, but it's worth fixing in the unified script since it's now the single point of failure for all clusters.

2. TMPDIR collision risk — common/bench.sh L514

currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))

$RANDOM % 9000 gives a 1-in-9000 collision probability per concurrent job. The old phoenix/bench.sh used % 900 (10× worse), so this is an improvement, but mktemp -d "$tmpbuild/run-XXXXXX" would be collision-free:

currentdir=$(mktemp -d "$tmpbuild/run-XXXXXX")

3. Missing -j $n_jobs in bench command — common/bench.sh L542–544

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The old phoenix/bench.sh passed -j $n_jobs to ./mfc.sh bench. If that flag controls benchmark-level parallelism (not just build parallelism), Phoenix benchmarks may be slower or behave differently. The old frontier/bench.sh also passed -j $n_ranks. If this was intentionally dropped, a brief comment explaining why would help.

4. Stale PR description
The PR description lists GPU partition priority as gpu-rtx6000 → gpu-l40s → ..., but select-gpu-partition.sh L188–196 explicitly excludes gpu-rtx6000 (it's too slow for the test time limit). The code is correct; the description is just stale and should be updated.

5. set -euo pipefail + unquoted $build_opts — common/bench.sh L530

retry_build ./mfc.sh build -j $n_jobs $build_opts || exit 1

With set -euo pipefail, an unset $build_opts would fail with "unbound variable". bench-preamble.sh should always set it, but defensive quoting ${build_opts:-} would make this robust if preamble logic changes.

Minor / Non-blocking

test.yml removal of the Restore Build Cache step (L1051–1055) is correct — actions/cache doesn't help on HPC self-hosted runners. Clean removal.
The wait $pid || exit=$? pattern in bench.yml (L482–483) correctly avoids set -e aborting on the first failure and orphaning the second background job. Good fix.
select-gpu-partition.sh cleans up private _-prefixed variables via unset at L215 — nice hygiene for a sourced script.
The id_file idempotency logic in submit-slurm-job.sh (stale job cancellation) correctly carries over from the old submit-job.sh.

Overall this is a solid cleanup that meaningfully reduces CI surface area. The findings above are mostly hardening opportunities, not blockers.

- File-level gcov coverage cache maps test UUIDs to exercised .fpp source files (gzip JSON, committed to repo) - --only-changes flag prunes tests by intersecting PR-changed files against coverage cache; conservative fallbacks for missing cache/coverage - --build-coverage-cache flag + 3-phase parallel cache builder (prepare, run, gcov collect) - New rebuild-cache CI job on Phoenix via SLURM when cases.py or Fortran dependency graph changes - Dep-change detection greps PR/push diffs for added use/include statements - 53 unit tests cover core coverage logic - Rebased onto PR MFlowCode#1299 unified CI architecture (submit-slurm-job.sh, common/test.sh) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Spencer Bryngelson added 3 commits March 9, 2026 23:36

ci: remove unused RETRY_VALIDATE_CMD from retry_build

2fb93db

Copilot AI review requested due to automatic review settings March 10, 2026 03:45