Fix self-hosted CI robustness: build cache, SLURM QOS, and submit resilience#1295
Fix self-hosted CI robustness: build cache, SLURM QOS, and submit resilience#1295sbryngelson merged 42 commits intoMFlowCode:masterfrom
Conversation
Benchmark jobs were using the extended partition (5:59 walltime, ENG160 account) causing multi-hour queue waits and hitting GHA's 8h wall-clock limit. The actual benchmark runs in ~20 minutes on the node. Switch to batch + 1:59 + --qos=normal (same as the test suite jobs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace setup-build-cache.sh symlink mechanism with rm -rf build before each test run on Phoenix and Frontier. Benchmark jobs unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the runner process is killed (exit 137) before the SLURM job completes, sacct is used to verify the job's final state. If the SLURM job completed with exit 0:0, the CI step passes regardless of the monitor's exit code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three submit.sh scripts (phoenix, frontier, frontier_amd symlink) now call a single helper that wraps monitor_slurm_job.sh with sacct fallback: if the monitor is killed before the SLURM job completes, the helper re-checks the job's final state and exits 0 if it succeeded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Cut benchmark time steps from 60-70 to 20 (GPU) / 10 (CPU) — still sufficient for grind time measurement - Unify Frontier SLURM config: bench now uses CFD154/batch/normal like tests instead of ENG160/extended (2hr wall time vs 6hr) - Reduce CI timeout from 8hr to 4hr Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On GNR nodes (192 cores), $(nproc) returns 192 which overwhelms MPI daemons and causes SIGTERM (exit 143) during benchmarks. Master lands on a 24-core node and passes while PR lands on GNR and fails, making benchmarks appear broken by the PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gfortran 12+ with -march=native on Granite Rapids (GNR) CPUs emits vmovw instructions (AVX-512 FP16) that binutils 2.35 cannot assemble, causing LTO link failures. Add -mno-avx512fp16 when the compiler supports it. FP16 is unused in MFC's double-precision computations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build errors containing [/tmp/...] paths (e.g. LTO linker output) were misinterpreted as Rich markup closing tags, crashing the error display and masking the actual build failure. Wrap raw output in Text() to prevent markup interpretation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Improves robustness of the self-hosted CI/benchmark infrastructure by removing fragile build caching, tightening SLURM submission defaults, and adding a more resilient SLURM job monitoring wrapper.
Changes:
- Remove persistent
build/reuse on Phoenix/Frontier self-hosted runners and cap parallel build/bench jobs to reduce node overload. - Update SLURM submission/monitoring to use
run_monitored_slurm_job.shand standardize Frontier SBATCH params (batch/1:59/normal). - Adjust build/UX details (CMake release flags for gcov/Granite Rapids assembler compatibility; Rich output rendering).
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
toolchain/mfc/build.py |
Wrap build stdout/stderr in rich.text.Text before rendering in panels. |
benchmarks/5eq_rk3_weno3_hllc/case.py |
Changes default benchmark step count (timestep stop/save). |
benchmarks/hypo_hll/case.py |
Changes default benchmark step count (timestep stop/save). |
benchmarks/ibm/case.py |
Changes default benchmark step count (timestep stop/save). |
benchmarks/igr/case.py |
Changes default benchmark step count (timestep stop/save). |
benchmarks/viscous_weno5_sgb_acoustic/case.py |
Changes default benchmark step count (timestep stop/save). |
CMakeLists.txt |
Skip -march=native for gcov builds; add conditional -mno-avx512fp16 to avoid older-binutils assembler failures. |
.github/workflows/phoenix/test.sh |
Remove persistent build cache usage by deleting build/ before running. |
.github/workflows/phoenix/bench.sh |
Delete build/ before building; cap -j parallelism to 64. |
.github/workflows/phoenix/submit.sh |
Switch SLURM monitoring from monitor_slurm_job.sh to run_monitored_slurm_job.sh. |
.github/workflows/frontier/build.sh |
Remove build cache usage by deleting build/ before build/test. |
.github/workflows/frontier/bench.sh |
Cap CPU benchmark -j parallelism to 64. |
.github/workflows/frontier/submit.sh |
Standardize SBATCH params and switch to run_monitored_slurm_job.sh. |
.github/workflows/bench.yml |
Changes workflow job timeout minutes. |
.github/scripts/run_monitored_slurm_job.sh |
New wrapper to recover from monitor termination by checking final SLURM job state via sacct. |
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8293f1d1-26a6-40fe-b9a5-3d5a10fcd169
📒 Files selected for processing (15)
.github/scripts/run_monitored_slurm_job.sh.github/workflows/bench.yml.github/workflows/frontier/bench.sh.github/workflows/frontier/build.sh.github/workflows/frontier/submit.sh.github/workflows/phoenix/bench.sh.github/workflows/phoenix/submit.sh.github/workflows/phoenix/test.shCMakeLists.txtbenchmarks/5eq_rk3_weno3_hllc/case.pybenchmarks/hypo_hll/case.pybenchmarks/ibm/case.pybenchmarks/igr/case.pybenchmarks/viscous_weno5_sgb_acoustic/case.pytoolchain/mfc/build.py
When benchmarking master vs PR, submit_and_monitor_bench.sh was using the master directory's submit.sh for the master bench job. Master's submit.sh calls monitor_slurm_job.sh directly without SIGKILL recovery. When the monitor was killed (exit 137), the master bench YAML was never found. Fix: always use the PR's submit.sh (which calls run_monitored_slurm_job.sh with sacct fallback) for both master and PR bench submissions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- Restore bench.yml timeout-minutes to 480 (accidentally regressed to 240) - Fix monitor_slurm_job.sh cleanup trap: check squeue before calling scancel so jobs that already left the queue are not cancelled, allowing run_monitored_slurm_job.sh to recover successfully via sacct Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sacct can return empty output (zero exit) when accounting is not yet
recorded or the epilog hasn't finished — the previous '|| echo UNKNOWN'
only caught non-zero exits, leaving final_state=''. Use '|| true' to
suppress exit-on-error and ${var:-default} expansion to default to
UNKNOWN when the output is empty.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…threshold to warning
- run_parallel_benchmarks.sh: before parallel launch, query sinfo for
available GPU partitions (H200->H100->A100->L40S->RTX6000->V100)
and export BENCH_GPU_PARTITION so both PR and master submit to the
same GPU type; skip Blackwell (not on embers)
- phoenix/submit.sh: replace hardcoded -CL40S constraint with
-p ${BENCH_GPU_PARTITION:-gpu-l40s} for bench GPU jobs
- bench.yml: add per-case failure and success log steps using .yaml
presence to distinguish pass/fail per benchmark case
- bench.py: downgrade grind time regression check from error to warning
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- run_parallel_benchmarks.sh: add comment explaining || true on sinfo/grep pipeline; distinguish selected-by-availability vs fallback in log output; print last 50 lines of job log immediately on non-zero exit (not only when YAML is missing) - phoenix/submit.sh: upgrade set -e to set -euo pipefail; quote $1 in cat call; log resolved partition at submission time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pool Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Write SLURM job ID to <output>.slurm_job_id in run_monitored_slurm_job.sh so a cancelled() step in test.yml and bench.yml can find and cancel any in-flight SLURM jobs. This handles the SIGKILL case where the EXIT trap in monitor_slurm_job.sh cannot fire. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Concurrent Phoenix jobs (cpu, gpu-acc, gpu-omp) all start simultaneously on the same runner workspace. With 'rm -rf build', they race on build/lock.yaml: the gpu-omp job writes gpu='mp', which the cpu test command then reads, causing --mp-gpu in the cpu banner and a hipfort cmake failure. Restore setup-build-cache.sh to give each (device, interface, runner) its own isolated build directory, preventing the race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt jobs" This reverts commit cf4f2a6.
The test command previously passed no --gpu/--no-gpu flag, so it always
read from build/lock.yaml. If lock.yaml was contaminated (stale from a
prior GPU run, NFS delays, or SLURM requeue race), the CPU test would
inherit gpu='mp' and attempt to build hipfort, causing a cmake failure.
Use ${build_opts:---no-gpu} so CPU jobs explicitly pass --no-gpu and
GPU jobs pass --gpu acc/mp. Lock.yaml content is now irrelevant.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CCE 19.0.0 (cpe/25.03 upgrade) has a compiler bug (IPA SIGSEGV in m_phase_change.fpp) that causes all Frontier builds to fail. This is a pre-existing upstream Cray issue unrelated to this PR. Set continue-on-error conditionally so Frontier matrix entries show orange/warning while Phoenix and GitHub runner results remain blocking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
.github/workflows/bench.yml (1)
127-134: Add-rflag toreadto prevent backslash mangling.Per shellcheck SC2162,
readwithout-rwill interpret backslashes as escape characters. While.slurm_job_idfilenames are unlikely to contain backslashes, using-ris a defensive best practice.🔧 Proposed fix
find . -name "*.slurm_job_id" | while read f; do + find . -name "*.slurm_job_id" | while read -r f; do
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 70d4fdef-d331-454c-9966-5726a3c22df6
📒 Files selected for processing (23)
.github/scripts/monitor_slurm_job.sh.github/scripts/prebuild-case-optimization.sh.github/scripts/retry-build.sh.github/scripts/run_monitored_slurm_job.sh.github/scripts/run_parallel_benchmarks.sh.github/scripts/setup-build-cache.sh.github/scripts/submit_and_monitor_bench.sh.github/workflows/bench.yml.github/workflows/frontier/bench.sh.github/workflows/frontier/build.sh.github/workflows/frontier/submit.sh.github/workflows/phoenix/bench.sh.github/workflows/phoenix/submit.sh.github/workflows/phoenix/test.sh.github/workflows/test.ymlCMakeLists.txtbenchmarks/5eq_rk3_weno3_hllc/case.pybenchmarks/hypo_hll/case.pybenchmarks/ibm/case.pybenchmarks/igr/case.pybenchmarks/viscous_weno5_sgb_acoustic/case.pytoolchain/mfc/bench.pytoolchain/mfc/build.py
💤 Files with no reviewable changes (1)
- .github/scripts/setup-build-cache.sh
| fi | ||
|
|
||
| ./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix | ||
| ./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts ${build_opts:---no-gpu} -- -c phoenix |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| - name: Cancel SLURM Jobs | ||
| if: cancelled() | ||
| run: | | ||
| find . -name "*.slurm_job_id" | while read f; do | ||
| job_id=$(cat "$f") | ||
| echo "Cancelling SLURM job $job_id" | ||
| scancel "$job_id" 2>/dev/null || true | ||
| done |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| - name: Cancel SLURM Jobs | ||
| if: cancelled() | ||
| run: | | ||
| find . -name "*.slurm_job_id" | while read f; do | ||
| job_id=$(cat "$f") | ||
| echo "Cancelling SLURM job $job_id" | ||
| scancel "$job_id" 2>/dev/null || true | ||
| done |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)), | ||
| "t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)), | ||
| "t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)), | ||
| "t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- self/case-optimization: clean: true on checkout - self: build retry max_attempts 3→2, on_retry rm -rf build - self/case-optimization: add Cancel SLURM Jobs step on cancellation - case-optimization: drop retry wrapper on Pre-Build (login node) - CMakeLists.txt: skip -march=native for gcov builds; add -mno-avx512fp16 to avoid binutils <2.38 assembler failures on Granite Rapids CPUs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
High-end HPC partitions (H200, H100, A100) are high-demand and give inconsistent availability. RTX 6000 (35 nodes) and L40S are more consistently available, giving more reproducible benchmark timings. Fallback also updated from gpu-l40s to gpu-rtx6000. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Split Phoenix test and case-optimization CI steps into separate Submit and Monitor phases so a transient GHA connectivity drop cannot waste compute: the Submit step is idempotent (exits immediately after writing the job ID file), and the Monitor step re-attaches to an existing RUNNING/ PENDING SLURM job on rerun rather than submitting a new one. - New submit-job.sh: idempotent sbatch submission; writes job ID to <slug>.slurm_job_id and exits immediately; skips resubmission if a live SLURM job for this slug is already RUNNING/PENDING - Refactored submit.sh: thin wrapper that calls submit-job.sh then run_monitored_slurm_job.sh (backward-compatible; bench and Frontier callers unchanged) - test.yml self job: clean: false on checkout so .slurm_job_id survives reruns; Test step split into Submit SLURM Test Job + Monitor SLURM Test Job (phoenix only); Frontier unchanged - test.yml case-optimization: Run Case-Optimization Tests split into Submit + Monitor (phoenix only); Frontier unchanged; Pre-Build (SLURM) gets idempotency automatically via refactored submit.sh Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Merged f95b2c4 (Work around CCE 19.0.0 compiler bugs for Cray+OpenACC builds) with the following conflict resolutions: - run_monitored_slurm_job.sh: kept ours (superset with id_file writing and UNKNOWN fallback) - phoenix/submit.sh: kept ours (thin wrapper calling submit-job.sh) - phoenix/bench.sh: kept our 64-job cap + took upstream RETRY_CLEAN_CMD - frontier/submit.sh: took upstream (job_type detection block) - CMakeLists.txt: took upstream (CCE IPA workaround for m_bubbles_EL and m_phase_change, comment fix) - test.yml: took upstream github job changes (simplified cache key, Python 3.14, updated build/test commands); kept our clean: false and Submit/Monitor split steps for Phoenix self and case-optimization jobs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
--${{ matrix.precision }} expands to bare '--' when precision is empty,
which is parsed as an argument separator. Restore the PRECISION env var
approach so empty precision is correctly omitted.
Remove the manual 'Set up Python 3.14' step; the lint-gate job already
pins Python 3.12 via setup-python and the github runners use the
system default for the build/test steps.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…precision flag
- run_parallel_benchmarks.sh: fix warning message to say gpu-rtx6000
(not gpu-l40s) to match the actual fallback value
- phoenix/bench.sh: remove RETRY_CLEAN_CMD prefix on retry_build —
retry-build.sh only uses RETRY_VALIDATE_CMD and always does rm -rf build
on retry; variable was dead and re-introduced by upstream merge
- test.yml: fix --${{ matrix.precision }} expanding to bare '--' when
precision is empty; restore PRECISION env var pattern; remove manual
'Set up Python 3.14' step
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…irst On rerun, the intent is to cancel the old job and run a fresh one — not to reuse an existing running job. Remove the early-exit that skipped resubmission for RUNNING/PENDING jobs; instead scancel any live job as a safety net (in case the 'Cancel SLURM Jobs' step did not fire due to SIGKILL), then always fall through to a fresh sbatch submission. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…og steps - submit-job.sh: escape $(pwd) in heredoc so it expands on the compute node after cd $SLURM_SUBMIT_DIR, not on the login node at sbatch time - submit-job.sh: replace backtick command substitution with $() for job_slug (modern bash style, consistent with rest of script) - bench.yml: combine 'Print Per-Case Failure Logs' and 'Print Per-Case Success Logs' into a single 'Print Per-Case Logs' step that labels each output as [PASSED] or [FAILED] inline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- run_monitored_slurm_job.sh: remove redundant .slurm_job_id write; submit-job.sh already writes it before the monitor is called, so the second write was a no-op in all code paths - bench.yml: replace two separate per-case log steps with one organized step: prints a summary table (N failed, N passed) with all case names first, then full logs only for failed cases — passing runs stay concise Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- submit.sh: replace backtick with $() for job_slug; add comment that the sed pipeline must stay in sync with submit-job.sh - test.yml: explain clean: false (preserves .slurm_job_id for stale job detection) and continue-on-error on Frontier (CCE compiler instability) - run_parallel_benchmarks.sh: explain GPU partition priority order (prefer smaller/older partitions to leave large nodes for production) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Resolve conflicts with master PRs MFlowCode#1295 (CI robustness refactor) and MFlowCode#1286 (CCE 19.0.0 workarounds). Keep coverage pruning features while adopting master's submit-job.sh split, build cache, Frontier continue-on-error, and improved SLURM monitoring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Five CI infrastructure fixes extracted from #1286, independent of the CCE compiler workarounds.
batchpartition with1:59walltime andnormalQOSbench.yml: restoretimeout-minutesto 480 (accidentally set to 240)submit.shto survive monitor SIGKILL by re-checking SLURM job state after restartTest plan
submit.shcorrectly recovers job state after runner restart/SIGKILL🤖 Generated with Claude Code
Note
Medium Risk
Touches CI submission/monitoring logic and build flags used in production-like Release builds, so misconfigurations could break self-hosted workflows or change performance/compatibility behavior despite being mostly reliability-focused.
Overview
Improves robustness of self-hosted SLURM CI by adding
run_monitored_slurm_job.shto recover from monitor termination (e.g., runner SIGKILL) and by makingmonitor_slurm_job.shonlyscanceljobs that are still queued/running.Updates benchmark/test workflows to record job IDs and cancel outstanding SLURM jobs on GitHub Actions cancellation, select a consistent Phoenix GPU partition for PR vs master comparisons, and print additional per-case logs on failure/success. Also standardizes/adjusts cluster scripts (e.g., Frontier submit parameters to
batch/normalwith ~2h walltime, caps build/bench parallelism to 64, forces clean benchmark builds viarm -rf build, and ensures both branches use the PR’ssubmit.shwrapper).Build/bench behavior tweaks: CMake now skips
-march=nativefor gcov builds and disables AVX-512 FP16 when supported to avoid assembler incompatibilities on newer CPUs, benchmark cases run fewer default timesteps, andbench_diffno longer fails the run solely on grind-time regression (downgraded to a warning).Written by Cursor Bugbot for commit cf4f2a6. Configure here.