ci: harden Phoenix CI against OOM, stale NFS handles, and transient SLURM errors#1304
ci: harden Phoenix CI against OOM, stale NFS handles, and transient SLURM errors#1304sbryngelson wants to merge 15 commits intoMFlowCode:masterfrom
Conversation
…rrors - Increase Phoenix cpu-small mem-per-cpu from 2G to 8G, reduce ntasks from 24 to 12 to match actual -j 8 build parallelism and prevent OOM during case-optimized nvfortran --gpu mp compilation - Add --exclusive to prevent memory contention from co-scheduled jobs - Broaden partition to cpu-small,cpu-medium,cpu-large for availability - Add retry-sbatch.sh: retries sbatch up to 3x on transient SLURM errors (socket timeout, connection failures) but not on hard config failures - Use retry_sbatch() in submit-slurm-job.sh Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Suppress errors from stale NFS file handles when wiping the build directory before a fresh build. Stale-handle files are inaccessible and cannot cause SIGILL, so ignoring rm failures is safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Redirect all diagnostic output to stderr so it isn't captured into job_id when called via command substitution - Match 'Submitted batch job NNN' exactly instead of any digit sequence, preventing error messages containing numbers from being mistaken for a valid job ID Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
More portable and defensive against script content starting with - or containing escape sequences. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Hardens cluster-based CI (primarily Phoenix) against common infrastructure failures (OOM during compilation, stale NFS handles during cleanup, and transient SLURM sbatch submission errors) by adjusting SLURM resource requests, making build-directory cleanup tolerant, and adding an sbatch retry helper.
Changes:
- Adjust Phoenix CPU SLURM resource directives (more memory per CPU, fewer tasks per node, exclusive allocation, and broader CPU partition selection).
- Suppress
rm -rf buildfailures to avoid job aborts on stale NFS file handles across build/test/bench/prebuild scripts. - Add
retry-sbatch.shand route SLURM submissions through it fromsubmit-slurm-job.sh.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/frontier/build.sh | Ignore rm -rf build errors before building/testing on Frontier login node. |
| .github/workflows/common/test.sh | Ignore Phoenix rm -rf build errors to avoid stale-NFS-handle aborts. |
| .github/workflows/common/bench.sh | Ignore Phoenix rm -rf build errors to avoid stale-NFS-handle aborts. |
| .github/scripts/submit-slurm-job.sh | Increase Phoenix CPU resources and submit via new retry_sbatch helper. |
| .github/scripts/retry-sbatch.sh | New helper to retry sbatch on transient submission errors. |
| .github/scripts/prebuild-case-optimization.sh | Ignore rm -rf build errors before case-optimization prebuild. |
Comments suppressed due to low confidence (1)
.github/workflows/common/bench.sh:33
- Same issue as in
common/test.sh: ignoringrm -rf builderrors can leavebuild/present, causingif [ ! -d "build" ]to skip the rebuild on Phoenix even though the script intends to always start fresh on heterogeneous nodes. Make the Phoenix rebuild unconditional (or check a stronger marker likebuild/install) so stale-handle cleanup failures don’t silently fall through into running benchmarks with a partial/stale build.
# Phoenix: always nuke stale builds (heterogeneous compute nodes → ISA mismatch risk).
if [ "$job_cluster" = "phoenix" ]; then
rm -rf build 2>/dev/null || true
fi
if [ ! -d "build" ]; then
source .github/scripts/retry-build.sh
retry_build ./mfc.sh build -j $n_jobs $build_opts || exit 1
…nd bench.sh Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ectory If stale NFS handles prevent full deletion of build/, the old 'if [ ! -d build ]' guard would skip the rebuild entirely, leaving stale binaries from a different compute node that could cause SIGILL. Force a rebuild on Phoenix unconditionally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace 'rm -rf build || true' with 'mv build build.stale.$$' followed by a background best-effort delete. mv is a metadata-only operation that succeeds even when files have stale NFS handles, guaranteeing build/ is gone before the fresh build starts. Old stale trees are cleaned up opportunistically in the background. This also simplifies test.sh and bench.sh: since mv reliably removes build/, the Phoenix-specific override in the build condition is no longer needed — the plain '[ ! -d build ]' check is sufficient again. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…F newline Replace 'rm -rf build.stale.*' with 'rm -rf build.stale.$$' so each job only cleans up its own renamed directory, avoiding a race if concurrent matrix jobs share a workspace. Also add trailing newline to retry-sbatch.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rename-then-background-delete pattern was copy-pasted across four scripts. Move it into .github/scripts/clean-build.sh and source it at each call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Under set -e, retry_sbatch returning 1 exits the script immediately. The if [ -z "$job_id" ] block was never reachable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'try again' is too broad and could match hard SLURM policy errors like 'QOS violates policy, try again later', causing unintended retries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--exclusive caused 'Requested node configuration is not available' on cpu-small,cpu-medium,cpu-large partitions. The --mem-per-cpu=8G reservation already prevents memory contention from co-scheduled jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 64ef40ab-0e3e-4b06-ab83-216300f44047
📒 Files selected for processing (7)
.github/scripts/clean-build.sh.github/scripts/prebuild-case-optimization.sh.github/scripts/retry-sbatch.sh.github/scripts/submit-slurm-job.sh.github/workflows/common/bench.sh.github/workflows/common/test.sh.github/workflows/frontier/build.sh
| rm -rf build.stale.* 2>/dev/null || true | ||
| mv build "build.stale.$$" 2>/dev/null || true | ||
| rm -rf "build.stale.$$" 2>/dev/null & disown |
There was a problem hiding this comment.
Scope stale cleanup to this job’s PID.
Line 12 currently deletes build.stale.*, which is broader than the PID-scoped behavior described in this script and can reintroduce cross-job cleanup races in shared workspaces. Keep deletion scoped to build.stale.$$.
Proposed fix
clean_build() {
- # Clean up leftover stale directories from previous runs before adding a new one.
- rm -rf build.stale.* 2>/dev/null || true
+ # Clean only this process namespace to avoid cross-job races.
+ rm -rf "build.stale.$$" 2>/dev/null || true
mv build "build.stale.$$" 2>/dev/null || true
rm -rf "build.stale.$$" 2>/dev/null & disown
}| job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$') | ||
| if [ -n "$job_id" ]; then |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -u
echo "Checking current extraction pattern in retry-sbatch.sh:"
rg -n "job_id=.*Submitted batch job" .github/scripts/retry-sbatch.sh
echo
echo "Reproducing strict-mode early-exit behavior:"
output="$(
bash -lc '
set -euo pipefail
submit_output="Socket timed out on send/recv operation"
job_id=$(echo "$submit_output" | grep -oE "Submitted batch job ([0-9]+)" | grep -oE "[0-9]+$")
echo "RETRY_LOGIC_REACHED"
' 2>&1
)"
status=$?
printf 'exit_status=%s\n' "$status"
printf '%s\n' "$output"
echo
echo "Expected: non-zero exit_status and no RETRY_LOGIC_REACHED marker."Repository: MFlowCode/MFC
Length of output: 444
🏁 Script executed:
cat -n .github/scripts/retry-sbatch.shRepository: MFlowCode/MFC
Length of output: 1834
🏁 Script executed:
rg -n "retry.sbatch|retry_sbatch" --type sh --type bashRepository: MFlowCode/MFC
Length of output: 85
🏁 Script executed:
rg -n "retry.sbatch|retry_sbatch"Repository: MFlowCode/MFC
Length of output: 39
🏁 Script executed:
rg -n "retry-sbatch"Repository: MFlowCode/MFC
Length of output: 39
🏁 Script executed:
git ls-files | head -20
find .github -type f -name "*.sh" | head -20Repository: MFlowCode/MFC
Length of output: 1190
🏁 Script executed:
rg -l "retry-sbatch|retry_sbatch" .github/Repository: MFlowCode/MFC
Length of output: 124
🏁 Script executed:
cat -n .github/scripts/submit-slurm-job.shRepository: MFlowCode/MFC
Length of output: 6880
Guard job-id extraction to prevent early exit under strict mode.
At line 19, the grep pipeline in the command substitution can fail without matching and return non-zero. Since the caller script (.github/scripts/submit-slurm-job.sh) uses set -euo pipefail and sources this function, the unguarded pipeline causes the script to exit before the retry logic at lines 20–35 executes, breaking the entire retry mechanism.
Proposed fix
- job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
+ job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$') | |
| if [ -n "$job_id" ]; then | |
| job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true) | |
| if [ -n "$job_id" ]; then |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1304 +/- ##
=======================================
Coverage 45.34% 45.34%
=======================================
Files 70 70
Lines 20514 20514
Branches 1954 1954
=======================================
Hits 9303 9303
Misses 10084 10084
Partials 1127 1127 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Phoenix compute nodes have a small /tmp. With 8 parallel test threads each spawning MPI processes over ~96 minutes, it fills up and ORTE fails to create its session directory, causing the last batch of tests to fail with 'No such file or directory'. Apply the same TMPDIR redirect to project storage that bench.sh already uses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: 65e2642
Summary
Findings[HIGH] Hardcoded personal path in tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_buildThis embeds a specific user's project storage path. If the account changes, tests fail with no obvious error. Consider using an environment variable (e.g., [MEDIUM] PR body describes [LOW] TMPDIR collision risk in currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))
[LOW] Unquoted variable in mkdir -p $tmpbuild
mkdir -p $currentdirBoth variables should be double-quoted ( Improvement opportunities
|
$(cat <<'EOF'
Summary
Two classes of Phoenix CI failures were investigated and fixed:
1. OOM during case-optimized GPU compilation on cpu-small (#1299 regression)
PR #1299 changed the case-optimization prebuild from a GPU SLURM partition to
cpu-small. However,cpu-smallwas configured with--mem-per-cpu=2G(48 GB total for 24 tasks), which is insufficient fornvfortran --case-optimization --gpu mp -j 8. SLURM's cgroup OOM enforcer kills ptxas processes mid-compilation, causing nvfortran to reportwrite_contents_to_file error+NVFORTRAN-F-0155.Fix: Reconfigure the Phoenix CPU SLURM job to match actual usage:
--mem-per-cpu=8G(up from 2G),--ntasks-per-node=12(down from 24) → 96 GB total, right-sized for-j 8builds--exclusiveto prevent memory contention from co-scheduled jobs-p cpu-small,cpu-medium,cpu-largeso SLURM can schedule on whichever partition has availability2. Stale NFS file handles causing
rm -rf buildto failOn Phoenix, compute nodes access the workspace over NFS. If a previous job left a
build/directory with stale NFS handles (e.g. after a server hiccup or cross-node reuse),rm -rf buildexits non-zero and kills the job immediately underset -e— before any compilation even starts.Fix: Use
rm -rf build 2>/dev/null || truein all four scripts that wipe the build directory (test.sh,bench.sh,frontier/build.sh,prebuild-case-optimization.sh). Stale-handle files are inaccessible and cannot be executed, so they cannot cause SIGILL; suppressing the error is safe.3. Transient sbatch socket timeouts
sbatchoccasionally fails withSocket timed out on send/recv operation. Previously this was a hard failure with no retry.Fix: Extract a
retry-sbatch.shhelper (mirroringretry-build.sh) that retriessbatchup to 3 times with a 30-second sleep, but only on known transient errors (socket timeout, connection refused/reset). Hard configuration failures (bad account, invalid partition, QOS violations) are not retried.Test plan
EOF
)