Skip to content

ci: harden Phoenix CI against OOM, stale NFS handles, and transient SLURM errors#1304

Open
sbryngelson wants to merge 15 commits intoMFlowCode:masterfrom
sbryngelson:frontier-runner
Open

ci: harden Phoenix CI against OOM, stale NFS handles, and transient SLURM errors#1304
sbryngelson wants to merge 15 commits intoMFlowCode:masterfrom
sbryngelson:frontier-runner

Conversation

@sbryngelson
Copy link
Member

$(cat <<'EOF'

Summary

Two classes of Phoenix CI failures were investigated and fixed:

1. OOM during case-optimized GPU compilation on cpu-small (#1299 regression)

PR #1299 changed the case-optimization prebuild from a GPU SLURM partition to cpu-small. However, cpu-small was configured with --mem-per-cpu=2G (48 GB total for 24 tasks), which is insufficient for nvfortran --case-optimization --gpu mp -j 8. SLURM's cgroup OOM enforcer kills ptxas processes mid-compilation, causing nvfortran to report write_contents_to_file error + NVFORTRAN-F-0155.

Fix: Reconfigure the Phoenix CPU SLURM job to match actual usage:

  • --mem-per-cpu=8G (up from 2G), --ntasks-per-node=12 (down from 24) → 96 GB total, right-sized for -j 8 builds
  • --exclusive to prevent memory contention from co-scheduled jobs
  • -p cpu-small,cpu-medium,cpu-large so SLURM can schedule on whichever partition has availability

2. Stale NFS file handles causing rm -rf build to fail

On Phoenix, compute nodes access the workspace over NFS. If a previous job left a build/ directory with stale NFS handles (e.g. after a server hiccup or cross-node reuse), rm -rf build exits non-zero and kills the job immediately under set -e — before any compilation even starts.

Fix: Use rm -rf build 2>/dev/null || true in all four scripts that wipe the build directory (test.sh, bench.sh, frontier/build.sh, prebuild-case-optimization.sh). Stale-handle files are inaccessible and cannot be executed, so they cannot cause SIGILL; suppressing the error is safe.

3. Transient sbatch socket timeouts

sbatch occasionally fails with Socket timed out on send/recv operation. Previously this was a hard failure with no retry.

Fix: Extract a retry-sbatch.sh helper (mirroring retry-build.sh) that retries sbatch up to 3 times with a 30-second sleep, but only on known transient errors (socket timeout, connection refused/reset). Hard configuration failures (bad account, invalid partition, QOS violations) are not retried.

Test plan

  • Phoenix gpu-acc, gpu-omp, cpu CI jobs pass without OOM or stale-handle failures
  • Case optimization prebuild on cpu partition completes successfully
  • sbatch retry fires on transient errors and passes through hard failures immediately
    EOF
    )

sbryngelson and others added 2 commits March 12, 2026 19:30
…rrors

- Increase Phoenix cpu-small mem-per-cpu from 2G to 8G, reduce ntasks
  from 24 to 12 to match actual -j 8 build parallelism and prevent OOM
  during case-optimized nvfortran --gpu mp compilation
- Add --exclusive to prevent memory contention from co-scheduled jobs
- Broaden partition to cpu-small,cpu-medium,cpu-large for availability
- Add retry-sbatch.sh: retries sbatch up to 3x on transient SLURM errors
  (socket timeout, connection failures) but not on hard config failures
- Use retry_sbatch() in submit-slurm-job.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Suppress errors from stale NFS file handles when wiping the build
directory before a fresh build. Stale-handle files are inaccessible
and cannot cause SIGILL, so ignoring rm failures is safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson marked this pull request as ready for review March 12, 2026 23:41
Copilot AI review requested due to automatic review settings March 12, 2026 23:41
sbryngelson and others added 2 commits March 12, 2026 19:42
- Redirect all diagnostic output to stderr so it isn't captured into
  job_id when called via command substitution
- Match 'Submitted batch job NNN' exactly instead of any digit sequence,
  preventing error messages containing numbers from being mistaken for
  a valid job ID

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
More portable and defensive against script content starting with -
or containing escape sequences.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens cluster-based CI (primarily Phoenix) against common infrastructure failures (OOM during compilation, stale NFS handles during cleanup, and transient SLURM sbatch submission errors) by adjusting SLURM resource requests, making build-directory cleanup tolerant, and adding an sbatch retry helper.

Changes:

  • Adjust Phoenix CPU SLURM resource directives (more memory per CPU, fewer tasks per node, exclusive allocation, and broader CPU partition selection).
  • Suppress rm -rf build failures to avoid job aborts on stale NFS file handles across build/test/bench/prebuild scripts.
  • Add retry-sbatch.sh and route SLURM submissions through it from submit-slurm-job.sh.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
.github/workflows/frontier/build.sh Ignore rm -rf build errors before building/testing on Frontier login node.
.github/workflows/common/test.sh Ignore Phoenix rm -rf build errors to avoid stale-NFS-handle aborts.
.github/workflows/common/bench.sh Ignore Phoenix rm -rf build errors to avoid stale-NFS-handle aborts.
.github/scripts/submit-slurm-job.sh Increase Phoenix CPU resources and submit via new retry_sbatch helper.
.github/scripts/retry-sbatch.sh New helper to retry sbatch on transient submission errors.
.github/scripts/prebuild-case-optimization.sh Ignore rm -rf build errors before case-optimization prebuild.
Comments suppressed due to low confidence (1)

.github/workflows/common/bench.sh:33

  • Same issue as in common/test.sh: ignoring rm -rf build errors can leave build/ present, causing if [ ! -d "build" ] to skip the rebuild on Phoenix even though the script intends to always start fresh on heterogeneous nodes. Make the Phoenix rebuild unconditional (or check a stronger marker like build/install) so stale-handle cleanup failures don’t silently fall through into running benchmarks with a partial/stale build.
# Phoenix: always nuke stale builds (heterogeneous compute nodes → ISA mismatch risk).
if [ "$job_cluster" = "phoenix" ]; then
    rm -rf build 2>/dev/null || true
fi

if [ ! -d "build" ]; then
    source .github/scripts/retry-build.sh
    retry_build ./mfc.sh build -j $n_jobs $build_opts || exit 1

sbryngelson and others added 9 commits March 12, 2026 19:46
…nd bench.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ectory

If stale NFS handles prevent full deletion of build/, the old
'if [ ! -d build ]' guard would skip the rebuild entirely, leaving
stale binaries from a different compute node that could cause SIGILL.
Force a rebuild on Phoenix unconditionally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace 'rm -rf build || true' with 'mv build build.stale.$$' followed
by a background best-effort delete. mv is a metadata-only operation that
succeeds even when files have stale NFS handles, guaranteeing build/ is
gone before the fresh build starts. Old stale trees are cleaned up
opportunistically in the background.

This also simplifies test.sh and bench.sh: since mv reliably removes
build/, the Phoenix-specific override in the build condition is no longer
needed — the plain '[ ! -d build ]' check is sufficient again.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…F newline

Replace 'rm -rf build.stale.*' with 'rm -rf build.stale.$$' so each job
only cleans up its own renamed directory, avoiding a race if concurrent
matrix jobs share a workspace. Also add trailing newline to retry-sbatch.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rename-then-background-delete pattern was copy-pasted across four
scripts. Move it into .github/scripts/clean-build.sh and source it at
each call site.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Under set -e, retry_sbatch returning 1 exits the script immediately.
The if [ -z "$job_id" ] block was never reachable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'try again' is too broad and could match hard SLURM policy errors like
'QOS violates policy, try again later', causing unintended retries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
--exclusive caused 'Requested node configuration is not available' on
cpu-small,cpu-medium,cpu-large partitions. The --mem-per-cpu=8G
reservation already prevents memory contention from co-scheduled jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 64ef40ab-0e3e-4b06-ab83-216300f44047

📥 Commits

Reviewing files that changed from the base of the PR and between 598f5a5 and defdd7c.

📒 Files selected for processing (7)
  • .github/scripts/clean-build.sh
  • .github/scripts/prebuild-case-optimization.sh
  • .github/scripts/retry-sbatch.sh
  • .github/scripts/submit-slurm-job.sh
  • .github/workflows/common/bench.sh
  • .github/workflows/common/test.sh
  • .github/workflows/frontier/build.sh

Comment on lines +12 to +14
rm -rf build.stale.* 2>/dev/null || true
mv build "build.stale.$$" 2>/dev/null || true
rm -rf "build.stale.$$" 2>/dev/null & disown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope stale cleanup to this job’s PID.

Line 12 currently deletes build.stale.*, which is broader than the PID-scoped behavior described in this script and can reintroduce cross-job cleanup races in shared workspaces. Keep deletion scoped to build.stale.$$.

Proposed fix
 clean_build() {
-    # Clean up leftover stale directories from previous runs before adding a new one.
-    rm -rf build.stale.* 2>/dev/null || true
+    # Clean only this process namespace to avoid cross-job races.
+    rm -rf "build.stale.$$" 2>/dev/null || true
     mv build "build.stale.$$" 2>/dev/null || true
     rm -rf "build.stale.$$" 2>/dev/null & disown
 }

Comment on lines +19 to +20
job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
if [ -n "$job_id" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -u

echo "Checking current extraction pattern in retry-sbatch.sh:"
rg -n "job_id=.*Submitted batch job" .github/scripts/retry-sbatch.sh

echo
echo "Reproducing strict-mode early-exit behavior:"
output="$(
  bash -lc '
    set -euo pipefail
    submit_output="Socket timed out on send/recv operation"
    job_id=$(echo "$submit_output" | grep -oE "Submitted batch job ([0-9]+)" | grep -oE "[0-9]+$")
    echo "RETRY_LOGIC_REACHED"
  ' 2>&1
)"
status=$?
printf 'exit_status=%s\n' "$status"
printf '%s\n' "$output"

echo
echo "Expected: non-zero exit_status and no RETRY_LOGIC_REACHED marker."

Repository: MFlowCode/MFC

Length of output: 444


🏁 Script executed:

cat -n .github/scripts/retry-sbatch.sh

Repository: MFlowCode/MFC

Length of output: 1834


🏁 Script executed:

rg -n "retry.sbatch|retry_sbatch" --type sh --type bash

Repository: MFlowCode/MFC

Length of output: 85


🏁 Script executed:

rg -n "retry.sbatch|retry_sbatch"

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

rg -n "retry-sbatch"

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

git ls-files | head -20
find .github -type f -name "*.sh" | head -20

Repository: MFlowCode/MFC

Length of output: 1190


🏁 Script executed:

rg -l "retry-sbatch|retry_sbatch" .github/

Repository: MFlowCode/MFC

Length of output: 124


🏁 Script executed:

cat -n .github/scripts/submit-slurm-job.sh

Repository: MFlowCode/MFC

Length of output: 6880


Guard job-id extraction to prevent early exit under strict mode.

At line 19, the grep pipeline in the command substitution can fail without matching and return non-zero. Since the caller script (.github/scripts/submit-slurm-job.sh) uses set -euo pipefail and sources this function, the unguarded pipeline causes the script to exit before the retry logic at lines 20–35 executes, breaking the entire retry mechanism.

Proposed fix
-        job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
+        job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
if [ -n "$job_id" ]; then
job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true)
if [ -n "$job_id" ]; then

@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 13, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 13, 2026
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.34%. Comparing base (598f5a5) to head (65e2642).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1304   +/-   ##
=======================================
  Coverage   45.34%   45.34%           
=======================================
  Files          70       70           
  Lines       20514    20514           
  Branches     1954     1954           
=======================================
  Hits         9303     9303           
  Misses      10084    10084           
  Partials     1127     1127           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Phoenix compute nodes have a small /tmp. With 8 parallel test
threads each spawning MPI processes over ~96 minutes, it fills up
and ORTE fails to create its session directory, causing the last
batch of tests to fail with 'No such file or directory'. Apply the
same TMPDIR redirect to project storage that bench.sh already uses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: 65e2642
Files changed: 7

  • .github/scripts/clean-build.sh (new)
  • .github/scripts/retry-sbatch.sh (new)
  • .github/scripts/submit-slurm-job.sh
  • .github/scripts/prebuild-case-optimization.sh
  • .github/workflows/common/bench.sh
  • .github/workflows/common/test.sh
  • .github/workflows/frontier/build.sh

Summary

  • Replaces rm -rf build with a mv-then-background-rm pattern (clean_build) to handle stale NFS ESTALE errors gracefully.
  • Adds retry_sbatch() helper that retries sbatch up to 3 times only for known transient errors (socket timeout, connection refused/reset).
  • Increases Phoenix CPU SLURM memory (2G→8G per CPU, 24→12 tasks/node) to prevent OOM during nvfortran --case-optimization.
  • Adds a Phoenix-specific TMPDIR redirect in test.sh to avoid filling /tmp under parallel MPI test loads.

Findings

[HIGH] Hardcoded personal path in test.sh lines 15–16

tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_build

This embeds a specific user's project storage path. If the account changes, tests fail with no obvious error. Consider using an environment variable (e.g., $MFC_TMPDIR or $SCRATCH) defined in the cluster's CI environment/secrets, with a fallback that fails loudly if unset.

[MEDIUM] PR body describes --exclusive but it is absent from the diff (submit-slurm-job.sh line 88)
The PR description states --exclusive was added to prevent memory contention. It does not appear in the SBATCH options in the diff. Either the body is stale, or the --exclusive flag was dropped intentionally — the description should be updated to match the actual change, or the flag should be added.

[LOW] TMPDIR collision risk in test.sh line 17

currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))

$RANDOM % 9000 gives only 9 000 distinct values. Concurrent CI jobs on the same node sharing $tmpbuild could collide. mktemp -d "$tmpbuild/run-XXXXXX" is atomic and collision-free.

[LOW] Unquoted variable in test.sh line 18–19

mkdir -p $tmpbuild
mkdir -p $currentdir

Both variables should be double-quoted ("$tmpbuild", "$currentdir") to guard against word-splitting if the path ever contains spaces.


Improvement opportunities

  • clean-build.sh line 12: The glob build.stale.* cleanup at the top of clean_build could interfere with a sibling job's background rm -rf if two jobs share a workspace directory. Scoping the stale sentinel to the job name or runner ID (e.g., build.stale.$GITHUB_RUN_ID.*) would make concurrency safer.
  • retry-sbatch.sh: The transient-error regex is case-insensitive (-i) and broad. Adding sbatch version-pinned strings or testing against known Phoenix error messages would reduce the chance of a mis-classification where a hard failure is retried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants