Skip to content

Fix self-hosted CI robustness: build cache, SLURM QOS, and submit resilience#1295

Merged
sbryngelson merged 42 commits intoMFlowCode:masterfrom
sbryngelson:fix/ci-robustness
Mar 9, 2026
Merged

Fix self-hosted CI robustness: build cache, SLURM QOS, and submit resilience#1295
sbryngelson merged 42 commits intoMFlowCode:masterfrom
sbryngelson:fix/ci-robustness

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 6, 2026

Summary

Five CI infrastructure fixes extracted from #1286, independent of the CCE compiler workarounds.

  • Remove persistent build cache for self-hosted test runners (stale cache caused spurious build failures)
  • Remove build cache from benchmark jobs on Phoenix and Frontier (benchmarks should always build clean)
  • Fix Frontier benchmark SLURM: use batch partition with 1:59 walltime and normal QOS
  • Fix bench.yml: restore timeout-minutes to 480 (accidentally set to 240)
  • Fix submit.sh to survive monitor SIGKILL by re-checking SLURM job state after restart

Test plan

  • Frontier benchmark CI passes without stale-cache failures
  • Frontier test suite SLURM submissions succeed with corrected QOS/partition
  • submit.sh correctly recovers job state after runner restart/SIGKILL

🤖 Generated with Claude Code


Note

Medium Risk
Touches CI submission/monitoring logic and build flags used in production-like Release builds, so misconfigurations could break self-hosted workflows or change performance/compatibility behavior despite being mostly reliability-focused.

Overview
Improves robustness of self-hosted SLURM CI by adding run_monitored_slurm_job.sh to recover from monitor termination (e.g., runner SIGKILL) and by making monitor_slurm_job.sh only scancel jobs that are still queued/running.

Updates benchmark/test workflows to record job IDs and cancel outstanding SLURM jobs on GitHub Actions cancellation, select a consistent Phoenix GPU partition for PR vs master comparisons, and print additional per-case logs on failure/success. Also standardizes/adjusts cluster scripts (e.g., Frontier submit parameters to batch/normal with ~2h walltime, caps build/bench parallelism to 64, forces clean benchmark builds via rm -rf build, and ensures both branches use the PR’s submit.sh wrapper).

Build/bench behavior tweaks: CMake now skips -march=native for gcov builds and disables AVX-512 FP16 when supported to avoid assembler incompatibilities on newer CPUs, benchmark cases run fewer default timesteps, and bench_diff no longer fails the run solely on grind-time regression (downgraded to a warning).

Written by Cursor Bugbot for commit cf4f2a6. Configure here.

Spencer Bryngelson and others added 11 commits March 6, 2026 14:35
Benchmark jobs were using the extended partition (5:59 walltime, ENG160
account) causing multi-hour queue waits and hitting GHA's 8h wall-clock
limit. The actual benchmark runs in ~20 minutes on the node. Switch to
batch + 1:59 + --qos=normal (same as the test suite jobs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace setup-build-cache.sh symlink mechanism with rm -rf build
before each test run on Phoenix and Frontier. Benchmark jobs unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the runner process is killed (exit 137) before the SLURM job
completes, sacct is used to verify the job's final state. If the
SLURM job completed with exit 0:0, the CI step passes regardless of
the monitor's exit code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three submit.sh scripts (phoenix, frontier, frontier_amd symlink) now
call a single helper that wraps monitor_slurm_job.sh with sacct fallback:
if the monitor is killed before the SLURM job completes, the helper
re-checks the job's final state and exits 0 if it succeeded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Cut benchmark time steps from 60-70 to 20 (GPU) / 10 (CPU) — still
  sufficient for grind time measurement
- Unify Frontier SLURM config: bench now uses CFD154/batch/normal like
  tests instead of ENG160/extended (2hr wall time vs 6hr)
- Reduce CI timeout from 8hr to 4hr

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On GNR nodes (192 cores), $(nproc) returns 192 which overwhelms
MPI daemons and causes SIGTERM (exit 143) during benchmarks.
Master lands on a 24-core node and passes while PR lands on GNR
and fails, making benchmarks appear broken by the PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gfortran 12+ with -march=native on Granite Rapids (GNR) CPUs emits
vmovw instructions (AVX-512 FP16) that binutils 2.35 cannot assemble,
causing LTO link failures. Add -mno-avx512fp16 when the compiler
supports it. FP16 is unused in MFC's double-precision computations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build errors containing [/tmp/...] paths (e.g. LTO linker output) were
misinterpreted as Rich markup closing tags, crashing the error display
and masking the actual build failure. Wrap raw output in Text() to
prevent markup interpretation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson marked this pull request as ready for review March 6, 2026 19:57
Copilot AI review requested due to automatic review settings March 6, 2026 19:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves robustness of the self-hosted CI/benchmark infrastructure by removing fragile build caching, tightening SLURM submission defaults, and adding a more resilient SLURM job monitoring wrapper.

Changes:

  • Remove persistent build/ reuse on Phoenix/Frontier self-hosted runners and cap parallel build/bench jobs to reduce node overload.
  • Update SLURM submission/monitoring to use run_monitored_slurm_job.sh and standardize Frontier SBATCH params (batch/1:59/normal).
  • Adjust build/UX details (CMake release flags for gcov/Granite Rapids assembler compatibility; Rich output rendering).

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
toolchain/mfc/build.py Wrap build stdout/stderr in rich.text.Text before rendering in panels.
benchmarks/5eq_rk3_weno3_hllc/case.py Changes default benchmark step count (timestep stop/save).
benchmarks/hypo_hll/case.py Changes default benchmark step count (timestep stop/save).
benchmarks/ibm/case.py Changes default benchmark step count (timestep stop/save).
benchmarks/igr/case.py Changes default benchmark step count (timestep stop/save).
benchmarks/viscous_weno5_sgb_acoustic/case.py Changes default benchmark step count (timestep stop/save).
CMakeLists.txt Skip -march=native for gcov builds; add conditional -mno-avx512fp16 to avoid older-binutils assembler failures.
.github/workflows/phoenix/test.sh Remove persistent build cache usage by deleting build/ before running.
.github/workflows/phoenix/bench.sh Delete build/ before building; cap -j parallelism to 64.
.github/workflows/phoenix/submit.sh Switch SLURM monitoring from monitor_slurm_job.sh to run_monitored_slurm_job.sh.
.github/workflows/frontier/build.sh Remove build cache usage by deleting build/ before build/test.
.github/workflows/frontier/bench.sh Cap CPU benchmark -j parallelism to 64.
.github/workflows/frontier/submit.sh Standardize SBATCH params and switch to run_monitored_slurm_job.sh.
.github/workflows/bench.yml Changes workflow job timeout minutes.
.github/scripts/run_monitored_slurm_job.sh New wrapper to recover from monitor termination by checking final SLURM job state via sacct.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8293f1d1-26a6-40fe-b9a5-3d5a10fcd169

📥 Commits

Reviewing files that changed from the base of the PR and between d4336a1 and 438627e.

📒 Files selected for processing (15)
  • .github/scripts/run_monitored_slurm_job.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh
  • .github/workflows/frontier/build.sh
  • .github/workflows/frontier/submit.sh
  • .github/workflows/phoenix/bench.sh
  • .github/workflows/phoenix/submit.sh
  • .github/workflows/phoenix/test.sh
  • CMakeLists.txt
  • benchmarks/5eq_rk3_weno3_hllc/case.py
  • benchmarks/hypo_hll/case.py
  • benchmarks/ibm/case.py
  • benchmarks/igr/case.py
  • benchmarks/viscous_weno5_sgb_acoustic/case.py
  • toolchain/mfc/build.py

Spencer Bryngelson and others added 3 commits March 6, 2026 15:07
When benchmarking master vs PR, submit_and_monitor_bench.sh was using the
master directory's submit.sh for the master bench job. Master's submit.sh
calls monitor_slurm_job.sh directly without SIGKILL recovery. When the
monitor was killed (exit 137), the master bench YAML was never found.

Fix: always use the PR's submit.sh (which calls run_monitored_slurm_job.sh
with sacct fallback) for both master and PR bench submissions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

This comment has been minimized.

sbryngelson and others added 10 commits March 7, 2026 03:59
- Restore bench.yml timeout-minutes to 480 (accidentally regressed to 240)
- Fix monitor_slurm_job.sh cleanup trap: check squeue before calling scancel
  so jobs that already left the queue are not cancelled, allowing
  run_monitored_slurm_job.sh to recover successfully via sacct

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sacct can return empty output (zero exit) when accounting is not yet
recorded or the epilog hasn't finished — the previous '|| echo UNKNOWN'
only caught non-zero exits, leaving final_state=''. Use '|| true' to
suppress exit-on-error and ${var:-default} expansion to default to
UNKNOWN when the output is empty.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…threshold to warning

- run_parallel_benchmarks.sh: before parallel launch, query sinfo for
  available GPU partitions (H200->H100->A100->L40S->RTX6000->V100)
  and export BENCH_GPU_PARTITION so both PR and master submit to the
  same GPU type; skip Blackwell (not on embers)
- phoenix/submit.sh: replace hardcoded -CL40S constraint with
  -p ${BENCH_GPU_PARTITION:-gpu-l40s} for bench GPU jobs
- bench.yml: add per-case failure and success log steps using .yaml
  presence to distinguish pass/fail per benchmark case
- bench.py: downgrade grind time regression check from error to warning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- run_parallel_benchmarks.sh: add comment explaining || true on
  sinfo/grep pipeline; distinguish selected-by-availability vs
  fallback in log output; print last 50 lines of job log immediately
  on non-zero exit (not only when YAML is missing)
- phoenix/submit.sh: upgrade set -e to set -euo pipefail; quote $1
  in cat call; log resolved partition at submission time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pool

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Write SLURM job ID to <output>.slurm_job_id in run_monitored_slurm_job.sh
so a cancelled() step in test.yml and bench.yml can find and cancel any
in-flight SLURM jobs. This handles the SIGKILL case where the EXIT trap
in monitor_slurm_job.sh cannot fire.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Concurrent Phoenix jobs (cpu, gpu-acc, gpu-omp) all start simultaneously
on the same runner workspace. With 'rm -rf build', they race on
build/lock.yaml: the gpu-omp job writes gpu='mp', which the cpu test
command then reads, causing --mp-gpu in the cpu banner and a hipfort
cmake failure.

Restore setup-build-cache.sh to give each (device, interface, runner)
its own isolated build directory, preventing the race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test command previously passed no --gpu/--no-gpu flag, so it always
read from build/lock.yaml. If lock.yaml was contaminated (stale from a
prior GPU run, NFS delays, or SLURM requeue race), the CPU test would
inherit gpu='mp' and attempt to build hipfort, causing a cmake failure.

Use ${build_opts:---no-gpu} so CPU jobs explicitly pass --no-gpu and
GPU jobs pass --gpu acc/mp. Lock.yaml content is now irrelevant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 8, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 8, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 8, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 8, 2026
CCE 19.0.0 (cpe/25.03 upgrade) has a compiler bug (IPA SIGSEGV in
m_phase_change.fpp) that causes all Frontier builds to fail. This is
a pre-existing upstream Cray issue unrelated to this PR.

Set continue-on-error conditionally so Frontier matrix entries show
orange/warning while Phoenix and GitHub runner results remain blocking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@coderabbitai

This comment has been minimized.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
.github/workflows/bench.yml (1)

127-134: Add -r flag to read to prevent backslash mangling.

Per shellcheck SC2162, read without -r will interpret backslashes as escape characters. While .slurm_job_id filenames are unlikely to contain backslashes, using -r is a defensive best practice.

🔧 Proposed fix
         find . -name "*.slurm_job_id" | while read f; do
+        find . -name "*.slurm_job_id" | while read -r f; do

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70d4fdef-d331-454c-9966-5726a3c22df6

📥 Commits

Reviewing files that changed from the base of the PR and between d4336a1 and aa21620.

📒 Files selected for processing (23)
  • .github/scripts/monitor_slurm_job.sh
  • .github/scripts/prebuild-case-optimization.sh
  • .github/scripts/retry-build.sh
  • .github/scripts/run_monitored_slurm_job.sh
  • .github/scripts/run_parallel_benchmarks.sh
  • .github/scripts/setup-build-cache.sh
  • .github/scripts/submit_and_monitor_bench.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh
  • .github/workflows/frontier/build.sh
  • .github/workflows/frontier/submit.sh
  • .github/workflows/phoenix/bench.sh
  • .github/workflows/phoenix/submit.sh
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • CMakeLists.txt
  • benchmarks/5eq_rk3_weno3_hllc/case.py
  • benchmarks/hypo_hll/case.py
  • benchmarks/ibm/case.py
  • benchmarks/igr/case.py
  • benchmarks/viscous_weno5_sgb_acoustic/case.py
  • toolchain/mfc/bench.py
  • toolchain/mfc/build.py
💤 Files with no reviewable changes (1)
  • .github/scripts/setup-build-cache.sh

fi

./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts ${build_opts:---no-gpu} -- -c phoenix

This comment was marked as outdated.

Comment on lines +261 to +268
- name: Cancel SLURM Jobs
if: cancelled()
run: |
find . -name "*.slurm_job_id" | while read f; do
job_id=$(cat "$f")
echo "Cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
done

This comment was marked as outdated.

Comment on lines +346 to +353
- name: Cancel SLURM Jobs
if: cancelled()
run: |
find . -name "*.slurm_job_id" | while read f; do
job_id=$(cat "$f")
echo "Cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
done

This comment was marked as outdated.

Comment on lines +194 to +195
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),

This comment was marked as outdated.

Comment on lines +51 to +52
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),

This comment was marked as outdated.

Comment on lines +66 to +67
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),

This comment was marked as outdated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

sbryngelson pushed a commit to sbryngelson/MFC that referenced this pull request Mar 8, 2026
- self/case-optimization: clean: true on checkout
- self: build retry max_attempts 3→2, on_retry rm -rf build
- self/case-optimization: add Cancel SLURM Jobs step on cancellation
- case-optimization: drop retry wrapper on Pre-Build (login node)
- CMakeLists.txt: skip -march=native for gcov builds; add
  -mno-avx512fp16 to avoid binutils <2.38 assembler failures on
  Granite Rapids CPUs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
High-end HPC partitions (H200, H100, A100) are high-demand and give
inconsistent availability. RTX 6000 (35 nodes) and L40S are more
consistently available, giving more reproducible benchmark timings.
Fallback also updated from gpu-l40s to gpu-rtx6000.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Split Phoenix test and case-optimization CI steps into separate Submit
and Monitor phases so a transient GHA connectivity drop cannot waste
compute: the Submit step is idempotent (exits immediately after writing
the job ID file), and the Monitor step re-attaches to an existing RUNNING/
PENDING SLURM job on rerun rather than submitting a new one.

- New submit-job.sh: idempotent sbatch submission; writes job ID to
  <slug>.slurm_job_id and exits immediately; skips resubmission if a
  live SLURM job for this slug is already RUNNING/PENDING
- Refactored submit.sh: thin wrapper that calls submit-job.sh then
  run_monitored_slurm_job.sh (backward-compatible; bench and Frontier
  callers unchanged)
- test.yml self job: clean: false on checkout so .slurm_job_id survives
  reruns; Test step split into Submit SLURM Test Job + Monitor SLURM
  Test Job (phoenix only); Frontier unchanged
- test.yml case-optimization: Run Case-Optimization Tests split into
  Submit + Monitor (phoenix only); Frontier unchanged; Pre-Build (SLURM)
  gets idempotency automatically via refactored submit.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Merged f95b2c4 (Work around CCE 19.0.0 compiler bugs for Cray+OpenACC
builds) with the following conflict resolutions:

- run_monitored_slurm_job.sh: kept ours (superset with id_file writing
  and UNKNOWN fallback)
- phoenix/submit.sh: kept ours (thin wrapper calling submit-job.sh)
- phoenix/bench.sh: kept our 64-job cap + took upstream RETRY_CLEAN_CMD
- frontier/submit.sh: took upstream (job_type detection block)
- CMakeLists.txt: took upstream (CCE IPA workaround for m_bubbles_EL
  and m_phase_change, comment fix)
- test.yml: took upstream github job changes (simplified cache key,
  Python 3.14, updated build/test commands); kept our clean: false and
  Submit/Monitor split steps for Phoenix self and case-optimization jobs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

sbryngelson and others added 2 commits March 8, 2026 23:16
--${{ matrix.precision }} expands to bare '--' when precision is empty,
which is parsed as an argument separator. Restore the PRECISION env var
approach so empty precision is correctly omitted.

Remove the manual 'Set up Python 3.14' step; the lint-gate job already
pins Python 3.12 via setup-python and the github runners use the
system default for the build/test steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…precision flag

- run_parallel_benchmarks.sh: fix warning message to say gpu-rtx6000
  (not gpu-l40s) to match the actual fallback value
- phoenix/bench.sh: remove RETRY_CLEAN_CMD prefix on retry_build —
  retry-build.sh only uses RETRY_VALIDATE_CMD and always does rm -rf build
  on retry; variable was dead and re-introduced by upstream merge
- test.yml: fix --${{ matrix.precision }} expanding to bare '--' when
  precision is empty; restore PRECISION env var pattern; remove manual
  'Set up Python 3.14' step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

…irst

On rerun, the intent is to cancel the old job and run a fresh one —
not to reuse an existing running job. Remove the early-exit that skipped
resubmission for RUNNING/PENDING jobs; instead scancel any live job as a
safety net (in case the 'Cancel SLURM Jobs' step did not fire due to
SIGKILL), then always fall through to a fresh sbatch submission.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

…og steps

- submit-job.sh: escape $(pwd) in heredoc so it expands on the compute
  node after cd $SLURM_SUBMIT_DIR, not on the login node at sbatch time
- submit-job.sh: replace backtick command substitution with $() for
  job_slug (modern bash style, consistent with rest of script)
- bench.yml: combine 'Print Per-Case Failure Logs' and 'Print Per-Case
  Success Logs' into a single 'Print Per-Case Logs' step that labels
  each output as [PASSED] or [FAILED] inline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

- run_monitored_slurm_job.sh: remove redundant .slurm_job_id write;
  submit-job.sh already writes it before the monitor is called, so the
  second write was a no-op in all code paths
- bench.yml: replace two separate per-case log steps with one organized
  step: prints a summary table (N failed, N passed) with all case names
  first, then full logs only for failed cases — passing runs stay concise

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

- submit.sh: replace backtick with $() for job_slug; add comment that
  the sed pipeline must stay in sync with submit-job.sh
- test.yml: explain clean: false (preserves .slurm_job_id for stale job
  detection) and continue-on-error on Frontier (CCE compiler instability)
- run_parallel_benchmarks.sh: explain GPU partition priority order
  (prefer smaller/older partitions to leave large nodes for production)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@sbryngelson sbryngelson merged commit edff972 into MFlowCode:master Mar 9, 2026
55 checks passed
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Mar 11, 2026
Resolve conflicts with master PRs MFlowCode#1295 (CI robustness refactor) and
MFlowCode#1286 (CCE 19.0.0 workarounds). Keep coverage pruning features while
adopting master's submit-job.sh split, build cache, Frontier
continue-on-error, and improved SLURM monitoring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants