ci: remove nick-fields/retry wrapper and add shared sinfo-based GPU partition selection#1299
Conversation
…nodes The JS action wrapper gets SIGKILL'd on Frontier login nodes under memory pressure, falsely failing the Build step even when build.sh succeeds. retry_build() inside build.sh already handles 2-attempt retry with rm -rf build between attempts. Also move gpu-v100 to last in Phoenix GPU partition priority so SLURM prefers newer GPU nodes (a100/h100/l40s/h200) over the aging V100s that have had recurring driver issues.
Extract partition selection into select-gpu-partition.sh so both test jobs (submit-job.sh) and benchmark jobs (run_parallel_benchmarks.sh) use the same sinfo-based logic with a consistent priority order: gpu-rtx6000 -> gpu-l40s -> gpu-v100 -> gpu-h200 -> gpu-h100 -> gpu-a100 Tests now dynamically pick the best available partition rather than submitting to a static multi-partition list, matching the benchmark approach. Bench still exports BENCH_GPU_PARTITION so PR and master land on the same GPU type for fair comparisons.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR updates CI/SLURM automation to centralize Phoenix GPU partition selection logic and simplify the non-Phoenix build step in GitHub Actions.
Changes:
- Replace the
nick-fields/retrywrapper intest.ymlwith a directrunstep +timeout-minutes. - Introduce a reusable
.github/scripts/select-gpu-partition.shand use it from both Phoenix benchmark and test submission paths. - Simplify
.github/scripts/retry-build.shby removing support for a post-build validation hook.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/test.yml |
Removes external retry wrapper around non-Phoenix build step. |
.github/workflows/phoenix/submit-job.sh |
Uses shared GPU partition selection for non-benchmark Phoenix submissions. |
.github/scripts/select-gpu-partition.sh |
New shared helper to pick an available Phoenix GPU partition via sinfo. |
.github/scripts/run_parallel_benchmarks.sh |
Refactors inline partition selection into the shared helper script. |
.github/scripts/retry-build.sh |
Removes RETRY_VALIDATE_CMD-based post-build validation behavior from retry loop. |
| # Provides retry_build(): 2-attempt loop. | ||
| # On failure of attempt 1, nukes the entire build directory before attempt 2. | ||
| # Set RETRY_VALIDATE_CMD to run a post-build validation; failure triggers a retry. | ||
| # Usage: source .github/scripts/retry-build.sh | ||
| # retry_build ./mfc.sh build -j 8 --gpu acc | ||
|
|
||
| retry_build() { | ||
| local validate_cmd="${RETRY_VALIDATE_CMD:-}" | ||
| local max_attempts=2 | ||
| local attempt=1 | ||
| while [ $attempt -le $max_attempts ]; do | ||
| echo "Build attempt $attempt of $max_attempts..." | ||
| if "$@"; then | ||
| if [ -n "$validate_cmd" ]; then | ||
| if ! eval "$validate_cmd"; then | ||
| echo "Post-build validation failed on attempt $attempt." | ||
| if [ $attempt -lt $max_attempts ]; then | ||
| echo " Nuking build directory before retry..." | ||
| rm -rf build 2>/dev/null || true | ||
| sleep 5 | ||
| attempt=$((attempt + 1)) | ||
| continue | ||
| else | ||
| echo "Validation still failing after $max_attempts attempts." | ||
| return 1 | ||
| fi | ||
| fi | ||
| fi | ||
| echo "Build succeeded on attempt $attempt." | ||
| return 0 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| if [ "$job_type" = "bench" ]; then | ||
| bench_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}" | ||
| echo "Submitting bench GPU job to partition: $bench_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})" | ||
| sbatch_gpu_opts="\ | ||
| #SBATCH -p $bench_partition | ||
| #SBATCH --ntasks-per-node=4 # Number of cores per node required | ||
| #SBATCH -G2\ | ||
| " | ||
| # BENCH_GPU_PARTITION is pre-selected by run_parallel_benchmarks.sh so both | ||
| # PR and master jobs land on the same GPU type for a fair comparison. | ||
| gpu_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}" | ||
| echo "Submitting bench GPU job to partition: $gpu_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})" | ||
| sbatch_time="#SBATCH -t 04:00:00" | ||
| else | ||
| sbatch_gpu_opts="\ | ||
| #SBATCH -p gpu-v100,gpu-a100,gpu-h100,gpu-l40s,gpu-h200 | ||
| source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh" | ||
| gpu_partition="$SELECTED_GPU_PARTITION" | ||
| sbatch_time="#SBATCH -t 03:00:00" |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/phoenix/submit-job.sh (1)
44-44: Consider using a more robust path resolution.The relative path
../../scripts/select-gpu-partition.shworks correctly but is fragile if the script hierarchy changes. Consider extracting the repository root and using an absolute path pattern:♻️ Optional: Use repo-root-relative path
else - source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh" + _repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)" + source "${_repo_root}/.github/scripts/select-gpu-partition.sh" gpu_partition="$SELECTED_GPU_PARTITION" sbatch_time="#SBATCH -t 03:00:00" fi
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d42e334-ff80-4016-adf4-47b1c6150d4a
📒 Files selected for processing (5)
.github/scripts/retry-build.sh.github/scripts/run_parallel_benchmarks.sh.github/scripts/select-gpu-partition.sh.github/workflows/phoenix/submit-job.sh.github/workflows/test.yml
💤 Files with no reviewable changes (1)
- .github/scripts/retry-build.sh
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Free Tier Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| # retry_build ./mfc.sh build -j 8 --gpu acc | ||
|
|
||
| retry_build() { | ||
| local validate_cmd="${RETRY_VALIDATE_CMD:-}" |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Make bench jobs use sinfo-based GPU partition selection (via select-gpu-partition.sh) as a baseline, then override with BENCH_GPU_PARTITION only when run_parallel_benchmarks.sh has pre-selected a partition for PR/master consistency. Previously bench jobs fell back to a hardcoded gpu-rtx6000 when BENCH_GPU_PARTITION was unset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…selection For parallel benchmarks (PR + master), both jobs need a GPU node concurrently, so require at least 2 idle/mix nodes before selecting a partition. Add GPU_PARTITION_MIN_NODES parameter to select-gpu-partition.sh (defaults to 1 for single-job test use). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
phoenix/test.sh relies on RETRY_VALIDATE_CMD to smoke-test the freshly built syscheck binary and trigger a rebuild on failure, catching architecture mismatches (SIGILL) from binaries compiled on a different compute node. Mistakenly removed in the previous commit as 'unused'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Replace per-cluster submit/test/bench scripts with unified versions: - submit-slurm-job.sh: single parameterized submit+monitor script for all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh, frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) is selected via a case block. Idempotent stale-job cancellation now applies to all clusters, not just Phoenix. - common/test.sh: unified test script with conditional build (skips if build/ exists from Frontier login-node build), cluster-aware GPU detection, thread counts, RDMA, and sharding. - common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags. Also removes nick-fields/retry from bench.yml (frontier build.sh already uses retry_build internally) and deletes dead code (run-tests-with-retry.sh). test.yml self job: 5 conditional steps -> 2 steps (Build + Test). test.yml case-opt job: 5 conditional steps -> 3 steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- Remove no-op 'rm -rf build' inside 'if [ ! -d build ]' guard in common/test.sh and common/bench.sh. - Default gpu_partition to 'batch' before dynamic selection to prevent unbound variable error if a new cluster is added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1299 +/- ##
=======================================
Coverage 44.94% 44.95%
=======================================
Files 70 70
Lines 20504 20504
Branches 1946 1946
=======================================
+ Hits 9216 9217 +1
Misses 10166 10166
+ Partials 1122 1121 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
submit_and_monitor_bench.sh cd's into master/ before calling submit-slurm-job.sh, which reads the bench script via cat. Since master branch doesn't have common/bench.sh yet, the cat fails. Fix by resolving the bench script path from the PR tree (absolute path) so it works regardless of cwd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Changed files:
Summary
Findings
Old ./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranksNew ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksThe
_idle=$(sinfo -p "$_part" --noheader -o "%t" 2>/dev/null | grep -cE "^(idle|mix)" || true)SLURM can emit states with modifiers like
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))The random suffix space (0–8999) is small. Concurrent jobs on the same node can collide. Using
submit_output=$(sbatch <<EOT
...
$sbatch_script_contents
EOT
)Using an unquoted Minor Improvement Opportunities
Overall this is a well-motivated refactor — the unified scripts are easier to reason about than 11 near-duplicate per-cluster files. The dropped |
RTX 6000 nodes can't finish the full test suite within the 3-hour SLURM wall time. Use gpu-l40s as the new fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 20 Changed files
Summary
Findings[Medium]
|
Claude Code ReviewHead SHA: Files changed: 20
Summary
Findings[Medium]
[Minor] tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_build
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))
mkdir -p $tmpbuild
mkdir -p $currentdir
[Minor] _GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"RTX 6000 is excluded entirely (comment: "too slow for the test suite time limit"). The old [Minor] Old Overall a clean, well-motivated CI refactor. The no-retry concern for Frontier bench login-node builds is the main thing to confirm before merge. |
The dry-run build uses build_opts but the live test command didn't. CMake caches the config, but passing it explicitly is safer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 20
Summary:
Findings1. Missing
if [ "$job_device" = "gpu" ]; then
./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
else
./mfc.sh bench --mem 1 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
fiBoth old scripts passed a 2. No retry for login-node build in bench workflow
3. Inconsistency in GPU partition priority between test and bench contexts
_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"The old 4.
Minor / no action needed
Overall this is a clean, well-motivated simplification. The main item to verify before merge is whether dropping |
Under set -e, 'wait $pid' returning non-zero aborts the script before the exit code is captured, leaving the second parallel job unmonitored. Use 'wait $pid || exit=$?' so both jobs are always waited on. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- select-gpu-partition.sh: priority comment said 'smaller/older' but list is now L40S/H200/H100/A100/V100 - submit-slurm-job.sh: 'Idempotent' → 'Rerun-safe' (it always submits a new job) - bench.sh: n_jobs only used for build, not bench Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For non-Phoenix GPU jobs, both device_opts and build_opts resolved to the same --gpu flag. Let build_opts carry it; device_opts is only for cluster-specific runtime flags like -g (Phoenix GPU IDs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 20
Summary
Findings1. Old ./mfc.sh bench --mem 4 -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranksOld ./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranksNew ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksThe 2.
3. sbatch_script_contents=$(cat "$script_path")
...
submit_output=$(sbatch <<EOT
...
$sbatch_script_contents
EOT
)The unquoted 4. The old step used Overall: Clean, well-scoped CI infrastructure simplification. The |
Claude Code ReviewHead SHA: Files changed: 20
Summary:
Findings1. Benchmark partition selection is a semantic regression ( The old 2. Missing Old ./mfc.sh bench -j -o ".yaml" -- -c phoenix-bench -n New unified bench.sh: ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksThe 3. The removed 4. Minor: for _part in $_GPU_PARTITION_PRIORITY; doWith Improvement Opportunities
Overall this is a solid CI refactor that eliminates real false-failure modes and meaningfully reduces script duplication. The three findings above (especially #1 and #2) are worth confirming before merge. |
Claude Code ReviewHead SHA: Files changed: 20 Summary
Findings1. Both old per-cluster bench scripts passed a job-parallelism flag to
The new ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksIf 2.
Old Using 3. PR_BENCH_SCRIPT="$(cd "${SCRIPT_DIR}/../workflows/common" && pwd)/bench.sh"The comment acknowledges that master may not have 4. Old on_retry_command: rm -rf pr/build master/buildThe new plain 5. Minor: GPU_PARTITION_MIN_NODES=2 source "${SCRIPT_DIR}/select-gpu-partition.sh"In bash, GPU_PARTITION_MIN_NODES=2
source "${SCRIPT_DIR}/select-gpu-partition.sh"(since What looks good
|
With clean:false, old SLURM job epilogs can write to the .out file after our stale-job check completes. The monitor tail then picks up this stale output (including errors from dead nodes) and reports it as if it came from the new job. Removing the .out file before submission ensures a clean output stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Changed files
Summary
Findings1. Missing Both old # old (phoenix/bench.sh):
./mfc.sh bench $bench_opts -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks
# new (common/bench.sh line 46):
./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksIf 2. No retry for login-node build in The old step used 3. Unexplained # --- Phoenix cleanup (trap EXIT handles rm -rf "$currentdir") ---
if [ "$job_cluster" = "phoenix" ]; then
sleep 10
unset TMPDIR
fiInherited from 4. Old The PR description and script comment document this as intentional ("RTX 6000 too slow for the test suite time limit"). Flagging only so reviewers are aware the bench GPU pool changed alongside the test pool. No other issues foundThe unified submit script is well-structured. Idempotent stale-job cancellation, SIGHUP suppression, and |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 21
Summary
Findings1. 2. 3. 4. 5. Improvement noted — OverallClean, well-motivated refactor with no blocking issues found. The logic is correctly unified and idempotency/stale-job cancellation now applies to all clusters. The key behavioral tradeoff (single-attempt bench build on GHA runner) appears intentional and acceptable given the SIGKILL root cause. Items 2–3 are style-only; item 1 deserves a quick confirmation that no non-Phoenix cluster runs the bench build on the GHA runner step. |
…ilation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 22 Key files:
Summary
Findings1. The old per-cluster bench scripts passed # New (common/bench.sh:540)
./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
# Old (frontier/bench.sh, GPU)
./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranksIf 2. The new priority list ( 3. # Old (phoenix/bench.sh): RANDOM % 900 → 0–899
# New (common/bench.sh): RANDOM % 9000 → 0–8999
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))The range increase reduces collision probability, but two concurrent jobs (PR + master) could still collide. Consider using 4. set -e
set -x
...
$sbatch_script_contents
5. Previously, Frontier used a cluster-specific Minor observations
Overall this is a clean and well-motivated consolidation. The main item to verify before merge is whether the 🤖 Generated with Claude Code |
GitHub Actions runners have different CPU microarchitectures. MFC compiles with -march=native, so cached binaries from one runner can contain instructions illegal on another. Adding the GCC-detected -march target to the cache key ensures each ISA gets its own cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Changed files
Summary
Findings1. The old code favoured 2. ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksOld 3. 4. Empty validate_cmd=""
...
RETRY_VALIDATE_CMD="$validate_cmd" \
retry_build ./mfc.sh test -v --dry-run -j 8 $build_opts || exit 1For non-Phoenix clusters, Minor nits
Overall this is a well-motivated consolidation. The partition-priority question (#1) and the |
…le speedup) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: dbe6a5e Files changed: 22
Summary
Findings1. Missing Old ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksIf 2. Removal of GitHub Actions build cache for non-cluster runners (potential CI slowdown) The 3. The 4. EOT heredoc delimiter collision risk in
5. Minor: n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) ))
_nproc=$(nproc); n_jobs=$(( _nproc > 64 ? 64 : _nproc ))Overall this is a solid simplification. The unified scripts remove significant duplication and the GPU partition selection logic is well-commented. The main item to double-check before merge is finding #1 (the dropped |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: fde5fc4 Files changed: 22 Key files:
Summary
Findings1. Potential job ID extraction fragility — job_id=$(echo "$submit_output" | grep -oE '[0-9]+')If job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job [0-9]+' | grep -oE '[0-9]+')The old per-cluster scripts had the same issue, but it's worth fixing in the unified script since it's now the single point of failure for all clusters. 2. TMPDIR collision risk — currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))
currentdir=$(mktemp -d "$tmpbuild/run-XXXXXX")3. Missing ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranksThe old 4. Stale PR description 5. retry_build ./mfc.sh build -j $n_jobs $build_opts || exit 1With Minor / Non-blocking
Overall this is a solid cleanup that meaningfully reduces CI surface area. The findings above are mostly hardening opportunities, not blockers. |
- File-level gcov coverage cache maps test UUIDs to exercised .fpp source files (gzip JSON, committed to repo) - --only-changes flag prunes tests by intersecting PR-changed files against coverage cache; conservative fallbacks for missing cache/coverage - --build-coverage-cache flag + 3-phase parallel cache builder (prepare, run, gcov collect) - New rebuild-cache CI job on Phoenix via SLURM when cases.py or Fortran dependency graph changes - Dep-change detection greps PR/push diffs for added use/include statements - 53 unit tests cover core coverage logic - Rebased onto PR MFlowCode#1299 unified CI architecture (submit-slurm-job.sh, common/test.sh) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
nick-fields/retryJS action with plainrun:steps for Frontier builds (test.yml) and bench builds (bench.yml). The JS action wrapper was getting SIGKILL'd on Frontier login nodes after the build completed successfully, causing false build failures. Retry logic is handled byretry_build()in.github/scripts/retry-build.sh, which all clusterbuild.shscripts already call.submit-slurm-job.sh: single submit+monitor script for all clusters (replacesphoenix/submit-job.sh,phoenix/submit.sh,frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) selected via case block. Idempotent stale-job cancellation now applies to all clusters.common/test.sh: unified test script with conditional build, cluster-aware GPU detection, thread counts, RDMA, and sharding.common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags.select-gpu-partition.shscript for sinfo-based GPU partition selection, used by both test and benchmark jobs. GPU partition priority:gpu-rtx6000 → gpu-l40s → gpu-v100 → gpu-h200 → gpu-h100 → gpu-a100.GPU_PARTITION_MIN_NODES=2) before selecting a partition, since PR and master benchmark jobs run concurrently.atl1-1-03-002-29-0(persistentcuInit error 999).run-tests-with-retry.sh(never called).Workflow simplification
test.ymlself job: 5 conditional steps → 2 (Build + Test)test.ymlcase-opt job: 5 conditional steps → 3Test plan