Frontier Benchmarking (#453)#881
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #881 +/- ##
==========================================
+ Coverage 44.03% 44.15% +0.11%
==========================================
Files 68 68
Lines 18395 18347 -48
Branches 2227 2227
==========================================
Hits 8101 8101
+ Misses 8991 8943 -48
Partials 1303 1303 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reduced the job duration to 3 hrs to see whether it would yield the same error regardless of duration. |
|
I did |
|
This benchmark test will never pass in its current state because the Frontier files for benchmarking do not exist on the master branch, hence this error (cd pr && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
(cd pr && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
(cd master && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
wait %1 && wait %[2](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:2)
shell: /usr/bin/bash -e {0}
env:
ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
bash: .github/workflows/frontier/submit-bench.sh: No such file or directory
Submitted batch job [3](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:3)531713once it looks like everything is working as well as one can expect, we can merge in the minimal files ( |
|
aight, myself or someone has to test it out manually by cloning master & pr and adding bash files in each then benchmarking on Frontier as a slurm/interative job to make sure nothing will corrupt in the process. |
|
I verified that this works on my end. The IBM case still gives NaNs though... |
Thanks much, and I wonder what the deal is with the IBM case ngl. Any specific error messages or such? If the issue persists, we can just exclude that case somehow. Also, NaNs I guess won't fail the test as can be seen on my recent PR when I assigned null to IBM grind/exec #895 (comment) Edit: lmk, if you suspect anything that might have caused that. |
|
Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case |
|
status? |
|
@sbryngelson done on my end tbh and nothing to add |
what's going on here? |
Any ideas @anandrdbz ? |
|
I'll look into it, last time I checked 2D_ibm was working, perhaps there were multiple issues causing NaNs |
|
I just ran 2D_ibm and 2D_ibm_multiphase to completion on an interactive node @wilfonba, is there another example case file that's failing ? |
It's the IBM case in the benchmarking cases (what this PR is about) |
|
Not sure when this was done but the case file in ibm in benchmarks does not actually have ib = T, in fact it's just running a single fluid hypo elastic case |
|
Anyways, I believe the reason why this particular case fails obviously has nothing to do with IBM since ib is not set, I think the reason is the problem size on frontier is larger than Phoenix due to it using 8 GPUs while the time step is hardcoded. I ran the same case file on a single GCD on frontier and it worked. I also reduced dt by a factor of 2 on 8 ranks and that also runs. But I guess there's not much point debugging this since there needs to be an overhaul of the case file to include an actual IBM case |
|
waiting for CI to run them will merge |
Co-authored-by: mohdsaid497566 <mohdsaid497566@gmail.com> Co-authored-by: Spencer Bryngelson <sbryngelson@gmail.com> Co-authored-by: Spencer Bryngelson <shb@gatech.edu> Co-authored-by: wilfonba <bwilfong3@gatech.edu>
Description
Added one GPU benchmarking case by submitting SLURM jobs on Frontier - duplicate implementation of Phoenix. (#453)
Manually Benchmarking,
Cloning
Copying Bash Scripts into master
Submit Benchmark Jobs
Process Benchmark Results
once the slurm jobs are done