Strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory#972
Strategy for running MFC out-of-core on NVIDIA Grace-Hopper using Unified Memory#972sbryngelson merged 27 commits intoMFlowCode:masterfrom
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to fb50e90
Previous suggestionsSuggestions up to commit 4065c02
|
|||||||||||||||||||||
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #972 +/- ##
==========================================
+ Coverage 40.91% 40.93% +0.01%
==========================================
Files 70 70
Lines 20288 20288
Branches 2517 2517
==========================================
+ Hits 8301 8305 +4
+ Misses 10450 10447 -3
+ Partials 1537 1536 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e47036b to
8fef22d
Compare
wilfonba
left a comment
There was a problem hiding this comment.
Approve to run benchmark
wilfonba
left a comment
There was a problem hiding this comment.
Approve to run benchmark
User description
This PR builds on top of the work done in #9, and aims to bring to the MFC
masterbranch the zero-copy out-of-core approach that relies oncudaMallocManagedandpinnedCPU memory allocations. This strategy works around some issues with unified memory and will be cleaned up as soon as these are resolved. The use ofcudaMallocManagedallows the use of 2MB pages for the GPU allocations which leads to fewer TLB misses and improves performance compared to the 64KB pages ofmallocwhen configured without huge pages. The use ofpinnedhost allocations allows locking some buffers in host memory and directly accessing them from GPU code via NVLink-C2C at peak host memory bandwidth. To ensure that MPI communications follow the fast GPUDirect paths also for unified memory, we use OpenACCcaptureon the send and receive buffers in order to switch to separate memory for these buffers, i.e. to allocate them usingcudaMalloc. It is also important to note that we implement a series of rearranged timestep updates for the Runge-Kutta schemes that substantially improve the locality and hence performance of the out-of-core approach. All of the above are crucial for good performance.The out-of-core implementation is highly configurable, allowing the control of the memory placement of certain arrays through the following case file parameters:
nv_uvm_out_of_core: Enable/disable the out-of-core approach. This parameter essentially controls the placement ofq_cons_ts(2)which can be either on the GPU viacudaMallocManaged, or on the CPU viacudaMallocHost.nv_uvm_igr_temps_on_gpu: Set the number of IGR temporaries to keep in GPU memory. The rest will stay in CPU memory and will be directly accessed from there.nv_uvm_pref_gpu: Enable/disable@:PREFER_GPUmacro, that implements some expicit CUDA memory hints for improving performance. These can be summarized as follows: (i) set preferred location GPU to resist migrations, (ii) set accessed by CPU to prefer direct mappings over faulting, and (iii) prefetch to GPU to populate memory pages on the GPU in a very efficient way before first-touch.This PR will also:
3D_IGR_TaylorGreenVortex_nvidia.fastmathoption to improve performance of mathy GPU kernels.I used the
3D_IGR_TaylorGreenVortex_nvidiatestcase on ALPS supercomputer.The code was tested with NVHPC 25.1 as well as latest NVHPC nightly build.
PR Type
Enhancement
Description
Implement out-of-core strategy for NVIDIA Grace-Hopper using Unified Memory
Allow controlling memory placement of certain arrays
Introduce pinned memory pools for CPU-side allocations
Modify time-stepping algorithm for improved locality in out-of-core updates and unified memory compatibility
Diagram Walkthrough
File Walkthrough
5 files
Add PREFER_GPU macro for memory placementConditional MPI buffer allocation for unified memoryApply GPU preference to grid variablesImplement pinned memory pools for IGR temporariesAdd out-of-core time stepping with pinned memory1 files
Add Taylor-Green vortex test case configuration1 files
Add home directory path helper method5 files
Add GPU and CPU binding script for SantisAdd NVIDIA Nsight profiling wrapper scriptAdd Santis supercomputer job template with UVM settingsUpdate NVHPC compiler flags for unified memoryAdd Santis module configuration