You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug fix for probe_wrt on GPUs, where the acceleration and center of mass are written. Previously gave NaNs and had poor performance since this subroutine was not ported to GPUs
Fixes #(issue) [optional]
Type of change
Please delete options that are not relevant.
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Something else
Scope
This PR comprises a set of related changes with a common goal
If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration
Test A
Test B
Test Configuration:
What computers and compilers did you use to test this:
Checklist
I have added comments for the new code
I added Doxygen docstrings to the new code
I have made corresponding changes to the documentation (docs/)
I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
I have added example cases in examples/ that demonstrate my new feature performing as expected.
They run to completion and demonstrate "interesting physics"
I ran ./mfc.sh format before committing my code
New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
This PR does not introduce any repeated code (it follows the DRY principle)
I cannot think of a way to condense this code and reduce any introduced additional line count
If your code changes any code source files (anything in src/simulation)
To make sure the code is performing as expected on GPU devices, I have:
Checked that the code compiles using NVHPC compilers
Checked that the code compiles using CRAY compilers
Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
Enclosed the new feature via nvtx ranges so that they can be identified in profiles
Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature
PR Type
Enhancement
Description
Add GPU support for probe write functionality
Implement GPU memory management for finite difference coefficients
Convert array operations to GPU-compatible loops
Add atomic operations for thread-safe center of mass calculations
Diagram Walkthrough
flowchart LR
A["CPU-only probe writes"] --> B["GPU memory allocation"]
B --> C["GPU parallel loops"]
C --> D["Atomic operations"]
D --> E["GPU-accelerated probe writes"]
The acceleration component calculation contains significant code duplication across the three coordinate directions (x, y, z). The nested loops and finite difference calculations are nearly identical with only variable names changing. This violates the DRY principle and makes maintenance difficult.
$:GPU_PARALLEL_LOOP(collapse=3)
do l =0, p
do k =0, n
do j =0, m
q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb)%sf(j, k, l) &
-18._wp*q_prim_vf1(momxb)%sf(j, k, l) &
+9._wp*q_prim_vf2(momxb)%sf(j, k, l) &
-2._wp*q_prim_vf3(momxb)%sf(j, k, l))/(6._wp*dt)
end doend doend doif(n == 0) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l)
end doend doend doend doelseif (p == 0) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb)%sf(j, r + k, l)
end doend doend doend doelseif(grid_geometry == 3) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb)%sf(j, k, r + l)/y_cc(k)
end doend doend doend doelse
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb)%sf(j, k, r + l)
end doend doend doend doend ifend if
! Computing the acceleration component in the y-coordinate direction
elseif (i == 2) then
$:GPU_PARALLEL_LOOP(collapse=3)
do l =0, p
do k =0, n
do j =0, m
q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb +1)%sf(j, k, l) &
-18._wp*q_prim_vf1(momxb +1)%sf(j, k, l) &
+9._wp*q_prim_vf2(momxb +1)%sf(j, k, l) &
-2._wp*q_prim_vf3(momxb +1)%sf(j, k, l))/(6._wp*dt)
end doend doend doif (p == 0) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb +1)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb +1)%sf(j, r + k, l)
end doend doend doend doelseif(grid_geometry == 3) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb +1)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb +1)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb +1)%sf(j, k, r + l)/y_cc(k) &
- (q_prim_vf0(momxe)%sf(j, k, l)**2._wp)/y_cc(k)
end doend doend doend doelse
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb +1)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb +1)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb +1)%sf(j, k, r + l)
end doend doend doend doend ifend if
! Computing the acceleration component in the z-coordinate direction
else
$:GPU_PARALLEL_LOOP(collapse=3)
do l =0, p
do k =0, n
do j =0, m
q_sf(j, k, l) = (11._wp*q_prim_vf0(momxe)%sf(j, k, l) &
-18._wp*q_prim_vf1(momxe)%sf(j, k, l) &
+9._wp*q_prim_vf2(momxe)%sf(j, k, l) &
-2._wp*q_prim_vf3(momxe)%sf(j, k, l))/(6._wp*dt)
end doend doend doif(grid_geometry == 3) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxe)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxe)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxe)%sf(j, k, r + l)/y_cc(k) &
+ (q_prim_vf0(momxe)%sf(j, k, l)* &
q_prim_vf0(momxb +1)%sf(j, k, l))/y_cc(k)
end doend doend doend doelse
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxe)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxe)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxe)%sf(j, k, r + l)
end doend doend doend doend ifend if
The code uses hardcoded momentum indices like momxb, momxe instead of the original mom_idx%beg, mom_idx%end pattern. This change should be verified to ensure these variables are properly defined and accessible in the GPU context.
q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb)%sf(j, k, l) &
-18._wp*q_prim_vf1(momxb)%sf(j, k, l) &
+9._wp*q_prim_vf2(momxb)%sf(j, k, l) &
-2._wp*q_prim_vf3(momxb)%sf(j, k, l))/(6._wp*dt)
end doend doend doif(n == 0) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l)
end doend doend doend doelseif (p == 0) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb)%sf(j, r + k, l)
end doend doend doend doelseif(grid_geometry == 3) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb)%sf(j, k, r + l)/y_cc(k)
end doend doend doend doelse
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb)%sf(j, k, r + l)
end doend doend doend doend ifend if
! Computing the acceleration component in the y-coordinate direction
elseif (i == 2) then
$:GPU_PARALLEL_LOOP(collapse=3)
do l =0, p
do k =0, n
do j =0, m
q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb +1)%sf(j, k, l) &
-18._wp*q_prim_vf1(momxb +1)%sf(j, k, l) &
+9._wp*q_prim_vf2(momxb +1)%sf(j, k, l) &
-2._wp*q_prim_vf3(momxb +1)%sf(j, k, l))/(6._wp*dt)
end doend doend doif (p == 0) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb +1)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb +1)%sf(j, r + k, l)
end doend doend doend doelseif(grid_geometry == 3) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb +1)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb +1)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb +1)%sf(j, k, r + l)/y_cc(k) &
- (q_prim_vf0(momxe)%sf(j, k, l)**2._wp)/y_cc(k)
end doend doend doend doelse
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxb +1)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxb +1)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxb +1)%sf(j, k, r + l)
end doend doend doend doend ifend if
! Computing the acceleration component in the z-coordinate direction
else
$:GPU_PARALLEL_LOOP(collapse=3)
do l =0, p
do k =0, n
do j =0, m
q_sf(j, k, l) = (11._wp*q_prim_vf0(momxe)%sf(j, k, l) &
-18._wp*q_prim_vf1(momxe)%sf(j, k, l) &
+9._wp*q_prim_vf2(momxe)%sf(j, k, l) &
-2._wp*q_prim_vf3(momxe)%sf(j, k, l))/(6._wp*dt)
end doend doend doif(grid_geometry == 3) then
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxe)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxe)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxe)%sf(j, k, r + l)/y_cc(k) &
+ (q_prim_vf0(momxe)%sf(j, k, l)* &
q_prim_vf0(momxb +1)%sf(j, k, l))/y_cc(k)
end doend doend doend doelse
$:GPU_PARALLEL_LOOP(collapse=4)
do l =0, p
do k =0, n
do j =0, m
do r =-fd_number, fd_number
q_sf(j, k, l) = q_sf(j, k, l) &
+ q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
q_prim_vf0(momxe)%sf(r + j, k, l) &
+ q_prim_vf0(momxb +1)%sf(j, k, l)*fd_coeff_y(r, k)* &
q_prim_vf0(momxe)%sf(j, r + k, l) &
+ q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
q_prim_vf0(momxe)%sf(j, k, r + l)
end do
The center of mass calculation uses atomic updates for accumulating values in parallel loops. While necessary for correctness, this could create performance bottlenecks on GPUs due to serialization of memory access. Consider using reduction operations instead.
$:GPU_ATOMIC(atomic='update')
c_m(i, 1) = c_m(i, 1) + q_vf(i)%sf(j, k, l)*dV
! x-location weighted
$:GPU_ATOMIC(atomic='update')
c_m(i, 2) = c_m(i, 2) + q_vf(i)%sf(j, k, l)*dV*x_cc(j)
! Volume fraction
$:GPU_ATOMIC(atomic='update')
c_m(i, 5) = c_m(i, 5) + q_vf(i + advxb -1)%sf(j, k, l)*dV
end doend doend doend doelseif (p == 0) then !2D simulation
$:GPU_PARALLEL_LOOP(collapse=3,private='[dV]')
do l =0, p !Loop over grid
do k =0, n
do j =0, m
$:GPU_LOOP(parallelism='[seq]')
do i =1, num_fluids !Loop over individual fluids
dV = dx(j)*dy(k)
! Mass
$:GPU_ATOMIC(atomic='update')
c_m(i, 1) = c_m(i, 1) + q_vf(i)%sf(j, k, l)*dV
! x-location weighted
$:GPU_ATOMIC(atomic='update')
c_m(i, 2) = c_m(i, 2) + q_vf(i)%sf(j, k, l)*dV*x_cc(j)
! y-location weighted
$:GPU_ATOMIC(atomic='update')
c_m(i, 3) = c_m(i, 3) + q_vf(i)%sf(j, k, l)*dV*y_cc(k)
! Volume fraction
$:GPU_ATOMIC(atomic='update')
c_m(i, 5) = c_m(i, 5) + q_vf(i + advxb -1)%sf(j, k, l)*dV
end doend doend doend doelse !3D simulation
$:GPU_PARALLEL_LOOP(collapse=3,private='[dV]')
do l =0, p !Loop over grid
do k =0, n
do j =0, m
$:GPU_LOOP(parallelism='[seq]')
do i =1, num_fluids !Loop over individual fluids
dV = dx(j)*dy(k)*dz(l)
! Mass
$:GPU_ATOMIC(atomic='update')
c_m(i, 1) = c_m(i, 1) + q_vf(i)%sf(j, k, l)*dV
! x-location weighted
$:GPU_ATOMIC(atomic='update')
c_m(i, 2) = c_m(i, 2) + q_vf(i)%sf(j, k, l)*dV*x_cc(j)
! y-location weighted
$:GPU_ATOMIC(atomic='update')
c_m(i, 3) = c_m(i, 3) + q_vf(i)%sf(j, k, l)*dV*y_cc(k)
! z-location weighted
$:GPU_ATOMIC(atomic='update')
c_m(i, 4) = c_m(i, 4) + q_vf(i)%sf(j, k, l)*dV*z_cc(l)
! Volume fraction
$:GPU_ATOMIC(atomic='update')
c_m(i, 5) = c_m(i, 5) + q_vf(i + advxb -1)%sf(j, k, l)*dV
The array access q_prim_vf0(momxb)%sf(r + j, k, l) can cause out-of-bounds memory access when r + j exceeds array bounds. Add boundary checks to prevent accessing invalid memory locations, which could cause crashes or incorrect results.
$:GPU_PARALLEL_LOOP(collapse=4)
do l = 0, p
do k = 0, n
do j = 0, m
do r = -fd_number, fd_number
- q_sf(j, k, l) = q_sf(j, k, l) &- + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &- q_prim_vf0(momxb)%sf(r + j, k, l) + if (r + j >= 0 .and. r + j <= m) then+ q_sf(j, k, l) = q_sf(j, k, l) &+ + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &+ q_prim_vf0(momxb)%sf(r + j, k, l) + end if
end do
end do
end do
end do
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a potential out-of-bounds memory access in q_prim_vf0(momxb)%sf(r + j, k, l), which is a critical bug that could lead to incorrect results or crashes.
High
Fix data dependencies in parallel loops
The cascading assignments create data dependencies that prevent proper GPU parallelization. The assignments should be done in separate loops or use temporary variables to avoid race conditions and ensure correct execution order.
+! Use separate loops to avoid data dependencies
$:GPU_PARALLEL_LOOP(collapse=4)
do i = 1, sys_size
do l = 0, p
do k = 0, n
do j = 0, m
- q_prim_ts(3)%vf(i)%sf(j, k, l) = q_prim_ts(2)%vf(i)%sf(j, k, l)- q_prim_ts(2)%vf(i)%sf(j, k, l) = q_prim_ts(1)%vf(i)%sf(j, k, l)- q_prim_ts(1)%vf(i)%sf(j, k, l) = q_prim_ts(0)%vf(i)%sf(j, k, l)- q_prim_ts(0)%vf(i)%sf(j, k, l) = q_prim_vf(i)%sf(j, k, l)+ temp_val = q_prim_ts(2)%vf(i)%sf(j, k, l)+ q_prim_ts(3)%vf(i)%sf(j, k, l) = temp_val
end do
end do
end do
end do
+! Continue with separate loops for other assignments
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a read-after-write data dependency within a parallel loop, which creates a race condition and will lead to incorrect results.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
User description
Description
Bug fix for probe_wrt on GPUs, where the acceleration and center of mass are written. Previously gave NaNs and had poor performance since this subroutine was not ported to GPUs
Fixes #(issue) [optional]
Type of change
Please delete options that are not relevant.
Scope
If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration
Test Configuration:
Checklist
docs/)examples/that demonstrate my new feature performing as expected.They run to completion and demonstrate "interesting physics"
./mfc.sh formatbefore committing my codeIf your code changes any code source files (anything in
src/simulation)To make sure the code is performing as expected on GPU devices, I have:
nvtxranges so that they can be identified in profiles./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.PR Type
Enhancement
Description
Add GPU support for probe write functionality
Implement GPU memory management for finite difference coefficients
Convert array operations to GPU-compatible loops
Add atomic operations for thread-safe center of mass calculations
Diagram Walkthrough
File Walkthrough
m_checker.fpp
Prohibit probe writes with IGRsrc/simulation/m_checker.fpp
m_derived_variables.fpp
GPU support for derived variables computationsrc/simulation/m_derived_variables.fpp
m_time_steppers.fpp
GPU-compatible time step cyclingsrc/simulation/m_time_steppers.fpp