Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B#5561
Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B#5561k-yeung wants to merge 6 commits intoOpenMathLib:developfrom
Conversation
These kernels use the same algorithm as the original ones, but improve on them in several respects: - Common code is shared between the two kernel types and architectures, reducing the overall lines of code. - Instead of using auto-generated code with each vector operation repeated multiple times (necessitated in part due to the vector types not being storable in arrays), macros and forced inlining are used to achieve the same effect directly for better clarity and a major reduction in lines of code. - The tiling sizes can be modified within a certain range by modifying the unroll parameters. - The RVV extension is not restricted to having the number of elements in a vector register being a power of two, so the tails of the matrix involving vector operations can be dealt with in a single operation rather than in decreasing powers of two, thereby improving performance. The tcopy operations need to be modified to take this into account.
…6B architectures This applies the changes made to the SGEMM/DGEMM kernels to their complex CGEMM/ZGEMM equivalents.
The original kernels have been deleted with this commit.
…sizes for RISC-V The unroll factor of the TRMM kernels are currently set to those of the equivalent GEMM kernels. As we are not dealing with the TRMM kernels for now, I have added extra xTRMM_UNROLL_(M|N) parameters to allow them to be set independently of the xGEMM_UNROLL_(M|N) parameters. If the new TRMM parameter is not defined, then it defaults back to the original behaviour of using the GEMM parameters.
…tecture Testing has shown that 4x8 or 4x4 performs better than the original 8x8 tiling size for this kernel. As we do not wish to perturb the behaviour of the DTRMM kernel at this point, the DTRMM tiling size is explicitly set to the original 8x8.
If DYNAMIC_ARCH is enabled, then the various xGEMM_UNROLL_(M|N) macros expand to 'gotoblas-><parameter>', which prevent their use in macro conditionals as the value is only defined at runtime. To get around this, we use the xGEMM_UNROLL_(M|N)_DEFAULT macros instead, which should expand to a compile-time constant. As the tiling factors used in the tcopy/ncopy functions are selected by these at compile time as well, it does not result in reduced functionality as changing the tiling at runtime without changing tcopy/ncopy would result in incorrect results anyway.
|
@k-yeung Can you post some performance timings for this patch versus before? |
|
I have attached a spreadsheet with some benchmarks I did multiplying matrices of various sizes on a Banana Pi BPI-F3 board. The matrix sizes are chosen to reflect common matrix sizes used in LLMs. |
|
I wrote the following test harness to run the tests (GitHub doesn't seem to accept .cpp attachments, so I gzipped it): |
|
We've seen some really bad cacheing effects for the BananaPi and GEMM. This seems to help very large GEMMs for that particular platform. I have doubts about whether new platforms that are coming out with better cacheing and out of order execution will benefit as much. Plus it makes reading the code challenging. Also what about debug if it is macros? I'm currently working on SHGEMM (3X faster) and SBGEMM (1.5X faster). I would love to eventually make these same types of changes to those routines if it is warranted. |
|
You could check in your test program into the kernels directory just so others can use it. |
|
This is concerning... DGEMM - 2048x2048 - up to a 75% slowdown?!? |
|
Could these changes be put into a KERNEL.x60 (BananaPi)? And then we'll see how well that works and whether a huge disruptive change (packing change, strip mining (which is good), file renaming, etc) like this is a good thing. |
That was for the 4x4 DGEMM tiling size - I settled for 4x8 in the patchset. DGEMM was the one kernel that I had to change the tiling size for in order to see any speed-up - I wonder if that has something to do with the board cache configuration? Though if so, I would expect ZGEMM to show the same behaviour. |
it might be more appropriate to have it in the benchmark directory though ? |
|
... and I'd prefer to leave the previous kernels available instead of deleting them - there could be platforms where they are still advantageous |
I agree. We shouldn't be making such a huge change decision based on a BananaPi. |
|
Could you post timings for smaller (less than 4K in each dimension) GEMMs? I like it if it doesn't slow down these. |
|
I have an idea to speed up low values of M (less than 16) without changing the packing and blocking. |
These patches improve the performance of the kernels by making use of variable-length vectors in the RISC-V vector extensions.
OpenBLAS generally assumes that SIMD instructions work only with a number of elements in powers of 2, so for example when dealing with a tail of length 7, it is expected that we will do a vector operation with 4 elements, then with 2, and finally 1. On RISC-V this is wasteful as this could be done in just one operation instead of three.
OpenBLAS reorders the matrix layout before it reaches the kernel in order to optimise cache usage, so this needs to be modified first. As only the LHS matrix is read into vector registers, only the tcopy operation needs to be modified to present the tail in one chunk instead of breaking it down into powers of 2.
In the kernels, we can then simply set the vector length on each instruction (which is presented as the last argument in the intrinsics) to the tile size for most of the computation and the remainder in the tail.
There is a complication in that vector types like vfloat32m1_t cannot be placed into arrays by the compiler (presumably because they represent actual registers, not areas of memory), so if we want the same vector operation done N times to N different vectors, it needs to be written out N times instead of using a for loop. This was originally done using a Python code generator to autogenerate some very repetitive code, but for ease of experimentation I moved to using macros instead.
A RISCV_REPEAT macro is used to repeat a vector instruction N times. However, as no loops are allowed, this would necessitate at least one conditional jump (if we use a switch on N with fallthrough reverse-ordered cases) which causes a significant performance penalty on such a tight loop. Instead, I ensured that N is set to a compile-time constant at the call site and force inline it, relying on the compiler's constant propagation and dead-code elimination to remove any extraneous calls.
With this approach, the lines of code in the kernel are dramatically decreased, and I believe is much more readable. I further decreased the numbers of code by factoring out the common code in SGEMM & DGEMM, CGEMM & ZGEMM, gemm_tcopy_* and zgemm_tcopy_* into separate files and including it as required.