Skip to content

Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B#5561

Open
k-yeung wants to merge 6 commits intoOpenMathLib:developfrom
riseproject-dev:riscv-rvv_vlv
Open

Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B#5561
k-yeung wants to merge 6 commits intoOpenMathLib:developfrom
riseproject-dev:riscv-rvv_vlv

Conversation

@k-yeung
Copy link

@k-yeung k-yeung commented Dec 5, 2025

These patches improve the performance of the kernels by making use of variable-length vectors in the RISC-V vector extensions.

OpenBLAS generally assumes that SIMD instructions work only with a number of elements in powers of 2, so for example when dealing with a tail of length 7, it is expected that we will do a vector operation with 4 elements, then with 2, and finally 1. On RISC-V this is wasteful as this could be done in just one operation instead of three.

OpenBLAS reorders the matrix layout before it reaches the kernel in order to optimise cache usage, so this needs to be modified first. As only the LHS matrix is read into vector registers, only the tcopy operation needs to be modified to present the tail in one chunk instead of breaking it down into powers of 2.

In the kernels, we can then simply set the vector length on each instruction (which is presented as the last argument in the intrinsics) to the tile size for most of the computation and the remainder in the tail.

There is a complication in that vector types like vfloat32m1_t cannot be placed into arrays by the compiler (presumably because they represent actual registers, not areas of memory), so if we want the same vector operation done N times to N different vectors, it needs to be written out N times instead of using a for loop. This was originally done using a Python code generator to autogenerate some very repetitive code, but for ease of experimentation I moved to using macros instead.

A RISCV_REPEAT macro is used to repeat a vector instruction N times. However, as no loops are allowed, this would necessitate at least one conditional jump (if we use a switch on N with fallthrough reverse-ordered cases) which causes a significant performance penalty on such a tight loop. Instead, I ensured that N is set to a compile-time constant at the call site and force inline it, relying on the compiler's constant propagation and dead-code elimination to remove any extraneous calls.

With this approach, the lines of code in the kernel are dramatically decreased, and I believe is much more readable. I further decreased the numbers of code by factoring out the common code in SGEMM & DGEMM, CGEMM & ZGEMM, gemm_tcopy_* and zgemm_tcopy_* into separate files and including it as required.

Kwok Cheung Yeung added 6 commits December 5, 2025 20:21
These kernels use the same algorithm as the original ones, but improve on them
in several respects:

- Common code is shared between the two kernel types and architectures,
reducing the overall lines of code.

- Instead of using auto-generated code with each vector operation repeated
multiple times (necessitated in part due to the vector types not being
storable in arrays), macros and forced inlining are used to achieve the
same effect directly for better clarity and a major reduction in lines of
code.

- The tiling sizes can be modified within a certain range by modifying the
unroll parameters.

- The RVV extension is not restricted to having the number of elements in a
vector register being a power of two, so the tails of the matrix involving
vector operations can be dealt with in a single operation rather than in
decreasing powers of two, thereby improving performance.  The tcopy operations
need to be modified to take this into account.
…6B architectures

This applies the changes made to the SGEMM/DGEMM kernels to their complex
CGEMM/ZGEMM equivalents.
The original kernels have been deleted with this commit.
…sizes for RISC-V

The unroll factor of the TRMM kernels are currently set to those of the
equivalent GEMM kernels.  As we are not dealing with the TRMM kernels for now,
I have added extra xTRMM_UNROLL_(M|N) parameters to allow them to be set
independently of the xGEMM_UNROLL_(M|N) parameters.

If the new TRMM parameter is not defined, then it defaults back to the
original behaviour of using the GEMM parameters.
…tecture

Testing has shown that 4x8 or 4x4 performs better than the original 8x8 tiling
size for this kernel.

As we do not wish to perturb the behaviour of the DTRMM kernel at this point,
the DTRMM tiling size is explicitly set to the original 8x8.
If DYNAMIC_ARCH is enabled, then the various xGEMM_UNROLL_(M|N) macros
expand to 'gotoblas-><parameter>', which prevent their use in macro
conditionals as the value is only defined at runtime.

To get around this, we use the xGEMM_UNROLL_(M|N)_DEFAULT macros instead,
which should expand to a compile-time constant.  As the tiling factors
used in the tcopy/ncopy functions are selected by these at compile time
as well, it does not result in reduced functionality as changing the
tiling at runtime without changing tcopy/ncopy would result in incorrect
results anyway.
@k-yeung k-yeung marked this pull request as draft December 5, 2025 21:33
@k-yeung k-yeung closed this Dec 5, 2025
@ChipKerchner
Copy link
Contributor

@k-yeung Can you post some performance timings for this patch versus before?

@k-yeung k-yeung reopened this Feb 13, 2026
@k-yeung
Copy link
Author

k-yeung commented Feb 13, 2026

I have attached a spreadsheet with some benchmarks I did multiplying matrices of various sizes on a Banana Pi BPI-F3 board. The matrix sizes are chosen to reflect common matrix sizes used in LLMs.

LLM benchmark.xlsx

@k-yeung k-yeung marked this pull request as ready for review February 13, 2026 20:11
@k-yeung
Copy link
Author

k-yeung commented Feb 13, 2026

I wrote the following test harness to run the tests (GitHub doesn't seem to accept .cpp attachments, so I gzipped it):

test_cblas_gemm.cpp.gz

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Feb 13, 2026

We've seen some really bad cacheing effects for the BananaPi and GEMM. This seems to help very large GEMMs for that particular platform. I have doubts about whether new platforms that are coming out with better cacheing and out of order execution will benefit as much. Plus it makes reading the code challenging. Also what about debug if it is macros?

I'm currently working on SHGEMM (3X faster) and SBGEMM (1.5X faster). I would love to eventually make these same types of changes to those routines if it is warranted.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Feb 13, 2026

You could check in your test program into the kernels directory just so others can use it.

@ChipKerchner
Copy link
Contributor

This is concerning...

DGEMM - 2048x2048 - up to a 75% slowdown?!?

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Feb 13, 2026

Could these changes be put into a KERNEL.x60 (BananaPi)? And then we'll see how well that works and whether a huge disruptive change (packing change, strip mining (which is good), file renaming, etc) like this is a good thing.

@k-yeung
Copy link
Author

k-yeung commented Feb 13, 2026

This is concerning...

DGEMM - 2048x2048 - up to a 75% slowdown?!?

That was for the 4x4 DGEMM tiling size - I settled for 4x8 in the patchset. DGEMM was the one kernel that I had to change the tiling size for in order to see any speed-up - I wonder if that has something to do with the board cache configuration? Though if so, I would expect ZGEMM to show the same behaviour.

@martin-frbg
Copy link
Collaborator

You could check in your test program into the kernels directory just so others can use it.

it might be more appropriate to have it in the benchmark directory though ?

@martin-frbg
Copy link
Collaborator

... and I'd prefer to leave the previous kernels available instead of deleting them - there could be platforms where they are still advantageous

@ChipKerchner
Copy link
Contributor

... and I'd prefer to leave the previous kernels available instead of deleting them - there could be platforms where they are still advantageous

I agree. We shouldn't be making such a huge change decision based on a BananaPi.

@ChipKerchner
Copy link
Contributor

Could you post timings for smaller (less than 4K in each dimension) GEMMs? I like it if it doesn't slow down these.

@ChipKerchner
Copy link
Contributor

I have an idea to speed up low values of M (less than 16) without changing the packing and blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants