Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30

createthis · 2025-12-12T00:46:28Z

Just messing around, trying to track down this correctness issue without buying an HGX B200.

Experimental FP8 KV Cache. Not wired to anything.
Vendor FlashMLA mla decode sm100 kernel, but disable due to compilation issues.
Add various code glue unit tests in an (unsuccessful) attempt to track down the correctness issue.
Add LLAMA_INDEXER_FP8_TC=1 FP8 tensor core MMA Lightning Indexer Kernel for sm_120a.
This is a home-grown kernel that uses cute::tl_mma::GemmTensorOp to do real FP8 MMA, just like the vendored tilelang kernel. However, this one works with production shapes for inference ( D=128, H=64 ).
Tilelang kernel is more like single GEMM then epilogue, ours is 8 GEMMs (N=8) + fused reduction.
However, unlike the real tilelang kernel, this one does not use TMA.

Profiling:
```
[PROFILE_FP8_GATHER] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.021 over 50 calls
[PROFILE_FP8_TC_HGRP] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.029 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.075 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX2_VLLM N=163840 T=1 k=2048 avg_ms=0.192 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=2048 avg_ms=1.860 over 50 calls
```
Fixed FP8 K Indexer Cache first prompt degenerate generation bug and wired into llama-model.cpp - 6.42 tok/s enable, 5.46 tok/s disabled for the same prompt.

Hmm. Switching to the built-in llama.cpp web ui, I'm seeing signs of degenerate generation as early as the first prompt, regardless of the indexer kernel used and regardless of cache or no cache.

Another interesting thing I just noticed is that the VLLM top-k kernel works if I prompt it from open webui, but if I prompt it from the built-in webui llama.cpp crashes.

…d wiring.

…vendor_flashmla_mla_decode

- llama_kv_cache_fp8 exposes the DS‑MLA blob - DeepSeek V3.2 MLA path now passes kv_blob to sparse MLA - Tests updated for new function signature - CUDA path prepared to accept the blob pointer - GGML_OP_SPARSE_MLA_DECODE now forwards the blob

build since it is sm_100 and we only have sm_120a. We'll have to use a different approach to track down our math errors.

degenerate generation after 2k of context.

- test-sparse-kv-partition - test-sparse-kv-windowing

…cache. - WMMA HGRP kernel now has a per-(token, head-group) Q scale to prevent FP8 saturation, mirroring vLLM’s per-token FP8 quantization of Q. - Host-side heuristics (q_rms-based q_scale proxy and K RMS proxy) have been removed or replaced with ones, avoiding double scaling and better matching vLLM’s design where scaling is handled in the FP8 quantization pipeline, not as extra GGML multipliers. - New tests directly exercise the critical FP8 indexer paths. Each change targets a specific discrepancy or bug: - Missing UE8M0 in K quant. - Q saturation in the WMMA fused kernel. - Extra heuristic scales that are no longer appropriate.

seem to help the degenerate generation situation though. In fact, it might make it a little worse. 5.2 is very obstinate and refuses to continue work without making these changes though, so consider going back and reverting them after it is done with things that matter more.

contain our FP8 tensor core mma attempts.

tensor core mma, if the Gods cooperate.

…el for sm_120a. This is a home-grown kernel that uses cute::tl_mma::GemmTensorOp to do real FP8 MMA, just like the vendored tilelang kernel. However, this one works with production shapes for inference ( D=128, H=64 ). Tilelang kernel is more like single GEMM then epilogue, ours is 8 GEMMs (N=8) + fused reduction. However, unlike the real tilelang kernel, this one does not use TMA.

- Wire the FP8 sidecar into the actual DS3.2 sparse attention path In src/llama-model.cpp

createthis added 20 commits December 6, 2025 17:01

Experimental FP8 KV Cache. Not wired into anything yet.

0c5c04c

Wired into the build, but not yet wired into DeepSeek V3.2

952cc97

More FP8 KV changes

513ea61

Flesh out get_k

8981f5a

Add a test for the fp8 kv cache

a523479

Test passing

49a0651

LLAMA_DEEPSEEK32_FP8_K=1 env var and wiring up.

5112464

Add the FP8 pack custom op hook and replacing the unsafe pointer‑base…

e952efa

…d wiring.

Add GGML_OP_KV_DSMLA_PACK

3f17d34

FP8 K is inferring again.

78da439

Vendor FlashMLA mla decode sm100 kernel.

18201a9

Fix the newline situation. Bots. Sigh.

44d5c6f

Extend the printf profiling for flashmla

696d060

Passing test.

7ea5669

Bring back the detail output. Bots. Sigh.

15bee54

Merge branch 'deepseek_v3_2_exp_fp8_kv_cache' into deepseek_v3_2_exp_…

e0e0f52

…vendor_flashmla_mla_decode

Rip vendors/flashmla/sm100/decode/sparse_fp8/splitkv_mla.cu out of the

0ac4361

build since it is sm_100 and we only have sm_120a. We'll have to use a different approach to track down our math errors.

Fix build

63c581b

Add sparse MLA decode glue code unit test.

b190c19

createthis self-assigned this Dec 12, 2025

github-actions bot added testing ggml Nvidia GPU labels Dec 12, 2025

createthis added 6 commits December 12, 2025 03:16

Fix LLAMA_SPARSE_DEBUG=1 during inference.

8e7701a

Add tests/test-indexer-fp8-fused-glue.cpp

7e91faf

Add tests for fp8 indexer glue code. Still trying to track down

0f6b45c

degenerate generation after 2k of context.

Add two more tests on our quest for accuracy:

a7cc37b

- test-sparse-kv-partition - test-sparse-kv-windowing

Two more tests

09fdef4

createthis added 7 commits December 13, 2025 23:02

New LLAMA_INDEXER_FP8_TC=1 kernel. This is experimental and will

458a802

contain our FP8 tensor core mma attempts.

Add passing test for naive FP8 kernel. This will eventually become FP8

8a0b70d

tensor core mma, if the Gods cooperate.

- Rewrote the FP8 cache quantization kernel to be shape-correct

48f09a8

- Wire the FP8 sidecar into the actual DS3.2 sparse attention path In src/llama-model.cpp

Remove .orig file. Oops. Didn't mean to commit that.

8a09b93

Ensure that the FP8 indexer cache is not used if the env var is not set.

3b8e6f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30

Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30

Uh oh!

createthis commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30

Are you sure you want to change the base?

Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30

Uh oh!

Conversation

createthis commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

createthis commented Dec 12, 2025 •

edited

Loading