Skip to content

Conversation

@createthis
Copy link
Owner

@createthis createthis commented Dec 12, 2025

Just messing around, trying to track down this correctness issue without buying an HGX B200.

  • Experimental FP8 KV Cache. Not wired to anything.

  • Vendor FlashMLA mla decode sm100 kernel, but disable due to compilation issues.

  • Add various code glue unit tests in an (unsuccessful) attempt to track down the correctness issue.

  • Add LLAMA_INDEXER_FP8_TC=1 FP8 tensor core MMA Lightning Indexer Kernel for sm_120a.
    This is a home-grown kernel that uses cute::tl_mma::GemmTensorOp to do real FP8 MMA, just like the vendored tilelang kernel. However, this one works with production shapes for inference ( D=128, H=64 ).
    Tilelang kernel is more like single GEMM then epilogue, ours is 8 GEMMs (N=8) + fused reduction.
    However, unlike the real tilelang kernel, this one does not use TMA.

    Profiling:

    [PROFILE_FP8_GATHER] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.021 over 50 calls
    [PROFILE_FP8_TC_HGRP] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.029 over 50 calls
    [PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.075 over 50 calls
    [PROFILE] SPARSE_TOPK_RADIX2_VLLM N=163840 T=1 k=2048 avg_ms=0.192 over 50 calls
    [PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=2048 avg_ms=1.860 over 50 calls
    
  • Fixed FP8 K Indexer Cache first prompt degenerate generation bug and wired into llama-model.cpp - 6.42 tok/s enable, 5.46 tok/s disabled for the same prompt.

Hmm. Switching to the built-in llama.cpp web ui, I'm seeing signs of degenerate generation as early as the first prompt, regardless of the indexer kernel used and regardless of cache or no cache.

Another interesting thing I just noticed is that the VLLM top-k kernel works if I prompt it from open webui, but if I prompt it from the built-in webui llama.cpp crashes.

degenerate generation after 2k of context.
- test-sparse-kv-partition
- test-sparse-kv-windowing
…cache.

- WMMA HGRP kernel now has a per-(token, head-group) Q scale to prevent FP8 saturation, mirroring vLLM’s per-token FP8 quantization of Q.
- Host-side heuristics (q_rms-based q_scale proxy and K RMS proxy) have been removed or replaced with ones, avoiding double scaling and better matching vLLM’s design where scaling is handled in the FP8 quantization pipeline, not as extra GGML multipliers.
- New tests directly exercise the critical FP8 indexer paths.

Each change targets a specific discrepancy or bug:

- Missing UE8M0 in K quant.
- Q saturation in the WMMA fused kernel.
- Extra heuristic scales that are no longer appropriate.
seem to help the degenerate generation situation though. In fact, it
might make it a little worse. 5.2 is very obstinate and refuses to
continue work without making these changes though, so consider going
back and reverting them after it is done with things that matter more.
contain our FP8 tensor core mma attempts.
…el for sm_120a.

This is a home-grown kernel that uses cute::tl_mma::GemmTensorOp to do real FP8 MMA,
just like the vendored tilelang kernel. However, this one works with production shapes for inference ( D=128, H=64 ).
Tilelang kernel is more like single GEMM then epilogue, ours is 8 GEMMs (N=8) + fused reduction.
However, unlike the real tilelang kernel, this one does not use TMA.
- Wire the FP8 sidecar into the actual DS3.2 sparse attention path
In src/llama-model.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants