forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 1
Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
createthis
wants to merge
33
commits into
deepseek_v3_2_exp
Choose a base branch
from
deepseek_v3_2_exp_vendor_flashmla_mla_decode_and_fp8_kv_cache
base: deepseek_v3_2_exp
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Deepseek v3 2 exp vendor flashmla mla decode and fp8 kv cache #30
createthis
wants to merge
33
commits into
deepseek_v3_2_exp
from
deepseek_v3_2_exp_vendor_flashmla_mla_decode_and_fp8_kv_cache
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…vendor_flashmla_mla_decode
- llama_kv_cache_fp8 exposes the DS‑MLA blob - DeepSeek V3.2 MLA path now passes kv_blob to sparse MLA - Tests updated for new function signature - CUDA path prepared to accept the blob pointer - GGML_OP_SPARSE_MLA_DECODE now forwards the blob
build since it is sm_100 and we only have sm_120a. We'll have to use a different approach to track down our math errors.
degenerate generation after 2k of context.
- test-sparse-kv-partition - test-sparse-kv-windowing
…cache. - WMMA HGRP kernel now has a per-(token, head-group) Q scale to prevent FP8 saturation, mirroring vLLM’s per-token FP8 quantization of Q. - Host-side heuristics (q_rms-based q_scale proxy and K RMS proxy) have been removed or replaced with ones, avoiding double scaling and better matching vLLM’s design where scaling is handled in the FP8 quantization pipeline, not as extra GGML multipliers. - New tests directly exercise the critical FP8 indexer paths. Each change targets a specific discrepancy or bug: - Missing UE8M0 in K quant. - Q saturation in the WMMA fused kernel. - Extra heuristic scales that are no longer appropriate.
seem to help the degenerate generation situation though. In fact, it might make it a little worse. 5.2 is very obstinate and refuses to continue work without making these changes though, so consider going back and reverting them after it is done with things that matter more.
contain our FP8 tensor core mma attempts.
tensor core mma, if the Gods cooperate.
…el for sm_120a. This is a home-grown kernel that uses cute::tl_mma::GemmTensorOp to do real FP8 MMA, just like the vendored tilelang kernel. However, this one works with production shapes for inference ( D=128, H=64 ). Tilelang kernel is more like single GEMM then epilogue, ours is 8 GEMMs (N=8) + fused reduction. However, unlike the real tilelang kernel, this one does not use TMA.
- Wire the FP8 sidecar into the actual DS3.2 sparse attention path In src/llama-model.cpp
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just messing around, trying to track down this correctness issue without buying an HGX B200.
Experimental FP8 KV Cache. Not wired to anything.
Vendor FlashMLA mla decode sm100 kernel, but disable due to compilation issues.
Add various code glue unit tests in an (unsuccessful) attempt to track down the correctness issue.
Add LLAMA_INDEXER_FP8_TC=1 FP8 tensor core MMA Lightning Indexer Kernel for sm_120a.
This is a home-grown kernel that uses
cute::tl_mma::GemmTensorOpto do real FP8 MMA, just like the vendored tilelang kernel. However, this one works with production shapes for inference ( D=128, H=64 ).Tilelang kernel is more like
single GEMM then epilogue, ours is8 GEMMs (N=8) + fused reduction.However, unlike the real tilelang kernel, this one does not use TMA.
Profiling:
Fixed FP8 K Indexer Cache first prompt degenerate generation bug and wired into llama-model.cpp - 6.42 tok/s enable, 5.46 tok/s disabled for the same prompt.
Hmm. Switching to the built-in llama.cpp web ui, I'm seeing signs of degenerate generation as early as the first prompt, regardless of the indexer kernel used and regardless of cache or no cache.
Another interesting thing I just noticed is that the VLLM top-k kernel works if I prompt it from open webui, but if I prompt it from the built-in webui llama.cpp crashes.