Skip to content

Conversation

@createthis
Copy link
Owner

@createthis createthis commented Dec 15, 2025

This is a branch forked from the deepseek_v3_2_exp branch at commit 86f7d8 on October 28th 2025. This was the last commit before I started exploring cuda kernels.

Starts out at about 6.16 tok/s decode on my hardware. I see 17-20 tok/s prefill. Honestly, my best CUDA branch's performance is no better than that. I'm sure it's possible to get better performance on my hardware, but this branch is a win because it is very simple and non-invasive regarding GGML.

Falls into degenerate generation after about 2k tokens, just like my CUDA branches.

NOTE: Setting LLAMA_SPARSE_TOPK=256 causes the degenerate generation to happen at 500-600 context instead of 2k. LLAMA_SPARSE_TOPK defaults to 2048 internally. I'm not sure if this is a useful test or if the model just can't work with lower than 2048.

src/llama-sparse-indexer.cpp where it belongs.
- Added idx_compute_scores_tile() to encapsulate:
  - K·Q → ReLU → per-head weight → reduction → K-scale
values in addition to indices. Include a summary with the threshold.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants