Deepseek v3 2 exp simple #31

createthis · 2025-12-15T16:17:35Z

This is a branch forked from the deepseek_v3_2_exp branch at commit 86f7d8 on October 28th 2025. This was the last commit before I started exploring cuda kernels.

Starts out at about 6.16 tok/s decode on my hardware. I see 17-20 tok/s prefill. Honestly, my best CUDA branch's performance is no better than that. I'm sure it's possible to get better performance on my hardware, but this branch is a win because it is very simple and non-invasive regarding GGML.

Falls into degenerate generation after about 2k tokens, just like my CUDA branches.

NOTE: Setting LLAMA_SPARSE_TOPK=256 causes the degenerate generation to happen at 500-600 context instead of 2k. LLAMA_SPARSE_TOPK defaults to 2048 internally. I'm not sure if this is a useful test or if the model just can't work with lower than 2048.

but it's a starting point.

based on tilelang and PDF description.

src/llama-sparse-indexer.cpp where it belongs. - Added idx_compute_scores_tile() to encapsulate: - K·Q → ReLU → per-head weight → reduction → K-scale

values in addition to indices. Include a summary with the threshold.

createthis added 30 commits October 1, 2025 00:10

DeepSeek-V3.2-Exp - bump transformers requirement, add hash

a41142f

Add ATTN_INDEXER_K_NORM

2d84dae

Use star-unpack syntax to keep it DRY

b0e8bfe

Add ATTN_INDEXER_WEIGHTS_PROJ

3eb7b9b

Add ATTN_INDEXER_WK

46461d5

Add ATTN_INDEXER_WQ_B

d13ca93

update comments for consistency

fb92591

Implement DeepSeek V3.2 by copying V3 implmentation. This doesn't work,

6b3af47

but it's a starting point.

Update convert_hf_to_gguf_update.py with DeepSeek V3.2-Exp and run it.

2e2706a

Cargo cult llm_build_deepseek2 function for v3.2

906631c

Rename python class for V3.2

5433895

Add deepseek-v3.2 to vocab

d9a5360

Add sparse attention tensors for DeepSeek V3.2-Exp.

2191a14

Add tensor mapping to python side for DeepSeek V3.2-Exp.

8a096bf

Load attn_indexer_k_norm.bias

ffc2b43

Add sparse attention indexer, authored by DeepSeek V3.1-Terminus.

bb1f4e4

Remove log message.

35e6eee

Add audit notes for ds 3.2-exp sparse attention code.

7c5fc96

Add link to vllm equivalent of Indexer key normalization (k_norm).

ee31322

conceptual sparse attention for ds 3.2-exp

019ddcf

Add some rough notes about where the equivalent vllm code can be found.

cdd6a59

Fix assertion issue

8689d0b

Fix assertion failure.

64be358

Naive implementation of Deepseek Sparse Attention (DSA) by 3.1-Terminus,

912dbe4

based on tilelang and PDF description.

Fix Qcur compile errors.

9877f25

Move Deepseek v3.2 sparse attention code into its own file.

65e973e

Attempt to fix compilation issues.

c489a9c

Attempt to fix compilation errors

2bceafb

Attempt to fix core dump

1f8bbc7

Add sparse attention unit test

3288664

createthis added 20 commits October 27, 2025 20:49

Ported radix top-k selection with thresholding and tail refinement

d3e4a6a

Integrate radix top-k

100535b

Guard printf's with dbg

7780061

Add a repro test for the context > 50k issue. Also attempt to fix it.

90f0e17

Add another potention repro test

8dd82d8

Add some logging

2d3cb41

fix compile errors

3d547a6

more logging

ca658a8

Add an include to fix compile issue

2d810d5

more logging

1ab31d4

Add logging to ggml to track this down.

4de2e22

Helping sentient rocks put their changes where they intended.

e1634c7

Add fflush

6c048aa

Try to get logging working

42024cd

More logging

60f71ac

label cur

07f84a3

Add another tensor name as I argue with sentient rocks

5983712

Attempt to fix the problem, remove unused tests

f2cef1a

Oops, fix makefile

f2896cf

Remove unnecessary prints now that problem is solved.

86f7d87

createthis self-assigned this Dec 15, 2025

github-actions bot added testing python ggml Nvidia GPU examples labels Dec 15, 2025

createthis added 4 commits December 16, 2025 10:23

Add a few files to ignore

8c96ada

Refactor per-tile scoring into a single helper in

465c0c2

src/llama-sparse-indexer.cpp where it belongs. - Added idx_compute_scores_tile() to encapsulate: - K·Q → ReLU → per-head weight → reduction → K-scale

Add unit test for sparse mla decode.

f4c653c

Improve radix debug logging. Use a mutex to keep it coherent. Prints

d255b85

values in addition to indices. Include a summary with the threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepseek v3 2 exp simple #31

Deepseek v3 2 exp simple #31

Uh oh!

createthis commented Dec 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deepseek v3 2 exp simple #31

Are you sure you want to change the base?

Deepseek v3 2 exp simple #31

Uh oh!

Conversation

createthis commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

createthis commented Dec 15, 2025 •

edited

Loading