DeepSeek-V3.2-Exp #9

createthis · 2025-10-01T00:12:29Z

Don't merge. WIP.

When switching to this branch from deepseek_v3_2_exp_simple, you need to run:

git submodule update --init --recursive

Then recompile the project (I am assuming you have a single blackwell 6000 pro and 768gb of system ram):

rm -Rf build && cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build --config Release

You can add LLAMA_SPARSE_PROF=1 to get performance profiling of the kernels.

There are a lot of different kernels in this branch gated by env vars. This is probably the most interesting config
as it uses the vendored VLLM top-k kernel:

LLAMA_FP8_INDEXER_CACHE=1 LLAMA_SPARSE_TOPK_VLLM=1 ./build/bin/llama-server \
    --model  /data2/DeepSeek-V3.2-Exp-GGUF/q4_k_m/DeepSeek-V3.2-Exp-Q4_K_M-00001-of-00009.gguf \
    --alias DeepSeek-V3.2-Exp:671b-q4_k_m \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 0.95 \
    --min-p 0.1 \
    --log-colors on \
    --flash-attn on \
    --host 0.0.0.0 \
    --prio 2 \
    --jinja \
    --port 11434

3.96 tok/s

[PROFILE_FP8_GATHER] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.019 over 50 calls
[PROFILE_WMMA_HGRP_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.173 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.205 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX2_VLLM N=163840 T=1 k=2048 avg_ms=0.192 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=2048 avg_ms=1.805 over 50 calls

- Removed the forced CPU backend assignment of kvaware_indices - src/llama-sparse-topk.cpp: deleted the block that moved result to backend_cpu. Now it stays where it’s produced. - src/llama-model.cpp: removed both instances of ggml_backend_sched_set_tensor_backend(sched, kvaware_indices, backend_cpu) so we don’t bounce indices to host in MLA and MHA sparse paths. - Gate debug-only float32 cast of indices: - src/llama-sparse-topk.cpp: only cast to F32 and log the f32 indices when LLAMA_SPARSE_DEBUG is set. This cuts extra nodes/copies in normal runs. - Increase default Top-K token tile size: - src/llama-sparse-topk.cpp: default TILE_T from 32 to 128, still overridable via LLAMA_SPARSE_TOPK_TILE_T.

branches, so we avoid the extra backend hop to CPU after apply_sparse_attention_kvaware

failing.

is vendoring. Update .gitignore.

public or protected in order to have an external method call.

gymnastics so we can feed fp8 indexer data to the WMMA HGRP kernel.

Inner loop now reads FP8 K codes instead of F32 K The launch currently passes null FP8 pointers

…ernel by quantizing from F32 K inside ggml_cuda_indexer_logits_fused_device when WMMA is used and DeepGEMM is not

…erically aligned with the CPU reference

…end-to-end at the GGML/llama-level. The fused indexer op is still not consuming the sidecar in CUDA (that’s the next step), but all the plumbing is there

gather kernel took 0.17 ms. This one takes 0.019 ms. A clear win.

requires.

in merge commit 184076. This brings that code in. However, there is a problem: Radix Sort is turned off because GPT 5.1 thinks we will never have a tile row count high enough to use it. I believe this points to an architectural issue on our end because I know Radix Sort is a critical performance feature of this kernel. I'm investigating.

useRadixSort true/false.

createthis self-assigned this Oct 1, 2025

github-actions bot added the python label Oct 1, 2025

This comment was marked as off-topic.

Sign in to view

github-actions bot added testing ggml Nvidia GPU labels Oct 13, 2025

github-actions bot added the examples label Oct 24, 2025

createthis added 23 commits October 27, 2025 13:45

comment out or hide all debug prints behind LLAMA_SPARSE_DEBUG

6fb54c1

Streaming per-head accumulation to avoid [N_kv, H, Tc] temporaries

edc23f9

Revert last change as it was objectively worse.

9e9a84a

kept the sparse attention output tensor “cur” on device in the sparse

7866fd5

branches, so we avoid the extra backend hop to CPU after apply_sparse_attention_kvaware

WIP radix top-k

b96f5fb

Ported radix top-k selection with thresholding and tail refinement

d3e4a6a

Integrate radix top-k

100535b

Guard printf's with dbg

7780061

Add a repro test for the context > 50k issue. Also attempt to fix it.

90f0e17

Add another potention repro test

8dd82d8

Add some logging

2d3cb41

fix compile errors

3d547a6

more logging

ca658a8

Add an include to fix compile issue

2d810d5

more logging

1ab31d4

Add logging to ggml to track this down.

4de2e22

Helping sentient rocks put their changes where they intended.

e1634c7

Add fflush

6c048aa

Try to get logging working

42024cd

More logging

60f71ac

label cur

07f84a3

Add another tensor name as I argue with sentient rocks

5983712

createthis added 30 commits December 1, 2025 00:36

Make the DeepGEMM kernels a cmake object/library. Use sm_90a. Test still

94be1f6

failing.

Switch back to the sm100 kernel. Tests still failing.

61b05dc

Use sm120a. Update vendored cutlass version to the same version DeepGEMM

cf6d921

is vendoring. Update .gitignore.

FP8 KV Cache like VLLM's DeepseekV32IndexerCache

1f5eb8a

FP8 indexer sidecar

8214be3

Add GGML_OP_INDEXER_K_CACHE_FP8

359e325

Wiring FP8 quantization into compute_indexer_triplet

f8c0f85

Remove src/llama-kv-indexer-fp8.cpp because lyr would have to become

1e5d1b0

public or protected in order to have an external method call.

Add ggml_cuda_indexer_k_cache_fp8_gather_wmma_hgrp to perform type

a32f143

gymnastics so we can feed fp8 indexer data to the WMMA HGRP kernel.

WMMA HGRP kernel signature now expects FP8 K + per‑row scales

297b730

Inner loop now reads FP8 K codes instead of F32 K The launch currently passes null FP8 pointers

Enabled a proper FP8-K path for the WMMA head-grouped fused indexer k…

56b2be4

…ernel by quantizing from F32 K inside ggml_cuda_indexer_logits_fused_device when WMMA is used and DeepGEMM is not

the generic WMMA HGRP implementation is now fully FP8-K aware and num…

ee99204

…erically aligned with the CPU reference

GGML op extended to carry FP8 sidecar + layout

cf89205

CUDA backend now sees the FP8 sidecar metadata

f7187d1

The “pass the real sidecar from the DeepSeek path” step is now wired …

b0ec50c

…end-to-end at the GGML/llama-level. The fused indexer op is still not consuming the sidecar in CUDA (that’s the next step), but all the plumbing is there

WMMA HGRP path now uses FP8 sidecar when available

024d071

A couple fixes so inference starts.

dbff774

Skip building F32 K when a sidecar is present and WMMA HGRP is used.

ce82456

Add profiling for the gather kernel

78910f7

Vendor the VLLM cp_gather_indexer_k_quant_cache_kernel. Our home grown

a020068

gather kernel took 0.17 ms. This one takes 0.019 ms. A clear win.

Vendor VLLM topk kernel

7cfe002

Fix test. Gated VLLM kernel to k=2048.

dfd88ff

When LLAMA_SPARSE_TOPK_VLLM=1, use k=2048 as the VLLM top-k kernel

34481f5

requires.

Do not clamp k to used_kv in the LLAMA_SPARSE_TOPK_VLLM=1 path.

79c6982

Wire up VLLM top-k starts and ends

6ebc840

Revert unintended change

8b798da

Use VLLM top-k kernel for prefill as well as decode.

eb21dde

Wire up useRadixSort=true

d78d235

Change code to be in parity with VLLM's host-side heuristics regarding

dbe24a4

useRadixSort true/false.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek-V3.2-Exp #9

DeepSeek-V3.2-Exp #9

Uh oh!

createthis commented Oct 1, 2025 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DeepSeek-V3.2-Exp #9

Are you sure you want to change the base?

DeepSeek-V3.2-Exp #9

Uh oh!

Conversation

createthis commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

createthis commented Oct 1, 2025 •

edited

Loading