Skip to content

Conversation

@createthis
Copy link
Owner

@createthis createthis commented Oct 1, 2025

Don't merge. WIP.

When switching to this branch from deepseek_v3_2_exp_simple, you need to run:

git submodule update --init --recursive

Then recompile the project (I am assuming you have a single blackwell 6000 pro and 768gb of system ram):

rm -Rf build && cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build --config Release

You can add LLAMA_SPARSE_PROF=1 to get performance profiling of the kernels.

There are a lot of different kernels in this branch gated by env vars. This is probably the most interesting config
as it uses the vendored VLLM top-k kernel:

LLAMA_FP8_INDEXER_CACHE=1 LLAMA_SPARSE_TOPK_VLLM=1 ./build/bin/llama-server \
    --model  /data2/DeepSeek-V3.2-Exp-GGUF/q4_k_m/DeepSeek-V3.2-Exp-Q4_K_M-00001-of-00009.gguf \
    --alias DeepSeek-V3.2-Exp:671b-q4_k_m \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 0.95 \
    --min-p 0.1 \
    --log-colors on \
    --flash-attn on \
    --host 0.0.0.0 \
    --prio 2 \
    --jinja \
    --port 11434

3.96 tok/s

[PROFILE_FP8_GATHER] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.019 over 50 calls
[PROFILE_WMMA_HGRP_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.173 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.205 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX2_VLLM N=163840 T=1 k=2048 avg_ms=0.192 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=2048 avg_ms=1.805 over 50 calls

@createthis createthis self-assigned this Oct 1, 2025
@github-actions github-actions bot added the python label Oct 1, 2025
@nicoboss

This comment was marked as off-topic.

  - Removed the forced CPU backend assignment of kvaware_indices
    - src/llama-sparse-topk.cpp: deleted the block that moved result to
      backend_cpu. Now it stays where it’s produced.
    - src/llama-model.cpp: removed both instances of
      ggml_backend_sched_set_tensor_backend(sched, kvaware_indices,
backend_cpu) so we don’t bounce indices to host in MLA and MHA sparse
paths.
- Gate debug-only float32 cast of indices:
  - src/llama-sparse-topk.cpp: only cast to F32 and log the f32 indices
    when LLAMA_SPARSE_DEBUG is set. This cuts extra nodes/copies in
normal runs.
- Increase default Top-K token tile size:
  - src/llama-sparse-topk.cpp: default TILE_T from 32 to 128, still
    overridable via LLAMA_SPARSE_TOPK_TILE_T.
branches, so we avoid the extra backend hop to CPU after
apply_sparse_attention_kvaware
public or protected in order to have an external method call.
gymnastics so we can feed fp8 indexer data to the WMMA HGRP kernel.
Inner loop now reads FP8 K codes instead of F32 K
The launch currently passes null FP8 pointers
…ernel by quantizing from F32 K inside ggml_cuda_indexer_logits_fused_device when WMMA is used and DeepGEMM is not
…end-to-end at the GGML/llama-level. The fused indexer op is still not consuming the sidecar in CUDA (that’s the next step), but all the plumbing is there
gather kernel took 0.17 ms. This one takes 0.019 ms. A clear win.
in merge commit 184076. This brings that code in. However, there is
a problem: Radix Sort is turned off because GPT 5.1 thinks we will never
have a tile row count high enough to use it. I believe this points to
an architectural issue on our end because I know Radix Sort is a
critical performance feature of this kernel. I'm investigating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants