Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
241 commits
Select commit Hold shift + click to select a range
a41142f
DeepSeek-V3.2-Exp - bump transformers requirement, add hash
createthis Oct 1, 2025
2d84dae
Add ATTN_INDEXER_K_NORM
createthis Oct 2, 2025
b0e8bfe
Use star-unpack syntax to keep it DRY
createthis Oct 2, 2025
3eb7b9b
Add ATTN_INDEXER_WEIGHTS_PROJ
createthis Oct 2, 2025
46461d5
Add ATTN_INDEXER_WK
createthis Oct 2, 2025
d13ca93
Add ATTN_INDEXER_WQ_B
createthis Oct 2, 2025
fb92591
update comments for consistency
createthis Oct 2, 2025
6b3af47
Implement DeepSeek V3.2 by copying V3 implmentation. This doesn't work,
createthis Oct 3, 2025
2e2706a
Update convert_hf_to_gguf_update.py with DeepSeek V3.2-Exp and run it.
createthis Oct 5, 2025
906631c
Cargo cult llm_build_deepseek2 function for v3.2
createthis Oct 5, 2025
5433895
Rename python class for V3.2
createthis Oct 5, 2025
d9a5360
Add deepseek-v3.2 to vocab
createthis Oct 5, 2025
2191a14
Add sparse attention tensors for DeepSeek V3.2-Exp.
createthis Oct 6, 2025
8a096bf
Add tensor mapping to python side for DeepSeek V3.2-Exp.
createthis Oct 6, 2025
ffc2b43
Load attn_indexer_k_norm.bias
createthis Oct 6, 2025
bb1f4e4
Add sparse attention indexer, authored by DeepSeek V3.1-Terminus.
createthis Oct 6, 2025
35e6eee
Remove log message.
createthis Oct 7, 2025
7c5fc96
Add audit notes for ds 3.2-exp sparse attention code.
createthis Oct 8, 2025
ee31322
Add link to vllm equivalent of Indexer key normalization (k_norm).
createthis Oct 8, 2025
019ddcf
conceptual sparse attention for ds 3.2-exp
createthis Oct 8, 2025
cdd6a59
Add some rough notes about where the equivalent vllm code can be found.
createthis Oct 8, 2025
8689d0b
Fix assertion issue
createthis Oct 11, 2025
64be358
Fix assertion failure.
createthis Oct 12, 2025
912dbe4
Naive implementation of Deepseek Sparse Attention (DSA) by 3.1-Terminus,
createthis Oct 12, 2025
9877f25
Fix Qcur compile errors.
createthis Oct 12, 2025
65e973e
Move Deepseek v3.2 sparse attention code into its own file.
createthis Oct 12, 2025
c489a9c
Attempt to fix compilation issues.
createthis Oct 12, 2025
2bceafb
Attempt to fix compilation errors
createthis Oct 12, 2025
1f8bbc7
Attempt to fix core dump
createthis Oct 12, 2025
3288664
Add sparse attention unit test
createthis Oct 13, 2025
61a6692
Fix include paths and use C++17
createthis Oct 13, 2025
d772923
Raise test memory context from 16MB to 64MB
createthis Oct 13, 2025
f0ca639
Attempt to fix core dump
createthis Oct 13, 2025
0c692af
fix compile errors
createthis Oct 13, 2025
064eb54
Let GGML handle memory allocation.
createthis Oct 13, 2025
f66a616
New File Structure:
createthis Oct 13, 2025
dff12e4
Fix compilation error
createthis Oct 13, 2025
b15f3d8
Attempt to fix compilation errors.
createthis Oct 13, 2025
9b5519a
Add fflush after each line of output so I can pipe output to CLI tools
createthis Oct 15, 2025
9f6d140
Fix macOS warnings
createthis Oct 15, 2025
4801a5b
Create a real llama_model and populate it with data programmatically.
createthis Oct 15, 2025
8cb77bc
Fix test_compute_token_importance()
createthis Oct 15, 2025
e1c58a2
Tests passing
createthis Oct 15, 2025
2afd5e4
Update llama-sparse-mla-fwd.cpp. Tests still passing.
createthis Oct 16, 2025
2298d1e
Move assertions from llama-sparse-mla-fwd.cpp to test-sparse-attn.cpp
createthis Oct 16, 2025
91e889e
Had V3.1-Terminus create a unit test reproducing the failure we're
createthis Oct 16, 2025
45aee27
V3.1-Terminus's changes so that the tests pass.
createthis Oct 16, 2025
6ebf635
Fix some warnings.
createthis Oct 16, 2025
e0528cf
Fix more warnings.
createthis Oct 16, 2025
be322eb
Fix two errors showing up in the tests.
createthis Oct 17, 2025
9530a6d
Attempt to fix assertion
createthis Oct 17, 2025
d570a74
Move sparse attention before regular attention.
createthis Oct 17, 2025
901d78e
Fix bug in k_indexer reshaping logic.
createthis Oct 17, 2025
66f7928
ncorrect tensor shapes in test code: The test files were creating att…
createthis Oct 17, 2025
586fdb7
Add some logging
createthis Oct 17, 2025
79bb818
Add more logging.
createthis Oct 17, 2025
843548a
more logging
createthis Oct 17, 2025
3a00118
Add a test for 4096 tokens. This is taking 20gb of memory. I clearly
createthis Oct 17, 2025
e1f87a5
Output memory usage
createthis Oct 18, 2025
4db7453
Changed from computing full T×T matrix to computing only the last row
createthis Oct 18, 2025
4102b16
Replaced the index-tensor write with a zero-copy view of the last token.
createthis Oct 18, 2025
8ba17cb
This commit successfully reproduces the error the last commit solved.
createthis Oct 18, 2025
5f545d2
Removed this by accident earlier. Bring it back for now. I'm not sure
createthis Oct 18, 2025
569506b
Update test-sparse-attn-noalloc to reproduce the current assertion
createthis Oct 18, 2025
bd7c4bc
Update the code so that the test passes.
createthis Oct 18, 2025
9a9dd10
Reproduce current inference assertion via unit test. This is a sparse
createthis Oct 18, 2025
69d48a0
- src/llama-model.cpp (DeepSeek 3.2 runtime integration)
createthis Oct 18, 2025
8c81708
Update unit test to repro current prompt failure.
createthis Oct 18, 2025
51a716c
- In the DeepSeek V3.2 model path (llm_build_deepseek3_2 in
createthis Oct 18, 2025
baba670
Another attempt to fix the prompt issue. Low confidence on this one
createthis Oct 18, 2025
88c5429
Revert "Reproduce current inference assertion via unit test. This is …
createthis Oct 18, 2025
3d896b8
- Rewrote indexer to be per-query and batched:
createthis Oct 19, 2025
347eb48
Attempt to fix "not enough space in the context's memory pool" error.
createthis Oct 19, 2025
4259fe6
Bump memory allocation and add MLA memory usage logging.
createthis Oct 19, 2025
d53ed9a
Fix warnings
createthis Oct 19, 2025
b5044ba
Fix assertion after prompt
createthis Oct 19, 2025
14a3f78
Keep the sparse indices and KV-cache copy ops on CPU so we avoid
createthis Oct 19, 2025
1217251
Pin everything to CPU
createthis Oct 19, 2025
8c23d3b
The no_alloc test was failing because ggml_reshape_4d requires the
createthis Oct 19, 2025
6f7d144
implemented a KV-aware top‑k path and wired it into the DeepSeek V3.2
createthis Oct 19, 2025
f4fdc50
Fix warnings
createthis Oct 19, 2025
752bb5b
Attempt to fix GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
createthis Oct 19, 2025
76948f4
Fix GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
createthis Oct 19, 2025
45152db
Fix OOM issue.
createthis Oct 19, 2025
b85524c
Attempt to resolve OOM issue.
createthis Oct 19, 2025
f58fb3f
Attempt to fix OOM
createthis Oct 19, 2025
b90e72e
Attempt to fix ggml-cuda/binbcast.cu:249: GGML_ASSERT(nb10 %
createthis Oct 19, 2025
adf6aca
2nd Attempt to fix ggml-cuda/binbcast.cu:249: GGML_ASSERT(nb10 %
createthis Oct 19, 2025
c48d187
Add logging to help track issue
createthis Oct 19, 2025
5e48e1e
Fix compile issue
createthis Oct 19, 2025
09b6baa
Arguing with sentient rocks.
createthis Oct 19, 2025
23651b0
Add fflush
createthis Oct 19, 2025
1d8ba61
Attempt to fix ggml/src/ggml-cuda/binbcast.cu:249: GGML_ASSERT(nb10 %
createthis Oct 19, 2025
d21243c
Add more logging
createthis Oct 19, 2025
99b11c2
More logging
createthis Oct 19, 2025
dcb9f75
Attempt to fix ggml/src/ggml-cuda/binbcast.cu:276: GGML_ASSERT(nb10 %
createthis Oct 19, 2025
0114af3
Dial back the logging now that we're no longer hitting assertions.
createthis Oct 19, 2025
dde8f76
Remove if (T >= 1024) { block. It was added to fix an OOM issue,
createthis Oct 20, 2025
efc5802
1) Added a per-layer Indexer K cache (Kindexer) to KV cache
createthis Oct 20, 2025
49f7e4f
fixed the immediate “not enough space in the context’s memory pool”
createthis Oct 20, 2025
9ddffbe
rewrote the indexer KV-aware top‑k to a memory-safe, tiled per-head
createthis Oct 20, 2025
60250c8
increased the graph metadata headroom for DeepSeek V3.2 in
createthis Oct 20, 2025
d1f3c24
This gets it inferring again.
createthis Oct 20, 2025
fb82813
Add some assertions in an attempt to pinpoint problem areas.
createthis Oct 20, 2025
b4a932b
Add newlines to a couple logs so they don't look strange.
createthis Oct 20, 2025
7f7b595
Fix warnings
createthis Oct 20, 2025
4b5f1bc
Remove unused test
createthis Oct 20, 2025
7bdd095
Fix whitespace issues.
createthis Oct 20, 2025
21c83c3
Fix trailing whitespace.
createthis Oct 20, 2025
12f2307
Add logging and assertions to help pinpoint degenerate generation
createthis Oct 20, 2025
88278f4
Some layer logging
createthis Oct 20, 2025
c0affb6
- I plumbed dense kq_scale into the sparse attention KV-aware path:
createthis Oct 20, 2025
c5c3b26
1) Sparse KV-aware attention now accepts and uses the dense kq_scale
createthis Oct 20, 2025
72a5065
- Adjusted mask broadcasting to use 3D reshape/repeat in a ggml-friendly
createthis Oct 20, 2025
61eedf1
Replace mask block with one that:
createthis Oct 20, 2025
6a536e4
Correct GPT 5's typo (allegedly)
createthis Oct 20, 2025
10f747b
Bump context allocation again. This gets it inferring again.
createthis Oct 21, 2025
c655019
DeepSeek V3.2 indexer dims: derive from tensors to avoid config drift
createthis Oct 21, 2025
1cfd0e3
Replace compute_token_importance with KV-aware lightning indexer. I
createthis Oct 21, 2025
492c382
Indexer weight scaling. See
createthis Oct 21, 2025
c60922a
Add comment reference
createthis Oct 21, 2025
bb1f27e
Add RoPE to indexer Q and K (on rope_head_dim only)
createthis Oct 22, 2025
3139cce
Minimal rotate_activation approximation
createthis Oct 22, 2025
301d4cd
Missing head_dim scaling in indexer scoring
createthis Oct 22, 2025
18bb076
Logging
createthis Oct 22, 2025
638ed68
Fix warnings
createthis Oct 22, 2025
bfac611
Fix segfault
createthis Oct 22, 2025
927c5a3
Logging
createthis Oct 22, 2025
ffe9f0f
Add an fflush
createthis Oct 22, 2025
3e36397
Logging
createthis Oct 22, 2025
3d700d3
Rotate-activation approximation disabled
createthis Oct 22, 2025
244cf67
Revert "Rotate-activation approximation disabled"
createthis Oct 22, 2025
3bb2b57
Add cb logging
createthis Oct 22, 2025
f7b2146
Remove unused function. It's confusing the sentient rocks.
createthis Oct 22, 2025
4a015ad
logging cb
createthis Oct 22, 2025
302519d
Remove unused select_topk_tokens()
createthis Oct 22, 2025
998981f
Remove unused apply_sparse_attention()
createthis Oct 22, 2025
e31b340
Add logging
createthis Oct 22, 2025
802deef
Applied the learned q-norm in the indexer path
createthis Oct 22, 2025
6f3cec2
Implemented the q_indexer RMS coupling to approximate q_scale
createthis Oct 22, 2025
89dc615
Implemented proper LayerNorm for K-indexer in compute_indexer_triplet,
createthis Oct 23, 2025
6eb8198
Logging
createthis Oct 23, 2025
c7ed256
Wired the diagnostic scalars into the graph so they’ll appear in
createthis Oct 23, 2025
19c0dc6
Add cb logging for:
createthis Oct 23, 2025
744c967
New cb logs:
createthis Oct 23, 2025
d4ec252
Wire idxkv_topk_indices_k_T into the graph so that it shows up in cb log
createthis Oct 23, 2025
4d69cbe
Add idxkv_topk_indices_k_T_f32 to cb log
createthis Oct 23, 2025
e81d41e
Identified issue where top-k is still running on CUDA, so pass sched
createthis Oct 23, 2025
c6a9fa6
Pass backend_cpu through the topk code.
createthis Oct 23, 2025
da575dd
Log backend name for topk
createthis Oct 23, 2025
b0592fa
cb logging for idxkv_logits_sample_pre_relu
createthis Oct 23, 2025
6953849
more cb logging
createthis Oct 23, 2025
49508f6
Try to fix assertion
createthis Oct 23, 2025
6a538c7
cb logging for scores_for_topk
createthis Oct 23, 2025
3a9eef5
- In src/llama-sparse-topk.cpp, right before calling ggml_top_k, we now
createthis Oct 23, 2025
a2d2649
Trying to fix the segfault that started with commit da575d. Don't return
createthis Oct 23, 2025
a3a0ba1
Fix inference (still degenerate generation though)
createthis Oct 23, 2025
69fa590
Attempt to cb log idxkv_topk_indices_k_T
createthis Oct 23, 2025
e13b63f
Again
createthis Oct 23, 2025
4db5c59
Revert "Again"
createthis Oct 23, 2025
7363553
Revert "Attempt to cb log idxkv_topk_indices_k_T"
createthis Oct 23, 2025
7ab0cb8
Add cb logging for idxkv_topk_indices_sample_f32
createthis Oct 23, 2025
1b9ee7e
Revert "Add cb logging for idxkv_topk_indices_sample_f32"
createthis Oct 23, 2025
6072d82
cb log idxkv_topk and idxkv_argsort
createthis Oct 24, 2025
9aafb11
cb logging for kq_mask and mask2d
createthis Oct 24, 2025
223d5de
Add logits_h cb loggin
createthis Oct 24, 2025
13fefcb
cb log for idxkv_mask_tc. Also disable the eval-callback sum nan exit
createthis Oct 24, 2025
22a5d06
cb loggin logits_sample_host
createthis Oct 24, 2025
f849039
Fix sampler nan's
createthis Oct 24, 2025
8b7b0a4
cp logging for mla_scores_post_mask_sample
createthis Oct 24, 2025
557be06
cb logging kvaware_sparse_attn_out_sample
createthis Oct 24, 2025
8712c0d
Try to fix crash
createthis Oct 24, 2025
1c44686
Attempt to fix crash
createthis Oct 24, 2025
ddeafa8
Revert "Attempt to fix crash"
createthis Oct 24, 2025
da5730d
Revert "Try to fix crash"
createthis Oct 24, 2025
94c91e2
Revert "cb logging kvaware_sparse_attn_out_sample"
createthis Oct 24, 2025
3797a6c
add cb logging to fwd mla for mla_scores_post_mask
createthis Oct 24, 2025
9ee230f
Add cb logging for mla_weights_sample
createthis Oct 24, 2025
979aeb7
add indexer_q_sample and indexer_weights_sample cb logging
createthis Oct 24, 2025
782c946
Add cb logging for indexer_k_cache_head
createthis Oct 24, 2025
1d5a878
More cb logs for indexer
createthis Oct 24, 2025
c761a8f
arguing with sentient rocks
createthis Oct 24, 2025
cc8b7ef
- Ignored the micro-window n_kv argument for the indexer path and
createthis Oct 24, 2025
e53b2ed
Fix crash (hopefully)
createthis Oct 24, 2025
5ef5a53
Another fix
createthis Oct 24, 2025
9bbd46e
Add clamp and get_k_full/get_v_full (for future use, apparently)
createthis Oct 24, 2025
d38859a
Fix warning
createthis Oct 25, 2025
3d862a2
Hide sparse attention logging behind env var LLAMA_SPARSE_DEBUG for
createthis Oct 25, 2025
5194bfa
Don't modify get_k_indexer. Create get_k_indexer_full instead.
createthis Oct 25, 2025
73b35db
Fix crash (hopefully)
createthis Oct 25, 2025
472a41d
Switch to using get_k_indexer_full
createthis Oct 25, 2025
be3ef9e
Fix crash
createthis Oct 25, 2025
04bb17f
Fix crash
createthis Oct 25, 2025
0500521
Build and use full kq_mask for sparse attention.
createthis Oct 25, 2025
aec054d
Merge branch 'deepseek_v3_2_exp' of github.com:createthis/llama.cpp i…
createthis Oct 25, 2025
ba4780c
Trying to track down why we aren't always getting the full width
createthis Oct 25, 2025
16a9d43
Fix compile error
createthis Oct 25, 2025
568c5e3
Attempt to fix crash
createthis Oct 25, 2025
6097468
Remove rotate_activation.
createthis Oct 25, 2025
ef9b177
Keep validity masking (causal window) in the indexer/top-k path, but
createthis Oct 26, 2025
48ba041
Bump LLAMA_SPARSE_TOPK default to 2048 to be inline with vllm and
createthis Oct 26, 2025
d2ac39e
set_input_kq_mask_full_2d: restore ALiBi in the full-width mask
createthis Oct 26, 2025
a37bf05
Q-scale proxy is active: idx_weights = W_proj(cur) × 1/sqrt(H) ×
createthis Oct 26, 2025
76f576d
added a safe K-scale proxy multiply in the top-k selector, after
createthis Oct 26, 2025
84ae1f1
K-scale was accidentally placed in a dbg block. Move out.
createthis Oct 26, 2025
1ebd7e7
- I replaced the fragile mask slicing in the top-k selector.
createthis Oct 26, 2025
d7870c3
clamp scores_tc immediately after adding the mask in the top-k selector
createthis Oct 26, 2025
f4cd950
top_k is now limited by the currently available KV, not just cache
createthis Oct 27, 2025
a85fad5
- In src/llama-sparse-mla-fwd.cpp, I removed the conditional “VIEW when
createthis Oct 27, 2025
0555126
top_k clamp changes
createthis Oct 27, 2025
b284bd1
Fix top-k clamp. Sparse attention generation is working!
createthis Oct 27, 2025
6fb54c1
comment out or hide all debug prints behind LLAMA_SPARSE_DEBUG
createthis Oct 27, 2025
edc23f9
Streaming per-head accumulation to avoid [N_kv, H, Tc] temporaries
createthis Oct 27, 2025
9e9a84a
Revert last change as it was objectively worse.
createthis Oct 27, 2025
06bd370
- Keep Top-K indices on device:
createthis Oct 27, 2025
7866fd5
kept the sparse attention output tensor “cur” on device in the sparse
createthis Oct 27, 2025
b96f5fb
WIP radix top-k
createthis Oct 27, 2025
d3e4a6a
Ported radix top-k selection with thresholding and tail refinement
createthis Oct 28, 2025
100535b
Integrate radix top-k
createthis Oct 28, 2025
7780061
Guard printf's with dbg
createthis Oct 28, 2025
90f0e17
Add a repro test for the context > 50k issue. Also attempt to fix it.
createthis Oct 28, 2025
8dd82d8
Add another potention repro test
createthis Oct 28, 2025
2d3cb41
Add some logging
createthis Oct 28, 2025
3d547a6
fix compile errors
createthis Oct 28, 2025
ca658a8
more logging
createthis Oct 28, 2025
2d810d5
Add an include to fix compile issue
createthis Oct 28, 2025
1ab31d4
more logging
createthis Oct 28, 2025
4de2e22
Add logging to ggml to track this down.
createthis Oct 28, 2025
e1634c7
Helping sentient rocks put their changes where they intended.
createthis Oct 28, 2025
6c048aa
Add fflush
createthis Oct 28, 2025
42024cd
Try to get logging working
createthis Oct 28, 2025
60f71ac
More logging
createthis Oct 28, 2025
07f84a3
label cur
createthis Oct 28, 2025
5983712
Add another tensor name as I argue with sentient rocks
createthis Oct 28, 2025
f2cef1a
Attempt to fix the problem, remove unused tests
createthis Oct 29, 2025
f2896cf
Oops, fix makefile
createthis Oct 29, 2025
86f7d87
Remove unnecessary prints now that problem is solved.
createthis Oct 29, 2025
8c96ada
Add a few files to ignore
createthis Dec 16, 2025
465c0c2
Refactor per-tile scoring into a single helper in
createthis Dec 16, 2025
f4c653c
Add unit test for sparse mla decode.
createthis Dec 17, 2025
d255b85
Improve radix debug logging. Use a mutex to keep it coherent. Prints
createthis Dec 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,21 @@ lcov-report/

# Build Artifacts

llama.pc
DartConfiguration.tcl
CMakeCache.txt
tags
.build/
build*
release
debug
!build-info.cmake
tests/CTestTestfile.cmake
tests/cmake_install.cmake
**/*.cmake
tests/*.cmake
**/Makefile
**/CMakeFiles
!build-info.cpp.in
!build-info.sh
!build.zig
Expand Down
190 changes: 190 additions & 0 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -852,6 +852,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "b3f499bb4255f8ca19fccd664443283318f2fd2414d5e0b040fbdd0cc195d6c5":
# ref: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
res = "deepseek-r1-qwen"
if chkhsh == "877081d19cf6996e2c4ff0e1236341e9b7bde288f5311a56a937f0afbbb3aeb5":
# ref: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
res = "deepseek-v3.2"
if chkhsh == "ccc2ef013c104be7bae2965776d611e1d7a8a2a9c547dd93a682c9a9fc80352e":
# ref: https://huggingface.co/Xenova/gpt-4o
res = "gpt-4o"
Expand Down Expand Up @@ -6503,6 +6506,193 @@ def prepare_tensors(self):
raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register(
"DeepseekV32ForCausalLM",
)
class DeepseekV3_2Model(TextModel):
model_arch = gguf.MODEL_ARCH.DEEPSEEK3_2

def set_vocab(self):
try:
self._set_vocab_gpt2()
return
except Exception:
pass

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
tokpre = self.get_vocab_base_pre(tokenizer)

if tokpre == "kimi-k2":
# Build merges list using the approach similar to HunYuanMoE
merges = []
vocab = {}
mergeable_ranks = tokenizer.model._mergeable_ranks
for token, rank in mergeable_ranks.items():
vocab[QwenModel.token_bytes_to_string(token)] = rank
if len(token) == 1:
continue
merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
if len(merged) == 2:
merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))

# Build token list
vocab_size = self.hparams["vocab_size"]
special_tokens = tokenizer.special_tokens
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **special_tokens}.items()}
tokens: list[str] = []
toktypes: list[int] = []

for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.UNUSED)
else:
token = reverse_vocab[i]
tokens.append(token)
if i in special_tokens.values():
toktypes.append(gguf.TokenType.CONTROL)
else:
toktypes.append(gguf.TokenType.NORMAL)

self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
self.gguf_writer.add_token_merges(merges)

special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False)
special_vocab.add_to_gguf(self.gguf_writer)
else:
raise NotImplementedError(f"Deepseek pre-tokenizer {tokpre!r} is not supported yet!")

def set_gguf_parameters(self):

# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
self.hparams["num_key_value_heads"] = 1

super().set_gguf_parameters()
hparams = self.hparams

self.gguf_writer.add_leading_dense_block_count(hparams["first_k_dense_replace"])
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
if "q_lora_rank" in hparams and hparams["q_lora_rank"] is not None:
self.gguf_writer.add_q_lora_rank(hparams["q_lora_rank"])
self.gguf_writer.add_kv_lora_rank(hparams["kv_lora_rank"])

# note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA
self.gguf_writer.add_key_length(hparams["kv_lora_rank"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length(hparams["kv_lora_rank"])
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])

self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"])
self.gguf_writer.add_expert_count(hparams["n_routed_experts"])
self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
self.gguf_writer.add_expert_weights_scale(hparams["routed_scaling_factor"])
self.gguf_writer.add_expert_weights_norm(hparams["norm_topk_prob"])

if hparams["scoring_func"] == "sigmoid":
self.gguf_writer.add_expert_gating_func(gguf.ExpertGatingFuncType.SIGMOID)
elif hparams["scoring_func"] == "softmax":
self.gguf_writer.add_expert_gating_func(gguf.ExpertGatingFuncType.SOFTMAX)
else:
raise ValueError(f"Unsupported scoring_func value: {hparams['scoring_func']}")

self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])

rope_scaling = self.hparams.get("rope_scaling") or {}
if rope_scaling.get("rope_type", rope_scaling.get("type")) == "yarn" and "factor" in rope_scaling:
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling["original_max_position_embeddings"])
self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1 * rope_scaling["mscale_all_dim"])

_experts: list[dict[str, Tensor]] | None = None

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip vision tensors and remove "language_model." for Kimi-VL
if "vision_tower" in name or "multi_modal_projector" in name:
return []

if name.startswith("language_model."):
name = name.replace("language_model.", "")

# rename e_score_correction_bias tensors
if name.endswith("e_score_correction_bias"):
name = name.replace("e_score_correction_bias", "e_score_correction.bias")

# skip Multi-Token Prediction (MTP) layers
block_count = self.hparams["num_hidden_layers"]
match = re.match(r"model.layers.(\d+)", name)
if match and int(match.group(1)) >= block_count:
return []

# process the experts separately
if name.find("mlp.experts") != -1:
n_experts = self.hparams["n_routed_experts"]
assert bid is not None

if self._experts is None:
self._experts = [{} for _ in range(self.block_count)]

self._experts[bid][name] = data_torch

if len(self._experts[bid]) >= n_experts * 3:
tensors: list[tuple[str, Tensor]] = []

# merge the experts into a single 3d tensor
for w_name in ["down_proj", "gate_proj", "up_proj"]:
datas: list[Tensor] = []

for xid in range(n_experts):
ename = f"model.layers.{bid}.mlp.experts.{xid}.{w_name}.weight"
datas.append(self._experts[bid][ename])
del self._experts[bid][ename]

data_torch = torch.stack(datas, dim=0)

merged_name = f"model.layers.{bid}.mlp.experts.{w_name}.weight"

new_name = self.map_tensor_name(merged_name)

tensors.append((new_name, data_torch))
return tensors
else:
return []

# note: MLA with the absorption optimization, needs these two split and k_b_proj transposed
if name.endswith("kv_b_proj.weight"):
name_kb = name.replace("kv_b_proj", "k_b_proj")
name_vb = name.replace("kv_b_proj", "v_b_proj")

n_head_kv = self.hparams["num_key_value_heads"]
v_head_dim = self.hparams["v_head_dim"]
qk_nope_head_dim = self.hparams["qk_nope_head_dim"]

assert data_torch.shape[0] == n_head_kv * (v_head_dim + qk_nope_head_dim)

kv_b = data_torch.view(n_head_kv, v_head_dim + qk_nope_head_dim, data_torch.shape[-1])
k_b, v_b = torch.split(kv_b, [qk_nope_head_dim, v_head_dim], dim=1)
k_b = k_b.transpose(1, 2)

return [
(self.map_tensor_name(name_kb), k_b),
(self.map_tensor_name(name_vb), v_b)
]

return [(self.map_tensor_name(name), data_torch)]

def prepare_tensors(self):
super().prepare_tensors()

if self._experts is not None:
# flatten `list[dict[str, Tensor]]` into `list[str]`
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register(
"DeepseekV2ForCausalLM",
"DeepseekV3ForCausalLM",
Expand Down
1 change: 1 addition & 0 deletions convert_hf_to_gguf_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "megrez", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Infinigence/Megrez-3B-Instruct"},
{"name": "deepseek-v3", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-V3"},
{"name": "deepseek-r1-qwen", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"},
{"name": "deepseek-v3.2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp"},
{"name": "gpt-4o", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Xenova/gpt-4o", },
{"name": "superbpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/UW/OLMo2-8B-SuperBPE-t180k", },
{"name": "trillion", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/trillionlabs/Trillion-7B-preview", },
Expand Down
8 changes: 4 additions & 4 deletions examples/eval-callback/eval-callback.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -105,10 +105,10 @@ static void ggml_print_tensor(uint8_t * data, ggml_type type, const int64_t * ne
}

// TODO: make this abort configurable/optional?
if (std::isnan(sum)) {
LOG_ERR("encountered NaN - aborting\n");
exit(0);
}
//if (std::isnan(sum)) {
// LOG_ERR("encountered NaN - aborting\n");
// exit(0);
//}
}

/**
Expand Down
1 change: 1 addition & 0 deletions ggml/src/ggml-cuda/binbcast.cu
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#include "binbcast.cuh"
#include <cstdint>
#include <utility>
#include <cstdio>

static __device__ __forceinline__ float op_repeat(const float a, const float b) {
return b;
Expand Down
19 changes: 19 additions & 0 deletions ggml/src/ggml.c
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@
#include <assert.h>
#include <errno.h>
#include <time.h>
#include <inttypes.h>

#include <math.h>
#include <stdlib.h>
#include <string.h>
Expand Down Expand Up @@ -1920,6 +1922,11 @@ static struct ggml_tensor * ggml_add_impl(
bool inplace) {
GGML_ASSERT(ggml_can_repeat(b, a));

// Ensure RHS has CUDA-friendly stride alignment for broadcast add
if (ggml_type_size(b->type) > 0 && (b->nb[1] % ggml_type_size(b->type)) != 0) {
b = ggml_cont(ctx, b);
}

struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

result->op = GGML_OP_ADD;
Expand Down Expand Up @@ -3394,6 +3401,12 @@ struct ggml_tensor * ggml_reshape_2d(
int64_t ne0,
int64_t ne1) {
GGML_ASSERT(ggml_is_contiguous(a));
/*
printf("ggml_reshape_2d: a=[%5" PRId64 ", %5" PRId64 "], ne0=%5" PRId64 ", ne1=%5" PRId64 "\n",
a->ne[0], a->ne[1],
ne0, ne1);
fflush(stdout);
*/
GGML_ASSERT(ggml_nelements(a) == ne0*ne1);

const int64_t ne[2] = { ne0, ne1 };
Expand All @@ -3413,6 +3426,12 @@ struct ggml_tensor * ggml_reshape_3d(
int64_t ne1,
int64_t ne2) {
GGML_ASSERT(ggml_is_contiguous(a));
/*
printf("ggml_reshape_3d: a=[%5" PRId64 ", %5" PRId64 ", %5" PRId64 "], ne0=%5" PRId64 ", ne1=%5" PRId64 ", ne2=%5" PRId64 "\n",
a->ne[0], a->ne[1], a->ne[2],
ne0, ne1, ne2);
fflush(stdout);
*/
GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2);

const int64_t ne[3] = { ne0, ne1, ne2 };
Expand Down
22 changes: 22 additions & 0 deletions gguf-py/gguf/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,7 @@ class MODEL_ARCH(IntEnum):
ARCTIC = auto()
DEEPSEEK = auto()
DEEPSEEK2 = auto()
DEEPSEEK3_2 = auto()
CHATGLM = auto()
GLM4 = auto()
GLM4_MOE = auto()
Expand Down Expand Up @@ -460,6 +461,10 @@ class MODEL_TENSOR(IntEnum):
FFN_EXP_PROBS_B = auto()
ATTN_Q_NORM = auto()
ATTN_K_NORM = auto()
ATTN_INDEXER_K_NORM = auto() # deepseek3_2
ATTN_INDEXER_WEIGHTS_PROJ = auto() # deepseek3_2
ATTN_INDEXER_WK = auto() # deepseek3_2
ATTN_INDEXER_WQ_B = auto() # deepseek3_2
LAYER_OUT_NORM = auto()
PER_LAYER_TOKEN_EMBD = auto() # gemma3n
PER_LAYER_MODEL_PROJ = auto() # gemma3n
Expand Down Expand Up @@ -712,6 +717,7 @@ class MODEL_TENSOR(IntEnum):
MODEL_ARCH.ARCTIC: "arctic",
MODEL_ARCH.DEEPSEEK: "deepseek",
MODEL_ARCH.DEEPSEEK2: "deepseek2",
MODEL_ARCH.DEEPSEEK3_2: "deepseek3_2",
MODEL_ARCH.CHATGLM: "chatglm",
MODEL_ARCH.GLM4: "glm4",
MODEL_ARCH.GLM4_MOE: "glm4moe",
Expand Down Expand Up @@ -779,6 +785,10 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.ATTN_SINKS: "blk.{bid}.attn_sinks",
MODEL_TENSOR.ATTN_Q_NORM: "blk.{bid}.attn_q_norm",
MODEL_TENSOR.ATTN_K_NORM: "blk.{bid}.attn_k_norm",
MODEL_TENSOR.ATTN_INDEXER_K_NORM: "blk.{bid}.attn_indexer_k_norm",
MODEL_TENSOR.ATTN_INDEXER_WEIGHTS_PROJ: "blk.{bid}.attn_indexer_weights_proj",
MODEL_TENSOR.ATTN_INDEXER_WK: "blk.{bid}.attn_indexer_wk",
MODEL_TENSOR.ATTN_INDEXER_WQ_B: "blk.{bid}.attn_indexer_wq_b",
MODEL_TENSOR.ATTN_OUT_NORM: "blk.{bid}.attn_output_norm",
MODEL_TENSOR.ATTN_POST_NORM: "blk.{bid}.post_attention_norm",
MODEL_TENSOR.FFN_GATE_INP: "blk.{bid}.ffn_gate_inp",
Expand Down Expand Up @@ -2746,6 +2756,14 @@ class MODEL_TENSOR(IntEnum):
# TODO
}

MODEL_TENSORS[MODEL_ARCH.DEEPSEEK3_2] = [
*MODEL_TENSORS[MODEL_ARCH.DEEPSEEK2],
MODEL_TENSOR.ATTN_INDEXER_K_NORM,
MODEL_TENSOR.ATTN_INDEXER_WEIGHTS_PROJ,
MODEL_TENSOR.ATTN_INDEXER_WK,
MODEL_TENSOR.ATTN_INDEXER_WQ_B,
]

# tensors that will not be serialized
MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_ARCH.LLAMA: [
Expand Down Expand Up @@ -2788,6 +2806,10 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.ROPE_FREQS,
MODEL_TENSOR.ATTN_ROT_EMBD,
],
MODEL_ARCH.DEEPSEEK3_2: [
MODEL_TENSOR.ROPE_FREQS,
MODEL_TENSOR.ATTN_ROT_EMBD,
],
MODEL_ARCH.CHATGLM: [
MODEL_TENSOR.ROPE_FREQS,
],
Expand Down
Loading