World First! The Tom Pulls TurboQuant w/MTP (and it Works!)
A world first! TurboQuant + MTP support from the same LLama.cpp! What a game changer!
The significance of this cannot be stressed enough - imagine not only getting multiple-token-prediction (MTP) on your LLM which almost doubles the speed , but then getting TurboQuant KV compression - allowing you to run VERY large contexts on minimal hardware!
Again we want to hugely thank TheTom for this specialized fork.
Background Supports.
- The supports are always the same you need the latest cmake, the latest nvcc, the latest nvidia drivers, simply go over to the StudentLLM and work through it, the only difference is come back here for a different configuration and a MTP enabled model.
- Because it references TheTom Turboquant fork of Llama.cpp it will automatically enable MTP - which he added to his fork last week!
- We had no idea this had occurred until Tom personally messaged me about it!

A Hot Config
- Please note you MUST use a MTP enabled MOE type model, we have gotten stunning results from Carnice-Qwen3.6.. so do support this guys work. Buy him a coffee!
wget https://huggingface.co/mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUF/blob/main/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Balanced.gguf/usr/bin/llama-server --jinja \
-m /home/c/models/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Balanced.gguf \
--host 192.168.1.3 \
--n-gpu-layers -1 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--n-cpu-moe 30 \
--chat-template-kwargs '{"preserve_thinking":true}' \
-c 252144 \
--flash-attn 1 \
--context-shift \
--repeat-penalty 1.12 \
--cache-type-k turbo3 \
--cache-type-v turbo4 \The results were shocking. We had it go over a 38K large multiple level asteroid game. We were used to our localLLM taking a good half hour to produce. No. Nuts! It was done in a few minutes.
Some INSANE ingest/production Numbers. Check this out.

- This is not 'high-end' gear - this is just a 4080 on a Ryzen 9 12-core. 128 GB RAM / 16 GB VRAM.
- Please note you will need to adjust your
--n-cpu-moe 30to something larger or smaller, if you are doing small contexts load your GPU like 90%. I use this configuration to utilize about 12GB of my 16GB GPU. It allows for very large contexts because of kv_cache compression and still gives whopping fast speeds, like FAST. - If I was doing super-large contexts I can adjust that value UP, and load most of the model to the CPU. It is a speed/size trade off.

Next we ran benchloop on the whole setup...
benchloop run --endpoint http://192.168.1.3:8080 --provider openai_compat --model Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Balanced.ggufWe modified our run configuration so it used 15.7GB of the 16 GB card. The idea is the benchmark will probably run inside the last 1 GB, and it was CRUSHING through it. Our benchmark config... --n-cpu-moe 24
/usr/bin/llama-server --jinja \
-m /home/c/models/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Balanced.gguf \
--host 192.168.1.3 \
--n-gpu-layers -1 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--n-cpu-moe 24 \
--chat-template-kwargs '{"preserve_thinking":true}' \
-c 252144 \
--flash-attn 1 \
--context-shift \
--repeat-penalty 1.12 \
--cache-type-k turbo3 \
--cache-type-v turbo4 \It was just sailing through the benchmark, and we were excited to see what results were about to pour in. We had never seen ingest over really about 150, and never saw much above 27. We effectively saw it doubled.

It gave VERY good results. For some reason dataextract either gets a 15/15 or a 1 with these models, to tweak. It should be noted that potentially we should disable our MCP agents before benchmarking as it adds unnecessary overhead.

We noticed that we gave it almost no-cache and it was failing at one point with an error of:
W slot update_slots: id 0 | task 32013 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpWe backed off our model and followed recommendations by adjusting the --n-cpu-moe 26 and adding --swa-full Our next run configuration is as follows, which loaded the GPU to 14.5GB out of 16GB.
/usr/bin/llama-server --jinja \
-m /home/c/models/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Balanced.gguf \
--host 192.168.1.3 \
--n-gpu-layers -1 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--n-cpu-moe 29 \
--chat-template-kwargs '{"preserve_thinking":true}' \
-c 252144 \
--swa-full \
--flash-attn 1 \
--repeat-penalty 1.12 \
--cache-type-k turbo3 \
--cache-type-v turbo4 \It should be noted that our Tokens/s dropped from 58 to about 45 ish on various things, and we are currently studying this for better results. We stopped at this point because irrespective of the bench the results are a game changer!
Conclusion
Even though it seemed that benchloop didn't really give this configuration a very high rating, we were extremely sold on how well it performed from our own anecdotal observations. It not just a work horse now, it is a endurance workhorse. We were very used to having our houseLLM take up to 1.5 hours to produce anything significant and it was like the dishes - you set it and come back and it produced something useful. Not it literally hot plates your code in seconds it is a real and bonified alternative for most production level models outside the most large of code shops.

