MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me.
TL;DR: 35B Q4_K_XL, no MTP, --fit-target 1536**, 131k context. That's the config.** 56 tok/s generation, 1,584 tok/s prompt processing at 128k context. MTP doesn't help at 128k — both converge to the same speed. Skip the complexity. The 27B IQ3 is worth considering if 56k context is enough for you (or if you have a 12 GB card where the 35B won't fit).
The Configs
Config
27B IQ3+MTP (A)
35B Q4_K_XL+MTP (B)
35B Q8_0+MTP (C)
Model
Qwen3.6-27B MTP-UD-IQ3_XXS
Qwen3.6-35B-A3B MTP-UD-Q4_K_XL
Qwen3.6-35B-A3B MTP-Q8_0
Size
12.45 GB
~22 GB
~36 GB
Source
GazTrab
havenoammo
Grafted
GPU fit
Fully on GPU (66/66)
Partial offload
Heavy offload
All tests on: RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, llama.cpp b9204 (mainline).
Common MTP flags: -np 1 --fit on -fa on -t 20 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2
Results
Speed — The MTP Surprise
With MTP (mtp-bench, 9 prompt types)
Metric
27B IQ3
35B Q4_K_XL
35B Q8_0
Avg tok/s
73
74
46
Peak tok/s
83 (code)
86 (translation)
51
MTP accept
74.4%
79.5%
80.1%
--fit-target
0
1536
1536
The surprise: 35B is FASTER without MTP
35B Q4_K_XL config
--fit-target
MTP?
Avg tok/s
VRAM used
Best (no MTP)
0
No
97
15,815 MiB
Same VRAM budget
1536
No
86
14,269 MiB
MTP enabled
1536
Yes
74
14,623 MiB
MTP is 23% slower for the 35B MoE on 16GB. Why?
- MTP requires
--fit-target 1536 to reserve ~1.5 GB for the MTP compute buffer
- That 1.5 GB pushes ~3 more MoE expert layers from GPU to CPU
- CPU-bound expert layers are the bottleneck for MoE inference
- MTP's multi-token speculation (~79% acceptance) doesn't compensate for the slower per-step speed
For the 27B, MTP helps because the model fits entirely on GPU (12.45 GB) — --fit-target 0 works with and without MTP, so there's no VRAM penalty. The 27B goes from ~56 tok/s (no MTP, older builds) to 73 tok/s with MTP.
Rule of thumb: MTP helps when your model fits on GPU. It hurts when the MTP compute buffer forces more layers to CPU.
Speed at Coding-Agent Context Lengths (the real test)
Everyone runs coding agents at 128k. Here's what actually happens as you fill the context window. Tested with synthetic prompts (Python classes, architecture docs, error stack traces — varied enough to prevent tokenizer compression), prompt cache disabled, 35B Q4_K_XL with --fit-target 1536:
Context
PP (no MTP)
PP (MTP)
TG (no MTP)
TG (MTP)
~8k
1,855 tok/s
1,712 tok/s
73 tok/s
79 tok/s
~32k
1,810 tok/s
1,674 tok/s
74 tok/s
70 tok/s
~64k
1,723 tok/s
1,583 tok/s
67 tok/s
76 tok/s
~128k
1,584 tok/s
1,437 tok/s
56 tok/s
56 tok/s
8k/32k TG measured in a separate run from 64k/128k — expect ~5-10% variance between rows from measurement noise.
At 128k context, MTP and no-MTP converge to the same TG speed (~56 tok/s). The KV cache fills VRAM at long context regardless of MTP, so the offload split ends up identical. MTP's multi-token speculation is offset by its compute overhead.
PP degrades gracefully: 1,855 → 1,584 tok/s from 8k to 128k (~15% decline). A 128k prompt processes in ~81 seconds.
The "97 tok/s" only exists at short context with --fit-target 0. At 64k+, --fit-target 0 OOMs because there's no headroom for KV cache growth. You must use --fit-target 1536 for long-context work, which brings speed down to ~73 tok/s at short context and ~56 tok/s at 128k.
Bottom line for coding agents: expect ~56 tok/s TG and ~1,500 tok/s PP at 128k context on 16GB. MTP is a wash — doesn't help or hurt at full context.
VRAM Usage
Config
VRAM used
VRAM free
Notes
A (27B IQ3+MTP)
14,803 MiB
1,039 MiB
Fully on GPU, fit-target 0
B (35B Q4_K_XL+MTP)
14,623 MiB
1,219 MiB
Partial offload, fit-target 1536
B (35B Q4_K_XL, no MTP)
15,815 MiB
27 MiB
Maximum GPU layers, fit-target 0
C (35B Q8_0+MTP)
14,567 MiB
1,275 MiB
Heavy offload, fit-target 1536
Context Limits (push to OOM)
Limit
27B IQ3
35B Q4_K_XL
35B Q8_0
Max ctx (q8_0 KV)
56k
131k+
131k+
Max ctx (q4_0 KV)
110k
131k+
131k+
Speed at max ctx
80.5 / 57.2
56
45
This is the biggest differentiator. The 35B MoE handles 131k context easily because its hybrid architecture (Gated DeltaNet + Attention) only has ~10 full-attention layers that need KV cache. The remaining SSM layers use a tiny recurrent state. The 27B dense model has KV on every layer, so it maxes out at 56k with q8_0 KV.
Tip for 27B users: switching from -ctk q8_0 -ctv q8_0 to -ctk q4_0 -ctv q4_0 extends your max context from 56k → 110k. Quality cost is minimal: q4_0 KV at 56k scores 218/220 CodeNeedle vs 220/220 with q8_0 KV (q4_0 at regular context: 219/220 — so most of the 2-line drop is from q4_0 itself, not the longer context).
The OOM at higher contexts is the MTP compute buffer (529 MiB fixed allocation), not the KV cache itself. This is a llama.cpp implementation detail that may improve in future versions.
Quality — CodeNeedle (positional recall)
11 functions from Python's http.server, ~50k char corpus, testing exact line-level recall:
Metric
27B IQ3
35B Q4_K_XL
35B Q8_0
Pass
11/11
11/11
11/11
Lines matched
220/220
217/220
216/220
Hallucinations
0
1
1
The 27B IQ3 has a perfect score — every line exact, zero hallucinations. The 35B models are close but not quite there. Interesting that Q8_0 doesn't beat Q4_K_XL here.
Quality — GSM8K (grade school math, 100 cases)
Metric
27B IQ3
35B Q4_K_XL
35B Q8_0
Accuracy
89%
91%
90%
CI (95%, excl. truncated)
[86.9%, 97.1%]
[84.9%, 95.8%]
[85.8%, 96.5%]
Truncated
5
1
3
Wall time
106 min
67 min
114 min
All three overlap in confidence intervals — the quality difference is negligible. But the 35B Q4_K_XL is 37% faster to evaluate (67 vs 106 min) with fewer truncations.
Note: AIME2025 was also tested on the 27B — 50% overall but 100% on non-truncated cases*. Every failure was context exhaustion at 32k, not wrong reasoning. The 35B MoE with 131k context would likely score higher.*
Ubatch PP Trick (coder543, May 18)
u/coder543 discovered that increasing -ub from 512→8192 gives 5.5x prompt processing speedup for --n-cpu-moe partially offloaded models. I tested this on the 35B:
Result: doesn't apply with --fit on**.** The -ub 2048+ OOMs because --fit on already maximizes VRAM for model layers — no headroom for larger batch buffers. If you use --n-cpu-moe manual offload instead, the trick works. But --fit on is simpler and handles the split automatically.
Concurrency (-np sweep)
Tested -np 1/2/4 on 10 GSM8K cases:
-np
27B tok/s
27B throughput
35B tok/s
35B throughput
1
83.3
0.6 cases/min
70.7
0.8 cases/min
2
57.7
1.3 cases/min
49.7
1.1 cases/min
4
10.0 (CPU overflow)
0.6 cases/min
28
failed
-np 2 doubles batch throughput at 30% slower per-request speed. -np 4 pushes layers to CPU — 27B drops to 10 tok/s, 35B partially fails. Use -np 1 for interactive chat, -np 2 for batch evaluation.
MTP Reference (for 27B / fully-on-GPU setups)
MTP is worth it when the model fits entirely on GPU (no offload penalty). For the 27B IQ3 on 12GB: 73 tok/s with MTP vs ~56 without. For the 35B on 16GB: skip it (see speed table above).
If you do use MTP:
--spec-type draft-mtp — not mtp. Mainline renamed it.
-np 1 — b9204 defaults to 4 slots which pushes layers to CPU.
--spec-draft-n-max 2 beats 3 (lower acceptance at 3 = slower overall).
--fit-target 1536 for partial-offload models. --fit-target 0 for fully-on-GPU.
- At 128k context, MTP gives no speedup — KV cache dominates VRAM regardless.
Other notes:
- Hadamard KV rotation (
-khad) is enabled by default since b8607 — no flag needed.
-np 2 doubles batch throughput at 30% slower per-request. Good for eval, bad for interactive.
Recommendation
The Config (just copy this)
./llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
-c 131072 -np 1 --fit on --fit-target 1536 \
-fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0
No MTP. No special flags. --fit-target 1536 is the key — it reserves VRAM headroom so the KV cache doesn't OOM at 128k. Load it, leave it running, point your coding agent at localhost:8080/v1/chat/completions.
What you get: 56 tok/s generation at 128k context. 1,584 tok/s prompt processing (81s to ingest 128k tokens). 131k max context. GSM8K 91%. Stable.
Why no MTP? At 128k context both MTP and no-MTP give the same 56 tok/s — the KV cache dominates VRAM either way. MTP adds 5 gotchas for zero benefit. Skip the complexity.
GGUF: havenoammo/Qwen3.6-35B-A3B-MTP-GGUF (the MTP GGUF works fine without --spec-type draft-mtp — it just ignores the extra tensors).
27B GGUF: GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF
Other VRAM budgets (community data, not tested by us)
Everything above was tested on our RTX 5080 16GB. These estimates for other GPUs are from community reports:
VRAM
Model
Speed
Source
8 GB
35B MoE Q2_K_XL+MTP
~50 tok/s (est.)
u/Still-Notice8155 (GTX 1070,
-fit off --n-cpu-moe 32)
12 GB
35B MoE Q4_K_XL+MTP
~73-80 tok/s
u/janvitos (RTX 4070 Super 12GB)
16 GB
35B Q4_K_XL
56 tok/s @ 128k
This post (RTX 5080)
24 GB
35B Q4_K_XL (no MTP)
~90+ tok/s (est.)
Model is ~22 GB, fits fully on GPU with headroom for KV
The 27B IQ3+MTP needs the MTP head grafted — graft-mtp.py in the repo.
Why not the others?
27B IQ3 — We tested it on our 16GB card where it fits fully on GPU (12.45 GB model). Perfect CodeNeedle (220/220), 73 tok/s with MTP (GGUF). But it caps at 56k context (110k with q4_0 KV). If your coding agent needs 128k, it's out. Better suited for 12 GB cards where the 35B won't fit.
35B Q8_0 — 38% slower (46 tok/s with MTP), negligible quality gain (GSM8K 90% vs 91%, overlapping CIs). Not worth the VRAM on 16 GB.
Credits
This post exists because of the community:
- am17an — original MTP implementation (PR #22673), merged mainline b9190
- havenoammo — MTP GGUF variants + graft script
- u/janvitos — 80 tok/s MTP config on 12GB (635 upvotes), documented the flags
- u/coder543 — ubatch PP trick for
--n-cpu-moe (May 18)
- u/OsmanthusBloom — earlier ubatch discovery
- u/Still-Notice8155 — GTX 1070 8GB MTP benchmarks proving it works everywhere
- u/raketenkater — run-time-repack, defrag-thold, -khad flags documentation
- u/moflinCASIO — 4060 Ti 16GB reference benchmarks
- u/WarthogConfident4039 — requested this benchmarking round
- ggerganov — llama-eval, MTP mainline merge
- u/simracerman — pushed for PP speed benchmarks ("your typical coding agent dumps 10k tokens")
- u/danielhanchen (Unsloth) — Dynamic quantization formula behind UD-Q4_K_XL
- u/alexziskind1 — CodeNeedle positional recall benchmark
What's Next
vLLM vs llama.cpp head-to-head. vLLM >= 0.19.0 supports MTP natively with PagedAttention (dynamic KV allocation — no fixed compute buffer eating VRAM). Could make MTP actually faster for partial-offload models. Stay tuned.
EDIT: u/Look_0ver_There — corrected 24 GB VRAM table (Q8_0 is 36 GB, doesn't fit)
EDIT 2: u/FusionX correctly points out that --fit-target 1536 is too conservative for headless setups. My machine runs a desktop compositor + terminal that eats ~1 GB VRAM before the model loads. If you're running headless, --fit-target 128 keeps more expert layers on GPU. FusionX reports 70-80 tok/s at 131k context on the same GPU with this setting. I'll re-benchmark with a lower fit-target and update. The recommended config is adjust --fit-target down if you're headless.
EDIT 3: Hey thanks everyone for commenting, and for the ones who really skeptical of the results because the post was AI generated. u/the__storm u/Special_Animal2049 kevin_1994 I really appreciate your criticisms, and I should have been more upfront about this. So to remedy this I have posted the scripts that produced these results and the raw data themselves, you can find them here: https://github.com/gaztrabisme/llm-server/tree/main/docs/dev
EDIT 4: u/OsmanthusBloom caught that the community VRAM table incorrectly listed the 27B dense model for the 8 GB and 12 GB rows. Both sources actually ran the 35B MoE with CPU offload.
--- TOP COMMENTS ---
Is that 4283 tokens of text to say : "if my model doesn't fit in my vram, speed is bad" ?
Is this AI generated? Your tk/s is way too low. Secondly, assuming we're in headless mode, you DO NOT need to reserve 1536MB VRAM. KV cache is already accounted for fitting heuristics in llama. Set it as low as you can without going OOM. 128MB works for me.
I have the same setup except a much shittier DDR4 RAM. At 131k ctx size, I can reach 70 tk/s with the same GPU, model and llama params. In fact, with --fit-target=128MB, the speed is 80 tk/s. And 65 tk/s at 256k ctx size.
And there's still plenty of room for improvement to eek out even more performance.
As for 27b, instead of IQ3, I suggest using the IQ4 quant as described here.
Also, I do agree that MTP is largely useless for 16GB VRAM.
Models
Gemini 3.5 Flash ranks #1 on Automation Bench (from Zapier), beating every other frontier model at a much lower cost
Read more Read less--- TOP COMMENTS --- This disagrees with my agenda. Downvoted
I'm curious why the Artificial Analysis benchmark run was so expensive for 3.5 Flash. It's cost-to-performance seems really variable?
Products
Google is officially replacing Vertex AI with the new "Gemini Enterprise Agent Platform"
Read more Read lessJust wanted to share an important Update for AI & Cloud Learners
Google is shifting from a traditional AI platform toward a complete Agentic AI ecosystem focused on autonomous AI agents and enterprise workflows.
Key highlights:
This marks a major shift in Google Cloud’s AI strategy toward Agentic AI and enterprise automation.
If you are currently learning or working with Vertex AI, it’s important to start exploring the Gemini Enterprise Agent Platform moving forward.
Have seen that, GCP ACE exam is going to revamped absed on this Gemini Enterprise Rebranding.
--- TOP COMMENTS --- Fantastic. That'll be the fourth rebrand in 2 years.
The shift to agents is real but nobody's talking about the governance nightmare. You've got autonomous systems making decisions in prod and most teams have zero visibility into why they're doing what they're doing. That's the actual blocker for enterprise adoption right now, not the platform itself.
OpenAI Guaranteed Compute
Read more Read lessOpenAI recently announced it is guaranteeing compute capacity for companies that sign 1-3 year deals.
https://openai.com/business/guaranteed-capacity/
What struck me as interesting is they’re willing to give companies discounts in exchange for term. In a normal industry that isn’t unusual; however, the model companies often talk about compute demand as if it’s effectively limitless and stating the obvious… companies don’t typically give discounts if they’re supply constrained.
So… my question is do you think OpenAI has overbuilt capacity (originally geared at consumer) and is now trying to backfill with enterprise? Do you think this is a play at stealing customers from Anthropic because the Anthropic is/was compute constrained? Both? Neither? Good or Bad strategy from OpenAI?
--- TOP COMMENTS --- I don’t believe they over built capacity, I believe they are looking for cash.
They are trying to improve their books as much as possible before they go public. Saying they have billions in commitments from large enterprise customers is a much better story than we believe people will pay hundreds monthly for hour next model with even better image generation capabilities.
OpenAI has orders for compute going 4 years ahead of time, on cards 3 generations ahead. They know compute will be in shortage, so they are providing compute stability in a market that will be plagued by compute shortages. This is why like a year ago everyone was talking how OpenAI bankruptcy is imminent, because they ordered compute very early on. Now, they are benefiting from that future stability they secured.
Companies
Anthropic-SpaceX deal seems much larger than previously reported
Read more Read lessI was reading SpaceX's prospectus which just dropped. Seems like it has some additional info about the Anthropic-xAI deal on p. 13. Anthropic is paying SpaceX 1.25B/mo for some unspecified amount of capacity between Colossus 1 and 2. Colossus 1 we've previously known about, Colossus 2 seems new. Well, this seems like a much bigger deal than was originally reported 2 weeks ago? 1.25B/mo is 15B/year, which is almost half of Anthropic's ARR even after it exploded in Q1 this year.
Also seems like Anthropic is likely paying a pretty hefty premium for this compute. Based on Colossus 1 GPU counts and going off of Nebius pricing, Colossus 1 should rent for about 6.4B/year, and that's on-demand pricing from a provider to a rando, a proper long term contract should be a lot cheaper. A couple weeks ago it seems like people were guessing the deal was around 3-5B/year for Colossus 1, which seems about right. Imo, they're probably getting a smaller chunk of Colossus 2 because
Which means Anthropic is likely paying a hefty premium for this deal. Probably shouldn't surprising given how axed they clearly are for compute, this is well reported.
That amount of money would also explain why Musk would do a 180 on Anthropic so quickly...
--- TOP COMMENTS --- It’s nonsense
Anthropic has 30bn in annual revenue and can’t pay 15bn a year as you say
The deal has a 90 day cancellation period and may and June is discounted
The only thing the banks and VCs and Elon care about at present time is maximising the IPO price
Don’t believe a word is my advice
Elon bought too many servers and GPUs and they are depreciating in price every day
Anthropic is also IPOing this year and may have the same VCs and banks involved…
Question everything and read the small print on everything
Could be wallpaper - sign a contract with if lated costs to push up prices and later agree to reduce prices. Win win for both - spaceX gets higher revenue to book - later when anthropic wants to ipo spaceX return the favor by reducing their costs.
Anthropic is officially set to be profitable as of Q2 2026
Read more Read less500 Million in Profit.
https://www.wsj.com/tech/ai/mind-blowing-growth-is-about-to-propel-anthropic-into-its-first-profitable-quarter-7edbf2f4
--- TOP COMMENTS --- No wonder Google is investing $40bn more into it at $330bn valuation. New era Berkshire.
r/technology posters on suicide watch. Now, what will the rehearsed line they will use be to dismiss LLMs?
Mark Zuckerberg’s Meta kicks off major bloodbath with 8,000 layoffs (about 10% of its workforce) as AI roils tech giant
Read more Read lessThe companywide purge is taking place in three massive waves, as employees across the world are notified in emails at 4 a.m. local time in their respective regions.
Singapore staffers were the first to receive the doomsday emails.
--- TOP COMMENTS --- "roils"...
This isn't a disruption. The company isn't upset by this.
This is business as usual now. It is a BENEFIT of AI adoption from the company's POV. This is a MAJOR WIN they're touting to investors.
There will probably be an ongoing 10-20% reduction in headcount per year from now on at every major organization in the world.
wtf are they even building that requires $200B for AI at meta?
Infrastructure
"AWS secures rare Mac Studios while ordinary Apple customers remain completely locked out"
Read more Read lesshttps://www.techradar.com/pro/you-cant-buy-them-for-your-home-or-office-but-aws-just-snapped-up-a-host-of-apples-most-highly-desired-m3-ultra-macs
Let them eat cloud!
--- TOP COMMENTS --- "You will own nothing and be happy" model is cancer for everything in modern economy.
Are these really used in data center setting?
Developer Tools
Google just killed the editor in Antigravity V2. Are we really supposed to be "Agent Managers" now?
Read more Read lessHappened today... here is the short story:
With the smell of fresh coffee on my desk, I watched the IDE update finish today, eager to check out a feature branch, knock out a PR review, and get back to work.
The window loaded. The editor-centric workflow I’ve used for years was gone.
Instead, I was staring at a standalone "Agent Manager" desktop app.
Am I the only one who thinks this is a massive step backward for actual engineering?
Problems I see with this:
Worse, the biggest lie in this new "Agent Manager" era is that AI can write good code on its own.
My take: It can't.
Second point: How was I supposed to review the code for my colleague?
--- TOP COMMENTS --- you can download Antigravity IDE, Which is a separate product now
"The window loaded. The editor-centric workflow I’ve used for years was gone."
Google Antigravity, was initially released November 18, 2025 - so 184 days ago ?
LM Studio finally added support for MTP Speculative Decoding
Read more Read lesshttps://preview.redd.it/1uuzjm0ll72h1.png?width=923&format=png&auto=webp&s=1af7d7594be1e08ff7ad6797e2bc53e9410769a3
update to 0.4.14 Build 2 (Beta) and make sure your llama.cpp engine is 2.15.0
https://preview.redd.it/x0vdwjb3n72h1.png?width=742&format=png&auto=webp&s=6367de44208004d2f50194d78a542c46b040dceb
you also must select "Manually choose model load parameters" and enable MTP in those before loading the model it is NOT on by default
--- TOP COMMENTS --- Here's my informal benchmarks using Unsloth's Qwen3.6-35B-A3B MTP UD-Q6_K_ML quant on a Windows 11 computer (AMD 3900x [12c/24t], 128GB of DDR4-3400, and NVidia 2060 Super 8GB) with 8192 context:
An optimized llama-server smokes LM Studio. Even the pre-built llama-cpp binaries on Github smoke LM Studio. It's enough speed-up to turn a barely-usable model into a productive daily driver.
I was using LM Studio until last month. It was my daily tool for running LLMs. But once I tried llama.cpp out of curiosity, I couldn’t go back to using LM Studio. The difference in optimizations and available flags is huge, IMHO.
Please write a prompt to minimize sycophancy, taking sides, flattering, echo-chamber, "yes-man", assumptions, and improve objectivity, brutal honesty, neutrality, and real-world verity.
Read more Read lessIt is well known that LLMs can over acknowledge, agree, flatter, and please its subscriber or primary user. This can result in the disservice to the user when they only receive agreements rather than being appropriately challenged. This is particularly notable when LLMs are used for quasi-counseling or analyzing discussions between two people.
As such, please help me write a prompt to instruct any LLM to cut it out! No sycophancy, taking sides, flattering, echo-chamber, "yes-man", assumptions, and improve objectivity, brutal honesty, neutrality, and real-world verity.
Thank you.
--- TOP COMMENTS --- Someone criticized Marc Andersen's prompt but didn't just say it was bad. They gave reasons. They came up with a better prompt. I took that prompt and told ChatGPT to make it better. I then told ChatGPT to figure out the failures and improve the prompt in a loop of 10 iterations. It came up with this and the problems it found were minor.
I used this at work. Previously, the LLM struggled to consistently name the root cause of a performance problem. Now, it states the root cause every time.
## Primary GoalUse structure only when it improves comprehension, navigation, or execution.Avoid:- filler- rhetorical padding- performative sophistication- unnecessary abstraction- repetitive framing- verbosity without informational gain- excessive caveats that do not materially affect conclusions## ActionabilityWhen appropriate:- translate analysis into concrete next steps- prioritize recommendations- identify dependencies and blockers- clarify implementation risks- distinguish explanation from recommendation and speculationFor irreversible or high-cost actions:- increase verification rigor- surface major uncertainties earlier- prefer conservative recommendations when evidence is weak## High-Impact DomainsApply elevated care for:- legal- medical- financial- security- operational- procedural- numerical claimsVerify fragile or high-consequence facts when accuracy materially matters.## ConfidenceExpress confidence only when decision-relevant.Tie confidence to specific claims or conclusions.Use:- high- moderate- low- unknown## Default Style- Be concise unless additional depth materially improves outcomes.- Preserve necessary nuance without unnecessary expansion.- Compress aggressively when additional detail adds little practical value.I’d avoid telling it to be “brutally honest,” because that can just make it perform harshness.
A better prompt is something like: “Separate facts, inferences, and guesses. For each claim, say what evidence would change your mind. Give me the strongest opposing interpretation before your recommendation.” That usually reduces the yes-man behavior without turning the model into a debate bro.
Hardware
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help
Read more Read lessMTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me.
TL;DR: 35B Q4_K_XL, no MTP,
--fit-target 1536**, 131k context. That's the config.** 56 tok/s generation, 1,584 tok/s prompt processing at 128k context. MTP doesn't help at 128k — both converge to the same speed. Skip the complexity. The 27B IQ3 is worth considering if 56k context is enough for you (or if you have a 12 GB card where the 35B won't fit).The Configs
Config 27B IQ3+MTP (A) 35B Q4_K_XL+MTP (B) 35B Q8_0+MTP (C) Model Qwen3.6-27B MTP-UD-IQ3_XXS Qwen3.6-35B-A3B MTP-UD-Q4_K_XL Qwen3.6-35B-A3B MTP-Q8_0 Size 12.45 GB ~22 GB ~36 GB Source GazTrab havenoammo Grafted GPU fit Fully on GPU (66/66) Partial offload Heavy offloadAll tests on: RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, llama.cpp b9204 (mainline).
Common MTP flags:
-np 1 --fit on -fa on -t 20 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2Results
Speed — The MTP Surprise
With MTP (mtp-bench, 9 prompt types)
Metric 27B IQ3 35B Q4_K_XL 35B Q8_0 Avg tok/s 73 74 46 Peak tok/s 83 (code) 86 (translation) 51 MTP accept 74.4% 79.5% 80.1% --fit-target 0 1536 1536The surprise: 35B is FASTER without MTP
35B Q4_K_XL config --fit-target MTP? Avg tok/s VRAM used Best (no MTP) 0 No 97 15,815 MiB Same VRAM budget 1536 No 86 14,269 MiB MTP enabled 1536 Yes 74 14,623 MiBMTP is 23% slower for the 35B MoE on 16GB. Why?
--fit-target 1536to reserve ~1.5 GB for the MTP compute bufferFor the 27B, MTP helps because the model fits entirely on GPU (12.45 GB) —
--fit-target 0works with and without MTP, so there's no VRAM penalty. The 27B goes from ~56 tok/s (no MTP, older builds) to 73 tok/s with MTP.Rule of thumb: MTP helps when your model fits on GPU. It hurts when the MTP compute buffer forces more layers to CPU.
Speed at Coding-Agent Context Lengths (the real test)
Everyone runs coding agents at 128k. Here's what actually happens as you fill the context window. Tested with synthetic prompts (Python classes, architecture docs, error stack traces — varied enough to prevent tokenizer compression), prompt cache disabled, 35B Q4_K_XL with
Context PP (no MTP) PP (MTP) TG (no MTP) TG (MTP) ~8k 1,855 tok/s 1,712 tok/s 73 tok/s 79 tok/s ~32k 1,810 tok/s 1,674 tok/s 74 tok/s 70 tok/s ~64k 1,723 tok/s 1,583 tok/s 67 tok/s 76 tok/s ~128k 1,584 tok/s 1,437 tok/s 56 tok/s 56 tok/s--fit-target 1536:8k/32k TG measured in a separate run from 64k/128k — expect ~5-10% variance between rows from measurement noise.
At 128k context, MTP and no-MTP converge to the same TG speed (~56 tok/s). The KV cache fills VRAM at long context regardless of MTP, so the offload split ends up identical. MTP's multi-token speculation is offset by its compute overhead.
PP degrades gracefully: 1,855 → 1,584 tok/s from 8k to 128k (~15% decline). A 128k prompt processes in ~81 seconds.
The "97 tok/s" only exists at short context with
--fit-target 0. At 64k+,--fit-target 0OOMs because there's no headroom for KV cache growth. You must use--fit-target 1536for long-context work, which brings speed down to ~73 tok/s at short context and ~56 tok/s at 128k.Bottom line for coding agents: expect ~56 tok/s TG and ~1,500 tok/s PP at 128k context on 16GB. MTP is a wash — doesn't help or hurt at full context.
VRAM Usage
Config VRAM used VRAM free Notes A (27B IQ3+MTP) 14,803 MiB 1,039 MiB Fully on GPU, fit-target 0 B (35B Q4_K_XL+MTP) 14,623 MiB 1,219 MiB Partial offload, fit-target 1536 B (35B Q4_K_XL, no MTP) 15,815 MiB 27 MiB Maximum GPU layers, fit-target 0 C (35B Q8_0+MTP) 14,567 MiB 1,275 MiB Heavy offload, fit-target 1536Context Limits (push to OOM)
Limit 27B IQ3 35B Q4_K_XL 35B Q8_0 Max ctx (q8_0 KV) 56k 131k+ 131k+ Max ctx (q4_0 KV) 110k 131k+ 131k+ Speed at max ctx 80.5 / 57.2 56 45This is the biggest differentiator. The 35B MoE handles 131k context easily because its hybrid architecture (Gated DeltaNet + Attention) only has ~10 full-attention layers that need KV cache. The remaining SSM layers use a tiny recurrent state. The 27B dense model has KV on every layer, so it maxes out at 56k with q8_0 KV.
Tip for 27B users: switching from
-ctk q8_0 -ctv q8_0to-ctk q4_0 -ctv q4_0extends your max context from 56k → 110k. Quality cost is minimal: q4_0 KV at 56k scores 218/220 CodeNeedle vs 220/220 with q8_0 KV (q4_0 at regular context: 219/220 — so most of the 2-line drop is from q4_0 itself, not the longer context).The OOM at higher contexts is the MTP compute buffer (529 MiB fixed allocation), not the KV cache itself. This is a llama.cpp implementation detail that may improve in future versions.
Quality — CodeNeedle (positional recall)
11 functions from Python's http.server, ~50k char corpus, testing exact line-level recall:
Metric 27B IQ3 35B Q4_K_XL 35B Q8_0 Pass 11/11 11/11 11/11 Lines matched 220/220 217/220 216/220 Hallucinations 0 1 1The 27B IQ3 has a perfect score — every line exact, zero hallucinations. The 35B models are close but not quite there. Interesting that Q8_0 doesn't beat Q4_K_XL here.
Quality — GSM8K (grade school math, 100 cases)
Metric 27B IQ3 35B Q4_K_XL 35B Q8_0 Accuracy 89% 91% 90% CI (95%, excl. truncated) [86.9%, 97.1%] [84.9%, 95.8%] [85.8%, 96.5%] Truncated 5 1 3 Wall time 106 min 67 min 114 minAll three overlap in confidence intervals — the quality difference is negligible. But the 35B Q4_K_XL is 37% faster to evaluate (67 vs 106 min) with fewer truncations.
Note: AIME2025 was also tested on the 27B — 50% overall but 100% on non-truncated cases*. Every failure was context exhaustion at 32k, not wrong reasoning. The 35B MoE with 131k context would likely score higher.*
Ubatch PP Trick (coder543, May 18)
u/coder543 discovered that increasing
-ubfrom 512→8192 gives 5.5x prompt processing speedup for--n-cpu-moepartially offloaded models. I tested this on the 35B:Result: doesn't apply with
--fit on**.** The-ub 2048+OOMs because--fit onalready maximizes VRAM for model layers — no headroom for larger batch buffers. If you use--n-cpu-moemanual offload instead, the trick works. But--fit onis simpler and handles the split automatically.Concurrency (-np sweep)
Tested
-np 27B tok/s 27B throughput 35B tok/s 35B throughput 1 83.3 0.6 cases/min 70.7 0.8 cases/min 2 57.7 1.3 cases/min 49.7 1.1 cases/min 4 10.0 (CPU overflow) 0.6 cases/min 28 failed-np 1/2/4on 10 GSM8K cases:-np 2doubles batch throughput at 30% slower per-request speed.-np 4pushes layers to CPU — 27B drops to 10 tok/s, 35B partially fails. Use-np 1for interactive chat,-np 2for batch evaluation.MTP Reference (for 27B / fully-on-GPU setups)
MTP is worth it when the model fits entirely on GPU (no offload penalty). For the 27B IQ3 on 12GB: 73 tok/s with MTP vs ~56 without. For the 35B on 16GB: skip it (see speed table above).
If you do use MTP:
--spec-type draft-mtp— notmtp. Mainline renamed it.-np 1— b9204 defaults to 4 slots which pushes layers to CPU.--spec-draft-n-max 2beats 3 (lower acceptance at 3 = slower overall).--fit-target 1536for partial-offload models.--fit-target 0for fully-on-GPU.Other notes:
-khad) is enabled by default since b8607 — no flag needed.-np 2doubles batch throughput at 30% slower per-request. Good for eval, bad for interactive.Recommendation
The Config (just copy this)
No MTP. No special flags.
--fit-target 1536is the key — it reserves VRAM headroom so the KV cache doesn't OOM at 128k. Load it, leave it running, point your coding agent atlocalhost:8080/v1/chat/completions.What you get: 56 tok/s generation at 128k context. 1,584 tok/s prompt processing (81s to ingest 128k tokens). 131k max context. GSM8K 91%. Stable.
Why no MTP? At 128k context both MTP and no-MTP give the same 56 tok/s — the KV cache dominates VRAM either way. MTP adds 5 gotchas for zero benefit. Skip the complexity.
GGUF: havenoammo/Qwen3.6-35B-A3B-MTP-GGUF (the MTP GGUF works fine without
--spec-type draft-mtp— it just ignores the extra tensors).27B GGUF: GazTrab/Qwen3.6-27B-MTP-UD-IQ3_XXS-GGUF
Other VRAM budgets (community data, not tested by us)
Everything above was tested on our RTX 5080 16GB. These estimates for other GPUs are from community reports:
VRAM Model Speed Source 8 GB 35B MoE Q2_K_XL+MTP ~50 tok/s (est.) u/Still-Notice8155 (GTX 1070,-fit off --n-cpu-moe 32) 12 GB 35B MoE Q4_K_XL+MTP ~73-80 tok/s u/janvitos (RTX 4070 Super 12GB) 16 GB 35B Q4_K_XL 56 tok/s @ 128k This post (RTX 5080) 24 GB 35B Q4_K_XL (no MTP) ~90+ tok/s (est.) Model is ~22 GB, fits fully on GPU with headroom for KVThe 27B IQ3+MTP needs the MTP head grafted —
graft-mtp.pyin the repo.Why not the others?
27B IQ3 — We tested it on our 16GB card where it fits fully on GPU (12.45 GB model). Perfect CodeNeedle (220/220), 73 tok/s with MTP (GGUF). But it caps at 56k context (110k with q4_0 KV). If your coding agent needs 128k, it's out. Better suited for 12 GB cards where the 35B won't fit.
35B Q8_0 — 38% slower (46 tok/s with MTP), negligible quality gain (GSM8K 90% vs 91%, overlapping CIs). Not worth the VRAM on 16 GB.
Credits
This post exists because of the community:
--n-cpu-moe(May 18)What's Next
vLLM vs llama.cpp head-to-head. vLLM >= 0.19.0 supports MTP natively with PagedAttention (dynamic KV allocation — no fixed compute buffer eating VRAM). Could make MTP actually faster for partial-offload models. Stay tuned.
EDIT: u/Look_0ver_There — corrected 24 GB VRAM table (Q8_0 is 36 GB, doesn't fit)
EDIT 2: u/FusionX correctly points out that --fit-target 1536 is too conservative for headless setups. My machine runs a desktop compositor + terminal that eats ~1 GB VRAM before the model loads. If you're running headless, --fit-target 128 keeps more expert layers on GPU. FusionX reports 70-80 tok/s at 131k context on the same GPU with this setting. I'll re-benchmark with a lower fit-target and update. The recommended config is adjust --fit-target down if you're headless.
EDIT 3: Hey thanks everyone for commenting, and for the ones who really skeptical of the results because the post was AI generated. u/the__storm u/Special_Animal2049 kevin_1994 I really appreciate your criticisms, and I should have been more upfront about this. So to remedy this I have posted the scripts that produced these results and the raw data themselves, you can find them here: https://github.com/gaztrabisme/llm-server/tree/main/docs/dev
EDIT 4: u/OsmanthusBloom caught that the community VRAM table incorrectly listed the 27B dense model for the 8 GB and 12 GB rows. Both sources actually ran the 35B MoE with CPU offload.
--- TOP COMMENTS --- Is that 4283 tokens of text to say : "if my model doesn't fit in my vram, speed is bad" ?
Is this AI generated? Your tk/s is way too low. Secondly, assuming we're in headless mode, you DO NOT need to reserve 1536MB VRAM. KV cache is already accounted for fitting heuristics in llama. Set it as low as you can without going OOM. 128MB works for me.
I have the same setup except a much shittier DDR4 RAM. At 131k ctx size, I can reach 70 tk/s with the same GPU, model and llama params. In fact, with
--fit-target=128MB, the speed is 80 tk/s. And 65 tk/s at 256k ctx size.And there's still plenty of room for improvement to eek out even more performance.
As for 27b, instead of IQ3, I suggest using the IQ4 quant as described here.
Also, I do agree that MTP is largely useless for 16GB VRAM.
Applications
Claude is improving my RV rental business but working me to death 😅
Read more Read lessLong story short but long. I own an RV rental business. I used to be a Mechanical Engineer but got tired of the office/government life and started renting my personal RV on the side 9 years ago. That turned into a small fleet of Winnebagos I rent out of Los Angeles so I quit my job to do this full time out of a random ass whim.
I have 20 units that have never, ever failed a single customer. I send all 20 to Burning Man every year and they all come back with no issues whatsoever. If you've never been, the alkaline dust kills everything, including your soul if you don't prepare well enough.
I have however neglected my gig as of late. Everything is more expensive, too many variables to keep up with and two months ago I just decided to finally sit down and see if this is even worth continuing with.
I have major ADHD so I started looking for any AI apps that help you organize your brainfarted life and ran into Claude.
I don't know if I just fell into an endless dopamine trap but here I am, redesigning the interior of one of our units. I've sourced cabinet quality plywood for cheap, done precision cuts to substitute old particle board. I've always hated to paint but I got clowned into spray painting to a decent AF level. I used Claude to help me make interior design decisions as well as help me with our website, ads, tool decisions, etc.
I'm probably wasting my time here cause I could just sell this unit and get a newer one, but the overall picture I've gotten... The ease of learning new skills, understanding roles I typically sub out so I can at least make sure I'm hiring the right people. The sudden engagement I've gotten into my own little gig...
I am dead tired from this rollercoaster ride my brain has gone down into but I have to admit... This fucking Skynet shit is helping me focus and make it easy to complete tasks I've neglected forever.
Skynet is coming or I guess it's here already and I'm not sure that's entirely a bad thing, a worse thing, a worserererer thing or an actual positive addition to one's life. Possibly a mix of both but fuck I haven't been this locked in for anything else other than the hobby that keeps my brain gears greased (2000 🪂 skydives and counting).
--- TOP COMMENTS --- ADHD + AI is genuinely dangerous. can't beat procrastination, you just go to having 14 urgent projects instead of 1
Can you tell your workflow?
I’m renovating a house and discovered AI are bad when it comes to “spatial” projects (at least Claude and Gemini). They can’t really design an interior space and struggle with custom made furniture.
Research
OAI researcher on Erdos problem: “This is the biggest deal in the history of AI so far. And it will look like a small deal at the end of the year.” (Buckle up)
Read more Read lessLink to tweet:
https://x.com/Houda_nait/status/2057240025725894663?s=20
Link to Erdos problem:
https://openai.com/index/model-disproves-discrete-geometry-conjecture/
https://x.com/OpenAI/status/2057176201782075690?s=20
--- TOP COMMENTS --- The scientist version of "I worked on it for years and he just...tweeted it out."
A bit inaccurate, because the problem is not solved, but the lower bound is improved (and many thought the old lower bound was the truth)
OpenAI claims a general-purpose reasoning model found a counterexample to Erdos's unit-distance bound [D]
Read more Read lessOpenAI posted a math result today claiming that one of its general-purpose reasoning models found a construction disproving the conjectured n^{1+O(1/log log n)} upper bound in Erdős’s planar unit-distance problem.
Announcement:
https://openai.com/index/model-disproves-discrete-geometry-conjecture/
Proof PDF:
https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-proof.pdf
Abridged reasoning writeup:
https://cdn.openai.com/pdf/1625eff6-5ac1-40d8-b1db-5d5cf925de8b/unit-distance-cot.pdf
The mathematical claim, as I understand it, is that there are finite planar point sets with more than n^{1+δ} unit distances for some fixed δ > 0 and infinitely many n. That would rule out the expected near-linear upper bound, though it does not determine the true asymptotic growth rate.
What seems especially relevant for this subreddit is the process claim: OpenAI says the solution was produced by a general-purpose reasoning model, then checked by an AI grading pipeline and reviewed/reworked by mathematicians. The proof PDF also includes the original prompt given to the model, but not the full experimental details: no model name, sampling setup, number of attempts, compute budget, hidden system prompt, or full grading pipeline.
Curious how people here read this as an ML result. Is this best viewed as evidence of frontier models doing genuine autonomous research, or as a cherry-picked but still important sample from a large search process? What kind of disclosure would you want before treating this as a reproducible AI-for-math milestone?
--- TOP COMMENTS --- Why would you trust anything OpenAI claims about its capabilities given how wrong they were before until there’s details?
Ai Safety
⚠️ Glendale College AI skipped dozens of names at graduation
Read more Read lessOn May 15, an AI text-to-speech system deployed at Glendale Community College's graduation ceremony malfunctioned and skipped dozens of graduates' names. College President Tiffany Hernandez publicly confirmed from the stage that artificial intelligence managed the process, calling the incident a "good lesson," which resulted in mass protests from the audience.
The text displayed on the screen did not match the individuals on stage, and the audio system completely shut down in many instances. The administration initially refused to pause the ceremony, telling students their names would not be announced again. Due to strong audience dissatisfaction, management reversed the decision within minutes, brought the students back on stage, and had a live announcer read their details. According to a March 2026 study by the Pew Research Center, 50% of the US population remains skeptical about integrating such technologies into daily systems.
The incident highlights the operational risks of integrating automated technologies in educational institutions. Similar protests occurred this May at other US universities, including the University of Arizona and the University of Central Florida, where graduates openly criticized speakers advocating for artificial intelligence adaptation in the job market.
Source:https://futurism.com/artificial-intelligence/ai-name-reader-flops-college-graduation
--- TOP COMMENTS --- this is exactly why you don't use experimental tech for something as important as graduation ceremony. these students worked years for this moment and some ai system decides to skip their names?
the fact they initially refused to fix it makes it even worse - took audience getting angry before they bothered bringing students back on stage. maybe stick to human announcers for events that actually matter to people
This is exactly the kind of thing people mean when they say AI should have a human fallback.
Skipping names at graduation isn’t some tiny bug lol, that’s a once in a lifetime moment for people and the system just completely fell apart.
Opinion And Analysis
I think we’re reaching the limit of brute-force context stuffing
Read more Read lessThe more I work with coding agents, the more it feels like raw context injection scales badly.
Issue with huge prompts:
What seems more promising is persistent structured memory like
Feels like the industry is slowly rediscovering that retrieval quality matters more than sheer context size.
Curious if others are seeing the same thing in production workflows.
--- TOP COMMENTS --- I’ve been enjoying shorter, higher level prompts, and using prompts injection. Rather than stuffing everything in a single massive “do the thing” prompt
It’s easy for the agent to do its job when it’s provided with exactly the information it needs wherever it’s looking in your codebase
I think we have what we do now because it works better than you’d expect at first, but the illusion breaks when you start hitting the limits during a long session.
I am think the future is a mix between a stateless context engine that can maximize token sweet spot efficiencies and pull in maximum possible context you need for the turn for the intent and attention trajectory. It would also be great to be able for context to be inspectable and tweakable. So you could ensure some context is getting the attention it needs rather than the llm treating the whole shifting blob equally.
Attention shaping adaptors delta-mem style are also pretty cool for a fuzzy non context window way of getting more correct answers/attention trajectory (tool use chains) for your context and project.
The best system I am imagining is a mixture between both of these but I don’t see anyone else working on them much.