I benchmarked Qwen 3.6, Qwen 3.5, and 5 other models across 5 agent frameworks on Apple Silicon — here's the full compatibility matrix
**Hardware:** Apple M3 Ultra, 256GB unified memory
**Frameworks tested:** Hermes Agent (64K stars), PydanticAI, LangChain, smolagents (HuggingFace), OpenClaude/Anthropic SDK
**Models tested:** Qwen 3.6 35B (brand new), Qwen 3.5 35B, Qwopus 27B, Qwen 3.5 27B, Llama 3.3 70B, DeepSeek-R1 32B, Gemma 4 26B
# The Agent Compatibility Matrix
This is the part I wish existed before I started. Each cell = pass rate across structured tool calling tests (single tool, multi-tool selection, multi-turn, streaming, stress test, many-tools injection, no-leak check).
|Model|Hermes|PydanticAI|LangChain|smolagents|OpenClaude|**Speed**|
|:-|:-|:-|:-|:-|:-|:-|
|**Qwen 3.6 35B** (4bit)|100%|100%|93%|100%|100%|**100 tok/s**|
|**Qwen 3.5 35B** (8bit)|100%|100%|100%|100%|100%|**83 tok/s**|
|**Qwopus 27B** (4bit)|100%|100%|100%|100%|100%|38 tok/s|
|**Qwen 3.5 27B** (4bit)|100%|100%|100%|—|—|38 tok/s|
|**Gemma 4 26B** (4bit)|100%|67%|—|100%|80%|\~40 tok/s|
|**DeepSeek-R1 32B** (4bit)|55%|50%|—|100%|40%|\~30 tok/s|
|**Llama 3.3 70B** (4bit)|45%|67%|67%|100%|—|\~20 tok/s|
**Key takeaway:** The Qwen family completely dominates tool calling — every Qwen model hits 100% (or near-100%) across all frameworks. Non-Qwen models are a coin flip depending on which framework you use.
# Speed Benchmarks (decode tok/s, same hardware)
|Model|RAM|Speed|Tool Calling|Best For|
|:-|:-|:-|:-|:-|
|Qwen3.5-4B (4bit)|2.4 GB|**168 tok/s**|100%|16GB MacBook, fast iteration|
|GPT-OSS 20B (mxfp4)|12 GB|**127 tok/s**|80%|Speed + decent quality|
|Qwen3.5-9B (4bit)|5.1 GB|**108 tok/s**|100%|Sweet spot for most Macs|
|**Qwen 3.6 35B** (4bit)|\~20 GB|**100 tok/s**|100%|NEW — 256 experts, 262K ctx|
|Qwen3.5-35B (8bit)|37 GB|**83 tok/s**|100%|Best quality-per-token|
|Qwen3.5-122B (mxfp4)|65 GB|**57 tok/s**|100%|Frontier-level, 96GB+ Mac|
For reference, Ollama gets \~41 tok/s on Qwen3.5-9B on the same machine. So these numbers are 2-3x faster.
# Model Quality Baselines (HumanEval + tinyMMLU)
Speed isn't everything — here's how the models do on code generation and knowledge:
|Model|HumanEval (10)|MMLU (10)|Tool Calling|MHI Score|
|:-|:-|:-|:-|:-|
|**Qwopus 27B**|80%|90%|100%|**92**|
|**Qwen 3.5 27B**|40%|100%|100%|**82**|
|**Qwen 3.5 35B** (8bit)|60%|40%|100%|**76**|
|**Qwen 3.6 35B** (4bit)|20%|30%|100%|**56**|
|**Llama 3.3 70B**|50%|90%|varies|**56-83**|
|**DeepSeek-R1 32B**|30%|100%|varies|**49-79**|
MHI = Model-Harness Index: 50% tool calling + 30% HumanEval + 20% MMLU. Measures "how well does this model work as an agent backend."
**Qwen 3.6 note:** The low HumanEval/MMLU is likely a 4-bit quantization artifact on a day-0 model. It was released days ago. Tool calling is flawless though — if you just need an agent backend, it's the fastest option at 100 tok/s with 100% compatibility.
# Interesting Findings
1. **Qwen 3.6 is blazing fast** — 100 tok/s on a 35B MoE with 256 experts and 262K context. Only 3B active params means it fits in \~20GB.
2. **smolagents is the most forgiving framework** — even DeepSeek-R1 and Llama 3.3 hit 100% with smolagents because it uses text-based code generation instead of structured function calling. If your model sucks at FC, try smolagents.
3. **Hermes Agent is the hardest test** — 62 tools injected, multi-turn chains, streaming. Models that pass Hermes pass everything.
4. **8-bit > 4-bit for quality** — Qwen 3.5 35B at 8-bit scores 60% HumanEval vs the 4-bit version's lower scores. If you have the RAM, 8-bit is worth it.
5. **Don't use DeepSeek-R1 for tool calling** — it's a reasoning model, not an agent model. 40-55% tool calling rate across frameworks. Great for math though.
# How I Tested
All tests use the same methodology:
* **Tool calling:** 7-11 API tests per harness — single tool, tool choice, multi-turn with tool results, streaming tool calls, many-tools injection (62 tools for Hermes), stress test (5 rapid calls checking for tag leaks), no-tool-needed (model should answer directly)
* **Framework-specific:** Each framework's own test suite (PydanticAI structured output, LangChain with\_structured\_output, smolagents CodeAgent + ToolCallingAgent)
* **HumanEval:** 10 tasks via completions endpoint, temp=0
* **MMLU:** 10 tinyMMLU questions via completions endpoint
* **Speed:** Measured at steady-state decode, not first-token
The server is [Rapid-MLX](https://github.com/raullenchai/Rapid-MLX) — an OpenAI-compatible inference server built on Apple's MLX framework. All test code is open source in the repo under `vllm_mlx/agents/testing.py` and `scripts/mhi_eval.py` if you want to reproduce.
# TL;DR
If you're running agents on Apple Silicon:
* **Best overall:** Qwopus 27B (MHI 92, works with everything)
* **Fastest with perfect compatibility:** Qwen 3.6 35B at 100 tok/s
* **Best quality-per-token:** Qwen 3.5 35B 8-bit (60% HumanEval, 100% tools)
* **Budget pick:** Qwen3.5-4B at 168 tok/s on a 16GB MacBook Air
* **Avoid for agents:** DeepSeek-R1, Llama 3.3 (unless you use smolagents)
Happy to answer questions or run additional models if there's interest.
--- TOP COMMENTS ---
Unfair comparisons of mixed quants tbh
---
The smolagents finding is the most useful part of this. Text-based code generation as a proxy for structured tool calling means you can use almost any model as an agent backend, even the ones that fail at JSON function calling.
DeepSeek-R1's 100% on smolagents vs 40-55% elsewhere tells the whole story. If you're building with a model that struggles at FC, smolagents is the workaround...
Yesterday