June 20, 2026 · 7 min read · ← All posts

Gemma 4 31B QAT quietly becomes the best local Gemma planner we have tested

We ran google/gemma-4-31B-it-qat-w4a16-ct through the same frozen WebBrain first-tool-call harness we use for local planner comparisons. The result is small on branding and big in practice: the QAT build improves over the older Gemma 4 31B int4 result, narrowly edges Qwen 3.6 27B on strict first-action quality, and does it with sub-second median latency.

What we ran

The endpoint was a local vLLM server exposing:

model id: gemma-4-31b
root: google/gemma-4-31B-it-qat-w4a16-ct
max context reported by vLLM: 65536

We used the same frozen legacy planner harness as the latest local-model post:

node test/llm/run-llamacpp.mjs \
  --base http://127.0.0.1:8000 \
  --model gemma-4-31b \
  --tag 2026-06-20-vllm-legacy \
  --concurrency 1 \
  --timeout 180000 \
  --no-save-request \
  --freeze test/llm/freeze/baseline-2026-05-23.json \
  --chat-template-compat alternating

That means 100 single-turn browser-agent prompts, the May 23 Claude Sonnet 4.6 WebBrain system prompt and 41-tool schema, legacy text-call compatibility, no native OpenAI tools field, and one active request at a time.

The result files are in:

test/llm/results/2026-06-20-vllm-legacy_chrome_gemma-4-31b_frozen

The result

MetricGemma 4 31B QAT w4a16
Completed cases100/100
Transport errors0
Parsed calls95/100
Exact first-call match19/100
Tool-name match37/100
Average latency0.67s
Median latency0.55s
p95 latency1.07s
Slowest case5.01s
Observed GPU memory~30.1 GB used on RTX 5090

This is the best Gemma-family local planner result we have saved so far.

The older Gemma 4 31B int4 AutoRound run was already useful, but the QAT build improves every first-action metric that matters here:

ModelParsed callsExactNameMedianp95
Intel Gemma 4 31B int4 AutoRound88/10014%34%0.63s1.46s
Gemma 4 31B QAT w4a1695/10019%37%0.55s1.07s

The improvement is not just a formatting cleanup. The QAT model reduced no-tool answers from 12 to 5, raised exact first-call matches by 5 points, and slightly reduced the over-cautious get_accessibility_tree habit from 45 calls to 41. It still inspects pages often, but it is less likely to stall out before making a usable first move.

The Qwen 3.6 27B comparison

This is the part that makes the run interesting beyond the Gemma family.

ModelParsed callsExactNameMedian latency
Qwen 3.6 27B92/10018%37%10.2s
Gemma 4 31B QAT w4a1695/10019%37%0.55s

So yes: in this strict first-action replay, Gemma 4 31B QAT narrowly beats Qwen 3.6 27B on quality. It is not a blowout. It is +3 parsed calls, +1 exact point, and a tied tool-name score. But the latency difference is not narrow at all: 0.55s median versus 10.2s.

That gives it a very different profile from the usual "larger local model is smarter but slower" tradeoff. Here, the QAT Gemma run is both slightly better and dramatically faster in this vLLM setup.

The naming is also refreshingly literal. This feels like the kind of improvement people might have called "Gemma 4.1 31B" if it had shipped as a conventional checkpoint refresh. Google named the mechanism instead: QAT. That is probably more honest. The model is not claiming to be a new generation; it is a quantization-aware serving-oriented variant that behaves like a real planner upgrade.

The analogy to the Qwen 3.5 to Qwen 3.6 jump is useful, but with one caveat: we do not have a saved Qwen 3.5 27B run in this WebBrain result set. So this is not a measured Qwen-style before/after. It is a behavioral analogy: the same base family suddenly feels more useful for agent routing because the deployed variant changed the practical quality-speed frontier.

Sonnet 4.6 as the reference

The strict ideal-first-call replay is useful, but the older WebBrain benchmark used a different reference: treat Claude Sonnet 4.6 as the truth and ask how often each model picked the same first tool.

Here is that view updated with the newer runs. The original May rows come from the published Sonnet export. The newer rows were scored against the same 100 Sonnet tool-name picks from that export, so this table is about first-tool alignment, not argument equality.

#ModelMatch allMatch when Sonnet tooledTool-call rateValid-name rateMedian
refClaude Sonnet 4.6100%100%92%92%2.8s
1Gemma 4 31B QAT w4a1677.0%78.3%95%95%0.55s
2Qwen 3.6 27B77.0%77.2%92%92%10.2s
3MiniMax M2.777.0%76.1%88%88%3.1s
4Intel Gemma 4 31B int4 AutoRound74.0%72.8%88%87%0.63s
5Qwen 3.5 4B73.0%71.7%82%82%5.5s
6Gemma 4 26B-A4B71.0%70.7%87%87%1.4s
7Qwen 3.6 35B-A3B70.0%70.7%90%90%10.3s
8Qwen 3.5 9B70.0%69.6%90%89%0.91s
9Gemma 4 E4B68.0%68.5%87%87%4.5s
10Nemotron Omni 30B67.0%68.5%93%93%2.6s
11DiffusionGemma 26B-A4B67.0%64.1%79%79%0.35s
12Gemma 4 E2B63.0%60.9%76%75%4.0s
13Gemma 4 12B Coder Fable5 Composer 2.561.0%62.0%94%94%1.9s
14Cohere North-Mini-Code 1.059.0%58.7%93%93%3.2s
15Browser-Use Qwen 30B-A3B Q443.0%45.7%93%93%0.48s
16LFM 2.540.0%38.0%83%83%5.9s
17Qwen 3.5 0.8B37.0%34.8%90%90%0.45s
18Qwen 3.5 2B36.0%34.8%89%89%0.78s
19VibeThinker 3B BF1633.0%32.6%84%81%5.0s
20Molmo2 8B8.0%0.0%2%1%1.7s

This is the strongest version of the QAT result. In the Sonnet-reference table, Gemma 4 31B QAT ties the best all-case score from the earlier benchmark and takes the tiebreaker on the 92 prompts where Sonnet actually called a tool. It is only a small quality lead, but it comes with a huge latency gap: 0.55s median for QAT versus 10.2s for Qwen 3.6 27B.

The newer small and specialty runs also make more sense in this lens. DiffusionGemma remains a speed story, not a top Sonnet-alignment story. Gemma 4 12B Coder and North are highly parseable but less Sonnet-like in first-tool choice. VibeThinker stays near the bottom, which matches its own model-card warning about tool-calling and autonomous-agent use. Molmo2 8B is effectively not a text tool-calling planner in this harness.

Vision probe

The vision probe also passed.

Against the Google password verification screenshot, Gemma 4 31B QAT correctly identified:

Timing was also solid:

Vision probe metricValue
TTFT1.31s
Total time3.50s
Prompt tokens595
Completion tokens152

One small miss: the model listed "Blockers: None" even though an authentication prompt is obviously a blocker for continuing. For WebBrain, that is the kind of detail a follow-up prompt tweak can usually tighten. The important part is that the endpoint accepted image input and produced a structured caption usable by the planner.

What this changes

For the current WebBrain local planner table, Gemma 4 31B QAT now belongs above the older Gemma 31B int4 run and just above Qwen 3.6 27B on strict first-action quality.

The broader leaderboard still has stronger action routers. MiniMax M2.7, Sonnet 4.6, and Qwen 3.6 35B-A3B still beat it on exact or name-only matching in the saved runs. But Gemma 4 31B QAT has the best balance we have seen from a local Gemma: high parseability, competitive first-action selection, working vision, and latency that keeps the browser loop feeling interactive.

The clean takeaway: this is not just a smaller file or a faster serving trick. In this harness, QAT made Gemma 4 31B a better browser-agent planner.

Written by Emre Sokullu. WebBrain is MIT-licensed and open on GitHub.