June 19, 2026 · 9 min read · ← All posts

DiffusionGemma hits 0.35s median in the WebBrain local planner bench

We pulled four new local candidates into the WebBrain bench: Gemma 4 12B Coder Fable5 Composer 2.5, Cohere North-Mini-Code 1.0, DiffusionGemma-26B-A4B-it, and VibeThinker-3B. The practical result: Gemma and North both completed the frozen legacy first-tool-call run at Q4, DiffusionGemma could not use the normal llama.cpp server path but did complete the harness through vLLM at very high speed, and VibeThinker matched its own caveat: it is not trained for tool-calling or autonomous agents. DiffusionGemma is promising, but not yet something I would put in WebBrain's default path.

What we ran

This was not a leaderboard run. It was a local-serving reality check: can these models sit behind WebBrain as a first-action browser planner?

For the comparable runs, we used the frozen first-tool-call harness:

node test/llm/run-llamacpp.mjs \
  --freeze test/llm/freeze/baseline-2026-05-23.json \
  --chat-template-compat alternating \
  --concurrency 1

That means:

The strict numbers below score only the first model action. Exact means tool name and args match the expected first call. Name means the tool name matches but args differ. Parsed calls measures format reliability, not correctness.

Results

ModelServing pathParsed callsExactNameMedianp95Observed VRAMStatus
Gemma 4 12B Coder Fable5 Composer 2.5Q4_K_M GGUF, llama.cpp, 32k ctx94/1009%26%1.9s2.8s~8.8-9.0 GBClean text run
Cohere North-Mini-Code 1.0Q4_K_M GGUF, llama.cpp b9714, 32k ctx93/1009%24%3.2s4.0s~19.5 GBClean run with parser workaround
DiffusionGemma-26B-A4B-itNVFP4, vLLM, 32k ctx79/10010%26%0.35s0.65s~32.0 GB reservedVery fast; vision probe works
VibeThinker-3BBF16, vLLM, 64k ctx, 8 server seqs84/1002%19%5.0s15.1s~26.8 GB reservedScored, but not recommended

Two VRAM caveats:

Gemma 4 12B Coder

Model: yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Local file:

G:\llama\models\gemma4-12b-coder\gemma4-coding-Q4_K_M.gguf

Gemma was the cleanest llama.cpp run in this batch.

MetricValue
Completed cases100/100
Transport errors0
Parsed calls94/100
Exact first-call match9/100
Tool-name match26/100
Average latency2.34s
Median latency1.91s
p95 latency2.81s
Slowest case40.5s
Observed VRAM~8.8-9.0 GB

The good part is format reliability. Under the frozen legacy path, Gemma emitted parseable calls in 94 cases without transport errors. It was also quick: most requests landed around two seconds.

The weakness is first-action selection. It called get_accessibility_tree 57 times out of 100. That is often safe behavior in a real browser session, but this benchmark asks for the expected first action. When the user explicitly asks to go somewhere, search something, or open a known page, "inspect the current blank tab" is usually too conservative.

We also ran test/vision-probe.mjs against the same server. The probe failed with:

image input is not supported - hint: if this is unexpected, you may need to provide the mmproj

So this particular GGUF is a text planner candidate only in the current setup.

North Mini Code

Model: CohereLabs/North-Mini-Code-1.0, Q4 GGUF from bartowski/North-Mini-Code-1.0-GGUF

Local file:

G:\llama\models\north-mini-code-1.0\North-Mini-Code-1.0-Q4_K_M.gguf

North loaded successfully in the staged current llama.cpp build:

G:\llama\llama-b9714-cuda13\llama-server.exe

But the first smoke test hit a serving-layer problem. With the native template, North generated a sensible action:

[
  {
    "tool_call_id": "0",
    "tool_name": "navigate",
    "parameters": {
      "url": "about:addons"
    }
  }
]

llama.cpp's OpenAI endpoint rejected it before returning a normal response:

The model produced output that does not match the expected peg-native format

For the comparable run, we restarted the server with --skip-chat-parsing and added a narrow raw-action extractor for North/Cohere-style JSON in the runner. The prompt remained the same frozen legacy prompt, and no structured OpenAI tools were sent.

MetricValue
Completed cases100/100
Transport errors0
Parsed calls93/100
Exact first-call match9/100
Tool-name match24/100
Average latency3.30s
Median latency3.19s
p95 latency4.02s
Slowest case5.16s
Observed VRAM~19.5 GB

North was slower and much heavier than Gemma, but more stable than the first template failure suggested. Like Gemma, it strongly preferred get_accessibility_tree: 59 of its 93 parsed calls used that tool. It had 7 no-tool answers.

The vision probe failed the same way as Gemma:

image input is not supported - hint: if this is unexpected, you may need to provide the mmproj

So North is also text-only in this local run.

DiffusionGemma

Model: google/diffusiongemma-26B-A4B-it, vLLM endpoint rooted at nvidia/diffusiongemma-26B-A4B-it-NVFP4. We also tested the Q4 GGUF from unsloth/diffusiongemma-26B-A4B-it-GGUF for local llama.cpp compatibility.

Local file:

G:\llama\models\diffusiongemma-26b-a4b-it\diffusiongemma-26B-A4B-it-Q4_K_M.gguf

DiffusionGemma was the headline surprise.

The expectation going in was simple: if the model can preserve roughly the same action quality as Gemma 4 26B-A4B while running much faster, it changes the local-agent tradeoff. That is mostly what happened. It did not beat the old Gemma 26B run on strict action quality, but it got close enough to be interesting and was dramatically faster once the vLLM server was warm.

MetricValue
Completed cases100/100
Transport errors0
Parsed calls79/100
Exact first-call match10/100
Tool-name match26/100
Average latency0.38s
Median latency0.35s
p95 latency0.65s
Slowest case0.88s
Observed VRAM~32.0 GB reserved by vLLM

The first one-case smoke took 8.8s and picked about:extensions instead of Firefox's expected about:addons. After that warmup, the full 100-case run finished in 37.8s wall time, with most cases between 200ms and 650ms. That is the flashy part: it behaves like a large local model, but the warmed vLLM path felt closer to a lightweight router.

Quality was more mixed. Compared with the earlier Gemma 4 26B-A4B saved run, DiffusionGemma had:

The caveat is important: DiffusionGemma does not yet feel reliable enough for WebBrain's default planner. In interactive use it can sometimes behave as if an action has already happened when it has not actually been taken yet. Our working hypothesis is that the diffusion generation path is part of that failure mode: it is less stepwise-deterministic than a normal autoregressive planner, and browser automation punishes that kind of state drift quickly. The speed is real; the action-state calibration still needs work.

The other surprise is vision. Unlike the Gemma 4 12B Coder and North Q4 GGUFs, the vLLM DiffusionGemma endpoint accepted the vision-probe.mjs screenshot and produced a useful structured caption of the Google password-error page. It correctly identified the page purpose, the visible text, the focused password input, and the "Enter a password" error state. That makes it more than a text planner curiosity.

There is still a serving split to be aware of. Standard llama-server and llama-cli did not load the Q4 GGUF in three builds we checked:

All failed with the same architecture error:

unknown model architecture: 'diffusion-gemma'

That is expected right now. DiffusionGemma is not a normal autoregressive model, and the GGUF notes point to the DiffusionGemma llama.cpp PR and the dedicated llama-diffusion-cli runner. The Hugging Face discussion for the Unsloth GGUF also notes that this path is CLI-only for now, with no llama-server support yet.

So we built the PR branch locally:

G:\llama-diffusiongemma\build\bin\llama-diffusion-cli.exe

Smoke command:

llama-diffusion-cli.exe \
  -m G:\llama\models\diffusiongemma-26b-a4b-it\diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -p "Explain promises in JavaScript in one sentence." \
  -n 64 \
  -ngl 99 \
  --diffusion-steps 24

That worked. It produced a coherent one-sentence answer and peaked around 21.3 GB total 5090 memory in the smoke run. The CLI reported about 1.3s total generation time for the 256-token canvas, using entropy-bound early stopping.

But the WebBrain score above comes from vLLM, not from the GGUF CLI path. For local llama.cpp users, this is still not drop-in. For vLLM users with enough VRAM, it is suddenly one of the most interesting candidates in the batch.

VibeThinker

Model: WeiboAI/VibeThinker-3B

We tried VibeThinker two ways.

First, we pulled a Q4_K_M GGUF and ran it through llama.cpp. Those numbers are discarded. The Q4 run showed repetitive answers, never-ending responses, high latency, and weak tool-call behavior. For a 3B model, this quantization may simply be too lossy for this use case.

Then we reran the BF16 model through vLLM on port 8000. That gave a cleaner data point:

MetricValue
Completed cases100/100
Transport errors0
Parsed calls84/100
Exact first-call match2/100
Tool-name match19/100
Average latency6.20s
Median latency4.97s
p95 latency15.13s
Slowest case24.4s
Observed VRAM~26.8 GB reserved by vLLM

This cleaner run still supports the upstream warning. The VibeThinker model card says it was not trained on tool-calling or agent-based programming data and does not recommend it for function calling, API orchestration, or autonomous coding agents. It recommends competitive-programming-style tasks instead.

That showed up here. VibeThinker emitted calls most of the time, but the first-action routing was weak, no-tool answers were common, and the model often selected generic reading/inspection behavior where the harness expected a concrete browser action.

So the fair interpretation is narrow: VibeThinker may still be interesting for reasoning or competitive-programming prompts, but these WebBrain planner results should not be used as evidence for or against its intended use case.

What I would keep testing

The practical local candidates from this batch are DiffusionGemma on vLLM, Gemma 4 12B Coder, and North Mini Code.

Gemma is lighter and faster in the llama.cpp lane. North is heavier, slightly slower, and needed a parser workaround, but once raw actions were allowed through it completed the full run cleanly. Both have the same exact-match score, and both overuse get_accessibility_tree.

That suggests the next useful experiment is prompt-side, not model-side:

DiffusionGemma is the speed track. It is the most unusual model in the batch, and the vLLM run showed why people are excited about it: large-model behavior with sub-second warmed routing latency. The remaining question is whether prompt changes can reduce its no-tool rate without losing that speed.

VibeThinker should stay out of the browser-agent planner table unless a tool-trained variant appears.

Comparison with earlier planner runs

For context, here is the same strict first-tool comparison across the saved runs we have tested before. This is not the same scoring lens as the May benchmark post, which compared models against consensus and Sonnet. This table replays each saved result against the current expected/NNN.json ideal first call: exact is name plus args, name is tool name only.

ModelParsed callsExactNameMedian latency
MiniMax M2.788/10023%36%3.0s
Claude Sonnet 4.692/10019%41%2.8s
Qwen 3.6 35B-A3B90/10018%38%10.3s
Qwen 3.6 27B92/10018%37%10.2s
Nemotron Omni 30B93/10016%36%2.5s
Qwen 3.5 9B90/10015%35%0.91s
Gemma 4 E4B87/10014%35%4.5s
Intel Gemma 4 31B int488/10014%34%0.63s
Gemma 4 26B-A4B87/10013%30%1.4s
Browser-Use Qwen 30B-A3B Q493/10012%35%0.47s
Qwen 3.5 4B82/10012%33%5.4s
Gemma 4 E2B76/10012%31%3.8s
DiffusionGemma 26B-A4B vLLM79/10010%26%0.35s
Gemma 4 12B Coder Q494/1009%26%1.9s
North Mini Code Q493/1009%24%3.2s
Qwen 3.5 0.8B90/1007%15%0.44s
LFM 2.583/1004%23%5.9s
Qwen 3.5 2B89/1004%7%0.78s
VibeThinker 3B BF1684/1002%19%5.0s

The more useful comparison is by class.

Mid dense: Gemma 4 12B Coder vs Qwen 3.5 9B

Gemma 4 12B Coder is closest to Qwen 3.5 9B in spirit: mid-sized dense-ish local planners that should fit comfortably below the huge-model tier.

ModelParsed callsExactNameMedian latency
Qwen 3.5 9B90/10015%35%0.91s
Gemma 4 12B Coder Q494/1009%26%1.9s

Gemma wins on parseability, but Qwen 3.5 9B is still the better first-action router in this saved strict replay. Gemma's weakness is not format. It is action choice: too many safe-but-generic get_accessibility_tree calls where Qwen more often picks the expected tool directly.

Mini class: VibeThinker 3B vs Gemma 4 E4B vs Qwen 3.5 4B

The small-model comparison is where VibeThinker has to prove it can transfer reasoning into browser-agent tool use. Against the earlier mini-class runs, it does not.

ModelParsed callsExactNameMedian latency
Gemma 4 E4B87/10014%35%4.5s
Qwen 3.5 4B82/10012%33%5.4s
VibeThinker 3B BF1684/1002%19%5.0s

This makes the VibeThinker conclusion clearer. Even with BF16/vLLM instead of the bad Q4 GGUF run, it trails both Gemma 4 E4B and Qwen 3.5 4B by a wide margin on exact and name-only matching. That lines up with the model-card warning: VibeThinker may be a competitive-programming/reasoning model, but it is not a browser-agent planner model.

Large MoE class: North vs Qwen 3.6 35B-A3B vs Gemma 4 26B

North belongs in the large local tier, where the question is whether a heavier model gives better routing than the older large MoE candidates.

ModelParsed callsExactNameMedian latency
Qwen 3.6 35B-A3B90/10018%38%10.3s
Gemma 4 26B-A4B87/10013%30%1.4s
DiffusionGemma 26B-A4B vLLM79/10010%26%0.35s
North Mini Code Q493/1009%24%3.2s

North is operationally promising because it loaded, fit, and produced parseable actions after the --skip-chat-parsing workaround. But it does not beat the older large local models on first-action quality. Qwen 3.6 35B-A3B remains the large-tier action-selection reference. Gemma 4 26B-A4B still has better strict routing quality than DiffusionGemma in this saved replay, but DiffusionGemma is the speed outlier: 0.35s median versus 1.4s for Gemma 26B and 10.3s for Qwen 35B.

That Qwen latency should not be read as the final word on the model. At roughly ten seconds it is much slower than Gemma 4 26B's one-second-ish run and DiffusionGemma's sub-second warmed vLLM path, but I still think there is room to improve Qwen 3.6 35B-A3B with a better vLLM configuration. The quality signal is strong enough that the serving stack deserves more tuning before we treat the latency gap as permanent.

The new models did not beat the earlier leaders on strict first-action accuracy. Their stronger signal is operational: DiffusionGemma is genuinely fast through vLLM and vision-capable, Gemma 4 12B Coder is light and parseable, and North Mini Code is heavy but also parseable once its native action format is allowed through. All three need prompt work before they can challenge the older Qwen, MiniMax, Sonnet, or Nemotron runs on action selection.

Written by Emre Sokullu. WebBrain is MIT-licensed and open on GitHub.