Blog

Engineering notes

Short write-ups on design decisions, failure modes, and benchmarks from building an open-source AI browser agent.

July 23, 2026 · 10 min read

Inkling makes the American open-weight comeback genuinely multimodal

Inkling produced 96 valid calls and chose the ideal tool 45 times, while reaching 73% Sonnet alignment. Maximum reasoning did not help—but working image and audio input make this 975B American open-weight release unusually complete.

July 22, 2026 · 8 min read

Nanbeige 4.2 3B punches above its size, but Qwen 3.5 9B still wins our frozen planner test

Nanbeige 4.2 3B produced 90 parsed calls, reached 67% Sonnet alignment, and delivered a 1.49s single-request median. That is impressive for 4B total parameters, but below Qwen 3.5 9B and Gemma 4 E4B in our first-action test.

July 21, 2026 · 9 min read

Poolside Laguna M.1 reaches 73% in WebBrain, but 225B does not win

Laguna M.1 returned 92 valid tool calls and reached 73% Sonnet alignment with a high-reasoning request. The larger model is much more dispatchable than Laguna S, but exact and ideal tool selection barely improve.

July 21, 2026 · 9 min read

Poolside Laguna S 2.1 reaches 71% with a high-reasoning request, but still trails Hy3

Laguna S 2.1 is much cleaner than Laguna XS and extraordinarily cheap on OpenRouter. A high-reasoning request lifted Sonnet alignment from 65% to 71%, but reduced parsed tool calls from 86 to 78 and still did not catch Hy3 or MiniMax M3.

July 14, 2026 · 6 min read

GLM-5.2 is not WebBrain's new planner reference yet

GLM-5.2 completed four 100-case WebBrain planner suites with zero transport errors after resumed runs. The frozen row lands at 69% Sonnet alignment, 21 exact first calls, and 36 ideal tool-name matches.

July 14, 2026 · 5 min read

WebMCP: Websites as Tools for AI Agents, and Why We're Excited

A new browser standard lets websites hand AI agents a menu of actions instead of forcing them to squint at screenshots. One GitHub issue sparked a conversation that could change how WebBrain works with the web.

July 10, 2026 · 6 min read

Poolside Laguna XS is fast and free, but not a WebBrain planner win

Laguna XS completed WebBrain's frozen planner suite with 92 parsed calls, 89 valid tool names, 65% Sonnet alignment, and a 1.30s median latency. The endpoint is quick; the planner quality is not competitive.

July 9, 2026 · 8 min read

ThinkingCap Qwen 3.6 27B is a serious WebBrain planner candidate

ThinkingCap lands at 77% Sonnet alignment, 91 parsed tool calls, 19 exact first actions, and 2.25s median latency. It looks like a real local planner candidate, but not a clean replacement for the Qwen baselines.

July 8, 2026 · 6 min read

Nex-N2-mini is cheap and fast, but not a WebBrain planner win

Nex-N2-mini completed WebBrain's frozen planner suite with 93 parsed tool calls, a 2.23s median latency, and about $0.045 reported OpenRouter cost, but only 65% Sonnet alignment.

July 8, 2026 · 6 min read

Nemotron 3 Ultra is huge, free, and not a WebBrain planner win

Nemotron 3 Ultra completed WebBrain's frozen planner benchmark through OpenRouter's free endpoint, but 81 parsed calls, 65% Sonnet alignment, and a 40.6s p95 keep it out of the planner shortlist.

July 8, 2026 · 7 min read

Agents-A1 in WebBrain's frozen planner benchmark

Agents-A1 beats Qwen 3.6 35B-A3B on Sonnet alignment and local latency in WebBrain's frozen planner harness, but it does not beat Qwen on exact or ideal first-call matching.

July 7, 2026 · 3 min read

Five things you shouldn't do with WebBrain

We briefly interrupt the benchmark series for a video-sized warning label: WebBrain can automate a browser, but consent, legality, and basic human decency still apply.

July 7, 2026 · 6 min read

Tencent Hy3 is a great OpenRouter planner, but text-only for now

Tencent Hy3 reached 95/100 parsed tool calls, 20/100 exact first-action matches, and 73% Sonnet alignment. It is a strong OpenRouter planner row, with multimodality as the missing piece.

July 2, 2026 · 4 min read

WebBrain now has an Ollama launch handoff

WebBrain can now be configured from an Ollama launch command: choose a local model, open the WebBrain handoff page, confirm the browser prompt, and the extension switches to Ollama automatically. It is not in upstream Ollama yet, but you can try it from the branch today.

July 2, 2026 · 6 min read

Qwen 3.7 Plus is a serious OpenRouter rival to MiniMax M3

Qwen 3.7 Plus reached 75/100 Sonnet alignment, 95/100 parsed native tool calls, and the best ideal tool-name score in the top hosted slice of WebBrain's frozen planner table.

July 2, 2026 · 6 min read

Qwen 3.6 27B NVFP4 is much faster, but not a clean planner upgrade

Qwen 3.6 27B NVFP4 gives WebBrain a big local latency win: 96/100 parsed calls, 1.8s median latency, and a top-10 Sonnet-reference result. The planner quality story is more mixed.

June 26, 2026 · 6 min read

Ornith-1.0-35B in WebBrain's frozen planner benchmark

Ornith-1.0-35B is a strong local planner and narrowly beats Qwen 3.6 35B on WebBrain's all-case Sonnet alignment. It does not beat Gemma 4 31B QAT in this browser-agent harness.

June 26, 2026 · 5 min read

Tiny raw LFM 2.5 checkpoints in WebBrain's frozen planner benchmark

LFM 2.5 230M and 350M both completed the frozen WebBrain planner run without transport errors. The raw tiny checkpoints are not good browser planners yet, but their failure shape is useful fine-tuning data.

June 21, 2026 · 6 min read

MiniMax M3 and WebBrain Cloud 1.0 enter the frozen WebBrain planner benchmark

MiniMax M3 landed at 75% Sonnet first-tool alignment, below the older MiniMax M2.7 run. WebBrain Cloud 1.0 reached 73% alignment with a much lower run cost, but higher latency in this test path.

June 20, 2026 · 6 min read

Gemma 4 12B QAT is fast enough for WebBrain, but Qwen 3.5 9B still routes a little better

Gemma 4 12B QAT lands as a very fast local WebBrain planner: 92/100 parsed calls, 14% exact, 33% tool-name match, 0.43s median per request, and a working but imperfect vision probe.

June 20, 2026 · 7 min read

Gemma 4 31B QAT quietly becomes the best local Gemma planner we have tested

Gemma 4 31B QAT is not branded like a new generation, but in WebBrain's local planner bench it behaves like a meaningful upgrade: 95/100 parsed calls, 19% exact, 37% tool-name match, and 0.55s median latency.

June 19, 2026 · 9 min read

DiffusionGemma hits 0.35s median in the WebBrain local planner bench

Gemma 4 12B Coder, North Mini Code, and DiffusionGemma all completed the frozen legacy tool-call bench through different serving paths. DiffusionGemma was the speed surprise under vLLM and also handled the vision probe, but its diffusion-style generation still needs more WebBrain-specific reliability work. VibeThinker confirmed its own model-card warning: it is not a browser-agent tool-calling model.

June 19, 2026 · 5 min read

WebBrain Cloud is live, and these are the local models we are benchmarking next

WebBrain Cloud is live in the latest main branch, which means you can try WebBrain without a local LLM or API access. It is request-limited, but useful for a first look. We are also lining up the next local-model benchmark set by hardware band: 4-12GB, 12-24GB, and 24-64GB VRAM.

June 12, 2026 · 8 min read

Molmo2-8B is truly open. Our current serving path could not give it a fair browser-tool test.

Molmo2-8B deserves praise for being open source in the meaningful sense: weights, data, recipe, and no closed-VLM distillation dependency. We could not run a fair native OpenAI-tools comparison through the current LM Studio Molmo path: structured tools failed at prompt rendering, and the fallback text-call run produced only 2 parsed tool calls.

June 2, 2026 · 9 min read

How WebBrain keeps a local AI agent from getting hijacked by the page it's reading

When an AI agent can act on a page as the logged-in user, the page itself becomes an attacker. We walk through WebBrain's layered defense — untrusted-content quarantine, a language-agnostic permission gate, UI-first actions — and share what our adversarial tests revealed: big models resist injection on their own, small local models need the guardrails, and the guardrails are what flip a confused model from relaying an attacker's instruction to flagging it.

May 31, 2026 · 7 min read

Liquid LFM 2.5-8B-A1B on browser tool calling: where the new on-device MoE fits

Liquid AI's just-shipped LFM 2.5-8B-A1B (8.3B total / 1.5B active sparse MoE) joins our browser-agent benchmark. It beats both small Qwens and runs within reach of Gemma 4-E2B against Claude Sonnet 4.6 — but for the small-model distillation slot, plain Apache 2.0 still wins over LFM's $10M-revenue-cliff license. Also: a tool-schema drift bug we caught and the freeze fix we shipped.

May 23, 2026 · 8 min read

13 LLMs, 100 browser tasks, two baselines: which model actually picks the right tool?

We benchmarked 13 local and API models on 100 real browser-agent tool-calling tasks against both consensus voting and Claude Sonnet 4.6. The consensus winner (Qwen 3.6-35B-A3B at 94%) isn't the Sonnet-match winner (Qwen 3.6-27B at 77%) — and that gap tells you something useful about what "correct" means for tool calling.

May 19, 2026 · 4 min read

Pruning Gemma 4 26B-A4B for small GPUs: Turkish-first, language-agnostic MoE surgery

Router hooks + expert activation telemetry + surgical long-tail removal + brief LoRA heal. Early run: 128→101 experts/layer, 26B→21B params, ~11 GB at 4-bit GGUF, with solid Turkish fluency and code performance.

May 7, 2026 · 6 min read

Round 4: Qwen 3.5-9B-int4 punches above its weight — when 9B int4 beats 308B IQ3_S on affordance

A 9B int4 Qwen 3.5 on vLLM classifies the email-chip dropdown explicitly — something the 308B MiMo V2.5 at IQ3_S and the larger Qwen 3.6-35B-A3B both missed. Cheapest image tokens after Gemma. The catch: it loses the red-border visual cue and joins the "Unknowns: None" club. Suddenly the most interesting option for ≤8 GB VRAM. Plus an updated routing-policy table by VRAM bracket.

May 7, 2026 · 7 min read

Round 3: Xiaomi MiMo V2.5 enters the vision shootout — and joins the "Unknowns: None" club

The empirical follow-up to last week's MiMo speculation post. Same probe, same Google sign-in screen, same prompt. MiMo at IQ3_S nails OCR and state extraction in the Qwen 3.6 tier — but joins Nemotron and Gemma in writing "Unknowns: None" instead of flagging the red-border ambiguity. Token cost ties the most expensive bucket; latency is in a different class. Plus a probe upgrade so big reasoning models stop tripping the default fetch headers timeout.

April 30, 2026 · 6 min read

Xiaomi MiMo V2.5 Pro vs "V2.5 Flash": should WebBrain add both?

Research notes on Xiaomi's newly released MiMo V2.5 series and why multimodal Pro+Flash-style routing may outperform text-only stacks for browser-agent workloads, while Qwen 3.6 still leads value on many pure-text tasks.

April 29, 2026 · 8 min read

Round 2: Nemotron Omni 30B vs Qwen 3.6 — does cheaper image tokens beat calibrated uncertainty?

A second round of vision-model benchmarking for browser agents. NVIDIA's Nemotron Omni 30B-A3B-Reasoning is 17% cheaper per image and classifies inputs better than Qwen 3.6-35B-A3B — but loses on calibrated uncertainty, and is English-only. Plus a head-to-head with the dense Qwen 3.6-27B that explains why MoE is the right architecture for self-hosted vision.

April 21, 2026 · 7 min read

Four vision models, one screenshot: which one is actually worth running locally for a browser agent?

We fed the same Google sign-in page through Gemma 4-E2B, Gemma 4-31B, Qwen3.5-27B, and Qwen3.6-35B-A3B using the exact system prompt WebBrain's vision sub-call ships with. The spread on OCR accuracy, latency, and token cost is wider than you'd expect — and one model quietly changed our mind about which architecture to reach for.