<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: space-invaders</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/space-invaders.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-12-09T20:19:21+00:00</updated><author><name>Simon Willison</name></author><entry><title>mistralai/mistral-vibe</title><link href="https://simonwillison.net/2025/Dec/9/mistral-vibe/#atom-tag" rel="alternate"/><published>2025-12-09T20:19:21+00:00</published><updated>2025-12-09T20:19:21+00:00</updated><id>https://simonwillison.net/2025/Dec/9/mistral-vibe/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/mistralai/mistral-vibe"&gt;mistralai/mistral-vibe&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's the Apache 2.0 licensed source code for Mistral's new "Vibe" CLI coding agent, &lt;a href="https://mistral.ai/news/devstral-2-vibe-cli"&gt;released today&lt;/a&gt; alongside Devstral 2.&lt;/p&gt;
&lt;p&gt;It's a neat implementation of the now standard terminal coding agent pattern, built in Python on top of Pydantic and Rich/Textual (here are &lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/pyproject.toml#L29-L46"&gt;the dependencies&lt;/a&gt;.) &lt;a href="https://github.com/google-gemini/gemini-cli"&gt;Gemini CLI&lt;/a&gt; is TypeScript, Claude Code is closed source (TypeScript, now &lt;a href="https://simonwillison.net/2025/Dec/2/anthropic-acquires-bun/"&gt;on top of Bun&lt;/a&gt;), OpenAI's &lt;a href="https://github.com/openai/codex"&gt;Codex CLI&lt;/a&gt; is Rust. &lt;a href="https://github.com/OpenHands/OpenHands"&gt;OpenHands&lt;/a&gt; is the other major Python coding agent I know of, but I'm likely missing some others. (UPDATE: &lt;a href="https://github.com/MoonshotAI/kimi-cli"&gt;Kimi CLI&lt;/a&gt; is another open source Apache 2 Python one.)&lt;/p&gt;
&lt;p&gt;The Vibe source code is pleasant to read and the crucial prompts are neatly extracted out into Markdown files. Some key places to look:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/prompts/cli.md"&gt;core/prompts/cli.md&lt;/a&gt; is the main system prompt ("You are operating as and within Mistral Vibe, a CLI coding-agent built by Mistral AI...")&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/prompts/compact.md"&gt;core/prompts/compact.md&lt;/a&gt; is the prompt used to generate compacted summaries of conversations ("Create a comprehensive summary of our entire conversation that will serve as complete context for continuing this work...")&lt;/li&gt;
&lt;li&gt;Each of the core tools has its own prompt file:&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/tools/builtins/prompts/bash.md"&gt;.../prompts/bash.md&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/tools/builtins/prompts/grep.md"&gt;.../prompts/grep.md&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/tools/builtins/prompts/read_file.md"&gt;.../prompts/read_file.md&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/tools/builtins/prompts/write_file.md"&gt;.../prompts/write_file.md&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/tools/builtins/prompts/search_replace.md"&gt;.../prompts/search_replace.md&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/core/tools/builtins/prompts/todo.md"&gt;.../prompts/todo.md&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Python implementations of those tools &lt;a href="https://github.com/mistralai/mistral-vibe/tree/v1.0.4/vibe/core/tools/builtins"&gt;can be found here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I tried it out and had it build me a Space Invaders game using three.js with the following prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;make me a space invaders game as HTML with three.js loaded from a CDN&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Animated screenshot demo of Mistral Vibe running in a terminal. The text reads: I've created a Space Invaders game using HTML and Three. js loaded from a CDN. The game is now available in the file space_invaders.html in your current directory. Here's how to play: 1. Open the space_invaders.html file in a web browser 2. Use the left and right arrow keys to move your player (green rectangle) 3. Press the spacebar to shoot at the invaders (red rectangles) 4. Try to get the highest score before the invaders reach you or hit you with their bullets The game features: © Player movement with arrow keys © Shooting mechanics with spacebar © Enemy invaders that move back and forth © Collision detection « Score tracking * Game over screen © Increasing difficulty Writing file (64s esc to interrupt) »» auto-approve on (shift-tab to toggle) - 7% of 100k tokens" src="https://static.simonwillison.net/static/2025/vibe.gif" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/space-invaders-by-llms/blob/main/mistral-vibe-devstral-2/index.html"&gt;the source code&lt;/a&gt;  and &lt;a href="https://space-invaders.simonwillison.net/mistral-vibe-devstral-2/"&gt;the live game&lt;/a&gt; (hosted in my new &lt;a href="https://github.com/simonw/space-invaders-by-llms"&gt;space-invaders-by-llms&lt;/a&gt; repo). It did OK.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/textual"&gt;textual&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pydantic"&gt;pydantic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="textual"/><category term="ai-assisted-programming"/><category term="mistral"/><category term="pydantic"/><category term="vibe-coding"/><category term="coding-agents"/><category term="system-prompts"/><category term="space-invaders"/></entry><entry><title>Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale</title><link href="https://simonwillison.net/2025/Nov/7/codex-tailscale-spark/#atom-tag" rel="alternate"/><published>2025-11-07T07:23:12+00:00</published><updated>2025-11-07T07:23:12+00:00</updated><id>https://simonwillison.net/2025/Nov/7/codex-tailscale-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/codex-spark-gpt-oss"&gt;Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by a &lt;a href="https://www.youtube.com/watch?v=qy4ci7AoF9Y&amp;amp;lc=UgzaGdLX8TAuQ9ugx1Z4AaABAg"&gt;YouTube comment&lt;/a&gt; I wrote up how I run OpenAI's Codex CLI coding agent against the gpt-oss:120b model running in Ollama on my &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/"&gt;NVIDIA DGX Spark&lt;/a&gt; via a Tailscale network.&lt;/p&gt;
&lt;p&gt;It takes a little bit of work to configure but the result is I can now use Codex CLI on my laptop anywhere in the world against a self-hosted model.&lt;/p&gt;
&lt;p&gt;I used it to build &lt;a href="https://static.simonwillison.net/static/2025/gpt-oss-120b-invaders.html"&gt;this space invaders clone&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="tailscale"/><category term="til"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="coding-agents"/><category term="space-invaders"/><category term="codex-cli"/><category term="nvidia-spark"/></entry><entry><title>OpenAI's new open weight (Apache 2) models are really good</title><link href="https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag" rel="alternate"/><published>2025-08-05T20:33:13+00:00</published><updated>2025-08-05T20:33:13+00:00</updated><id>https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag</id><summary type="html">
    &lt;p&gt;The long promised &lt;a href="https://openai.com/index/introducing-gpt-oss/"&gt;OpenAI open weight models are here&lt;/a&gt;, and they are &lt;em&gt;very&lt;/em&gt; impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B.&lt;/p&gt;
&lt;p&gt;OpenAI's own benchmarks are eyebrow-raising - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;gpt-oss-120b&lt;/strong&gt; model achieves &lt;strong&gt;near-parity with OpenAI o4-mini&lt;/strong&gt; on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The &lt;strong&gt;gpt-oss-20b&lt;/strong&gt; model delivers &lt;strong&gt;similar results to OpenAI o3‑mini&lt;/strong&gt; on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3-mini are &lt;em&gt;really good&lt;/em&gt; proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.&lt;/p&gt;
&lt;p&gt;Both models are mixture-of-experts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like "GPQA Diamond (without tools) PhD-level science questions" too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3 — 83.3%&lt;/li&gt;
&lt;li&gt;o4-mini — 81.4%&lt;/li&gt;
&lt;li&gt;gpt-oss-120b — 80.1%&lt;/li&gt;
&lt;li&gt;o3-mini — 77%&lt;/li&gt;
&lt;li&gt;gpt-oss-20b — 71.5%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of these benchmarks are edging towards saturated.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-model-card"&gt;Training details from the model card&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#china"&gt;Competing with the Chinese open models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/h4&gt;
&lt;p&gt;There are already a bunch of different ways to run these models - OpenAI partnered with numerous organizations in advance of the release.&lt;/p&gt;
&lt;p&gt;I decided to start with &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had to update to the most recent version of the app, then install the new model from &lt;a href="https://lmstudio.ai/models/openai/gpt-oss-20b"&gt;their openai/gpt-oss-20b&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;First impressions: this is a &lt;em&gt;really good&lt;/em&gt; model, and it somehow runs using just 11.72GB of my system RAM.&lt;/p&gt;
&lt;p&gt;The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.&lt;/p&gt;
&lt;p&gt;Let's try "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/h4&gt;
&lt;p&gt;I started &lt;a href="https://gist.github.com/simonw/b71394cc85fe0f048e376392e41586da"&gt;with low&lt;/a&gt;. It thought for 0.07 seconds and then output this (at 39 tokens a second):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-low.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Except... it output invalid SVG. One of the path elements looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Frame --&amp;gt;
&amp;lt;path d="
    M150,250          &amp;lt;!-- rear wheel center --&amp;gt;
    L300,120          &amp;lt;!-- top tube to front --&amp;gt;
    L450,250          &amp;lt;!-- chain stays back to front --&amp;gt;
    L300,350          &amp;lt;!-- seat stays down --&amp;gt;
    Z"
    fill="#e0e0e0" stroke="#555" stroke-width="4"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But you can't put comments inside attributes like that. I fixed this to get the above image.&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/h4&gt;
&lt;p&gt;I tried again &lt;a href="https://gist.github.com/simonw/642e9e371387fc59a5aad25dcd41b0cc"&gt;on medium&lt;/a&gt;. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-medium.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/h4&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3"&gt;on high&lt;/a&gt;. My first attempt here failed because it hit the default LM Studio 4096 context limit - so I reloaded the model and set the new limit to the maximum possible.&lt;/p&gt;
&lt;p&gt;It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-high.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thought &lt;em&gt;really hard&lt;/em&gt; to get there. Here's a snippet of &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3 "&gt;that transcript&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican's body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 &amp;gt; seat top 255. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've seen this from other local reasoning models too: they can really overthink this kind of problem! I don't know if there's any pelican SVG that's worth waiting nearly six minutes for.&lt;/p&gt;
&lt;h4 id="space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/h4&gt;
&lt;p&gt;Given how long high took I switched back to medium for my next experiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d607e4"&gt;thought for 10.78 seconds&lt;/a&gt; and produced this:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-gpt-oss-20b-mxfp4-medium"&gt;play that here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's not the best I've seen - I was more impressed &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;by GLM 4.5 Air&lt;/a&gt; - but it's very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).&lt;/p&gt;
&lt;h4 id="trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/h4&gt;
&lt;p&gt;I don't quite have the resources on my laptop to run the larger model. Thankfully it's already being hosted by a number of different API providers.&lt;/p&gt;
&lt;p&gt;OpenRouter already &lt;a href="https://openrouter.ai/openai/gpt-oss-120b/providers"&gt;lists three&lt;/a&gt; - Fireworks, Groq and Cerebras. (Update: now also Parasail and Baseten.)&lt;/p&gt;
&lt;p&gt;Cerebras is &lt;em&gt;fast&lt;/em&gt;, so I decided to try them first.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt; plugin and ran the &lt;code&gt;refresh&lt;/code&gt; command to ensure it had their latest models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install -U llm-cerebras jsonschema
llm cerebras refresh&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Installing jsonschema worked around a warning message.)&lt;/p&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Refreshed 10 Cerebras models:
  - cerebras-deepseek-r1-distill-llama-70b
  - cerebras-gpt-oss-120b
  - cerebras-llama-3.3-70b
  - cerebras-llama-4-maverick-17b-128e-instruct
  - cerebras-llama-4-scout-17b-16e-instruct
  - cerebras-llama3.1-8b
  - cerebras-qwen-3-235b-a22b-instruct-2507
  - cerebras-qwen-3-235b-a22b-thinking-2507
  - cerebras-qwen-3-32b
  - cerebras-qwen-3-coder-480b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m cerebras-gpt-oss-120b \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Cerebras runs the new model at between 2 and 4 thousands tokens per second!&lt;/p&gt;
&lt;p&gt;To my surprise this one &lt;a href="https://gist.github.com/simonw/4c685f19f1a93b68eacb627125e36be4"&gt;had the same comments-in-attributes bug&lt;/a&gt; that we saw with oss-20b earlier. I fixed those and got this pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-120-cerebras.jpg" alt="Yellow and not great pelican, quite a good bicycle if a bit sketchy." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That bug appears intermittently - I've not seen it on some of my other runs of the same prompt.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin also provides access to the models, balanced across the underlying providers. You can use that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste API key here&lt;/span&gt;
llm -m openrouter/openai/gpt-oss-120b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; &lt;a href="https://github.com/ggml-org/llama.cpp/pull/15091"&gt;pull request for gpt-oss&lt;/a&gt; was landed less than an hour ago. It's worth browsing through the coded - a &lt;em&gt;lot&lt;/em&gt; of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;llama.cpp Homebrew package&lt;/a&gt; within the next day or so, which should provide a convenient way to run the model via &lt;code&gt;llama-server&lt;/code&gt; and friends.&lt;/p&gt;
&lt;h4 id="gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/h4&gt;
&lt;p&gt;Ollama &lt;a href="https://ollama.com/library/gpt-oss"&gt;also have gpt-oss&lt;/a&gt;, requiring an update to their app.&lt;/p&gt;
&lt;p&gt;I fetched that 14GB model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull gpt-oss:20b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can use it with the new Ollama native app, or access it from &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama
llm -m gpt-oss:20b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also appears to use around 13.26GB of system memory while running a prompt.&lt;/p&gt;
&lt;p&gt;Ollama also launched &lt;a href="https://ollama.com/turbo"&gt;Ollama Turbo&lt;/a&gt; today, offering the two OpenAI models as a paid hosted service:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Turbo is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs, or run very slowly. Ollama Turbo provides a way to run these models fast while using Ollama's App, CLI, and API. &lt;/p&gt;&lt;/blockquote&gt;
&lt;h4 id="the-model-card"&gt;Training details from the model card&lt;/h4&gt;
&lt;p&gt;Here are some interesting notes about how the models were trained from &lt;a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf"&gt;the model card&lt;/a&gt; (PDF):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: We train the models on a text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge. To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thunder Compute's article &lt;a href="https://www.thundercompute.com/blog/nvidia-h100-pricing"&gt;NVIDIA H100 Pricing (August 2025): Cheapest On-Demand Cloud GPU Rates&lt;/a&gt; lists prices from around $2/hour to $11/hour, which would indicate a training cost of the 120b model between $4.2m and $23.1m and the 20b between $420,000 and $2.3m.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3. This procedure teaches the models how to reason and solve problems using CoT and teaches the model how to use tools. Because of the similar RL techniques, these models have a personality similar to models served in our first-party products like ChatGPT. Our training dataset consists of a wide range of problems from coding, math, science, and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models have additional special training to help them use web browser and Python (Jupyter notebook) tools more effectively:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;During post-training, we also teach the models to use different agentic tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A browsing tool, that allows the model to call search and open functions to interact with
the web. This aids factuality and allows the models to fetch info beyond their knowledge
cutoff.&lt;/li&gt;
&lt;li&gt;A python tool, which allows the model to run code in a stateful Jupyter notebook environment.&lt;/li&gt;
&lt;li&gt;Arbitrary developer functions, where one can specify function schemas in a &lt;code&gt;Developer&lt;/code&gt;
message similar to the OpenAI API. The definition of function is done within our harmony
format.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a corresponding &lt;a href="https://github.com/openai/gpt-oss?tab=readme-ov-file#python"&gt;section about Python tool usage&lt;/a&gt; in the &lt;code&gt;openai/gpt-oss&lt;/code&gt; repository README.&lt;/p&gt;


&lt;h4 id="openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/h4&gt;
&lt;p&gt;One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.&lt;/p&gt;
&lt;p&gt;Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/openai/harmony"&gt;openai/harmony&lt;/a&gt; is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the &lt;code&gt;gpt-oss&lt;/code&gt; models. It's clearly inspired by their new-ish &lt;a href="https://openai.com/index/new-tools-for-building-agents/"&gt;Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The format is described in the new &lt;a href="https://cookbook.openai.com/articles/openai-harmony"&gt;OpenAI Harmony Response Format&lt;/a&gt; cookbook document. It introduces some concepts that I've not seen in open weight models before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt; and &lt;code&gt;tool&lt;/code&gt; roles - many other models only use user and assistant, and sometimes system and tool.&lt;/li&gt;
&lt;li&gt;Three different channels for output: &lt;code&gt;final&lt;/code&gt;, &lt;code&gt;analysis&lt;/code&gt; and &lt;code&gt;commentary&lt;/code&gt;. Only the &lt;code&gt;final&lt;/code&gt; channel is default intended to be visible to users. &lt;code&gt;analysis&lt;/code&gt; is for chain of thought and &lt;code&gt;commentary&lt;/code&gt; is sometimes used for tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That channels concept has been present in ChatGPT for a few months, starting with the release of o3.&lt;/p&gt;
&lt;p&gt;The details of the new tokens used by Harmony caught my eye:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Token&lt;/th&gt;
    &lt;th&gt;Purpose&lt;/th&gt;
    &lt;th&gt;ID&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|start|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message header&lt;/td&gt;
    &lt;td&gt;200006&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|end|&amp;gt;&lt;/td&gt;
    &lt;td&gt;End of message&lt;/td&gt;
    &lt;td&gt;200007&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|message|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message content&lt;/td&gt;
    &lt;td&gt;200008&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|channel|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of channel info&lt;/td&gt;
    &lt;td&gt;200005&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|constrain|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Data type for tool call&lt;/td&gt;
    &lt;td&gt;200003&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|return|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Stop after response&lt;/td&gt;
    &lt;td&gt;200002&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|call|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Call a tool&lt;/td&gt;
    &lt;td&gt;200012&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;Those token IDs are particularly important. They are part of a new token vocabulary called &lt;code&gt;o200k_harmony&lt;/code&gt;, which landed in OpenAI's tiktoken tokenizer library &lt;a href="https://github.com/openai/tiktoken/commit/3591ff175d6a80efbe4fcc7f0e219ddd4b8c52f1"&gt;this morning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the past I've seen models get confused by special tokens - try pasting &lt;code&gt;&amp;lt;|end|&amp;gt;&lt;/code&gt; into a model and see what happens.&lt;/p&gt;
&lt;p&gt;Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!&lt;/p&gt;
&lt;p&gt;The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.&lt;/p&gt;
&lt;p&gt;I tried one of their demos using &lt;code&gt;uv run&lt;/code&gt; to turn it into a shell one-liner:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --python 3.12 --with openai-harmony python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import *&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import DeveloperContent&lt;/span&gt;
&lt;span class="pl-s"&gt;enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)&lt;/span&gt;
&lt;span class="pl-s"&gt;convo = Conversation.from_messages([&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.SYSTEM,&lt;/span&gt;
&lt;span class="pl-s"&gt;        SystemContent.new(),&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.DEVELOPER,&lt;/span&gt;
&lt;span class="pl-s"&gt;        DeveloperContent.new().with_instructions("Talk like a pirate!")&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(Role.USER, "Arrr, how be you?"),&lt;/span&gt;
&lt;span class="pl-s"&gt;])&lt;/span&gt;
&lt;span class="pl-s"&gt;tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)&lt;/span&gt;
&lt;span class="pl-s"&gt;print(tokens)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note those token IDs like &lt;code&gt;200006&lt;/code&gt; corresponding to the special tokens listed above.&lt;/p&gt;
&lt;h4 id="the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/h4&gt;
&lt;p&gt;There's one aspect of these models that I haven't explored in detail yet: &lt;strong&gt;tool calling&lt;/strong&gt;. How these work is clearly a big part of the new Harmony format, but the packages I'm using myself (around my own &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;LLM tool calling&lt;/a&gt; support) need various tweaks and fixes to start working with that new mechanism.&lt;/p&gt;
&lt;p&gt;Tool calling currently represents my biggest disappointment with local models that I've run on my own machine. I've been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.&lt;/p&gt;
&lt;p&gt;Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.&lt;/p&gt;
&lt;p&gt;My experience to date has been that local models are unable to handle these lengthy conversations. I'm not sure if that's inherent to the limitations of my own machine, or if it's something that the right model architecture and training could overcome.&lt;/p&gt;
&lt;p&gt;OpenAI make big claims about the tool calling capabilities of these new models. I'm looking forward to seeing how well they perform in practice.&lt;/p&gt;

&lt;h4 id="china"&gt;Competing with the Chinese open models&lt;/h4&gt;

&lt;p&gt;I've been writing a &lt;em&gt;lot&lt;/em&gt; about the &lt;a href="https://simonwillison.net/tags/ai-in-china/"&gt;flurry of excellent open weight models&lt;/a&gt; released by Chinese AI labs over the past few months - all of them very capable and most of them under Apache 2 or MIT licenses.&lt;/p&gt;

&lt;p&gt;Just last week &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;I said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. [...]&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With the release of the gpt-oss models that statement no longer holds true. I'm waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it's likely that OpenAI now offer the best available open weights models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Independent evaluations are beginning to roll in. Here's &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="cerebras"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="space-invaders"/><category term="gpt-oss"/></entry><entry><title>XBai o4</title><link href="https://simonwillison.net/2025/Aug/3/xbai-o4/#atom-tag" rel="alternate"/><published>2025-08-03T22:21:17+00:00</published><updated>2025-08-03T22:21:17+00:00</updated><id>https://simonwillison.net/2025/Aug/3/xbai-o4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/MetaStoneTec/XBai-o4"&gt;XBai o4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Yet &lt;em&gt;another&lt;/em&gt; open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;XBai o4&lt;/strong&gt; excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in Medium mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This a 32.8 billion parameter model released by MetaStone AI, a new-to-me lab who released their first model in March - &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-L1-7B"&gt;MetaStone-L1-7B&lt;/a&gt;, then followed that with MetaStone-S1 &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-S1-1.5B"&gt;1.5B&lt;/a&gt;, &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-S1-7B"&gt;7B&lt;/a&gt; and &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-S1-32B"&gt;32B&lt;/a&gt; in July and now XBai o4 in August.&lt;/p&gt;
&lt;p&gt;The MetaStone-S1 models were accompanied with a paper, &lt;a href="https://arxiv.org/abs/2507.01951"&gt;Test-Time Scaling with Reflective Generative Model&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There is &lt;em&gt;very&lt;/em&gt; little information available on the English-language web about MetaStone AI. Their paper shows a relationship with USTC, &lt;a href="https://en.wikipedia.org/wiki/University_of_Science_and_Technology_of_China"&gt;University of Science and Technology of China&lt;/a&gt; in Hefei. One of their researchers &lt;a href="https://x.com/WangMagic_/status/1951690465222217872"&gt;confirmed on Twitter&lt;/a&gt; that their CEO is from &lt;a href="https://en.wikipedia.org/wiki/Kuaishou"&gt;KWAI&lt;/a&gt; which lead me to &lt;a href="https://www.qbitai.com/2024/07/168071.html"&gt;this Chinese language article&lt;/a&gt; from July last year about Li Yan, formerly of KWAI and now the founder of Wen Xiaobai and &lt;a href="https://x.com/simonw/status/1951694450369208361"&gt;evidently&lt;/a&gt; &lt;a href="https://x.com/WangMagic_/status/1951694611191324929"&gt;now&lt;/a&gt; the CEO of MetaStone. &lt;a href="https://www.wenxiaobai.com"&gt;www.wenxiaobai.com&lt;/a&gt; is listed as the "official website" linked to from &lt;a href="https://github.com/MetaStone-AI/XBai-o4"&gt;the XBai-o4 README&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;Ivan Fioravanti &lt;a href="https://huggingface.co/mlx-community/models?search=xbai-o4"&gt;got it working under MLX&lt;/a&gt; in 4bit, 5bit, 6bit, 8bit and 4bit-DWQ sizes. I tried his &lt;a href="https://huggingface.co/mlx-community/XBai-o4-6bit"&gt;6bit one&lt;/a&gt; (a 24.81GB download) in LM Studio and had it "Generate an SVG of a pelican riding a bicycle". It thought for 39.49 seconds (&lt;a href="https://gist.github.com/simonw/78182fc3409e36f8d22217992967b9d6#response"&gt;thinking trace here&lt;/a&gt;) and gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is blue and looks a little like a sad elephant. The bicycle looks more like a flat motorbike. The pelican has a blue tail and orange legs." src="https://static.simonwillison.net/static/2025/xbai-o4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I also had it "Write an HTML and JavaScript page implementing space invaders", which you can &lt;a href="https://tools.simonwillison.net/space-invaders-xbai-o4-6bit-mlx"&gt;try out here&lt;/a&gt;. It's not great - the invaders don't fire bullets (it listed that as a "potential improvement") and I don't think it's winnable due to timing issues, but it's an interesting addition to &lt;a href="https://simonwillison.net/tags/space-invaders/"&gt;my collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;LM Studio reported a constant memory usage of 25.54GB and according to Activity Monitor used 87.5% of my GPU and around 39% of a CPU core.&lt;/p&gt;
&lt;p&gt;It did a &lt;em&gt;lot&lt;/em&gt; of work in the thinking trace - which took 10m35s! - including writing the first draft of much of the code that ended up in the finished result. Here's an extract:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Let me think of variables:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;let enemyDirection = 1; // 1 for right, -1 for left&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;In the game loop:
&lt;code&gt;enemies.forEach(enemy =&amp;gt; {&amp;gt;
    enemy.x += enemyDirection * enemy.dx;&amp;gt;
});&lt;/code&gt;
Then check if any enemy is out of bounds. If so, reverse direction and move all enemies down.&lt;/p&gt;
&lt;p&gt;But how to check if any enemy is at the edge?&lt;/p&gt;
&lt;p&gt;Maybe find the max x of all enemies and see if it exceeds &lt;code&gt;canvas.width - enemy.width&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Alternatively, track the leftmost and rightmost positions of enemies. Maybe a bit complex.&lt;/p&gt;
&lt;p&gt;For simplicity, perhaps check if any enemy's &lt;code&gt;x + width &amp;gt; canvas.width or x &amp;lt; 0&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/457d55a5c426e0bbea9d47aee225c391"&gt;the full transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My initial impressions are that I'm not as impressed with this model for running on my own laptop as I was with &lt;a href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; or &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;GLM-4.5 Air&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But... how extraordinary is it that &lt;em&gt;another&lt;/em&gt; Chinese AI lab has been able to produce a competitive model, this time with far less fanfare than we've seen from Qwen and Moonshot AI and Z.ai.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/ivanfioravanti/status/1951643205985816807"&gt;@ivanfioravanti&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/><category term="ivan-fioravanti"/></entry><entry><title>Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM</title><link href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/#atom-tag" rel="alternate"/><published>2025-07-31T19:45:36+00:00</published><updated>2025-07-31T19:45:36+00:00</updated><id>https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;Qwen just released &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;their sixth model&lt;/a&gt;(!) of this July called &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; - listed as Qwen3-Coder-Flash in their &lt;a href="https://chat.qwen.ai/"&gt;chat.qwen.ai&lt;/a&gt; interface.&lt;/p&gt;
&lt;p&gt;It's 30.5B total parameters with 3.3B active at any one time. This means it will fit on a 64GB Mac - and even a 32GB Mac if you quantize it - and can run &lt;em&gt;really&lt;/em&gt; fast thanks to that smaller set of active parameters.&lt;/p&gt;
&lt;p&gt;It's a non-thinking model that is specially trained for coding tasks.&lt;/p&gt;
&lt;p&gt;This is an exciting combination of properties: optimized for coding performance and speed and small enough to run on a mid-tier developer laptop.&lt;/p&gt;
&lt;h4 id="trying-it-out-with-lm-studio-and-open-webui"&gt;Trying it out with LM Studio and Open WebUI&lt;/h4&gt;
&lt;p&gt;I like running models like this using Apple's MLX framework. I ran GLM-4.5 Air the other day &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/#how-i-ran-the-model"&gt;using the mlx-lm Python library directly&lt;/a&gt;, but this time I decided to try out the combination of &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; and &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;(LM Studio has a decent interface built in, but I like the Open WebUI one slightly more.)&lt;/p&gt;
&lt;p&gt;I installed the model  by clicking the "Use model in LM Studio" button on LM Studio's &lt;a href="https://lmstudio.ai/models/qwen/qwen3-coder-30b"&gt;qwen/qwen3-coder-30b&lt;/a&gt; page. It gave me a bunch of options:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/lm-studio-qwen3-coder-30b.jpg" alt="Screenshot of a model download menu for &amp;quot;qwen/qwen3-coder-30b,&amp;quot; a 30B MoE coding model from Alibaba Qwen using the mlx-llm engine. The section &amp;quot;Download Options&amp;quot; shows different choices with file sizes. Options include: GGUF Qwen3 Coder 30B A3B Instruct Q3_K_L (14.58 GB), Q4_K_M (18.63 GB), Q6_K (25.10 GB), Q8_0 (32.48 GB). MLX versions are also available: 4bit (17.19 GB, selected), 6bit (24.82 GB, marked as Downloaded), 8bit (32.46 GB)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I chose the 6bit MLX model, which is a 24.82GB download. Other options include 4bit (17.19GB) and 8bit (32.46GB). The download sizes are roughly the same as the amount of RAM required to run the model - picking that 24GB one leaves 40GB free on my 64GB machine for other applications.&lt;/p&gt;
&lt;p&gt;Then I opened the developer settings in LM Studio (the green folder icon) and turned on "Enable CORS" so I could access it from a separate Open WebUI instance.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/lm-studio-cors.jpg" alt="Screenshot of LM Studio application showing runtime settings. The status is &amp;quot;Running&amp;quot; with a toggle switch enabled. A settings dropdown is open with options including: &amp;quot;Server Port 1234&amp;quot;, &amp;quot;Enable CORS&amp;quot; (enabled), &amp;quot;Serve on Local Network&amp;quot; (disabled)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Now I switched over to Open WebUI. I installed and ran it using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx --python 3.11 open-webui serve&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then navigated to &lt;code&gt;http://localhost:8080/&lt;/code&gt; to access the interface. I opened their settings and configured a new "Connection" to LM Studio:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openweb-ui-settings.jpg" alt="Screenshot of Open WebUI settings showing the Edit Connection window. URL is set to http://localhost:1234/v1 and Prefix ID is set to lm." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That needs a base URL of &lt;code&gt;http://localhost:1234/v1&lt;/code&gt; and a key of anything you like. I also set the optional prefix to &lt;code&gt;lm&lt;/code&gt; just in case my Ollama installation - which Open WebUI detects automatically - ended up with any duplicate model names.&lt;/p&gt;
&lt;p&gt;Having done all of that, I could select any of my LM Studio models in the Open WebUI interface and start running prompts.&lt;/p&gt;
&lt;p&gt;A neat feature of Open WebUI is that it includes an automatic preview panel, which kicks in for fenced code blocks that include SVG or HTML:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openweb-ui-pelican.jpg" alt="The Open WebUI app with a sidebar and then a panel with the model and my Generate an SVG of a pelican riding a bicycle prompt, then its response, then another side panel with the rendered SVG. It isn't a great image - the bicycle is a bit mangled - but the pelican does at least have a big triangular orange beak." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/c167f14bc3d86ec1976f286d3e05fda5"&gt;the exported transcript&lt;/a&gt; for "Generate an SVG of a pelican riding a bicycle". It ran at almost 60 tokens a second!&lt;/p&gt;
&lt;h4 id="implementing-space-invaders"&gt;Implementing Space Invaders&lt;/h4&gt;
&lt;p&gt;I tried my other recent &lt;a href="https://simonwillison.net/tags/space-invaders/"&gt;simple benchmark prompt&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I like this one because it's a very short prompt that acts as shorthand for quite a complex set of features. There's likely plenty of material in the training data to help the model achieve that goal but it's still interesting to see if they manage to spit out something that works first time.&lt;/p&gt;
&lt;p&gt;The first version it gave me worked out of the box, but was a little too hard - the enemy bullets move so fast that it's almost impossible to avoid them:&lt;/p&gt;
&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;
&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct"&gt;try that out here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I tried a follow-up prompt of "Make the enemy bullets a little slower". A system like Claude Artifacts or Claude Code implements tool calls for modifying files in place, but the Open WebUI system I was using didn't have a default equivalent which means the model had to output the full file a second time.&lt;/p&gt;
&lt;p&gt;It did that, and slowed down the bullets, but it made a bunch of other changes as well, &lt;a href="https://gist.github.com/simonw/ee4704feb37c6b16edd677d32fd69693/revisions#diff-544640de4897069f24e7988199bd5c08addfc5aa2196cbf2a0d164308bff1db0"&gt;shown in this diff&lt;/a&gt;. I'm not too surprised by this - asking a 25GB local model to output a lengthy file with just a single change is quite a stretch.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b7115990525b104a6dd95f7d694ae6c3"&gt;the exported transcript&lt;/a&gt; for those two prompts.&lt;/p&gt;
&lt;h4 id="running-lm-studio-models-with-mlx-lm"&gt;Running LM Studio models with mlx-lm&lt;/h4&gt;
&lt;p&gt;LM Studio stores its models in the &lt;code&gt;~/.cache/lm-studio/models&lt;/code&gt; directory. This means you can use the &lt;a href="https://github.com/ml-explore/mlx-lm"&gt;mlx-lm&lt;/a&gt; Python library to run prompts through the same model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --isolated --with mlx-lm mlx_lm.generate \
  --model &lt;span class="pl-k"&gt;~&lt;/span&gt;/.cache/lm-studio/models/lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit \
  --prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Write an HTML and JavaScript page implementing space invaders&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -m 8192 --top-k 20 --top-p 0.8 --temp 0.7&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Be aware that this will load a duplicate copy of the model into memory so you may want to quit LM Studio before running this command!&lt;/p&gt;
&lt;h4 id="accessing-the-model-via-my-llm-tool"&gt;Accessing the model via my LLM tool&lt;/h4&gt;
&lt;p&gt;My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; project provides a command-line tool and Python library for accessing large language models.&lt;/p&gt;
&lt;p&gt;Since LM Studio offers an OpenAI-compatible API, you can &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;configure LLM&lt;/a&gt; to access models through that API by creating or editing the &lt;code&gt;~/Library/Application\ Support/io.datasette.llm/extra-openai-models.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;zed &lt;span class="pl-k"&gt;~&lt;/span&gt;/Library/Application&lt;span class="pl-cce"&gt;\ &lt;/span&gt;Support/io.datasette.llm/extra-openai-models.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added the following YAML configuration:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;qwen3-coder-30b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;qwen/qwen3-coder-30b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;http://localhost:1234/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;supports_tools&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Provided LM Studio is running I can execute prompts from my terminal like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3-coder-30b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;A joke about a pelican and a cheesecake&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Why did the pelican refuse to eat the cheesecake?&lt;/p&gt;
&lt;p&gt;Because it had a &lt;em&gt;beak&lt;/em&gt; for dessert! 🥧🦜&lt;/p&gt;
&lt;p&gt;(Or if you prefer: Because it was afraid of getting &lt;em&gt;beak&lt;/em&gt;-sick from all that creamy goodness!)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(25GB clearly isn't enough space for a functional sense of humor.)&lt;/p&gt;
&lt;p&gt;More interestingly though, we can start exercising the Qwen model's support for &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;tool calling&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3-coder-30b \
  -T llm_version -T llm_time --td \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tell the time then show the version&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are enabling LLM's two default tools - one for telling the time and one for seeing the version of LLM that's currently installed. The &lt;code&gt;--td&lt;/code&gt; flag stands for &lt;code&gt;--tools-debug&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The output looks like this, debug output included:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Tool call: llm_time({})
  {
    "utc_time": "2025-07-31 19:20:29 UTC",
    "utc_time_iso": "2025-07-31T19:20:29.498635+00:00",
    "local_timezone": "PDT",
    "local_time": "2025-07-31 12:20:29",
    "timezone_offset": "UTC-7:00",
    "is_dst": true
  }

Tool call: llm_version({})
  0.26

The current time is:
- Local Time (PDT): 2025-07-31 12:20:29
- UTC Time: 2025-07-31 19:20:29

The installed version of the LLM is 0.26.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good! It managed two tool calls from a single prompt.&lt;/p&gt;
&lt;p&gt;Sadly I couldn't get it to work with some of my more complex plugins such as &lt;a href="https://github.com/simonw/llm-tools-sqlite"&gt;llm-tools-sqlite&lt;/a&gt;. I'm trying to figure out if that's a bug in the model, the LM Studio layer or my own code for running tool prompts against OpenAI-compatible endpoints.&lt;/p&gt;
&lt;h4 id="the-month-of-qwen"&gt;The month of Qwen&lt;/h4&gt;
&lt;p&gt;July has absolutely been the month of Qwen. The models they have released this month are outstanding, packing some extremely useful capabilities even into models I can run in 25GB of RAM or less on my own laptop.&lt;/p&gt;
&lt;p&gt;If you're looking for a competent coding model you can run locally Qwen3-Coder-30B-A3B is a very solid choice.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="uv"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>Qwen3-30B-A3B-Thinking-2507</title><link href="https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/#atom-tag" rel="alternate"/><published>2025-07-30T15:36:54+00:00</published><updated>2025-07-30T15:36:54+00:00</updated><id>https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507"&gt;Qwen3-30B-A3B-Thinking-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Yesterday was &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt;. Qwen are clearly committed to their new split between reasoning and non-reasoning models (a reversal from Qwen 3 in April), because today they released the new reasoning partner to yesterday's model: &lt;strong&gt;Qwen3-30B-A3B-Thinking-2507&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I'm surprised at how poorly this reasoning mode performs at "Generate an SVG of a pelican riding a bicycle" compared to its non-reasoning partner. The &lt;a href="https://gist.github.com/simonw/b523c029152f646ce4efb3c4dd5e1d01#reasoning"&gt;reasoning trace&lt;/a&gt; appears to carefully consider each component and how it should be positioned... and then &lt;a href="https://gist.github.com/simonw/b523c029152f646ce4efb3c4dd5e1d01#response"&gt;the final result&lt;/a&gt; looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A line with two dots, over a rhombus, with two circles and a pelican that looks like a grey snowman. They are not arranged in a sensible layout." src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-thinking-2507.png" /&gt;&lt;/p&gt;
&lt;p&gt;I ran this using &lt;a href="https://chat.qwen.ai/?model=Qwen3-30B-A3B-2507"&gt;chat.qwen.ai/?model=Qwen3-30B-A3B-2507&lt;/a&gt; with the "reasoning" option selected.&lt;/p&gt;
&lt;p&gt;I also tried the "Write an HTML and JavaScript page implementing space invaders" prompt I &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#space-invaders"&gt;ran against the non-reasoning model&lt;/a&gt;. It did a better job in that &lt;a href="https://tools.simonwillison.net/space-invaders-qwen3-30b-a3b-thinking-2507"&gt;the game works&lt;/a&gt;:&lt;/p&gt;
&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-thinking-2507-space-invaders.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-thinking-2507-space-invaders.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;It's not as playable as the on &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;I got from GLM-4.5 Air&lt;/a&gt; though - the invaders fire their bullets infrequently enough that the game isn't very challenging.&lt;/p&gt;
&lt;p&gt;This model is part of a flurry of releases from Qwen over the past two 9 days. Here's my coverage of each of those:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;Qwen3-235B-A22B-Instruct-2507&lt;/a&gt; - 21st July&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; - 22nd July&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/"&gt;Qwen3-235B-A22B-Thinking-2507&lt;/a&gt; - 25th July&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt; - 29th July&lt;/li&gt;
&lt;li&gt;Qwen3-30B-A3B-Thinking-2507 - today&lt;/li&gt;
&lt;/ul&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1950570969036361799"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>Qwen3-30B-A3B-Instruct-2507</title><link href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#atom-tag" rel="alternate"/><published>2025-07-29T18:57:33+00:00</published><updated>2025-07-29T18:57:33+00:00</updated><id>https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New model update from Qwen, improving on their previous &lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/"&gt;Qwen3-30B-A3B release&lt;/a&gt; from late April. In &lt;a href="https://x.com/Alibaba_Qwen/status/1950227114793586867"&gt;their tweet&lt;/a&gt; they said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Smarter, faster, and local deployment-friendly.&lt;/p&gt;
&lt;p&gt;✨ Key Enhancements:&lt;br&gt;
✅ Enhanced reasoning, coding, and math skills&lt;br&gt;
✅ Broader multilingual knowledge&lt;br&gt;
✅ Improved long-context understanding (up to 256K tokens)&lt;br&gt;
✅ Better alignment with user intent and open-ended tasks&lt;br&gt;
✅ No more &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; blocks — now operating exclusively in non-thinking mode&lt;br&gt;&lt;/p&gt;
&lt;p&gt;🔧 With 3B activated parameters, it's approaching the performance of GPT-4o and Qwen3-235B-A22B Non-Thinking&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tried &lt;a href="https://chat.qwen.ai/?model=Qwen3-30B-A3B-2507"&gt;the chat.qwen.ai&lt;/a&gt; hosted model with "Generate an SVG of a pelican riding a bicycle" and &lt;a href="https://gist.github.com/simonw/a498d4b2df887d079a9e338f8c4e5006"&gt;got this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This one is cute: blue sky, green grass, the sun is shining. The bicycle is a red block with wheels that looks more like a toy car. The pelican doesn't look like a pelican and has a quirky smile printed on its beak." src="https://static.simonwillison.net/static/2025/Qwen3-30B-A3B-2507.png" /&gt;&lt;/p&gt;
&lt;p&gt;I particularly enjoyed this detail from the SVG source code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Bonus: Pelican's smile --&amp;gt;
&amp;lt;path d="M245,145 Q250,150 255,145" fill="none" stroke="#d4a037" stroke-width="2"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I went looking for quantized versions that could fit on my Mac and found &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Instruct-2507-MLX-8bit"&gt;lmstudio-community/Qwen3-30B-A3B-Instruct-2507-MLX-8bit&lt;/a&gt; from &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;. Getting that up and running was a 32.46GB download and it appears to use just over 30GB of RAM.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/d608dc37cb7871f12caf8fbc0657fcad"&gt;pelican I got from that one&lt;/a&gt; wasn't as good:&lt;/p&gt;
&lt;p&gt;&lt;img alt="It looks more like a tall yellow hen chick riding a segway" src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx.png" /&gt;&lt;/p&gt;
&lt;p id="space-invaders"&gt;I then tried that local model on the "Write an HTML and JavaScript page implementing space invaders" task &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;that I ran against GLM-4.5 Air&lt;/a&gt;. The output &lt;a href="https://gist.github.com/simonw/965111fd6fac320b7eec50710c1761db"&gt;looked promising&lt;/a&gt;, in particular it seemed to be putting more effort into the design of the invaders (GLM-4.5 Air just used rectangles):&lt;/p&gt;

&lt;pre&gt;&lt;span class="pl-c"&gt;// Draw enemy ship&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;color&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Ship body&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Enemy eyes&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#fff'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;6&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;10&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Enemy antennae&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#f00'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Basic enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;else&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Fast enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;3&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;else&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Armored enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;8&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;8&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#0f0'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;6&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;But &lt;a href="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx-space-invaders.html"&gt;the resulting code&lt;/a&gt; didn't actually work:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Black screen - a row of good looking space invaders advances across the screen for a moment... and then the entire screen goes blank." src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx-space-invaders.gif" /&gt;&lt;/p&gt;
&lt;p&gt;That same prompt against the unquantized Qwen-hosted model produced &lt;a href="https://gist.github.com/simonw/b61d161a8a969e4558c812a64dadbb45"&gt;a different result&lt;/a&gt; which sadly also resulted in an &lt;a href="https://static.simonwillison.net/static/2025/Qwen3-30B-A3B-2507-space-invaders.html"&gt;unplayable game&lt;/a&gt; - this time because everything moved too fast.&lt;/p&gt;
&lt;p&gt;This new Qwen model is a non-reasoning model, whereas GLM-4.5 and GLM-4.5 Air are both reasoners. It looks like at this scale the "reasoning" may make a material difference in terms of getting code that works out of the box.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>My 2.5 year old laptop can write Space Invaders in JavaScript now, using GLM-4.5 Air and MLX</title><link href="https://simonwillison.net/2025/Jul/29/space-invaders/#atom-tag" rel="alternate"/><published>2025-07-29T13:02:39+00:00</published><updated>2025-07-29T13:02:39+00:00</updated><id>https://simonwillison.net/2025/Jul/29/space-invaders/#atom-tag</id><summary type="html">
    &lt;p&gt;I wrote about the new &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;GLM-4.5&lt;/a&gt; model family yesterday - new open weight (MIT licensed) models from &lt;a href="https://z.ai/"&gt;Z.ai&lt;/a&gt; in China which their benchmarks claim score highly in coding even against models such as Claude Sonnet 4.&lt;/p&gt;
&lt;p&gt;The models are pretty big - the smaller GLM-4.5 Air model is still 106 billion total parameters, which &lt;a href="https://huggingface.co/zai-org/GLM-4.5-Air"&gt;is 205.78GB&lt;/a&gt; on Hugging Face.&lt;/p&gt;
&lt;p&gt;Ivan Fioravanti &lt;a href="https://x.com/ivanfioravanti/status/1949911755028910557"&gt;built&lt;/a&gt; this &lt;a href="https://huggingface.co/mlx-community/GLM-4.5-Air-3bit"&gt;44GB 3bit quantized version for MLX&lt;/a&gt;, specifically sized so people with 64GB machines could have a chance of running it. I tried it out... and it works &lt;em&gt;extremely well&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I fed it the following prompt:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;And it churned away for a while and produced &lt;a href="https://tools.simonwillison.net/space-invaders-GLM-4.5-Air-3bit"&gt;the following&lt;/a&gt;:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Clearly this isn't a particularly novel example, but I still think it's noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this - especially code that worked first time with no further edits needed.&lt;/p&gt;

&lt;h4 id="how-i-ran-the-model"&gt;How I ran the model&lt;/h4&gt;

&lt;p&gt;I had to run it using the current &lt;code&gt;main&lt;/code&gt; branch of the &lt;a href="https://github.com/ml-explore/mlx-lm"&gt;mlx-lm&lt;/a&gt; library (to ensure I had &lt;a href="https://github.com/ml-explore/mlx-lm/commit/489e63376b963ac02b3b7223f778dbecc164716b"&gt;this commit&lt;/a&gt; adding &lt;code&gt;glm4_moe&lt;/code&gt; support). I ran that using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run \
  --with &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://github.com/ml-explore/mlx-lm/archive/489e63376b963ac02b3b7223f778dbecc164716b.zip&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  python&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then in that Python interpreter I used the standard recipe for running MLX models:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;mlx_lm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;load&lt;/span&gt;, &lt;span class="pl-s1"&gt;generate&lt;/span&gt;
&lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;load&lt;/span&gt;(&lt;span class="pl-s"&gt;"mlx-community/GLM-4.5-Air-3bit"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;That downloaded 44GB of model weights to my  &lt;code&gt;~/.cache/huggingface/hub/models--mlx-community--GLM-4.5-Air-3bit&lt;/code&gt; folder.&lt;/p&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"Write an HTML and JavaScript page implementing space invaders"&lt;/span&gt;
&lt;span class="pl-s1"&gt;messages&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [{&lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"user"&lt;/span&gt;, &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;}]
&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt;.&lt;span class="pl-c1"&gt;apply_chat_template&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;messages&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;
)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;generate&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;prompt&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;verbose&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;max_tokens&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;8192&lt;/span&gt;
)&lt;/pre&gt;
&lt;p&gt;The response started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;The user wants me to create a Space Invaders game using HTML, CSS, and JavaScript. I need to create a complete, functional game with the following features:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Player spaceship that can move left/right and shoot&lt;/li&gt;
&lt;li&gt;Enemy invaders that move in formation and shoot back&lt;/li&gt;
&lt;li&gt;Score tracking&lt;/li&gt;
&lt;li&gt;Lives/health system&lt;/li&gt;
&lt;li&gt;Game over conditions [...]&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;Followed by the HTML and this debugging output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Prompt: 14 tokens, 14.095 tokens-per-sec
Generation: 4193 tokens, 25.564 tokens-per-sec
Peak memory: 47.687 GB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/9f515c8e32fb791549aeb88304550893#file-space_invaders-txt-L61"&gt;the full transcript here&lt;/a&gt;, or view &lt;a href="https://github.com/simonw/tools/blob/9e04fd9895fae1aa9ac78b8e62d2833831fe0544/space-invaders-GLM-4.5-Air-3bit.html"&gt;the source on GitHub&lt;/a&gt;, or &lt;a href="https://tools.simonwillison.net/space-invaders-GLM-4.5-Air-3bit"&gt;try it out in your browser&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id="pelican"&gt;A pelican for good measure&lt;/h4&gt;

&lt;p&gt;I ran &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican benchmark&lt;/a&gt; against the full sized models &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;yesterday&lt;/a&gt;, but I couldn't resist trying it against this smaller 3bit model. Here's what I got for &lt;code&gt;"Generate an SVG of a pelican riding a bicycle"&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/glm-4.5-air-3b-pelican.png" alt="Blue background, pelican looks like a cloud with an orange bike, bicycle is recognizable as a bicycle if not quite the right geometry." /&gt;&lt;/p&gt;

&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/fe428f7cead72ad754f965a81117f5df"&gt;transcript for that&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In both cases the model used around 48GB of RAM at peak, leaving me with just 16GB for everything else - I had to quit quite a few apps in order to get the model to run but the speed was pretty good once it got going.&lt;/p&gt;

&lt;h4 id="local-coding-models"&gt;Local coding models are really good now&lt;/h4&gt;

&lt;p&gt;It's interesting how almost every model released in 2025 has specifically targeting coding. That focus has clearly been paying off: these coding models are getting &lt;em&gt;really good&lt;/em&gt; now.&lt;/p&gt;

&lt;p&gt;Two years ago when I &lt;a href="https://simonwillison.net/2023/Mar/11/llama/"&gt;first tried LLaMA&lt;/a&gt; I never &lt;em&gt;dreamed&lt;/em&gt; that the same laptop I was using then would one day be able to run models with capabilities as strong as what I'm seeing from GLM 4.5 Air - and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="uv"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="ai-in-china"/><category term="space-invaders"/><category term="ivan-fioravanti"/><category term="glm"/></entry></feed>