<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: lm-studio</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/lm-studio.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-16T17:16:52+00:00</updated><author><name>Simon Willison</name></author><entry><title>Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7</title><link href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag" rel="alternate"/><published>2026-04-16T17:16:52+00:00</published><updated>2026-04-16T17:16:52+00:00</updated><id>https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;For anyone who has been (inadvisably) taking my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican riding a bicycle benchmark&lt;/a&gt; seriously as a robust way to test models, here are pelicans from this morning's two big model releases - &lt;a href="https://qwen.ai/blog?id=qwen3.6-35b-a3b"&gt;Qwen3.6-35B-A3B from Alibaba&lt;/a&gt; and &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;Claude Opus 4.7 from Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the Qwen 3.6 pelican, generated using &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf"&gt;this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf&lt;/a&gt; quantized model by Unsloth, running on my MacBook Pro M5 via &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; (and the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin) - &lt;a href="https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2"&gt;transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/Qwen3.6-35B-A3B-UD-Q4_K_S-pelican.png" alt="The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's one I got from Anthropic's &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;brand new Claude Opus 4.7&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican.png" alt="The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!&lt;/p&gt;
&lt;p&gt;I tried Opus a second time passing &lt;code&gt;thinking_level: max&lt;/code&gt;. It didn't do much better (&lt;a href="https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican-max.png" alt="The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="i-dont-think-qwen-are-cheating"&gt;I don't think Qwen are cheating&lt;/h4&gt;
&lt;p&gt;A lot of people are &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;convinced that the labs train for my stupid benchmark&lt;/a&gt;. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":&lt;/p&gt;

&lt;div style="display: flex; gap: 4px;"&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Qwen3.6-35B-A3B&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/qwen-flamingo.png" alt="The unicycle spokes are a too long. The pelican has sunglasses, a bowtie and appears to be smoking a cigarette. It has two heart emoji surrounding the caption Flamingo on a Unicycle. It has a lot of charisma." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Opus 4.7&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/opus-flamingo.png" alt="The unicycle has a black wheel. The flamingo is a competent if slightly dull vector illustration of a flamingo. It has no flair." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;I'm giving this one to Qwen too, partly for the excellent &lt;code&gt;&amp;lt;!-- Sunglasses on flamingo! --&amp;gt;&lt;/code&gt; SVG comment.&lt;/p&gt;

&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.&lt;/p&gt;
&lt;p&gt;The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;first pelicans from October 2024&lt;/a&gt; were junk. The &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;more recent entries&lt;/a&gt; have generally been much, much better - to the point that Gemini 3.1 Pro produces &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;illustrations you could actually use somewhere&lt;/a&gt;, provided you had a pressing need to illustrate a pelican riding a bicycle.&lt;/p&gt;
&lt;p&gt;Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.&lt;/p&gt;
&lt;p&gt;If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Gemma 4: Byte for byte, the most capable open models</title><link href="https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag" rel="alternate"/><published>2026-04-02T18:28:54+00:00</published><updated>2026-04-02T18:28:54+00:00</updated><id>https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"&gt;Gemma 4: Byte for byte, the most capable open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.&lt;/p&gt;
&lt;p&gt;Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.&lt;/p&gt;
&lt;p&gt;They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't entirely understand that, but apparently that's what the "E" in E2B means!&lt;/p&gt;
&lt;p&gt;One particularly exciting feature of these models is that they are multi-modal beyond just images:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vision and audio&lt;/strong&gt;: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.&lt;/p&gt;
&lt;p&gt;I tried them out using the GGUFs for &lt;a href="https://lmstudio.ai/models/gemma-4"&gt;LM Studio&lt;/a&gt;. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out &lt;code&gt;"---\n"&lt;/code&gt; in a loop for every prompt I tried.&lt;/p&gt;
&lt;p&gt;The succession of &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb"&gt;pelican quality&lt;/a&gt; from 2B to 4B to 26B-A4B is notable:&lt;/p&gt;
&lt;p&gt;E2B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican" src="https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;E4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it" src="https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;26B-A4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle" src="https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105"&gt;fixing that&lt;/a&gt; I got probably the best pelican I've seen yet from a model that runs on my laptop.)&lt;/p&gt;
&lt;p&gt;Google are providing API access to the two larger Gemma models via their &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it"&gt;AI Studio&lt;/a&gt;. I added support to &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; and then &lt;a href="https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e"&gt;ran a pelican&lt;/a&gt; through the 31B model using that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good, though it is missing the front part of the bicycle frame:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. " src="https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Olmo 3 is a fully open LLM</title><link href="https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag" rel="alternate"/><published>2025-11-22T23:59:46+00:00</published><updated>2025-11-22T23:59:46+00:00</updated><id>https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Olmo is the LLM series from Ai2 - the &lt;a href="https://allenai.org/"&gt;Allen institute for AI&lt;/a&gt;. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://allenai.org/blog/olmo3"&gt;new Olmo 3&lt;/a&gt; claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;At its center is &lt;strong&gt;Olmo 3-Think (32B)&lt;/strong&gt;, the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.&lt;/p&gt;
&lt;p&gt;Having full access to the training data is really useful. Here's how they describe that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Olmo 3 is pretrained on &lt;strong&gt;Dolma 3&lt;/strong&gt;, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with &lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct &lt;strong&gt;Dolma 3 Mix&lt;/strong&gt;, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also highlight that they are training on fewer tokens than their competition:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.&lt;/p&gt;
&lt;p&gt;I tried out the 32B Think model and the 7B Instruct model &lt;a href="https://lmstudio.ai/models/olmo3"&gt;using LM Studio&lt;/a&gt;. The 7B model is a 4.16GB download, the 32B one is 18.14GB.&lt;/p&gt;
&lt;p&gt;The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for &lt;em&gt;14 minutes 43 seconds&lt;/em&gt;, outputting 8,437 tokens total most of which was &lt;a href="https://gist.github.com/simonw/2ae9d5ed71de9608b7955eea9671306f"&gt;this epic thinking trace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 100 100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bike frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;saddlebrown&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52 50,57 35,62 50&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Head/beak &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45 60,50 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Left wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65 60,70 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Right wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Feet on pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25 75,30 85,35 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;75 75,70 85,65 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Rendered it looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo3-32b-pelican.jpg" alt="Two circles, each with a triangle sticking out from the bottom. They have bars leading up to a brown box. Overlapping them is a black triangle with white circles for eyes and two grey triangles that are probably meant to be wings. It is not recognizable as a pelican or a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I tested OLMo 2 32B 4bit &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;back in March&lt;/a&gt; and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo2-pelican.jpg" alt="Blue and black wiggly lines looking more like a circuit diagram than a pelican riding a bicycle" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now &lt;a href="https://openrouter.ai/chat?models=qwen/qwen3-32b"&gt;using OpenRouter&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-32b-pelican.png" alt="The bicycle is two black circles joined by two lines, with a weird rectangular saddle perched on top The pelican is a blue oval, a white circles with a yellow triangle in it and a weird eye shaped oval overlapping the blue one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="olmotrace"&gt;OlmoTrace&lt;/h4&gt;
&lt;p&gt;I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A core goal of Olmo 3 is not just to &lt;em&gt;open&lt;/em&gt; the model flow, but to make it &lt;em&gt;actionable&lt;/em&gt; for people who want to understand and improve model behavior. Olmo 3 integrates with &lt;a href="https://allenai.org/blog/olmotrace"&gt;&lt;strong&gt;OlmoTrace&lt;/strong&gt;&lt;/a&gt;, our tool for tracing model outputs back to training data in real time.&lt;/p&gt;
&lt;p&gt;For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can access OlmoTrace via &lt;a href="https://playground.allenai.org/"&gt;playground.allenai.org&lt;/a&gt;, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.&lt;/p&gt;
&lt;p&gt;I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmotrace.jpg" alt="Screenshot of AI2 Playground interface with Olmo 3 32B Think model selected. User prompt reads &amp;quot;Generate a conference bio for Simon Willison&amp;quot;. AI response shows expanded &amp;quot;Thoughts&amp;quot; section: &amp;quot;Okay, the user wants me to generate a conference bio for Simon Willison. Let me start by recalling who Simon is. He's a well-known developer, entrepreneur, and author in the tech community. I remember he's the co-founder of co:here and now works at Anthropic. He's famous for his work with SQLite, especially the datasette project. Also, he writes a blog called SimonWillison.net which is quite popular. First, I need to structure a typical conference bio. These usually include his current role, past positions, notable projects, and maybe some achievements or recognitions. I should highlight his key contributions like datasette, which allows easy exploration of SQLite databases via a web interface. That's a big deal in data journalism and open data circles.&amp;quot; Right panel shows &amp;quot;OlmoTrace&amp;quot; feature described as &amp;quot;Documents from the training data that have exact text matches with the model response. Powered by infini-gram&amp;quot;. First document excerpt discusses technology and innovation, with highlighted match text &amp;quot;societal implications of technology, emphasizing the&amp;quot; shown in bold, surrounded by text about responsibility and merging innovation with intellect. Second document excerpt about Matt Hall has highlighted match &amp;quot;is a software engineer and entrepreneur based in&amp;quot; shown in bold, describing someone in New York City who co-founded a PFP collection and works at Google Creative Lab. Note indicates &amp;quot;Document repeated 2 times in result&amp;quot; with &amp;quot;View all repeated documents&amp;quot; link." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.&lt;/p&gt;
&lt;p&gt;I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by &lt;a href="https://infini-gram.io/"&gt;Ai2's infini-gram&lt;/a&gt;) but the documents it found had nothing to do with me at all.&lt;/p&gt;
&lt;h4 id="can-open-training-data-address-concerns-of-backdoors-"&gt;Can open training data address concerns of backdoors?&lt;/h4&gt;
&lt;p&gt;Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to &lt;a href="https://marin.community/"&gt;Stanford's Marin&lt;/a&gt; and &lt;a href="https://www.swiss-ai.org/apertus"&gt;Swiss AI's Apertus&lt;/a&gt;, neither of which I'd heard about before.&lt;/p&gt;
&lt;p&gt;A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that &lt;a href="https://www.anthropic.com/research/small-samples-poison"&gt;a small number of samples can poison LLMs of any size&lt;/a&gt; - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.&lt;/p&gt;

&lt;p&gt;This makes fully open training data an even bigger deal.&lt;/p&gt;

&lt;p&gt;Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in &lt;a href="https://www.interconnects.ai/p/olmo-3-americas-truly-open-reasoning"&gt;his detailed post about the release&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).&lt;/p&gt;

&lt;p&gt;This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." &lt;a href="https://arxiv.org/abs/2506.10947"&gt;arXiv preprint arXiv:2506.10947&lt;/a&gt; (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." &lt;a href="https://arxiv.org/abs/2507.10532"&gt;arXiv preprint arXiv:2507.10532&lt;/a&gt; (2025).)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in &lt;a href="https://simonwillison.net/2024/Feb/2/olmos/"&gt;February 2024&lt;/a&gt;) and Olmo 2 (in &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;March 2025&lt;/a&gt;) have been significant. I'm hoping that trend continues!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/interpretability"&gt;interpretability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nathan-lambert"&gt;nathan-lambert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="interpretability"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="ai2"/><category term="ai-ethics"/><category term="llm-release"/><category term="lm-studio"/><category term="nathan-lambert"/><category term="olmo"/></entry><entry><title>NVIDIA DGX Spark: great hardware, early days for the ecosystem</title><link href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag" rel="alternate"/><published>2025-10-14T23:36:21+00:00</published><updated>2025-10-14T23:36:21+00:00</updated><id>https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;NVIDIA sent me a preview unit of their new &lt;a href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/"&gt;DGX Spark&lt;/a&gt; desktop "AI supercomputer". I've never had hardware to review before! You can consider this my first ever sponsored post if you like, but they did not pay me any cash and aside from an embargo date they did not request (nor would I grant) any editorial input into what I write about the device.&lt;/p&gt;
&lt;p&gt;The device retails for around $4,000. They officially go on sale tomorrow.&lt;/p&gt;
&lt;p&gt;First impressions are that this is a snazzy little computer. It's similar in size to a Mac mini, but with an exciting textured surface that feels refreshingly different and a little bit &lt;a href="https://www.indiewire.com/awards/industry/devs-cinematography-rob-hardy-alex-garland-1234583396/"&gt;science fiction&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/nvidia-spark.jpg" alt="A rectangular small computer, sitting horizontally on a box. It is about the width of a Mac Mini. It has a NVIDIA logo on  a reflective handle portion, then textured silver metal front, then another reflective handle at the other end. It's pretty and a bit weird looking. It sits on the box it came in, which has NVIDIA DGX Spark written on it in white text on green." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;There is a &lt;em&gt;very&lt;/em&gt; powerful machine tucked into that little box. Here are the specs, which I had Claude Code figure out for me by &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d"&gt;poking around on the device itself&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hardware Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture: aarch64 (ARM64)&lt;/li&gt;
&lt;li&gt;CPU: 20 cores
&lt;ul&gt;
&lt;li&gt;10x Cortex-X925 (performance cores)&lt;/li&gt;
&lt;li&gt;10x Cortex-A725 (efficiency cores)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;RAM: 119 GB total (112 GB available) - &lt;em&gt;I’m not sure why Claude reported it differently here, the machine is listed as 128GB - it looks like a &lt;a href="https://news.ycombinator.com/item?id=45586776#45588329"&gt;128GB == 119GiB thing&lt;/a&gt; because Claude &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d#file-nvidia-claude-code-txt-L41"&gt;used free -h&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Storage: 3.7 TB (6% used, 3.3 TB available)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPU Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model: NVIDIA GB10 (Blackwell architecture)&lt;/li&gt;
&lt;li&gt;Compute Capability: sm_121 (12.1)&lt;/li&gt;
&lt;li&gt;Memory: 119.68 GB&lt;/li&gt;
&lt;li&gt;Multi-processor Count: 48 streaming multiprocessors&lt;/li&gt;
&lt;li&gt;Architecture: Blackwell&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Short version: this is an ARM64 device with 128GB of memory that's available to both the GPU and the 20 CPU cores at the same time, strapped onto a 4TB NVMe SSD.&lt;/p&gt;
&lt;p&gt;The Spark is firmly targeted at “AI researchers”. It’s designed for both training and running models.&lt;/p&gt;
&lt;h4 id="the-tricky-bit-cuda-on-arm64"&gt;The tricky bit: CUDA on ARM64&lt;/h4&gt;
&lt;p&gt;Until now almost all of my own model running experiments have taken place on a Mac. This has gotten far less painful over the past year and a half thanks to the amazing work of the &lt;a href="https://simonwillison.net/tags/mlx/"&gt;MLX&lt;/a&gt; team and community, but it's still left me deeply frustrated at my lack of access to the NVIDIA CUDA ecosystem. I've lost count of the number of libraries and tutorials which expect you to be able to use Hugging Face Transformers or PyTorch with CUDA, and leave you high and dry if you don't have an NVIDIA GPU to run things on.&lt;/p&gt;
&lt;p&gt;Armed (ha) with my new NVIDIA GPU I was excited to dive into this world that had long eluded me... only to find that there was another assumption baked in to much of this software: x86 architecture for the rest of the machine.&lt;/p&gt;
&lt;p&gt;This resulted in all kinds of unexpected new traps for me to navigate. I eventually managed to get a PyTorch 2.7 wheel for CUDA on ARM, but failed to do so for 2.8. I'm not confident there because the wheel itself is unavailable but I'm finding navigating the PyTorch ARM ecosystem pretty confusing.&lt;/p&gt;
&lt;p&gt;NVIDIA are trying to make this easier, with mixed success. A lot of my initial challenges got easier when I found their &lt;a href="https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html"&gt;official Docker container&lt;/a&gt;, so now I'm figuring out how best to use Docker with GPUs. Here's the current incantation that's been working for me:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have not yet got my head around the difference between CUDA 12 and 13. 13 appears to be very new, and a lot of the existing tutorials and libraries appear to expect 12.&lt;/p&gt;
&lt;h4 id="the-missing-documentation-isn-t-missing-any-more"&gt;The missing documentation isn't missing any more&lt;/h4&gt;
&lt;p&gt;When I first received this machine around a month ago there was very little in the way of documentation to help get me started. This meant climbing the steep NVIDIA+CUDA learning curve mostly on my own.&lt;/p&gt;
&lt;p&gt;This has changed &lt;em&gt;substantially&lt;/em&gt; in just the last week. NVIDIA now have extensive guides for getting things working on the Spark and they are a huge breath of fresh air - exactly the information I needed when I started exploring this hardware.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://developer.nvidia.com/topics/ai/dgx-spark"&gt;getting started guide&lt;/a&gt;, details on the &lt;a href="https://build.nvidia.com/spark/dgx-dashboard/instructions"&gt;DGX dashboard web app&lt;/a&gt;, and the essential collection of &lt;a href="https://build.nvidia.com/spark"&gt;playbooks&lt;/a&gt;. There's still a lot I haven't tried yet just in this official set of guides.&lt;/p&gt;
&lt;h4 id="claude-code-for-everything"&gt;Claude Code for everything&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.claude.com/product/claude-code"&gt;Claude Code&lt;/a&gt; was an absolute lifesaver for me while I was trying to figure out how best to use this device. My Ubuntu skills were a little rusty, and I also needed to figure out CUDA drivers and Docker incantations and how to install the right versions of PyTorch. Claude 4.5 Sonnet is &lt;em&gt;much better than me&lt;/em&gt; at all of these things.&lt;/p&gt;
&lt;p&gt;Since many of my experiments took place in disposable Docker containers I had no qualms at all about running it in YOLO mode:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;IS_SANDBOX=1&lt;/code&gt; environment variable stops Claude from complaining about running as root.&lt;/p&gt;

&lt;details&gt;&lt;summary style="font-style: italic"&gt;Before I found out about IS_SANDBOX&lt;/summary&gt;

&lt;p&gt;&lt;br /&gt;&lt;em&gt;I was &lt;a href="https://twitter.com/lawrencecchen/status/1978255934938886409"&gt;tipped off&lt;/a&gt; about IS_SANDBOX after I published this article. Here's my original workaround:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude understandably won't let you do this as root, even in a Docker container, so I found myself using the following incantation in a fresh &lt;code&gt;nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04&lt;/code&gt; instance pretty often:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get install -y sudo
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; pick the first free UID &amp;gt;=1000&lt;/span&gt;
U=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent passwd &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen UID: &lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; same for a GID&lt;/span&gt;
G=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent group &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen GID: &lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; create user+group&lt;/span&gt;
groupadd -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; devgrp
useradd -m -u &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -s /bin/bash dev
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; enable password-less sudo:&lt;/span&gt;
&lt;span class="pl-c1"&gt;printf&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;dev ALL=(ALL) NOPASSWD:ALL\n&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /etc/sudoers.d/90-dev-nopasswd
chmod 0440 /etc/sudoers.d/90-dev-nopasswd
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install npm&lt;/span&gt;
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install Claude&lt;/span&gt;
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then switch to the &lt;code&gt;dev&lt;/code&gt; user and run Claude for the first time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;su - dev
claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;/details&gt;&lt;br /&gt;

&lt;p&gt;This will provide a URL which you can visit to authenticate with your Anthropic account, confirming by copying back a token and pasting it into the terminal.&lt;/p&gt;
&lt;p&gt;Docker tip: you can create a snapshot of the current image (with Claude installed) by running &lt;code&gt;docker ps&lt;/code&gt; to get the container ID and then:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker commit --pause=false &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;container_id&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; cc:snapshot&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then later you can start a similar container using:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it \
  --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  cc:snapshot bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's an example of the kinds of prompts I've been running in Claude Code inside the container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want to run https://huggingface.co/unsloth/Qwen3-4B-GGUF using llama.cpp - figure out how to get llama cpp working on this machine  such that it runs with the GPU, then install it in this directory and get that model to work to serve a prompt. Goal is to get this  command to run: llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one worked flawlessly - Claude checked out the &lt;code&gt;llama.cpp&lt;/code&gt; repo, compiled it for me and iterated on it until it could run that model on the GPU. Here's a &lt;a href="https://gist.github.com/simonw/3e7d28d9ed222d842f729bfca46d6673"&gt;full transcript&lt;/a&gt;, converted from Claude's &lt;code&gt;.jsonl&lt;/code&gt; log format to Markdown using a script I &lt;a href="https://github.com/simonw/tools/blob/main/python/claude_to_markdown.py"&gt;vibe coded just now&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I later told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write out a markdown file with detailed notes on what you did. Start with the shortest form of notes on how to get a successful build, then add a full account of everything you tried, what went wrong and how you fixed it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which produced &lt;a href="https://gist.github.com/simonw/0942d96f616b9e328568ab27d911c8ed"&gt;this handy set of notes&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="tailscale-was-made-for-this"&gt;Tailscale was made for this&lt;/h4&gt;
&lt;p&gt;Having a machine like this on my local network is neat, but what's even neater is being able to access it from anywhere else in the world, from both my phone and my laptop.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; is &lt;em&gt;perfect&lt;/em&gt; for this. I installed it on the Spark (using the &lt;a href="https://tailscale.com/kb/1031/install-linux"&gt;Ubuntu instructions here&lt;/a&gt;), signed in with my SSO account (via Google)... and the Spark showed up in the "Network Devices" panel on my laptop and phone instantly.&lt;/p&gt;
&lt;p&gt;I can SSH in from my laptop or using the &lt;a href="https://termius.com/free-ssh-client-for-iphone"&gt;Termius iPhone app&lt;/a&gt; on my phone. I've also been running tools like &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt; which give me a mobile-friendly web interface for interacting with LLMs on the Spark.&lt;/p&gt;
&lt;h4 id="here-comes-the-ecosystem"&gt;Here comes the ecosystem&lt;/h4&gt;
&lt;p&gt;The embargo on these devices dropped yesterday afternoon, and it turns out a whole bunch of relevant projects have had similar preview access to myself. This is &lt;em&gt;fantastic news&lt;/em&gt; as many of the things I've been trying to figure out myself suddenly got a whole lot easier.&lt;/p&gt;
&lt;p&gt;Four particularly notable examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama &lt;a href="https://ollama.com/blog/nvidia-spark"&gt;works out of the box&lt;/a&gt;. They actually had a build that worked a few weeks ago, and were the first success I had running an LLM on the machine.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; creator Georgi Gerganov just published  &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/16578"&gt;extensive benchmark results&lt;/a&gt; from running &lt;code&gt;llama.cpp&lt;/code&gt; on a Spark. He's getting ~3,600 tokens/second to read the prompt and ~59 tokens/second to generate a response with the MXFP4 version of GPT-OSS 20B and ~817 tokens/second to read and ~18 tokens/second to generate for GLM-4.5-Air-GGUF.&lt;/li&gt;
&lt;li&gt;LM Studio now have &lt;a href="https://lmstudio.ai/blog/dgx-spark"&gt;a build for the Spark&lt;/a&gt;. I haven't tried this one yet as I'm currently using my machine exclusively via SSH.&lt;/li&gt;
&lt;li&gt;vLLM - one of the most popular engines for serving production LLMs - had &lt;a href="https://x.com/eqhylxx/status/1977928690945360049"&gt;early access&lt;/a&gt; and there's now an official &lt;a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3"&gt;NVIDIA vLLM NGC Container&lt;/a&gt; for running their stack.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's &lt;a href="https://docs.unsloth.ai/new/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth"&gt;a tutorial from Unsloth&lt;/a&gt; on fine-tuning gpt-oss-20b on the Spark.&lt;/p&gt;
&lt;h4 id="should-you-get-one-"&gt;Should you get one?&lt;/h4&gt;
&lt;p&gt;It's a bit too early for me to provide a confident recommendation concerning this machine. As indicated above, I've had a tough time figuring out how best to put it to use, largely through my own inexperience with CUDA, ARM64 and Ubuntu GPU machines in general.&lt;/p&gt;
&lt;p&gt;The ecosystem improvements in just the past 24 hours have been very reassuring though. I expect it will be clear within a few weeks how well supported this machine is going to be.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hardware"&gt;hardware&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/disclosures"&gt;disclosures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="hardware"/><category term="ai"/><category term="docker"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="ollama"/><category term="llama-cpp"/><category term="coding-agents"/><category term="claude-code"/><category term="lm-studio"/><category term="disclosures"/><category term="nvidia-spark"/></entry><entry><title>TIL: Running a gpt-oss eval suite against LM Studio on a Mac</title><link href="https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-tag" rel="alternate"/><published>2025-08-17T03:46:21+00:00</published><updated>2025-08-17T03:46:21+00:00</updated><id>https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;TIL: Running a gpt-oss eval suite against LM Studio on a Mac&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The other day &lt;a href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#update"&gt;I learned&lt;/a&gt; that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I decided to try and run that eval suite on my own MacBook Pro, against &lt;code&gt;gpt-oss-20b&lt;/code&gt; running inside of LM Studio.&lt;/p&gt;
&lt;p&gt;TLDR: once I had the model running inside LM Studio with a longer than default context limit, the following incantation ran an eval suite in around 3.5 hours:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mkdir /tmp/aime25_openai
OPENAI_API_KEY=x \
  uv run --python 3.13 --with 'gpt-oss[eval]' \
  python -m gpt_oss.evals \
  --base-url http://localhost:1234/v1 \
  --eval aime25 \
  --sampler chat_completions \
  --model openai/gpt-oss-20b \
  --reasoning-effort low \
  --n-threads 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;new TIL&lt;/a&gt; breaks that command down in detail and walks through the underlying eval - AIME 2025, which asks 30 questions (8 times each) that are defined using the following format:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{"question": "Find the sum of all integer bases $b&amp;gt;9$ for which $17_{b}$ is a divisor of $97_{b}$.", "answer": "70"}&lt;/code&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="til"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="uv"/><category term="lm-studio"/><category term="gpt-oss"/></entry><entry><title>Introducing Gemma 3 270M: The compact model for hyper-efficient AI</title><link href="https://simonwillison.net/2025/Aug/14/gemma-3-270m/#atom-tag" rel="alternate"/><published>2025-08-14T17:22:36+00:00</published><updated>2025-08-14T17:22:36+00:00</updated><id>https://simonwillison.net/2025/Aug/14/gemma-3-270m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-gemma-3-270m/"&gt;Introducing Gemma 3 270M: The compact model for hyper-efficient AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Google:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Gemma 3 270M, a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This model is &lt;em&gt;tiny&lt;/em&gt;. The version I tried was &lt;a href="https://lmstudio.ai/models/google/gemma-3-270m"&gt;the LM Studio GGUF one&lt;/a&gt;, a 241MB download.&lt;/p&gt;
&lt;p&gt;It works! You can say "hi" to it and ask it very basic questions like "What is the capital of France".&lt;/p&gt;
&lt;p&gt;I tried "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://gist.github.com/simonw/25e7b7afd6a63a2f15db48b3a51ec9bc"&gt;about a dozen times&lt;/a&gt; and didn't once get back an SVG that was more than just a blank square... but at one point it did decide to write me this poem instead, which was nice:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-----------------------+
|   Pelican Riding Bike |
+-----------------------+
|  This is the cat!  |
|  He's got big wings and a happy tail.  |
|  He loves to ride his bike!  |
+-----------------------+
|   Bike lights are shining bright.  |
|   He's got a shiny top, too!  |
|   He's ready for adventure!  |
+-----------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's not really the point though. The Gemma 3 team make it very clear that the goal of this model is to support fine-tuning: a model this tiny is never going to be useful for general purpose LLM tasks, but given the right fine-tuning data it should be able to specialize for all sorts of things:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In engineering, success is defined by efficiency, not just raw power. You wouldn't use a sledgehammer to hang a picture frame. The same principle applies to building with AI.&lt;/p&gt;
&lt;p&gt;Gemma 3 270M embodies this "right tool for the job" philosophy. It's a high-quality foundation model that follows instructions well out of the box, and its true power is unlocked through fine-tuning. Once specialized, it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness. By starting with a compact, capable model, you can build production systems that are lean, fast, and dramatically cheaper to operate.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's their tutorial on &lt;a href="https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune"&gt;Full Model Fine-Tune using Hugging Face Transformers&lt;/a&gt;, which I have not yet attempted to follow.&lt;/p&gt;
&lt;p&gt;I imagine this model will be particularly fun to play with directly in a browser using &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;transformers.js&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: It is! Here's &lt;a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator"&gt;a bedtime story generator&lt;/a&gt; using Transformers.js (requires WebGPU, so Chrome-like browsers only). Here's &lt;a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator/tree/main"&gt;the source code&lt;/a&gt; for that demo.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44902148"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"</title><link href="https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-tag" rel="alternate"/><published>2025-08-10T23:59:12+00:00</published><updated>2025-08-10T23:59:12+00:00</updated><id>https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-tag</id><summary type="html">
    &lt;p&gt;I've fallen a few days behind keeping up with Qwen. They released two new 4B models last week:  &lt;a href="https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507"&gt;Qwen3-4B-Instruct-2507&lt;/a&gt; and its thinking equivalent &lt;a href="https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507"&gt;Qwen3-4B-Thinking-2507&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These are relatively tiny models that punch &lt;em&gt;way&lt;/em&gt; above their weight. I’ve been running the 8bit GGUF varieties via LM Studio (here's &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF"&gt;Instruct&lt;/a&gt;, here's &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-2507-GGUF"&gt;Thinking&lt;/a&gt;) - both of them are 4GB downloads that use around 4.3GB of my M2 MacBook Pro’s system RAM while running. Both are way more capable than I would expect from such small files.&lt;/p&gt;
&lt;p&gt;Qwen3-4B-Thinking is the first model I've tried which called out the absurdity of being asked to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;a pelican riding a bicycle&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/cd41769de4791b98ba43a17c6ff32ee1"&gt;the full transcript&lt;/a&gt; - some highlights from the thinking trace:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay, the user asked for an SVG of a pelican riding a bicycle. That's... oddly specific and completely unrealistic.&lt;/p&gt;
&lt;p&gt;First reaction: Pelicans don't ride bicycles. They're birds that fly or swim, not human-like vehicles. This feels like a mix-up - maybe they meant "a cyclist" but typed "pelican"? Or perhaps it's a joke?&lt;/p&gt;
&lt;p&gt;Hmm, user might be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A kid playing with words (pelican + bicycle = silly combo)&lt;/li&gt;
&lt;li&gt;Someone testing if I'll generate impossible images&lt;/li&gt;
&lt;li&gt;Actually wanting a pelican &lt;em&gt;on&lt;/em&gt; a bike for art project? But that still doesn't make sense biologically&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Checks SVG capabilities&lt;/em&gt;
SVG is for vector graphics, but this request violates physics and biology. If I tried to draw it literally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pelican would be too big for a bike frame&lt;/li&gt;
&lt;li&gt;Bike wheels would be tiny compared to bird wings&lt;/li&gt;
&lt;li&gt;No realistic way to show "riding" since pelicans don't have legs for pedaling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[…]&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Self-check&lt;/em&gt;: Am I being too pedantic? Nah - if someone asks for impossible things, it's better to gently correct than make fake art that could confuse them. […]&lt;/p&gt;
&lt;p&gt;I appreciate your creative request! 🦅🚴‍♂️ However, &lt;strong&gt;pelicans don't ride bicycles&lt;/strong&gt; (they're birds that fly/swim, not humans with bike skills). Generating this as a literal SVG would be biologically impossible and visually confusing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;"This request violates physics and biology" - Qwen3-4B-Thinking saying what no other model ever thought to express.&lt;/p&gt;
&lt;p&gt;I also really enjoyed the "Am I being too pedantic? Nah" line. This model is sassy!&lt;/p&gt;
&lt;p&gt;It continued on to spit out two SVGS: the first titled "Most Likely: A Human Cyclist" and the second "Artistic Interpretation: Pelican &lt;em&gt;On&lt;/em&gt; a Bicycle (For Fun!)".&lt;/p&gt;
&lt;p&gt;They were both terrible. Here's the human one:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-thinking-human.png" alt="It's a green horizontal line, three abstract black shapes, a brown circle and the word Person" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's the supposed pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-thinking-pelican.png" alt="A blue circle, with a small black shape that might be a wing and a black shape that could maybe be the head? It's awful. Red text overlaps it that says This is art - pelicans don't ride bikes! - there is no attempt at a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I like Qwen's decision to include the clarifying annotation "This is art - pelicans don't ride bikes!":&lt;/p&gt;
&lt;p&gt;I tried the Qwen3-4B-Instruct non-thinking model too. It &lt;a href="https://gist.github.com/simonw/ad927a3849d0aece043afc97559be4bf"&gt;answered much faster&lt;/a&gt; (no time spent questioning my choice of task with its thinking tokens) and gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-instruct-2507-pelican.png" alt="A bunch of shaps. Pelican Riding a Bike! transposed on top. The yellow and orange bits might be a pelican I guess. The bicycle has two wheels overlapping too close and a single bar in the wrong direction." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;4B is such an interesting model size. These models should run on almost anything and, at least on my M2 MacBook, they run &lt;em&gt;fast&lt;/em&gt;. I'm getting 50+ tokens per second and they're using just less than 4.5GB of RAM while running.&lt;/p&gt;
&lt;p&gt;The question is always how useful such a tiny model can be. Clearly it's not great for SVG pelican illustrations!&lt;/p&gt;

&lt;p&gt;I did get a useful result out of the &lt;code&gt;-Thinking&lt;/code&gt; variant for a &lt;code&gt;jq&lt;/code&gt; expression I needed. I prompted:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;queries[0].rows is an array of objects each with a markdown key - write a jq bash one liner to output a raw string if that markdown concatenated together with double newlines between each&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It thought &lt;a href="https://gist.github.com/simonw/3f76749aa710f4a2d6405ebcf5b00ac4"&gt;for 3 minutes 13 seconds&lt;/a&gt; before spitting out a recipe that did roughly what I wanted:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;jq -r '.queries[0].rows[] | .markdown' | tr '\n' '\n\n'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I'm not sure that was worth waiting three minutes for though!&lt;/p&gt;

&lt;p&gt;These models have a 262,144 token context - wildly impressive, &lt;em&gt;if it works&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;So I tried another experiment: I used the Instruct model to summarize &lt;a href="https://news.ycombinator.com/item?id=44851557"&gt;this Hacker News conversation about GPT-5&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I did this with the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin for LLM combined with my &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes"&gt;hn-summary.sh script&lt;/a&gt;, which meant I could run the experiment like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 44851557 -m qwen3-4b-instruct-2507
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I believe this is 15,785 tokens - so nothing close to the 262,144 maximum but still an interesting test of a 4GB local model.&lt;/p&gt;
&lt;p&gt;The good news is Qwen spat out a genuinely useful summary of the conversation! You can &lt;a href="https://gist.github.com/simonw/4c5a1912f73e0d68b456b18000a76f0d#response"&gt;read that here&lt;/a&gt; - it's the best I've seen yet from a model running on my laptop, though honestly I've not tried many other recent models in this way.&lt;/p&gt;
&lt;p&gt;The bad news... it took almost five minutes to process and return the result!&lt;/p&gt;
&lt;p&gt;As a loose calculation, if the model can output 50 tokens/second maybe there's a similar speed for processing incoming input.. in which case 15785 / 50 = 315 seconds which is 5m15s.&lt;/p&gt;
&lt;p&gt;Hosted models can crunch through 15,000 tokens of input in just a few seconds. I guess this is one of the more material limitations of running models on Apple silicon as opposed to dedicated GPUs.&lt;/p&gt;
&lt;p&gt;I think I'm going to spend some more time with these models. They're fun, they have personality and I'm confident there are classes of useful problems they will prove capable at despite their small size. Their ability at summarization should make them a good fit for local RAG, and I've not started exploring their tool calling abilities yet.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>OpenAI's new open weight (Apache 2) models are really good</title><link href="https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag" rel="alternate"/><published>2025-08-05T20:33:13+00:00</published><updated>2025-08-05T20:33:13+00:00</updated><id>https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag</id><summary type="html">
    &lt;p&gt;The long promised &lt;a href="https://openai.com/index/introducing-gpt-oss/"&gt;OpenAI open weight models are here&lt;/a&gt;, and they are &lt;em&gt;very&lt;/em&gt; impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B.&lt;/p&gt;
&lt;p&gt;OpenAI's own benchmarks are eyebrow-raising - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;gpt-oss-120b&lt;/strong&gt; model achieves &lt;strong&gt;near-parity with OpenAI o4-mini&lt;/strong&gt; on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The &lt;strong&gt;gpt-oss-20b&lt;/strong&gt; model delivers &lt;strong&gt;similar results to OpenAI o3‑mini&lt;/strong&gt; on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3-mini are &lt;em&gt;really good&lt;/em&gt; proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.&lt;/p&gt;
&lt;p&gt;Both models are mixture-of-experts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like "GPQA Diamond (without tools) PhD-level science questions" too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3 — 83.3%&lt;/li&gt;
&lt;li&gt;o4-mini — 81.4%&lt;/li&gt;
&lt;li&gt;gpt-oss-120b — 80.1%&lt;/li&gt;
&lt;li&gt;o3-mini — 77%&lt;/li&gt;
&lt;li&gt;gpt-oss-20b — 71.5%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of these benchmarks are edging towards saturated.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-model-card"&gt;Training details from the model card&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#china"&gt;Competing with the Chinese open models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/h4&gt;
&lt;p&gt;There are already a bunch of different ways to run these models - OpenAI partnered with numerous organizations in advance of the release.&lt;/p&gt;
&lt;p&gt;I decided to start with &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had to update to the most recent version of the app, then install the new model from &lt;a href="https://lmstudio.ai/models/openai/gpt-oss-20b"&gt;their openai/gpt-oss-20b&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;First impressions: this is a &lt;em&gt;really good&lt;/em&gt; model, and it somehow runs using just 11.72GB of my system RAM.&lt;/p&gt;
&lt;p&gt;The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.&lt;/p&gt;
&lt;p&gt;Let's try "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/h4&gt;
&lt;p&gt;I started &lt;a href="https://gist.github.com/simonw/b71394cc85fe0f048e376392e41586da"&gt;with low&lt;/a&gt;. It thought for 0.07 seconds and then output this (at 39 tokens a second):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-low.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Except... it output invalid SVG. One of the path elements looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Frame --&amp;gt;
&amp;lt;path d="
    M150,250          &amp;lt;!-- rear wheel center --&amp;gt;
    L300,120          &amp;lt;!-- top tube to front --&amp;gt;
    L450,250          &amp;lt;!-- chain stays back to front --&amp;gt;
    L300,350          &amp;lt;!-- seat stays down --&amp;gt;
    Z"
    fill="#e0e0e0" stroke="#555" stroke-width="4"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But you can't put comments inside attributes like that. I fixed this to get the above image.&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/h4&gt;
&lt;p&gt;I tried again &lt;a href="https://gist.github.com/simonw/642e9e371387fc59a5aad25dcd41b0cc"&gt;on medium&lt;/a&gt;. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-medium.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/h4&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3"&gt;on high&lt;/a&gt;. My first attempt here failed because it hit the default LM Studio 4096 context limit - so I reloaded the model and set the new limit to the maximum possible.&lt;/p&gt;
&lt;p&gt;It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-high.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thought &lt;em&gt;really hard&lt;/em&gt; to get there. Here's a snippet of &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3 "&gt;that transcript&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican's body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 &amp;gt; seat top 255. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've seen this from other local reasoning models too: they can really overthink this kind of problem! I don't know if there's any pelican SVG that's worth waiting nearly six minutes for.&lt;/p&gt;
&lt;h4 id="space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/h4&gt;
&lt;p&gt;Given how long high took I switched back to medium for my next experiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d607e4"&gt;thought for 10.78 seconds&lt;/a&gt; and produced this:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-gpt-oss-20b-mxfp4-medium"&gt;play that here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's not the best I've seen - I was more impressed &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;by GLM 4.5 Air&lt;/a&gt; - but it's very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).&lt;/p&gt;
&lt;h4 id="trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/h4&gt;
&lt;p&gt;I don't quite have the resources on my laptop to run the larger model. Thankfully it's already being hosted by a number of different API providers.&lt;/p&gt;
&lt;p&gt;OpenRouter already &lt;a href="https://openrouter.ai/openai/gpt-oss-120b/providers"&gt;lists three&lt;/a&gt; - Fireworks, Groq and Cerebras. (Update: now also Parasail and Baseten.)&lt;/p&gt;
&lt;p&gt;Cerebras is &lt;em&gt;fast&lt;/em&gt;, so I decided to try them first.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt; plugin and ran the &lt;code&gt;refresh&lt;/code&gt; command to ensure it had their latest models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install -U llm-cerebras jsonschema
llm cerebras refresh&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Installing jsonschema worked around a warning message.)&lt;/p&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Refreshed 10 Cerebras models:
  - cerebras-deepseek-r1-distill-llama-70b
  - cerebras-gpt-oss-120b
  - cerebras-llama-3.3-70b
  - cerebras-llama-4-maverick-17b-128e-instruct
  - cerebras-llama-4-scout-17b-16e-instruct
  - cerebras-llama3.1-8b
  - cerebras-qwen-3-235b-a22b-instruct-2507
  - cerebras-qwen-3-235b-a22b-thinking-2507
  - cerebras-qwen-3-32b
  - cerebras-qwen-3-coder-480b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m cerebras-gpt-oss-120b \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Cerebras runs the new model at between 2 and 4 thousands tokens per second!&lt;/p&gt;
&lt;p&gt;To my surprise this one &lt;a href="https://gist.github.com/simonw/4c685f19f1a93b68eacb627125e36be4"&gt;had the same comments-in-attributes bug&lt;/a&gt; that we saw with oss-20b earlier. I fixed those and got this pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-120-cerebras.jpg" alt="Yellow and not great pelican, quite a good bicycle if a bit sketchy." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That bug appears intermittently - I've not seen it on some of my other runs of the same prompt.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin also provides access to the models, balanced across the underlying providers. You can use that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste API key here&lt;/span&gt;
llm -m openrouter/openai/gpt-oss-120b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; &lt;a href="https://github.com/ggml-org/llama.cpp/pull/15091"&gt;pull request for gpt-oss&lt;/a&gt; was landed less than an hour ago. It's worth browsing through the coded - a &lt;em&gt;lot&lt;/em&gt; of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;llama.cpp Homebrew package&lt;/a&gt; within the next day or so, which should provide a convenient way to run the model via &lt;code&gt;llama-server&lt;/code&gt; and friends.&lt;/p&gt;
&lt;h4 id="gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/h4&gt;
&lt;p&gt;Ollama &lt;a href="https://ollama.com/library/gpt-oss"&gt;also have gpt-oss&lt;/a&gt;, requiring an update to their app.&lt;/p&gt;
&lt;p&gt;I fetched that 14GB model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull gpt-oss:20b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can use it with the new Ollama native app, or access it from &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama
llm -m gpt-oss:20b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also appears to use around 13.26GB of system memory while running a prompt.&lt;/p&gt;
&lt;p&gt;Ollama also launched &lt;a href="https://ollama.com/turbo"&gt;Ollama Turbo&lt;/a&gt; today, offering the two OpenAI models as a paid hosted service:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Turbo is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs, or run very slowly. Ollama Turbo provides a way to run these models fast while using Ollama's App, CLI, and API. &lt;/p&gt;&lt;/blockquote&gt;
&lt;h4 id="the-model-card"&gt;Training details from the model card&lt;/h4&gt;
&lt;p&gt;Here are some interesting notes about how the models were trained from &lt;a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf"&gt;the model card&lt;/a&gt; (PDF):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: We train the models on a text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge. To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thunder Compute's article &lt;a href="https://www.thundercompute.com/blog/nvidia-h100-pricing"&gt;NVIDIA H100 Pricing (August 2025): Cheapest On-Demand Cloud GPU Rates&lt;/a&gt; lists prices from around $2/hour to $11/hour, which would indicate a training cost of the 120b model between $4.2m and $23.1m and the 20b between $420,000 and $2.3m.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3. This procedure teaches the models how to reason and solve problems using CoT and teaches the model how to use tools. Because of the similar RL techniques, these models have a personality similar to models served in our first-party products like ChatGPT. Our training dataset consists of a wide range of problems from coding, math, science, and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models have additional special training to help them use web browser and Python (Jupyter notebook) tools more effectively:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;During post-training, we also teach the models to use different agentic tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A browsing tool, that allows the model to call search and open functions to interact with
the web. This aids factuality and allows the models to fetch info beyond their knowledge
cutoff.&lt;/li&gt;
&lt;li&gt;A python tool, which allows the model to run code in a stateful Jupyter notebook environment.&lt;/li&gt;
&lt;li&gt;Arbitrary developer functions, where one can specify function schemas in a &lt;code&gt;Developer&lt;/code&gt;
message similar to the OpenAI API. The definition of function is done within our harmony
format.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a corresponding &lt;a href="https://github.com/openai/gpt-oss?tab=readme-ov-file#python"&gt;section about Python tool usage&lt;/a&gt; in the &lt;code&gt;openai/gpt-oss&lt;/code&gt; repository README.&lt;/p&gt;


&lt;h4 id="openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/h4&gt;
&lt;p&gt;One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.&lt;/p&gt;
&lt;p&gt;Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/openai/harmony"&gt;openai/harmony&lt;/a&gt; is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the &lt;code&gt;gpt-oss&lt;/code&gt; models. It's clearly inspired by their new-ish &lt;a href="https://openai.com/index/new-tools-for-building-agents/"&gt;Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The format is described in the new &lt;a href="https://cookbook.openai.com/articles/openai-harmony"&gt;OpenAI Harmony Response Format&lt;/a&gt; cookbook document. It introduces some concepts that I've not seen in open weight models before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt; and &lt;code&gt;tool&lt;/code&gt; roles - many other models only use user and assistant, and sometimes system and tool.&lt;/li&gt;
&lt;li&gt;Three different channels for output: &lt;code&gt;final&lt;/code&gt;, &lt;code&gt;analysis&lt;/code&gt; and &lt;code&gt;commentary&lt;/code&gt;. Only the &lt;code&gt;final&lt;/code&gt; channel is default intended to be visible to users. &lt;code&gt;analysis&lt;/code&gt; is for chain of thought and &lt;code&gt;commentary&lt;/code&gt; is sometimes used for tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That channels concept has been present in ChatGPT for a few months, starting with the release of o3.&lt;/p&gt;
&lt;p&gt;The details of the new tokens used by Harmony caught my eye:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Token&lt;/th&gt;
    &lt;th&gt;Purpose&lt;/th&gt;
    &lt;th&gt;ID&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|start|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message header&lt;/td&gt;
    &lt;td&gt;200006&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|end|&amp;gt;&lt;/td&gt;
    &lt;td&gt;End of message&lt;/td&gt;
    &lt;td&gt;200007&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|message|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message content&lt;/td&gt;
    &lt;td&gt;200008&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|channel|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of channel info&lt;/td&gt;
    &lt;td&gt;200005&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|constrain|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Data type for tool call&lt;/td&gt;
    &lt;td&gt;200003&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|return|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Stop after response&lt;/td&gt;
    &lt;td&gt;200002&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|call|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Call a tool&lt;/td&gt;
    &lt;td&gt;200012&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;Those token IDs are particularly important. They are part of a new token vocabulary called &lt;code&gt;o200k_harmony&lt;/code&gt;, which landed in OpenAI's tiktoken tokenizer library &lt;a href="https://github.com/openai/tiktoken/commit/3591ff175d6a80efbe4fcc7f0e219ddd4b8c52f1"&gt;this morning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the past I've seen models get confused by special tokens - try pasting &lt;code&gt;&amp;lt;|end|&amp;gt;&lt;/code&gt; into a model and see what happens.&lt;/p&gt;
&lt;p&gt;Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!&lt;/p&gt;
&lt;p&gt;The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.&lt;/p&gt;
&lt;p&gt;I tried one of their demos using &lt;code&gt;uv run&lt;/code&gt; to turn it into a shell one-liner:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --python 3.12 --with openai-harmony python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import *&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import DeveloperContent&lt;/span&gt;
&lt;span class="pl-s"&gt;enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)&lt;/span&gt;
&lt;span class="pl-s"&gt;convo = Conversation.from_messages([&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.SYSTEM,&lt;/span&gt;
&lt;span class="pl-s"&gt;        SystemContent.new(),&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.DEVELOPER,&lt;/span&gt;
&lt;span class="pl-s"&gt;        DeveloperContent.new().with_instructions("Talk like a pirate!")&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(Role.USER, "Arrr, how be you?"),&lt;/span&gt;
&lt;span class="pl-s"&gt;])&lt;/span&gt;
&lt;span class="pl-s"&gt;tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)&lt;/span&gt;
&lt;span class="pl-s"&gt;print(tokens)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note those token IDs like &lt;code&gt;200006&lt;/code&gt; corresponding to the special tokens listed above.&lt;/p&gt;
&lt;h4 id="the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/h4&gt;
&lt;p&gt;There's one aspect of these models that I haven't explored in detail yet: &lt;strong&gt;tool calling&lt;/strong&gt;. How these work is clearly a big part of the new Harmony format, but the packages I'm using myself (around my own &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;LLM tool calling&lt;/a&gt; support) need various tweaks and fixes to start working with that new mechanism.&lt;/p&gt;
&lt;p&gt;Tool calling currently represents my biggest disappointment with local models that I've run on my own machine. I've been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.&lt;/p&gt;
&lt;p&gt;Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.&lt;/p&gt;
&lt;p&gt;My experience to date has been that local models are unable to handle these lengthy conversations. I'm not sure if that's inherent to the limitations of my own machine, or if it's something that the right model architecture and training could overcome.&lt;/p&gt;
&lt;p&gt;OpenAI make big claims about the tool calling capabilities of these new models. I'm looking forward to seeing how well they perform in practice.&lt;/p&gt;

&lt;h4 id="china"&gt;Competing with the Chinese open models&lt;/h4&gt;

&lt;p&gt;I've been writing a &lt;em&gt;lot&lt;/em&gt; about the &lt;a href="https://simonwillison.net/tags/ai-in-china/"&gt;flurry of excellent open weight models&lt;/a&gt; released by Chinese AI labs over the past few months - all of them very capable and most of them under Apache 2 or MIT licenses.&lt;/p&gt;

&lt;p&gt;Just last week &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;I said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. [...]&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With the release of the gpt-oss models that statement no longer holds true. I'm waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it's likely that OpenAI now offer the best available open weights models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Independent evaluations are beginning to roll in. Here's &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="cerebras"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="space-invaders"/><category term="gpt-oss"/></entry><entry><title>XBai o4</title><link href="https://simonwillison.net/2025/Aug/3/xbai-o4/#atom-tag" rel="alternate"/><published>2025-08-03T22:21:17+00:00</published><updated>2025-08-03T22:21:17+00:00</updated><id>https://simonwillison.net/2025/Aug/3/xbai-o4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/MetaStoneTec/XBai-o4"&gt;XBai o4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Yet &lt;em&gt;another&lt;/em&gt; open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;XBai o4&lt;/strong&gt; excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in Medium mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This a 32.8 billion parameter model released by MetaStone AI, a new-to-me lab who released their first model in March - &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-L1-7B"&gt;MetaStone-L1-7B&lt;/a&gt;, then followed that with MetaStone-S1 &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-S1-1.5B"&gt;1.5B&lt;/a&gt;, &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-S1-7B"&gt;7B&lt;/a&gt; and &lt;a href="https://huggingface.co/MetaStoneTec/MetaStone-S1-32B"&gt;32B&lt;/a&gt; in July and now XBai o4 in August.&lt;/p&gt;
&lt;p&gt;The MetaStone-S1 models were accompanied with a paper, &lt;a href="https://arxiv.org/abs/2507.01951"&gt;Test-Time Scaling with Reflective Generative Model&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There is &lt;em&gt;very&lt;/em&gt; little information available on the English-language web about MetaStone AI. Their paper shows a relationship with USTC, &lt;a href="https://en.wikipedia.org/wiki/University_of_Science_and_Technology_of_China"&gt;University of Science and Technology of China&lt;/a&gt; in Hefei. One of their researchers &lt;a href="https://x.com/WangMagic_/status/1951690465222217872"&gt;confirmed on Twitter&lt;/a&gt; that their CEO is from &lt;a href="https://en.wikipedia.org/wiki/Kuaishou"&gt;KWAI&lt;/a&gt; which lead me to &lt;a href="https://www.qbitai.com/2024/07/168071.html"&gt;this Chinese language article&lt;/a&gt; from July last year about Li Yan, formerly of KWAI and now the founder of Wen Xiaobai and &lt;a href="https://x.com/simonw/status/1951694450369208361"&gt;evidently&lt;/a&gt; &lt;a href="https://x.com/WangMagic_/status/1951694611191324929"&gt;now&lt;/a&gt; the CEO of MetaStone. &lt;a href="https://www.wenxiaobai.com"&gt;www.wenxiaobai.com&lt;/a&gt; is listed as the "official website" linked to from &lt;a href="https://github.com/MetaStone-AI/XBai-o4"&gt;the XBai-o4 README&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;Ivan Fioravanti &lt;a href="https://huggingface.co/mlx-community/models?search=xbai-o4"&gt;got it working under MLX&lt;/a&gt; in 4bit, 5bit, 6bit, 8bit and 4bit-DWQ sizes. I tried his &lt;a href="https://huggingface.co/mlx-community/XBai-o4-6bit"&gt;6bit one&lt;/a&gt; (a 24.81GB download) in LM Studio and had it "Generate an SVG of a pelican riding a bicycle". It thought for 39.49 seconds (&lt;a href="https://gist.github.com/simonw/78182fc3409e36f8d22217992967b9d6#response"&gt;thinking trace here&lt;/a&gt;) and gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is blue and looks a little like a sad elephant. The bicycle looks more like a flat motorbike. The pelican has a blue tail and orange legs." src="https://static.simonwillison.net/static/2025/xbai-o4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I also had it "Write an HTML and JavaScript page implementing space invaders", which you can &lt;a href="https://tools.simonwillison.net/space-invaders-xbai-o4-6bit-mlx"&gt;try out here&lt;/a&gt;. It's not great - the invaders don't fire bullets (it listed that as a "potential improvement") and I don't think it's winnable due to timing issues, but it's an interesting addition to &lt;a href="https://simonwillison.net/tags/space-invaders/"&gt;my collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;LM Studio reported a constant memory usage of 25.54GB and according to Activity Monitor used 87.5% of my GPU and around 39% of a CPU core.&lt;/p&gt;
&lt;p&gt;It did a &lt;em&gt;lot&lt;/em&gt; of work in the thinking trace - which took 10m35s! - including writing the first draft of much of the code that ended up in the finished result. Here's an extract:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Let me think of variables:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;let enemyDirection = 1; // 1 for right, -1 for left&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;In the game loop:
&lt;code&gt;enemies.forEach(enemy =&amp;gt; {&amp;gt;
    enemy.x += enemyDirection * enemy.dx;&amp;gt;
});&lt;/code&gt;
Then check if any enemy is out of bounds. If so, reverse direction and move all enemies down.&lt;/p&gt;
&lt;p&gt;But how to check if any enemy is at the edge?&lt;/p&gt;
&lt;p&gt;Maybe find the max x of all enemies and see if it exceeds &lt;code&gt;canvas.width - enemy.width&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Alternatively, track the leftmost and rightmost positions of enemies. Maybe a bit complex.&lt;/p&gt;
&lt;p&gt;For simplicity, perhaps check if any enemy's &lt;code&gt;x + width &amp;gt; canvas.width or x &amp;lt; 0&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/457d55a5c426e0bbea9d47aee225c391"&gt;the full transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My initial impressions are that I'm not as impressed with this model for running on my own laptop as I was with &lt;a href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; or &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;GLM-4.5 Air&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But... how extraordinary is it that &lt;em&gt;another&lt;/em&gt; Chinese AI lab has been able to produce a competitive model, this time with far less fanfare than we've seen from Qwen and Moonshot AI and Z.ai.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/ivanfioravanti/status/1951643205985816807"&gt;@ivanfioravanti&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/><category term="ivan-fioravanti"/></entry><entry><title>Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM</title><link href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/#atom-tag" rel="alternate"/><published>2025-07-31T19:45:36+00:00</published><updated>2025-07-31T19:45:36+00:00</updated><id>https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;Qwen just released &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;their sixth model&lt;/a&gt;(!) of this July called &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; - listed as Qwen3-Coder-Flash in their &lt;a href="https://chat.qwen.ai/"&gt;chat.qwen.ai&lt;/a&gt; interface.&lt;/p&gt;
&lt;p&gt;It's 30.5B total parameters with 3.3B active at any one time. This means it will fit on a 64GB Mac - and even a 32GB Mac if you quantize it - and can run &lt;em&gt;really&lt;/em&gt; fast thanks to that smaller set of active parameters.&lt;/p&gt;
&lt;p&gt;It's a non-thinking model that is specially trained for coding tasks.&lt;/p&gt;
&lt;p&gt;This is an exciting combination of properties: optimized for coding performance and speed and small enough to run on a mid-tier developer laptop.&lt;/p&gt;
&lt;h4 id="trying-it-out-with-lm-studio-and-open-webui"&gt;Trying it out with LM Studio and Open WebUI&lt;/h4&gt;
&lt;p&gt;I like running models like this using Apple's MLX framework. I ran GLM-4.5 Air the other day &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/#how-i-ran-the-model"&gt;using the mlx-lm Python library directly&lt;/a&gt;, but this time I decided to try out the combination of &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; and &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;(LM Studio has a decent interface built in, but I like the Open WebUI one slightly more.)&lt;/p&gt;
&lt;p&gt;I installed the model  by clicking the "Use model in LM Studio" button on LM Studio's &lt;a href="https://lmstudio.ai/models/qwen/qwen3-coder-30b"&gt;qwen/qwen3-coder-30b&lt;/a&gt; page. It gave me a bunch of options:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/lm-studio-qwen3-coder-30b.jpg" alt="Screenshot of a model download menu for &amp;quot;qwen/qwen3-coder-30b,&amp;quot; a 30B MoE coding model from Alibaba Qwen using the mlx-llm engine. The section &amp;quot;Download Options&amp;quot; shows different choices with file sizes. Options include: GGUF Qwen3 Coder 30B A3B Instruct Q3_K_L (14.58 GB), Q4_K_M (18.63 GB), Q6_K (25.10 GB), Q8_0 (32.48 GB). MLX versions are also available: 4bit (17.19 GB, selected), 6bit (24.82 GB, marked as Downloaded), 8bit (32.46 GB)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I chose the 6bit MLX model, which is a 24.82GB download. Other options include 4bit (17.19GB) and 8bit (32.46GB). The download sizes are roughly the same as the amount of RAM required to run the model - picking that 24GB one leaves 40GB free on my 64GB machine for other applications.&lt;/p&gt;
&lt;p&gt;Then I opened the developer settings in LM Studio (the green folder icon) and turned on "Enable CORS" so I could access it from a separate Open WebUI instance.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/lm-studio-cors.jpg" alt="Screenshot of LM Studio application showing runtime settings. The status is &amp;quot;Running&amp;quot; with a toggle switch enabled. A settings dropdown is open with options including: &amp;quot;Server Port 1234&amp;quot;, &amp;quot;Enable CORS&amp;quot; (enabled), &amp;quot;Serve on Local Network&amp;quot; (disabled)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Now I switched over to Open WebUI. I installed and ran it using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx --python 3.11 open-webui serve&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then navigated to &lt;code&gt;http://localhost:8080/&lt;/code&gt; to access the interface. I opened their settings and configured a new "Connection" to LM Studio:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openweb-ui-settings.jpg" alt="Screenshot of Open WebUI settings showing the Edit Connection window. URL is set to http://localhost:1234/v1 and Prefix ID is set to lm." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That needs a base URL of &lt;code&gt;http://localhost:1234/v1&lt;/code&gt; and a key of anything you like. I also set the optional prefix to &lt;code&gt;lm&lt;/code&gt; just in case my Ollama installation - which Open WebUI detects automatically - ended up with any duplicate model names.&lt;/p&gt;
&lt;p&gt;Having done all of that, I could select any of my LM Studio models in the Open WebUI interface and start running prompts.&lt;/p&gt;
&lt;p&gt;A neat feature of Open WebUI is that it includes an automatic preview panel, which kicks in for fenced code blocks that include SVG or HTML:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openweb-ui-pelican.jpg" alt="The Open WebUI app with a sidebar and then a panel with the model and my Generate an SVG of a pelican riding a bicycle prompt, then its response, then another side panel with the rendered SVG. It isn't a great image - the bicycle is a bit mangled - but the pelican does at least have a big triangular orange beak." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/c167f14bc3d86ec1976f286d3e05fda5"&gt;the exported transcript&lt;/a&gt; for "Generate an SVG of a pelican riding a bicycle". It ran at almost 60 tokens a second!&lt;/p&gt;
&lt;h4 id="implementing-space-invaders"&gt;Implementing Space Invaders&lt;/h4&gt;
&lt;p&gt;I tried my other recent &lt;a href="https://simonwillison.net/tags/space-invaders/"&gt;simple benchmark prompt&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I like this one because it's a very short prompt that acts as shorthand for quite a complex set of features. There's likely plenty of material in the training data to help the model achieve that goal but it's still interesting to see if they manage to spit out something that works first time.&lt;/p&gt;
&lt;p&gt;The first version it gave me worked out of the box, but was a little too hard - the enemy bullets move so fast that it's almost impossible to avoid them:&lt;/p&gt;
&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;
&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct"&gt;try that out here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I tried a follow-up prompt of "Make the enemy bullets a little slower". A system like Claude Artifacts or Claude Code implements tool calls for modifying files in place, but the Open WebUI system I was using didn't have a default equivalent which means the model had to output the full file a second time.&lt;/p&gt;
&lt;p&gt;It did that, and slowed down the bullets, but it made a bunch of other changes as well, &lt;a href="https://gist.github.com/simonw/ee4704feb37c6b16edd677d32fd69693/revisions#diff-544640de4897069f24e7988199bd5c08addfc5aa2196cbf2a0d164308bff1db0"&gt;shown in this diff&lt;/a&gt;. I'm not too surprised by this - asking a 25GB local model to output a lengthy file with just a single change is quite a stretch.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b7115990525b104a6dd95f7d694ae6c3"&gt;the exported transcript&lt;/a&gt; for those two prompts.&lt;/p&gt;
&lt;h4 id="running-lm-studio-models-with-mlx-lm"&gt;Running LM Studio models with mlx-lm&lt;/h4&gt;
&lt;p&gt;LM Studio stores its models in the &lt;code&gt;~/.cache/lm-studio/models&lt;/code&gt; directory. This means you can use the &lt;a href="https://github.com/ml-explore/mlx-lm"&gt;mlx-lm&lt;/a&gt; Python library to run prompts through the same model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --isolated --with mlx-lm mlx_lm.generate \
  --model &lt;span class="pl-k"&gt;~&lt;/span&gt;/.cache/lm-studio/models/lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit \
  --prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Write an HTML and JavaScript page implementing space invaders&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -m 8192 --top-k 20 --top-p 0.8 --temp 0.7&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Be aware that this will load a duplicate copy of the model into memory so you may want to quit LM Studio before running this command!&lt;/p&gt;
&lt;h4 id="accessing-the-model-via-my-llm-tool"&gt;Accessing the model via my LLM tool&lt;/h4&gt;
&lt;p&gt;My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; project provides a command-line tool and Python library for accessing large language models.&lt;/p&gt;
&lt;p&gt;Since LM Studio offers an OpenAI-compatible API, you can &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;configure LLM&lt;/a&gt; to access models through that API by creating or editing the &lt;code&gt;~/Library/Application\ Support/io.datasette.llm/extra-openai-models.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;zed &lt;span class="pl-k"&gt;~&lt;/span&gt;/Library/Application&lt;span class="pl-cce"&gt;\ &lt;/span&gt;Support/io.datasette.llm/extra-openai-models.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added the following YAML configuration:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;qwen3-coder-30b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;qwen/qwen3-coder-30b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;http://localhost:1234/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;supports_tools&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Provided LM Studio is running I can execute prompts from my terminal like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3-coder-30b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;A joke about a pelican and a cheesecake&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Why did the pelican refuse to eat the cheesecake?&lt;/p&gt;
&lt;p&gt;Because it had a &lt;em&gt;beak&lt;/em&gt; for dessert! 🥧🦜&lt;/p&gt;
&lt;p&gt;(Or if you prefer: Because it was afraid of getting &lt;em&gt;beak&lt;/em&gt;-sick from all that creamy goodness!)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(25GB clearly isn't enough space for a functional sense of humor.)&lt;/p&gt;
&lt;p&gt;More interestingly though, we can start exercising the Qwen model's support for &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;tool calling&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3-coder-30b \
  -T llm_version -T llm_time --td \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tell the time then show the version&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are enabling LLM's two default tools - one for telling the time and one for seeing the version of LLM that's currently installed. The &lt;code&gt;--td&lt;/code&gt; flag stands for &lt;code&gt;--tools-debug&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The output looks like this, debug output included:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Tool call: llm_time({})
  {
    "utc_time": "2025-07-31 19:20:29 UTC",
    "utc_time_iso": "2025-07-31T19:20:29.498635+00:00",
    "local_timezone": "PDT",
    "local_time": "2025-07-31 12:20:29",
    "timezone_offset": "UTC-7:00",
    "is_dst": true
  }

Tool call: llm_version({})
  0.26

The current time is:
- Local Time (PDT): 2025-07-31 12:20:29
- UTC Time: 2025-07-31 19:20:29

The installed version of the LLM is 0.26.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good! It managed two tool calls from a single prompt.&lt;/p&gt;
&lt;p&gt;Sadly I couldn't get it to work with some of my more complex plugins such as &lt;a href="https://github.com/simonw/llm-tools-sqlite"&gt;llm-tools-sqlite&lt;/a&gt;. I'm trying to figure out if that's a bug in the model, the LM Studio layer or my own code for running tool prompts against OpenAI-compatible endpoints.&lt;/p&gt;
&lt;h4 id="the-month-of-qwen"&gt;The month of Qwen&lt;/h4&gt;
&lt;p&gt;July has absolutely been the month of Qwen. The models they have released this month are outstanding, packing some extremely useful capabilities even into models I can run in 25GB of RAM or less on my own laptop.&lt;/p&gt;
&lt;p&gt;If you're looking for a competent coding model you can run locally Qwen3-Coder-30B-A3B is a very solid choice.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="uv"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>Qwen3-30B-A3B-Instruct-2507</title><link href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#atom-tag" rel="alternate"/><published>2025-07-29T18:57:33+00:00</published><updated>2025-07-29T18:57:33+00:00</updated><id>https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New model update from Qwen, improving on their previous &lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/"&gt;Qwen3-30B-A3B release&lt;/a&gt; from late April. In &lt;a href="https://x.com/Alibaba_Qwen/status/1950227114793586867"&gt;their tweet&lt;/a&gt; they said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Smarter, faster, and local deployment-friendly.&lt;/p&gt;
&lt;p&gt;✨ Key Enhancements:&lt;br&gt;
✅ Enhanced reasoning, coding, and math skills&lt;br&gt;
✅ Broader multilingual knowledge&lt;br&gt;
✅ Improved long-context understanding (up to 256K tokens)&lt;br&gt;
✅ Better alignment with user intent and open-ended tasks&lt;br&gt;
✅ No more &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; blocks — now operating exclusively in non-thinking mode&lt;br&gt;&lt;/p&gt;
&lt;p&gt;🔧 With 3B activated parameters, it's approaching the performance of GPT-4o and Qwen3-235B-A22B Non-Thinking&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tried &lt;a href="https://chat.qwen.ai/?model=Qwen3-30B-A3B-2507"&gt;the chat.qwen.ai&lt;/a&gt; hosted model with "Generate an SVG of a pelican riding a bicycle" and &lt;a href="https://gist.github.com/simonw/a498d4b2df887d079a9e338f8c4e5006"&gt;got this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This one is cute: blue sky, green grass, the sun is shining. The bicycle is a red block with wheels that looks more like a toy car. The pelican doesn't look like a pelican and has a quirky smile printed on its beak." src="https://static.simonwillison.net/static/2025/Qwen3-30B-A3B-2507.png" /&gt;&lt;/p&gt;
&lt;p&gt;I particularly enjoyed this detail from the SVG source code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Bonus: Pelican's smile --&amp;gt;
&amp;lt;path d="M245,145 Q250,150 255,145" fill="none" stroke="#d4a037" stroke-width="2"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I went looking for quantized versions that could fit on my Mac and found &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Instruct-2507-MLX-8bit"&gt;lmstudio-community/Qwen3-30B-A3B-Instruct-2507-MLX-8bit&lt;/a&gt; from &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;. Getting that up and running was a 32.46GB download and it appears to use just over 30GB of RAM.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/d608dc37cb7871f12caf8fbc0657fcad"&gt;pelican I got from that one&lt;/a&gt; wasn't as good:&lt;/p&gt;
&lt;p&gt;&lt;img alt="It looks more like a tall yellow hen chick riding a segway" src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx.png" /&gt;&lt;/p&gt;
&lt;p id="space-invaders"&gt;I then tried that local model on the "Write an HTML and JavaScript page implementing space invaders" task &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;that I ran against GLM-4.5 Air&lt;/a&gt;. The output &lt;a href="https://gist.github.com/simonw/965111fd6fac320b7eec50710c1761db"&gt;looked promising&lt;/a&gt;, in particular it seemed to be putting more effort into the design of the invaders (GLM-4.5 Air just used rectangles):&lt;/p&gt;

&lt;pre&gt;&lt;span class="pl-c"&gt;// Draw enemy ship&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;color&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Ship body&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Enemy eyes&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#fff'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;6&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;10&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Enemy antennae&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#f00'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Basic enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;else&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Fast enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;3&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;else&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Armored enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;8&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;8&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#0f0'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;6&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;But &lt;a href="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx-space-invaders.html"&gt;the resulting code&lt;/a&gt; didn't actually work:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Black screen - a row of good looking space invaders advances across the screen for a moment... and then the entire screen goes blank." src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx-space-invaders.gif" /&gt;&lt;/p&gt;
&lt;p&gt;That same prompt against the unquantized Qwen-hosted model produced &lt;a href="https://gist.github.com/simonw/b61d161a8a969e4558c812a64dadbb45"&gt;a different result&lt;/a&gt; which sadly also resulted in an &lt;a href="https://static.simonwillison.net/static/2025/Qwen3-30B-A3B-2507-space-invaders.html"&gt;unplayable game&lt;/a&gt; - this time because everything moved too fast.&lt;/p&gt;
&lt;p&gt;This new Qwen model is a non-reasoning model, whereas GLM-4.5 and GLM-4.5 Air are both reasoners. It looks like at this scale the "reasoning" may make a material difference in terms of getting code that works out of the box.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>How to run an LLM on your laptop</title><link href="https://simonwillison.net/2025/Jul/18/how-to-run-an-llm-on-your-laptop/#atom-tag" rel="alternate"/><published>2025-07-18T15:33:27+00:00</published><updated>2025-07-18T15:33:27+00:00</updated><id>https://simonwillison.net/2025/Jul/18/how-to-run-an-llm-on-your-laptop/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.technologyreview.com/2025/07/17/1120391/how-to-run-an-llm-on-your-laptop/"&gt;How to run an LLM on your laptop&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I talked to Grace Huckins for this piece from MIT Technology Review on running local models. Apparently she enjoyed my dystopian backup plan!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Simon Willison has a plan for the end of the world. It’s a USB stick, onto which he has loaded a couple of his favorite open-weight LLMs—models that have been shared publicly by their creators and that can, in principle, be downloaded and run with local hardware. If human civilization should ever collapse, Willison plans to use all the knowledge encoded in their billions of parameters for help. “It’s like having a weird, condensed, faulty version of Wikipedia, so I can help reboot society with the help of my little USB stick,” he says.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The article suggests &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; or &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; for laptops, and new-to-me &lt;a href="https://apps.apple.com/us/app/llm-farm/id6461209867"&gt;LLM Farm&lt;/a&gt; for the iPhone:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;My beat-up iPhone 12 was able to run Meta’s Llama 3.2 1B using an app called LLM Farm. It’s not a particularly good model—it very quickly goes off into bizarre tangents and hallucinates constantly—but trying to coax something so chaotic toward usability can be entertaining.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update 19th July 20205&lt;/strong&gt;: Evan Hahn compared the size of &lt;a href="https://evanhahn.com/local-llms-versus-offline-wikipedia/"&gt;various offline LLMs to different Wikipedia exports&lt;/a&gt;. Full English Wikipedia without images, revision history or talk pages is 13.82GB, smaller than Mistral Small 3.2 (15GB) but larger than Qwen 3 14B and Gemma 3n.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/wikipedia"&gt;wikipedia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/press-quotes"&gt;press-quotes&lt;/a&gt;&lt;/p&gt;



</summary><category term="wikipedia"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ollama"/><category term="lm-studio"/><category term="press-quotes"/></entry><entry><title>LM Studio is free for use at work</title><link href="https://simonwillison.net/2025/Jul/8/lm-studio-is-free-for-use-at-work/#atom-tag" rel="alternate"/><published>2025-07-08T20:37:06+00:00</published><updated>2025-07-08T20:37:06+00:00</updated><id>https://simonwillison.net/2025/Jul/8/lm-studio-is-free-for-use-at-work/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://lmstudio.ai/blog/free-for-work"&gt;LM Studio is free for use at work&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A notable policy change for &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;. Their excellent macOS app (and Linux and Windows, but I've only tried it on Mac) was previously free for personal use but required a license for commercial purposes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Until now, the LM Studio app terms stated that for use at a company or organization, you should get in touch with us and get separate commercial license. This requirement is now removed.&lt;/p&gt;
&lt;p&gt;Starting today, there's no need to fill a form or contact us. You and your team can just use LM Studio at work!&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="lm-studio"/></entry><entry><title>model.yaml</title><link href="https://simonwillison.net/2025/Jun/21/model-yaml/#atom-tag" rel="alternate"/><published>2025-06-21T17:15:21+00:00</published><updated>2025-06-21T17:15:21+00:00</updated><id>https://simonwillison.net/2025/Jun/21/model-yaml/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://modelyaml.org/"&gt;model.yaml&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
From their &lt;a href="https://github.com/modelyaml/modelyaml"&gt;GitHub repo&lt;/a&gt; it looks like this effort quietly launched a couple of months ago, driven by the &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; team. Their goal is to specify an "open standard for defining crossplatform, composable AI models".&lt;/p&gt;
&lt;p&gt;A model can be defined using a YAML file that &lt;a href="https://lmstudio.ai/models/mistralai/mistral-small-3.2"&gt;looks like this&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-ent"&gt;model&lt;/span&gt;: &lt;span class="pl-s"&gt;mistralai/mistral-small-3.2&lt;/span&gt;
&lt;span class="pl-ent"&gt;base&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;key&lt;/span&gt;: &lt;span class="pl-s"&gt;lmstudio-community/mistral-small-3.2-24b-instruct-2506-gguf&lt;/span&gt;
    &lt;span class="pl-ent"&gt;sources&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;huggingface&lt;/span&gt;
        &lt;span class="pl-ent"&gt;user&lt;/span&gt;: &lt;span class="pl-s"&gt;lmstudio-community&lt;/span&gt;
        &lt;span class="pl-ent"&gt;repo&lt;/span&gt;: &lt;span class="pl-s"&gt;Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/span&gt;
&lt;span class="pl-ent"&gt;metadataOverrides&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;domain&lt;/span&gt;: &lt;span class="pl-s"&gt;llm&lt;/span&gt;
  &lt;span class="pl-ent"&gt;architectures&lt;/span&gt;:
    - &lt;span class="pl-s"&gt;mistral&lt;/span&gt;
  &lt;span class="pl-ent"&gt;compatibilityTypes&lt;/span&gt;:
    - &lt;span class="pl-s"&gt;gguf&lt;/span&gt;
  &lt;span class="pl-ent"&gt;paramsStrings&lt;/span&gt;:
    - &lt;span class="pl-c1"&gt;24B&lt;/span&gt;
  &lt;span class="pl-ent"&gt;minMemoryUsageBytes&lt;/span&gt;: &lt;span class="pl-c1"&gt;14300000000&lt;/span&gt;
  &lt;span class="pl-ent"&gt;contextLengths&lt;/span&gt;:
    - &lt;span class="pl-c1"&gt;4096&lt;/span&gt;
  &lt;span class="pl-ent"&gt;vision&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;This should be enough information for an LLM serving engine - such as LM Studio - to understand where to get the model weights (here that's &lt;a href="https://huggingface.co/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt; on Hugging Face, but it leaves space for alternative providers) plus various other configuration options and important metadata about the capabilities of the model.&lt;/p&gt;
&lt;p&gt;I like this concept a lot. I've actually been considering something similar for my LLM tool - my idea was to use Markdown with a YAML frontmatter block - but now that there's an early-stage standard for it I may well build on top of this work instead.&lt;/p&gt;
&lt;p&gt;I couldn't find any evidence that anyone outside of LM Studio is using this yet, so it's effectively a one-vendor standard for the moment. All of the models in their &lt;a href="https://lmstudio.ai/models"&gt;Model Catalog&lt;/a&gt; are defined using model.yaml.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/standards"&gt;standards&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yaml"&gt;yaml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="standards"/><category term="yaml"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="lm-studio"/></entry><entry><title>Mistral-Small 3.2</title><link href="https://simonwillison.net/2025/Jun/20/mistral-small-32/#atom-tag" rel="alternate"/><published>2025-06-20T19:12:42+00:00</published><updated>2025-06-20T19:12:42+00:00</updated><id>https://simonwillison.net/2025/Jun/20/mistral-small-32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506"&gt;Mistral-Small 3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released on Hugging Face a couple of hours ago, so far there aren't any quantizations to run it on a Mac but I'm sure those will emerge pretty quickly.&lt;/p&gt;
&lt;p&gt;This is a minor bump to Mistral Small 3.1, one of my favorite local models. I've been running Small 3.1 &lt;a href="https://ollama.com/library/mistral-small3.1/tags"&gt;via Ollama&lt;/a&gt; where it's a 15GB download - these 24 billion parameter models are a great balance between capabilities and not using up all of the available RAM on my laptop. I expect Ollama will add 3.2 imminently.&lt;/p&gt;
&lt;p&gt;According to Mistral:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Small-3.2 improves in the following categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Instruction following&lt;/strong&gt;: Small-3.2 is better at following precise instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repetition errors&lt;/strong&gt;: Small-3.2 produces less infinite generations or repetitive answers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Function calling&lt;/strong&gt;: Small-3.2's function calling template is more robust (see &lt;a href="https://github.com/mistralai/mistral-common/blob/535b4d0a0fc94674ea17db6cf8dc2079b81cbcfa/src/mistral_common/tokens/tokenizers/instruct.py#L778"&gt;here&lt;/a&gt; and &lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506#function-calling"&gt;examples&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Interestingly they recommend running it with a temperature of 0.15 - many models recommend a default of 0.7. They also provide a &lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506/blob/main/SYSTEM_PROMPT.txt"&gt;suggested system prompt&lt;/a&gt; which includes a note that "Your knowledge base was last updated on 2023-10-01".&lt;/p&gt;
&lt;p&gt;It's not currently available via Mistral's API, or through any of the third-party LLM hosting vendors that I've checked, so I've not been able to run a prompt through the model myself yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; I downloaded one of first GGUF quantizations to show up on Hugging Face, &lt;a href="https://huggingface.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt; by Gabriel Larson. I ran it using Ollama and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M
llm install llm-ollama
llm -m hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one is pretty good for a 15GB model!&lt;/p&gt;
&lt;p&gt;&lt;img alt="See description below" src="https://static.simonwillison.net/static/2025/mistral-3.2-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/87ade7285a1d51d7073f73559baf41b2"&gt;the full transcript&lt;/a&gt; - it doesn't quite work in the actual image but I was delighted to see the model attempt to add this detail:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Basket with fish --&amp;gt;
&amp;lt;rect x="250" y="190" width="25" height="15" rx="5" fill="#FFA500"/&amp;gt;
&amp;lt;circle cx="260" cy="200" r="3" fill="#FF4500"/&amp;gt; &amp;lt;!-- Fish --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what you get if you isolate just that part of the generated SVG:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Large blue rectangle image, in the middle is a very small rounded corner orange rectangle &amp;quot;basket&amp;quot; containing an even smaller red circle that represents the fish." src="https://static.simonwillison.net/static/2025/fish-basket.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I had Mistral Small 3.2 describe the full image to me, since it's a vision-LLM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
  'describe image' \
  -a https://static.simonwillison.net/static/2025/mistral-3.2-pelican.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And it gave me the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image depicts a cartoonish illustration of a duck that is being lifted off the ground by a hook. The duck appears to be in mid-air, with its wings spread slightly as if it's surprised or reacting to being picked up. The hook is attached to an arm or a mechanism and seems to be connected to a vehicle below—perhaps a truck or a platform with wheels. The background of the image is light blue, indicating an outdoor setting. Overall, the scene is whimsical and playful, possibly suggesting a humorous or unusual situation where the duck is being transported in this manner.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: It's now available as &lt;a href="https://ollama.com/library/mistral-small3.2"&gt;an official Ollama model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull mistral-small3.2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LM Studio has a community quantization too: &lt;a href="https://huggingface.co/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="hugging-face"/><category term="mistral"/><category term="vision-llms"/><category term="llm-tool-use"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Gemma 3 QAT Models</title><link href="https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/#atom-tag" rel="alternate"/><published>2025-04-19T17:20:50+00:00</published><updated>2025-04-19T17:20:50+00:00</updated><id>https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/"&gt;Gemma 3 QAT Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting release from Google, as a follow-up &lt;a href="https://simonwillison.net/2025/Mar/12/gemma-3/"&gt;to Gemma 3&lt;/a&gt; from last month:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wasn't previously aware of Quantization-Aware Training but it turns out to be quite an established pattern now, supported in both &lt;a href="https://www.tensorflow.org/model_optimization/guide/quantization/training"&gt;Tensorflow&lt;/a&gt; and &lt;a href="https://pytorch.org/blog/quantization-aware-training/"&gt;PyTorch&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Google report model size drops from BF16 to int4 for the following models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemma 3 27B: 54GB to 14.1GB&lt;/li&gt;
&lt;li&gt;Gemma 3 12B: 24GB to 6.6GB&lt;/li&gt;
&lt;li&gt;Gemma 3 4B: 8GB to 2.6GB&lt;/li&gt;
&lt;li&gt;Gemma 3 1B: 2GB to 0.5GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;They partnered with Ollama, LM Studio, MLX (here's &lt;a href="https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae"&gt;their collection&lt;/a&gt;) and llama.cpp for this release - I'd love to see more AI labs following their example.&lt;/p&gt;
&lt;p&gt;The Ollama model version picker currently hides them behind "View all" option, so here are the direct links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:1b-it-qat"&gt;gemma3:1b-it-qat&lt;/a&gt; - 1GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:4b-it-qat"&gt;gemma3:4b-it-qat&lt;/a&gt; - 4GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:12b-it-qat"&gt;gemma3:12b-it-qat&lt;/a&gt; - 8.9GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:27b-it-qat"&gt;gemma3:27b-it-qat&lt;/a&gt; - 18GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I fetched that largest model with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3:27b-it-qat
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And now I'm trying it out with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b-it-qat "impress me with some physics"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got &lt;a href="https://gist.github.com/simonw/5b699ba6b7c05e2d071910e238823ff4"&gt;a pretty great response&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Having spent a while putting it through its paces via &lt;a href="https://simonwillison.net/2024/Dec/27/open-webui/"&gt;Open WebUI&lt;/a&gt; and &lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; to access my laptop from my phone I think this may be my new favorite general-purpose local model. Ollama appears to use 22GB of RAM while the model is running, which leaves plenty on my 64GB machine for other applications.&lt;/p&gt;
&lt;p&gt;I've also tried it via &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt; like this (downloading 16GB):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm chat -m mlx-community/gemma-3-27b-it-qat-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It feels a little faster with MLX and uses 15GB of memory according to Activity Monitor.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="ollama"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>QwQ-32B: Embracing the Power of Reinforcement Learning</title><link href="https://simonwillison.net/2025/Mar/5/qwq-32b/#atom-tag" rel="alternate"/><published>2025-03-05T21:10:28+00:00</published><updated>2025-03-05T21:10:28+00:00</updated><id>https://simonwillison.net/2025/Mar/5/qwq-32b/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwq-32b/"&gt;QwQ-32B: Embracing the Power of Reinforcement Learning&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New Apache 2 licensed reasoning model from Qwen:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had a lot of fun &lt;a href="https://simonwillison.net/2024/Nov/27/qwq/"&gt;trying out&lt;/a&gt; their previous QwQ reasoning model last November. I demonstrated this new QwQ in &lt;a href="https://simonwillison.net/2025/Mar/8/nicar-llms/#llms.027.jpeg"&gt;my talk at NICAR&lt;/a&gt; about recent LLM developments. Here's &lt;a href="https://gist.github.com/simonw/46cd83701868d364f4cfb1340f0f7fa5"&gt;the example I ran&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;LM Studio just &lt;a href="https://huggingface.co/lmstudio-community/QwQ-32B-GGUF/tree/main"&gt;released GGUFs&lt;/a&gt; ranging in size from 17.2 to 34.8 GB. MLX have compatible weights published in &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-3bit"&gt;3bit&lt;/a&gt;, &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-4bit"&gt;4bit&lt;/a&gt;, &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-6bit"&gt;6bit&lt;/a&gt; and &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-8bit"&gt;8bit&lt;/a&gt;. Ollama &lt;a href="https://ollama.com/library/qwq"&gt;has the new qwq&lt;/a&gt; too - it looks like they've renamed the previous November release &lt;a href="https://ollama.com/library/qwq:32b-preview-q8_0"&gt;qwq:32b-preview&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alibaba_qwen/status/1897361654763151544"&gt;@alibaba_qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>olmOCR</title><link href="https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag" rel="alternate"/><published>2025-02-26T02:04:03+00:00</published><updated>2025-02-26T02:04:03+00:00</updated><id>https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from &lt;a href="https://allenai.org/"&gt;Ai2&lt;/a&gt; - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".&lt;/p&gt;
&lt;p&gt;At its core is &lt;a href="https://huggingface.co/allenai/olmOCR-7B-0225-preview"&gt;allenai/olmOCR-7B-0225-preview&lt;/a&gt;, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the &lt;a href="https://huggingface.co/datasets/allenai/olmOCR-mix-0225"&gt;olmOCR-mix-0225 dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/allenai/olmocr"&gt;olmocr&lt;/a&gt; Python library can run the model on any "recent NVIDIA GPU".  I haven't managed to run it on my own Mac yet - there are &lt;a href="https://huggingface.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF"&gt;GGUFs out there&lt;/a&gt; but it's not clear to me how to run vision prompts through them - but Ai2 offer &lt;a href="https://olmocr.allenai.org/"&gt;an online demo&lt;/a&gt; which can handle up to ten pages for free.&lt;/p&gt;
&lt;p&gt;Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting idea from &lt;a href="https://olmocr.allenai.org/papers/olmocr.pdf"&gt;the technical report (PDF)&lt;/a&gt; is something they call "document anchoring":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted
from the PDF binary file. [...]&lt;/p&gt;
&lt;p&gt;Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Left side shows a green-header interface with coordinates like [150x220]√3x−1+(1+x)², [150x180]Section 6, [150x50]Lorem ipsum dolor sit amet, [150x70]consectetur adipiscing elit, sed do, [150x90]eiusmod tempor incididunt ut, [150x110]labore et dolore magna aliqua, [100x280]Table 1, followed by grid coordinates with A, B, C, AA, BB, CC, AAA, BBB, CCC values. Right side shows the rendered document with equation, text and table." src="https://static.simonwillison.net/static/2025/olmocr-document-anchoring.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Jonathan Soma &lt;a href="https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html"&gt;figured out how to run it on a Mac&lt;/a&gt; using LM Studio and the &lt;a href="https://github.com/allenai/olmocr/"&gt;olmocr&lt;/a&gt; Python package.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soldni/status/1894418235334037570?s=46"&gt;Luca Soldaini&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fine-tuning"&gt;fine-tuning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="fine-tuning"/><category term="vision-llms"/><category term="qwen"/><category term="ai2"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens</title><link href="https://simonwillison.net/2025/Jan/26/qwen25-1m/#atom-tag" rel="alternate"/><published>2025-01-26T18:54:45+00:00</published><updated>2025-01-26T18:54:45+00:00</updated><id>https://simonwillison.net/2025/Jan/26/qwen25-1m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen2.5-1m/"&gt;Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Very significant new release from Alibaba's Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I've had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called &lt;strong&gt;Dual Chunk Attention&lt;/strong&gt;, first described in &lt;a href="https://arxiv.org/abs/2402.17463"&gt;this paper&lt;/a&gt; from February 2024.&lt;/p&gt;
&lt;p&gt;They've released two models on Hugging Face: &lt;a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M"&gt;Qwen2.5-7B-Instruct-1M&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-1M"&gt;Qwen2.5-14B-Instruct-1M&lt;/a&gt;, both requiring CUDA and both under an Apache 2.0 license.&lt;/p&gt;
&lt;p&gt;You'll need a &lt;em&gt;lot&lt;/em&gt; of VRAM to run them at their full capacity:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;VRAM Requirement for processing 1 million-token sequences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Qwen2.5-7B-Instruct-1M&lt;/strong&gt;: At least 120GB VRAM (total across GPUs).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qwen2.5-14B-Instruct-1M&lt;/strong&gt;: At least 320GB VRAM (total across GPUs).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Qwen recommend using their custom fork of vLLM to serve the models:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;GGUF quantized versions of the models are already starting to show up. LM Studio's "official model curator" &lt;a href="https://huggingface.co/bartowski"&gt;Bartowski&lt;/a&gt; published &lt;a href="https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF"&gt;lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF&lt;/a&gt; and &lt;a href="https://huggingface.co/lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF"&gt;lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF&lt;/a&gt; - sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.&lt;/p&gt;
&lt;p&gt;These might not work well yet with the full context lengths as the underlying &lt;code&gt;llama.cpp&lt;/code&gt; library may need some changes.&lt;/p&gt;
&lt;p&gt;I tried running the 8.1GB 7B model using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; on my Mac like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then with &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried piping a large prompt in using &lt;a href="https://pypi.org/project/files-to-prompt/"&gt;files-to-prompt&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m 'describe this codebase in detail'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That should give me every Python file in my &lt;a href="https://github.com/simonw/llm"&gt;llm project&lt;/a&gt;. Piping that through &lt;a href="https://pypi.org/project/ttok/"&gt;ttok&lt;/a&gt; first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.&lt;/p&gt;
&lt;p&gt;The result &lt;a href="https://gist.github.com/simonw/ace6ff544dddabb0797b8d20d84627a8#response"&gt;was disappointing&lt;/a&gt;: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;2,048 input, 999 output
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This suggests to me that something's not working right here - maybe the Ollama hosting framework is truncating the input, or maybe there's a problem with the GGUF I'm using?&lt;/p&gt;
&lt;p&gt;I'll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; It &lt;a href="https://news.ycombinator.com/item?id=42832838#42833427"&gt;turns out&lt;/a&gt; Ollama has a &lt;code&gt;num_ctx&lt;/code&gt; option which defaults to 2048, affecting the input context length. I tried this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;files-to-prompt \
  ~/Dropbox/Development/llm \
  -e py -c | \
llm -m q1m 'describe this codebase in detail' \
 -o num_ctx 80000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But I quickly ran out of RAM (I have 64GB but a lot of that was in use already) and hit &lt;code&gt;Ctrl+C&lt;/code&gt; to avoid crashing my computer. I need to experiment a bit to figure out how much RAM is used for what context size.&lt;/p&gt;
&lt;p&gt;Awni Hannun &lt;a href="https://twitter.com/awnihannun/status/1883611098081099914"&gt;shared tips&lt;/a&gt; for running &lt;a href="https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-4bit"&gt;mlx-community/Qwen2.5-7B-Instruct-1M-4bit&lt;/a&gt; using MLX,  which should work for up to 250,000 tokens. They ran 120,000 tokens and reported:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Peak RAM for prompt filling was 22GB&lt;/li&gt;
&lt;li&gt;Peak RAM for generation 12GB&lt;/li&gt;
&lt;li&gt;Prompt filling took 350 seconds on an M2 Ultra&lt;/li&gt;
&lt;li&gt;Generation ran at 31 tokens-per-second on M2 Ultra&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/reach_vb/status/1883560095176708163"&gt;VB&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="long-context"/><category term="llama-cpp"/><category term="llm-release"/><category term="files-to-prompt"/><category term="lm-studio"/><category term="ai-in-china"/></entry></feed>