<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: local-llms</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/local-llms.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-16T17:16:52+00:00</updated><author><name>Simon Willison</name></author><entry><title>Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7</title><link href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag" rel="alternate"/><published>2026-04-16T17:16:52+00:00</published><updated>2026-04-16T17:16:52+00:00</updated><id>https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;For anyone who has been (inadvisably) taking my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican riding a bicycle benchmark&lt;/a&gt; seriously as a robust way to test models, here are pelicans from this morning's two big model releases - &lt;a href="https://qwen.ai/blog?id=qwen3.6-35b-a3b"&gt;Qwen3.6-35B-A3B from Alibaba&lt;/a&gt; and &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;Claude Opus 4.7 from Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the Qwen 3.6 pelican, generated using &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf"&gt;this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf&lt;/a&gt; quantized model by Unsloth, running on my MacBook Pro M5 via &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; (and the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin) - &lt;a href="https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2"&gt;transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/Qwen3.6-35B-A3B-UD-Q4_K_S-pelican.png" alt="The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's one I got from Anthropic's &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;brand new Claude Opus 4.7&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican.png" alt="The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!&lt;/p&gt;
&lt;p&gt;I tried Opus a second time passing &lt;code&gt;thinking_level: max&lt;/code&gt;. It didn't do much better (&lt;a href="https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican-max.png" alt="The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="i-dont-think-qwen-are-cheating"&gt;I don't think Qwen are cheating&lt;/h4&gt;
&lt;p&gt;A lot of people are &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;convinced that the labs train for my stupid benchmark&lt;/a&gt;. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":&lt;/p&gt;

&lt;div style="display: flex; gap: 4px;"&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Qwen3.6-35B-A3B&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/qwen-flamingo.png" alt="The unicycle spokes are a too long. The pelican has sunglasses, a bowtie and appears to be smoking a cigarette. It has two heart emoji surrounding the caption Flamingo on a Unicycle. It has a lot of charisma." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Opus 4.7&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/opus-flamingo.png" alt="The unicycle has a black wheel. The flamingo is a competent if slightly dull vector illustration of a flamingo. It has no flair." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;I'm giving this one to Qwen too, partly for the excellent &lt;code&gt;&amp;lt;!-- Sunglasses on flamingo! --&amp;gt;&lt;/code&gt; SVG comment.&lt;/p&gt;

&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.&lt;/p&gt;
&lt;p&gt;The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;first pelicans from October 2024&lt;/a&gt; were junk. The &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;more recent entries&lt;/a&gt; have generally been much, much better - to the point that Gemini 3.1 Pro produces &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;illustrations you could actually use somewhere&lt;/a&gt;, provided you had a pressing need to illustrate a pelican riding a bicycle.&lt;/p&gt;
&lt;p&gt;Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.&lt;/p&gt;
&lt;p&gt;If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Google AI Edge Gallery</title><link href="https://simonwillison.net/2026/Apr/6/google-ai-edge-gallery/#atom-tag" rel="alternate"/><published>2026-04-06T05:18:26+00:00</published><updated>2026-04-06T05:18:26+00:00</updated><id>https://simonwillison.net/2026/Apr/6/google-ai-edge-gallery/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://apps.apple.com/nl/app/google-ai-edge-gallery/id6749645337"&gt;Google AI Edge Gallery&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Terrible name, really great app: this is Google's official app for running their Gemma 4 models (the E2B and E4B sizes, plus some members of the Gemma 3 family) directly on your iPhone.&lt;/p&gt;
&lt;p&gt;It works &lt;em&gt;really&lt;/em&gt; well. The E2B model is a 2.54GB download and is both fast and genuinely useful.&lt;/p&gt;
&lt;p&gt;The app also provides "ask questions about images" and audio transcription (up to 30s) with the two small Gemma 4 models, and has an interesting "skills" demo which demonstrates tool calling against eight different interactive widgets, each implemented as an HTML page (though sadly the source code is not visible): interactive-map, kitchen-adventure, calculate-hash, text-spinner, mood-tracker, mnemonic-password, query-wikipedia, and qr-code.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gemini-agent-skills.jpg" alt="Screenshot of an &amp;quot;Agent Skills&amp;quot; chat interface using the Gemma-4-E2B-it model. The user prompt reads &amp;quot;Show me the Castro Theatre on a map.&amp;quot; The model response, labeled &amp;quot;Model on GPU,&amp;quot; shows it &amp;quot;Called JS skill &amp;#39;interactive-map/index.html&amp;#39;&amp;quot; and displays an embedded Google Map centered on a red pin at The Castro Theatre in San Francisco, with nearby landmarks visible including Starbelly, Cliff&amp;#39;s Variety, Blind Butcher, GLBT Historical Society Museum, and Fable. An &amp;quot;Open in Maps&amp;quot; link and &amp;quot;View in full screen&amp;quot; button are shown. Below the map, the model states &amp;quot;The interactive map view for the Castro Theatre has been shown.&amp;quot; with a response time of 2.4 s. A text input field with &amp;quot;Type prompt...&amp;quot; placeholder, a &amp;quot;+&amp;quot; button, and a &amp;quot;Skills&amp;quot; button appear at the bottom." style="max-width: min(400px, 100%); margin: 0 auto; display: block;"&gt;&lt;/p&gt;
&lt;p&gt;(That demo did freeze the app when I tried to add a follow-up prompt though.)&lt;/p&gt;
&lt;p&gt;This is the first time I've seen a local model vendor release an official app for trying out their models on in iPhone. Sadly it's missing permanent logs - conversations with this app are ephemeral.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47652561"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/iphone"&gt;iphone&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="iphone"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gemini"/><category term="llm-tool-use"/></entry><entry><title>Gemma 4: Byte for byte, the most capable open models</title><link href="https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag" rel="alternate"/><published>2026-04-02T18:28:54+00:00</published><updated>2026-04-02T18:28:54+00:00</updated><id>https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"&gt;Gemma 4: Byte for byte, the most capable open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.&lt;/p&gt;
&lt;p&gt;Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.&lt;/p&gt;
&lt;p&gt;They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't entirely understand that, but apparently that's what the "E" in E2B means!&lt;/p&gt;
&lt;p&gt;One particularly exciting feature of these models is that they are multi-modal beyond just images:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vision and audio&lt;/strong&gt;: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.&lt;/p&gt;
&lt;p&gt;I tried them out using the GGUFs for &lt;a href="https://lmstudio.ai/models/gemma-4"&gt;LM Studio&lt;/a&gt;. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out &lt;code&gt;"---\n"&lt;/code&gt; in a loop for every prompt I tried.&lt;/p&gt;
&lt;p&gt;The succession of &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb"&gt;pelican quality&lt;/a&gt; from 2B to 4B to 26B-A4B is notable:&lt;/p&gt;
&lt;p&gt;E2B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican" src="https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;E4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it" src="https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;26B-A4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle" src="https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105"&gt;fixing that&lt;/a&gt; I got probably the best pelican I've seen yet from a model that runs on my laptop.)&lt;/p&gt;
&lt;p&gt;Google are providing API access to the two larger Gemma models via their &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it"&gt;AI Studio&lt;/a&gt;. I added support to &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; and then &lt;a href="https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e"&gt;ran a pelican&lt;/a&gt; through the 31B model using that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good, though it is missing the front part of the bicycle frame:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. " src="https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Quoting Georgi Gerganov</title><link href="https://simonwillison.net/2026/Mar/30/georgi-gerganov/#atom-tag" rel="alternate"/><published>2026-03-30T21:31:02+00:00</published><updated>2026-03-30T21:31:02+00:00</updated><id>https://simonwillison.net/2026/Mar/30/georgi-gerganov/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/ggerganov/status/2038674698809102599"&gt;&lt;p&gt;Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/ggerganov/status/2038674698809102599"&gt;Georgi Gerganov&lt;/a&gt;, explaining why it's hard to find local models that work well with coding agents&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/georgi-gerganov"&gt;georgi-gerganov&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="coding-agents"/><category term="georgi-gerganov"/></entry><entry><title>Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer</title><link href="https://simonwillison.net/2026/Mar/30/mr-chatterbox/#atom-tag" rel="alternate"/><published>2026-03-30T14:28:34+00:00</published><updated>2026-03-30T14:28:34+00:00</updated><id>https://simonwillison.net/2026/Mar/30/mr-chatterbox/#atom-tag</id><summary type="html">
    &lt;p&gt;Trip Venturella released &lt;a href="https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/"&gt;Mr. Chatterbox&lt;/a&gt;, a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it in &lt;a href="https://huggingface.co/tventurella/mr_chatterbox_model"&gt;the model card&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available &lt;a href="https://huggingface.co/datasets/TheBritishLibrary/blbooks"&gt;by the British Library&lt;/a&gt;. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.&lt;/p&gt;
&lt;p&gt;Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?&lt;/p&gt;
&lt;p&gt;Thanks to Trip we can now find out for ourselves!&lt;/p&gt;
&lt;p&gt;The model itself is tiny, at least by Large Language Model standards - just &lt;a href="https://huggingface.co/tventurella/mr_chatterbox_model/tree/main"&gt;2.05GB&lt;/a&gt; on disk. You can try it out using Trip's &lt;a href="https://huggingface.co/spaces/tventurella/mr_chatterbox"&gt;HuggingFace Spaces demo&lt;/a&gt;:&lt;/p&gt;
&lt;p style="text-align: center"&gt;&lt;img src="https://static.simonwillison.net/static/2026/chatterbox.jpg" alt="Screenshot of a Victorian-themed chatbot interface titled &amp;quot;🎩 Mr. Chatterbox (Beta)&amp;quot; with subtitle &amp;quot;The Victorian Gentleman Chatbot&amp;quot;. The conversation shows a user asking &amp;quot;How should I behave at dinner?&amp;quot; with the bot replying &amp;quot;My good fellow, one might presume that such trivialities could not engage your attention during an evening's discourse!&amp;quot; The user then asks &amp;quot;What are good topics?&amp;quot; and the bot responds &amp;quot;The most pressing subjects of our society— Indeed, a gentleman must endeavor to engage the conversation with grace and vivacity. Such pursuits serve as vital antidotes against ennui when engaged in agreeable company.&amp;quot; A text input field at the bottom reads &amp;quot;Say hello...&amp;quot; with a send button. The interface uses a dark maroon and cream color scheme." style="max-width: 80%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2203.15556"&gt;2022 Chinchilla paper&lt;/a&gt; suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.&lt;/p&gt;
&lt;p&gt;But what a fun project!&lt;/p&gt;
&lt;h4 id="running-it-locally-with-llm"&gt;Running it locally with LLM&lt;/h4&gt;
&lt;p&gt;I decided to see if I could run the model on my own machine using my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; framework.&lt;/p&gt;
&lt;p&gt;I got Claude Code to do most of the work - &lt;a href="https://gisthost.github.io/?7d0f00e152dd80d617b5e501e4ff025b/index.html"&gt;here's the transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Trip trained the model using Andrej Karpathy's &lt;a href="https://github.com/karpathy/nanochat"&gt;nanochat&lt;/a&gt;, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the &lt;a href="https://huggingface.co/spaces/tventurella/mr_chatterbox/tree/main"&gt;Space demo source code&lt;/a&gt;) I had Claude &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;read the LLM plugin tutorial&lt;/a&gt; and build the rest of the plugin.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/llm-mrchatterbox"&gt;llm-mrchatterbox&lt;/a&gt; is the result. Install the plugin like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mrchatterbox
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mrchatterbox "Good day, sir"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or start an ongoing chat session like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m mrchatterbox
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you don't have LLM installed you can still get a chat session started from scratch using uvx like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-mrchatterbox llm chat -m mrchatterbox
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you are finished with the model you can delete the cached file using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm mrchatterbox delete-model
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future.&lt;/p&gt;
&lt;p&gt;I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.&lt;/p&gt;

&lt;p id="update-31st"&gt;&lt;strong&gt;Update 31st March 2026&lt;/strong&gt;: I had missed this when I first published this piece but Trip has his own &lt;a href="https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/"&gt;detailed writeup of the project&lt;/a&gt; which goes into much more detail about how he trained the model. Here's how the books were filtered for pre-training:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;First, I downloaded the British Library dataset split of all 19th-century books. I filtered those down to books contemporaneous with the reign of Queen Victoria—which, unfortunately, cut out the novels of Jane Austen—and further filtered those down to a set of books with a optical character recognition (OCR) confidence of .65 or above, as listed in the metadata. This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Getting it to behave like a conversational model was a lot harder. Trip started by trying to train on plays by Oscar Wilde and George Bernard Shaw, but found they didn't provide enough pairs. Then he tried extracting dialogue pairs from the books themselves with poor results. The approach that worked was to have Claude Haiku and GPT-4o-mini generate synthetic conversation pairs for the supervised fine tuning, which solved the problem but sadly I think dilutes the "no training inputs from after 1899" claim from the original model card.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="hugging-face"/><category term="llm"/><category term="training-data"/><category term="uv"/><category term="ai-ethics"/><category term="claude-code"/></entry><entry><title>Streaming experts</title><link href="https://simonwillison.net/2026/Mar/24/streaming-experts/#atom-tag" rel="alternate"/><published>2026-03-24T05:09:03+00:00</published><updated>2026-03-24T05:09:03+00:00</updated><id>https://simonwillison.net/2026/Mar/24/streaming-experts/#atom-tag</id><summary type="html">
    &lt;p&gt;I wrote about Dan Woods' experiments with &lt;strong&gt;streaming experts&lt;/strong&gt; &lt;a href="https://simonwillison.net/2026/Mar/18/llm-in-a-flash/"&gt;the other day&lt;/a&gt;, the trick where you run larger Mixture-of-Experts models on hardware that doesn't have enough RAM to fit the entire model by instead streaming the necessary expert weights from SSD for each token that you process.&lt;/p&gt;
&lt;p&gt;Five days ago Dan was running Qwen3.5-397B-A17B in 48GB of RAM. Today &lt;a href="https://twitter.com/seikixtc/status/2036246162936910322"&gt;@seikixtc reported&lt;/a&gt; running the colossal Kimi K2.5 - a 1 trillion parameter model with 32B active weights at any one time, in 96GB of RAM on an M2 Max MacBook Pro.&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://twitter.com/anemll/status/2035901335984611412"&gt;@anemll showed&lt;/a&gt; that same Qwen3.5-397B-A17B model running on an iPhone, albeit at just 0.6 tokens/second - &lt;a href="https://github.com/Anemll/flash-moe/tree/iOS-App"&gt;iOS repo here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I think this technique has legs. Dan and his fellow tinkerers are continuing to run &lt;a href="https://simonwillison.net/tags/autoresearch/"&gt;autoresearch loops&lt;/a&gt; in order to find yet more optimizations to squeeze more performance out of these models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Now Daniel Isaac &lt;a href="https://twitter.com/danpacary/status/2036480556045836603"&gt;got Kimi K2.5 working&lt;/a&gt; on a 128GB M4 Max at ~1.7 tokens/second.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/autoresearch"&gt;autoresearch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;&lt;/p&gt;



</summary><category term="definitions"/><category term="llms"/><category term="ai"/><category term="autoresearch"/><category term="generative-ai"/><category term="kimi"/><category term="local-llms"/><category term="qwen"/></entry><entry><title>Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally</title><link href="https://simonwillison.net/2026/Mar/18/llm-in-a-flash/#atom-tag" rel="alternate"/><published>2026-03-18T23:56:46+00:00</published><updated>2026-03-18T23:56:46+00:00</updated><id>https://simonwillison.net/2026/Mar/18/llm-in-a-flash/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/danveloper/status/2034353876753592372"&gt;Autoresearching Apple&amp;#x27;s &amp;quot;LLM in a Flash&amp;quot; to run Qwen 397B locally&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main"&gt;Qwen3.5-397B-A17B&lt;/a&gt; running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk.&lt;/p&gt;
&lt;p&gt;Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time.&lt;/p&gt;
&lt;p&gt;Dan used techniques described in Apple's 2023 paper &lt;a href="https://arxiv.org/abs/2312.11514"&gt;LLM in a flash: Efficient Large Language Model Inference with Limited Memory&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He fed the paper to Claude Code and used a variant of Andrej Karpathy's &lt;a href="https://simonwillison.net/2026/Mar/13/liquid/"&gt;autoresearch pattern&lt;/a&gt; to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/danveloper/flash-moe"&gt;danveloper/flash-moe&lt;/a&gt; has the resulting code plus &lt;a href="https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf"&gt;a PDF paper&lt;/a&gt; mostly written by Claude Opus 4.6 describing the experiment in full.&lt;/p&gt;
&lt;p&gt;The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running.&lt;/p&gt;
&lt;p&gt;Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3.&lt;/p&gt;
&lt;p&gt;It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Dan's &lt;a href="https://twitter.com/danveloper/status/2034686509748462022"&gt;latest version&lt;/a&gt; upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/autoresearch"&gt;autoresearch&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="autoresearch"/></entry><entry><title>ggml.ai joins Hugging Face to ensure the long-term progress of Local AI</title><link href="https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-face/#atom-tag" rel="alternate"/><published>2026-02-20T17:12:55+00:00</published><updated>2026-02-20T17:12:55+00:00</updated><id>https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-face/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/discussions/19759"&gt;ggml.ai joins Hugging Face to ensure the long-term progress of Local AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I don't normally cover acquisition news like this, but I have some thoughts.&lt;/p&gt;
&lt;p&gt;It's hard to overstate the impact Georgi Gerganov has had on the local model space. Back in March 2023 his release of &lt;a href="https://github.com/ggml-org/llama.cpp"&gt;llama.cpp&lt;/a&gt; made it possible to run a local LLM on consumer hardware. The &lt;a href="https://github.com/ggml-org/llama.cpp/blob/775328064e69db1ebd7e19ccb59d2a7fa6142470/README.md?plain=1#L7"&gt;original README&lt;/a&gt; said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The main goal is to run the model using 4-bit quantization on a MacBook. [...] This was hacked in an evening - I have no idea if it works correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wrote about trying llama.cpp out at the time in &lt;a href="https://simonwillison.net/2023/Mar/11/llama/#llama-cpp"&gt;Large language models are having their Stable Diffusion moment&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I used it to run the 7B LLaMA model on my laptop last night, and then this morning upgraded to the 13B model—the one that Facebook claim is competitive with GPT-3.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Meta's &lt;a href="https://github.com/meta-llama/llama/tree/llama_v1"&gt;original LLaMA release&lt;/a&gt; depended on PyTorch and their &lt;a href="https://github.com/facebookresearch/fairscale"&gt;FairScale&lt;/a&gt; PyTorch extension for running on multiple GPUs, and required CUDA and NVIDIA hardware. Georgi's work opened that up to a much wider range of hardware and kicked off the local model movement that has continued to grow since then.&lt;/p&gt;
&lt;p&gt;Hugging Face are already responsible for the incredibly influential &lt;a href="https://github.com/huggingface/transformers"&gt;Transformers&lt;/a&gt; library used by the majority of LLM releases today. They've proven themselves a good steward for that open source project, which makes me optimistic for the future of llama.cpp and related projects.&lt;/p&gt;
&lt;p&gt;This section from the announcement looks particularly promising:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Going forward, our joint efforts will be geared towards the following objectives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Towards seamless "single-click" integration with the &lt;a href="https://github.com/huggingface/transformers"&gt;transformers&lt;/a&gt; library. The &lt;code&gt;transformers&lt;/code&gt; framework has established itself as the 'source of truth' for AI model definitions. Improving the compatibility between the transformers and the ggml ecosystems is essential for wider model support and quality control.&lt;/li&gt;
&lt;li&gt;Better packaging and user experience of ggml-based software. As we enter the phase in which local inference becomes a meaningful and competitive alternative to cloud inference, it is crucial to improve and simplify the way in which casual users deploy and access local models. We will work towards making llama.cpp ubiquitous and readily available everywhere, and continue partnering with great downstream projects.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given the influence of Transformers, this closer integration could lead to model releases that are compatible with the GGML ecosystem out of the box. That would be a big win for the local model ecosystem.&lt;/p&gt;
&lt;p&gt;I'm also excited to see investment in "packaging and user experience of ggml-based software". This has mostly been left to tools like &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt; and &lt;a href="https://lmstudio.ai"&gt;LM Studio&lt;/a&gt;. ggml-org released &lt;a href="https://github.com/ggml-org/LlamaBarn"&gt;LlamaBarn&lt;/a&gt; last year - "a macOS menu bar app for running local LLMs" - and I'm hopeful that further investment in this area will result in more high quality open source tools for running local models from the team best placed to deliver them.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ggerganov/status/2024839991482777976"&gt;@ggerganov&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers"&gt;transformers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/georgi-gerganov"&gt;georgi-gerganov&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="transformers"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="hugging-face"/><category term="llama-cpp"/><category term="georgi-gerganov"/></entry><entry><title>Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale</title><link href="https://simonwillison.net/2025/Nov/7/codex-tailscale-spark/#atom-tag" rel="alternate"/><published>2025-11-07T07:23:12+00:00</published><updated>2025-11-07T07:23:12+00:00</updated><id>https://simonwillison.net/2025/Nov/7/codex-tailscale-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/codex-spark-gpt-oss"&gt;Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by a &lt;a href="https://www.youtube.com/watch?v=qy4ci7AoF9Y&amp;amp;lc=UgzaGdLX8TAuQ9ugx1Z4AaABAg"&gt;YouTube comment&lt;/a&gt; I wrote up how I run OpenAI's Codex CLI coding agent against the gpt-oss:120b model running in Ollama on my &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/"&gt;NVIDIA DGX Spark&lt;/a&gt; via a Tailscale network.&lt;/p&gt;
&lt;p&gt;It takes a little bit of work to configure but the result is I can now use Codex CLI on my laptop anywhere in the world against a self-hosted model.&lt;/p&gt;
&lt;p&gt;I used it to build &lt;a href="https://static.simonwillison.net/static/2025/gpt-oss-120b-invaders.html"&gt;this space invaders clone&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="tailscale"/><category term="til"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="coding-agents"/><category term="space-invaders"/><category term="codex-cli"/><category term="nvidia-spark"/></entry><entry><title>MiniMax M2 &amp; Agent: Ingenious in Simplicity</title><link href="https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag" rel="alternate"/><published>2025-10-29T22:49:47+00:00</published><updated>2025-10-29T22:49:47+00:00</updated><id>https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.minimax.io/news/minimax-m2"&gt;MiniMax M2 &amp;amp; Agent: Ingenious in Simplicity&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
MiniMax M2 was released on Monday 27th October by MiniMax, a Chinese AI lab founded in December 2021.&lt;/p&gt;
&lt;p&gt;It's a very promising model. Their self-reported benchmark scores show it as comparable to Claude Sonnet 4, and Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1982714153375854998"&gt;are ranking it&lt;/a&gt; as the best currently available open weight model according to their intelligence score:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MiniMax’s M2 achieves a new all-time-high Intelligence Index score for an open weights model and offers impressive efficiency with only 10B active parameters (200B total). [...]&lt;/p&gt;
&lt;p&gt;The model’s strengths include tool use and instruction following (as shown by Tau2 Bench and IFBench). As such, while M2 likely excels at agentic use cases it may underperform other open weights leaders such as DeepSeek V3.2 and Qwen3 235B at some generalist tasks. This is in line with a number of recent open weights model releases from Chinese AI labs which focus on agentic capabilities, likely pointing to a heavy post-training emphasis on RL.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The size is particularly significant: the model weights are 230GB &lt;a href="https://huggingface.co/MiniMaxAI/MiniMax-M2"&gt;on Hugging Face&lt;/a&gt;, significantly smaller than other high performing open weight models. That's small enough to run on a 256GB Mac Studio, and the MLX community &lt;a href="https://huggingface.co/mlx-community/MiniMax-M2-8bit"&gt;have that working already&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;MiniMax offer their own API, and recommend using their Anthropic-compatible endpoint and the official Anthropic SDKs to access it. MiniMax Head of Engineering Skyler Miao
 &lt;a href="https://x.com/SkylerMiao7/status/1982989507252367687"&gt;provided some background on that&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;M2 is a agentic thinking model, it do interleaved thinking like sonnet 4.5, which means every response will contain its thought content.
Its very important for M2 to keep the chain of thought. So we must make sure the history thought passed back to the model.
Anthropic API support it for sure, as sonnet needs it as well. OpenAI only support it in their new Response API, no support for in ChatCompletion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MiniMax are offering the new model via their API for free until November 7th, after which the cost will be $0.30/million input tokens and $1.20/million output tokens - similar in price to Gemini 2.5 Flash and GPT-5 Mini, see &lt;a href="https://www.llm-prices.com/#it=51&amp;amp;ot=4017&amp;amp;sel=minimax-m2%2Cgpt-5-mini%2Cclaude-3-haiku%2Cgemini-2.5-flash-lite%2Cgemini-2.5-flash"&gt;price comparison here&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;I released a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; called &lt;a href="https://github.com/simonw/llm-minimax"&gt;llm-minimax&lt;/a&gt; providing support for M2 via the MiniMax API:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-minimax
llm keys set minimax
# Paste key here
llm -m m2 -o max_tokens 10000 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/da79447830dc431c067a93648b338be6"&gt;the result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Biycle is good though obscured by the pelican. Pelican has an impressive triple beak and is stretched along the bicycle frame. Not clear if it can pedal or what it is sitting on." src="https://static.simonwillison.net/static/2025/m2-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;51 input, 4,017 output. At $0.30/m input and $1.20/m output that pelican would cost 0.4836 cents - less than half a cent.&lt;/p&gt;
&lt;p&gt;This is the first plugin I've written for an Anthropic-API-compatible model. I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.21"&gt;llm-anthropic 0.21&lt;/a&gt; first adding the ability to customize the &lt;code&gt;base_url&lt;/code&gt; parameter when using that model class. This meant the new plugin was less than &lt;a href="https://github.com/simonw/llm-minimax/blob/0.1/llm_minimax.py"&gt;30 lines of Python&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/minimax"&gt;minimax&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="ai-in-china"/><category term="minimax"/></entry><entry><title>NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0</title><link href="https://simonwillison.net/2025/Oct/16/nvidia-dgx-spark-apple-mac-studio/#atom-tag" rel="alternate"/><published>2025-10-16T05:34:41+00:00</published><updated>2025-10-16T05:34:41+00:00</updated><id>https://simonwillison.net/2025/Oct/16/nvidia-dgx-spark-apple-mac-studio/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.exolabs.net/nvidia-dgx-spark"&gt;NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
EXO Labs wired a 256GB M3 Ultra Mac Studio up to an NVIDIA DGX Spark and got a 2.8x performance boost serving Llama-3.1 8B (FP16) with an 8,192 token prompt.&lt;/p&gt;
&lt;p&gt;Their detailed explanation taught me a lot about LLM performance.&lt;/p&gt;
&lt;p&gt;There are two key steps in executing a prompt. The first is the &lt;strong&gt;prefill&lt;/strong&gt; phase that reads the incoming prompt and builds a KV cache for each of the transformer layers in the model. This is compute-bound as it needs to process every token in the input and perform large matrix multiplications across all of the layers to initialize the model's internal state.&lt;/p&gt;
&lt;p&gt;Performance in the prefill stage influences TTFT - time‑to‑first‑token.&lt;/p&gt;
&lt;p&gt;The second step is the &lt;strong&gt;decode&lt;/strong&gt; phase, which generates the output one token at a time. This part is limited by memory bandwidth - there's less arithmetic, but each token needs to consider the entire KV cache.&lt;/p&gt;
&lt;p&gt;Decode performance influences TPS - tokens per second.&lt;/p&gt;
&lt;p&gt;EXO noted that the Spark has 100 TFLOPS but only 273GB/s of memory bandwidth, making it a better fit for prefill. The M3 Ultra has 26 TFLOPS but 819GB/s of memory bandwidth, making it ideal for the decode phase.&lt;/p&gt;
&lt;p&gt;They run prefill on the Spark, streaming the KV cache to the Mac over 10Gb Ethernet. They can start streaming earlier layers while the later layers are still being calculated. Then the Mac runs the decode phase, returning tokens faster than if the Spark had run the full process end-to-end.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/exolabs/status/1978525767739883736"&gt;@exolabs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apple"&gt;apple&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;



</summary><category term="apple"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="nvidia-spark"/></entry><entry><title>NVIDIA DGX Spark: great hardware, early days for the ecosystem</title><link href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag" rel="alternate"/><published>2025-10-14T23:36:21+00:00</published><updated>2025-10-14T23:36:21+00:00</updated><id>https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;NVIDIA sent me a preview unit of their new &lt;a href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/"&gt;DGX Spark&lt;/a&gt; desktop "AI supercomputer". I've never had hardware to review before! You can consider this my first ever sponsored post if you like, but they did not pay me any cash and aside from an embargo date they did not request (nor would I grant) any editorial input into what I write about the device.&lt;/p&gt;
&lt;p&gt;The device retails for around $4,000. They officially go on sale tomorrow.&lt;/p&gt;
&lt;p&gt;First impressions are that this is a snazzy little computer. It's similar in size to a Mac mini, but with an exciting textured surface that feels refreshingly different and a little bit &lt;a href="https://www.indiewire.com/awards/industry/devs-cinematography-rob-hardy-alex-garland-1234583396/"&gt;science fiction&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/nvidia-spark.jpg" alt="A rectangular small computer, sitting horizontally on a box. It is about the width of a Mac Mini. It has a NVIDIA logo on  a reflective handle portion, then textured silver metal front, then another reflective handle at the other end. It's pretty and a bit weird looking. It sits on the box it came in, which has NVIDIA DGX Spark written on it in white text on green." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;There is a &lt;em&gt;very&lt;/em&gt; powerful machine tucked into that little box. Here are the specs, which I had Claude Code figure out for me by &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d"&gt;poking around on the device itself&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hardware Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture: aarch64 (ARM64)&lt;/li&gt;
&lt;li&gt;CPU: 20 cores
&lt;ul&gt;
&lt;li&gt;10x Cortex-X925 (performance cores)&lt;/li&gt;
&lt;li&gt;10x Cortex-A725 (efficiency cores)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;RAM: 119 GB total (112 GB available) - &lt;em&gt;I’m not sure why Claude reported it differently here, the machine is listed as 128GB - it looks like a &lt;a href="https://news.ycombinator.com/item?id=45586776#45588329"&gt;128GB == 119GiB thing&lt;/a&gt; because Claude &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d#file-nvidia-claude-code-txt-L41"&gt;used free -h&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Storage: 3.7 TB (6% used, 3.3 TB available)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPU Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model: NVIDIA GB10 (Blackwell architecture)&lt;/li&gt;
&lt;li&gt;Compute Capability: sm_121 (12.1)&lt;/li&gt;
&lt;li&gt;Memory: 119.68 GB&lt;/li&gt;
&lt;li&gt;Multi-processor Count: 48 streaming multiprocessors&lt;/li&gt;
&lt;li&gt;Architecture: Blackwell&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Short version: this is an ARM64 device with 128GB of memory that's available to both the GPU and the 20 CPU cores at the same time, strapped onto a 4TB NVMe SSD.&lt;/p&gt;
&lt;p&gt;The Spark is firmly targeted at “AI researchers”. It’s designed for both training and running models.&lt;/p&gt;
&lt;h4 id="the-tricky-bit-cuda-on-arm64"&gt;The tricky bit: CUDA on ARM64&lt;/h4&gt;
&lt;p&gt;Until now almost all of my own model running experiments have taken place on a Mac. This has gotten far less painful over the past year and a half thanks to the amazing work of the &lt;a href="https://simonwillison.net/tags/mlx/"&gt;MLX&lt;/a&gt; team and community, but it's still left me deeply frustrated at my lack of access to the NVIDIA CUDA ecosystem. I've lost count of the number of libraries and tutorials which expect you to be able to use Hugging Face Transformers or PyTorch with CUDA, and leave you high and dry if you don't have an NVIDIA GPU to run things on.&lt;/p&gt;
&lt;p&gt;Armed (ha) with my new NVIDIA GPU I was excited to dive into this world that had long eluded me... only to find that there was another assumption baked in to much of this software: x86 architecture for the rest of the machine.&lt;/p&gt;
&lt;p&gt;This resulted in all kinds of unexpected new traps for me to navigate. I eventually managed to get a PyTorch 2.7 wheel for CUDA on ARM, but failed to do so for 2.8. I'm not confident there because the wheel itself is unavailable but I'm finding navigating the PyTorch ARM ecosystem pretty confusing.&lt;/p&gt;
&lt;p&gt;NVIDIA are trying to make this easier, with mixed success. A lot of my initial challenges got easier when I found their &lt;a href="https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html"&gt;official Docker container&lt;/a&gt;, so now I'm figuring out how best to use Docker with GPUs. Here's the current incantation that's been working for me:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have not yet got my head around the difference between CUDA 12 and 13. 13 appears to be very new, and a lot of the existing tutorials and libraries appear to expect 12.&lt;/p&gt;
&lt;h4 id="the-missing-documentation-isn-t-missing-any-more"&gt;The missing documentation isn't missing any more&lt;/h4&gt;
&lt;p&gt;When I first received this machine around a month ago there was very little in the way of documentation to help get me started. This meant climbing the steep NVIDIA+CUDA learning curve mostly on my own.&lt;/p&gt;
&lt;p&gt;This has changed &lt;em&gt;substantially&lt;/em&gt; in just the last week. NVIDIA now have extensive guides for getting things working on the Spark and they are a huge breath of fresh air - exactly the information I needed when I started exploring this hardware.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://developer.nvidia.com/topics/ai/dgx-spark"&gt;getting started guide&lt;/a&gt;, details on the &lt;a href="https://build.nvidia.com/spark/dgx-dashboard/instructions"&gt;DGX dashboard web app&lt;/a&gt;, and the essential collection of &lt;a href="https://build.nvidia.com/spark"&gt;playbooks&lt;/a&gt;. There's still a lot I haven't tried yet just in this official set of guides.&lt;/p&gt;
&lt;h4 id="claude-code-for-everything"&gt;Claude Code for everything&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.claude.com/product/claude-code"&gt;Claude Code&lt;/a&gt; was an absolute lifesaver for me while I was trying to figure out how best to use this device. My Ubuntu skills were a little rusty, and I also needed to figure out CUDA drivers and Docker incantations and how to install the right versions of PyTorch. Claude 4.5 Sonnet is &lt;em&gt;much better than me&lt;/em&gt; at all of these things.&lt;/p&gt;
&lt;p&gt;Since many of my experiments took place in disposable Docker containers I had no qualms at all about running it in YOLO mode:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;IS_SANDBOX=1&lt;/code&gt; environment variable stops Claude from complaining about running as root.&lt;/p&gt;

&lt;details&gt;&lt;summary style="font-style: italic"&gt;Before I found out about IS_SANDBOX&lt;/summary&gt;

&lt;p&gt;&lt;br /&gt;&lt;em&gt;I was &lt;a href="https://twitter.com/lawrencecchen/status/1978255934938886409"&gt;tipped off&lt;/a&gt; about IS_SANDBOX after I published this article. Here's my original workaround:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude understandably won't let you do this as root, even in a Docker container, so I found myself using the following incantation in a fresh &lt;code&gt;nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04&lt;/code&gt; instance pretty often:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get install -y sudo
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; pick the first free UID &amp;gt;=1000&lt;/span&gt;
U=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent passwd &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen UID: &lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; same for a GID&lt;/span&gt;
G=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent group &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen GID: &lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; create user+group&lt;/span&gt;
groupadd -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; devgrp
useradd -m -u &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -s /bin/bash dev
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; enable password-less sudo:&lt;/span&gt;
&lt;span class="pl-c1"&gt;printf&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;dev ALL=(ALL) NOPASSWD:ALL\n&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /etc/sudoers.d/90-dev-nopasswd
chmod 0440 /etc/sudoers.d/90-dev-nopasswd
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install npm&lt;/span&gt;
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install Claude&lt;/span&gt;
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then switch to the &lt;code&gt;dev&lt;/code&gt; user and run Claude for the first time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;su - dev
claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;/details&gt;&lt;br /&gt;

&lt;p&gt;This will provide a URL which you can visit to authenticate with your Anthropic account, confirming by copying back a token and pasting it into the terminal.&lt;/p&gt;
&lt;p&gt;Docker tip: you can create a snapshot of the current image (with Claude installed) by running &lt;code&gt;docker ps&lt;/code&gt; to get the container ID and then:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker commit --pause=false &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;container_id&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; cc:snapshot&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then later you can start a similar container using:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it \
  --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  cc:snapshot bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's an example of the kinds of prompts I've been running in Claude Code inside the container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want to run https://huggingface.co/unsloth/Qwen3-4B-GGUF using llama.cpp - figure out how to get llama cpp working on this machine  such that it runs with the GPU, then install it in this directory and get that model to work to serve a prompt. Goal is to get this  command to run: llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one worked flawlessly - Claude checked out the &lt;code&gt;llama.cpp&lt;/code&gt; repo, compiled it for me and iterated on it until it could run that model on the GPU. Here's a &lt;a href="https://gist.github.com/simonw/3e7d28d9ed222d842f729bfca46d6673"&gt;full transcript&lt;/a&gt;, converted from Claude's &lt;code&gt;.jsonl&lt;/code&gt; log format to Markdown using a script I &lt;a href="https://github.com/simonw/tools/blob/main/python/claude_to_markdown.py"&gt;vibe coded just now&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I later told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write out a markdown file with detailed notes on what you did. Start with the shortest form of notes on how to get a successful build, then add a full account of everything you tried, what went wrong and how you fixed it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which produced &lt;a href="https://gist.github.com/simonw/0942d96f616b9e328568ab27d911c8ed"&gt;this handy set of notes&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="tailscale-was-made-for-this"&gt;Tailscale was made for this&lt;/h4&gt;
&lt;p&gt;Having a machine like this on my local network is neat, but what's even neater is being able to access it from anywhere else in the world, from both my phone and my laptop.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; is &lt;em&gt;perfect&lt;/em&gt; for this. I installed it on the Spark (using the &lt;a href="https://tailscale.com/kb/1031/install-linux"&gt;Ubuntu instructions here&lt;/a&gt;), signed in with my SSO account (via Google)... and the Spark showed up in the "Network Devices" panel on my laptop and phone instantly.&lt;/p&gt;
&lt;p&gt;I can SSH in from my laptop or using the &lt;a href="https://termius.com/free-ssh-client-for-iphone"&gt;Termius iPhone app&lt;/a&gt; on my phone. I've also been running tools like &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt; which give me a mobile-friendly web interface for interacting with LLMs on the Spark.&lt;/p&gt;
&lt;h4 id="here-comes-the-ecosystem"&gt;Here comes the ecosystem&lt;/h4&gt;
&lt;p&gt;The embargo on these devices dropped yesterday afternoon, and it turns out a whole bunch of relevant projects have had similar preview access to myself. This is &lt;em&gt;fantastic news&lt;/em&gt; as many of the things I've been trying to figure out myself suddenly got a whole lot easier.&lt;/p&gt;
&lt;p&gt;Four particularly notable examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama &lt;a href="https://ollama.com/blog/nvidia-spark"&gt;works out of the box&lt;/a&gt;. They actually had a build that worked a few weeks ago, and were the first success I had running an LLM on the machine.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; creator Georgi Gerganov just published  &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/16578"&gt;extensive benchmark results&lt;/a&gt; from running &lt;code&gt;llama.cpp&lt;/code&gt; on a Spark. He's getting ~3,600 tokens/second to read the prompt and ~59 tokens/second to generate a response with the MXFP4 version of GPT-OSS 20B and ~817 tokens/second to read and ~18 tokens/second to generate for GLM-4.5-Air-GGUF.&lt;/li&gt;
&lt;li&gt;LM Studio now have &lt;a href="https://lmstudio.ai/blog/dgx-spark"&gt;a build for the Spark&lt;/a&gt;. I haven't tried this one yet as I'm currently using my machine exclusively via SSH.&lt;/li&gt;
&lt;li&gt;vLLM - one of the most popular engines for serving production LLMs - had &lt;a href="https://x.com/eqhylxx/status/1977928690945360049"&gt;early access&lt;/a&gt; and there's now an official &lt;a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3"&gt;NVIDIA vLLM NGC Container&lt;/a&gt; for running their stack.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's &lt;a href="https://docs.unsloth.ai/new/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth"&gt;a tutorial from Unsloth&lt;/a&gt; on fine-tuning gpt-oss-20b on the Spark.&lt;/p&gt;
&lt;h4 id="should-you-get-one-"&gt;Should you get one?&lt;/h4&gt;
&lt;p&gt;It's a bit too early for me to provide a confident recommendation concerning this machine. As indicated above, I've had a tough time figuring out how best to put it to use, largely through my own inexperience with CUDA, ARM64 and Ubuntu GPU machines in general.&lt;/p&gt;
&lt;p&gt;The ecosystem improvements in just the past 24 hours have been very reassuring though. I expect it will be clear within a few weeks how well supported this machine is going to be.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hardware"&gt;hardware&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/disclosures"&gt;disclosures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="hardware"/><category term="ai"/><category term="docker"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="ollama"/><category term="llama-cpp"/><category term="coding-agents"/><category term="claude-code"/><category term="lm-studio"/><category term="disclosures"/><category term="nvidia-spark"/></entry><entry><title>Video of GPT-OSS 20B running on a phone</title><link href="https://simonwillison.net/2025/Oct/10/gpt-oss-20b-snapdragon/#atom-tag" rel="alternate"/><published>2025-10-10T22:37:21+00:00</published><updated>2025-10-10T22:37:21+00:00</updated><id>https://simonwillison.net/2025/Oct/10/gpt-oss-20b-snapdragon/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/nexa_ai/status/1975232300985291008"&gt;Video of GPT-OSS 20B running on a phone&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
GPT-OSS 20B is a &lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/"&gt;very good model&lt;/a&gt;. At launch OpenAI claimed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://nexa.ai/"&gt;Nexa AI&lt;/a&gt; just posted a video on Twitter demonstrating exactly that: the full GPT-OSS 20B running on a Snapdragon Gen 5 phone in their &lt;a href="https://play.google.com/store/apps/details?id=com.nexa.studio"&gt;Nexa Studio&lt;/a&gt; Android app. It requires at least 16GB of RAM, and benefits from Snapdragon using a similar trick to Apple Silicon where the system RAM is available to both the CPU and the GPU.&lt;/p&gt;
&lt;p&gt;The latest iPhone 17 Pro Max is still stuck at 12GB of RAM, presumably not enough to run this same model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/android"&gt;android&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="android"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gpt-oss"/></entry><entry><title>Locally AI</title><link href="https://simonwillison.net/2025/Sep/21/locally-ai/#atom-tag" rel="alternate"/><published>2025-09-21T23:56:14+00:00</published><updated>2025-09-21T23:56:14+00:00</updated><id>https://simonwillison.net/2025/Sep/21/locally-ai/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://apps.apple.com/us/app/locally-ai-local-ai-chat/id6741426692"&gt;Locally AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Handy new iOS app by Adrien Grondin for running local LLMs on your phone. It just added support for the new iOS 26 Apple Foundation model, so you can install this app and instantly start a conversation with that model without any additional download.&lt;/p&gt;
&lt;p&gt;The app can also run a variety of other models using MLX, including members of the Gemma, Llama 3.2, and and Qwen families.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apple"&gt;apple&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ios"&gt;ios&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;&lt;/p&gt;



</summary><category term="apple"/><category term="ios"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="mlx"/></entry><entry><title>Load Llama-3.2 WebGPU in your browser from a local folder</title><link href="https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag" rel="alternate"/><published>2025-09-08T20:53:52+00:00</published><updated>2025-09-08T20:53:52+00:00</updated><id>https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;Load Llama-3.2 WebGPU in your browser from a local folder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by &lt;a href="https://news.ycombinator.com/item?id=45168953#45169054"&gt;a comment&lt;/a&gt; on Hacker News I decided to see if it was possible to modify the &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;transformers.js-examples/tree/main/llama-3.2-webgpu&lt;/a&gt; Llama 3.2 chat demo (&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;online here&lt;/a&gt;, I &lt;a href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/"&gt;wrote about it last November&lt;/a&gt;) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.&lt;/p&gt;
&lt;p&gt;I posed the problem to OpenAI's GPT-5-enabled Codex CLI like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex churned away for several minutes, even running commands like &lt;code&gt;curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n '1,200p'&lt;/code&gt; to inspect the source code of the underlying Transformers.js library.&lt;/p&gt;
&lt;p&gt;After four prompts total (&lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751814#gistcomment-5751814"&gt;shown here&lt;/a&gt;) it built something which worked!&lt;/p&gt;
&lt;p&gt;To try it out you'll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then visit my &lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;llama-3.2-webgpu&lt;/a&gt; page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".&lt;/p&gt;
&lt;p&gt;Here's an animated demo (recorded in real-time, I didn't speed this up):&lt;/p&gt;
&lt;p&gt;&lt;img alt="GIF. I follow the setup instructions, clicking to load a local model and browsing to the correct folder. Once loaded the model shows a chat interface, I run the example about time management which returns tokens at about 10/second." src="https://static.simonwillison.net/static/2025/webgpu-llama-demo-small.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I pushed &lt;a href="https://github.com/simonw/transformers.js-examples/commit/cdebf4128c6e30414d437affd4b13b6c9c79421d"&gt;a branch with those changes here&lt;/a&gt;. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I'm pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.&lt;/p&gt;
&lt;p&gt;According to the Codex &lt;code&gt;/status&lt;/code&gt; command &lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751807#gistcomment-5751807"&gt;this used&lt;/a&gt; 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45168953#45173297"&gt;My Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="transformers-js"/><category term="webgpu"/><category term="llm-pricing"/><category term="vibe-coding"/><category term="gpt-5"/><category term="codex-cli"/></entry><entry><title>llama.cpp guide: running gpt-oss with llama.cpp</title><link href="https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp/#atom-tag" rel="alternate"/><published>2025-08-19T19:01:13+00:00</published><updated>2025-08-19T19:01:13+00:00</updated><id>https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/discussions/15396"&gt;llama.cpp guide: running gpt-oss with llama.cpp&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really useful official guide to running the OpenAI gpt-oss models using &lt;code&gt;llama-server&lt;/code&gt; from &lt;code&gt;llama.cpp&lt;/code&gt; - which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the models.&lt;/p&gt;
&lt;p&gt;TLDR version for macOS to run the smaller &lt;code&gt;gpt-oss-20b&lt;/code&gt; model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;brew install llama.cpp
llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloads a 12GB model file from &lt;a href="https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main"&gt;ggml-org/gpt-oss-20b-GGUF&lt;/a&gt; on Hugging Face, stores it in &lt;code&gt;~/Library/Caches/llama.cpp/&lt;/code&gt; and starts it running on port 8080.&lt;/p&gt;
&lt;p&gt;You can then visit this URL to start interacting with the model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:8080/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On my 64GB M2 MacBook Pro &lt;a href="https://gist.github.com/simonw/85ea67cba9fce0c7e63951dda5117268"&gt;it runs at around&lt;/a&gt; 82 tokens/second.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface with filename &amp;quot;llama.cpp&amp;quot; showing a conversation about creating an SVG of a pelican on a bicycle. The conversation includes detailed coordinates for drawing the pelican (body ellipse center at 250,140 with rx=35, ry=50, head circle at 260,110 with r=20, beak triangle points, wings, and tail specifications), implementation notes about layering bicycle elements then pelican, and ends with a code block showing the beginning of SVG code with XML declaration, svg tag with viewBox=&amp;quot;0 0 500 300&amp;quot;, style definitions for .bg, .wheel, .frame, .crossbar, .seat, .handlebar, .pedal, .pelican-body, and .pelican-head classes with various fill and stroke properties. Below the code is explanatory text: &amp;quot;Below is a compact, self-contained SVG that shows a stylised pelican perched on a bicycle. Copy the code into an .svg file or paste it directly into an HTML page to view it.&amp;quot; At the bottom is a message input field with &amp;quot;Type a message (Shift+Enter to add a new line)&amp;quot; placeholder text." src="https://static.simonwillison.net/static/2025/llama-cpp-screenshot.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The guide also includes notes for running on NVIDIA and AMD hardware.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ggerganov/status/1957821440633282642"&gt;@ggerganov&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llama-cpp"/><category term="gpt-oss"/></entry><entry><title>TIL: Running a gpt-oss eval suite against LM Studio on a Mac</title><link href="https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-tag" rel="alternate"/><published>2025-08-17T03:46:21+00:00</published><updated>2025-08-17T03:46:21+00:00</updated><id>https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;TIL: Running a gpt-oss eval suite against LM Studio on a Mac&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The other day &lt;a href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#update"&gt;I learned&lt;/a&gt; that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I decided to try and run that eval suite on my own MacBook Pro, against &lt;code&gt;gpt-oss-20b&lt;/code&gt; running inside of LM Studio.&lt;/p&gt;
&lt;p&gt;TLDR: once I had the model running inside LM Studio with a longer than default context limit, the following incantation ran an eval suite in around 3.5 hours:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mkdir /tmp/aime25_openai
OPENAI_API_KEY=x \
  uv run --python 3.13 --with 'gpt-oss[eval]' \
  python -m gpt_oss.evals \
  --base-url http://localhost:1234/v1 \
  --eval aime25 \
  --sampler chat_completions \
  --model openai/gpt-oss-20b \
  --reasoning-effort low \
  --n-threads 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;new TIL&lt;/a&gt; breaks that command down in detail and walks through the underlying eval - AIME 2025, which asks 30 questions (8 times each) that are defined using the following format:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{"question": "Find the sum of all integer bases $b&amp;gt;9$ for which $17_{b}$ is a divisor of $97_{b}$.", "answer": "70"}&lt;/code&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="til"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="uv"/><category term="lm-studio"/><category term="gpt-oss"/></entry><entry><title>Open weight LLMs exhibit inconsistent performance across providers</title><link href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag" rel="alternate"/><published>2025-08-15T16:29:34+00:00</published><updated>2025-08-15T16:29:34+00:00</updated><id>https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag</id><summary type="html">
    &lt;p&gt;Artificial Analysis published &lt;a href="https://artificialanalysis.ai/models/gpt-oss-120b/providers#aime25x32-performance-gpt-oss-120b"&gt;a new benchmark&lt;/a&gt; the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers.&lt;/p&gt;
&lt;p&gt;The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/aim25x32-gpt-oss-120b.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges (Min, 25th, Median, 75th, Max) for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Together.ai (93.3%), Parasail (90.0%), Groq (86.7%), Amazon (83.3%), Azure (80.0%), CompectAI (36.7%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are some varied results!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0&lt;/li&gt;
&lt;li&gt;90.0%: Parasail&lt;/li&gt;
&lt;li&gt;86.7%: Groq&lt;/li&gt;
&lt;li&gt;83.3%: Amazon&lt;/li&gt;
&lt;li&gt;80.0%: Azure&lt;/li&gt;
&lt;li&gt;36.7%: CompactifAI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It looks like most of the providers that scored 93.3% were running models using the latest &lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; (with the exception of Cerebras who I believe have their own custom serving stack).&lt;/p&gt;
&lt;p&gt;I hadn't heard of CompactifAI before - I found &lt;a href="https://www.hpcwire.com/off-the-wire/multiverse-computing-closes-e189m-series-b-to-scale-compactifai-deployment/"&gt;this June 12th 2025 press release&lt;/a&gt; which says that "CompactifAI models are highly-compressed versions of leading open source LLMs that retain original accuracy, are 4x-12x faster and yield a 50%-80% reduction in inference costs" which helps explain their notably lower score!&lt;/p&gt;
&lt;p&gt;Microsoft Azure's Lucas Pickup &lt;a href="https://x.com/lupickup/status/1955620918086226223"&gt;confirmed&lt;/a&gt; that Azure's 80% score was caused by running an older vLLM, now fixed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;No news yet on what went wrong with the AWS Bedrock version.&lt;/p&gt;
&lt;h4 id="the-challenge-for-customers-of-open-weight-models"&gt;The challenge for customers of open weight models&lt;/h4&gt;
&lt;p&gt;As a customer of open weight model providers, this really isn't something I wanted to have to think about!&lt;/p&gt;
&lt;p&gt;It's not really a surprise though. When running models myself I inevitably have to make choices - about which serving framework to use (I'm usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.&lt;/p&gt;
&lt;p&gt;I know that quantization has an impact, but it's difficult for me to quantify that effect.&lt;/p&gt;
&lt;p&gt;It looks like with hosted models even knowing the quantization they are using isn't necessarily enough information to be able to predict that model's performance.&lt;/p&gt;
&lt;p&gt;I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform - if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.&lt;/p&gt;
&lt;p&gt;There's a lot that can go wrong. Tool calling is particularly vulnerable to these differences - models have been trained on specific tool-calling conventions, and if a provider doesn't get these exactly right the results can be unpredictable but difficult to diagnose.&lt;/p&gt;
&lt;p&gt;What would help &lt;em&gt;enormously&lt;/em&gt; here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model's implementation.&lt;/p&gt;
&lt;p&gt;Models aren't deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.&lt;/p&gt;
&lt;p id="update"&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://x.com/DKundel/status/1956395988836368587"&gt;Via OpenAI's Dominik Kundel&lt;/a&gt; I learned that OpenAI now include a &lt;a href="https://github.com/openai/gpt-oss/tree/main/compatibility-test"&gt;compatibility test&lt;/a&gt; in the gpt-oss GitHub repository to help providers verify that they have implemented things like tool calling templates correctly, described in more detail in their &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt; cookbook.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;my TIL&lt;/a&gt; on running part of that eval suite.&lt;/p&gt;

&lt;h4 id="update-aug-20"&gt;Update: August 20th 2025&lt;/h4&gt;

&lt;p&gt;Since I first wrote this article Artificial Analysis have updated the benchmark results to reflect fixes that vendors have made since their initial run. Here's what it looks like today:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-oss-eval-updated.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Azure (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Groq (93.3%), Together.ai (93.3%), Parasail (90.0%), Google Vertex (83.3%), Amazon (80.0%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;Groq and Azure have both improved their scores to 93.3%. Google Vertex is new  to the chart at 83.3%.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="gpt-oss"/><category term="artificial-analysis"/><category term="llm-performance"/></entry><entry><title>Introducing Gemma 3 270M: The compact model for hyper-efficient AI</title><link href="https://simonwillison.net/2025/Aug/14/gemma-3-270m/#atom-tag" rel="alternate"/><published>2025-08-14T17:22:36+00:00</published><updated>2025-08-14T17:22:36+00:00</updated><id>https://simonwillison.net/2025/Aug/14/gemma-3-270m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-gemma-3-270m/"&gt;Introducing Gemma 3 270M: The compact model for hyper-efficient AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Google:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Gemma 3 270M, a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This model is &lt;em&gt;tiny&lt;/em&gt;. The version I tried was &lt;a href="https://lmstudio.ai/models/google/gemma-3-270m"&gt;the LM Studio GGUF one&lt;/a&gt;, a 241MB download.&lt;/p&gt;
&lt;p&gt;It works! You can say "hi" to it and ask it very basic questions like "What is the capital of France".&lt;/p&gt;
&lt;p&gt;I tried "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://gist.github.com/simonw/25e7b7afd6a63a2f15db48b3a51ec9bc"&gt;about a dozen times&lt;/a&gt; and didn't once get back an SVG that was more than just a blank square... but at one point it did decide to write me this poem instead, which was nice:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-----------------------+
|   Pelican Riding Bike |
+-----------------------+
|  This is the cat!  |
|  He's got big wings and a happy tail.  |
|  He loves to ride his bike!  |
+-----------------------+
|   Bike lights are shining bright.  |
|   He's got a shiny top, too!  |
|   He's ready for adventure!  |
+-----------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's not really the point though. The Gemma 3 team make it very clear that the goal of this model is to support fine-tuning: a model this tiny is never going to be useful for general purpose LLM tasks, but given the right fine-tuning data it should be able to specialize for all sorts of things:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In engineering, success is defined by efficiency, not just raw power. You wouldn't use a sledgehammer to hang a picture frame. The same principle applies to building with AI.&lt;/p&gt;
&lt;p&gt;Gemma 3 270M embodies this "right tool for the job" philosophy. It's a high-quality foundation model that follows instructions well out of the box, and its true power is unlocked through fine-tuning. Once specialized, it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness. By starting with a compact, capable model, you can build production systems that are lean, fast, and dramatically cheaper to operate.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's their tutorial on &lt;a href="https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune"&gt;Full Model Fine-Tune using Hugging Face Transformers&lt;/a&gt;, which I have not yet attempted to follow.&lt;/p&gt;
&lt;p&gt;I imagine this model will be particularly fun to play with directly in a browser using &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;transformers.js&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: It is! Here's &lt;a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator"&gt;a bedtime story generator&lt;/a&gt; using Transformers.js (requires WebGPU, so Chrome-like browsers only). Here's &lt;a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator/tree/main"&gt;the source code&lt;/a&gt; for that demo.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44902148"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"</title><link href="https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-tag" rel="alternate"/><published>2025-08-10T23:59:12+00:00</published><updated>2025-08-10T23:59:12+00:00</updated><id>https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-tag</id><summary type="html">
    &lt;p&gt;I've fallen a few days behind keeping up with Qwen. They released two new 4B models last week:  &lt;a href="https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507"&gt;Qwen3-4B-Instruct-2507&lt;/a&gt; and its thinking equivalent &lt;a href="https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507"&gt;Qwen3-4B-Thinking-2507&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These are relatively tiny models that punch &lt;em&gt;way&lt;/em&gt; above their weight. I’ve been running the 8bit GGUF varieties via LM Studio (here's &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF"&gt;Instruct&lt;/a&gt;, here's &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-2507-GGUF"&gt;Thinking&lt;/a&gt;) - both of them are 4GB downloads that use around 4.3GB of my M2 MacBook Pro’s system RAM while running. Both are way more capable than I would expect from such small files.&lt;/p&gt;
&lt;p&gt;Qwen3-4B-Thinking is the first model I've tried which called out the absurdity of being asked to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;a pelican riding a bicycle&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/cd41769de4791b98ba43a17c6ff32ee1"&gt;the full transcript&lt;/a&gt; - some highlights from the thinking trace:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay, the user asked for an SVG of a pelican riding a bicycle. That's... oddly specific and completely unrealistic.&lt;/p&gt;
&lt;p&gt;First reaction: Pelicans don't ride bicycles. They're birds that fly or swim, not human-like vehicles. This feels like a mix-up - maybe they meant "a cyclist" but typed "pelican"? Or perhaps it's a joke?&lt;/p&gt;
&lt;p&gt;Hmm, user might be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A kid playing with words (pelican + bicycle = silly combo)&lt;/li&gt;
&lt;li&gt;Someone testing if I'll generate impossible images&lt;/li&gt;
&lt;li&gt;Actually wanting a pelican &lt;em&gt;on&lt;/em&gt; a bike for art project? But that still doesn't make sense biologically&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Checks SVG capabilities&lt;/em&gt;
SVG is for vector graphics, but this request violates physics and biology. If I tried to draw it literally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pelican would be too big for a bike frame&lt;/li&gt;
&lt;li&gt;Bike wheels would be tiny compared to bird wings&lt;/li&gt;
&lt;li&gt;No realistic way to show "riding" since pelicans don't have legs for pedaling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[…]&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Self-check&lt;/em&gt;: Am I being too pedantic? Nah - if someone asks for impossible things, it's better to gently correct than make fake art that could confuse them. […]&lt;/p&gt;
&lt;p&gt;I appreciate your creative request! 🦅🚴‍♂️ However, &lt;strong&gt;pelicans don't ride bicycles&lt;/strong&gt; (they're birds that fly/swim, not humans with bike skills). Generating this as a literal SVG would be biologically impossible and visually confusing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;"This request violates physics and biology" - Qwen3-4B-Thinking saying what no other model ever thought to express.&lt;/p&gt;
&lt;p&gt;I also really enjoyed the "Am I being too pedantic? Nah" line. This model is sassy!&lt;/p&gt;
&lt;p&gt;It continued on to spit out two SVGS: the first titled "Most Likely: A Human Cyclist" and the second "Artistic Interpretation: Pelican &lt;em&gt;On&lt;/em&gt; a Bicycle (For Fun!)".&lt;/p&gt;
&lt;p&gt;They were both terrible. Here's the human one:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-thinking-human.png" alt="It's a green horizontal line, three abstract black shapes, a brown circle and the word Person" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's the supposed pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-thinking-pelican.png" alt="A blue circle, with a small black shape that might be a wing and a black shape that could maybe be the head? It's awful. Red text overlaps it that says This is art - pelicans don't ride bikes! - there is no attempt at a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I like Qwen's decision to include the clarifying annotation "This is art - pelicans don't ride bikes!":&lt;/p&gt;
&lt;p&gt;I tried the Qwen3-4B-Instruct non-thinking model too. It &lt;a href="https://gist.github.com/simonw/ad927a3849d0aece043afc97559be4bf"&gt;answered much faster&lt;/a&gt; (no time spent questioning my choice of task with its thinking tokens) and gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-instruct-2507-pelican.png" alt="A bunch of shaps. Pelican Riding a Bike! transposed on top. The yellow and orange bits might be a pelican I guess. The bicycle has two wheels overlapping too close and a single bar in the wrong direction." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;4B is such an interesting model size. These models should run on almost anything and, at least on my M2 MacBook, they run &lt;em&gt;fast&lt;/em&gt;. I'm getting 50+ tokens per second and they're using just less than 4.5GB of RAM while running.&lt;/p&gt;
&lt;p&gt;The question is always how useful such a tiny model can be. Clearly it's not great for SVG pelican illustrations!&lt;/p&gt;

&lt;p&gt;I did get a useful result out of the &lt;code&gt;-Thinking&lt;/code&gt; variant for a &lt;code&gt;jq&lt;/code&gt; expression I needed. I prompted:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;queries[0].rows is an array of objects each with a markdown key - write a jq bash one liner to output a raw string if that markdown concatenated together with double newlines between each&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It thought &lt;a href="https://gist.github.com/simonw/3f76749aa710f4a2d6405ebcf5b00ac4"&gt;for 3 minutes 13 seconds&lt;/a&gt; before spitting out a recipe that did roughly what I wanted:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;jq -r '.queries[0].rows[] | .markdown' | tr '\n' '\n\n'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I'm not sure that was worth waiting three minutes for though!&lt;/p&gt;

&lt;p&gt;These models have a 262,144 token context - wildly impressive, &lt;em&gt;if it works&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;So I tried another experiment: I used the Instruct model to summarize &lt;a href="https://news.ycombinator.com/item?id=44851557"&gt;this Hacker News conversation about GPT-5&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I did this with the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin for LLM combined with my &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes"&gt;hn-summary.sh script&lt;/a&gt;, which meant I could run the experiment like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 44851557 -m qwen3-4b-instruct-2507
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I believe this is 15,785 tokens - so nothing close to the 262,144 maximum but still an interesting test of a 4GB local model.&lt;/p&gt;
&lt;p&gt;The good news is Qwen spat out a genuinely useful summary of the conversation! You can &lt;a href="https://gist.github.com/simonw/4c5a1912f73e0d68b456b18000a76f0d#response"&gt;read that here&lt;/a&gt; - it's the best I've seen yet from a model running on my laptop, though honestly I've not tried many other recent models in this way.&lt;/p&gt;
&lt;p&gt;The bad news... it took almost five minutes to process and return the result!&lt;/p&gt;
&lt;p&gt;As a loose calculation, if the model can output 50 tokens/second maybe there's a similar speed for processing incoming input.. in which case 15785 / 50 = 315 seconds which is 5m15s.&lt;/p&gt;
&lt;p&gt;Hosted models can crunch through 15,000 tokens of input in just a few seconds. I guess this is one of the more material limitations of running models on Apple silicon as opposed to dedicated GPUs.&lt;/p&gt;
&lt;p&gt;I think I'm going to spend some more time with these models. They're fun, they have personality and I'm confident there are classes of useful problems they will prove capable at despite their small size. Their ability at summarization should make them a good fit for local RAG, and I've not started exploring their tool calling abilities yet.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>OpenAI's new open weight (Apache 2) models are really good</title><link href="https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag" rel="alternate"/><published>2025-08-05T20:33:13+00:00</published><updated>2025-08-05T20:33:13+00:00</updated><id>https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag</id><summary type="html">
    &lt;p&gt;The long promised &lt;a href="https://openai.com/index/introducing-gpt-oss/"&gt;OpenAI open weight models are here&lt;/a&gt;, and they are &lt;em&gt;very&lt;/em&gt; impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B.&lt;/p&gt;
&lt;p&gt;OpenAI's own benchmarks are eyebrow-raising - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;gpt-oss-120b&lt;/strong&gt; model achieves &lt;strong&gt;near-parity with OpenAI o4-mini&lt;/strong&gt; on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The &lt;strong&gt;gpt-oss-20b&lt;/strong&gt; model delivers &lt;strong&gt;similar results to OpenAI o3‑mini&lt;/strong&gt; on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3-mini are &lt;em&gt;really good&lt;/em&gt; proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.&lt;/p&gt;
&lt;p&gt;Both models are mixture-of-experts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like "GPQA Diamond (without tools) PhD-level science questions" too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3 — 83.3%&lt;/li&gt;
&lt;li&gt;o4-mini — 81.4%&lt;/li&gt;
&lt;li&gt;gpt-oss-120b — 80.1%&lt;/li&gt;
&lt;li&gt;o3-mini — 77%&lt;/li&gt;
&lt;li&gt;gpt-oss-20b — 71.5%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of these benchmarks are edging towards saturated.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-model-card"&gt;Training details from the model card&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#china"&gt;Competing with the Chinese open models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/h4&gt;
&lt;p&gt;There are already a bunch of different ways to run these models - OpenAI partnered with numerous organizations in advance of the release.&lt;/p&gt;
&lt;p&gt;I decided to start with &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had to update to the most recent version of the app, then install the new model from &lt;a href="https://lmstudio.ai/models/openai/gpt-oss-20b"&gt;their openai/gpt-oss-20b&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;First impressions: this is a &lt;em&gt;really good&lt;/em&gt; model, and it somehow runs using just 11.72GB of my system RAM.&lt;/p&gt;
&lt;p&gt;The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.&lt;/p&gt;
&lt;p&gt;Let's try "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/h4&gt;
&lt;p&gt;I started &lt;a href="https://gist.github.com/simonw/b71394cc85fe0f048e376392e41586da"&gt;with low&lt;/a&gt;. It thought for 0.07 seconds and then output this (at 39 tokens a second):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-low.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Except... it output invalid SVG. One of the path elements looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Frame --&amp;gt;
&amp;lt;path d="
    M150,250          &amp;lt;!-- rear wheel center --&amp;gt;
    L300,120          &amp;lt;!-- top tube to front --&amp;gt;
    L450,250          &amp;lt;!-- chain stays back to front --&amp;gt;
    L300,350          &amp;lt;!-- seat stays down --&amp;gt;
    Z"
    fill="#e0e0e0" stroke="#555" stroke-width="4"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But you can't put comments inside attributes like that. I fixed this to get the above image.&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/h4&gt;
&lt;p&gt;I tried again &lt;a href="https://gist.github.com/simonw/642e9e371387fc59a5aad25dcd41b0cc"&gt;on medium&lt;/a&gt;. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-medium.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/h4&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3"&gt;on high&lt;/a&gt;. My first attempt here failed because it hit the default LM Studio 4096 context limit - so I reloaded the model and set the new limit to the maximum possible.&lt;/p&gt;
&lt;p&gt;It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-high.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thought &lt;em&gt;really hard&lt;/em&gt; to get there. Here's a snippet of &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3 "&gt;that transcript&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican's body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 &amp;gt; seat top 255. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've seen this from other local reasoning models too: they can really overthink this kind of problem! I don't know if there's any pelican SVG that's worth waiting nearly six minutes for.&lt;/p&gt;
&lt;h4 id="space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/h4&gt;
&lt;p&gt;Given how long high took I switched back to medium for my next experiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d607e4"&gt;thought for 10.78 seconds&lt;/a&gt; and produced this:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-gpt-oss-20b-mxfp4-medium"&gt;play that here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's not the best I've seen - I was more impressed &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;by GLM 4.5 Air&lt;/a&gt; - but it's very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).&lt;/p&gt;
&lt;h4 id="trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/h4&gt;
&lt;p&gt;I don't quite have the resources on my laptop to run the larger model. Thankfully it's already being hosted by a number of different API providers.&lt;/p&gt;
&lt;p&gt;OpenRouter already &lt;a href="https://openrouter.ai/openai/gpt-oss-120b/providers"&gt;lists three&lt;/a&gt; - Fireworks, Groq and Cerebras. (Update: now also Parasail and Baseten.)&lt;/p&gt;
&lt;p&gt;Cerebras is &lt;em&gt;fast&lt;/em&gt;, so I decided to try them first.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt; plugin and ran the &lt;code&gt;refresh&lt;/code&gt; command to ensure it had their latest models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install -U llm-cerebras jsonschema
llm cerebras refresh&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Installing jsonschema worked around a warning message.)&lt;/p&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Refreshed 10 Cerebras models:
  - cerebras-deepseek-r1-distill-llama-70b
  - cerebras-gpt-oss-120b
  - cerebras-llama-3.3-70b
  - cerebras-llama-4-maverick-17b-128e-instruct
  - cerebras-llama-4-scout-17b-16e-instruct
  - cerebras-llama3.1-8b
  - cerebras-qwen-3-235b-a22b-instruct-2507
  - cerebras-qwen-3-235b-a22b-thinking-2507
  - cerebras-qwen-3-32b
  - cerebras-qwen-3-coder-480b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m cerebras-gpt-oss-120b \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Cerebras runs the new model at between 2 and 4 thousands tokens per second!&lt;/p&gt;
&lt;p&gt;To my surprise this one &lt;a href="https://gist.github.com/simonw/4c685f19f1a93b68eacb627125e36be4"&gt;had the same comments-in-attributes bug&lt;/a&gt; that we saw with oss-20b earlier. I fixed those and got this pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-120-cerebras.jpg" alt="Yellow and not great pelican, quite a good bicycle if a bit sketchy." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That bug appears intermittently - I've not seen it on some of my other runs of the same prompt.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin also provides access to the models, balanced across the underlying providers. You can use that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste API key here&lt;/span&gt;
llm -m openrouter/openai/gpt-oss-120b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; &lt;a href="https://github.com/ggml-org/llama.cpp/pull/15091"&gt;pull request for gpt-oss&lt;/a&gt; was landed less than an hour ago. It's worth browsing through the coded - a &lt;em&gt;lot&lt;/em&gt; of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;llama.cpp Homebrew package&lt;/a&gt; within the next day or so, which should provide a convenient way to run the model via &lt;code&gt;llama-server&lt;/code&gt; and friends.&lt;/p&gt;
&lt;h4 id="gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/h4&gt;
&lt;p&gt;Ollama &lt;a href="https://ollama.com/library/gpt-oss"&gt;also have gpt-oss&lt;/a&gt;, requiring an update to their app.&lt;/p&gt;
&lt;p&gt;I fetched that 14GB model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull gpt-oss:20b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can use it with the new Ollama native app, or access it from &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama
llm -m gpt-oss:20b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also appears to use around 13.26GB of system memory while running a prompt.&lt;/p&gt;
&lt;p&gt;Ollama also launched &lt;a href="https://ollama.com/turbo"&gt;Ollama Turbo&lt;/a&gt; today, offering the two OpenAI models as a paid hosted service:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Turbo is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs, or run very slowly. Ollama Turbo provides a way to run these models fast while using Ollama's App, CLI, and API. &lt;/p&gt;&lt;/blockquote&gt;
&lt;h4 id="the-model-card"&gt;Training details from the model card&lt;/h4&gt;
&lt;p&gt;Here are some interesting notes about how the models were trained from &lt;a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf"&gt;the model card&lt;/a&gt; (PDF):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: We train the models on a text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge. To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thunder Compute's article &lt;a href="https://www.thundercompute.com/blog/nvidia-h100-pricing"&gt;NVIDIA H100 Pricing (August 2025): Cheapest On-Demand Cloud GPU Rates&lt;/a&gt; lists prices from around $2/hour to $11/hour, which would indicate a training cost of the 120b model between $4.2m and $23.1m and the 20b between $420,000 and $2.3m.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3. This procedure teaches the models how to reason and solve problems using CoT and teaches the model how to use tools. Because of the similar RL techniques, these models have a personality similar to models served in our first-party products like ChatGPT. Our training dataset consists of a wide range of problems from coding, math, science, and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models have additional special training to help them use web browser and Python (Jupyter notebook) tools more effectively:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;During post-training, we also teach the models to use different agentic tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A browsing tool, that allows the model to call search and open functions to interact with
the web. This aids factuality and allows the models to fetch info beyond their knowledge
cutoff.&lt;/li&gt;
&lt;li&gt;A python tool, which allows the model to run code in a stateful Jupyter notebook environment.&lt;/li&gt;
&lt;li&gt;Arbitrary developer functions, where one can specify function schemas in a &lt;code&gt;Developer&lt;/code&gt;
message similar to the OpenAI API. The definition of function is done within our harmony
format.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a corresponding &lt;a href="https://github.com/openai/gpt-oss?tab=readme-ov-file#python"&gt;section about Python tool usage&lt;/a&gt; in the &lt;code&gt;openai/gpt-oss&lt;/code&gt; repository README.&lt;/p&gt;


&lt;h4 id="openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/h4&gt;
&lt;p&gt;One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.&lt;/p&gt;
&lt;p&gt;Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/openai/harmony"&gt;openai/harmony&lt;/a&gt; is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the &lt;code&gt;gpt-oss&lt;/code&gt; models. It's clearly inspired by their new-ish &lt;a href="https://openai.com/index/new-tools-for-building-agents/"&gt;Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The format is described in the new &lt;a href="https://cookbook.openai.com/articles/openai-harmony"&gt;OpenAI Harmony Response Format&lt;/a&gt; cookbook document. It introduces some concepts that I've not seen in open weight models before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt; and &lt;code&gt;tool&lt;/code&gt; roles - many other models only use user and assistant, and sometimes system and tool.&lt;/li&gt;
&lt;li&gt;Three different channels for output: &lt;code&gt;final&lt;/code&gt;, &lt;code&gt;analysis&lt;/code&gt; and &lt;code&gt;commentary&lt;/code&gt;. Only the &lt;code&gt;final&lt;/code&gt; channel is default intended to be visible to users. &lt;code&gt;analysis&lt;/code&gt; is for chain of thought and &lt;code&gt;commentary&lt;/code&gt; is sometimes used for tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That channels concept has been present in ChatGPT for a few months, starting with the release of o3.&lt;/p&gt;
&lt;p&gt;The details of the new tokens used by Harmony caught my eye:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Token&lt;/th&gt;
    &lt;th&gt;Purpose&lt;/th&gt;
    &lt;th&gt;ID&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|start|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message header&lt;/td&gt;
    &lt;td&gt;200006&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|end|&amp;gt;&lt;/td&gt;
    &lt;td&gt;End of message&lt;/td&gt;
    &lt;td&gt;200007&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|message|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message content&lt;/td&gt;
    &lt;td&gt;200008&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|channel|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of channel info&lt;/td&gt;
    &lt;td&gt;200005&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|constrain|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Data type for tool call&lt;/td&gt;
    &lt;td&gt;200003&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|return|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Stop after response&lt;/td&gt;
    &lt;td&gt;200002&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|call|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Call a tool&lt;/td&gt;
    &lt;td&gt;200012&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;Those token IDs are particularly important. They are part of a new token vocabulary called &lt;code&gt;o200k_harmony&lt;/code&gt;, which landed in OpenAI's tiktoken tokenizer library &lt;a href="https://github.com/openai/tiktoken/commit/3591ff175d6a80efbe4fcc7f0e219ddd4b8c52f1"&gt;this morning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the past I've seen models get confused by special tokens - try pasting &lt;code&gt;&amp;lt;|end|&amp;gt;&lt;/code&gt; into a model and see what happens.&lt;/p&gt;
&lt;p&gt;Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!&lt;/p&gt;
&lt;p&gt;The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.&lt;/p&gt;
&lt;p&gt;I tried one of their demos using &lt;code&gt;uv run&lt;/code&gt; to turn it into a shell one-liner:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --python 3.12 --with openai-harmony python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import *&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import DeveloperContent&lt;/span&gt;
&lt;span class="pl-s"&gt;enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)&lt;/span&gt;
&lt;span class="pl-s"&gt;convo = Conversation.from_messages([&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.SYSTEM,&lt;/span&gt;
&lt;span class="pl-s"&gt;        SystemContent.new(),&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.DEVELOPER,&lt;/span&gt;
&lt;span class="pl-s"&gt;        DeveloperContent.new().with_instructions("Talk like a pirate!")&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(Role.USER, "Arrr, how be you?"),&lt;/span&gt;
&lt;span class="pl-s"&gt;])&lt;/span&gt;
&lt;span class="pl-s"&gt;tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)&lt;/span&gt;
&lt;span class="pl-s"&gt;print(tokens)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note those token IDs like &lt;code&gt;200006&lt;/code&gt; corresponding to the special tokens listed above.&lt;/p&gt;
&lt;h4 id="the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/h4&gt;
&lt;p&gt;There's one aspect of these models that I haven't explored in detail yet: &lt;strong&gt;tool calling&lt;/strong&gt;. How these work is clearly a big part of the new Harmony format, but the packages I'm using myself (around my own &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;LLM tool calling&lt;/a&gt; support) need various tweaks and fixes to start working with that new mechanism.&lt;/p&gt;
&lt;p&gt;Tool calling currently represents my biggest disappointment with local models that I've run on my own machine. I've been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.&lt;/p&gt;
&lt;p&gt;Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.&lt;/p&gt;
&lt;p&gt;My experience to date has been that local models are unable to handle these lengthy conversations. I'm not sure if that's inherent to the limitations of my own machine, or if it's something that the right model architecture and training could overcome.&lt;/p&gt;
&lt;p&gt;OpenAI make big claims about the tool calling capabilities of these new models. I'm looking forward to seeing how well they perform in practice.&lt;/p&gt;

&lt;h4 id="china"&gt;Competing with the Chinese open models&lt;/h4&gt;

&lt;p&gt;I've been writing a &lt;em&gt;lot&lt;/em&gt; about the &lt;a href="https://simonwillison.net/tags/ai-in-china/"&gt;flurry of excellent open weight models&lt;/a&gt; released by Chinese AI labs over the past few months - all of them very capable and most of them under Apache 2 or MIT licenses.&lt;/p&gt;

&lt;p&gt;Just last week &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;I said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. [...]&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With the release of the gpt-oss models that statement no longer holds true. I'm waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it's likely that OpenAI now offer the best available open weights models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Independent evaluations are beginning to roll in. Here's &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="cerebras"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="space-invaders"/><category term="gpt-oss"/></entry><entry><title>Ollama's new app</title><link href="https://simonwillison.net/2025/Jul/31/ollamas-new-app/#atom-tag" rel="alternate"/><published>2025-07-31T00:58:32+00:00</published><updated>2025-07-31T00:58:32+00:00</updated><id>https://simonwillison.net/2025/Jul/31/ollamas-new-app/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/blog/new-app"&gt;Ollama&amp;#x27;s new app&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ollama has been one of my favorite ways to run local models for a while - it makes it really easy to download models, and it's smart about keeping them resident in memory while they are being used and then cleaning them out after they stop receiving traffic.&lt;/p&gt;
&lt;p&gt;The one missing feature to date has been an interface: Ollama has been exclusively command-line, which is fine for the CLI literate among us and not much use for everyone else.&lt;/p&gt;
&lt;p&gt;They've finally fixed that! The new app's interface is accessible from the existing system tray menu and lets you chat with any of your installed models. Vision models can accept images through the new interface as well.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface showing a response about encouraging pelicans in a garden. The prompt reads &amp;quot;Describe ways I can encourage pelicans to hang out in my garden&amp;quot; followed by the response: &amp;quot;Pelicans are large, aquatic birds typically found near lakes, rivers, coastal areas, and wetlands—far from typical residential gardens. While it's unlikely you'll attract wild pelicans to your backyard garden, you can create a bird-friendly environment that supports a variety of waterfowl and wildlife. Here's how you can encourage birdlife (including species like herons, ducks, or even occasional pelicans if you're near a suitable habitat) to visit your garden: 1. Create a Water Feature Pelicans love water, so a large, shallow pond or water garden is key: • Size matters: Aim for at least 3–5 feet deep in parts and a shallow edge (6–12 inches) for wading. • Add native aquatic plants (e.g., cattails, water lilies, reeds) to provide shelter and food.&amp;quot; At the bottom is a &amp;quot;Send a message&amp;quot; text input field and &amp;quot;qwen3:30b-a3b-instruct-2507-q4_K_M&amp;quot; with a dropdown arrow." src="https://static.simonwillison.net/static/2025/ollama-app.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44739632"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ollama"/></entry><entry><title>The best available open weight LLMs now come from China</title><link href="https://simonwillison.net/2025/Jul/30/chinese-models/#atom-tag" rel="alternate"/><published>2025-07-30T16:18:38+00:00</published><updated>2025-07-30T16:18:38+00:00</updated><id>https://simonwillison.net/2025/Jul/30/chinese-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively &lt;em&gt;smoked them&lt;/em&gt; over the course of July.&lt;/p&gt;
&lt;p&gt;Here's what came out this month, with links to my notes on each one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Moonshot &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;Kimi-K2-Instruct&lt;/a&gt; - 11th July, 1 trillion parameters&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;Qwen3-235B-A22B-Instruct-2507&lt;/a&gt; - 21st July, 235 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; - 22nd July, 480 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/"&gt;Qwen3-235B-A22B-Thinking-2507&lt;/a&gt; - 25th July, 235 billion&lt;/li&gt;
&lt;li&gt;Z.ai &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;GLM-4.5 and GLM-4.5 Air&lt;/a&gt; - 28th July, 355 and 106 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt; - 29th July, 30 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/"&gt;Qwen3-30B-A3B-Thinking-2507&lt;/a&gt; - 30th July, 30 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; - 31st July, 30 billion (released after I first posted this note)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;small&gt;Notably absent from this list is DeepSeek, but that's only because their last model release was &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528"&gt;DeepSeek-R1-0528&lt;/a&gt; back in April.&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;The only janky license among them is Kimi K2, which uses a non-OSI-compliant modified MIT. Qwen's models are all Apache 2 and Z.ai's are MIT.&lt;/p&gt;
&lt;p&gt;The larger Chinese models all offer their own APIs and are increasingly available from other providers.  I've been able to run versions of the Qwen 30B and GLM-4.5 Air 106B models on my own laptop.&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update August 5th 2025&lt;/strong&gt;: The OpenAI open weight models came out and &lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/"&gt;they are very impressive&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="qwen"/><category term="openai"/><category term="generative-ai"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="ai-in-china"/><category term="gpt-oss"/><category term="moonshot"/><category term="kimi"/><category term="janky-licenses"/><category term="glm"/></entry><entry><title>My 2.5 year old laptop can write Space Invaders in JavaScript now, using GLM-4.5 Air and MLX</title><link href="https://simonwillison.net/2025/Jul/29/space-invaders/#atom-tag" rel="alternate"/><published>2025-07-29T13:02:39+00:00</published><updated>2025-07-29T13:02:39+00:00</updated><id>https://simonwillison.net/2025/Jul/29/space-invaders/#atom-tag</id><summary type="html">
    &lt;p&gt;I wrote about the new &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;GLM-4.5&lt;/a&gt; model family yesterday - new open weight (MIT licensed) models from &lt;a href="https://z.ai/"&gt;Z.ai&lt;/a&gt; in China which their benchmarks claim score highly in coding even against models such as Claude Sonnet 4.&lt;/p&gt;
&lt;p&gt;The models are pretty big - the smaller GLM-4.5 Air model is still 106 billion total parameters, which &lt;a href="https://huggingface.co/zai-org/GLM-4.5-Air"&gt;is 205.78GB&lt;/a&gt; on Hugging Face.&lt;/p&gt;
&lt;p&gt;Ivan Fioravanti &lt;a href="https://x.com/ivanfioravanti/status/1949911755028910557"&gt;built&lt;/a&gt; this &lt;a href="https://huggingface.co/mlx-community/GLM-4.5-Air-3bit"&gt;44GB 3bit quantized version for MLX&lt;/a&gt;, specifically sized so people with 64GB machines could have a chance of running it. I tried it out... and it works &lt;em&gt;extremely well&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I fed it the following prompt:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;And it churned away for a while and produced &lt;a href="https://tools.simonwillison.net/space-invaders-GLM-4.5-Air-3bit"&gt;the following&lt;/a&gt;:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Clearly this isn't a particularly novel example, but I still think it's noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this - especially code that worked first time with no further edits needed.&lt;/p&gt;

&lt;h4 id="how-i-ran-the-model"&gt;How I ran the model&lt;/h4&gt;

&lt;p&gt;I had to run it using the current &lt;code&gt;main&lt;/code&gt; branch of the &lt;a href="https://github.com/ml-explore/mlx-lm"&gt;mlx-lm&lt;/a&gt; library (to ensure I had &lt;a href="https://github.com/ml-explore/mlx-lm/commit/489e63376b963ac02b3b7223f778dbecc164716b"&gt;this commit&lt;/a&gt; adding &lt;code&gt;glm4_moe&lt;/code&gt; support). I ran that using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run \
  --with &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://github.com/ml-explore/mlx-lm/archive/489e63376b963ac02b3b7223f778dbecc164716b.zip&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  python&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then in that Python interpreter I used the standard recipe for running MLX models:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;mlx_lm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;load&lt;/span&gt;, &lt;span class="pl-s1"&gt;generate&lt;/span&gt;
&lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;load&lt;/span&gt;(&lt;span class="pl-s"&gt;"mlx-community/GLM-4.5-Air-3bit"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;That downloaded 44GB of model weights to my  &lt;code&gt;~/.cache/huggingface/hub/models--mlx-community--GLM-4.5-Air-3bit&lt;/code&gt; folder.&lt;/p&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"Write an HTML and JavaScript page implementing space invaders"&lt;/span&gt;
&lt;span class="pl-s1"&gt;messages&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [{&lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"user"&lt;/span&gt;, &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;}]
&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt;.&lt;span class="pl-c1"&gt;apply_chat_template&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;messages&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;
)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;generate&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;prompt&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;verbose&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;max_tokens&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;8192&lt;/span&gt;
)&lt;/pre&gt;
&lt;p&gt;The response started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;The user wants me to create a Space Invaders game using HTML, CSS, and JavaScript. I need to create a complete, functional game with the following features:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Player spaceship that can move left/right and shoot&lt;/li&gt;
&lt;li&gt;Enemy invaders that move in formation and shoot back&lt;/li&gt;
&lt;li&gt;Score tracking&lt;/li&gt;
&lt;li&gt;Lives/health system&lt;/li&gt;
&lt;li&gt;Game over conditions [...]&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;Followed by the HTML and this debugging output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Prompt: 14 tokens, 14.095 tokens-per-sec
Generation: 4193 tokens, 25.564 tokens-per-sec
Peak memory: 47.687 GB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/9f515c8e32fb791549aeb88304550893#file-space_invaders-txt-L61"&gt;the full transcript here&lt;/a&gt;, or view &lt;a href="https://github.com/simonw/tools/blob/9e04fd9895fae1aa9ac78b8e62d2833831fe0544/space-invaders-GLM-4.5-Air-3bit.html"&gt;the source on GitHub&lt;/a&gt;, or &lt;a href="https://tools.simonwillison.net/space-invaders-GLM-4.5-Air-3bit"&gt;try it out in your browser&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id="pelican"&gt;A pelican for good measure&lt;/h4&gt;

&lt;p&gt;I ran &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican benchmark&lt;/a&gt; against the full sized models &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;yesterday&lt;/a&gt;, but I couldn't resist trying it against this smaller 3bit model. Here's what I got for &lt;code&gt;"Generate an SVG of a pelican riding a bicycle"&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/glm-4.5-air-3b-pelican.png" alt="Blue background, pelican looks like a cloud with an orange bike, bicycle is recognizable as a bicycle if not quite the right geometry." /&gt;&lt;/p&gt;

&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/fe428f7cead72ad754f965a81117f5df"&gt;transcript for that&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In both cases the model used around 48GB of RAM at peak, leaving me with just 16GB for everything else - I had to quit quite a few apps in order to get the model to run but the speed was pretty good once it got going.&lt;/p&gt;

&lt;h4 id="local-coding-models"&gt;Local coding models are really good now&lt;/h4&gt;

&lt;p&gt;It's interesting how almost every model released in 2025 has specifically targeting coding. That focus has clearly been paying off: these coding models are getting &lt;em&gt;really good&lt;/em&gt; now.&lt;/p&gt;

&lt;p&gt;Two years ago when I &lt;a href="https://simonwillison.net/2023/Mar/11/llama/"&gt;first tried LLaMA&lt;/a&gt; I never &lt;em&gt;dreamed&lt;/em&gt; that the same laptop I was using then would one day be able to run models with capabilities as strong as what I'm seeing from GLM 4.5 Air - and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="uv"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="ai-in-china"/><category term="space-invaders"/><category term="ivan-fioravanti"/><category term="glm"/></entry><entry><title>GLM-4.5: Reasoning, Coding, and Agentic Abililties</title><link href="https://simonwillison.net/2025/Jul/28/glm-45/#atom-tag" rel="alternate"/><published>2025-07-28T16:56:42+00:00</published><updated>2025-07-28T16:56:42+00:00</updated><id>https://simonwillison.net/2025/Jul/28/glm-45/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://z.ai/blog/glm-4.5"&gt;GLM-4.5: Reasoning, Coding, and Agentic Abililties&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Another day, another significant new open weight model release from a Chinese frontier AI lab.&lt;/p&gt;
&lt;p&gt;This time it's Z.ai - who rebranded (at least in English) from &lt;a href="https://en.wikipedia.org/wiki/Zhipu_AI"&gt;Zhipu AI&lt;/a&gt; a few months ago. They just dropped &lt;a href="https://huggingface.co/zai-org/GLM-4.5-Base"&gt;GLM-4.5-Base&lt;/a&gt;, &lt;a href="https://huggingface.co/zai-org/GLM-4.5"&gt;GLM-4.5&lt;/a&gt; and &lt;a href="https://huggingface.co/zai-org/GLM-4.5-Air"&gt;GLM-4.5 Air&lt;/a&gt; on Hugging Face, all under an MIT license.&lt;/p&gt;
&lt;p&gt;These are MoE hybrid reasoning models with thinking and non-thinking modes, similar to Qwen 3. GLM-4.5 is 355 billion total parameters with 32 billion active, GLM-4.5-Air is 106 billion total parameters and 12 billion active.&lt;/p&gt;
&lt;p&gt;They started using MIT a few months ago for their &lt;a href="https://huggingface.co/collections/zai-org/glm-4-0414-67f3cbcb34dd9d252707cb2e"&gt;GLM-4-0414&lt;/a&gt; models - their older releases used a janky non-open-source custom license.&lt;/p&gt;
&lt;p&gt;Z.ai's own benchmarking (across 12 common benchmarks) ranked their GLM-4.5 3rd behind o3 and Grok-4 and just ahead of Claude Opus 4. They ranked GLM-4.5 Air 6th place just ahead of Claude 4 Sonnet. I haven't seen any independent benchmarks yet.&lt;/p&gt;
&lt;p&gt;The other models they included in their own benchmarks were o4-mini (high), Gemini 2.5 Pro, Qwen3-235B-Thinking-2507, DeepSeek-R1-0528, Kimi K2, GPT-4.1, DeepSeek-V3-0324. Notably absent: any of Meta's Llama models, or any of Mistral's. Did they deliberately only compare themselves to open weight models from other Chinese AI labs?&lt;/p&gt;
&lt;p&gt;Both models have a 128,000 context length and are trained for tool calling, which honestly feels like table stakes for any model released in 2025 at this point.&lt;/p&gt;
&lt;p&gt;It's interesting to see them use Claude Code to run their own coding benchmarks:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To assess GLM-4.5's agentic coding capabilities, we utilized Claude Code to evaluate performance against Claude-4-Sonnet, Kimi K2, and Qwen3-Coder across 52 coding tasks spanning frontend development, tool development, data analysis, testing, and algorithm implementation. [...] The empirical results demonstrate that GLM-4.5 achieves a 53.9% win rate against Kimi K2 and exhibits dominant performance over Qwen3-Coder with an 80.8% success rate. While GLM-4.5 shows competitive performance, further optimization opportunities remain when compared to Claude-4-Sonnet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They published the dataset for that benchmark as &lt;a href="https://huggingface.co/datasets/zai-org/CC-Bench-trajectories"&gt;zai-org/CC-Bench-trajectories&lt;/a&gt; on Hugging Face. I think they're using the word "trajectory" for what I would call a chat transcript.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Unlike DeepSeek-V3 and Kimi K2, we reduce the width (hidden dimension and number of routed experts) of the model while increasing the height (number of layers), as we found that deeper models exhibit better reasoning capacity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They pre-trained on 15 trillion tokens, then an additional 7 trillion for code and reasoning:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code &amp;amp; reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also open sourced their post-training reinforcement learning harness, which they've called &lt;strong&gt;slime&lt;/strong&gt;. That's available at &lt;a href="https://github.com/THUDM/slime"&gt;THUDM/slime&lt;/a&gt; on GitHub - THUDM is the Knowledge Engineer Group @ Tsinghua University, the University from which Zhipu AI spun out as an independent company.&lt;/p&gt;
&lt;p&gt;This time I ran my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican bechmark&lt;/a&gt; using the &lt;a href="https://chat.z.ai/"&gt;chat.z.ai&lt;/a&gt; chat interface, which offers free access (no account required) to both GLM 4.5 and GLM 4.5 Air. I had reasoning enabled for both.&lt;/p&gt;
&lt;p&gt;Here's what I got for "Generate an SVG of a pelican riding a bicycle" on &lt;a href="https://chat.z.ai/s/014a8c13-7b73-40e8-bbf9-6a94482caa2e"&gt;GLM 4.5&lt;/a&gt;. I like how the pelican has its wings on the handlebars:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Description by Claude Sonnet 4: This is a whimsical illustration of a white duck or goose riding a red bicycle. The bird has an orange beak and is positioned on the bike seat, with its orange webbed feet gripping what appears to be chopsticks or utensils near the handlebars. The bicycle has a simple red frame with two wheels, and there are motion lines behind it suggesting movement. The background is a soft blue-gray color, giving the image a clean, minimalist cartoon style. The overall design has a playful, humorous quality to it." src="https://static.simonwillison.net/static/2025/glm-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://chat.z.ai/s/e772675c-3445-4cff-903c-6faa3d6b9524"&gt;GLM 4.5 Air&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Description by Claude Sonnet 4: This image shows a cute, minimalist illustration of a snowman riding a bicycle. The snowman has a simple design with a round white body, small black dot for an eye, and an orange rectangular nose (likely representing a carrot). The snowman appears to be in motion on a black bicycle with two wheels, with small orange arrows near the pedals suggesting movement. There are curved lines on either side of the image indicating motion or wind. The overall style is clean and whimsical, using a limited color palette of white, black, orange, and gray against a light background." src="https://static.simonwillison.net/static/2025/glm-4.5-air-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Ivan Fioravanti &lt;a href="https://x.com/ivanfioravanti/status/1949854575902523399"&gt;shared a video&lt;/a&gt; of the &lt;a href="https://huggingface.co/mlx-community/GLM-4.5-Air-4bit"&gt;mlx-community/GLM-4.5-Air-4bit&lt;/a&gt; quantized model running on a M4 Mac with 128GB of RAM, and it looks like a very strong contender for a local model that can write useful code. The cheapest 128GB Mac Studio costs around $3,500 right now, so genuinely great open weight coding models are creeping closer to being affordable on consumer machines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Ivan released a 3 bit quantized version of GLM-4.5 Air which runs using 48GB of RAM on my laptop. I tried it and was &lt;em&gt;really&lt;/em&gt; impressed, see &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;My 2.5 year old laptop can write Space Invaders in JavaScript now&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/><category term="ivan-fioravanti"/><category term="glm"/></entry><entry><title>How to run an LLM on your laptop</title><link href="https://simonwillison.net/2025/Jul/18/how-to-run-an-llm-on-your-laptop/#atom-tag" rel="alternate"/><published>2025-07-18T15:33:27+00:00</published><updated>2025-07-18T15:33:27+00:00</updated><id>https://simonwillison.net/2025/Jul/18/how-to-run-an-llm-on-your-laptop/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.technologyreview.com/2025/07/17/1120391/how-to-run-an-llm-on-your-laptop/"&gt;How to run an LLM on your laptop&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I talked to Grace Huckins for this piece from MIT Technology Review on running local models. Apparently she enjoyed my dystopian backup plan!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Simon Willison has a plan for the end of the world. It’s a USB stick, onto which he has loaded a couple of his favorite open-weight LLMs—models that have been shared publicly by their creators and that can, in principle, be downloaded and run with local hardware. If human civilization should ever collapse, Willison plans to use all the knowledge encoded in their billions of parameters for help. “It’s like having a weird, condensed, faulty version of Wikipedia, so I can help reboot society with the help of my little USB stick,” he says.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The article suggests &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; or &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; for laptops, and new-to-me &lt;a href="https://apps.apple.com/us/app/llm-farm/id6461209867"&gt;LLM Farm&lt;/a&gt; for the iPhone:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;My beat-up iPhone 12 was able to run Meta’s Llama 3.2 1B using an app called LLM Farm. It’s not a particularly good model—it very quickly goes off into bizarre tangents and hallucinates constantly—but trying to coax something so chaotic toward usability can be entertaining.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update 19th July 20205&lt;/strong&gt;: Evan Hahn compared the size of &lt;a href="https://evanhahn.com/local-llms-versus-offline-wikipedia/"&gt;various offline LLMs to different Wikipedia exports&lt;/a&gt;. Full English Wikipedia without images, revision history or talk pages is 13.82GB, smaller than Mistral Small 3.2 (15GB) but larger than Qwen 3 14B and Gemma 3n.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/wikipedia"&gt;wikipedia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/press-quotes"&gt;press-quotes&lt;/a&gt;&lt;/p&gt;



</summary><category term="wikipedia"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ollama"/><category term="lm-studio"/><category term="press-quotes"/></entry><entry><title>LM Studio is free for use at work</title><link href="https://simonwillison.net/2025/Jul/8/lm-studio-is-free-for-use-at-work/#atom-tag" rel="alternate"/><published>2025-07-08T20:37:06+00:00</published><updated>2025-07-08T20:37:06+00:00</updated><id>https://simonwillison.net/2025/Jul/8/lm-studio-is-free-for-use-at-work/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://lmstudio.ai/blog/free-for-work"&gt;LM Studio is free for use at work&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A notable policy change for &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;. Their excellent macOS app (and Linux and Windows, but I've only tried it on Mac) was previously free for personal use but required a license for commercial purposes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Until now, the LM Studio app terms stated that for use at a company or organization, you should get in touch with us and get separate commercial license. This requirement is now removed.&lt;/p&gt;
&lt;p&gt;Starting today, there's no need to fill a form or contact us. You and your team can just use LM Studio at work!&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="lm-studio"/></entry><entry><title>Introducing Gemma 3n: The developer guide</title><link href="https://simonwillison.net/2025/Jun/26/gemma-3n/#atom-tag" rel="alternate"/><published>2025-06-26T21:08:36+00:00</published><updated>2025-06-26T21:08:36+00:00</updated><id>https://simonwillison.net/2025/Jun/26/gemma-3n/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/"&gt;Introducing Gemma 3n: The developer guide&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Extremely consequential new open weights model release from Google today:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multimodal by design:&lt;/strong&gt; Gemma 3n natively supports image, audio, video, and text inputs and text outputs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized for on-device:&lt;/strong&gt; Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on &lt;a href="https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#per-layer-embeddings-(ple):-unlocking-more-memory-efficiency"&gt;&lt;strong&gt;effective&lt;/strong&gt;&lt;/a&gt; parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is &lt;strong&gt;very&lt;/strong&gt; exciting: a 2B and 4B model optimized for end-user devices which accepts text, images &lt;em&gt;and&lt;/em&gt; audio as inputs!&lt;/p&gt;
&lt;p&gt;Gemma 3n is also the most comprehensive day one launch I've seen for any model: Google partnered with "AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now.&lt;/p&gt;
&lt;p&gt;So far I've run two variants on my Mac laptop. Ollama offer &lt;a href="https://ollama.com/library/gemma3n"&gt;a 7.5GB version&lt;/a&gt; (full tag &lt;code&gt;gemma3n:e4b-it-q4_K_M0&lt;/code&gt;) of the 4B model, which I ran like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3n
llm install llm-ollama
llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It drew me this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican looks a bit like a grey pig. It is floating above a bicycle that looks more like a rail cart." src="https://static.simonwillison.net/static/2025/gemma3n-ollama.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The Ollama version doesn't appear to support image or audio input yet.&lt;/p&gt;
&lt;p&gt;... but the &lt;a href="https://github.com/Blaizzy/mlx-vlm"&gt;mlx-vlm&lt;/a&gt; version does!&lt;/p&gt;
&lt;p&gt;First I tried that on &lt;a href="https://static.simonwillison.net/static/2025/pelican-joke-request.wav"&gt;this WAV file&lt;/a&gt; like so (using a recipe adapted from &lt;a href="https://www.youtube.com/watch?v=8-8R2UvUBrc"&gt;Prince Canuma's video&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Transcribe the following speech segment in English:" \
  --audio pelican-joke-request.wav
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Tell me a joke about a pelican.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I had it draw me a pelican for good measure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I quite like this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is a sin wave, but you can make out which bits are the frame and the wheels. The pelican is white with a black triangular bill. It's a much better attempt than the Ollama one." src="https://static.simonwillison.net/static/2025/gemma3n-mlx-vlm.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations.&lt;/p&gt;
&lt;p&gt;Finally, I had it describe the image it had just created:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 1000 \
  --prompt "Describe image" \
  --image gemma3n-mlx-vlm.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got back - it thought this was a chemical diagram!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image is a cartoon-style illustration depicting a molecular structure against a light blue background. The structure is composed of several differently colored and shaped elements connected by curved black lines. &lt;/p&gt;
&lt;p&gt;Here's a breakdown of the elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Large Black Circles:&lt;/strong&gt; There are two large, solid black circles, likely representing larger molecules or components.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cream-Colored Oval:&lt;/strong&gt; A prominent cream-colored oval shape is centrally located and connected to other elements. It has a smaller, darker cream-colored sphere attached to its left side.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Red Oval:&lt;/strong&gt; A smaller, solid red oval is positioned near one of the black circles and connected to the cream-colored oval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Black Lines:&lt;/strong&gt; These lines act as bonds, connecting the various molecular components. They are curved and dynamic, suggesting movement or interaction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Triangular Shape:&lt;/strong&gt; A small black triangle is attached to the smaller cream-colored sphere.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Letter "I":&lt;/strong&gt; The letter "I" appears twice, likely labeling specific parts of the molecule. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The overall impression is of a simplified representation of a biological molecule, possibly a protein or a complex organic compound. The use of different colors helps to distinguish the various components within the structure.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prince-canuma"&gt;prince-canuma&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="prince-canuma"/></entry><entry><title>Mistral-Small 3.2</title><link href="https://simonwillison.net/2025/Jun/20/mistral-small-32/#atom-tag" rel="alternate"/><published>2025-06-20T19:12:42+00:00</published><updated>2025-06-20T19:12:42+00:00</updated><id>https://simonwillison.net/2025/Jun/20/mistral-small-32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506"&gt;Mistral-Small 3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released on Hugging Face a couple of hours ago, so far there aren't any quantizations to run it on a Mac but I'm sure those will emerge pretty quickly.&lt;/p&gt;
&lt;p&gt;This is a minor bump to Mistral Small 3.1, one of my favorite local models. I've been running Small 3.1 &lt;a href="https://ollama.com/library/mistral-small3.1/tags"&gt;via Ollama&lt;/a&gt; where it's a 15GB download - these 24 billion parameter models are a great balance between capabilities and not using up all of the available RAM on my laptop. I expect Ollama will add 3.2 imminently.&lt;/p&gt;
&lt;p&gt;According to Mistral:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Small-3.2 improves in the following categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Instruction following&lt;/strong&gt;: Small-3.2 is better at following precise instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repetition errors&lt;/strong&gt;: Small-3.2 produces less infinite generations or repetitive answers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Function calling&lt;/strong&gt;: Small-3.2's function calling template is more robust (see &lt;a href="https://github.com/mistralai/mistral-common/blob/535b4d0a0fc94674ea17db6cf8dc2079b81cbcfa/src/mistral_common/tokens/tokenizers/instruct.py#L778"&gt;here&lt;/a&gt; and &lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506#function-calling"&gt;examples&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Interestingly they recommend running it with a temperature of 0.15 - many models recommend a default of 0.7. They also provide a &lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506/blob/main/SYSTEM_PROMPT.txt"&gt;suggested system prompt&lt;/a&gt; which includes a note that "Your knowledge base was last updated on 2023-10-01".&lt;/p&gt;
&lt;p&gt;It's not currently available via Mistral's API, or through any of the third-party LLM hosting vendors that I've checked, so I've not been able to run a prompt through the model myself yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; I downloaded one of first GGUF quantizations to show up on Hugging Face, &lt;a href="https://huggingface.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt; by Gabriel Larson. I ran it using Ollama and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M
llm install llm-ollama
llm -m hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one is pretty good for a 15GB model!&lt;/p&gt;
&lt;p&gt;&lt;img alt="See description below" src="https://static.simonwillison.net/static/2025/mistral-3.2-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/87ade7285a1d51d7073f73559baf41b2"&gt;the full transcript&lt;/a&gt; - it doesn't quite work in the actual image but I was delighted to see the model attempt to add this detail:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Basket with fish --&amp;gt;
&amp;lt;rect x="250" y="190" width="25" height="15" rx="5" fill="#FFA500"/&amp;gt;
&amp;lt;circle cx="260" cy="200" r="3" fill="#FF4500"/&amp;gt; &amp;lt;!-- Fish --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what you get if you isolate just that part of the generated SVG:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Large blue rectangle image, in the middle is a very small rounded corner orange rectangle &amp;quot;basket&amp;quot; containing an even smaller red circle that represents the fish." src="https://static.simonwillison.net/static/2025/fish-basket.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I had Mistral Small 3.2 describe the full image to me, since it's a vision-LLM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
  'describe image' \
  -a https://static.simonwillison.net/static/2025/mistral-3.2-pelican.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And it gave me the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image depicts a cartoonish illustration of a duck that is being lifted off the ground by a hook. The duck appears to be in mid-air, with its wings spread slightly as if it's surprised or reacting to being picked up. The hook is attached to an arm or a mechanism and seems to be connected to a vehicle below—perhaps a truck or a platform with wheels. The background of the image is light blue, indicating an outdoor setting. Overall, the scene is whimsical and playful, possibly suggesting a humorous or unusual situation where the duck is being transported in this manner.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: It's now available as &lt;a href="https://ollama.com/library/mistral-small3.2"&gt;an official Ollama model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull mistral-small3.2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LM Studio has a community quantization too: &lt;a href="https://huggingface.co/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="hugging-face"/><category term="mistral"/><category term="vision-llms"/><category term="llm-tool-use"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Magistral — the first reasoning model by Mistral AI</title><link href="https://simonwillison.net/2025/Jun/10/magistral/#atom-tag" rel="alternate"/><published>2025-06-10T16:13:22+00:00</published><updated>2025-06-10T16:13:22+00:00</updated><id>https://simonwillison.net/2025/Jun/10/magistral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/magistral"&gt;Magistral — the first reasoning model by Mistral AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral's first reasoning model is out today, in two sizes. There's a 24B Apache 2 licensed open-weights model called Magistral Small (actually Magistral-Small-2506), and a larger API-only model called Magistral Medium.&lt;/p&gt;
&lt;p&gt;Magistral Small is available as &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506"&gt;mistralai/Magistral-Small-2506&lt;/a&gt; on Hugging Face. From that model card:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Context Window&lt;/strong&gt;: A 128k context window, but performance might degrade past 40k. Hence we recommend setting the maximum model length to 40k.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mistral also released an official GGUF version, &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506_gguf"&gt;Magistral-Small-2506_gguf&lt;/a&gt;, which I ran successfully using Ollama like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That fetched a 25GB file. I ran prompts using a chat session with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got for "Generate an SVG of a pelican riding a bicycle" (&lt;a href="https://gist.github.com/simonw/7aaac8217f43be04886737d67c08ecca"&gt;transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue sky and what looks like an eagle flying towards the viewer." src="https://static.simonwillison.net/static/2025/magistral-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's disappointing that the GGUF doesn't support function calling yet - hopefully a community variant can add that, it's one of the best ways I know of to unlock the potential of these reasoning models.&lt;/p&gt;
&lt;p&gt;I just noticed that Ollama have their own &lt;a href="https://ollama.com/library/magistral"&gt;Magistral model&lt;/a&gt; too, which can be accessed using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull magistral:latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gets you a 14GB &lt;code&gt;q4_K_M&lt;/code&gt; quantization - other options can be found in the &lt;a href="https://ollama.com/library/magistral/tags"&gt;full list of Ollama magistral tags&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing that caught my eye in the Magistral announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Legal, finance, healthcare, and government professionals get traceable reasoning that meets compliance requirements. Every conclusion can be traced back through its logical steps, providing auditability for high-stakes environments with domain-specialized AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.&lt;/p&gt;
&lt;p&gt;Also from that announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our early tests indicated that Magistral is an excellent creative companion. We highly recommend it for creative writing and storytelling, with the model capable of producing coherent or — if needed — delightfully eccentric copy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I haven't seen a reasoning model promoted for creative writing in this way before.&lt;/p&gt;
&lt;p&gt;You can try out Magistral Medium by selecting the new "Thinking" option in Mistral's &lt;a href="https://chat.mistral.ai/"&gt;Le Chat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface showing settings options. At the top is a text input field that says &amp;quot;Ask le Chat or @mention an agent&amp;quot; with a plus button, lightbulb &amp;quot;Think&amp;quot; button with up arrow, grid &amp;quot;Tools&amp;quot; button, and settings icon. Below are two toggle options: &amp;quot;Pure Thinking&amp;quot; with description &amp;quot;Best option for math + coding. Disables tools.&amp;quot; (toggle is off), and &amp;quot;10x Speed&amp;quot; with lightning bolt icon and &amp;quot;PRO - 2 remaining today&amp;quot; label, described as &amp;quot;Same quality at 10x the speed.&amp;quot; (toggle is on and green)." src="https://static.simonwillison.net/static/2025/magistral-le-chat.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;They have options for "Pure Thinking" and a separate option for "10x speed", which runs Magistral Medium at 10x the speed using &lt;a href="https://www.cerebras.ai/"&gt;Cerebras&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new models are also available through &lt;a href="https://docs.mistral.ai/api/"&gt;the Mistral API&lt;/a&gt;. You can access them by installing &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt; and running &lt;code&gt;llm mistral refresh&lt;/code&gt; to refresh the list of available models, then:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mistral/magistral-medium-latest \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Claude Sonnet 4 described this as Minimalist illustration of a white bird with an orange beak riding on a dark gray motorcycle against a light blue sky with a white sun and gray ground" src="https://static.simonwillison.net/static/2025/magistral-medium-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/93917661eae6e2fe0a0bd5685172fab8"&gt;that transcript&lt;/a&gt;. At 13 input and 1,236 output tokens that cost me &lt;a href="https://www.llm-prices.com/#it=13&amp;amp;ot=1236&amp;amp;ic=2&amp;amp;oc=5"&gt;0.62 cents&lt;/a&gt; - just over half a cent.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="cerebras"/><category term="llm-pricing"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry></feed>