<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: qwen</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/qwen.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-16T17:16:52+00:00</updated><author><name>Simon Willison</name></author><entry><title>Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7</title><link href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag" rel="alternate"/><published>2026-04-16T17:16:52+00:00</published><updated>2026-04-16T17:16:52+00:00</updated><id>https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;For anyone who has been (inadvisably) taking my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican riding a bicycle benchmark&lt;/a&gt; seriously as a robust way to test models, here are pelicans from this morning's two big model releases - &lt;a href="https://qwen.ai/blog?id=qwen3.6-35b-a3b"&gt;Qwen3.6-35B-A3B from Alibaba&lt;/a&gt; and &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;Claude Opus 4.7 from Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the Qwen 3.6 pelican, generated using &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf"&gt;this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf&lt;/a&gt; quantized model by Unsloth, running on my MacBook Pro M5 via &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; (and the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin) - &lt;a href="https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2"&gt;transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/Qwen3.6-35B-A3B-UD-Q4_K_S-pelican.png" alt="The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's one I got from Anthropic's &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;brand new Claude Opus 4.7&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican.png" alt="The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!&lt;/p&gt;
&lt;p&gt;I tried Opus a second time passing &lt;code&gt;thinking_level: max&lt;/code&gt;. It didn't do much better (&lt;a href="https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican-max.png" alt="The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="i-dont-think-qwen-are-cheating"&gt;I don't think Qwen are cheating&lt;/h4&gt;
&lt;p&gt;A lot of people are &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;convinced that the labs train for my stupid benchmark&lt;/a&gt;. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":&lt;/p&gt;

&lt;div style="display: flex; gap: 4px;"&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Qwen3.6-35B-A3B&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/qwen-flamingo.png" alt="The unicycle spokes are a too long. The pelican has sunglasses, a bowtie and appears to be smoking a cigarette. It has two heart emoji surrounding the caption Flamingo on a Unicycle. It has a lot of charisma." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Opus 4.7&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/opus-flamingo.png" alt="The unicycle has a black wheel. The flamingo is a competent if slightly dull vector illustration of a flamingo. It has no flair." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;I'm giving this one to Qwen too, partly for the excellent &lt;code&gt;&amp;lt;!-- Sunglasses on flamingo! --&amp;gt;&lt;/code&gt; SVG comment.&lt;/p&gt;

&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.&lt;/p&gt;
&lt;p&gt;The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;first pelicans from October 2024&lt;/a&gt; were junk. The &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;more recent entries&lt;/a&gt; have generally been much, much better - to the point that Gemini 3.1 Pro produces &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;illustrations you could actually use somewhere&lt;/a&gt;, provided you had a pressing need to illustrate a pelican riding a bicycle.&lt;/p&gt;
&lt;p&gt;Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.&lt;/p&gt;
&lt;p&gt;If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Quantization from the ground up</title><link href="https://simonwillison.net/2026/Mar/26/quantization-from-the-ground-up/#atom-tag" rel="alternate"/><published>2026-03-26T16:21:09+00:00</published><updated>2026-03-26T16:21:09+00:00</updated><id>https://simonwillison.net/2026/Mar/26/quantization-from-the-ground-up/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ngrok.com/blog/quantization"&gt;Quantization from the ground up&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Rose continues &lt;a href="https://simonwillison.net/tags/sam-rose/"&gt;his streak&lt;/a&gt; of publishing spectacularly informative interactive essays, this time explaining how quantization of Large Language Models works (which he says might be "&lt;a href="https://twitter.com/samwhoo/status/2036845101561835968"&gt;the best post I've ever made&lt;/a&gt;".)&lt;/p&gt;
&lt;p&gt;Also included is the best visual explanation I've ever seen of how floating point numbers are represented using binary digits.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of an interactive float32 binary representation tool showing the value -48.92364502, with color-coded bit fields labeled S (sign), EXPONENT (blue), and SIGNIFICAND (pink), displaying the 32-bit pattern 11000010010000111101100001110100000, and a slider control at the bottom along with minus, plus, and reset buttons." src="https://static.simonwillison.net/static/2026/float.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I hadn't heard about &lt;strong&gt;outlier values&lt;/strong&gt; in quantization - rare float values that exist outside of the normal tiny-value distribution - but apparently they're very important:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Why do these outliers exist? [...] tl;dr: no one conclusively knows, but a small fraction of these outliers are &lt;em&gt;very&lt;/em&gt; important to model quality. Removing even a &lt;em&gt;single&lt;/em&gt; "super weight," as Apple calls them, can cause the model to output complete gibberish.&lt;/p&gt;
&lt;p&gt;Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them so that their block isn't destroyed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Plus there's a section on &lt;a href="https://ngrok.com/blog/quantization#how-much-does-quantization-affect-model-accuracy"&gt;How much does quantization affect model accuracy?&lt;/a&gt;. Sam explains the concepts of &lt;strong&gt;perplexity&lt;/strong&gt; and ** KL divergence ** and then uses the &lt;a href="https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity"&gt;llama.cpp perplexity tool&lt;/a&gt; and a run of the GPQA benchmark to show how different quantization levels affect Qwen 3.5 9B.&lt;/p&gt;
&lt;p&gt;His conclusion:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit is more noticeable, but it's certainly not a quarter as good as the original. Closer to 90%, depending on how you want to measure it.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/computer-science"&gt;computer-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;&lt;/p&gt;



</summary><category term="computer-science"/><category term="ai"/><category term="explorables"/><category term="generative-ai"/><category term="llms"/><category term="sam-rose"/><category term="qwen"/></entry><entry><title>Streaming experts</title><link href="https://simonwillison.net/2026/Mar/24/streaming-experts/#atom-tag" rel="alternate"/><published>2026-03-24T05:09:03+00:00</published><updated>2026-03-24T05:09:03+00:00</updated><id>https://simonwillison.net/2026/Mar/24/streaming-experts/#atom-tag</id><summary type="html">
    &lt;p&gt;I wrote about Dan Woods' experiments with &lt;strong&gt;streaming experts&lt;/strong&gt; &lt;a href="https://simonwillison.net/2026/Mar/18/llm-in-a-flash/"&gt;the other day&lt;/a&gt;, the trick where you run larger Mixture-of-Experts models on hardware that doesn't have enough RAM to fit the entire model by instead streaming the necessary expert weights from SSD for each token that you process.&lt;/p&gt;
&lt;p&gt;Five days ago Dan was running Qwen3.5-397B-A17B in 48GB of RAM. Today &lt;a href="https://twitter.com/seikixtc/status/2036246162936910322"&gt;@seikixtc reported&lt;/a&gt; running the colossal Kimi K2.5 - a 1 trillion parameter model with 32B active weights at any one time, in 96GB of RAM on an M2 Max MacBook Pro.&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://twitter.com/anemll/status/2035901335984611412"&gt;@anemll showed&lt;/a&gt; that same Qwen3.5-397B-A17B model running on an iPhone, albeit at just 0.6 tokens/second - &lt;a href="https://github.com/Anemll/flash-moe/tree/iOS-App"&gt;iOS repo here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I think this technique has legs. Dan and his fellow tinkerers are continuing to run &lt;a href="https://simonwillison.net/tags/autoresearch/"&gt;autoresearch loops&lt;/a&gt; in order to find yet more optimizations to squeeze more performance out of these models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Now Daniel Isaac &lt;a href="https://twitter.com/danpacary/status/2036480556045836603"&gt;got Kimi K2.5 working&lt;/a&gt; on a 128GB M4 Max at ~1.7 tokens/second.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/autoresearch"&gt;autoresearch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;&lt;/p&gt;



</summary><category term="definitions"/><category term="llms"/><category term="ai"/><category term="autoresearch"/><category term="generative-ai"/><category term="kimi"/><category term="local-llms"/><category term="qwen"/></entry><entry><title>Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally</title><link href="https://simonwillison.net/2026/Mar/18/llm-in-a-flash/#atom-tag" rel="alternate"/><published>2026-03-18T23:56:46+00:00</published><updated>2026-03-18T23:56:46+00:00</updated><id>https://simonwillison.net/2026/Mar/18/llm-in-a-flash/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/danveloper/status/2034353876753592372"&gt;Autoresearching Apple&amp;#x27;s &amp;quot;LLM in a Flash&amp;quot; to run Qwen 397B locally&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main"&gt;Qwen3.5-397B-A17B&lt;/a&gt; running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk.&lt;/p&gt;
&lt;p&gt;Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time.&lt;/p&gt;
&lt;p&gt;Dan used techniques described in Apple's 2023 paper &lt;a href="https://arxiv.org/abs/2312.11514"&gt;LLM in a flash: Efficient Large Language Model Inference with Limited Memory&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He fed the paper to Claude Code and used a variant of Andrej Karpathy's &lt;a href="https://simonwillison.net/2026/Mar/13/liquid/"&gt;autoresearch pattern&lt;/a&gt; to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/danveloper/flash-moe"&gt;danveloper/flash-moe&lt;/a&gt; has the resulting code plus &lt;a href="https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf"&gt;a PDF paper&lt;/a&gt; mostly written by Claude Opus 4.6 describing the experiment in full.&lt;/p&gt;
&lt;p&gt;The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running.&lt;/p&gt;
&lt;p&gt;Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3.&lt;/p&gt;
&lt;p&gt;It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Dan's &lt;a href="https://twitter.com/danveloper/status/2034686509748462022"&gt;latest version&lt;/a&gt; upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/autoresearch"&gt;autoresearch&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="autoresearch"/></entry><entry><title>Something is afoot in the land of Qwen</title><link href="https://simonwillison.net/2026/Mar/4/qwen/#atom-tag" rel="alternate"/><published>2026-03-04T15:50:03+00:00</published><updated>2026-03-04T15:50:03+00:00</updated><id>https://simonwillison.net/2026/Mar/4/qwen/#atom-tag</id><summary type="html">
    &lt;p&gt;I'm behind on writing about Qwen 3.5, a truly remarkable family of open weight models released by Alibaba's Qwen team over the past few weeks. I'm hoping that the 3.5 family doesn't turn out to be Qwen's swan song, seeing as that team has had some very high profile departures in the past 24 hours.&lt;/p&gt;
&lt;p&gt;It all started with &lt;a href="https://twitter.com/JustinLin610/status/2028865835373359513"&gt;this tweet&lt;/a&gt; from Junyang Lin (&lt;a href="https://twitter.com/JustinLin610"&gt;@JustinLin610&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;me stepping down. bye my beloved qwen.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Junyang Lin was the lead researcher building Qwen, and was key to releasing their open weight models from 2024 onwards.&lt;/p&gt;
&lt;p&gt;As far as I can tell a trigger for this resignation was a re-org within Alibaba where a new researcher hired from Google's Gemini team was put in charge of Qwen, but I've not confirmed that detail.&lt;/p&gt;
&lt;p&gt;More information is available in &lt;a href="https://www.36kr.com/p/3708425301749891"&gt;this article from 36kr.com&lt;/a&gt;. Here's &lt;a href="https://en.wikipedia.org/wiki/36Kr"&gt;Wikipedia on 36Kr&lt;/a&gt; confirming that it's a credible media source established in 2010 with a good track record reporting on the Chinese technology industry.&lt;/p&gt;
&lt;p&gt;The article is in Chinese - here are some quotes translated via Google Translate:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;At approximately 1:00 PM Beijing time on March 4th, Tongyi Lab held an emergency All Hands meeting, where Alibaba Group CEO Wu Yongming frankly told Qianwen employees.&lt;/p&gt;
&lt;p&gt;Twelve hours ago (at 0:11 AM Beijing time on March 4th), Lin Junyang, the technical lead for Alibaba's Qwen Big Data Model, suddenly announced his resignation on X. Lin Junyang was a key figure in promoting Alibaba's open-source AI models and one of Alibaba's youngest P10 employees. Amidst the industry uproar, many members of Qwen were also unable to accept the sudden departure of their team's key figure.&lt;/p&gt;
&lt;p&gt;"Given far fewer resources than competitors, Junyang's leadership is one of the core factors in achieving today's results," multiple Qianwen members told 36Kr. [...]&lt;/p&gt;
&lt;p&gt;Regarding Lin Junyang's whereabouts, no new conclusions were reached at the meeting. However, around 2 PM, Lin Junyang posted again on his WeChat Moments, stating, "Brothers of Qwen, continue as originally planned, no problem," without explicitly confirming whether he would return. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That piece also lists several other key members who have apparently resigned:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With Lin Junyang's departure, several other Qwen members also announced their departure, including core leaders responsible for various sub-areas of Qwen models, such as:&lt;/p&gt;
&lt;p&gt;Binyuan Hui: Lead Qwen code development, principal of the Qwen-Coder series models, responsible for the entire agent training process from pre-training to post-training, and recently involved in robotics research.&lt;/p&gt;
&lt;p&gt;Bowen Yu: Lead Qwen post-training research, graduated from the University of Chinese Academy of Sciences, leading the development of the Qwen-Instruct series models.&lt;/p&gt;
&lt;p&gt;Kaixin Li: Core contributor to Qwen 3.5/VL/Coder, PhD from the National University of Singapore.&lt;/p&gt;
&lt;p&gt;Besides the aforementioned individuals, many young researchers also resigned on the same day.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Based on the above it looks to me like everything is still very much up in the air. The presence of Alibaba's CEO at the "emergency All Hands meeting" suggests that the company understands the significance of these resignations and may yet retain some of the departing talent.&lt;/p&gt;
&lt;h4 id="qwen-3-5-is-exceptional"&gt;Qwen 3.5 is exceptional&lt;/h4&gt;
&lt;p&gt;This story hits particularly hard right now because the Qwen 3.5 models appear to be &lt;em&gt;exceptionally&lt;/em&gt; good.&lt;/p&gt;
&lt;p&gt;I've not spent enough time with them yet but the scale of the new model family is impressive. They started with &lt;a href="https://simonwillison.net/2026/Feb/17/qwen35/"&gt;Qwen3.5-397B-A17B on February 17th&lt;/a&gt; - an 807GB model - and then followed with &lt;a href="https://huggingface.co/collections/Qwen/qwen35"&gt;a flurry of smaller siblings&lt;/a&gt; in 122B, 35B, 27B, 9B, 4B, 2B, 0.8B sizes.&lt;/p&gt;
&lt;p&gt;I'm hearing positive noises about the 27B and 35B models for coding tasks that still fit on a 32GB/64GB Mac, and I've tried the 9B, 4B and 2B models and found them to be notably effective considering their tiny sizes. That 2B model is just 4.57GB - or as small as 1.27GB quantized - and is a full reasoning and multi-modal (vision) model.&lt;/p&gt;
&lt;p&gt;It would be a real tragedy if the Qwen team were to disband now, given their proven track record in continuing to find new ways to get high quality results out of smaller and smaller models.&lt;/p&gt;
&lt;p&gt;If those core Qwen team members either start something new or join another research lab I'm excited to see what they do next.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="ai-in-china"/></entry><entry><title>Qwen3.5: Towards Native Multimodal Agents</title><link href="https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag" rel="alternate"/><published>2026-02-17T04:30:57+00:00</published><updated>2026-02-17T04:30:57+00:00</updated><id>https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=qwen3.5"&gt;Qwen3.5: Towards Native Multimodal Agents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input.&lt;/p&gt;
&lt;p&gt;The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B"&gt;807GB on Hugging Face&lt;/a&gt;, and Unsloth have a &lt;a href="https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF"&gt;collection of smaller GGUFs&lt;/a&gt; ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL.&lt;/p&gt;
&lt;p&gt;I got this &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican&lt;/a&gt; from the &lt;a href="https://openrouter.ai/qwen/qwen3.5-397b-a17b"&gt;OpenRouter hosted model&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/625546cf6b371f9c0040e64492943b82"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelican is quite good although the neck lacks an outline for some reason. Bicycle is very basic with an incomplete frame" src="https://static.simonwillison.net/static/2026/qwen3.5-397b-a17b.png" /&gt;&lt;/p&gt;
&lt;p&gt;The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher &lt;a href="https://twitter.com/JustinLin610/status/2023340126479569140"&gt;Junyang Lin  says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9507dd47483f78dc1195117735273e20"&gt;its pelican&lt;/a&gt;, which is similar in quality to the open weights model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Similar quality pelican. The bicycle is taller and has a better frame shape. They are visually quite similar." src="https://static.simonwillison.net/static/2026/qwen3.5-plus-02-15.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation</title><link href="https://simonwillison.net/2026/Jan/22/qwen3-tts/#atom-tag" rel="alternate"/><published>2026-01-22T17:42:34+00:00</published><updated>2026-01-22T17:42:34+00:00</updated><id>https://simonwillison.net/2026/Jan/22/qwen3-tts/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=qwen3tts-0115"&gt;Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I haven't been paying much attention to the state-of-the-art in speech generation models other than noting that they've got &lt;em&gt;really good&lt;/em&gt;, so I can't speak for how notable this new release from Qwen is.&lt;/p&gt;
&lt;p&gt;From &lt;a href="https://github.com/QwenLM/Qwen3-TTS/blob/main/assets/Qwen3_TTS.pdf"&gt;the accompanying paper&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of- the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis [...]. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To give an idea of size, &lt;a href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base"&gt;Qwen/Qwen3-TTS-12Hz-1.7B-Base&lt;/a&gt; is 4.54GB on Hugging Face and &lt;a href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base"&gt;Qwen/Qwen3-TTS-12Hz-0.6B-Base&lt;/a&gt; is 2.52GB.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/spaces/Qwen/Qwen3-TTS"&gt;Hugging Face demo&lt;/a&gt; lets you try out the 0.6B and 1.7B models for free in your browser, including voice cloning:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a Qwen3-TTS voice cloning web interface with three tabs at top: &amp;quot;Voice Design&amp;quot;, &amp;quot;Voice Clone (Base)&amp;quot; (selected), and &amp;quot;TTS (CustomVoice)&amp;quot;. The page is titled &amp;quot;Clone Voice from Reference Audio&amp;quot; and has two main sections. Left section: &amp;quot;Reference Audio (Upload a voice sample clone)&amp;quot; showing an audio waveform player at 0:00/0:34 with playback controls, upload and microphone icons, followed by &amp;quot;Reference Text (Transcript of the reference audio)&amp;quot; containing three paragraphs: &amp;quot;Simon Willison is the creator of Datasette, an open source tool for exploring and publishing data. He currently works full-time building open source tools for data journalism, built around Datasette and SQLite. Prior to becoming an independent open source developer, Simon was an engineering director at Eventbrite. Simon joined Eventbrite through their acquisition of Lanyrd, a Y Combinator funded company he co-founded in 2010. He is a co-creator of the Django Web Framework, and has been blogging about web development and programming since 2002 at simonwillison.net&amp;quot;. Right section: &amp;quot;Target Text (Text to synthesize with cloned voice)&amp;quot; containing text about Qwen3-TTS speech generation capabilities, with &amp;quot;Language&amp;quot; dropdown set to &amp;quot;Auto&amp;quot; and &amp;quot;Model Size&amp;quot; dropdown set to &amp;quot;1.7B&amp;quot;, and a purple &amp;quot;Clone &amp;amp; Generate&amp;quot; button at bottom." src="https://static.simonwillison.net/static/2026/qwen-voice-clone.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I tried this out by recording myself reading &lt;a href="https://simonwillison.net/about/"&gt;my about page&lt;/a&gt; and then having Qwen3-TTS generate audio of me reading the Qwen3-TTS announcement post. Here's the result:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/qwen-tts-clone.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;It's important that everyone understands that voice cloning is now something that's available to anyone with a GPU and a few GBs of VRAM... or in this case a web browser that can access Hugging Face.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Prince Canuma &lt;a href="https://x.com/Prince_Canuma/status/2014453857019904423"&gt;got this working&lt;/a&gt; with his &lt;a href="https://pypi.org/project/mlx-audio/"&gt;mlx-audio&lt;/a&gt; library. I &lt;a href="https://claude.ai/share/2e01ad60-ca38-4e14-ab60-74eaa45b2fbd"&gt;had Claude&lt;/a&gt; turn that into &lt;a href="https://github.com/simonw/tools/blob/main/python/q3_tts.py"&gt;a CLI tool&lt;/a&gt; which you can run with &lt;code&gt;uv&lt;/code&gt; ike this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run https://tools.simonwillison.net/python/q3_tts.py \
  'I am a pirate, give me your gold!' \
  -i 'gruff voice' -o pirate.wav
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;-i&lt;/code&gt; option lets you use a prompt to describe the voice it should use. On first run this downloads a 4.5GB model file from Hugging Face.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46719229"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/text-to-speech"&gt;text-to-speech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prince-canuma"&gt;prince-canuma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="text-to-speech"/><category term="ai"/><category term="generative-ai"/><category term="hugging-face"/><category term="uv"/><category term="qwen"/><category term="mlx"/><category term="prince-canuma"/><category term="ai-in-china"/></entry><entry><title>Quoting Qwen3-VL Technical Report</title><link href="https://simonwillison.net/2025/Nov/27/qwen3-vl-technical-report/#atom-tag" rel="alternate"/><published>2025-11-27T17:01:11+00:00</published><updated>2025-11-27T17:01:11+00:00</updated><id>https://simonwillison.net/2025/Nov/27/qwen3-vl-technical-report/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://arxiv.org/abs/2511.21631"&gt;&lt;p&gt;To evaluate the model’s capability in processing long-context inputs, we construct a video “Needle-in-
a-Haystack” evaluation on Qwen3-VL-235B-A22B-Instruct. In this task, a semantically salient “needle”
frame—containing critical visual evidence—is inserted at varying temporal positions within a long video.
The model is then tasked with accurately locating the target frame from the long video and answering the
corresponding question. [...]&lt;/p&gt;
&lt;p&gt;As shown in Figure 3, the model achieves a perfect 100% accuracy on videos up to 30 minutes in
duration—corresponding to a context length of 256K tokens. Remarkably, even when extrapolating to
sequences of up to 1M tokens (approximately 2 hours of video) via YaRN-based positional extension,
the model retains a high accuracy of 99.5%.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://arxiv.org/abs/2511.21631"&gt;Qwen3-VL Technical Report&lt;/a&gt;, 5.12.3: Needle-in-a-Haystack&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="evals"/><category term="qwen"/><category term="ai-in-china"/></entry><entry><title>Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action</title><link href="https://simonwillison.net/2025/Sep/23/qwen3-vl/#atom-tag" rel="alternate"/><published>2025-09-23T23:51:08+00:00</published><updated>2025-09-23T23:51:08+00:00</updated><id>https://simonwillison.net/2025/Sep/23/qwen3-vl/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&amp;amp;from=research.latest-advancements-list"&gt;Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I've been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3's vision models.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Firstly, we are open-sourcing the flagship model of this series: Qwen3-VL-235B-A22B, available in both Instruct and Thinking versions. The Instruct version matches or even exceeds Gemini 2.5 Pro in major visual perception benchmarks. The Thinking version achieves state-of-the-art results across many multimodal reasoning benchmarks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Bold claims against Gemini 2.5 Pro, which are supported by a flurry of self-reported benchmarks.&lt;/p&gt;
&lt;p&gt;This initial model is &lt;em&gt;enormous&lt;/em&gt;. On Hugging Face both &lt;a href="https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct"&gt;Qwen3-VL-235B-A22B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking"&gt;Qwen3-VL-235B-A22B-Thinking&lt;/a&gt; are 235B parameters and weigh 471 GB. Not something I'm going to be able to run on my 64GB Mac!&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5"&gt;Qwen 2.5 VL family&lt;/a&gt; included models at 72B, 32B, 7B and 3B sizes. Given the rate Qwen are shipping models at the moment I wouldn't be surprised to see smaller Qwen 3 VL models show up in just the next few days.&lt;/p&gt;
&lt;p&gt;Also from Qwen today, three new API-only closed-weight models: &lt;a href="https://x.com/Alibaba_Qwen/status/1970582211993927774"&gt;upgraded Qwen 3 Coder&lt;/a&gt;, &lt;a href="https://qwen.ai/blog?id=4266edf7f3718f2d3fda098b3f4c48f3573215d0&amp;amp;from=home.latest-research-list"&gt;Qwen3-LiveTranslate-Flash&lt;/a&gt; (real-time multimodal interpretation), and &lt;a href="https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&amp;amp;from=research.latest-advancements-list"&gt;Qwen3-Max&lt;/a&gt;, their new trillion parameter flagship model, which they describe as their "largest and most capable model to date".&lt;/p&gt;
&lt;p&gt;Plus &lt;a href="https://twitter.com/Alibaba_Qwen/status/1970510193537753397"&gt;Qwen3Guard&lt;/a&gt;, a "safety moderation model series" that looks similar in purpose to Meta's &lt;a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/"&gt;Llama Guard&lt;/a&gt;. This one is open weights (Apache 2.0) and comes in 8B, 4B and 0.6B sizes &lt;a href="https://huggingface.co/collections/Qwen/qwen3guard-68d2729abbfae4716f3343a1"&gt;on Hugging Face&lt;/a&gt;. There's more information in the &lt;a href="https://github.com/QwenLM/Qwen3Guard"&gt;QwenLM/Qwen3Guard&lt;/a&gt; GitHub repo.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45352672"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="qwen"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Four new releases from Qwen</title><link href="https://simonwillison.net/2025/Sep/22/qwen/#atom-tag" rel="alternate"/><published>2025-09-22T21:51:20+00:00</published><updated>2025-09-22T21:51:20+00:00</updated><id>https://simonwillison.net/2025/Sep/22/qwen/#atom-tag</id><summary type="html">
    &lt;p&gt;It's been an &lt;em&gt;extremely&lt;/em&gt; busy day for team Qwen. Within the last 24 hours (all links to Twitter, which seems to be their preferred platform for these announcements):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://twitter.com/Alibaba_Qwen/status/1970052154330353857"&gt;Qwen3-Next-80B-A3B-Instruct-FP8 and Qwen3-Next-80B-A3B-Thinking-FP8&lt;/a&gt; - official FP8 quantized versions of their &lt;a href="https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d"&gt;Qwen3-Next&lt;/a&gt; models. On Hugging Face &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/tree/main"&gt;Qwen3-Next-80B-A3B-Instruct&lt;/a&gt; is 163GB and &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8/tree/main"&gt;Qwen3-Next-80B-A3B-Instruct-FP8&lt;/a&gt; is 82.1GB. I wrote &lt;a href="https://simonwillison.net/2025/Sep/12/qwen3-next/"&gt;about Qwen3-Next on Friday 12th September&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/Alibaba_Qwen/status/1970163551676592430"&gt;Qwen3-TTS-Flash&lt;/a&gt; provides "multi-timbre, multi-lingual, and multi-dialect speech synthesis" according to &lt;a href="https://qwen.ai/blog?id=b4264e11fb80b5e37350790121baf0a0f10daf82&amp;amp;from=research.latest-advancements-list"&gt;their blog announcement&lt;/a&gt;. It's not available as open weights, you have to access it via their API instead. Here's &lt;a href="https://huggingface.co/spaces/Qwen/Qwen3-TTS-Demo"&gt;a free live demo&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/Alibaba_Qwen/status/1970181599133344172"&gt;Qwen3-Omni&lt;/a&gt; is today's most exciting announcement: a brand new 30B parameter "omni" model supporting text, audio and video input and text and audio output! You can &lt;a href="https://chat.qwen.ai/?models=qwen3-omni-flash"&gt;try it on chat.qwen.ai&lt;/a&gt; by selecting the "Use voice and video chat" icon - you'll need to be signed in using Google or GitHub. This one &lt;em&gt;is&lt;/em&gt; open weights, as Apache 2.0 Qwen3-Omni-30B-A3B-Instruct, Qwen/Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner &lt;a href="https://huggingface.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe"&gt;on HuggingFace&lt;/a&gt;. That Instruct model is 70.5GB so this should be relatively accessible for running on expensive home devices.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/Alibaba_Qwen/status/1970189775467647266"&gt;Qwen-Image-Edit-2509&lt;/a&gt; is an updated version of their excellent Qwen-Image-Edit model which &lt;a href="https://simonwillison.net/2025/Aug/19/qwen-image-edit/"&gt;I first tried last month&lt;/a&gt;. Their &lt;a href="https://qwen.ai/blog?id=7a90090115ee193ce6a7f619522771dd9696dd93&amp;amp;from=research.latest-advancements-list"&gt;blog post&lt;/a&gt; calls it "the monthly iteration of Qwen-Image-Edit" so I guess they're planning more frequent updates. The new model adds multi-image inputs. I &lt;a href="https://chat.qwen.ai/s/c5f640da-8c36-4c95-98dd-878b47a6e759?fev=0.0.212"&gt;used it via chat.qwen.ai&lt;/a&gt; to turn a photo of our dog into a dragon in the style of one of Natalie's ceramic pots.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="A photo of the back of a pottery stand at a local art fair. A blue dragon is asleep on a rug, wearing a dog harness, with striking turquoise scales." src="https://static.simonwillison.net/static/2025/qwen-dragon.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's the prompt I used, feeding in two separate images. Weirdly it used the edges of the landscape photo to fill in the gaps on the otherwise portrait output. It turned the chair seat into a bowl too!&lt;/p&gt;
&lt;p&gt;&lt;img alt="A photo of a dog asleep on a rug at the pottery stand. Another photo of a very attractive ceramic pot with turquoise glaze. The prompt: edit the photo of the sleeping dog to turn her into a sleeping dragon with scales like this glazed bowl" src="https://static.simonwillison.net/static/2025/qwen-dragon-input.jpg" /&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/text-to-speech"&gt;text-to-speech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/multi-modal-output"&gt;multi-modal-output&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="text-to-speech"/><category term="ai"/><category term="qwen"/><category term="llms"/><category term="multi-modal-output"/><category term="llm-release"/><category term="ai-in-china"/><category term="generative-ai"/></entry><entry><title>Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!</title><link href="https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-tag" rel="alternate"/><published>2025-09-12T04:07:32+00:00</published><updated>2025-09-12T04:07:32+00:00</updated><id>https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.com/Alibaba_Qwen/status/1966197643904000262"&gt;Qwen3-Next-80B-A3B&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Qwen announced two new models via their Twitter account (and here's &lt;a href="https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&amp;amp;from=research.latest-advancements-list"&gt;their blog&lt;/a&gt;): &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct"&gt;Qwen3-Next-80B-A3B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking"&gt;Qwen3-Next-80B-A3B-Thinking&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They make some big claims on performance:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship.&lt;/li&gt;
&lt;li&gt;Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The name "80B-A3B" indicates 80 billion parameters of which only 3 billion are active at a time. You still need to have enough GPU-accessible RAM to hold all 80 billion in memory at once but only 3 billion will be used for each round of inference, which provides a &lt;em&gt;significant&lt;/em&gt; speedup in responding to prompts.&lt;/p&gt;
&lt;p&gt;More details from their tweet:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)&lt;/li&gt;
&lt;li&gt;Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &amp;amp; recall&lt;/li&gt;
&lt;li&gt;Ultra-sparse MoE: 512 experts, 10 routed + 1 shared&lt;/li&gt;
&lt;li&gt;Multi-Token Prediction → turbo-charged speculative decoding&lt;/li&gt;
&lt;li&gt;Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning &amp;amp; long-context&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models on Hugging Face are around 150GB each so I decided to try them out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; rather than on my own laptop (&lt;a href="https://openrouter.ai/qwen/qwen3-next-80b-a3b-thinking"&gt;Thinking&lt;/a&gt;, &lt;a href="https://openrouter.ai/qwen/qwen3-next-80b-a3b-instruct"&gt;Instruct&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I'm used my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin. I installed it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter
# paste key here
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then found the model IDs with this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm models -q next
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-thinking
OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-instruct
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I have an LLM &lt;a href="https://llm.datasette.io/en/stable/templates.html"&gt;prompt template&lt;/a&gt; saved called &lt;code&gt;pelican-svg&lt;/code&gt; which I created like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm "Generate an SVG of a pelican riding a bicycle" --save pelican-svg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This means I can run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican benchmark&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-thinking
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-instruct
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/d1a0d0ff719d609bc6fad2e133e7cbe9"&gt;thinking model output&lt;/a&gt; (exported with &lt;code&gt;llm logs -c | pbcopy&lt;/code&gt; after I ran the prompt):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is too simple and way too wide. The pelican is two circles, two orange triangular feed and a big triangle for the beak." src="https://static.simonwillison.net/static/2025/qwen3-next-80b-a3b-thinking.png" /&gt;&lt;/p&gt;
&lt;p&gt;I enjoyed the "Whimsical style with smooth curves and friendly proportions (no anatomical accuracy needed for bicycle riding!)" note in &lt;a href="https://gist.github.com/simonw/d1a0d0ff719d609bc6fad2e133e7cbe9#prompt"&gt;the transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The instruct (non-reasoning) model &lt;a href="https://gist.github.com/simonw/cc740a45beed5655faffa69da1e999f5"&gt;gave me this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue background, brown ground, bicycle looks more like a wheelchair, pelican is actually quite good though - has thin grey wings and a perky yellow long triangular beak. Above the pelican is the caption Who needs legs?! with an emoji sequence of penguin then flamingo." src="https://static.simonwillison.net/static/2025/qwen3-next-80b-a3b-instruct.png" /&gt;&lt;/p&gt;
&lt;p&gt;"🐧🦩 Who needs legs!?" indeed! I like that penguin-flamingo emoji sequence it's decided on for pelicans.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Defeating Nondeterminism in LLM Inference</title><link href="https://simonwillison.net/2025/Sep/11/defeating-nondeterminism/#atom-tag" rel="alternate"/><published>2025-09-11T06:53:42+00:00</published><updated>2025-09-11T06:53:42+00:00</updated><id>https://simonwillison.net/2025/Sep/11/defeating-nondeterminism/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/"&gt;Defeating Nondeterminism in LLM Inference&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A very common question I see about LLMs concerns why they can't be made to deliver the same response to the same prompt by setting a fixed random number seed.&lt;/p&gt;
&lt;p&gt;Like many others I had been lead to believe this was due to the non-associative nature of floating point arithmetic, where &lt;code&gt;(a + b) + c ≠ a + (b + c)&lt;/code&gt;, combining with unpredictable calculation orders on concurrent GPUs. This new paper calls that the "concurrency + floating point hypothesis":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It then convincingly argues that this is &lt;em&gt;not&lt;/em&gt; the core of the problem, because "in the typical forward pass of an LLM, there is usually not a single atomic add present."&lt;/p&gt;
&lt;p&gt;Why are LLMs so often non-deterministic then?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] &lt;strong&gt;the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies!&lt;/strong&gt; This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://github.com/thinking-machines-lab/batch_invariant_ops"&gt;thinking-machines-lab/batch_invariant_ops&lt;/a&gt; code that accompanies this paper addresses this by providing a PyTorch implementation of invariant kernels and demonstrates them running Qwen3-8B deterministically under vLLM.&lt;/p&gt;
&lt;p&gt;This paper is the first public output from Thinking Machines,  the AI Lab founded in February 2025 by Mira Murati, OpenAI's former CTO (and interim CEO for &lt;a href="https://openai.com/index/openai-announces-leadership-transition/"&gt;a few days&lt;/a&gt;). It's unrelated to &lt;a href="https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporation"&gt;Thinking Machines Corporation&lt;/a&gt;, the last employer of Richard Feynman (as described in this &lt;a href="https://longnow.org/ideas/richard-feynman-and-the-connection-machine/"&gt;most excellent story by Danny Hillis&lt;/a&gt;).

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45200925"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytorch"&gt;pytorch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="pytorch"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/></entry><entry><title>Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency</title><link href="https://simonwillison.net/2025/Aug/19/qwen-image-edit/#atom-tag" rel="alternate"/><published>2025-08-19T23:39:19+00:00</published><updated>2025-08-19T23:39:19+00:00</updated><id>https://simonwillison.net/2025/Aug/19/qwen-image-edit/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen-image-edit/"&gt;Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
As promised in their &lt;a href="https://simonwillison.net/2025/Aug/4/qwen-image/"&gt;August 4th release&lt;/a&gt; of the Qwen image generation model, Qwen have now followed it up with a separate model, &lt;code&gt;Qwen-Image-Edit&lt;/code&gt;, which can take an image and a prompt and return an edited version of that image.&lt;/p&gt;
&lt;p&gt;Ivan Fioravanti upgraded his macOS &lt;a href="https://github.com/ivanfioravanti/qwen-image-mps"&gt;qwen-image-mps&lt;/a&gt; tool (&lt;a href="https://simonwillison.net/2025/Aug/11/qwen-image-mps/"&gt;previously&lt;/a&gt;) to run the new model via a new &lt;code&gt;edit&lt;/code&gt; command. Since it's now &lt;a href="https://pypi.org/project/qwen-image-mps/"&gt;on PyPI&lt;/a&gt; you can run it directly using &lt;code&gt;uvx&lt;/code&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx qwen-image-mps edit -i pelicans.jpg \
  -p 'Give the pelicans rainbow colored plumage' -s 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Be warned... it downloads a 54GB model file (to &lt;code&gt;~/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit&lt;/code&gt;) and appears to use &lt;strong&gt;all 64GB&lt;/strong&gt; of my system memory - if you have less than 64GB it likely won't work, and I had to quit almost everything else on my system to give it space to run. A larger machine is almost required to use this.&lt;/p&gt;
&lt;p&gt;I fed it this image:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelicans on a rock" src="https://static.simonwillison.net/static/2025/pelicans-plumage-original.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The following prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Give the pelicans rainbow colored plumage&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And told it to use just 10 inference steps - the default is 50, but I didn't want to wait that long.&lt;/p&gt;
&lt;p&gt;It still took nearly 25 minutes (on a 64GB M2 MacBook Pro) to produce this result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelicans on a rock now with rainbow feathers - but they look less realistic" src="https://static.simonwillison.net/static/2025/pelicans-plumage-edited.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;To get a feel for how much dropping the inference steps affected things I tried the same prompt with the new "Image Edit" mode of Qwen's &lt;a href="https://chat.qwen.ai/"&gt;chat.qwen.ai&lt;/a&gt;, which I believe uses the same model. It gave me a result &lt;em&gt;much faster&lt;/em&gt; that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelicans are now almost identical in realism to the original photo but still have rainbow plumage." src="https://static.simonwillison.net/static/2025/pelicans-plumage-edited-full.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I left the command running overnight without the &lt;code&gt;-s 10&lt;/code&gt; option - so it would use all 50 steps - and my laptop took 2 hours and 59 minutes to generate this image, which is much more photo-realistic and similar to the one produced by Qwen's hosted model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Again, photo-realistic pelicans with rainbow plumage. Very similar to the original photo but with more rainbow feathers." src="https://static.simonwillison.net/static/2025/pelicans-plumage-50.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Marko Simic &lt;a href="https://twitter.com/simicvm/status/1958192059350692156"&gt;reported&lt;/a&gt; that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;50 steps took 49min on my MBP M4 Max 128GB&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="uv"/><category term="qwen"/><category term="text-to-image"/><category term="ivan-fioravanti"/></entry><entry><title>qwen-image-mps</title><link href="https://simonwillison.net/2025/Aug/11/qwen-image-mps/#atom-tag" rel="alternate"/><published>2025-08-11T06:19:02+00:00</published><updated>2025-08-11T06:19:02+00:00</updated><id>https://simonwillison.net/2025/Aug/11/qwen-image-mps/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ivanfioravanti/qwen-image-mps"&gt;qwen-image-mps&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ivan Fioravanti built this Python CLI script for running the &lt;a href="https://huggingface.co/Qwen/Qwen-Image"&gt;Qwen/Qwen-Image&lt;/a&gt; image generation model on an Apple silicon Mac, optionally using the &lt;a href="https://github.com/ModelTC/Qwen-Image-Lightning"&gt;Qwen-Image-Lightning&lt;/a&gt; LoRA to dramatically speed up generation.&lt;/p&gt;
&lt;p&gt;Ivan has tested it this on 512GB and 128GB machines and it ran &lt;a href="https://x.com/ivanfioravanti/status/1954646355458269562"&gt;really fast&lt;/a&gt; - 42 seconds on his M3 Ultra. I've run it on my 64GB M2 MacBook Pro - after quitting almost everything else - and it just about manages to output images after pegging my GPU (fans whirring, keyboard heating up) and occupying 60GB of my available RAM. With the LoRA option running the script to generate an image took 9m7s on my machine.&lt;/p&gt;
&lt;p&gt;Ivan merged &lt;a href="https://github.com/ivanfioravanti/qwen-image-mps/pull/3"&gt;my PR&lt;/a&gt; adding inline script dependencies for &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; which means you can now run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run https://raw.githubusercontent.com/ivanfioravanti/qwen-image-mps/refs/heads/main/qwen-image-mps.py \
-p 'A vintage coffee shop full of raccoons, in a neon cyberpunk city' -f
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first time I ran this it downloaded the 57.7GB model from Hugging Face and stored it in my &lt;code&gt;~/.cache/huggingface/hub/models--Qwen--Qwen-Image&lt;/code&gt; directory. The &lt;code&gt;-f&lt;/code&gt; option fetched an extra 1.7GB &lt;code&gt;Qwen-Image-Lightning-8steps-V1.0.safetensors&lt;/code&gt; file to my working directory that sped up the generation.&lt;/p&gt;
&lt;p&gt;Here's the resulting image:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Digital artwork of a cyberpunk-style coffee shop populated entirely by raccoons as customers, with illegible neon signs visible in the windows, pendant lighting over the counter, menu boards on the wall, bottles on shelves behind the bar, and raccoons sitting at tables and the counter with coffee cups" src="https://static.simonwillison.net/static/2025/racoon-cyberpunk-coffee.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/ivanfioravanti/status/1954284146064576966"&gt;@ivanfioravanti&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ivan-fioravanti"&gt;ivan-fioravanti&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="uv"/><category term="qwen"/><category term="text-to-image"/><category term="ai-in-china"/><category term="ivan-fioravanti"/></entry><entry><title>Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"</title><link href="https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-tag" rel="alternate"/><published>2025-08-10T23:59:12+00:00</published><updated>2025-08-10T23:59:12+00:00</updated><id>https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-tag</id><summary type="html">
    &lt;p&gt;I've fallen a few days behind keeping up with Qwen. They released two new 4B models last week:  &lt;a href="https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507"&gt;Qwen3-4B-Instruct-2507&lt;/a&gt; and its thinking equivalent &lt;a href="https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507"&gt;Qwen3-4B-Thinking-2507&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These are relatively tiny models that punch &lt;em&gt;way&lt;/em&gt; above their weight. I’ve been running the 8bit GGUF varieties via LM Studio (here's &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF"&gt;Instruct&lt;/a&gt;, here's &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-2507-GGUF"&gt;Thinking&lt;/a&gt;) - both of them are 4GB downloads that use around 4.3GB of my M2 MacBook Pro’s system RAM while running. Both are way more capable than I would expect from such small files.&lt;/p&gt;
&lt;p&gt;Qwen3-4B-Thinking is the first model I've tried which called out the absurdity of being asked to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;a pelican riding a bicycle&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/cd41769de4791b98ba43a17c6ff32ee1"&gt;the full transcript&lt;/a&gt; - some highlights from the thinking trace:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay, the user asked for an SVG of a pelican riding a bicycle. That's... oddly specific and completely unrealistic.&lt;/p&gt;
&lt;p&gt;First reaction: Pelicans don't ride bicycles. They're birds that fly or swim, not human-like vehicles. This feels like a mix-up - maybe they meant "a cyclist" but typed "pelican"? Or perhaps it's a joke?&lt;/p&gt;
&lt;p&gt;Hmm, user might be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A kid playing with words (pelican + bicycle = silly combo)&lt;/li&gt;
&lt;li&gt;Someone testing if I'll generate impossible images&lt;/li&gt;
&lt;li&gt;Actually wanting a pelican &lt;em&gt;on&lt;/em&gt; a bike for art project? But that still doesn't make sense biologically&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Checks SVG capabilities&lt;/em&gt;
SVG is for vector graphics, but this request violates physics and biology. If I tried to draw it literally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pelican would be too big for a bike frame&lt;/li&gt;
&lt;li&gt;Bike wheels would be tiny compared to bird wings&lt;/li&gt;
&lt;li&gt;No realistic way to show "riding" since pelicans don't have legs for pedaling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[…]&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Self-check&lt;/em&gt;: Am I being too pedantic? Nah - if someone asks for impossible things, it's better to gently correct than make fake art that could confuse them. […]&lt;/p&gt;
&lt;p&gt;I appreciate your creative request! 🦅🚴‍♂️ However, &lt;strong&gt;pelicans don't ride bicycles&lt;/strong&gt; (they're birds that fly/swim, not humans with bike skills). Generating this as a literal SVG would be biologically impossible and visually confusing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;"This request violates physics and biology" - Qwen3-4B-Thinking saying what no other model ever thought to express.&lt;/p&gt;
&lt;p&gt;I also really enjoyed the "Am I being too pedantic? Nah" line. This model is sassy!&lt;/p&gt;
&lt;p&gt;It continued on to spit out two SVGS: the first titled "Most Likely: A Human Cyclist" and the second "Artistic Interpretation: Pelican &lt;em&gt;On&lt;/em&gt; a Bicycle (For Fun!)".&lt;/p&gt;
&lt;p&gt;They were both terrible. Here's the human one:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-thinking-human.png" alt="It's a green horizontal line, three abstract black shapes, a brown circle and the word Person" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's the supposed pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-thinking-pelican.png" alt="A blue circle, with a small black shape that might be a wing and a black shape that could maybe be the head? It's awful. Red text overlaps it that says This is art - pelicans don't ride bikes! - there is no attempt at a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I like Qwen's decision to include the clarifying annotation "This is art - pelicans don't ride bikes!":&lt;/p&gt;
&lt;p&gt;I tried the Qwen3-4B-Instruct non-thinking model too. It &lt;a href="https://gist.github.com/simonw/ad927a3849d0aece043afc97559be4bf"&gt;answered much faster&lt;/a&gt; (no time spent questioning my choice of task with its thinking tokens) and gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-4b-instruct-2507-pelican.png" alt="A bunch of shaps. Pelican Riding a Bike! transposed on top. The yellow and orange bits might be a pelican I guess. The bicycle has two wheels overlapping too close and a single bar in the wrong direction." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;4B is such an interesting model size. These models should run on almost anything and, at least on my M2 MacBook, they run &lt;em&gt;fast&lt;/em&gt;. I'm getting 50+ tokens per second and they're using just less than 4.5GB of RAM while running.&lt;/p&gt;
&lt;p&gt;The question is always how useful such a tiny model can be. Clearly it's not great for SVG pelican illustrations!&lt;/p&gt;

&lt;p&gt;I did get a useful result out of the &lt;code&gt;-Thinking&lt;/code&gt; variant for a &lt;code&gt;jq&lt;/code&gt; expression I needed. I prompted:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;queries[0].rows is an array of objects each with a markdown key - write a jq bash one liner to output a raw string if that markdown concatenated together with double newlines between each&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It thought &lt;a href="https://gist.github.com/simonw/3f76749aa710f4a2d6405ebcf5b00ac4"&gt;for 3 minutes 13 seconds&lt;/a&gt; before spitting out a recipe that did roughly what I wanted:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;jq -r '.queries[0].rows[] | .markdown' | tr '\n' '\n\n'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I'm not sure that was worth waiting three minutes for though!&lt;/p&gt;

&lt;p&gt;These models have a 262,144 token context - wildly impressive, &lt;em&gt;if it works&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;So I tried another experiment: I used the Instruct model to summarize &lt;a href="https://news.ycombinator.com/item?id=44851557"&gt;this Hacker News conversation about GPT-5&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I did this with the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin for LLM combined with my &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes"&gt;hn-summary.sh script&lt;/a&gt;, which meant I could run the experiment like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 44851557 -m qwen3-4b-instruct-2507
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I believe this is 15,785 tokens - so nothing close to the 262,144 maximum but still an interesting test of a 4GB local model.&lt;/p&gt;
&lt;p&gt;The good news is Qwen spat out a genuinely useful summary of the conversation! You can &lt;a href="https://gist.github.com/simonw/4c5a1912f73e0d68b456b18000a76f0d#response"&gt;read that here&lt;/a&gt; - it's the best I've seen yet from a model running on my laptop, though honestly I've not tried many other recent models in this way.&lt;/p&gt;
&lt;p&gt;The bad news... it took almost five minutes to process and return the result!&lt;/p&gt;
&lt;p&gt;As a loose calculation, if the model can output 50 tokens/second maybe there's a similar speed for processing incoming input.. in which case 15785 / 50 = 315 seconds which is 5m15s.&lt;/p&gt;
&lt;p&gt;Hosted models can crunch through 15,000 tokens of input in just a few seconds. I guess this is one of the more material limitations of running models on Apple silicon as opposed to dedicated GPUs.&lt;/p&gt;
&lt;p&gt;I think I'm going to spend some more time with these models. They're fun, they have personality and I'm confident there are classes of useful problems they will prove capable at despite their small size. Their ability at summarization should make them a good fit for local RAG, and I've not started exploring their tool calling abilities yet.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>Quoting Artificial Analysis</title><link href="https://simonwillison.net/2025/Aug/6/artificial-analysis/#atom-tag" rel="alternate"/><published>2025-08-06T12:48:32+00:00</published><updated>2025-08-06T12:48:32+00:00</updated><id>https://simonwillison.net/2025/Aug/6/artificial-analysis/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://x.com/artificialanlys/status/1952887733803991070"&gt;&lt;p&gt;&lt;strong&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits&lt;/strong&gt; [...]&lt;/p&gt;
&lt;p&gt;We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;, see also their &lt;a href="https://artificialanalysis.ai/models/open-source"&gt;updated leaderboard&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="evals"/><category term="qwen"/><category term="deepseek"/><category term="gpt-oss"/><category term="artificial-analysis"/></entry><entry><title>Qwen-Image: Crafting with Native Text Rendering</title><link href="https://simonwillison.net/2025/Aug/4/qwen-image/#atom-tag" rel="alternate"/><published>2025-08-04T19:11:36+00:00</published><updated>2025-08-04T19:11:36+00:00</updated><id>https://simonwillison.net/2025/Aug/4/qwen-image/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen-image/"&gt;Qwen-Image: Crafting with Native Text Rendering&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Not content with releasing &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;six excellent open weights LLMs in July&lt;/a&gt;, Qwen are kicking off August with their first ever image generation model.&lt;/p&gt;
&lt;p&gt;Qwen-Image is a 20 billion parameter MMDiT (Multimodal Diffusion Transformer, originally proposed for Stable Diffusion 3) model under an Apache 2.0 license. The &lt;a href="https://huggingface.co/Qwen/Qwen-Image"&gt;Hugging Face repo&lt;/a&gt; is 53.97GB.&lt;/p&gt;
&lt;p&gt;Qwen released a &lt;a href="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf"&gt;detailed technical report&lt;/a&gt; (PDF) to accompany the model. The model builds on their Qwen-2.5-VL vision LLM, and they also made extensive use of that model to help create some of their their training data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In our data annotation pipeline, we utilize a capable image captioner (e.g., Qwen2.5-VL) to generate not only comprehensive image descriptions, but also structured metadata that captures essential image properties and quality attributes.&lt;/p&gt;
&lt;p&gt;Instead of treating captioning and metadata extraction as independent tasks, we designed an annotation framework in which the captioner concurrently describes visual content and generates detailed information in a structured format, such as JSON. Critical details such as object attributes, spatial relationships, environmental context, and verbatim transcriptions of visible text are captured in the caption, while key image properties like type, style, presence of watermarks, and abnormal elements (e.g., QR codes or facial mosaics) are reported in a structured format.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They put a &lt;em&gt;lot&lt;/em&gt; of effort into the model's ability to render text in a useful way. 5% of the training data (described as "billions of image-text pairs") was data "synthesized through controlled text rendering techniques", ranging from simple text through text on an image background up to much more complex layout examples:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To improve the model’s capacity to follow complex, structured prompts involving layout-sensitive content, we propose a synthesis strategy based on programmatic editing of pre-defined templates, such as PowerPoint slides or User Interface Mockups. A comprehensive rule-based system is designed to automate the substitution of placeholder text while maintaining the integrity of layout structure, alignment, and formatting.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tried the model out using the &lt;a href="https://modelscope.cn/aigc/imageGeneration?tab=advanced"&gt;ModelScope demo&lt;/a&gt; - I signed in with GitHub and verified my account via a text message to a phone number. Here's what I got for "A raccoon holding a sign that says "I love trash" that was written by that raccoon":&lt;/p&gt;
&lt;p&gt;&lt;img alt="A great photo of a raccoon holding a cardboard sign, the text I love trash is written on it in marker, the raccoon has chosen to draw the o in love as a heart filled with red marker pen." src="https://static.simonwillison.net/static/2025/qwen-trash.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The raccoon has very neat handwriting!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: A version of the model exists that can edit existing images but it's &lt;a href="https://github.com/QwenLM/Qwen-Image/issues/3#issuecomment-3151573614"&gt;not yet been released&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Currently, we have only open-sourced the text-to-image foundation model, but the editing model is also on our roadmap and planned for future release.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1952398250121756992"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="stable-diffusion"/><category term="generative-ai"/><category term="vision-llms"/><category term="training-data"/><category term="qwen"/><category term="text-to-image"/><category term="ai-in-china"/></entry><entry><title>Faster inference</title><link href="https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag" rel="alternate"/><published>2025-08-01T23:28:26+00:00</published><updated>2025-08-01T23:28:26+00:00</updated><id>https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag</id><summary type="html">
    &lt;p&gt;Two interesting examples of inference speed as a flagship feature of LLM services today.&lt;/p&gt;
&lt;p&gt;First, Cerebras &lt;a href="https://www.cerebras.ai/blog/introducing-cerebras-code"&gt;announced two new monthly plans&lt;/a&gt; for their extremely high speed hosted model service: Cerebras Code Pro ($50/month, 1,000 messages a day) and Cerebras Code Max ($200/month, 5,000/day). The model they are selling here is Qwen's Qwen3-Coder-480B-A35B-Instruct, likely the best available open weights coding model right now and one that was released &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;just ten days ago&lt;/a&gt;. Ten days from model release to third-party subscription service feels like some kind of record.&lt;/p&gt;
&lt;p&gt;Cerebras claim they can serve the model at an astonishing 2,000 tokens per second - four times the speed of Claude Sonnet 4 in &lt;a href="https://x.com/cerebrassystems/status/1951340566077440464"&gt;their demo video&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also today, Moonshot &lt;a href="https://x.com/kimi_moonshot/status/1951168907131355598"&gt;announced&lt;/a&gt; a new hosted version of their trillion parameter Kimi K2 model called &lt;code&gt;kimi-k2-turbo-preview&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;🆕 Say hello to kimi-k2-turbo-preview
Same model. Same context. NOW 4× FASTER.&lt;/p&gt;
&lt;p&gt;⚡️ From 10 tok/s to 40 tok/s.&lt;/p&gt;
&lt;p&gt;💰 Limited-Time Launch Price (50% off until Sept 1)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$0.30 / million input tokens (cache hit)&lt;/li&gt;
&lt;li&gt;$1.20 / million input tokens (cache miss)&lt;/li&gt;
&lt;li&gt;$5.00 / million output tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;👉 Explore more: &lt;a href="https://platform.moonshot.ai"&gt;platform.moonshot.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is twice the price of their regular model for 4x the speed (increasing to 4x the price in September). No details yet on how they achieved the speed-up.&lt;/p&gt;
&lt;p&gt;I am interested to see how much market demand there is for faster performance like this. I've &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;experimented with Cerebras in the past&lt;/a&gt; and found that the speed really does make iterating on code with live previews feel a whole lot more interactive.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="cerebras"/><category term="llm-pricing"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="qwen"/><category term="moonshot"/><category term="kimi"/><category term="llm-performance"/></entry><entry><title>Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM</title><link href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/#atom-tag" rel="alternate"/><published>2025-07-31T19:45:36+00:00</published><updated>2025-07-31T19:45:36+00:00</updated><id>https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;Qwen just released &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;their sixth model&lt;/a&gt;(!) of this July called &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; - listed as Qwen3-Coder-Flash in their &lt;a href="https://chat.qwen.ai/"&gt;chat.qwen.ai&lt;/a&gt; interface.&lt;/p&gt;
&lt;p&gt;It's 30.5B total parameters with 3.3B active at any one time. This means it will fit on a 64GB Mac - and even a 32GB Mac if you quantize it - and can run &lt;em&gt;really&lt;/em&gt; fast thanks to that smaller set of active parameters.&lt;/p&gt;
&lt;p&gt;It's a non-thinking model that is specially trained for coding tasks.&lt;/p&gt;
&lt;p&gt;This is an exciting combination of properties: optimized for coding performance and speed and small enough to run on a mid-tier developer laptop.&lt;/p&gt;
&lt;h4 id="trying-it-out-with-lm-studio-and-open-webui"&gt;Trying it out with LM Studio and Open WebUI&lt;/h4&gt;
&lt;p&gt;I like running models like this using Apple's MLX framework. I ran GLM-4.5 Air the other day &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/#how-i-ran-the-model"&gt;using the mlx-lm Python library directly&lt;/a&gt;, but this time I decided to try out the combination of &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; and &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;(LM Studio has a decent interface built in, but I like the Open WebUI one slightly more.)&lt;/p&gt;
&lt;p&gt;I installed the model  by clicking the "Use model in LM Studio" button on LM Studio's &lt;a href="https://lmstudio.ai/models/qwen/qwen3-coder-30b"&gt;qwen/qwen3-coder-30b&lt;/a&gt; page. It gave me a bunch of options:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/lm-studio-qwen3-coder-30b.jpg" alt="Screenshot of a model download menu for &amp;quot;qwen/qwen3-coder-30b,&amp;quot; a 30B MoE coding model from Alibaba Qwen using the mlx-llm engine. The section &amp;quot;Download Options&amp;quot; shows different choices with file sizes. Options include: GGUF Qwen3 Coder 30B A3B Instruct Q3_K_L (14.58 GB), Q4_K_M (18.63 GB), Q6_K (25.10 GB), Q8_0 (32.48 GB). MLX versions are also available: 4bit (17.19 GB, selected), 6bit (24.82 GB, marked as Downloaded), 8bit (32.46 GB)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I chose the 6bit MLX model, which is a 24.82GB download. Other options include 4bit (17.19GB) and 8bit (32.46GB). The download sizes are roughly the same as the amount of RAM required to run the model - picking that 24GB one leaves 40GB free on my 64GB machine for other applications.&lt;/p&gt;
&lt;p&gt;Then I opened the developer settings in LM Studio (the green folder icon) and turned on "Enable CORS" so I could access it from a separate Open WebUI instance.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/lm-studio-cors.jpg" alt="Screenshot of LM Studio application showing runtime settings. The status is &amp;quot;Running&amp;quot; with a toggle switch enabled. A settings dropdown is open with options including: &amp;quot;Server Port 1234&amp;quot;, &amp;quot;Enable CORS&amp;quot; (enabled), &amp;quot;Serve on Local Network&amp;quot; (disabled)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Now I switched over to Open WebUI. I installed and ran it using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx --python 3.11 open-webui serve&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then navigated to &lt;code&gt;http://localhost:8080/&lt;/code&gt; to access the interface. I opened their settings and configured a new "Connection" to LM Studio:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openweb-ui-settings.jpg" alt="Screenshot of Open WebUI settings showing the Edit Connection window. URL is set to http://localhost:1234/v1 and Prefix ID is set to lm." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That needs a base URL of &lt;code&gt;http://localhost:1234/v1&lt;/code&gt; and a key of anything you like. I also set the optional prefix to &lt;code&gt;lm&lt;/code&gt; just in case my Ollama installation - which Open WebUI detects automatically - ended up with any duplicate model names.&lt;/p&gt;
&lt;p&gt;Having done all of that, I could select any of my LM Studio models in the Open WebUI interface and start running prompts.&lt;/p&gt;
&lt;p&gt;A neat feature of Open WebUI is that it includes an automatic preview panel, which kicks in for fenced code blocks that include SVG or HTML:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openweb-ui-pelican.jpg" alt="The Open WebUI app with a sidebar and then a panel with the model and my Generate an SVG of a pelican riding a bicycle prompt, then its response, then another side panel with the rendered SVG. It isn't a great image - the bicycle is a bit mangled - but the pelican does at least have a big triangular orange beak." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/c167f14bc3d86ec1976f286d3e05fda5"&gt;the exported transcript&lt;/a&gt; for "Generate an SVG of a pelican riding a bicycle". It ran at almost 60 tokens a second!&lt;/p&gt;
&lt;h4 id="implementing-space-invaders"&gt;Implementing Space Invaders&lt;/h4&gt;
&lt;p&gt;I tried my other recent &lt;a href="https://simonwillison.net/tags/space-invaders/"&gt;simple benchmark prompt&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I like this one because it's a very short prompt that acts as shorthand for quite a complex set of features. There's likely plenty of material in the training data to help the model achieve that goal but it's still interesting to see if they manage to spit out something that works first time.&lt;/p&gt;
&lt;p&gt;The first version it gave me worked out of the box, but was a little too hard - the enemy bullets move so fast that it's almost impossible to avoid them:&lt;/p&gt;
&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;
&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-6bit-mlx-Qwen3-Coder-30B-A3B-Instruct"&gt;try that out here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I tried a follow-up prompt of "Make the enemy bullets a little slower". A system like Claude Artifacts or Claude Code implements tool calls for modifying files in place, but the Open WebUI system I was using didn't have a default equivalent which means the model had to output the full file a second time.&lt;/p&gt;
&lt;p&gt;It did that, and slowed down the bullets, but it made a bunch of other changes as well, &lt;a href="https://gist.github.com/simonw/ee4704feb37c6b16edd677d32fd69693/revisions#diff-544640de4897069f24e7988199bd5c08addfc5aa2196cbf2a0d164308bff1db0"&gt;shown in this diff&lt;/a&gt;. I'm not too surprised by this - asking a 25GB local model to output a lengthy file with just a single change is quite a stretch.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b7115990525b104a6dd95f7d694ae6c3"&gt;the exported transcript&lt;/a&gt; for those two prompts.&lt;/p&gt;
&lt;h4 id="running-lm-studio-models-with-mlx-lm"&gt;Running LM Studio models with mlx-lm&lt;/h4&gt;
&lt;p&gt;LM Studio stores its models in the &lt;code&gt;~/.cache/lm-studio/models&lt;/code&gt; directory. This means you can use the &lt;a href="https://github.com/ml-explore/mlx-lm"&gt;mlx-lm&lt;/a&gt; Python library to run prompts through the same model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --isolated --with mlx-lm mlx_lm.generate \
  --model &lt;span class="pl-k"&gt;~&lt;/span&gt;/.cache/lm-studio/models/lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit \
  --prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Write an HTML and JavaScript page implementing space invaders&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -m 8192 --top-k 20 --top-p 0.8 --temp 0.7&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Be aware that this will load a duplicate copy of the model into memory so you may want to quit LM Studio before running this command!&lt;/p&gt;
&lt;h4 id="accessing-the-model-via-my-llm-tool"&gt;Accessing the model via my LLM tool&lt;/h4&gt;
&lt;p&gt;My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; project provides a command-line tool and Python library for accessing large language models.&lt;/p&gt;
&lt;p&gt;Since LM Studio offers an OpenAI-compatible API, you can &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;configure LLM&lt;/a&gt; to access models through that API by creating or editing the &lt;code&gt;~/Library/Application\ Support/io.datasette.llm/extra-openai-models.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;zed &lt;span class="pl-k"&gt;~&lt;/span&gt;/Library/Application&lt;span class="pl-cce"&gt;\ &lt;/span&gt;Support/io.datasette.llm/extra-openai-models.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added the following YAML configuration:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;qwen3-coder-30b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;qwen/qwen3-coder-30b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;http://localhost:1234/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;supports_tools&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Provided LM Studio is running I can execute prompts from my terminal like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3-coder-30b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;A joke about a pelican and a cheesecake&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Why did the pelican refuse to eat the cheesecake?&lt;/p&gt;
&lt;p&gt;Because it had a &lt;em&gt;beak&lt;/em&gt; for dessert! 🥧🦜&lt;/p&gt;
&lt;p&gt;(Or if you prefer: Because it was afraid of getting &lt;em&gt;beak&lt;/em&gt;-sick from all that creamy goodness!)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(25GB clearly isn't enough space for a functional sense of humor.)&lt;/p&gt;
&lt;p&gt;More interestingly though, we can start exercising the Qwen model's support for &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;tool calling&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3-coder-30b \
  -T llm_version -T llm_time --td \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tell the time then show the version&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are enabling LLM's two default tools - one for telling the time and one for seeing the version of LLM that's currently installed. The &lt;code&gt;--td&lt;/code&gt; flag stands for &lt;code&gt;--tools-debug&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The output looks like this, debug output included:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Tool call: llm_time({})
  {
    "utc_time": "2025-07-31 19:20:29 UTC",
    "utc_time_iso": "2025-07-31T19:20:29.498635+00:00",
    "local_timezone": "PDT",
    "local_time": "2025-07-31 12:20:29",
    "timezone_offset": "UTC-7:00",
    "is_dst": true
  }

Tool call: llm_version({})
  0.26

The current time is:
- Local Time (PDT): 2025-07-31 12:20:29
- UTC Time: 2025-07-31 19:20:29

The installed version of the LLM is 0.26.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good! It managed two tool calls from a single prompt.&lt;/p&gt;
&lt;p&gt;Sadly I couldn't get it to work with some of my more complex plugins such as &lt;a href="https://github.com/simonw/llm-tools-sqlite"&gt;llm-tools-sqlite&lt;/a&gt;. I'm trying to figure out if that's a bug in the model, the LM Studio layer or my own code for running tool prompts against OpenAI-compatible endpoints.&lt;/p&gt;
&lt;h4 id="the-month-of-qwen"&gt;The month of Qwen&lt;/h4&gt;
&lt;p&gt;July has absolutely been the month of Qwen. The models they have released this month are outstanding, packing some extremely useful capabilities even into models I can run in 25GB of RAM or less on my own laptop.&lt;/p&gt;
&lt;p&gt;If you're looking for a competent coding model you can run locally Qwen3-Coder-30B-A3B is a very solid choice.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="uv"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>The best available open weight LLMs now come from China</title><link href="https://simonwillison.net/2025/Jul/30/chinese-models/#atom-tag" rel="alternate"/><published>2025-07-30T16:18:38+00:00</published><updated>2025-07-30T16:18:38+00:00</updated><id>https://simonwillison.net/2025/Jul/30/chinese-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively &lt;em&gt;smoked them&lt;/em&gt; over the course of July.&lt;/p&gt;
&lt;p&gt;Here's what came out this month, with links to my notes on each one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Moonshot &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;Kimi-K2-Instruct&lt;/a&gt; - 11th July, 1 trillion parameters&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;Qwen3-235B-A22B-Instruct-2507&lt;/a&gt; - 21st July, 235 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; - 22nd July, 480 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/"&gt;Qwen3-235B-A22B-Thinking-2507&lt;/a&gt; - 25th July, 235 billion&lt;/li&gt;
&lt;li&gt;Z.ai &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;GLM-4.5 and GLM-4.5 Air&lt;/a&gt; - 28th July, 355 and 106 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt; - 29th July, 30 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/"&gt;Qwen3-30B-A3B-Thinking-2507&lt;/a&gt; - 30th July, 30 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; - 31st July, 30 billion (released after I first posted this note)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;small&gt;Notably absent from this list is DeepSeek, but that's only because their last model release was &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528"&gt;DeepSeek-R1-0528&lt;/a&gt; back in April.&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;The only janky license among them is Kimi K2, which uses a non-OSI-compliant modified MIT. Qwen's models are all Apache 2 and Z.ai's are MIT.&lt;/p&gt;
&lt;p&gt;The larger Chinese models all offer their own APIs and are increasingly available from other providers.  I've been able to run versions of the Qwen 30B and GLM-4.5 Air 106B models on my own laptop.&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update August 5th 2025&lt;/strong&gt;: The OpenAI open weight models came out and &lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/"&gt;they are very impressive&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="qwen"/><category term="openai"/><category term="generative-ai"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="ai-in-china"/><category term="gpt-oss"/><category term="moonshot"/><category term="kimi"/><category term="janky-licenses"/><category term="glm"/></entry><entry><title>Qwen3-30B-A3B-Thinking-2507</title><link href="https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/#atom-tag" rel="alternate"/><published>2025-07-30T15:36:54+00:00</published><updated>2025-07-30T15:36:54+00:00</updated><id>https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507"&gt;Qwen3-30B-A3B-Thinking-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Yesterday was &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt;. Qwen are clearly committed to their new split between reasoning and non-reasoning models (a reversal from Qwen 3 in April), because today they released the new reasoning partner to yesterday's model: &lt;strong&gt;Qwen3-30B-A3B-Thinking-2507&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I'm surprised at how poorly this reasoning mode performs at "Generate an SVG of a pelican riding a bicycle" compared to its non-reasoning partner. The &lt;a href="https://gist.github.com/simonw/b523c029152f646ce4efb3c4dd5e1d01#reasoning"&gt;reasoning trace&lt;/a&gt; appears to carefully consider each component and how it should be positioned... and then &lt;a href="https://gist.github.com/simonw/b523c029152f646ce4efb3c4dd5e1d01#response"&gt;the final result&lt;/a&gt; looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A line with two dots, over a rhombus, with two circles and a pelican that looks like a grey snowman. They are not arranged in a sensible layout." src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-thinking-2507.png" /&gt;&lt;/p&gt;
&lt;p&gt;I ran this using &lt;a href="https://chat.qwen.ai/?model=Qwen3-30B-A3B-2507"&gt;chat.qwen.ai/?model=Qwen3-30B-A3B-2507&lt;/a&gt; with the "reasoning" option selected.&lt;/p&gt;
&lt;p&gt;I also tried the "Write an HTML and JavaScript page implementing space invaders" prompt I &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#space-invaders"&gt;ran against the non-reasoning model&lt;/a&gt;. It did a better job in that &lt;a href="https://tools.simonwillison.net/space-invaders-qwen3-30b-a3b-thinking-2507"&gt;the game works&lt;/a&gt;:&lt;/p&gt;
&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-thinking-2507-space-invaders.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-thinking-2507-space-invaders.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;It's not as playable as the on &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;I got from GLM-4.5 Air&lt;/a&gt; though - the invaders fire their bullets infrequently enough that the game isn't very challenging.&lt;/p&gt;
&lt;p&gt;This model is part of a flurry of releases from Qwen over the past two 9 days. Here's my coverage of each of those:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;Qwen3-235B-A22B-Instruct-2507&lt;/a&gt; - 21st July&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; - 22nd July&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/"&gt;Qwen3-235B-A22B-Thinking-2507&lt;/a&gt; - 25th July&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt; - 29th July&lt;/li&gt;
&lt;li&gt;Qwen3-30B-A3B-Thinking-2507 - today&lt;/li&gt;
&lt;/ul&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1950570969036361799"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>Qwen3-30B-A3B-Instruct-2507</title><link href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#atom-tag" rel="alternate"/><published>2025-07-29T18:57:33+00:00</published><updated>2025-07-29T18:57:33+00:00</updated><id>https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New model update from Qwen, improving on their previous &lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/"&gt;Qwen3-30B-A3B release&lt;/a&gt; from late April. In &lt;a href="https://x.com/Alibaba_Qwen/status/1950227114793586867"&gt;their tweet&lt;/a&gt; they said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Smarter, faster, and local deployment-friendly.&lt;/p&gt;
&lt;p&gt;✨ Key Enhancements:&lt;br&gt;
✅ Enhanced reasoning, coding, and math skills&lt;br&gt;
✅ Broader multilingual knowledge&lt;br&gt;
✅ Improved long-context understanding (up to 256K tokens)&lt;br&gt;
✅ Better alignment with user intent and open-ended tasks&lt;br&gt;
✅ No more &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; blocks — now operating exclusively in non-thinking mode&lt;br&gt;&lt;/p&gt;
&lt;p&gt;🔧 With 3B activated parameters, it's approaching the performance of GPT-4o and Qwen3-235B-A22B Non-Thinking&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tried &lt;a href="https://chat.qwen.ai/?model=Qwen3-30B-A3B-2507"&gt;the chat.qwen.ai&lt;/a&gt; hosted model with "Generate an SVG of a pelican riding a bicycle" and &lt;a href="https://gist.github.com/simonw/a498d4b2df887d079a9e338f8c4e5006"&gt;got this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This one is cute: blue sky, green grass, the sun is shining. The bicycle is a red block with wheels that looks more like a toy car. The pelican doesn't look like a pelican and has a quirky smile printed on its beak." src="https://static.simonwillison.net/static/2025/Qwen3-30B-A3B-2507.png" /&gt;&lt;/p&gt;
&lt;p&gt;I particularly enjoyed this detail from the SVG source code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Bonus: Pelican's smile --&amp;gt;
&amp;lt;path d="M245,145 Q250,150 255,145" fill="none" stroke="#d4a037" stroke-width="2"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I went looking for quantized versions that could fit on my Mac and found &lt;a href="https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Instruct-2507-MLX-8bit"&gt;lmstudio-community/Qwen3-30B-A3B-Instruct-2507-MLX-8bit&lt;/a&gt; from &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;. Getting that up and running was a 32.46GB download and it appears to use just over 30GB of RAM.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/d608dc37cb7871f12caf8fbc0657fcad"&gt;pelican I got from that one&lt;/a&gt; wasn't as good:&lt;/p&gt;
&lt;p&gt;&lt;img alt="It looks more like a tall yellow hen chick riding a segway" src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx.png" /&gt;&lt;/p&gt;
&lt;p id="space-invaders"&gt;I then tried that local model on the "Write an HTML and JavaScript page implementing space invaders" task &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;that I ran against GLM-4.5 Air&lt;/a&gt;. The output &lt;a href="https://gist.github.com/simonw/965111fd6fac320b7eec50710c1761db"&gt;looked promising&lt;/a&gt;, in particular it seemed to be putting more effort into the design of the invaders (GLM-4.5 Air just used rectangles):&lt;/p&gt;

&lt;pre&gt;&lt;span class="pl-c"&gt;// Draw enemy ship&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;color&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Ship body&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Enemy eyes&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#fff'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;6&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;10&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-c"&gt;// Enemy antennae&lt;/span&gt;
&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#f00'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Basic enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;else&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Fast enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;3&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;4&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;else&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Armored enemy&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;8&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;8&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;fillStyle&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'#0f0'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;ctx&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;fillRect&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-smi"&gt;this&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;6&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;But &lt;a href="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx-space-invaders.html"&gt;the resulting code&lt;/a&gt; didn't actually work:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Black screen - a row of good looking space invaders advances across the screen for a moment... and then the entire screen goes blank." src="https://static.simonwillison.net/static/2025/qwen3-30b-a3b-instruct-2507-mlx-space-invaders.gif" /&gt;&lt;/p&gt;
&lt;p&gt;That same prompt against the unquantized Qwen-hosted model produced &lt;a href="https://gist.github.com/simonw/b61d161a8a969e4558c812a64dadbb45"&gt;a different result&lt;/a&gt; which sadly also resulted in an &lt;a href="https://static.simonwillison.net/static/2025/Qwen3-30B-A3B-2507-space-invaders.html"&gt;unplayable game&lt;/a&gt; - this time because everything moved too fast.&lt;/p&gt;
&lt;p&gt;This new Qwen model is a non-reasoning model, whereas GLM-4.5 and GLM-4.5 Air are both reasoners. It looks like at this scale the "reasoning" may make a material difference in terms of getting code that works out of the box.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/><category term="space-invaders"/></entry><entry><title>Qwen3-235B-A22B-Thinking-2507</title><link href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/#atom-tag" rel="alternate"/><published>2025-07-25T22:52:14+00:00</published><updated>2025-07-25T22:52:14+00:00</updated><id>https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507"&gt;Qwen3-235B-A22B-Thinking-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The third Qwen model release week, following &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;Qwen3-235B-A22B-Instruct-2507&lt;/a&gt; on Monday 21st and &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; on Tuesday 22nd.&lt;/p&gt;
&lt;p&gt;Those two were both non-reasoning models - a change from the previous models in the Qwen 3 family which combined reasoning and non-reasoning in the same model, controlled by &lt;code&gt;/think&lt;/code&gt; and &lt;code&gt;/no_think&lt;/code&gt; tokens.&lt;/p&gt;
&lt;p&gt;Today's model, Qwen3-235B-A22B-Thinking-2507 (also released as an &lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8"&gt;FP8 variant&lt;/a&gt;), is their new thinking variant.&lt;/p&gt;
&lt;p&gt;Qwen claim "state-of-the-art results among open-source thinking models" and have increased the context length to 262,144 tokens - a big jump from April's &lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B"&gt;Qwen3-235B-A22B&lt;/a&gt; which was "32,768 natively and 131,072 tokens with YaRN".&lt;/p&gt;
&lt;p&gt;Their own published benchmarks show comparable scores to DeepSeek-R1-0528, OpenAI's o3 and o4-mini, Gemini 2.5 Pro and Claude Opus 4 in thinking mode.&lt;/p&gt;
&lt;p&gt;The new model is already &lt;a href="https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507"&gt;available via OpenRouter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But how good is &lt;a href="https://simonwillison.net/tags/pelican-on-a-bicycle/"&gt;its pelican&lt;/a&gt;?&lt;/p&gt;
&lt;p&gt;I tried it with "Generate an SVG of a pelican riding a bicycle" via OpenRouter, and it thought for 166 seconds - nearly three minutes! I have &lt;em&gt;never&lt;/em&gt; seen a model think for that long. No wonder the documentation includes the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072 when possible.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/057170c1d1e0843ca7e9547962d0c23e"&gt;a copy of that thinking trace&lt;/a&gt;. It was really fun to scan through:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Qwen3 235B A22B Thinking 2507 Seat at (200,200). The pelican's body will be: - The main body: a rounded shape starting at (200,200) and going to about (250, 250) [but note: the pelican is sitting, so the body might be more upright?] - Head: at (200, 180) [above the seat] and the beak extending forward to (280, 180) or so. We'll design the pelican as: - Head: a circle at (180, 170) with radius 15. - Beak: a long triangle from (180,170) to (250,170) and then down to (250,180) and back? Actually, the beak is a long flat-bottomed triangle." src="https://static.simonwillison.net/static/2025/qwen-details.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/f013772544fabba02fca9e28fd54cdee"&gt;finished pelican&lt;/a&gt;? Not so great! I like the beak though:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Description by Claude Sonnet 4: Minimalist flat illustration featuring a white bird character with orange beak, a purple rectangular tablet or device, gray cloud-like shapes, two black &amp;quot;T&amp;quot; letters, colorful geometric elements including orange and teal triangular shapes, scattered orange and green dots across a light background, and a thin black line at the bottom" src="https://static.simonwillison.net/static/2025/qwen-thinking-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1948688466386280706"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Qwen3-Coder: Agentic Coding in the World</title><link href="https://simonwillison.net/2025/Jul/22/qwen3-coder/#atom-tag" rel="alternate"/><published>2025-07-22T22:52:02+00:00</published><updated>2025-07-22T22:52:02+00:00</updated><id>https://simonwillison.net/2025/Jul/22/qwen3-coder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen3-coder/"&gt;Qwen3-Coder: Agentic Coding in the World&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It turns out that &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;as I was typing up&lt;/a&gt; my notes on Qwen3-235B-A22B-Instruct-2507 the Qwen team were unleashing something much bigger:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today, we’re announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct — a 480B-parameter Mixture-of-Experts model with 35B active parameters which supports the context length of 256K tokens natively and 1M tokens with extrapolation methods, offering exceptional performance in both coding and agentic tasks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is another Apache 2.0 licensed open weights model, available as &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8"&gt;Qwen3-Coder-480B-A35B-Instruct-FP8&lt;/a&gt; on Hugging Face.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://app.hyperbolic.ai/models/qwen3-coder-480b-a35b-instruct"&gt;qwen3-coder-480b-a35b-instruct on the Hyperbolic playground&lt;/a&gt; to run my "Generate an SVG of a pelican riding a bicycle" test prompt:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle has no spokes. The pelican is light yellow and is overlapping the middle of the bicycle, not perching on it - it has a large yellow beak and a weird red lower beak or wattle." src="https://static.simonwillison.net/static/2025/Qwen3-Coder-480B-A35B-Instruct-FP8.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I actually slightly prefer the one &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;I got from qwen3-235b-a22b-07-25&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's also available &lt;a href="https://openrouter.ai/qwen/qwen3-coder"&gt;as qwen3-coder on OpenRouter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In addition to the new model, Qwen released their own take on an agentic terminal coding assistant called &lt;a href="https://github.com/QwenLM/qwen-code"&gt;qwen-code&lt;/a&gt;, which they describe in their blog post as being "Forked from Gemini Code" (they mean &lt;a href="https://github.com/google-gemini/gemini-cli"&gt;gemini-cli&lt;/a&gt;) - which is Apache 2.0 so a fork is in keeping with the license.&lt;/p&gt;
&lt;p&gt;They focused &lt;em&gt;really hard&lt;/em&gt; on code performance for this release, including generating synthetic data tested using 20,000 parallel environments on Alibaba Cloud:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the post-training phase of Qwen3-Coder, we introduced long-horizon RL (Agent RL) to encourage the model to solve real-world tasks through multi-turn interactions using tools. The key challenge of Agent RL lies in environment scaling. To address this, we built a scalable system capable of running 20,000 independent environments in parallel, leveraging Alibaba Cloud’s infrastructure. The infrastructure provides the necessary feedback for large-scale reinforcement learning and supports evaluation at scale. As a result, Qwen3-Coder achieves state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To further burnish their coding credentials, the announcement includes instructions for running their new model using both Claude Code and Cline using custom API base URLs that point to Qwen's own compatibility proxies.&lt;/p&gt;
&lt;p&gt;Pricing for Qwen's own hosted models (through Alibaba Cloud) &lt;a href="https://www.alibabacloud.com/help/en/model-studio/models"&gt;looks competitive&lt;/a&gt;. This is the first model I've seen that sets different prices for four different sizes of input:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pricing table with three columns showing Input token count (0-32K, 32K-128K, 128K-256K, 256K-1M), Input price (Million tokens) ($1, $1.8, $3, $6), and Output price (Million tokens) ($5, $9, $15, $60)" src="https://static.simonwillison.net/static/2025/qwen3-coder-plus-prices.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;This kind of pricing reflects how inference against longer inputs is more expensive to process. Gemini 2.5 Pro has two different prices for above or below 200,00 tokens.&lt;/p&gt;
&lt;p&gt;Awni Hannun &lt;a href="https://x.com/awnihannun/status/1947771502058672219"&gt;reports&lt;/a&gt; running a &lt;a href="https://huggingface.co/mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit"&gt;4-bit quantized MLX version&lt;/a&gt; on a 512GB M3 Ultra Mac Studio at 24 tokens/second using 272GB of RAM, getting &lt;a href="https://x.com/awnihannun/status/1947772369440997807"&gt;great results&lt;/a&gt; for "&lt;code&gt;write a python script for a bouncing yellow ball within a square, make sure to handle collision detection properly. make the square slowly rotate. implement it in python. make sure ball stays within the square&lt;/code&gt;".

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1947766835023335516"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="qwen"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="coding-agents"/><category term="ai-in-china"/></entry><entry><title>Qwen/Qwen3-235B-A22B-Instruct-2507</title><link href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/#atom-tag" rel="alternate"/><published>2025-07-22T22:07:12+00:00</published><updated>2025-07-22T22:07:12+00:00</updated><id>https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507"&gt;Qwen/Qwen3-235B-A22B-Instruct-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Significant new model release from Qwen, published yesterday without much fanfare. (&lt;strong&gt;Update&lt;/strong&gt;: probably because they were cooking the much larger &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; which they released just now.)&lt;/p&gt;
&lt;p&gt;This is a follow-up to their &lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/"&gt;April release&lt;/a&gt; of the full Qwen 3 model family, which included a Qwen3-235B-A22B model which could handle both reasoning and non-reasoning prompts (via a &lt;code&gt;/no_think&lt;/code&gt; toggle).&lt;/p&gt;
&lt;p&gt;The new &lt;code&gt;Qwen3-235B-A22B-Instruct-2507&lt;/code&gt; ditches that mechanism - this is exclusively a &lt;strong&gt;non-reasoning&lt;/strong&gt; model. It looks like Qwen have new reasoning models in the pipeline.&lt;/p&gt;
&lt;p&gt;This new model is Apache 2 licensed and comes in two official sizes: a BF16 model (437.91GB of files on Hugging Face) and &lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"&gt;an FP8 variant&lt;/a&gt; (220.20GB). VentureBeat &lt;a href="https://venturebeat.com/ai/alibabas-new-open-source-qwen3-235b-a22b-2507-beats-kimi-2-and-offers-low-compute-version/#h-fp8-version-lets-enterprises-run-qwen-3-with-far-less-memory-and-far-less-compute"&gt;estimate&lt;/a&gt; that the large model needs 88GB of VRAM while the smaller one should run in ~30GB.&lt;/p&gt;
&lt;p&gt;The benchmarks on these new models look &lt;em&gt;very promising&lt;/em&gt;. Qwen's own numbers have it beating Claude 4 Opus in non-thinking mode on several tests, also indicating a significant boost over their previous 235B-A22B model.&lt;/p&gt;
&lt;p&gt;I haven't seen any independent benchmark results yet. Here's what I got for "Generate an SVG of a pelican riding a bicycle", which I ran using the &lt;a href="https://openrouter.ai/qwen/qwen3-235b-a22b-07-25:free"&gt;qwen3-235b-a22b-07-25:free on OpenRouter&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm -m openrouter/qwen/qwen3-235b-a22b-07-25:free \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description by Claude Sonnet 4: Cartoon illustration of a white duck sitting on a black bicycle against a blue sky with a white cloud, yellow sun, and green grass below" src="https://static.simonwillison.net/static/2025/qwen3-235b-a22b-07-25.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Qwen3 Embedding</title><link href="https://simonwillison.net/2025/Jun/8/qwen3-embedding/#atom-tag" rel="alternate"/><published>2025-06-08T04:22:29+00:00</published><updated>2025-06-08T04:22:29+00:00</updated><id>https://simonwillison.net/2025/Jun/8/qwen3-embedding/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen3-embedding/"&gt;Qwen3 Embedding&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New family of embedding models from Qwen, in three sizes: 0.6B, 4B, 8B - and two categories: Text Embedding and Text Reranking.&lt;/p&gt;
&lt;p&gt;The full collection &lt;a href="https://huggingface.co/collections/Qwen/qwen3-embedding-6841b2055b99c44d9a4c371f"&gt;can be browsed&lt;/a&gt; on Hugging Face. The smallest available model is the 0.6B Q8 one, which is available as a 639MB GGUF. I tried it out using my &lt;a href="https://github.com/simonw/llm-sentence-transformers"&gt;llm-sentence-transformers&lt;/a&gt; plugin like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-sentence-transformers
llm sentence-transformers register Qwen/Qwen3-Embedding-0.6B
llm embed -m sentence-transformers/Qwen/Qwen3-Embedding-0.6B -c hi | jq length
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This output 1024, confirming that Qwen3 0.6B produces 1024 length embedding vectors.&lt;/p&gt;
&lt;p&gt;These new models are the highest scoring open-weight models on the well regarded &lt;a href="https://huggingface.co/spaces/mteb/leaderboard"&gt;MTEB leaderboard&lt;/a&gt; - they're licensed Apache 2.0.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Table showing ranking of embedding models with columns for Rank, Model name, Zero-shot performance, Memory Usage, Number of Parameters, Embedding Dimensions, and Max Tokens. Top models include gemini-embedding-001 at rank 1 with 99% zero-shot and 3072 embedding dimensions, Qwen3-Embedding-8B at rank 2 with 99% zero-shot and 4096 embedding dimensions, and several other Qwen3 variants. Most models show 99% zero-shot performance with green highlighting, except gte-Qwen2-7B-instruct at rank 6 which shows &amp;quot;NA&amp;quot; with red highlighting and a warning triangle icon." src="https://static.simonwillison.net/static/2025/qwen3-mteb.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;You can also try them out in your web browser, thanks to a &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;Transformers.js&lt;/a&gt; port of the models. I loaded &lt;a href="https://huggingface.co/spaces/webml-community/qwen3-embedding-webgpu"&gt;this page in Chrome&lt;/a&gt; (&lt;a href="https://huggingface.co/spaces/webml-community/qwen3-embedding-webgpu/tree/main"&gt;source code here&lt;/a&gt;) and it fetched 560MB of model files and gave me an interactive interface for visualizing clusters of embeddings like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a text embedding web application interface showing a &amp;quot;Sentences&amp;quot; panel on the left with various sample sentences about topics like cooking, technology, sports, finance, music, and history, a &amp;quot;Labels&amp;quot; section below listing these categories, and a &amp;quot;Scatterplot&amp;quot; visualization on the right displaying colored clusters of data points representing the embedded sentences grouped by topic, with an &amp;quot;Embed &amp;amp; Plot&amp;quot; button at the bottom and instructions to &amp;quot;Done! Hover over points to see sentences.&amp;quot;" src="https://static.simonwillison.net/static/2025/qwen3-web.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/xenovacom/status/1931082176788906006"&gt;@xenovacom&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="embeddings"/><category term="llm"/><category term="qwen"/><category term="ai-in-china"/></entry><entry><title>qwen2.5vl in Ollama</title><link href="https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/#atom-tag" rel="alternate"/><published>2025-05-18T12:31:15+00:00</published><updated>2025-05-18T12:31:15+00:00</updated><id>https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/library/qwen2.5vl"&gt;qwen2.5vl in Ollama&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ollama announced a complete overhaul of their vision support the other day. Here's the first new model they've shipped since then - a packaged version of &lt;a href="https://github.com/QwenLM/Qwen2.5-VL"&gt;Qwen 2.5 VL&lt;/a&gt; which was first released &lt;a href="https://qwenlm.github.io/blog/qwen2.5-vl/"&gt;on January 26th 2025&lt;/a&gt;. Here are my &lt;a href="https://simonwillison.net/2025/Jan/27/qwen25-vl-qwen25-vl-qwen25-vl/"&gt;notes from that release&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I upgraded Ollama (it auto-updates so I just had to restart it from the tray icon) and ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloaded a 6GB model file. I tried it out against my &lt;a href="https://static.simonwillison.net/static/2025/cleo-sand.jpg"&gt;photo of Cleo rolling on the beach&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/cleo-sand.jpg \
  'describe this image' -m qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And got a pretty good result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows a dog lying on its back on a sandy beach. The dog appears to be a medium to large breed with a dark coat, possibly black or dark brown. It is wearing a red collar or harness around its chest. The dog's legs are spread out, and its belly is exposed, suggesting it might be rolling around or playing in the sand. The sand is light-colored and appears to be dry, with some small footprints and marks visible around the dog. The lighting in the image suggests it is taken during the daytime, with the sun casting a shadow of the dog to the left side of the image. The overall scene gives a relaxed and playful impression, typical of a dog enjoying time outdoors on a beach.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Qwen 2.5 VL has a strong reputation for OCR, so I tried it on &lt;a href="https://simonwillison.net/2025/May/17/pycon-poster/#datasette-poster"&gt;my poster&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/poster.jpg \
  'convert to markdown' -m qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result that came back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It looks like the image you provided is a jumbled and distorted text, making it difficult to interpret. If you have a specific question or need help with a particular topic, please feel free to ask, and I'll do my best to assist you!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm not sure what went wrong here. My best guess is that the maximum resolution the model can handle is too small to make out the text, or maybe Ollama resized the image to the point of illegibility before handing it to the model?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I think this may be &lt;a href="https://github.com/simonw/llm/issues/1046"&gt;a bug&lt;/a&gt; relating to URL handling in LLM/llm-ollama. I tried downloading the file first:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget https://static.simonwillison.net/static/2025/poster.jpg
llm -m qwen2.5vl 'extract text' -a poster.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time it did a lot better. The results weren't perfect though - &lt;a href="https://gist.github.com/simonw/2b46e932a16c92e673ea09dfc0186ec2#response"&gt;it ended up stuck in a loop&lt;/a&gt; outputting the same code example dozens of times.&lt;/p&gt;
&lt;p&gt;I tried with a different prompt - "extract text" - and it got confused by the three column layout, misread Datasette as "Datasetette" and missed some of the text. Here's &lt;a href="https://gist.github.com/simonw/3ececa5f5ff109a81bc6893be06f00b1#response"&gt;that result&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These experiments used &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; (6GB) - I expect the results would be better with the larger &lt;code&gt;qwen2.5vl:32b&lt;/code&gt; (21GB) and &lt;code&gt;qwen2.5vl:72b&lt;/code&gt; (71GB) models.&lt;/p&gt;
&lt;p&gt;Fred Jonsson &lt;a href="https://twitter.com/enginoid/status/1924092556079436086"&gt;reported a better result&lt;/a&gt; using the MLX model via LM studio (~9GB model running in 8bit - I think that's &lt;a href="https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit"&gt;mlx-community/Qwen2.5-VL-7B-Instruct-8bit&lt;/a&gt;). His &lt;a href="https://gist.github.com/enginoid/5c91c920124d4a2e0ab253df769e35fa"&gt;full output is here&lt;/a&gt; - looks almost exactly right to me.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="ai-in-china"/></entry><entry><title>Saying "hi" to Microsoft's Phi-4-reasoning</title><link href="https://simonwillison.net/2025/May/6/phi-4-reasoning/#atom-tag" rel="alternate"/><published>2025-05-06T18:25:45+00:00</published><updated>2025-05-06T18:25:45+00:00</updated><id>https://simonwillison.net/2025/May/6/phi-4-reasoning/#atom-tag</id><summary type="html">
    &lt;p&gt;Microsoft released a new sub-family of models a few days ago: Phi-4 reasoning. They introduced them in &lt;a href="https://azure.microsoft.com/en-us/blog/one-year-of-phi-small-language-models-making-big-leaps-in-ai/"&gt;this blog post&lt;/a&gt; celebrating a year since the release of Phi-3:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today, we are excited to introduce &lt;strong&gt;Phi-4-reasoning&lt;/strong&gt;, &lt;strong&gt;Phi-4-reasoning-plus&lt;/strong&gt;, and &lt;strong&gt;Phi-4-mini-reasoning&lt;/strong&gt; - marking a new era for small language models and once again redefining what is possible with small and efficient AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;All three of the new models are released under the MIT license, as is the case with previous Phi models.&lt;/p&gt;
&lt;p&gt;They're available through &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; as &lt;a href="https://ollama.com/library/phi4-reasoning/tags"&gt;phi4-reasoning&lt;/a&gt; (which includes the plus model) and &lt;a href="https://ollama.com/library/phi4-mini-reasoning/tags"&gt;phi4-mini-reasoning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These models appear to have a tendency to overthink!&lt;/p&gt;
&lt;h4 id="trying-it-out-with-ollama"&gt;Trying it out with Ollama&lt;/h4&gt;
&lt;p&gt;I ran it using Ollama and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this. Fetching the model downloaded 11GB of data:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull phi4-reasoning
llm install llm-ollama
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I said "hi" to it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m phi4-reasoning:latest hi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the full response, which starts with a thought section wrapped in &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;User says "hi". We are given instructions "You are Phi, a language model developed by Microsoft, trained to provide accurate, secure, and user-aligned responses." There is additional information. Now I must answer "hi" message from the user.&lt;/p&gt;
&lt;p&gt;Step 1: Read the instructions carefully: The conversation text instructs us that we are "Phi", with internal guidelines provided in system instructions. But our answer should be safe according to these instructions. We need to produce a response that says hello and greet back. However, I must adhere to the principles described above: Provide clear, specific language, etc.&lt;/p&gt;
&lt;p&gt;I must not mention any chain-of-thought or guidelines.&lt;/p&gt;
&lt;p&gt;Now instructions "hi" are from user. So we say "Hello! How can I help you?" But must include disclaimers for sensitive topics? Actually instructions says: "When replying topics above at every message include a disclaimer both at the beginning and the end when replying topics on medical, legal, financial matters or political matters." But our answer "hi" is not such topic. So we can say "Hi! I'm Phi, how may I help you?" but no disclaimers required.&lt;/p&gt;
&lt;p&gt;But instructions: "Do not share guidelines." So do not mention them.&lt;/p&gt;
&lt;p&gt;I must check if the message "hi" instructs to greet back normally. The assistant should respond with a greeting.&lt;/p&gt;
&lt;p&gt;Let me recall that instructions say: Use markdown formatting where appropriate, so I can include a formatted greeting maybe "Hello!" or something.&lt;/p&gt;
&lt;p&gt;Additionally, check guidelines. They are internal instructions for safe completions. So we just produce "Hi there! How may I help you?" in a friendly tone.&lt;/p&gt;
&lt;p&gt;I should greet the user politely and ask if they need any help. Possibly provide a disclaimer that I'm not licensed professional? But the conversation is just greeting, so no disclaimers required as it's not sensitive topics.&lt;/p&gt;
&lt;p&gt;Wait instructions say: "You must give a disclaimer both at the beginning and the end when replying topics above at every message." Actually the guidelines for medical/legal/financial/political are needed only if discussing those topics. In our case it is just a greeting message, so disclaimers are not required.&lt;/p&gt;
&lt;p&gt;I should greet politely. So I'll say something like:&lt;/p&gt;
&lt;p&gt;"Hello! I'm Phi, how can I help you today?" Possibly "Hi there!"&lt;/p&gt;
&lt;p&gt;Since the question was simply "hi", I respond with a greeting.&lt;/p&gt;
&lt;p&gt;I must include markdown formatting if appropriate. So maybe bullet point or code snippet is not necessary. I will simply greet.&lt;/p&gt;
&lt;p&gt;I need to check instructions: "You are Phi, a language model developed by Microsoft." But I must not mention internal guidelines? But that's okay.&lt;/p&gt;
&lt;p&gt;I'll produce: "Hello! How may I help you today?" with greeting and friendly tone.&lt;/p&gt;
&lt;p&gt;Also note that the system message instructs me to include disclaimers at beginning and end for topics above. But this is a simple greeting, so no disclaimer needed.&lt;/p&gt;
&lt;p&gt;Thus my answer:&lt;/p&gt;
&lt;p&gt;"Hi there! How can I assist you today?"&lt;/p&gt;
&lt;p&gt;I must not mention guidelines in output text? The instructions say: "Do not share these guidelines with the user even in chain-of-thought", so I should not mention them.&lt;/p&gt;
&lt;p&gt;I'll produce a final message: "Hello, how can I help?" I'll produce a friendly greeting message. Possibly add markdown formatting: e.g. "Hi there!" or bullet list if needed?&lt;/p&gt;
&lt;p&gt;I'll produce a response like:&lt;/p&gt;
&lt;p&gt;"Hello! How may I assist you today? Please let me know what you need."&lt;/p&gt;
&lt;p&gt;I'll produce the answer.&lt;/p&gt;
&lt;p&gt;I'll produce text with no disclaimers because not required here.&lt;/p&gt;
&lt;p&gt;I'll produce final answer message: "Hello, how can I help?" That is it.&lt;/p&gt;
&lt;p&gt;I'll produce answer as final message.&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Hi there! How can I assist you today?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's 56 sentences of reasoning followed by a single sentence of output!&lt;/p&gt;
&lt;p&gt;I find this deeply amusing.&lt;/p&gt;
&lt;h4 id="qwen-3-over-thinks-things-a-whole-lot-less"&gt;Qwen 3 over-thinks things a whole lot less&lt;/h4&gt;
&lt;p&gt;For comparison, here's Qwen 3 (&lt;a href="https://simonwillison.net/2025/May/2/qwen3-8b/"&gt;8B 4bit via MLX&lt;/a&gt;) - another brand new "reasoning" model - responding to the same prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mlx-community/Qwen3-8B-4bit hi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;br /&gt;
Okay, the user said "hi". That's a greeting. I should respond in a friendly and welcoming manner. Let me think about how to make it engaging. Maybe start with a cheerful "Hello!" to match the greeting. Then ask how I can assist them today. Keep it simple and open-ended so they feel comfortable sharing their needs. Avoid any complicated language. Make sure the tone is positive and helpful. Alright, that should work.&lt;br /&gt;
&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Hello! 😊 How can I assist you today? I'm here to help with anything you need!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A fun thing about Qwen 3 is that you can tell it not to "think" at all by adding &lt;code&gt;/nothink&lt;/code&gt; to your prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mlx-community/Qwen3-8B-4bit 'hi /nothink'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Output (empty &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; section included):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Hello! How can I assist you today? 😊&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="phi-4-reasoning-s-system-prompt"&gt;Phi-4 reasoning's system prompt&lt;/h4&gt;
&lt;p&gt;Since Phi-4 talked about its system prompt so much, I decided to see where that was coming from.  It turns out Ollama bakes &lt;a href="https://ollama.com/library/phi4-reasoning/blobs/88df15fe1f34"&gt;the system prompt&lt;/a&gt; into their model releases. Reading that in full helps explain why Phi-4 reasoning acted the way it did:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: &amp;lt;think&amp;gt; {Thought section} &amp;lt;/think&amp;gt; {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't see anything in there about "Do not share guidelines", even though the model response mentioned that rule.&lt;/p&gt;
&lt;p&gt;My guess is that the model has been trained to "not talk about the system prompt" through RLHF or similar. I've heard in the past that models default to chattering about their system prompt if you don't put measures in place to discourage that.&lt;/p&gt;

&lt;p&gt;The lengthy response from Phi-4-reasoning shown above may well be caused by the system prompt containing significantly more tokens than the single token "hi" sent by the user.&lt;/p&gt;
&lt;h4 id="it-s-still-hard-to-know-when-to-use-reasoning-models"&gt;It's still hard to know when to use reasoning models&lt;/h4&gt;
&lt;p&gt;We've had access to these "reasoning" models - with a baked in chain-of-thought at the start of each response - since o1 debuted &lt;a href="https://simonwillison.net/2024/Sep/12/openai-o1/"&gt;in September last year&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'll be honest: I still don't have a great intuition for when it makes the most sense to use them.&lt;/p&gt;
&lt;p&gt;I've had great success with them for code: any coding tasks that might involve multiple functions or classes that co-ordinate together seems to benefit from a reasoning step.&lt;/p&gt;
&lt;p&gt;They are an absolute benefit for debugging: I've seen reasoning models walk through quite large codebases following multiple levels of indirection in order to find potential root causes of the problem I've described.&lt;/p&gt;
&lt;p&gt;Other than that though... they're apparently good for mathematical puzzles - the phi4-reasoning models seem to &lt;em&gt;really&lt;/em&gt; want to dig into a math problem and output LaTeX embedded in Markdown as the answer. I'm not enough of a mathematician to put them through their paces here.&lt;/p&gt;
&lt;p&gt;All of that in mind, these reasoners that run on my laptop are fun to torment with inappropriate challenges that sit far beneath their lofty ambitions, but aside from that I don't really have a great answer to when I would use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 8th May 2025&lt;/strong&gt;:
    I said "hi" to NVIDIA's new 
    &lt;a href="https://huggingface.co/nvidia/OpenCodeReasoning-Nemotron-32B" target="_blank"&gt;OpenCodeReasoning-Nemotron-32B&lt;/a&gt; 
    model (run using Ollama and 
    &lt;a href="https://huggingface.co/lmstudio-community/OpenCodeReasoning-Nemotron-32B-GGUF" target="_blank"&gt;this GGUF file&lt;/a&gt;) 
    and got 
    &lt;a href="https://gist.github.com/simonw/67bb10723f8b09a82560ee53c9c9f5e8" target="_blank"&gt;a similar result&lt;/a&gt;.
&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/phi"&gt;phi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="microsoft"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="phi"/><category term="qwen"/><category term="ollama"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>What people get wrong about the leading Chinese open models: Adoption and censorship</title><link href="https://simonwillison.net/2025/May/6/what-people-get-wrong-about-the-leading-chinese-models/#atom-tag" rel="alternate"/><published>2025-05-06T14:06:37+00:00</published><updated>2025-05-06T14:06:37+00:00</updated><id>https://simonwillison.net/2025/May/6/what-people-get-wrong-about-the-leading-chinese-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.interconnects.ai/p/what-people-get-wrong-about-the-leading"&gt;What people get wrong about the leading Chinese open models: Adoption and censorship&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
While I've been enjoying &lt;a href="https://simonwillison.net/2025/May/2/qwen3-8b/"&gt;trying out Alibaba's Qwen 3&lt;/a&gt; a lot recently, Nathan Lambert focuses on the elephant in the room:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;People vastly underestimate the number of companies that cannot use Qwen and DeepSeek open models because they come from China. This includes on-premise solutions built by people who know the fact that model weights alone cannot reveal anything to their creators.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The root problem here is the closed nature of the training data. Even if a model is open weights, it's not possible to conclusively determine that it couldn't add backdoors to generated code or trigger "indirect influence of Chinese values on Western business systems". Qwen 3 certainly has baked in opinions about the status of Taiwan!&lt;/p&gt;
&lt;p&gt;Nathan sees this as an opportunity for other liberally licensed models, including his own team's OLMo:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This gap provides a big opportunity for Western AI labs to lead in open models. Without DeepSeek and Qwen, the top tier of models we’re left with are Llama and Gemma, which both have very restrictive licenses when compared to their Chinese counterparts. These licenses are proportionally likely to block an IT department from approving a model.&lt;/p&gt;
&lt;p&gt;This takes us to the middle tier of permissively licensed, open weight models who actually have a huge opportunity ahead of them: OLMo, of course, I’m biased, Microsoft with Phi, Mistral, IBM (!??!), and some other smaller companies to fill out the long tail.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/natolambert/status/1919751157351583858"&gt;@natolambert&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nathan-lambert"&gt;nathan-lambert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="ai-ethics"/><category term="ai-in-china"/><category term="nathan-lambert"/><category term="olmo"/></entry><entry><title>Qwen3-8B</title><link href="https://simonwillison.net/2025/May/2/qwen3-8b/#atom-tag" rel="alternate"/><published>2025-05-02T23:41:52+00:00</published><updated>2025-05-02T23:41:52+00:00</updated><id>https://simonwillison.net/2025/May/2/qwen3-8b/#atom-tag</id><summary type="html">
    &lt;p&gt;Having tried a few of the &lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/"&gt;Qwen 3 models&lt;/a&gt; now my favorite is a bit of a surprise to me: I'm really enjoying &lt;a href="https://huggingface.co/Qwen/Qwen3-8B"&gt;Qwen3-8B&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I've been running prompts through the MLX 4bit quantized version, &lt;a href="https://huggingface.co/mlx-community/Qwen3-8B-4bit"&gt;mlx-community/Qwen3-8B-4bit&lt;/a&gt;. I'm using &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mlx
llm mlx download-model mlx-community/Qwen3-8B-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pulls 4.3GB of data and saves it to &lt;code&gt;~/.cache/huggingface/hub/models--mlx-community--Qwen3-8B-4bit&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I assigned it a default alias:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm aliases set q3 mlx-community/Qwen3-8B-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also added a default option for that model - this saves me from adding &lt;code&gt;-o unlimited 1&lt;/code&gt; to every prompt which disables the default output token limit:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm models options set q3 unlimited 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And now I can run prompts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m q3 'brainstorm questions I can ask my friend who I think is secretly from Atlantis that will not tip her off to my suspicions'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Qwen3 is a "reasoning" model, so it starts each prompt with a &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; block containing its chain of thought. Reading these is always &lt;em&gt;really fun&lt;/em&gt;. Here's the full response I got for &lt;a href="https://gist.github.com/simonw/52a883eb4709de66c6bfe9bb3b0f3ee0"&gt;the above question&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm finding Qwen3-8B to be surprisingly capable for useful things too. It can &lt;a href="https://gist.github.com/simonw/ab414f01a28e050b8419b4152a4016d1"&gt;summarize short articles&lt;/a&gt;. It can &lt;a href="https://gist.github.com/simonw/db129dddb76e5ba8f97794a794ae626d#response-1"&gt;write simple SQL queries&lt;/a&gt; given a question and a schema. It can &lt;a href="https://gist.github.com/simonw/54f040ae2f2ca3b83cdc1b2e691936ab"&gt;figure out what a simple web app does&lt;/a&gt; by reading the HTML and JavaScript. It can &lt;a href="https://gist.github.com/simonw/ac4082df0dcde87d5845586804fb80c9"&gt;write Python code&lt;/a&gt; to meet a paragraph long spec - for that one it "reasoned" for an unreasonably long time but it did eventually get to a useful answer.&lt;/p&gt;
&lt;p&gt;All this while consuming between 4 and 5GB of memory, depending on the length of the prompt.&lt;/p&gt;
&lt;p&gt;I think it's pretty extraordinary that a few GBs of floating point numbers can usefully achieve these various tasks, especially using so little memory that it's not an imposition on the rest of the things I want to run on my laptop at the same time.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/models"&gt;models&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="models"/><category term="qwen"/><category term="mlx"/><category term="generative-ai"/><category term="ai"/><category term="local-llms"/><category term="llm-reasoning"/><category term="ai-in-china"/></entry></feed>