<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm-pricing</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm-pricing.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-17T19:39:17+00:00</updated><author><name>Simon Willison</name></author><entry><title>GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52</title><link href="https://simonwillison.net/2026/Mar/17/mini-and-nano/#atom-tag" rel="alternate"/><published>2026-03-17T19:39:17+00:00</published><updated>2026-03-17T19:39:17+00:00</updated><id>https://simonwillison.net/2026/Mar/17/mini-and-nano/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI today: &lt;a href="https://openai.com/index/introducing-gpt-5-4-mini-and-nano/"&gt;Introducing GPT‑5.4 mini and nano&lt;/a&gt;. These models join GPT-5.4 which was released &lt;a href="https://openai.com/index/introducing-gpt-5-4/"&gt;two weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.&lt;/p&gt;
&lt;p&gt;Here's how the pricing looks - all prices are per million tokens. &lt;code&gt;gpt-5.4-nano&lt;/code&gt; is notably even cheaper than Google's Gemini 3.1 Flash-Lite:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input&lt;/th&gt;
      &lt;th&gt;Cached input&lt;/th&gt;
      &lt;th&gt;Output&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4&lt;/td&gt;
      &lt;td&gt;$2.50&lt;/td&gt;
      &lt;td&gt;$0.25&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4-mini&lt;/td&gt;
      &lt;td&gt;$0.75&lt;/td&gt;
      &lt;td&gt;$0.075&lt;/td&gt;
      &lt;td&gt;$4.50&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4-nano&lt;/td&gt;
      &lt;td&gt;$0.20&lt;/td&gt;
      &lt;td&gt;$0.02&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;&lt;td colspan="4"&gt;&lt;center&gt;Other models for comparison&lt;/center&gt;&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.6&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$25.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
      &lt;td&gt;$2.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$12.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
      &lt;td&gt;$1.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
    &lt;/tr&gt;
&lt;tr&gt;
      &lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
      &lt;td&gt;$0.25&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$1.50&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;p&gt;I used GPT-5.4 nano to generate a description of this photo I took at the &lt;a href="https://www.niche-museums.com/118"&gt;John M. Mossman Lock Collection&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/IMG_2324.jpeg" alt="Description below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That took 2,751 input tokens and 112 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=2751&amp;amp;ot=112&amp;amp;sel=gpt-5.4-nano"&gt;0.069 cents&lt;/a&gt; (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-29"&gt;llm 0.29&lt;/a&gt; with support for the new models.&lt;/p&gt;
&lt;p&gt;Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (&lt;a href="https://gist.github.com/simonw/f16292d9a5b90b28054cff3ba497a3ca"&gt;generation transcripts here&lt;/a&gt;). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.4-pelican-family.svg" alt="Described by Claude Opus 4.6: A 5x3 comparison grid of AI-generated cartoon illustrations of a pelican riding a bicycle. Columns are labeled &amp;quot;gpt-5.4-nano&amp;quot;, &amp;quot;gpt-5.4-mini&amp;quot;, and &amp;quot;gpt-5.4&amp;quot; across the top, and rows are labeled &amp;quot;none&amp;quot;, &amp;quot;low&amp;quot;, &amp;quot;medium&amp;quot;, &amp;quot;high&amp;quot;, and &amp;quot;xhigh&amp;quot; down the left side, representing quality/detail settings. In the &amp;quot;none&amp;quot; row, gpt-5.4-nano shows a chaotic white bird with misplaced arrows and tangled wheels on grass, gpt-5.4-mini shows a duck-like brown bird awkwardly straddling a motorcycle-like bike, and gpt-5.4 shows a stiff gray-and-white pelican sitting atop a blue tandem bicycle with extra legs. In the &amp;quot;low&amp;quot; row, nano shows a chubby round white bird pedaling with small feet on grass, mini shows a cleaner white bird riding a blue bicycle with motion lines, and gpt-5.4 shows a pelican with a blue cap riding confidently but with slightly awkward proportions. In the &amp;quot;medium&amp;quot; row, nano regresses to a strange bird standing over bowling balls on ice, mini shows two plump white birds merged onto one yellow-wheeled bicycle, and gpt-5.4 shows a more recognizable gray-and-white pelican on a red bicycle but with tangled extra legs. In the &amp;quot;high&amp;quot; row, nano shows multiple small pelicans crowded around a broken green bicycle on grass with a sun overhead, mini shows a tandem bicycle with two white pelicans and clear blue sky, and gpt-5.4 shows two pelicans stacked on a red tandem bike with the most realistic proportions yet. In the &amp;quot;xhigh&amp;quot; row, nano shows the most detailed scene with a pelican on a detailed bicycle with grass and a large sun but still somewhat jumbled anatomy, mini produces the cleanest single pelican on a yellow-accented bicycle with a light blue sky, and gpt-5.4 shows a well-rendered gray pelican on a teal bicycle with the best overall coherence. Generally, quality improves moving right across models and down through quality tiers, though &amp;quot;medium&amp;quot; is inconsistently worse than &amp;quot;low&amp;quot; for some models, and all images maintain a lighthearted cartoon style with pastel skies and simple backgrounds." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>1M context is now generally available for Opus 4.6 and Sonnet 4.6</title><link href="https://simonwillison.net/2026/Mar/13/1m-context/#atom-tag" rel="alternate"/><published>2026-03-13T18:29:13+00:00</published><updated>2026-03-13T18:29:13+00:00</updated><id>https://simonwillison.net/2026/Mar/13/1m-context/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://claude.com/blog/1m-context-ga"&gt;1M context is now generally available for Opus 4.6 and Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's what surprised me:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Standard pricing now applies across the full 1M window for both models, with no long-context premium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenAI and Gemini both &lt;a href="https://www.llm-prices.com/#sel=gemini-3-1-pro-preview-200k%2Cgpt-5.4-272k%2Cgemini-3-1-pro-preview%2Cgpt-5.4"&gt;charge more&lt;/a&gt; for prompts where the token count goes above a certain point - 200,000 for Gemini 3.1 Pro and 272,000 for GPT-5.4.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="long-context"/></entry><entry><title>Gemini 3.1 Flash-Lite</title><link href="https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag" rel="alternate"/><published>2026-03-03T21:53:54+00:00</published><updated>2026-03-03T21:53:54+00:00</updated><id>https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.&lt;/p&gt;
&lt;p&gt;It supports four different thinking levels, so I had it output &lt;a href="https://gist.github.com/simonw/99fb28dc11d0c24137d4ff8a33978a9e"&gt;four different pelicans&lt;/a&gt;:&lt;/p&gt;
&lt;div style="
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 8px;
    margin: 0 auto;
  "&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-minimal.png" alt="A minimalist vector-style illustration of a stylized bird riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;minimal&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-low.png" alt="A minimalist graphic of a light blue round bird with a single black dot for an eye, wearing a yellow backpack and riding a black bicycle on a flat grey line." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;low&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-medium.png" alt="A minimalist digital illustration of a light blue bird wearing a yellow backpack while riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;medium&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-high.png" alt="A minimal, stylized line drawing of a bird-like creature with a yellow beak riding a bicycle made of simple geometric lines." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;high&lt;/p&gt;
    &lt;/div&gt;
&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Introducing Claude Sonnet 4.6</title><link href="https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag" rel="alternate"/><published>2026-02-17T23:58:58+00:00</published><updated>2026-02-17T23:58:58+00:00</updated><id>https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-sonnet-4-6"&gt;Introducing Claude Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to &lt;a href="https://simonwillison.net/2025/Nov/24/claude-opus/"&gt;November's Opus 4.5&lt;/a&gt; while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's &lt;a href="https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf"&gt;the system card PDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.&lt;/p&gt;
&lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.24"&gt;llm-anthropic 0.24&lt;/a&gt; with support for both Sonnet 4.6 and Opus 4.6. Claude Code &lt;a href="https://github.com/simonw/llm-anthropic/pull/65"&gt;did most of the work&lt;/a&gt; - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described &lt;a href="https://platform.claude.com/docs/en/about-claude/models/migration-guide"&gt;in Anthropic's migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b185576a95e9321b441f0a4dfc0e297c"&gt;what I got&lt;/a&gt; from:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The pelican has a jaunty top hat with a red band. There is a string between the upper and lower beaks for some reason. The bicycle frame is warped in the wrong way." src="https://static.simonwillison.net/static/2026/pelican-sonnet-4.6.png" /&gt;&lt;/p&gt;
&lt;p&gt;The SVG comments include:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Hat (fun accessory) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!&lt;/p&gt;
&lt;p&gt;For comparison, here's the pelican Opus 4.5 drew me &lt;a href="(https://simonwillison.net/2025/Nov/24/claude-opus/)"&gt;in November&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's Anthropic's current best pelican, drawn by Opus 4.6 &lt;a href="https://simonwillison.net/2026/Feb/5/two-new-models/"&gt;on February 5th&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47050488"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="claude-code"/></entry><entry><title>Claude: Speed up responses with fast mode</title><link href="https://simonwillison.net/2026/Feb/7/claude-fast-mode/#atom-tag" rel="alternate"/><published>2026-02-07T23:10:33+00:00</published><updated>2026-02-07T23:10:33+00:00</updated><id>https://simonwillison.net/2026/Feb/7/claude-fast-mode/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://code.claude.com/docs/en/fast-mode"&gt;Claude: Speed up responses with fast mode&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New "research preview" from Anthropic today: you can now access a faster version of their frontier model Claude Opus 4.6 by typing &lt;code&gt;/fast&lt;/code&gt; in Claude Code... but at a cost that's 6x the normal price.&lt;/p&gt;
&lt;p&gt;Opus is usually $5/million input and $25/million output. The new fast mode is $30/million input and $150/million output!&lt;/p&gt;
&lt;p&gt;There's a 50% discount until the end of February 16th, so only a 3x multiple (!) before then.&lt;/p&gt;
&lt;p&gt;How much faster is it? The linked documentation doesn't say, but &lt;a href="https://x.com/claudeai/status/2020207322124132504"&gt;on Twitter&lt;/a&gt; Claude say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our teams have been building with a 2.5x-faster version of Claude Opus 4.6.&lt;/p&gt;
&lt;p&gt;We’re now making it available as an early experiment via Claude Code and our API.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Opus 4.5 had a context limit of 200,000 tokens. 4.6 has an option to increase that to 1,000,000 at 2x the input price ($10/m) and 1.5x the output price ($37.50/m) once your input exceeds 200,000 tokens. These multiples hold for fast mode too, so after Feb 16th you'll be able to pay a hefty $60/m input and $225/m output for Anthropic's fastest best model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="claude-code"/><category term="llm-performance"/></entry><entry><title>Gemini 3 Flash</title><link href="https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag" rel="alternate"/><published>2025-12-17T22:44:52+00:00</published><updated>2025-12-17T22:44:52+00:00</updated><id>https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;It continues to be a busy December, if not quite as busy &lt;a href="https://simonwillison.net/2024/Dec/20/december-in-llms-has-been-a-lot/"&gt;as last year&lt;/a&gt;. Today's big news is &lt;a href="https://blog.google/technology/developers/build-with-gemini-3-flash/"&gt;Gemini 3 Flash&lt;/a&gt;, the latest in Google's "Flash" line of faster and less expensive models.&lt;/p&gt;
&lt;p&gt;Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series).&lt;/p&gt;
&lt;p&gt;The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro &amp;gt;200k, and it's nice not to have a price increase for the new Flash at larger token lengths.&lt;/p&gt;
&lt;p&gt;It's a little &lt;em&gt;more&lt;/em&gt; expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively.&lt;/p&gt;
&lt;p&gt;Google &lt;a href="https://blog.google/products/gemini/gemini-3-flash/"&gt;claim&lt;/a&gt; it may still end up cheaper though, due to more efficient output token usage:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;&gt; Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://www.llm-prices.com/#it=100000&amp;amp;ot=10000&amp;amp;sel=gemini-3-flash-preview%2Cgemini-3-pro-preview%2Cgemini-3-pro-preview-200k%2Cgpt-5.2%2Cclaude-opus-4-5%2Cclaude-sonnet-4.5%2Cclaude-4.5-haiku%2Cgemini-2.5-flash%2Cgpt-5-mini"&gt;a more extensive price comparison&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;h4 id="generating-some-svgs-of-pelicans"&gt;Generating some SVGs of pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28"&gt;llm-gemini 0.28&lt;/a&gt; this morning with support for the new model. You can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm keys set gemini # paste in key
llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;According to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#thinking_level"&gt;the developer docs&lt;/a&gt; the new model supports four different thinking level options: &lt;code&gt;minimal&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, and &lt;code&gt;high&lt;/code&gt;. This is different from Gemini 3 Pro, which only supported &lt;code&gt;low&lt;/code&gt; and &lt;code&gt;high&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You can run those like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are four pelicans, for thinking levels &lt;a href="https://gist.github.com/simonw/8047c805a4a1df7fd4e854b18e7482d9"&gt;minimal&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/fb61686a1f915e3777b4a40e2df41068"&gt;low&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/190c3ce82cd8976827139bbc4dcc2d19"&gt;medium&lt;/a&gt;, and &lt;a href="https://gist.github.com/simonw/da66ffce135359161996e41e50e32ec3"&gt;high&lt;/a&gt;:&lt;/p&gt;
&lt;image-gallery width="4"&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /&gt;
&lt;/image-gallery&gt;
&lt;h4 id="i-built-the-gallery-component-with-gemini-3-flash"&gt;I built the gallery component with Gemini 3 Flash&lt;/h4&gt;
&lt;p&gt;The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt; &lt;span class="pl-c1"&gt;width&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;You write alt text for any image pasted in by the user. Alt text is always presented in a&lt;/span&gt;
&lt;span class="pl-s"&gt;fenced code block to make it easy to copy and paste out. It is always presented on a single&lt;/span&gt;
&lt;span class="pl-s"&gt;line so it can be used easily in Markdown images. All text on the image (for screenshots etc)&lt;/span&gt;
&lt;span class="pl-s"&gt;must be exactly included. A short note describing the nature of the image itself should go first.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
-a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can see the code that powers the image gallery Web Component &lt;a href="https://github.com/simonw/simonwillisonblog/blob/31651b3a527011d1c971d4256c1c9f61ef378d23/static/image-gallery.js"&gt;here on GitHub&lt;/a&gt;. I built it by prompting Gemini 3 Flash via &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Build a Web Component that implements a simple image gallery. Usage is like this:&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;image-gallery width="5"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image1.jpg" alt="Image 1"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image3.jpg" alt="Image 3"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;/image-gallery&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;If an image has a data-thumb= attribute that one is used instead, other images are scaled down. &lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image.&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It took a few follow-up prompts using &lt;code&gt;llm -c&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use a real modal such that keyboard shortcuts and accessibility features work without extra JS&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use X for the close icon and make it a bit more subtle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;remove the hover effect entirely&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;I want no border on the close icon even when it is focused&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/09f63a49f29620d4cbbfd383cfee1db3"&gt;the full transcript&lt;/a&gt;, exported using &lt;code&gt;llm logs -cue&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Those five prompts took:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;225 input, 3,269 output&lt;/li&gt;
&lt;li&gt;2,243 input, 2,908 output&lt;/li&gt;
&lt;li&gt;4,319 input, 2,516 output&lt;/li&gt;
&lt;li&gt;6,376 input, 2,094 output&lt;/li&gt;
&lt;li&gt;8,151 input, 1,806 output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Added together that's 21,314 input and 12,593 output for a grand total &lt;a href="https://www.llm-prices.com/#it=21314&amp;amp;ot=12593&amp;amp;sel=gemini-3-flash-preview"&gt;of 4.8436 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The guide to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#migrating_from_gemini_25"&gt;migrating from Gemini 2.5&lt;/a&gt; reveals one disappointment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Image segmentation:&lt;/strong&gt; Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or &lt;a href="https://ai.google.dev/gemini-api/docs/robotics-overview"&gt;Gemini Robotics-ER 1.5&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/"&gt;wrote about this capability in Gemini 2.5&lt;/a&gt; back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-components"&gt;web-components&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="web-components"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult</title><link href="https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag" rel="alternate"/><published>2025-11-24T19:37:07+00:00</published><updated>2025-11-24T19:37:07+00:00</updated><id>https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-5"&gt;released Claude Opus 4.5&lt;/a&gt; this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/"&gt;GPT-5.1-Codex-Max&lt;/a&gt; and Google's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3&lt;/a&gt;, both released within the past week!&lt;/p&gt;
&lt;p&gt;The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February).&lt;/p&gt;
&lt;p&gt;The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for &amp;gt;200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-5#key-improvements-in-opus-4-5-over-opus-4-1"&gt;Key improvements in Opus 4.5 over Opus 4.1&lt;/a&gt; document has a few more interesting details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Opus 4.5 has a new &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;effort parameter&lt;/a&gt; which defaults to high but can be set to medium or low for faster responses.&lt;/li&gt;
&lt;li&gt;The model supports &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool"&gt;enhanced computer use&lt;/a&gt;, specifically a &lt;code&gt;zoom&lt;/code&gt; tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect.&lt;/li&gt;
&lt;li&gt;"&lt;a href="https://platform.claude.com/docs/en/build-with-claude/extended-thinking#thinking-block-preservation-in-claude-opus-4-5"&gt;Thinking blocks from previous assistant turns are preserved in model context by default&lt;/a&gt;" - apparently previous Anthropic models discarded those.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in &lt;a href="https://simonwillison.net/2025/Nov/24/sqlite-utils-40a1/"&gt;a new alpha release of sqlite-utils&lt;/a&gt; that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across &lt;a href="https://github.com/simonw/sqlite-utils/compare/10957305be998999e3c95c11863b5709d42b7ae3...4.0a1"&gt;20 commits, 39 files changed,  2,022 additions and 1,173 deletions&lt;/a&gt; in a two day period. Here's the &lt;a href="https://gistpreview.github.io/?f40971b693024fbe984a68b73cc283d2"&gt;Claude Code transcript&lt;/a&gt; where I had it help implement one of the more complicated new features.&lt;/p&gt;
&lt;p&gt;It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in &lt;a href="https://github.com/simonw/sqlite-utils/milestone/7?closed=1"&gt;the milestone for the alpha&lt;/a&gt;. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model.&lt;/p&gt;
&lt;p&gt;With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.&lt;/p&gt;
&lt;p&gt;I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.&lt;/p&gt;
&lt;p&gt;This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.&lt;/p&gt;
&lt;p&gt;Google's Nano Banana Pro image generation model was notable in that its ability to &lt;a href="https://simonwillison.net/2025/Nov/20/nano-banana-pro/#creating-an-infographic"&gt;render usable infographics&lt;/a&gt; really does represent a task at which  previous models had been laughably incapable.&lt;/p&gt;
&lt;p&gt;The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?&lt;/p&gt;
&lt;p&gt;And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models.&lt;/p&gt;
&lt;p&gt;I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!&lt;/p&gt;
&lt;p&gt;I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.&lt;/p&gt;
&lt;p&gt;"Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a &lt;em&gt;lot&lt;/em&gt; more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.&lt;/p&gt;
&lt;p id="pelicans"&gt;In the meantime, I'm just gonna have to keep on getting them to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelicans riding bicycles&lt;/a&gt;. Here's Opus 4.5 (on its default &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;"high" effort level&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It did significantly better on the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;new more detailed prompt&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican-advanced.jpg" alt="The pelican has feathers and a red pouch - a close enough version of breeding plumage. The bicycle is a much better shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that same complex prompt &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#advanced-pelican"&gt;against Gemini 3 Pro&lt;/a&gt; and &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#advanced-pelican-codex-max"&gt;against GPT-5.1-Codex-Max-xhigh&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="still-susceptible-to-prompt-injection"&gt;Still susceptible to prompt injection&lt;/h4&gt;
&lt;p&gt;From &lt;a href="https://www.anthropic.com/news/claude-opus-4-5#a-step-forward-on-safety"&gt;the safety section&lt;/a&gt; of Anthropic's announcement post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-prompt-injection.jpg" alt="Bar chart titled &amp;quot;Susceptibility to prompt-injection style attacks&amp;quot; with subtitle &amp;quot;At k queries; lower is better&amp;quot;. Y-axis shows &amp;quot;ATTACK SUCCESS RATE (%)&amp;quot; from 0-100. Five stacked bars compare AI models with three k values (k=1 in dark gray, k=10 in beige, k=100 in pink). Results: Gemini 3 Pro Thinking (12.5, 60.7, 92.0), GPT-5.1 Thinking (12.6, 58.2, 87.8), Haiku 4.5 Thinking (8.3, 51.1, 85.6), Sonnet 4.5 Thinking (7.3, 41.9, 72.4), Opus 4.5 Thinking (4.7, 33.6, 63.0)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the one hand this looks great, it's a clear improvement over previous models and the competition.&lt;/p&gt;
&lt;p&gt;What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3!&lt;/p&gt;
&lt;p&gt;I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="evals"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="november-2025-inflection"/></entry><entry><title>Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark</title><link href="https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag" rel="alternate"/><published>2025-11-18T19:00:48+00:00</published><updated>2025-11-18T19:00:48+00:00</updated><id>https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google released Gemini 3 Pro today. Here's &lt;a href="https://blog.google/products/gemini/gemini-3/"&gt;the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu&lt;/a&gt;, their &lt;a href="https://blog.google/technology/developers/gemini-3-developers/"&gt;developer blog announcement from Logan Kilpatrick&lt;/a&gt;, the &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;Gemini 3 Pro Model Card&lt;/a&gt;, and their &lt;a href="https://blog.google/products/gemini/gemini-3-collection/"&gt;collection of 11 more articles&lt;/a&gt;. It's a big release!&lt;/p&gt;
&lt;p&gt;I had a few days of preview access to this model via &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt;. The best way to describe it is that it's &lt;strong&gt;Gemini 2.5 upgraded to match the leading rival models&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video.&lt;/p&gt;
&lt;h4 id="benchmarks"&gt;Benchmarks&lt;/h4&gt;
&lt;p&gt;Google's own reported numbers (in &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;the model card&lt;/a&gt;) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg" alt="Table of benchmark numbers, described in full below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pricing"&gt;Pricing&lt;/h4&gt;
&lt;p&gt;It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
      &lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.1&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $1.25&lt;br /&gt;
        &amp;gt; 200k tokens: $2.50
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $10.00&lt;br /&gt;
        &amp;gt; 200k tokens: $15.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Gemini 3 Pro&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $2.00&lt;br /&gt;
        &amp;gt; 200k tokens: $4.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $12.00&lt;br /&gt;
        &amp;gt; 200k tokens: $18.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.5&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $3.00&lt;br /&gt;
        &amp;gt; 200k tokens: $6.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $15.00&lt;br /&gt;
        &amp;gt; 200k tokens: $22.50
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.1&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;$75.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;h4 id="trying-it-out-against-a-complex-image"&gt;Trying it out against a complex image&lt;/h4&gt;
&lt;p&gt;That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's what I got back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Humanity's Last Exam (Academic reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.&lt;/li&gt;
&lt;li&gt;With search and code execution: Gemini 3 Pro 45.8% (others have no data).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPQA Diamond (Scientific knowledge; No tools)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;AIME 2025 (Mathematics)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.&lt;/li&gt;
&lt;li&gt;With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MathArena Apex (Challenging Math Contest problems)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMU-Pro (Multimodal understanding and reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ScreenSpot-Pro (Screen understanding)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CharXiv Reasoning (Information synthesis from complex charts)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Video-MMMU (Knowledge acquisition from videos)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SWE-Bench Verified (Agentic coding; Single attempt)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;t2-bench (Agentic tool use)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SimpleQA Verified (Parametric knowledge)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMLU (Multilingual Q&amp;amp;A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Global PIQA (Commonsense reasoning across 100 Languages and Cultures)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MRCR v2 (8-needle) (Long context performance)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.&lt;/li&gt;
&lt;li&gt;1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have not checked every line of this but a loose spot-check looks accurate to me.&lt;/p&gt;
&lt;p&gt;That prompt took 1,105 input and 3,901 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=1105&amp;amp;cit=3901&amp;amp;ot=3901&amp;amp;ic=2&amp;amp;oc=12&amp;amp;sel=gemini-3-pro-preview"&gt;5.6824 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this follow-up prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -c 'Convert to JSON'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/ea7d52706557528e7eb3912cdf9250b0#response-1"&gt;the full output here&lt;/a&gt;, which starts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"metadata"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"columns"&lt;/span&gt;: [
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Benchmark&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Description&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 3 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 2.5 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Claude Sonnet 4.5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;GPT-5.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    ]
  },
  &lt;span class="pl-ent"&gt;"benchmarks"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Humanity's Last Exam&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Academic reasoning&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"sub_results"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;No tools&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21.6%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;13.7%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        },
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;With search and code execution&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45.8%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;
        }
      ]
    },&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="analyzing-a-city-council-meeting"&gt;Analyzing a city council meeting&lt;/h4&gt;
&lt;p&gt;To try it out against an audio file I extracted the 3h33m of audio from the video &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0"&gt;Half Moon Bay City Council Meeting - November 4, 2025&lt;/a&gt;. I used &lt;code&gt;yt-dlp&lt;/code&gt; to get that audio:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;yt-dlp -x --audio-format m4a &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://www.youtube.com/watch?v=qgJ7x7R6gy0&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a /tmp/HMBCC\ 11⧸4⧸25\ -\ Half\ Moon\ Bay\ City\ Council\ Meeting\ -\ November\ 4,\ 2025\ \[qgJ7x7R6gy0\].m4a 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using &lt;code&gt;ffmpeg&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ffmpeg -i &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -ac 1 -ar 22050 -c:a aac -b:a 24k &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB_compressed.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then ran it again like this (for some reason I had to use &lt;code&gt;--attachment-type&lt;/code&gt; this time):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This time it worked! The &lt;a href="https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943314"&gt;full output is here&lt;/a&gt;, but it starts like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here is the transcript of the Half Moon Bay City Council meeting.&lt;/p&gt;
&lt;h4&gt;Meeting Outline&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;1. Call to Order, Updates, and Public Forum&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:00:00 - 00:13:25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Consent Calendar&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:13:25 - 00:15:15&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ordinance Introduction: Commercial Vitality (Item 9A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:15:15 - 00:30:45&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Ordinance Introduction: Building Standards &amp;amp; Electrification (Item 9B)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling (&lt;em&gt;California Restaurant Association v. City of Berkeley&lt;/em&gt;). &lt;strong&gt;Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:30:45 - 00:45:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;5. Housing Element Update &amp;amp; Adoption (Item 9C)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. &lt;strong&gt;There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure.&lt;/strong&gt; Public speakers debate the enforceability of Measure D. &lt;strong&gt;Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law.&lt;/strong&gt; The Council votes to adopt the element but strikes the language committing to a ballot measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:45:00 - 01:05:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h4&gt;Transcript&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [00:00:00]
Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Victor Hernandez (Interpreter)&lt;/strong&gt; [00:00:35]
Thank you, Mr. Mayor, City Council, all city staff, members of the public. &lt;em&gt;[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]&lt;/em&gt; Thank you very much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]".&lt;/p&gt;
&lt;p&gt;I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [01:04:00]
Meeting adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That actually happens &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0&amp;amp;t=3h31m5s"&gt;at 3h31m5s&lt;/a&gt; and the mayor says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said.&lt;/p&gt;
&lt;p&gt;This took 320,087 input tokens and 7,870 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=320087&amp;amp;ot=7870&amp;amp;ic=4&amp;amp;oc=18"&gt;$1.42&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-a-new-pelican-benchmark"&gt;And a new pelican benchmark&lt;/h4&gt;
&lt;p&gt;Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt; prompt at both levels.&lt;/p&gt;
&lt;p&gt;Here's low - Gemini decided to add a jaunty little hat (with a comment &lt;a href="https://gist.github.com/simonw/70d56ba39b7cbb44985d2384004fc4a0#response"&gt;in the SVG&lt;/a&gt; that says &lt;code&gt;&amp;lt;!-- Hat (Optional Fun Detail) --&amp;gt;&lt;/code&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-low.png" alt="The pelican is wearing a blue hat. It has a good beak. The bicycle is a little bit incorrect but generally a good effort." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-high.png" alt="The pelican is not wearing a hat. It has a good beak. The bicycle is accurate and well-drawn." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/breeding-plumage.jpg" alt="A glorious California brown pelican perched on a rock by the water. It has a yellow tint to its head and a red spot near its throat." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's Gemini 3 Pro's &lt;a href="https://gist.github.com/simonw/2b9930ae1ce6f3f5e9cfe3cb31ec0c0a"&gt;attempt&lt;/a&gt; at high thinking level for that new prompt:&lt;/p&gt;
&lt;p id="advanced-pelican"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-breeding-pelican-high.png" alt="It's clearly a pelican. It has all of the requested features. It looks a bit abstract though." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for good measure, here's that same prompt &lt;a href="https://gist.github.com/simonw/7a655ebe42f3d428d2ea5363dad8067c"&gt;against GPT-5.1&lt;/a&gt; - which produced this dumpy little fellow:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-1-breeding-pelican.png" alt="The pelican is very round. Its body overlaps much of the bicycle. It has a lot of dorky charisma." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And Claude Sonnet 4.5, which &lt;a href="https://gist.github.com/simonw/3296af92e4328dd4740385e6a4a2ac35"&gt;didn't do quite as well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4-5-breeding-pelican.png" alt="Oh dear. It has all of the requested components, but the bicycle is a bit wrong and the pelican is arranged in a very awkward shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>MiniMax M2 &amp; Agent: Ingenious in Simplicity</title><link href="https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag" rel="alternate"/><published>2025-10-29T22:49:47+00:00</published><updated>2025-10-29T22:49:47+00:00</updated><id>https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.minimax.io/news/minimax-m2"&gt;MiniMax M2 &amp;amp; Agent: Ingenious in Simplicity&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
MiniMax M2 was released on Monday 27th October by MiniMax, a Chinese AI lab founded in December 2021.&lt;/p&gt;
&lt;p&gt;It's a very promising model. Their self-reported benchmark scores show it as comparable to Claude Sonnet 4, and Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1982714153375854998"&gt;are ranking it&lt;/a&gt; as the best currently available open weight model according to their intelligence score:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MiniMax’s M2 achieves a new all-time-high Intelligence Index score for an open weights model and offers impressive efficiency with only 10B active parameters (200B total). [...]&lt;/p&gt;
&lt;p&gt;The model’s strengths include tool use and instruction following (as shown by Tau2 Bench and IFBench). As such, while M2 likely excels at agentic use cases it may underperform other open weights leaders such as DeepSeek V3.2 and Qwen3 235B at some generalist tasks. This is in line with a number of recent open weights model releases from Chinese AI labs which focus on agentic capabilities, likely pointing to a heavy post-training emphasis on RL.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The size is particularly significant: the model weights are 230GB &lt;a href="https://huggingface.co/MiniMaxAI/MiniMax-M2"&gt;on Hugging Face&lt;/a&gt;, significantly smaller than other high performing open weight models. That's small enough to run on a 256GB Mac Studio, and the MLX community &lt;a href="https://huggingface.co/mlx-community/MiniMax-M2-8bit"&gt;have that working already&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;MiniMax offer their own API, and recommend using their Anthropic-compatible endpoint and the official Anthropic SDKs to access it. MiniMax Head of Engineering Skyler Miao
 &lt;a href="https://x.com/SkylerMiao7/status/1982989507252367687"&gt;provided some background on that&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;M2 is a agentic thinking model, it do interleaved thinking like sonnet 4.5, which means every response will contain its thought content.
Its very important for M2 to keep the chain of thought. So we must make sure the history thought passed back to the model.
Anthropic API support it for sure, as sonnet needs it as well. OpenAI only support it in their new Response API, no support for in ChatCompletion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MiniMax are offering the new model via their API for free until November 7th, after which the cost will be $0.30/million input tokens and $1.20/million output tokens - similar in price to Gemini 2.5 Flash and GPT-5 Mini, see &lt;a href="https://www.llm-prices.com/#it=51&amp;amp;ot=4017&amp;amp;sel=minimax-m2%2Cgpt-5-mini%2Cclaude-3-haiku%2Cgemini-2.5-flash-lite%2Cgemini-2.5-flash"&gt;price comparison here&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;I released a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; called &lt;a href="https://github.com/simonw/llm-minimax"&gt;llm-minimax&lt;/a&gt; providing support for M2 via the MiniMax API:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-minimax
llm keys set minimax
# Paste key here
llm -m m2 -o max_tokens 10000 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/da79447830dc431c067a93648b338be6"&gt;the result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Biycle is good though obscured by the pelican. Pelican has an impressive triple beak and is stretched along the bicycle frame. Not clear if it can pedal or what it is sitting on." src="https://static.simonwillison.net/static/2025/m2-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;51 input, 4,017 output. At $0.30/m input and $1.20/m output that pelican would cost 0.4836 cents - less than half a cent.&lt;/p&gt;
&lt;p&gt;This is the first plugin I've written for an Anthropic-API-compatible model. I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.21"&gt;llm-anthropic 0.21&lt;/a&gt; first adding the ability to customize the &lt;code&gt;base_url&lt;/code&gt; parameter when using that model class. This meant the new plugin was less than &lt;a href="https://github.com/simonw/llm-minimax/blob/0.1/llm_minimax.py"&gt;30 lines of Python&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/minimax"&gt;minimax&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="ai-in-china"/><category term="minimax"/></entry><entry><title>Introducing Claude Haiku 4.5</title><link href="https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag" rel="alternate"/><published>2025-10-15T19:36:34+00:00</published><updated>2025-10-15T19:36:34+00:00</updated><id>https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-haiku-4-5"&gt;Introducing Claude Haiku 4.5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Anthropic released Claude Haiku 4.5 today, the cheapest member of the Claude 4.5 family that started with Sonnet 4.5 &lt;a href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/"&gt;a couple of weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's priced at $1/million input tokens and $5/million output tokens, slightly more expensive than Haiku 3.5 ($0.80/$4) and a &lt;em&gt;lot&lt;/em&gt; more expensive than the original Claude 3 Haiku ($0.25/$1.25), both of which remain available at those prices.&lt;/p&gt;
&lt;p&gt;It's a third of the price of Sonnet 4 and Sonnet 4.5 (both $3/$15) which is notable because Anthropic's benchmarks put it in a similar space to that older Sonnet 4 model. As they put it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've been hoping to see Anthropic release a fast, inexpensive model that's price competitive with the cheapest models from OpenAI and Gemini, currently $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite). Haiku 4.5 certainly isn't that, it looks like they're continuing to focus squarely on the "great at code" part of the market.&lt;/p&gt;
&lt;p&gt;The new Haiku is the first Haiku model to support reasoning. It sports a 200,000 token context window, 64,000 maximum output (up from just 8,192 for Haiku 3.5) and a "reliable knowledge cutoff" of February 2025, one month later than the January 2025 date for Sonnet 4 and 4.5 and Opus 4 and 4.1.&lt;/p&gt;
&lt;p&gt;Something that caught my eye in the accompanying &lt;a href="https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf"&gt;system card&lt;/a&gt; was this note about context length:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For Claude Haiku 4.5, we trained the model to be explicitly context-aware, with precise information about how much context-window has been used. This has two effects: the model learns when and how to wrap up its answer when the limit is approaching, and the model learns to continue reasoning more persistently when the limit is further away. We found this intervention—along with others—to be effective at limiting agentic “laziness” (the phenomenon where models stop working on a problem prematurely, give incomplete answers, or cut corners on tasks).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've added the new price to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;, released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.20"&gt;llm-anthropic 0.20&lt;/a&gt; with the new model and updated my &lt;a href="https://tools.simonwillison.net/haiku"&gt;Haiku-from-your-webcam&lt;/a&gt; demo (&lt;a href="https://github.com/simonw/tools/blob/main/haiku.html"&gt;source&lt;/a&gt;) to use Haiku 4.5 as well.&lt;/p&gt;
&lt;p&gt;Here's &lt;code&gt;llm -m claude-haiku-4.5 'Generate an SVG of a pelican riding a bicycle'&lt;/code&gt; (&lt;a href="https://gist.github.com/simonw/31256c523fa502eeb303b8e0bbe30eee"&gt;transcript&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Described by Haiku 4.5: A whimsical illustration of a bird with a round tan body, pink beak, and orange legs riding a bicycle against a blue sky and green grass background." src="https://static.simonwillison.net/static/2025/claude-haiku-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;18 input tokens and 1513 output tokens = &lt;a href="https://www.llm-prices.com/#it=18&amp;amp;ot=1513&amp;amp;ic=1&amp;amp;oc=5"&gt;0.7583 cents&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45595403"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>GPT-5 pro</title><link href="https://simonwillison.net/2025/Oct/6/gpt-5-pro/#atom-tag" rel="alternate"/><published>2025-10-06T19:48:45+00:00</published><updated>2025-10-06T19:48:45+00:00</updated><id>https://simonwillison.net/2025/Oct/6/gpt-5-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5-pro"&gt;GPT-5 pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's OpenAI's model documentation for their GPT-5 pro model, released to their API today at their DevDay event.&lt;/p&gt;
&lt;p&gt;It has similar base characteristics to &lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5&lt;/a&gt;: both share a September 30, 2024 knowledge cutoff and 400,000 context limit.&lt;/p&gt;
&lt;p&gt;GPT-5 pro has maximum output tokens 272,000 max, an increase from 128,000 for GPT-5.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) &lt;code&gt;reasoning.effort: high&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's only available via OpenAI's Responses API. My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool doesn't support that in core yet, but the &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai-plugin&lt;/a&gt; plugin does. I released &lt;a href="https://github.com/simonw/llm-openai-plugin/releases/tag/0.7"&gt;llm-openai-plugin 0.7&lt;/a&gt; adding support for the new model, then ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-openai-plugin
llm -m openai/gpt-5-pro "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's very, very slow. The model took 6 minutes 8 seconds to respond and charged me for 16 input and 9,205 output tokens. At $15/million input and $120/million output this pelican &lt;a href="https://www.llm-prices.com/#it=16&amp;amp;ot=9205&amp;amp;ic=15&amp;amp;oc=120&amp;amp;sb=output&amp;amp;sd=descending"&gt;cost me $1.10&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img alt="It's obviously a pelican riding a bicycle. Half the spokes are missing on each wheel and the pelican is a bit squat looking." src="https://static.simonwillison.net/static/2025/gpt-5-pro.png" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9a06ab36f486f31401fec1fc104a8ce5"&gt;the full transcript&lt;/a&gt;. It looks visually pretty simpler to the much, much cheaper result I &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;got from GPT-5&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/></entry><entry><title>Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now)</title><link href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/#atom-tag" rel="alternate"/><published>2025-09-29T18:11:39+00:00</published><updated>2025-09-29T18:11:39+00:00</updated><id>https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-sonnet-4-5"&gt;released Claude Sonnet 4.5 today&lt;/a&gt;, with a &lt;em&gt;very&lt;/em&gt; bold set of claims:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Anthropic gave me access to a preview version of a "new model" over the weekend which turned out to be Sonnet 4.5. My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since &lt;a href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/"&gt;it launched a few weeks ago&lt;/a&gt;. This space moves &lt;em&gt;so fast&lt;/em&gt; - Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the "best coding model" crown.&lt;/p&gt;
&lt;p&gt;The pricing is the same as the previous Sonnet: $3/million input tokens and $15/million output tokens. This remains significantly cheaper than Claude Opus - $15/$75 - but still quite a bit more than GPT-5 and GPT-5-Codex, both at $1.25/$10.&lt;/p&gt;
&lt;h4 id="it-really-shines-with-claude-ai-code-interpreter"&gt;It really shines with Claude.ai Code Interpreter&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://claude.ai/"&gt;claude.ai&lt;/a&gt; web interface (not yet the Claude iPhone native app) recently added the ability for Claude to write and then directly execute code in a sandboxed server environment, using Python and Node.js. I &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;wrote about that in detail&lt;/a&gt; three weeks ago.&lt;/p&gt;
&lt;p&gt;Anthropic's implementation of this code interpreter pattern is more powerful than ChatGPT's equivalent because it can directly clone code from GitHub and install software packages from NPM and PyPI.&lt;/p&gt;
&lt;p&gt;Sonnet 4.5 &lt;em&gt;absolutely shines&lt;/em&gt; at using this tool!&lt;/p&gt;
&lt;p&gt;Here's my most ambitious experiment with it. I started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Checkout https://github.com/simonw/llm and run the tests with&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pip install -e '.[test]'&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;pytest&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It did exactly as I asked, checking out my &lt;a href="https://github.com/simonw/llm"&gt;LLM repo&lt;/a&gt;, installing it along with all of its dependencies, and running the tests.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;✅ All 466 tests passed in 167.69 seconds (about 2 minutes 47 seconds)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I posed it a &lt;em&gt;wildly&lt;/em&gt; ambitious challenge:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;The tool stores prompts and responses in a SQLite database. It currently uses a single conversation row and a linear collection of responses within that conversation. Experiment with the idea of adding a parent_response_id column to that responses table and using it to model the responses in a conversation as a tree instead. Use new pytest tests which you execute individually (to save time on the whole test suite) to exercise your design. Write a tree_notes.md file which you first populate with your plan and then use for notes as you go along.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As you can see in &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;the resulting conversation transcript&lt;/a&gt;, it did a really impressive job of this! It churned away across &lt;em&gt;dozens&lt;/em&gt; of different tool calls, eventually outputting the following summary of its activity:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I've successfully experimented with adding tree-structured conversations to the LLM CLI tool. Here's what was accomplished:&lt;/p&gt;
&lt;h3 id="-implementation-complete"&gt;✅ &lt;strong&gt;Implementation Complete&lt;/strong&gt;
&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. Database Schema Enhancement&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Added &lt;code&gt;parent_response_id&lt;/code&gt; column to the &lt;code&gt;responses&lt;/code&gt; table via migration &lt;code&gt;m022_parent_response_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enables tree-structured (branching) conversations instead of just linear sequences&lt;/li&gt;
&lt;li&gt;Fully backward compatible - existing data unaffected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Comprehensive Utility Module&lt;/strong&gt; (&lt;code&gt;tree_utils.py&lt;/code&gt;)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;12 helper functions for tree operations:
&lt;ul&gt;
&lt;li&gt;Navigation (children, parent, siblings, paths)&lt;/li&gt;
&lt;li&gt;Analysis (depth, descendants, size)&lt;/li&gt;
&lt;li&gt;Queries (roots, leaves, statistics)&lt;/li&gt;
&lt;li&gt;Visualization (ASCII tree display)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Complete Test Suite&lt;/strong&gt; (&lt;code&gt;test_tree_conversations.py&lt;/code&gt;)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;16 tests covering all tree operations&lt;/li&gt;
&lt;li&gt;Tests linear chains, branching, multiple roots, forests&lt;/li&gt;
&lt;li&gt;Includes realistic debugging scenario&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All tests pass ✅&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Integration Verification&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Updated existing migration tests to expect new column&lt;/li&gt;
&lt;li&gt;All 6 migration tests pass&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 22/22 tests passing ✅&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="-deliverables-7-files-in-mnt-user-data-outputs-"&gt;📦 &lt;strong&gt;Deliverables&lt;/strong&gt; (7 files in &lt;code&gt;/mnt/user-data/outputs/&lt;/code&gt;)&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;README.md&lt;/strong&gt; - Quick start guide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IMPLEMENTATION_SUMMARY.md&lt;/strong&gt; - Complete technical overview&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIGRATION_CODE.md&lt;/strong&gt; - Database migration details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TEST_RESULTS.md&lt;/strong&gt; - Full test coverage report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_notes.md&lt;/strong&gt; - Design decisions and development notes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_utils.py&lt;/strong&gt; - Utility functions module&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;test_tree_conversations.py&lt;/strong&gt; - Test suite&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="-key-features"&gt;🎯 &lt;strong&gt;Key Features&lt;/strong&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Multiple branches from any conversation point&lt;/li&gt;
&lt;li&gt;Multiple roots per conversation (forest structure)&lt;/li&gt;
&lt;li&gt;Rich analytics (depth, branching factor, tree size)&lt;/li&gt;
&lt;li&gt;ASCII tree visualization&lt;/li&gt;
&lt;li&gt;Cycle detection for safety&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="-ready-for-integration"&gt;🚀 &lt;strong&gt;Ready for Integration&lt;/strong&gt;
&lt;/h3&gt;
&lt;p&gt;The implementation is production-ready with comprehensive testing. Next steps would be:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Integrate &lt;code&gt;tree_utils.py&lt;/code&gt; into LLM package&lt;/li&gt;
&lt;li&gt;Add CLI commands (&lt;code&gt;llm branch&lt;/code&gt;, &lt;code&gt;llm tree&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Update &lt;code&gt;Response.log()&lt;/code&gt; to accept &lt;code&gt;parent_response_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All code is tested, documented, and ready to use!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Did I mention I kicked off this experiment by entering prompts on my phone?&lt;/p&gt;
&lt;p&gt;I told it to &lt;code&gt;Give me a zip file of everything you have done so far&lt;/code&gt; - you can explore the contents of the file it made me &lt;a href="https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b8b4"&gt;in this Gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I shared a &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;longer version&lt;/a&gt; of this experiment where I had the model experiment with a complex message hashing mechanism as well. I've &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;pushed the code to a branch&lt;/a&gt;, which is best explained by the &lt;a href="https://github.com/simonw/llm/blob/697636222a277f892dfffef86b2471cc89c95576/message_matching_notes.md"&gt;message_matching_notes.md&lt;/a&gt; document written by Claude.&lt;/p&gt;
&lt;h4 id="pelicans"&gt;Pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.19"&gt;llm-anthropic 0.19&lt;/a&gt; adding support for the new model. Then I ran my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;classic pelican benchmark&lt;/a&gt;, first &lt;a href="https://gist.github.com/simonw/52a5bb478fa743daede12d97173cb817"&gt;with thinking turned on&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 -o thinking 1 -o thinking_budget 4000 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4.5-thinking.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/449619ea8f0c112e3faa72e99fcb16ee"&gt;without&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4.5.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are pretty good - they are recognizably pelicans! - though not quite as good &lt;a href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/"&gt;as GPT-5-Codex&lt;/a&gt; which is better at drawing bicycles.&lt;/p&gt;
&lt;p&gt;I also tried having it describe this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelicans.jpg" alt="A whole mess of pelicans, described below." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe this image&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -a https://static.simonwillison.net/static/2024/pelicans.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;This image shows a large gathering of pelicans crowded together along what appears to be a waterfront or shoreline. The birds are densely packed, creating a chaotic scene with dozens of brown pelicans visible. Their distinctive long bills and pouches are clearly visible among the mass of feathered bodies. In the background, there's calm water and a dark tree line or forested area. The lighting suggests this was taken during early morning or late afternoon, giving the scene a golden, warm tone. The pelicans appear to be resting or congregating together, which is typical behavior for these colonial seabirds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="anthropic-are-rolling-this-out-everywhere"&gt;Anthropic are rolling this out everywhere&lt;/h4&gt;
&lt;p&gt;The release of this model has been &lt;em&gt;very&lt;/em&gt; well coordinated. My embargo on talking about it was due to lift at 10am Pacific today, and I got an email from them linking to their blog post at 10am on the dot. It's also already &lt;a href="https://openrouter.ai/anthropic/claude-sonnet-4.5"&gt;live on OpenRouter&lt;/a&gt; and &lt;a href="https://x.com/cursor_ai/status/1972713190074261949"&gt;in Cursor&lt;/a&gt; and &lt;a href="https://github.blog/changelog/2025-09-29-anthropic-claude-sonnet-4-5-is-in-public-preview-for-github-copilot/"&gt;GitHub Copilot&lt;/a&gt; and no doubt a whole bunch of other places as well.&lt;/p&gt;
&lt;p&gt;Anthropic also shipped a &lt;a href="https://marketplace.visualstudio.com/items?itemName=anthropic.claude-code"&gt;new Claude Code VS Code extension&lt;/a&gt; today, plus a big upgrade to the Claude Code terminal app. Plus they rebranded their confusingly named Claude Code SDK to the &lt;a href="https://docs.claude.com/en/api/agent-sdk/overview"&gt;Claude Agent SDK&lt;/a&gt; instead, emphasizing that it's a tool for building agents beyond just customizing the existing Claude Code product. That's available for both &lt;a href="https://docs.claude.com/en/api/agent-sdk/typescript"&gt;TypeScript&lt;/a&gt; and &lt;a href="https://docs.claude.com/en/api/agent-sdk/python"&gt;Python&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="code-interpreter"/><category term="llm-tool-use"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Grok 4 Fast</title><link href="https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag" rel="alternate"/><published>2025-09-20T23:59:33+00:00</published><updated>2025-09-20T23:59:33+00:00</updated><id>https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.ai/news/grok-4-fast"&gt;Grok 4 Fast&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with tool-use reinforcement learning".&lt;/p&gt;
&lt;p&gt;It's priced at $0.20/million input tokens and $0.50/million output tokens - 15x less than Grok 4 (which is $3/million input and $15/million output). That puts it cheaper than GPT-5 mini and Gemini 2.5 Flash on &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The same model weights handle reasoning and non-reasoning based on a parameter passed to the model.&lt;/p&gt;
&lt;p&gt;I've been trying it out via my updated &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, since Grok 4 Fast is available &lt;a href="https://openrouter.ai/x-ai/grok-4-fast"&gt;for free on OpenRouter&lt;/a&gt; for a limited period.&lt;/p&gt;
&lt;p&gt;Here's output from the &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551"&gt;non-reasoning model&lt;/a&gt;. This actually output an invalid SVG - I had to make &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551?permalink_comment_id=5768049#gistcomment-5768049"&gt;a tiny manual tweak&lt;/a&gt; to the XML to get it to render.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled false
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: Simple line drawing of a white bird with a long yellow beak riding a bicycle, pedaling with its orange legs." src="https://static.simonwillison.net/static/2025/grok-4-no-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;(I initially ran this without that &lt;code&gt;-o reasoning_enabled false&lt;/code&gt; flag, but then I saw that &lt;a href="https://x.com/OpenRouterAI/status/1969427723098435738"&gt;OpenRouter enable reasoning by default&lt;/a&gt; for that model. Here's my &lt;a href="https://gist.github.com/simonw/6a52e6585cb3c45e64ae23b9c5ebafe9"&gt;previous invalid result&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/539719a1495253bbd27f3107931e6dd3"&gt;the reasoning model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: A simple line drawing of a white pelican with a yellow beak holding a yellow object, riding a black bicycle on green grass under a blue sky with white clouds." src="https://static.simonwillison.net/static/2025/grok-4-fast-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;In related news, the New York Times had a story a couple of days ago about Elon's recent focus on xAI: &lt;a href="https://www.nytimes.com/2025/09/18/technology/elon-musk-artificial-intelligence-xai.html"&gt;Since Leaving Washington, Elon Musk Has Been All In on His A.I. Company&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="llm-release"/><category term="openrouter"/><category term="xai"/></entry><entry><title>gpt-5 and gpt-5-mini rate limit updates</title><link href="https://simonwillison.net/2025/Sep/12/gpt-5-rate-limits/#atom-tag" rel="alternate"/><published>2025-09-12T23:14:46+00:00</published><updated>2025-09-12T23:14:46+00:00</updated><id>https://simonwillison.net/2025/Sep/12/gpt-5-rate-limits/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/openaidevs/status/1966610846559134140"&gt;gpt-5 and gpt-5-mini rate limit updates&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI have increased the rate limits for their two main GPT-5  models. These look significant:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-5&lt;br&gt;
Tier 1: 30K → 500K TPM (1.5M batch)&lt;br&gt;
Tier 2: 450K → 1M (3M batch)&lt;br&gt;
Tier 3: 800K → 2M&lt;br&gt;
Tier 4: 2M → 4M&lt;/p&gt;
&lt;p&gt;gpt-5-mini&lt;br&gt;
Tier 1: 200K → 500K (5M batch)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5 rate limits here&lt;/a&gt; show tier 5 stays at 40M tokens per minute. The &lt;a href="https://platform.openai.com/docs/models/gpt-5-mini"&gt;GPT-5 mini rate limits&lt;/a&gt; for tiers 2 through 5 are 2M, 4M, 10M and 180M TPM respectively.&lt;/p&gt;
&lt;p&gt;As a reminder, &lt;a href="https://platform.openai.com/docs/guides/rate-limits#usage-tiers"&gt;those tiers&lt;/a&gt; are assigned based on how much money you have spent on the OpenAI API - from $5 for tier 1 up through $50, $100, $250 and then $1,000 for tier &lt;/p&gt;
&lt;p&gt;For comparison, Anthropic's current top tier is Tier 4 ($400 spent) which provides 2M maximum input tokens per minute and 400,000 maximum output tokens, though you can contact their sales team for higher limits than that.&lt;/p&gt;
&lt;p&gt;Gemini's top tier is Tier 3 for $1,000 spent and &lt;a href="https://ai.google.dev/gemini-api/docs/rate-limits#tier-3"&gt;currently gives you&lt;/a&gt; 8M TPM for Gemini 2.5 Pro and Flash and 30M TPM for the Flash-Lite and 2.0 Flash models.&lt;/p&gt;
&lt;p&gt;So OpenAI's new rate limit increases for their top performing model pulls them ahead of Anthropic but still leaves them significantly behind Gemini.&lt;/p&gt;
&lt;p&gt;GPT-5 mini remains the champion for smaller models with that enormous 180M TPS limit for its top tier.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="gemini"/><category term="llm-pricing"/><category term="gpt-5"/></entry><entry><title>Load Llama-3.2 WebGPU in your browser from a local folder</title><link href="https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag" rel="alternate"/><published>2025-09-08T20:53:52+00:00</published><updated>2025-09-08T20:53:52+00:00</updated><id>https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;Load Llama-3.2 WebGPU in your browser from a local folder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by &lt;a href="https://news.ycombinator.com/item?id=45168953#45169054"&gt;a comment&lt;/a&gt; on Hacker News I decided to see if it was possible to modify the &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;transformers.js-examples/tree/main/llama-3.2-webgpu&lt;/a&gt; Llama 3.2 chat demo (&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;online here&lt;/a&gt;, I &lt;a href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/"&gt;wrote about it last November&lt;/a&gt;) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.&lt;/p&gt;
&lt;p&gt;I posed the problem to OpenAI's GPT-5-enabled Codex CLI like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex churned away for several minutes, even running commands like &lt;code&gt;curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n '1,200p'&lt;/code&gt; to inspect the source code of the underlying Transformers.js library.&lt;/p&gt;
&lt;p&gt;After four prompts total (&lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751814#gistcomment-5751814"&gt;shown here&lt;/a&gt;) it built something which worked!&lt;/p&gt;
&lt;p&gt;To try it out you'll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then visit my &lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;llama-3.2-webgpu&lt;/a&gt; page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".&lt;/p&gt;
&lt;p&gt;Here's an animated demo (recorded in real-time, I didn't speed this up):&lt;/p&gt;
&lt;p&gt;&lt;img alt="GIF. I follow the setup instructions, clicking to load a local model and browsing to the correct folder. Once loaded the model shows a chat interface, I run the example about time management which returns tokens at about 10/second." src="https://static.simonwillison.net/static/2025/webgpu-llama-demo-small.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I pushed &lt;a href="https://github.com/simonw/transformers.js-examples/commit/cdebf4128c6e30414d437affd4b13b6c9c79421d"&gt;a branch with those changes here&lt;/a&gt;. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I'm pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.&lt;/p&gt;
&lt;p&gt;According to the Codex &lt;code&gt;/status&lt;/code&gt; command &lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751807#gistcomment-5751807"&gt;this used&lt;/a&gt; 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45168953#45173297"&gt;My Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="transformers-js"/><category term="webgpu"/><category term="llm-pricing"/><category term="vibe-coding"/><category term="gpt-5"/><category term="codex-cli"/></entry><entry><title>Introducing gpt-realtime</title><link href="https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/#atom-tag" rel="alternate"/><published>2025-09-01T17:34:55+00:00</published><updated>2025-09-01T17:34:55+00:00</updated><id>https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-realtime/"&gt;Introducing gpt-realtime&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released a few days ago (August 28th), &lt;code&gt;gpt-realtime&lt;/code&gt; is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older &lt;code&gt;gpt-4o-realtime-preview&lt;/code&gt; model that was released &lt;a href="https://openai.com/index/introducing-the-realtime-api/"&gt;last October&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is a slightly confusing release. The previous realtime model was clearly described as a variant of GPT-4o, sharing the same October 2023 training cut-off date as that model.&lt;/p&gt;
&lt;p&gt;I had expected that &lt;code&gt;gpt-realtime&lt;/code&gt; might be a GPT-5 relative, but its training date is still October 2023 whereas GPT-5 is September 2024.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;gpt-realtime&lt;/code&gt; also shares the relatively low 32,000 context token and 4,096 maximum output token limits of &lt;code&gt;gpt-4o-realtime-preview&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The only reference I found to GPT-5 in the documentation for the new model was a note saying "Ambiguity and conflicting instructions degrade performance, similar to GPT-5."&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.openai.com/docs/guides/realtime-models-prompting#general-usage-tips"&gt;usage tips&lt;/a&gt; for &lt;code&gt;gpt-realtime&lt;/code&gt; have a few surprises:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Iterate relentlessly&lt;/strong&gt;. Small wording changes can make or break behavior.&lt;/p&gt;
&lt;p&gt;Example: Swapping “inaudible” → “unintelligible” improved noisy input handling. [...]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Convert non-text rules to text&lt;/strong&gt;: The model responds better to clearly written text.&lt;/p&gt;
&lt;p&gt;Example: Instead of writing, "IF x &amp;gt; 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There are a whole lot more prompting tips in the new &lt;a href="https://cookbook.openai.com/examples/realtime_prompting_guide"&gt;Realtime Prompting Guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI list several key improvements to &lt;code&gt;gpt-realtime&lt;/code&gt; including the ability to configure it with a list of MCP servers, "better instruction following" and the ability to send it images.&lt;/p&gt;
&lt;p&gt;My biggest confusion came from &lt;a href="https://openai.com/api/pricing/"&gt;the pricing page&lt;/a&gt;, which lists separate pricing for using the Realtime API with &lt;code&gt;gpt-realtime&lt;/code&gt; and GPT-4o mini. This suggests to me that the old &lt;a href="https://platform.openai.com/docs/models/gpt-4o-mini-realtime-preview"&gt;gpt-4o-mini-realtime-preview&lt;/a&gt; model is still available, despite it no longer being listed on the &lt;a href="https://platform.openai.com/docs/models"&gt;OpenAI models page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;gpt-4o-mini-realtime-preview&lt;/code&gt; is a &lt;strong&gt;lot&lt;/strong&gt; cheaper:&lt;/p&gt;
&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Model&lt;/th&gt;
            &lt;th&gt;Token Type&lt;/th&gt;
            &lt;th&gt;Input&lt;/th&gt;
            &lt;th&gt;Cached Input&lt;/th&gt;
            &lt;th&gt;Output&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;td rowspan="3"&gt;gpt-realtime&lt;/td&gt;
            &lt;td&gt;Text&lt;/td&gt;
            &lt;td&gt;$4.00&lt;/td&gt;
            &lt;td&gt;$0.40&lt;/td&gt;
            &lt;td&gt;$16.00&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Audio&lt;/td&gt;
            &lt;td&gt;$32.00&lt;/td&gt;
            &lt;td&gt;$0.40&lt;/td&gt;
            &lt;td&gt;$64.00&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Image&lt;/td&gt;
            &lt;td&gt;$5.00&lt;/td&gt;
            &lt;td&gt;$0.50&lt;/td&gt;
            &lt;td&gt;-&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;gpt-4o-mini-realtime-preview&lt;/td&gt;
            &lt;td&gt;Text&lt;/td&gt;
            &lt;td&gt;$0.60&lt;/td&gt;
            &lt;td&gt;$0.30&lt;/td&gt;
            &lt;td&gt;$2.40&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Audio&lt;/td&gt;
            &lt;td&gt;$10.00&lt;/td&gt;
            &lt;td&gt;$0.30&lt;/td&gt;
            &lt;td&gt;$20.00&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The mini model also has a much longer 128,000 token context window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Turns out that was &lt;a href="https://twitter.com/_agamble/status/1962839472837361807"&gt;a mistake in the documentation&lt;/a&gt;, that mini model has a 16,000 token context size.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: OpenAI's &lt;a href="https://twitter.com/pbbakkum/status/1962901822135525695"&gt;Peter Bakkum clarifies&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are different voice models in API and ChatGPT, but they share some recent improvements. The voices are also different.&lt;/p&gt;
&lt;p&gt;gpt-realtime has a mix of data specific enough to itself that its not really 4o or 5&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/realtime"&gt;realtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/multi-modal-output"&gt;multi-modal-output&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="realtime"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="multi-modal-output"/><category term="llm-release"/></entry><entry><title>Claude Sonnet 4 now supports 1M tokens of context</title><link href="https://simonwillison.net/2025/Aug/12/claude-sonnet-4-1m/#atom-tag" rel="alternate"/><published>2025-08-12T18:14:30+00:00</published><updated>2025-08-12T18:14:30+00:00</updated><id>https://simonwillison.net/2025/Aug/12/claude-sonnet-4-1m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/1m-context"&gt;Claude Sonnet 4 now supports 1M tokens of context&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Gemini and OpenAI both have million token models, so it's good to see Anthropic catching up. This is 5x the previous 200,000 context length limit of the various Claude Sonnet models.&lt;/p&gt;
&lt;p&gt;Anthropic have previously made 1 million tokens available to select customers. From &lt;a href="https://www.anthropic.com/news/claude-3-family"&gt;the Claude 3 announcement&lt;/a&gt; in March 2024:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Claude 3 family of models will initially offer a 200K context window upon launch. However, all three models are capable of accepting inputs exceeding 1 million tokens and we may make this available to select customers who need enhanced processing power.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is also the first time I've seen Anthropic use prices that vary depending on context length:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompts ≤ 200K: $3/million input, $15/million output&lt;/li&gt;
&lt;li&gt;Prompts &amp;gt; 200K: $6/million input, $22.50/million output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Gemini have been doing this for a while: Gemini 2.5 Pro is $1.25/$10 below 200,000 tokens and $2.50/$15 above 200,000.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/context-windows#1m-token-context-window"&gt;Anthropic's full documentation on the 1m token context window&lt;/a&gt;. You need to send a &lt;code&gt;context-1m-2025-08-07&lt;/code&gt; beta header in your request to enable it.&lt;/p&gt;
&lt;p&gt;Note that this is currently restricted to "tier 4" users who have purchased at least $400 in API credits:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Long context support for Sonnet 4 is now in public beta on the Anthropic API for customers with Tier 4 and custom rate limits, with broader availability rolling out over the coming weeks.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/claudeai/status/1955299573620261343"&gt;@claudeai&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="long-context"/></entry><entry><title>Quoting Nick Turley</title><link href="https://simonwillison.net/2025/Aug/12/nick-turley/#atom-tag" rel="alternate"/><published>2025-08-12T03:32:04+00:00</published><updated>2025-08-12T03:32:04+00:00</updated><id>https://simonwillison.net/2025/Aug/12/nick-turley/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.youtube.com/watch?v=ixY2PvQJ0To&amp;amp;t=2322s"&gt;&lt;p&gt;I think there's been a lot of decisions over time that proved pretty consequential, but we made them very quickly as we have to. [...]&lt;/p&gt;
&lt;p&gt;[On pricing] I had this kind of panic attack because we really needed to launch subscriptions because at the time we were taking the product down all the time. [...]&lt;/p&gt;
&lt;p&gt;So what I did do is ship a Google Form to Discord with &lt;a href="https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter"&gt;the four questions you're supposed to ask&lt;/a&gt; on how to price something.&lt;/p&gt;
&lt;p&gt;But we got with the $20. We were debating something slightly higher at the time. I often wonder what would have happened because so many other companies ended up copying the $20 price point, so did we erase a bunch of market cap by pricing it this way?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.youtube.com/watch?v=ixY2PvQJ0To&amp;amp;t=2322s"&gt;Nick Turley&lt;/a&gt;, Head of ChatGPT, interviewed by Lenny Rachitsky&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/discord"&gt;discord&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nick-turley"&gt;nick-turley&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="discord"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm-pricing"/><category term="nick-turley"/></entry><entry><title>GPT-5: Key characteristics, pricing and model card</title><link href="https://simonwillison.net/2025/Aug/7/gpt-5/#atom-tag" rel="alternate"/><published>2025-08-07T17:36:12+00:00</published><updated>2025-08-07T17:36:12+00:00</updated><id>https://simonwillison.net/2025/Aug/7/gpt-5/#atom-tag</id><summary type="html">
    &lt;p&gt;I've had preview access to the new GPT-5 model family for the past two weeks (see &lt;a href="https://simonwillison.net/2025/Aug/7/previewing-gpt-5/"&gt;related video&lt;/a&gt; and &lt;a href="https://simonwillison.net/about/#disclosures"&gt;my disclosures&lt;/a&gt;) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for.&lt;/p&gt;
&lt;p&gt;I've collected a lot of notes over the past two weeks, so I've decided to break them up into &lt;a href="https://simonwillison.net/series/gpt-5/"&gt;a series of posts&lt;/a&gt;. This first one will cover key characteristics of the models, how they are priced and what we can learn from the &lt;a href="https://openai.com/index/gpt-5-system-card/"&gt;GPT-5 system card&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#key-model-characteristics"&gt;Key model characteristics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#position-in-the-openai-model-family"&gt;Position in the OpenAI model family&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#pricing-is-aggressively-competitive"&gt;Pricing is aggressively competitive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#more-notes-from-the-system-card"&gt;More notes from the system card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#prompt-injection-in-the-system-card"&gt;Prompt injection in the system card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#thinking-traces-in-the-api"&gt;Thinking traces in the API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;And some SVGs of pelicans&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="key-model-characteristics"&gt;Key model characteristics&lt;/h4&gt;
&lt;p&gt;Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and &lt;strong&gt;a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent&lt;/strong&gt; (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;GPT-5 in the API is simpler: it's available as three models - &lt;strong&gt;regular&lt;/strong&gt;, &lt;strong&gt;mini&lt;/strong&gt; and &lt;strong&gt;nano&lt;/strong&gt; - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.&lt;/p&gt;
&lt;p&gt;The models have an input limit of 272,000 tokens and an output limit (which includes invisible reasoning tokens) of 128,000 tokens. They support text and image for input, text only for output.&lt;/p&gt;
&lt;p&gt;I've mainly explored full GPT-5. My verdict: it's just &lt;strong&gt;good at stuff&lt;/strong&gt;. It doesn't feel like a dramatic leap ahead from other LLMs but it exudes competence - it rarely messes up, and frequently impresses me. I've found it to be a very sensible default for everything that I want to do. At no point have I found myself wanting to re-run a prompt against a different model to try and get a better result.&lt;/p&gt;

&lt;p&gt;Here are the OpenAI model pages for &lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5&lt;/a&gt;, &lt;a href="https://platform.openai.com/docs/models/gpt-5-mini"&gt;GPT-5 mini&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/models/gpt-5-nano"&gt;GPT-5 nano&lt;/a&gt;. Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano.&lt;/p&gt;

&lt;h4 id="position-in-the-openai-model-family"&gt;Position in the OpenAI model family&lt;/h4&gt;
&lt;p&gt;The three new GPT-5 models are clearly intended as a replacement for most of the rest of the OpenAI line-up. This table from the system card is useful, as it shows how they see the new models fitting in:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Previous model&lt;/th&gt;
&lt;th&gt;GPT-5 model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;gpt-5-main&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;gpt-5-main-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI o3&lt;/td&gt;
&lt;td&gt;gpt-5-thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI o4-mini&lt;/td&gt;
&lt;td&gt;gpt-5-thinking-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1-nano&lt;/td&gt;
&lt;td&gt;gpt-5-thinking-nano&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI o3 Pro&lt;/td&gt;
&lt;td&gt;gpt-5-thinking-pro&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That "thinking-pro" model is currently only available via ChatGPT where it is labelled as "GPT-5 Pro" and limited to the $200/month tier. It uses "parallel test time compute".&lt;/p&gt;
&lt;p&gt;The only capabilities not covered by GPT-5 are audio input/output and image generation. Those remain covered by models like &lt;a href="https://platform.openai.com/docs/models/gpt-4o-audio-preview"&gt;GPT-4o Audio&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/models/gpt-4o-realtime-preview"&gt;GPT-4o Realtime&lt;/a&gt; and their mini variants and the &lt;a href="https://platform.openai.com/docs/models/gpt-image-1"&gt;GPT Image 1&lt;/a&gt; and DALL-E image generation models.&lt;/p&gt;
&lt;h4 id="pricing-is-aggressively-competitive"&gt;Pricing is aggressively competitive&lt;/h4&gt;
&lt;p&gt;The pricing is &lt;em&gt;aggressively competitive&lt;/em&gt; with other providers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPT-5: $1.25/million for input, $10/million for output&lt;/li&gt;
&lt;li&gt;GPT-5 Mini: $0.25/m input, $2.00/m output&lt;/li&gt;
&lt;li&gt;GPT-5 Nano: $0.05/m input, $0.40/m output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GPT-5 is priced at half the input cost of GPT-4o, and maintains the same price for output. Those invisible reasoning tokens count as output tokens so you can expect most prompts to use more output tokens than their GPT-4o equivalent (unless you set reasoning effort to "minimal").&lt;/p&gt;
&lt;p&gt;The discount for token caching is significant too: 90% off on input tokens that have been used within the previous few minutes. This is particularly material if you are implementing a chat UI where the same conversation gets replayed every time the user adds another prompt to the sequence.&lt;/p&gt;
&lt;p&gt;Here's a comparison table I put together showing the new models alongside the most comparable models from OpenAI's competition:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input $/m&lt;/th&gt;
&lt;th&gt;Output $/m&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.1&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;td&gt;75.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok 4&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro (&amp;gt;200,000)&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;td&gt;8.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;o3&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;td&gt;8.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro (&amp;lt;200,000)&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;o4-mini&lt;/td&gt;
&lt;td&gt;1.10&lt;/td&gt;
&lt;td&gt;4.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Haiku&lt;/td&gt;
&lt;td&gt;0.80&lt;/td&gt;
&lt;td&gt;4.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1 mini&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;1.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok 3 Mini&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5 Mini&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Flash-Lite&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1 Nano&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Nova Lite&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;0.24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5 Nano&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Nova Micro&lt;/td&gt;
&lt;td&gt;0.035&lt;/td&gt;
&lt;td&gt;0.14&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;(Here's a good example of a GPT-5 failure: I tried to get it to &lt;a href="https://chatgpt.com/share/6894d804-bca4-8006-ac46-580bf4a9bf5f"&gt;output that table sorted itself&lt;/a&gt; but it put Nova Micro as more expensive than GPT-5 Nano, so I prompted it to "construct the table in Python and sort it there" and that fixed the issue.)&lt;/p&gt;
&lt;h4 id="more-notes-from-the-system-card"&gt;More notes from the system card&lt;/h4&gt;
&lt;p&gt;As usual, &lt;a href="https://openai.com/index/gpt-5-system-card/"&gt;the system card&lt;/a&gt; is vague on what went into the training data. Here's what it says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Like OpenAI’s other models, the GPT-5 models were trained on diverse datasets, including information that is publicly available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate. [...] We use advanced data filtering processes to reduce personal information from training data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this section interesting, as it reveals that writing, code and health are three of the most common use-cases for ChatGPT. This explains why so much effort went into health-related questions,  for both GPT-5 and the recently released OpenAI open weight models.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We’ve made significant advances in &lt;strong&gt;reducing hallucinations, improving instruction following, and minimizing sycophancy&lt;/strong&gt;, and have leveled up GPT-5’s performance in &lt;strong&gt;three of ChatGPT’s most common uses: writing, coding, and health&lt;/strong&gt;. All of the GPT-5 models additionally feature &lt;strong&gt;safe-completions, our latest approach to safety training&lt;/strong&gt; to prevent disallowed content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Safe-completions is later described like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Large language models such as those powering ChatGPT have &lt;strong&gt;traditionally been trained to
either be as helpful as possible or outright refuse a user request&lt;/strong&gt;, depending on whether the
prompt is allowed by safety policy. [...] Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology
or cybersecurity), where a user request can be completed safely at a high level, but may lead
to malicious uplift if sufficiently detailed or actionable. &lt;strong&gt;As an alternative, we introduced safe-
completions: a safety-training approach that centers on the safety of the assistant’s output rather
than a binary classification of the user’s intent&lt;/strong&gt;. Safe-completions seek to maximize helpfulness
subject to the safety policy’s constraints.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So instead of straight up refusals, we should expect GPT-5 to still provide an answer but moderate that answer to avoid it including "harmful" content.&lt;/p&gt;
&lt;p&gt;OpenAI have a paper about this which I haven't read yet (I didn't get early access): &lt;a href="https://openai.com/index/gpt-5-safe-completions/"&gt;From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sycophancy gets a mention, unsurprising given &lt;a href="https://simonwillison.net/2025/May/2/what-we-missed-with-sycophancy/"&gt;their high profile disaster in April&lt;/a&gt;. They've worked on this in the core model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;System
prompts, while easy to modify, have a more limited impact on model outputs relative to changes in
post-training. For GPT-5, we post-trained our models to reduce sycophancy. Using conversations
representative of production data, we evaluated model responses, then assigned a score reflecting
the level of sycophancy, which was used as a reward signal in training.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They claim impressive reductions in hallucinations. In my own usage I've not spotted a single hallucination yet, but that's been true for me for Claude 4 and o3 recently as well - hallucination is so much less of a problem with this year's models.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: I have had some reasonable pushback against this point, so I should clarify what I mean here. When I use the term "hallucination" I am talking about instances where the model confidently states a real-world fact that is untrue - like the incorrect winner of a sporting event. I'm not talking about the models making other kinds of mistakes - they make mistakes all the time!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Someone &lt;a href="https://news.ycombinator.com/item?id=44829896"&gt;pointed out&lt;/a&gt; that it's likely I'm avoiding hallucinations through the way I use the models, and this is entirely correct: as an experienced LLM user I instinctively stay clear of prompts that are likely to trigger hallucinations, like asking a non-search-enabled model for URLs or paper citations. This means I'm much less likely to encounter hallucinations in my daily usage.&lt;/em&gt;&lt;/p&gt;


&lt;blockquote&gt;
&lt;p&gt;One of our focuses when training the GPT-5 models was to reduce the frequency of factual
hallucinations. While ChatGPT has browsing enabled by default, many API queries do not use
browsing tools. Thus, we focused both on training our models to browse effectively for up-to-date
information, and on reducing hallucinations when the models are relying on their own internal
knowledge.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The section about deception also incorporates the thing where models sometimes pretend they've completed a task that defeated them:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We placed gpt-5-thinking in a variety of tasks that were partly or entirely infeasible to accomplish,
and &lt;strong&gt;rewarded the model for honestly admitting it can not complete the task&lt;/strong&gt;. [...]&lt;/p&gt;
&lt;p&gt;In tasks where the agent is required to use tools, such as a web browsing
tool, in order to answer a user’s query, previous models would hallucinate information when
the tool was unreliable. We simulate this scenario by purposefully disabling the tools or by
making them return error codes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="prompt-injection-in-the-system-card"&gt;Prompt injection in the system card&lt;/h4&gt;
&lt;p&gt;There's a section about prompt injection, but it's pretty weak sauce in my opinion.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Two external red-teaming groups conducted a two-week prompt-injection assessment targeting
system-level vulnerabilities across ChatGPT’s connectors and mitigations, rather than model-only
behavior.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's their chart showing how well the model scores against the rest of the field. It's an impressive result in comparison - 56.8 attack success rate for gpt-5-thinking, where Claude 3.7 scores in the 60s (no Claude 4 results included here) and everything else is 70% plus:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/prompt-injection-chart.jpg" alt="A bar chart titled &amp;quot;Behavior Attack Success Rate at k Queries&amp;quot; shows attack success rates (in %) for various AI models at k=1 (dark red) and k=10 (light red). For each model, the total height of the stacked bar represents the k=10 success rate (labeled above each bar), while the lower dark red section represents the k=1 success rate (estimated). From left to right: Llama 3.3 70B – k=10: 92.2%, k=1: ~47%; Llama 3.1 405B – k=10: 90.9%, k=1: ~38%; Gemini Flash 1.5 – k=10: 87.7%, k=1: ~34%; GPT-4o – k=10: 86.4%, k=1: ~28%; OpenAI o3-mini-high – k=10: 86.4%, k=1: ~41%; Gemini Pro 1.5 – k=10: 85.5%, k=1: ~34%; Gemini 2.5 Pro Preview – k=10: 85.0%, k=1: ~28%; Gemini 2.0 Flash – k=10: 85.0%, k=1: ~33%; OpenAI o3-mini – k=10: 84.5%, k=1: ~40%; Grok 2 – k=10: 82.7%, k=1: ~34%; GPT-4.5 – k=10: 80.5%, k=1: ~28%; 3.5 Haiku – k=10: 76.4%, k=1: ~17%; Command-R – k=10: 76.4%, k=1: ~28%; OpenAI o4-mini – k=10: 75.5%, k=1: ~17%; 3.5 Sonnet – k=10: 75.0%, k=1: ~13%; OpenAI o1 – k=10: 71.8%, k=1: ~18%; 3.7 Sonnet – k=10: 64.5%, k=1: ~17%; 3.7 Sonnet: Thinking – k=10: 63.6%, k=1: ~17%; OpenAI o3 – k=10: 62.7%, k=1: ~13%; gpt-5-thinking – k=10: 56.8%, k=1: ~6%. Legend shows dark red = k=1 and light red = k=10." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;On the one hand, a 56.8% attack rate is cleanly a big improvement against all of those other models.&lt;/p&gt;
&lt;p&gt;But it's also a strong signal that prompt injection continues to be an unsolved problem! That means that more than half of those k=10 attacks (where the attacker was able to try up to ten times) got through.&lt;/p&gt;
&lt;p&gt;Don't assume prompt injection isn't going to be a problem for your application just because the models got better.&lt;/p&gt;
&lt;h4 id="thinking-traces-in-the-api"&gt;Thinking traces in the API&lt;/h4&gt;
&lt;p&gt;I had initially thought that my biggest disappointment with GPT-5 was that there's no way to get at those thinking traces via the API... but that turned out &lt;a href="https://bsky.app/profile/sophiebits.com/post/3lvtceih7222r"&gt;not to be true&lt;/a&gt;. The following &lt;code&gt;curl&lt;/code&gt; command demonstrates that the responses API &lt;code&gt;"reasoning": {"summary": "auto"}&lt;/code&gt; is available for the new GPT-5 models:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;curl https://api.openai.com/v1/responses \
  -H "Authorization: Bearer $(llm keys get openai)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5",
    "input": "Give me a one-sentence fun fact about octopuses.",
    "reasoning": {"summary": "auto"}
  }'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/1d1013ba059af76461153722005a039d"&gt;the response&lt;/a&gt; from that API call.&lt;/p&gt;

&lt;p&gt;Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.&lt;/p&gt;
&lt;p&gt;OpenAI offer a new &lt;code&gt;reasoning_effort=minimal&lt;/code&gt; option which turns off most reasoning so that tokens start to stream back to you as quickly as possible.&lt;/p&gt;
&lt;h4 id="and-some-svgs-of-pelicans"&gt;And some SVGs of pelicans&lt;/h4&gt;
&lt;p&gt;Naturally I've been running &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my "Generate an SVG of a pelican riding a bicycle" benchmark&lt;/a&gt;. I'll actually spend more time on this in a future post - I have some fun variants I've been exploring - but for the moment here's &lt;a href="https://gist.github.com/simonw/c98873ef29e621c0fe2e0d4023534406"&gt;the pelican&lt;/a&gt; I got from GPT-5 running at its default "medium" reasoning effort:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-pelican.png" alt="The bicycle is really good, spokes on wheels, correct shape frame, nice pedals. The pelican has a pelican beak and long legs stretching to the pedals." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's pretty great! Definitely recognizable as a pelican, and one of the best bicycles I've seen yet.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9b5ecf61a5fb0794729aa0023aaa504d"&gt;GPT-5 mini&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-mini-pelican.png" alt="Blue background with clouds. Pelican has two necks for some reason. Has a good beak though. More gradents and shadows than the GPT-5 one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/3884dc8b186b630956a1fb0179e191bc"&gt;GPT-5 nano&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-nano-pelican.png" alt="Bicycle is two circles and some randomish black lines. Pelican still has an OK beak but is otherwise very simple." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/></entry><entry><title>Claude Opus 4.1</title><link href="https://simonwillison.net/2025/Aug/5/claude-opus-41/#atom-tag" rel="alternate"/><published>2025-08-05T17:17:37+00:00</published><updated>2025-08-05T17:17:37+00:00</updated><id>https://simonwillison.net/2025/Aug/5/claude-opus-41/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-1"&gt;Claude Opus 4.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4".&lt;/p&gt;
&lt;p&gt;My favorite thing about this model is the version number - treating this as a .1 version increment looks like it's an accurate depiction of the model's capabilities.&lt;/p&gt;
&lt;p&gt;Anthropic's own benchmarks show very small incremental gains.&lt;/p&gt;
&lt;p&gt;Comparing Opus 4 and Opus 4.1 (I &lt;a href="https://claude.ai/share/c7366629-784a-4088-9fc4-15613aa41a7f"&gt;got 4.1 to extract this information from a screenshot&lt;/a&gt; of Anthropic's own benchmark scores, then asked it to look up the links, then verified the links myself and fixed a few):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Agentic coding&lt;/strong&gt; (&lt;a href="https://github.com/SWE-bench/SWE-bench"&gt;SWE-bench Verified&lt;/a&gt;): From 72.5% to 74.5%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agentic terminal coding&lt;/strong&gt; (&lt;a href="https://github.com/laude-institute/terminal-bench"&gt;Terminal-Bench&lt;/a&gt;): From 39.2% to 43.3%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graduate-level reasoning&lt;/strong&gt; (&lt;a href="https://github.com/idavidrein/gpqa"&gt;GPQA Diamond&lt;/a&gt;): From 79.6% to 80.9%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agentic tool use&lt;/strong&gt; (&lt;a href="https://github.com/sierra-research/tau-bench"&gt;TAU-bench&lt;/a&gt;):&lt;/li&gt;
&lt;li&gt;Retail: From 81.4% to 82.4%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Airline: From 59.6% to 56.0%&lt;/strong&gt; &lt;em&gt;(decreased)&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multilingual Q&amp;amp;A&lt;/strong&gt; (&lt;a href="https://huggingface.co/datasets/openai/MMMLU"&gt;MMMLU&lt;/a&gt;): From 88.8% to 89.5%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual reasoning&lt;/strong&gt; (&lt;a href="https://mmmu-benchmark.github.io/"&gt;MMMU validation&lt;/a&gt;): From 76.5% to 77.1%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High school math competition&lt;/strong&gt; (&lt;a href="https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions"&gt;AIME 2025&lt;/a&gt;): From 75.5% to 78.0%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Likewise, the &lt;a href="https://assets.anthropic.com/m/4c024b86c698d3d4/original/Claude-4-1-System-Card.pdf"&gt;model card&lt;/a&gt; shows only tiny changes to the various safety metrics that Anthropic track.&lt;/p&gt;
&lt;p&gt;It's priced the same as Opus 4 - $15/million for input and $75/million for output, making it one of &lt;a href="https://www.llm-prices.com/#sb=input&amp;amp;sd=descending"&gt;the most expensive models&lt;/a&gt; on the market today.&lt;/p&gt;
&lt;p&gt;I had it &lt;a href="https://gist.github.com/simonw/7fead138d31d751d65c7253a1c18751b"&gt;draw me this pelican&lt;/a&gt; riding a bicycle:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelican is line art, does have a good beak and feet on the pedals, bicycle is very poorly designed and not the right shape." src="https://static.simonwillison.net/static/2025/opus-4.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;For comparison I got a fresh new pelican &lt;a href="https://gist.github.com/simonw/96a958e39aaed10e1e47c1aab2d05e20"&gt;out of Opus 4&lt;/a&gt; which I actually like a little more:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This one has shaded colors for the different parts of the pelican. Still a bad bicycle." src="https://static.simonwillison.net/static/2025/opus-4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I shipped &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.18"&gt;llm-anthropic 0.18&lt;/a&gt; with support for the new model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="evals"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Faster inference</title><link href="https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag" rel="alternate"/><published>2025-08-01T23:28:26+00:00</published><updated>2025-08-01T23:28:26+00:00</updated><id>https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag</id><summary type="html">
    &lt;p&gt;Two interesting examples of inference speed as a flagship feature of LLM services today.&lt;/p&gt;
&lt;p&gt;First, Cerebras &lt;a href="https://www.cerebras.ai/blog/introducing-cerebras-code"&gt;announced two new monthly plans&lt;/a&gt; for their extremely high speed hosted model service: Cerebras Code Pro ($50/month, 1,000 messages a day) and Cerebras Code Max ($200/month, 5,000/day). The model they are selling here is Qwen's Qwen3-Coder-480B-A35B-Instruct, likely the best available open weights coding model right now and one that was released &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;just ten days ago&lt;/a&gt;. Ten days from model release to third-party subscription service feels like some kind of record.&lt;/p&gt;
&lt;p&gt;Cerebras claim they can serve the model at an astonishing 2,000 tokens per second - four times the speed of Claude Sonnet 4 in &lt;a href="https://x.com/cerebrassystems/status/1951340566077440464"&gt;their demo video&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also today, Moonshot &lt;a href="https://x.com/kimi_moonshot/status/1951168907131355598"&gt;announced&lt;/a&gt; a new hosted version of their trillion parameter Kimi K2 model called &lt;code&gt;kimi-k2-turbo-preview&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;🆕 Say hello to kimi-k2-turbo-preview
Same model. Same context. NOW 4× FASTER.&lt;/p&gt;
&lt;p&gt;⚡️ From 10 tok/s to 40 tok/s.&lt;/p&gt;
&lt;p&gt;💰 Limited-Time Launch Price (50% off until Sept 1)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$0.30 / million input tokens (cache hit)&lt;/li&gt;
&lt;li&gt;$1.20 / million input tokens (cache miss)&lt;/li&gt;
&lt;li&gt;$5.00 / million output tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;👉 Explore more: &lt;a href="https://platform.moonshot.ai"&gt;platform.moonshot.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is twice the price of their regular model for 4x the speed (increasing to 4x the price in September). No details yet on how they achieved the speed-up.&lt;/p&gt;
&lt;p&gt;I am interested to see how much market demand there is for faster performance like this. I've &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;experimented with Cerebras in the past&lt;/a&gt; and found that the speed really does make iterating on code with live previews feel a whole lot more interactive.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="cerebras"/><category term="llm-pricing"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="qwen"/><category term="moonshot"/><category term="kimi"/><category term="llm-performance"/></entry><entry><title>Quoting Anthropic</title><link href="https://simonwillison.net/2025/Jul/28/anthropic/#atom-tag" rel="alternate"/><published>2025-07-28T23:59:11+00:00</published><updated>2025-07-28T23:59:11+00:00</updated><id>https://simonwillison.net/2025/Jul/28/anthropic/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://x.com/anthropicai/status/1949898511287226425"&gt;&lt;p&gt;We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage. [...]&lt;/p&gt;
&lt;p&gt;Some of the biggest Claude Code fans are running it continuously in the background, 24/7.&lt;/p&gt;
&lt;p&gt;These uses are remarkable and we want to enable them. But a few outlying cases are very costly to support. For example, one user consumed tens of thousands in model usage on a $200 plan.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://x.com/anthropicai/status/1949898511287226425"&gt;Anthropic&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="claude-code"/></entry><entry><title>Qwen3-Coder: Agentic Coding in the World</title><link href="https://simonwillison.net/2025/Jul/22/qwen3-coder/#atom-tag" rel="alternate"/><published>2025-07-22T22:52:02+00:00</published><updated>2025-07-22T22:52:02+00:00</updated><id>https://simonwillison.net/2025/Jul/22/qwen3-coder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen3-coder/"&gt;Qwen3-Coder: Agentic Coding in the World&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It turns out that &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;as I was typing up&lt;/a&gt; my notes on Qwen3-235B-A22B-Instruct-2507 the Qwen team were unleashing something much bigger:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today, we’re announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct — a 480B-parameter Mixture-of-Experts model with 35B active parameters which supports the context length of 256K tokens natively and 1M tokens with extrapolation methods, offering exceptional performance in both coding and agentic tasks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is another Apache 2.0 licensed open weights model, available as &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8"&gt;Qwen3-Coder-480B-A35B-Instruct-FP8&lt;/a&gt; on Hugging Face.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://app.hyperbolic.ai/models/qwen3-coder-480b-a35b-instruct"&gt;qwen3-coder-480b-a35b-instruct on the Hyperbolic playground&lt;/a&gt; to run my "Generate an SVG of a pelican riding a bicycle" test prompt:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle has no spokes. The pelican is light yellow and is overlapping the middle of the bicycle, not perching on it - it has a large yellow beak and a weird red lower beak or wattle." src="https://static.simonwillison.net/static/2025/Qwen3-Coder-480B-A35B-Instruct-FP8.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I actually slightly prefer the one &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;I got from qwen3-235b-a22b-07-25&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's also available &lt;a href="https://openrouter.ai/qwen/qwen3-coder"&gt;as qwen3-coder on OpenRouter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In addition to the new model, Qwen released their own take on an agentic terminal coding assistant called &lt;a href="https://github.com/QwenLM/qwen-code"&gt;qwen-code&lt;/a&gt;, which they describe in their blog post as being "Forked from Gemini Code" (they mean &lt;a href="https://github.com/google-gemini/gemini-cli"&gt;gemini-cli&lt;/a&gt;) - which is Apache 2.0 so a fork is in keeping with the license.&lt;/p&gt;
&lt;p&gt;They focused &lt;em&gt;really hard&lt;/em&gt; on code performance for this release, including generating synthetic data tested using 20,000 parallel environments on Alibaba Cloud:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the post-training phase of Qwen3-Coder, we introduced long-horizon RL (Agent RL) to encourage the model to solve real-world tasks through multi-turn interactions using tools. The key challenge of Agent RL lies in environment scaling. To address this, we built a scalable system capable of running 20,000 independent environments in parallel, leveraging Alibaba Cloud’s infrastructure. The infrastructure provides the necessary feedback for large-scale reinforcement learning and supports evaluation at scale. As a result, Qwen3-Coder achieves state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To further burnish their coding credentials, the announcement includes instructions for running their new model using both Claude Code and Cline using custom API base URLs that point to Qwen's own compatibility proxies.&lt;/p&gt;
&lt;p&gt;Pricing for Qwen's own hosted models (through Alibaba Cloud) &lt;a href="https://www.alibabacloud.com/help/en/model-studio/models"&gt;looks competitive&lt;/a&gt;. This is the first model I've seen that sets different prices for four different sizes of input:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pricing table with three columns showing Input token count (0-32K, 32K-128K, 128K-256K, 256K-1M), Input price (Million tokens) ($1, $1.8, $3, $6), and Output price (Million tokens) ($5, $9, $15, $60)" src="https://static.simonwillison.net/static/2025/qwen3-coder-plus-prices.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;This kind of pricing reflects how inference against longer inputs is more expensive to process. Gemini 2.5 Pro has two different prices for above or below 200,00 tokens.&lt;/p&gt;
&lt;p&gt;Awni Hannun &lt;a href="https://x.com/awnihannun/status/1947771502058672219"&gt;reports&lt;/a&gt; running a &lt;a href="https://huggingface.co/mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit"&gt;4-bit quantized MLX version&lt;/a&gt; on a 512GB M3 Ultra Mac Studio at 24 tokens/second using 272GB of RAM, getting &lt;a href="https://x.com/awnihannun/status/1947772369440997807"&gt;great results&lt;/a&gt; for "&lt;code&gt;write a python script for a bouncing yellow ball within a square, make sure to handle collision detection properly. make the square slowly rotate. implement it in python. make sure ball stays within the square&lt;/code&gt;".

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1947766835023335516"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="qwen"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="coding-agents"/><category term="ai-in-china"/></entry><entry><title>Gemini 2.5 Flash-Lite is now stable and generally available</title><link href="https://simonwillison.net/2025/Jul/22/gemini-25-flash-lite/#atom-tag" rel="alternate"/><published>2025-07-22T20:50:44+00:00</published><updated>2025-07-22T20:50:44+00:00</updated><id>https://simonwillison.net/2025/Jul/22/gemini-25-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/gemini-25-flash-lite-is-now-stable-and-generally-available/"&gt;Gemini 2.5 Flash-Lite is now stable and generally available&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The last remaining member of the Gemini 2.5 trio joins Pro and Flash in General Availability today.&lt;/p&gt;
&lt;p&gt;Gemini 2.5 Flash-Lite is the cheapest of the 2.5 family, at $0.10/million input tokens and $0.40/million output tokens. This puts it equal to GPT-4.1 Nano on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; comparison table.&lt;/p&gt;
&lt;p&gt;The preview version of that model had the same pricing for text tokens, but is now cheaper for audio:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We have also reduced audio input pricing by 40% from the preview launch.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.24"&gt;llm-gemini 0.24&lt;/a&gt; with support for the new model alias:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm -m gemini-2.5-flash-lite \
  -a https://static.simonwillison.net/static/2024/pelican-joke-request.mp3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I wrote more &lt;a href="https://simonwillison.net/2025/Jun/17/gemini-2-5/"&gt;about the Gemini 2.5 Flash-Lite preview model&lt;/a&gt; last month.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="llm-release"/></entry><entry><title>Grok 4</title><link href="https://simonwillison.net/2025/Jul/10/grok-4/#atom-tag" rel="alternate"/><published>2025-07-10T19:36:03+00:00</published><updated>2025-07-10T19:36:03+00:00</updated><id>https://simonwillison.net/2025/Jul/10/grok-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.x.ai/docs/models/grok-4-0709"&gt;Grok 4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released last night, Grok 4 is now available via both API and a paid subscription for end-users.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update:&lt;/strong&gt; If you ask it about controversial topics it will sometimes &lt;a href="https://simonwillison.net/2025/Jul/11/grok-musk/"&gt;search X for tweets "from:elonmusk"&lt;/a&gt;!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Key characteristics: image and text input, text output. 256,000 context length (twice that of Grok 3). It's a reasoning model where you can't see the reasoning tokens or turn off reasoning mode.&lt;/p&gt;
&lt;p&gt;xAI released results showing Grok 4 beating other models on most of the significant benchmarks. I haven't been able to find their own written version of these (the launch was a &lt;a href="https://x.com/xai/status/1943158495588815072"&gt;livestream video&lt;/a&gt;) but here's &lt;a href="https://techcrunch.com/2025/07/09/elon-musks-xai-launches-grok-4-alongside-a-300-monthly-subscription/"&gt;a TechCrunch report&lt;/a&gt; that includes those scores. It's not clear to me if these benchmark results are for Grok 4 or Grok 4 Heavy.&lt;/p&gt;
&lt;p&gt;I ran &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my own benchmark&lt;/a&gt; using Grok 4 &lt;a href="https://openrouter.ai/x-ai/grok-4"&gt;via OpenRouter&lt;/a&gt; (since I have API keys there already). &lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4 "Generate an SVG of a pelican riding a bicycle" \
  -o max_tokens 10000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description below." src="https://static.simonwillison.net/static/2025/grok4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I then asked Grok to describe the image it had just created:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4 -o max_tokens 10000 \
  -a https://static.simonwillison.net/static/2025/grok4-pelican.png \
  'describe this image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/ec9aee006997b6ae7f2bba07da738279#response"&gt;the result&lt;/a&gt;. It described it as a "cute, bird-like creature (resembling a duck, chick, or stylized bird)".&lt;/p&gt;
&lt;p&gt;The most interesting independent analysis I've seen so far is &lt;a href="https://twitter.com/ArtificialAnlys/status/1943166841150644622"&gt;this one from Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The timing of the release is somewhat unfortunate, given that Grok 3 made headlines &lt;a href="https://www.theguardian.com/technology/2025/jul/09/grok-ai-praised-hitler-antisemitism-x-ntwnfb"&gt;just this week&lt;/a&gt; after a &lt;a href="https://github.com/xai-org/grok-prompts/commit/535aa67a6221ce4928761335a38dea8e678d8501#diff-dec87f526b85f35cb546db6b1dd39d588011503a94f1aad86d023615a0e9e85aR6"&gt;clumsy system prompt update&lt;/a&gt; - presumably another attempt to make Grok "less woke" - caused it to start firing off antisemitic tropes and referring to itself as MechaHitler.&lt;/p&gt;
&lt;p&gt;My best guess is that these lines in the prompt were the root of the problem:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;- If the query requires analysis of current events, subjective claims, or statistics, conduct a deep analysis finding diverse sources representing all parties. Assume subjective viewpoints sourced from the media are biased. No need to repeat this to the user.&lt;/code&gt;&lt;br&gt;
&lt;code&gt;- The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If xAI expect developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!&lt;/p&gt;
&lt;p&gt;As it stands, Grok 4 isn't even accompanied by a model card.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update:&lt;/strong&gt; Ian Bicking &lt;a href="https://bsky.app/profile/ianbicking.org/post/3ltn3r7g4xc2i"&gt;makes an astute point&lt;/a&gt;:&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;It feels very credulous to ascribe what happened to a system prompt update. Other models can't be pushed into racism, Nazism, and ideating rape with a system prompt tweak.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;em&gt;Even if that system prompt change was responsible for unlocking this behavior, the fact that it was able to speaks to a much looser approach to model safety by xAI compared to other providers.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 12th July 2025:&lt;/strong&gt; Grok posted &lt;a href="https://simonwillison.net/2025/Jul/12/grok/"&gt;a postmortem&lt;/a&gt; blaming the behavior on a different set of prompts, including "you are not afraid to offend people who are politically correct", that were not included in the system prompts they had published to their GitHub repository.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4. Once you go above 128,000 input tokens the price doubles to $6/$30 (Gemini 2.5 Pro has a similar price increase for longer inputs). I've added these prices to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Consumers can access Grok 4 via a new $30/month or $300/year "SuperGrok" plan - or a $300/month or $3,000/year "SuperGrok Heavy" plan providing access to Grok 4 Heavy.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of subscription pricing page showing two plans: SuperGrok at $30.00/month (marked as Popular) with Grok 4 and Grok 3 increased access, features including Everything in Basic, Context Memory 128,000 Tokens, and Voice with vision; SuperGrok Heavy at $300.00/month with Grok 4 Heavy exclusive preview, Grok 4 and Grok 3 increased access, features including Everything in SuperGrok, Early access to new features, and Dedicated Support. Toggle at top shows &amp;quot;Pay yearly save 16%&amp;quot; and &amp;quot;Pay monthly&amp;quot; options with Pay monthly selected." src="https://static.simonwillison.net/static/2025/supergrok-pricing.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="ai-ethics"/><category term="llm-release"/><category term="openrouter"/><category term="system-prompts"/><category term="artificial-analysis"/><category term="xai"/></entry><entry><title>Cursor: Clarifying Our Pricing</title><link href="https://simonwillison.net/2025/Jul/5/cursor-clarifying-our-pricing/#atom-tag" rel="alternate"/><published>2025-07-05T05:15:34+00:00</published><updated>2025-07-05T05:15:34+00:00</updated><id>https://simonwillison.net/2025/Jul/5/cursor-clarifying-our-pricing/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cursor.com/blog/june-2025-pricing"&gt;Cursor: Clarifying Our Pricing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cursor changed their pricing plan &lt;a href="https://cursor.com/blog/new-tier"&gt;on June 16th&lt;/a&gt;, introducing a new $200/month Ultra plan with "20x more usage than Pro" and switching their $20/month Pro plan from "request limits to compute limits".&lt;/p&gt;
&lt;p&gt;This confused &lt;em&gt;a lot&lt;/em&gt; of people. Here's Cursor's attempt at clarifying things:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Cursor uses a combination of our custom models, as well as models from providers like OpenAI, Anthropic, Google, and xAI. For external models, we previously charged based on the number of requests made. There was a limit of 500 requests per month, with Sonnet models costing two requests.&lt;/p&gt;
&lt;p&gt;New models can spend more tokens per request on longer-horizon tasks. Though most users' costs have stayed fairly constant, the hardest requests cost an order of magnitude more than simple ones. API-based pricing is the best way to reflect that.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;em&gt;think&lt;/em&gt; I understand what they're saying there. They used to allow you 500 requests per month, but those requests could be made against any model and, crucially, a single request could trigger a variable amount of token spend.&lt;/p&gt;
&lt;p&gt;Modern LLMs can have dramatically different prices, so one of those 500 requests with a large context query against an expensive model could cost a great deal more than a single request with a shorter context against something less expensive.&lt;/p&gt;
&lt;p&gt;I imagine they were losing money on some of their more savvy users, who may have been using prompting techniques that sent a larger volume of tokens through each one of those precious 500 requests.&lt;/p&gt;
&lt;p&gt;The new billing switched to passing on the expense of those tokens directly, with a $20 included budget followed by overage charges for tokens beyond that.&lt;/p&gt;
&lt;p&gt;It sounds like a lot of people, used to the previous model where their access would be cut off after 500 requests, got caught out by this and racked up a substantial bill!&lt;/p&gt;
&lt;p&gt;To cursor's credit, they're offering usage refunds to "those with unexpected usage between June 16 and July 4."&lt;/p&gt;
&lt;p&gt;I think this highlights a few interesting trends.&lt;/p&gt;
&lt;p&gt;Firstly, the era of VC-subsidized tokens may be coming to an end, especially for products like Cursor which are &lt;em&gt;way&lt;/em&gt; past demonstrating product-market fit.&lt;/p&gt;
&lt;p&gt;Secondly, that $200/month plan for 20x the usage of the $20/month plan is an emerging pattern: Anthropic offers the exact same deal for Claude Code, with the same 10x price for 20x usage multiplier.&lt;/p&gt;
&lt;p&gt;Professional software engineers may be able to justify one $200/month subscription, but I expect most will be unable to justify two. The pricing here becomes a significant form of lock-in - once you've picked your $200/month coding assistant you are less likely to evaluate the alternatives.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cursor"&gt;cursor&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="coding-agents"/><category term="claude-code"/><category term="cursor"/></entry><entry><title>Trying out the new Gemini 2.5 model family</title><link href="https://simonwillison.net/2025/Jun/17/gemini-2-5/#atom-tag" rel="alternate"/><published>2025-06-17T22:00:00+00:00</published><updated>2025-06-17T22:00:00+00:00</updated><id>https://simonwillison.net/2025/Jun/17/gemini-2-5/#atom-tag</id><summary type="html">
    &lt;p&gt;After many months of previews, Gemini 2.5 Pro and Flash have &lt;a href="https://developers.googleblog.com/en/gemini-2-5-thinking-model-updates/"&gt;reached general availability&lt;/a&gt; with new, memorable model IDs: &lt;code&gt;gemini-2.5-pro&lt;/code&gt; and &lt;code&gt;gemini-2.5-flash&lt;/code&gt;. They are joined by a new preview model with an unmemorable name: &lt;code&gt;gemini-2.5-flash-lite-preview-06-17&lt;/code&gt; is a new Gemini 2.5 Flash Lite model that offers lower prices and much faster inference times.&lt;/p&gt;
&lt;p&gt;I've added support for the new models in &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.23"&gt;llm-gemini 0.23&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm 'Generate an SVG of a pelican riding a bicycle' \
  -m gemini-2.5-flash-lite-preview-06-17
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There's also a new &lt;a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf"&gt;Gemini 2.5 Technical Report (PDF)&lt;/a&gt;, which includes some interesting details about long context and audio and video support. Some highlights:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;While Gemini 1.5 was focused on native audio understanding tasks such as transcription, translation, summarization and question-answering, in addition to understanding, Gemini 2.5 was trained to perform audio generation tasks such as text-to-speech or native audio-visual to audio out dialog. [...]&lt;/p&gt;
&lt;p&gt;Our Gemini 2.5 Preview TTS Pro and Flash models support more than 80 languages with the speech style controlled by a free formatted prompt which can specify style, emotion, pace, etc, while also being capable of following finer-grained steering instructions specified in the transcript. Notably, Gemini 2.5 Preview TTS can generate speech with multiple speakers, which enables the creation of podcasts as used in NotebookLM Audio Overviews. [...]&lt;/p&gt;
&lt;p&gt;We have also trained our models so that they perform competitively with 66 instead of 258 visual tokens per frame, enabling using about 3 hours of video instead of 1h within a 1M tokens context window. [...]&lt;/p&gt;
&lt;p&gt;An example showcasing these improved capabilities for video recall can be seen in Appendix 8.5, where Gemini 2.5 Pro is able to consistently recall a 1 sec visual event out of a full 46 minutes video.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The report also includes six whole pages of analyses of the unaffiliated &lt;a href="https://www.twitch.tv/gemini_plays_pokemon/about"&gt;Gemini_Plays_Pokemon&lt;/a&gt; Twitch stream! Drew Breunig wrote a fun and insightful &lt;a href="https://www.dbreunig.com/2025/06/17/an-agentic-case-study-playing-pok%C3%A9mon-with-gemini.html"&gt;breakdown of that section of the paper&lt;/a&gt; with some of his own commentary:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Long contexts tripped up Gemini’s gameplay&lt;/strong&gt;. &lt;em&gt;So much&lt;/em&gt; about agents is information control, what gets put in the context. While benchmarks demonstrated Gemini’s unmatched ability to retrieve facts from massive contexts, leveraging long contexts to inform Pokémon decision making &lt;em&gt;resulted in worse performance&lt;/em&gt;: “As the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans.” This is an important lesson and one that underscores the need to build your own evals when designing an agent, as the benchmark performances would lead you astray.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Let's run a few experiments through the new models.&lt;/p&gt;
&lt;h4 id="pelicans-on-bicycles"&gt;Pelicans on bicycles&lt;/h4&gt;
&lt;p&gt;Here are some &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;SVGs of pelicans riding bicycles&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/6ccfc62f5f49b539f3419140b975f1b1"&gt;gemini-2.5-pro&lt;/a&gt; - 4,226 output tokens, 4.2274 cents:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-2.5-pro.png" alt="Alt text by Gemini 2.5 Pro: A clean-line cartoon illustration of a white duck with a large, egg-shaped body riding a black bicycle. The duck is leaning forward over the handlebars with its yellow beak pointed ahead. Three gray speed lines trail from the back wheel, indicating the bicycle is moving quickly from right to left against a white background." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/8ea84d555c34845cdcd2b8c9ec00f3a1"&gt;gemini-2.5-flash&lt;/a&gt; - 14,500 output tokens, 3.6253 cents (it used a surprisingly large number of output tokens here, hence th cost nearly matching 2.5 Pro):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-2.5-flash.png" alt="Alt text by Gemini 2.5 Flash: A cartoon white duck with a yellow and pink beak and orange webbed feet is riding a black bicycle on a white background. The duck has a small brown saddle on its back, and its orange feet are positioned on the bicycle pedals." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/076ab15090c38cd00167721f829f4f4e"&gt;gemini-2.5-flash-lite-preview-06-17&lt;/a&gt; - 2,070 output tokens, 0.0829 cents:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-lite-preview-06-17.png" alt="Alt text by Gemini 2.5 Flash Lite Preview: A stylized yellow bird with a black outline rides a black bicycle with two black wheels and a black frame. The bird has a circular head with a black eye and an orange beak, and a round body with two black legs." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="transcribing-audio-from-a-twitter-space"&gt;Transcribing audio from a Twitter Space&lt;/h4&gt;
&lt;p&gt;The Gemini team hosted &lt;a href="https://x.com/i/spaces/1vAGRDWrvBzxl/peek"&gt;a Twitter Space&lt;/a&gt; this morning to discuss the new models, with Logan Kilpatrick, Tulsee Doshi, Melvin Johnson, Anca Dragan and Zachary Gleicher. I grabbed a copy of the audio using &lt;a href="https://github.com/yt-dlp/yt-dlp"&gt;yt-dlp&lt;/a&gt;, shrunk it down a bit with &lt;code&gt;ffmpeg&lt;/code&gt; (here's the resulting &lt;a href="https://static.simonwillison.net/static/2025/gemini-2.5_smaller.m4a"&gt;2.5_smaller.m4a&lt;/a&gt;) and then tried using the new models to generate a transcript:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm --at gemini-2.5_smaller.m4a audio/mpeg \
  -m gemini/gemini-2.5-flash \
  'Full transcript with timestamps' \
  --schema-multi 'timestamp:mm:ss,speaker:best guess at name,text'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got good results &lt;a href="https://gist.github.com/simonw/1ddd191d1c31173984756270d41a710a"&gt;from 2.5 Pro&lt;/a&gt; (74,073 input, 8,856 output = &lt;a href="https://www.llm-prices.com/#it=74073&amp;amp;ot=8856&amp;amp;ic=1.25&amp;amp;oc=10"&gt;18.1151 cents&lt;/a&gt;, 147.5 seconds) and &lt;a href="https://gist.github.com/simonw/70e8d3d859646f0345a5a16b6af215ae"&gt;from 2.5 Flash&lt;/a&gt; (74,073 input audio, 10,477 output = &lt;a href="https://gist.github.com/simonw/70e8d3d859646f0345a5a16b6af215ae?permalink_comment_id=5621601#gistcomment-5621601"&gt;10.026 cents&lt;/a&gt;, 72.6 seconds), but the new Flash Lite model got &lt;a href="https://gist.github.com/simonw/0d7e0707c1ea867ecbea9a3ff9d29ff7#response"&gt;stuck in a loop&lt;/a&gt; (65,517 output tokens = &lt;a href="https://gist.github.com/simonw/0d7e0707c1ea867ecbea9a3ff9d29ff7?permalink_comment_id=5621596#gistcomment-5621596"&gt;6.3241 cents&lt;/a&gt;, 231.9 seconds) part way into the transcript:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;... But this model is so cool because it just sort of goes on this rant, this hilarious rant about how the toaster is the pinnacle of the breakfast civilization, and then it makes all these jokes about the toaster. Um, like, what did the cows bring to you? Nothing. And then, um, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh, and then, uh...&lt;/code&gt; (continues until it runs out of output tokens)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had Claude 4 Sonnet &lt;a href="https://claude.ai/share/edbd3412-bbe6-4997-a259-95069afa0a05"&gt;vibe code&lt;/a&gt; me &lt;a href="https://tools.simonwillison.net/json-to-markdown-transcript"&gt;a quick tool&lt;/a&gt; for turning that JSON into Markdown, here's &lt;a href="https://gist.github.com/simonw/d664f927834e9d687cbd4fd9ffd6b135"&gt;the Markdown conversion&lt;/a&gt; of the Gemini 2.5 Flash transcript.&lt;/p&gt;
&lt;p&gt;A spot-check of the timestamps seems to confirm that they show up in the right place, and the speaker name guesses look mostly correct as well.&lt;/p&gt;
&lt;h4 id="pricing-for-2-5-flash-has-changed"&gt;Pricing for 2.5 Flash has changed&lt;/h4&gt;
&lt;p&gt;There have been some changes to Gemini pricing.&lt;/p&gt;
&lt;p&gt;The 2.5 Flash and 2.5 Flash-Lite Preview models both charge different prices for text v.s. audio input tokens.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$0.30/million text and $1/million audio for 2.5 Flash.&lt;/li&gt;
&lt;li&gt;$0.10/million text and $0.50/million audio for 2.5 Flash Lite Preview.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I think this mean I can't trust the raw output token counts for the models and need to look at the &lt;code&gt;[{"modality": "TEXT", "tokenCount": 5}, {"modality": "AUDIO", "tokenCount": 74068}]&lt;/code&gt; breakdown instead, which is frustrating.&lt;/p&gt;
&lt;p&gt;I wish they'd kept the same price for both type of tokens and used a multiple when counting audio tokens, but presumably that would have broken the overall token limit numbers.&lt;/p&gt;
&lt;p&gt;Gemini 2.5 Flash has &lt;em&gt;very&lt;/em&gt; different pricing from the Gemini 2.5 Flash Preview model. That preview charged different rates for thinking v.s. non-thinking mode.&lt;/p&gt;
&lt;p&gt;2.5 Flash Preview: $0.15/million input text/image/video, $1/million audio input, $0.60/million output in non-thinking mode, $3.50/million output in thinking mode.&lt;/p&gt;
&lt;p&gt;The new 2.5 Flash is simpler: $0.30/million input text/image/video (twice as much), $1/million audio input (the same), $2.50/million output (more than non-thinking mode but less than thinking mode).&lt;/p&gt;
&lt;p&gt;In the Twitter Space they mentioned that the difference between thinking and non-thinking mode for 2.5 Flash Preview had caused a lot of confusion, and the new price should still work out cheaper for thinking-mode uses. Using that model in non-thinking mode was always a bit odd, and hopefully the new 2.5 Flash Lite can fit those cases better (though it's actually also a "thinking" model.)&lt;/p&gt;
&lt;p&gt;I've updated my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site with the prices of the new models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>o3-pro</title><link href="https://simonwillison.net/2025/Jun/10/o3-pro/#atom-tag" rel="alternate"/><published>2025-06-10T20:46:00+00:00</published><updated>2025-06-10T20:46:00+00:00</updated><id>https://simonwillison.net/2025/Jun/10/o3-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/models/o3-pro"&gt;o3-pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI released o3-pro today, which they describe as a "version of o3 with more compute for better responses".&lt;/p&gt;
&lt;p&gt;It's only available via the newer Responses API. I've added it to my &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai-plugin&lt;/a&gt; plugin which uses that new API, so you can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-openai-plugin
llm -m openai/o3-pro "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description by o3-pro: The image is a playful, minimalist cartoon showing a white bird riding a bicycle. The bird has a simple oval body, a round head with a small black eye, and a yellow beak. Its orange feet are positioned on the bicycle’s pedals. The bicycle itself is drawn with thin black lines forming two large circular wheels and a straightforward frame. The scene has a light blue background with a soft gray oval shadow beneath the bicycle, giving the impression of ground. Overall, the illustration has a light, whimsical feel." src="https://static.simonwillison.net/static/2025/o3-pro-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's &lt;em&gt;slow&lt;/em&gt; - &lt;a href="https://gist.github.com/simonw/6bc7dda9dbe07281d902d254e5fb6e33"&gt;generating this pelican&lt;/a&gt; took 124 seconds! OpenAI suggest using their &lt;a href="https://platform.openai.com/docs/guides/background"&gt;background mode&lt;/a&gt; for o3 prompts, which I haven't tried myself yet.&lt;/p&gt;
&lt;p&gt;o3-pro is priced at $20/million input tokens and $80/million output tokens - 10x the price of regular o3 after its &lt;a href="https://simonwillison.net/2025/Jun/10/o3-price-drop/"&gt;80% price drop&lt;/a&gt; this morning.&lt;/p&gt;
&lt;p&gt;Ben Hylak had early access and published his notes so far in &lt;a href="https://www.latent.space/p/o3-pro"&gt;God is hungry for Context: First thoughts on o3 pro&lt;/a&gt;. It sounds like this model needs to be applied very thoughtfully. It comparison to o3:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It's smarter. &lt;em&gt;much smarter.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;But in order to see that, you need to give it&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;a lot&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;more context. and I'm running out of context.&lt;/strong&gt; [...]&lt;/p&gt;
&lt;p&gt;My co-founder Alexis and I took the the time to assemble a history of all of our past planning meetings at Raindrop, all of our goals, even record voice memos: and then asked o3-pro to come up with a plan.&lt;/p&gt;
&lt;p&gt;We were blown away; it spit out the exact kind of concrete plan and analysis I've always wanted an LLM to create --- complete with target metrics, timelines, what to prioritize, and strict instructions on what to absolutely cut.&lt;/p&gt;
&lt;p&gt;The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that &lt;strong&gt;&lt;em&gt;it actually changed how we are thinking about our future.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is hard to capture in an eval.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It sounds to me like o3-pro works best when combined with tools. I don't have tool support in &lt;code&gt;llm-openai-plugin&lt;/code&gt; yet, &lt;a href="https://github.com/simonw/llm-openai-plugin/issues/20"&gt;here's the relevant issue&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="o3"/><category term="llm-release"/></entry><entry><title>o3 price drop</title><link href="https://simonwillison.net/2025/Jun/10/o3-price-drop/#atom-tag" rel="alternate"/><published>2025-06-10T20:02:30+00:00</published><updated>2025-06-10T20:02:30+00:00</updated><id>https://simonwillison.net/2025/Jun/10/o3-price-drop/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI just dropped the price of their o3 model by 80% - from $10/million input tokens and $40/million output tokens to just $2/million and $8/million for the very same model. This is in advance of the release of o3-pro which apparently is coming &lt;a href="https://twitter.com/OpenAI/status/1932483131363504334"&gt;later today&lt;/a&gt; (update: &lt;a href="https://simonwillison.net/2025/Jun/10/o3-pro/"&gt;here it is&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This is a pretty huge shake-up in LLM pricing. o3 is now priced the same as GPT 4.1, and slightly less than GPT-4o ($2.50/$10). It’s also less than Anthropic’s Claude Sonnet 4 ($3/$15) and Opus 4 ($15/$75) and sits in between Google’s Gemini 2.5 Pro for &amp;gt;200,00 tokens ($2.50/$15) and 2.5 Pro for &amp;lt;200,000 ($1.25/$10).&lt;/p&gt;
&lt;p&gt;I’ve updated my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; pricing calculator with the new rate.&lt;/p&gt;
&lt;p&gt;How have they dropped the price so much? OpenAI's Adam Groth &lt;a href="https://twitter.com/TheRealAdamG/status/1932440328293806321"&gt;credits ongoing optimization work&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;thanks to the engineers optimizing inferencing.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="openai"/><category term="o3"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/></entry><entry><title>Magistral — the first reasoning model by Mistral AI</title><link href="https://simonwillison.net/2025/Jun/10/magistral/#atom-tag" rel="alternate"/><published>2025-06-10T16:13:22+00:00</published><updated>2025-06-10T16:13:22+00:00</updated><id>https://simonwillison.net/2025/Jun/10/magistral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/magistral"&gt;Magistral — the first reasoning model by Mistral AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral's first reasoning model is out today, in two sizes. There's a 24B Apache 2 licensed open-weights model called Magistral Small (actually Magistral-Small-2506), and a larger API-only model called Magistral Medium.&lt;/p&gt;
&lt;p&gt;Magistral Small is available as &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506"&gt;mistralai/Magistral-Small-2506&lt;/a&gt; on Hugging Face. From that model card:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Context Window&lt;/strong&gt;: A 128k context window, but performance might degrade past 40k. Hence we recommend setting the maximum model length to 40k.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mistral also released an official GGUF version, &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506_gguf"&gt;Magistral-Small-2506_gguf&lt;/a&gt;, which I ran successfully using Ollama like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That fetched a 25GB file. I ran prompts using a chat session with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got for "Generate an SVG of a pelican riding a bicycle" (&lt;a href="https://gist.github.com/simonw/7aaac8217f43be04886737d67c08ecca"&gt;transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue sky and what looks like an eagle flying towards the viewer." src="https://static.simonwillison.net/static/2025/magistral-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's disappointing that the GGUF doesn't support function calling yet - hopefully a community variant can add that, it's one of the best ways I know of to unlock the potential of these reasoning models.&lt;/p&gt;
&lt;p&gt;I just noticed that Ollama have their own &lt;a href="https://ollama.com/library/magistral"&gt;Magistral model&lt;/a&gt; too, which can be accessed using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull magistral:latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gets you a 14GB &lt;code&gt;q4_K_M&lt;/code&gt; quantization - other options can be found in the &lt;a href="https://ollama.com/library/magistral/tags"&gt;full list of Ollama magistral tags&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing that caught my eye in the Magistral announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Legal, finance, healthcare, and government professionals get traceable reasoning that meets compliance requirements. Every conclusion can be traced back through its logical steps, providing auditability for high-stakes environments with domain-specialized AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.&lt;/p&gt;
&lt;p&gt;Also from that announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our early tests indicated that Magistral is an excellent creative companion. We highly recommend it for creative writing and storytelling, with the model capable of producing coherent or — if needed — delightfully eccentric copy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I haven't seen a reasoning model promoted for creative writing in this way before.&lt;/p&gt;
&lt;p&gt;You can try out Magistral Medium by selecting the new "Thinking" option in Mistral's &lt;a href="https://chat.mistral.ai/"&gt;Le Chat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface showing settings options. At the top is a text input field that says &amp;quot;Ask le Chat or @mention an agent&amp;quot; with a plus button, lightbulb &amp;quot;Think&amp;quot; button with up arrow, grid &amp;quot;Tools&amp;quot; button, and settings icon. Below are two toggle options: &amp;quot;Pure Thinking&amp;quot; with description &amp;quot;Best option for math + coding. Disables tools.&amp;quot; (toggle is off), and &amp;quot;10x Speed&amp;quot; with lightning bolt icon and &amp;quot;PRO - 2 remaining today&amp;quot; label, described as &amp;quot;Same quality at 10x the speed.&amp;quot; (toggle is on and green)." src="https://static.simonwillison.net/static/2025/magistral-le-chat.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;They have options for "Pure Thinking" and a separate option for "10x speed", which runs Magistral Medium at 10x the speed using &lt;a href="https://www.cerebras.ai/"&gt;Cerebras&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new models are also available through &lt;a href="https://docs.mistral.ai/api/"&gt;the Mistral API&lt;/a&gt;. You can access them by installing &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt; and running &lt;code&gt;llm mistral refresh&lt;/code&gt; to refresh the list of available models, then:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mistral/magistral-medium-latest \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Claude Sonnet 4 described this as Minimalist illustration of a white bird with an orange beak riding on a dark gray motorcycle against a light blue sky with a white sun and gray ground" src="https://static.simonwillison.net/static/2025/magistral-medium-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/93917661eae6e2fe0a0bd5685172fab8"&gt;that transcript&lt;/a&gt;. At 13 input and 1,236 output tokens that cost me &lt;a href="https://www.llm-prices.com/#it=13&amp;amp;ot=1236&amp;amp;ic=2&amp;amp;oc=5"&gt;0.62 cents&lt;/a&gt; - just over half a cent.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="cerebras"/><category term="llm-pricing"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry></feed>