<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm-reasoning</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm-reasoning.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-08T23:07:44+00:00</updated><author><name>Simon Willison</name></author><entry><title>Meta's new model is Muse Spark, and meta.ai chat has some interesting tools</title><link href="https://simonwillison.net/2026/Apr/8/muse-spark/#atom-tag" rel="alternate"/><published>2026-04-08T23:07:44+00:00</published><updated>2026-04-08T23:07:44+00:00</updated><id>https://simonwillison.net/2026/Apr/8/muse-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;Meta &lt;a href="https://ai.meta.com/blog/introducing-muse-spark-msl/"&gt;announced Muse Spark&lt;/a&gt; today, their first model release since Llama 4 &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/"&gt;almost exactly a year ago&lt;/a&gt;. It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; (Facebook or Instagram login required).&lt;/p&gt;
&lt;p&gt;Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows".&lt;/p&gt;
&lt;p&gt;The model is exposed as two different modes on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro.&lt;/p&gt;
&lt;h5 id="a-couple-of-pelicans"&gt;A couple of pelicans&lt;/h5&gt;
&lt;p&gt;I prefer to run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican test&lt;/a&gt; via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly.&lt;/p&gt;
&lt;p&gt;Here's the pelican I got for "Instant":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-instant-pelican.jpg" alt="This is a pretty basic pelican. The bicycle is mangled, the pelican itself has a rectangular beak albeit with a hint of pouch curve below it. Not a very good one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And this one for "Thinking":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-thinking-pelican.png" alt="Much better. Clearly a pelican. Bicycle is the correct shape. Pelican is wearing a blue cycling helmet (albeit badly rendered). Not a bad job at all." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model &lt;a href="https://gist.github.com/simonw/ea7466204f1001b7d67afcb5d0532f6f"&gt;output an SVG directly&lt;/a&gt; (with code comments) whereas the Thinking model &lt;a href="https://gist.github.com/simonw/bc911a56006ba44b0bf66abf0f872ab2"&gt;wrapped it in a thin HTML shell&lt;/a&gt; with some unused &lt;code&gt;Playables SDK v1.0.0&lt;/code&gt; JavaScript libraries.&lt;/p&gt;
&lt;p&gt;Which got me curious...&lt;/p&gt;
&lt;h5 id="poking-around-with-tools"&gt;Poking around with tools&lt;/h5&gt;
&lt;p&gt;Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style.&lt;/p&gt;
&lt;p&gt;But what else can it do?&lt;/p&gt;
&lt;p&gt;I asked it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;what tools do you have access to?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want the exact tool names, parameter names and tool descriptions, in the original format&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It spat out detailed descriptions of 16 different tools. You can see &lt;a href="https://gist.github.com/simonw/e1ce0acd70443f93dcd6481e716c4304#response-1"&gt;the full list I got back here&lt;/a&gt; - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks.&lt;/p&gt;
&lt;p&gt;Here are highlights derived from that response:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Browse and search&lt;/strong&gt;. &lt;code&gt;browser.search&lt;/code&gt; can run a web search through an undisclosed search engine, &lt;code&gt;browser.open&lt;/code&gt; can load the full page from one of those search results and &lt;code&gt;browser.find&lt;/code&gt; can run pattern matches against the returned page content.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Meta content search&lt;/strong&gt;. &lt;code&gt;meta_1p.content_search&lt;/code&gt; can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including &lt;code&gt;author_ids&lt;/code&gt;, &lt;code&gt;key_celebrities&lt;/code&gt;, &lt;code&gt;commented_by_user_ids&lt;/code&gt;, and &lt;code&gt;liked_by_user_ids&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"Catalog search"&lt;/strong&gt; - &lt;code&gt;meta_1p.meta_catalog_search&lt;/code&gt; can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Image generation&lt;/strong&gt;. &lt;code&gt;media.image_gen&lt;/code&gt; generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.python_execution&lt;/strong&gt; - yes! It's &lt;a href="https://simonwillison.net/tags/code-interpreter/"&gt;Code Interpreter&lt;/a&gt;, my favourite feature of both ChatGPT and Claude.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at &lt;code&gt;/mnt/data/&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Python 3.9 &lt;a href="https://devguide.python.org/versions/"&gt;is EOL&lt;/a&gt; these days but the library collection looks useful.&lt;/p&gt;
&lt;p&gt;I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from &lt;a href="https://sqlite.org/releaselog/3_34_1.html"&gt;January 2021&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.create_web_artifact&lt;/strong&gt; - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to &lt;code&gt;html&lt;/code&gt; for websites/apps or &lt;code&gt;svg&lt;/code&gt; for vector graphics."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.download_meta_1p_media&lt;/strong&gt; is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or &lt;code&gt;catalog_search_citation_id&lt;/code&gt; for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.file_search&lt;/strong&gt; - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tools for editing files in the container&lt;/strong&gt; - &lt;code&gt;container.view&lt;/code&gt;, &lt;code&gt;container.insert&lt;/code&gt; (with &lt;code&gt;new_str&lt;/code&gt; and &lt;code&gt;insert_line&lt;/code&gt;), &lt;code&gt;container.str_replace&lt;/code&gt;. These look similar to Claude's &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool#text-editor-tool-commands"&gt;text editor tool commands&lt;/a&gt; - these are becoming a common pattern across any file-equipped agent harness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.visual_grounding&lt;/strong&gt; - see below, this one is &lt;em&gt;fun&lt;/em&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;subagents.spawn_agent&lt;/strong&gt; - the &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/subagents/"&gt;sub-agent as a tool&lt;/a&gt; pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;third_party.link_third_party_account&lt;/strong&gt; - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="image-analysis-in-the-container"&gt;Image analysis in the container&lt;/h4&gt;
&lt;p&gt;Let's talk about that &lt;code&gt;container.visual_grounding&lt;/code&gt; one. Here's the description in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Description:&lt;/strong&gt; Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;object_names&lt;/strong&gt; (array, required)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;image_path&lt;/strong&gt; (string, default "")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;format_type&lt;/strong&gt; ("bbox"|"point"|"count"|null, default "bbox")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; (string|null)[string]&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;So it can take an image path, optional object names and return things in format &lt;code&gt;bbox&lt;/code&gt;, &lt;code&gt;point&lt;/code&gt; or &lt;code&gt;count&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This sounded to me like Meta's &lt;a href="https://github.com/facebookresearch/segment-anything"&gt;Segment Anything&lt;/a&gt;, (&lt;em&gt;I later heard that it's not, this is a native feature of the model itself via a tool call with a custom system prompt&lt;/em&gt;). I decided to take it for a spin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;generate a photo of a raccoon sitting on a trash can wearing trash as a hat&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Response&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here you go — king of the alley, in full regalia:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-raccoon-trash-hat.jpg" alt="It's a photo of a raccoon wearing a disposable coffee cup and some newspaper and banana peel as a hat. It's pretty good" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My guess is that image generation in Meta AI is powered by their &lt;a href="https://arxiv.org/abs/2309.15807"&gt;Emu model&lt;/a&gt;, or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's &lt;code&gt;media.image_gen&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;image_gen&lt;/code&gt; tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;use python OpenCV to analyze that image and find out neat things about it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Annoyingly it did &lt;em&gt;not&lt;/em&gt; show me the code it ran, but it did create this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/raccoon_analysis.png" alt="OpenCV image analysis dashboard titled &amp;quot;OpenCV Analysis: Raccoon Trash Hat&amp;quot; with four panels: top-left shows the original 1280x1280 photo of a raccoon wearing a trash-lid hat with a coffee cup and newspaper on top in an alley; top-right shows a Canny edge detection version with white edges on black background labeled &amp;quot;Canny Edges - 4.6% density&amp;quot;; bottom-left shows &amp;quot;Top 5 Dominant Colors&amp;quot; as swatches with values RGB(20, 19, 24): 35.3%, RGB(43, 43, 47): 34.2%, RGB(84, 80, 79): 18.3%, RGB(139, 135, 131): 6.3%, RGB(210, 213, 208): 5.8%; bottom-right shows &amp;quot;Color Histograms&amp;quot; line chart with Red, Green, and Blue channels plotted from 0-250 on x-axis and 0-35000 on y-axis, peaking around value 25-50." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Pretty neat! We can both generate images and then use them with Python tools.&lt;/p&gt;
&lt;p&gt;Next, the &lt;code&gt;visual_grounding&lt;/code&gt; tool. Knowing the names of the tools makes it easy to refer to them in prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now try the container.visual_grounding tool&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The response (&lt;a href="https://gist.github.com/simonw/72d64b325d33e5882b9818d825551b44"&gt;full response here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-grounding.jpg" alt="Visual grounding analysis of a raccoon photo titled &amp;quot;Visual Grounding: Found 8 objects&amp;quot; with colored dot annotations on the image marking each detected object: coffee cup (515,205) in yellow, banana peel (569,261) in light yellow, newspaper (399,279) in blue, right eye (461,396) in magenta, left eye (582,397) in pink, nose (514,472) in cyan, raccoon (553,548) in green, and trash can lid (550,892) in red. A legend at the bottom lists all objects with their normalized 0-1000 coordinates. Footer text reads &amp;quot;Coordinates are 0-1000 normalized. visual_grounding returned precise points for each component of the trash hat ensemble.&amp;quot; Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly &lt;code&gt;point&lt;/code&gt; mode, but how about the others?&lt;/p&gt;
&lt;p&gt;I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;yes draw bounding boxes, can it do masks too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Response:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yes — bounding boxes work great. Masks? Not directly in &lt;code&gt;visual_grounding&lt;/code&gt;; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks.&lt;/p&gt;
&lt;p&gt;Here's the bbox output:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-bbox.jpg" alt="Bounding box object detection visualization titled &amp;quot;Bounding Boxes (visual_grounding)&amp;quot; with subtitle &amp;quot;8 objects detected — coordinates are 0-1000 normalized&amp;quot; showing a raccoon photo with colored rectangular bounding boxes around detected objects: coffee cup in yellow [424,115,610,273] 186×158, banana peel in yellow [472,215,667,291] 195×76, newspaper in blue [305,223,548,315] 243×92, raccoon in green [270,240,892,845] 622×605, raccoon's face in purple [305,283,750,537] 445×254, right eye in magenta [442,379,489,413] 47×34, left eye in magenta [565,379,605,416] 40×37, and trash can lid in red [107,800,1000,998] 893×198. A legend at the bottom shows each object's name, coordinates, and pixel dimensions in colored cards. Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;What you’re seeing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;raccoon&lt;/strong&gt;: big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;coffee cup&lt;/strong&gt;: (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;banana peel&lt;/strong&gt;: (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;newspaper&lt;/strong&gt;: (305,223)-(548,315) — left ear coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;trash can lid&lt;/strong&gt;: (107,800)-(1000,998) — spans almost the full image width at the bottom&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;face&lt;/strong&gt;: (305,283)-(750,537) — tight crop around the mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;eyes&lt;/strong&gt;: both ~40×37 pixel boxes, perfectly level at y=379&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;About masks:&lt;/strong&gt; since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GrabCut&lt;/strong&gt; — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAM-style color threshold&lt;/strong&gt; — use the bbox plus k-means to isolate the coffee cup or banana peel&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2026/meta-bbox.html"&gt;the HTML for that bbox visualization&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can &lt;a href="https://gist.github.com/simonw/26bfdb22d7d73450c13f49abbf0c251a"&gt;see that all here&lt;/a&gt;. The count mode is particularly interesting - here's that in full:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon whisker&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;403&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;445&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;397&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;458&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;472&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;391&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;485&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;497&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;415&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;512&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;618&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;446&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;622&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;461&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;625&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;475&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;628&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;488&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;501&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;651&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;518&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;12&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon paw claw&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;406&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;845&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;424&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;847&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;441&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;842&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;473&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;833&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;596&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;830&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;844&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;660&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;843&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;679&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;838&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;8&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;trash item on head&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;514&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;191&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;568&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;263&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;274&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;3&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;eye&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;462&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;395&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;582&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;ear&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;359&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;313&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;680&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;294&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools.&lt;/p&gt;
&lt;p&gt;Which means... it can count pelicans too!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-count-pelicans.jpg" alt="Screenshot of a Meta AI conversation where the user uploaded a photo of pelicans on rocks and asked &amp;quot;Count the pelicans&amp;quot;. response reads: &amp;quot;That rocky haul-out is packed — I count 25 brown pelicans in your photo. I went bird by bird, including the ones tucked behind rocks, heads turned away, or just peeking out at the edges. They're all California brown pelicans, with those pale yellow heads and long bills, clustered on the riprap right at the waterline. To make it easy to verify, I've numbered each one on the image:&amp;quot; followed by an annotated version of the photo with red numbered circles (1–25) placed on each pelican. The annotated image has a downward chevron and an &amp;quot;Open&amp;quot; button at the bottom right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that overlay &lt;a href="https://static.simonwillison.net/static/2026/meta-count-pelicans.html"&gt;exported as HTML&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: Meta's &lt;a href="https://twitter.com/jacktripleu/status/2042050863800447387"&gt;Jack Wu confirms&lt;/a&gt; that these tools are part of the new harness they launched alongside the new model.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id="maybe-open-weights-in-the-future-"&gt;Maybe open weights in the future?&lt;/h4&gt;
&lt;p&gt;On Twitter &lt;a href="https://twitter.com/alexandr_wang/status/2041909388852748717"&gt;Alexandr Wang said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So are Meta back in the frontier model game? &lt;a href="https://twitter.com/ArtificialAnlys/status/2041913043379220801"&gt;Artificial Analysis&lt;/a&gt; think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively.&lt;/p&gt;
&lt;p&gt;I'm waiting for API access - while the tool collection on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; is quite strong the real test of a model like this is still what we can build on top of it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="facebook"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="code-interpreter"/><category term="llm-tool-use"/><category term="meta"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Gemma 4: Byte for byte, the most capable open models</title><link href="https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag" rel="alternate"/><published>2026-04-02T18:28:54+00:00</published><updated>2026-04-02T18:28:54+00:00</updated><id>https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"&gt;Gemma 4: Byte for byte, the most capable open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.&lt;/p&gt;
&lt;p&gt;Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.&lt;/p&gt;
&lt;p&gt;They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't entirely understand that, but apparently that's what the "E" in E2B means!&lt;/p&gt;
&lt;p&gt;One particularly exciting feature of these models is that they are multi-modal beyond just images:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vision and audio&lt;/strong&gt;: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.&lt;/p&gt;
&lt;p&gt;I tried them out using the GGUFs for &lt;a href="https://lmstudio.ai/models/gemma-4"&gt;LM Studio&lt;/a&gt;. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out &lt;code&gt;"---\n"&lt;/code&gt; in a loop for every prompt I tried.&lt;/p&gt;
&lt;p&gt;The succession of &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb"&gt;pelican quality&lt;/a&gt; from 2B to 4B to 26B-A4B is notable:&lt;/p&gt;
&lt;p&gt;E2B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican" src="https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;E4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it" src="https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;26B-A4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle" src="https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105"&gt;fixing that&lt;/a&gt; I got probably the best pelican I've seen yet from a model that runs on my laptop.)&lt;/p&gt;
&lt;p&gt;Google are providing API access to the two larger Gemma models via their &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it"&gt;AI Studio&lt;/a&gt;. I added support to &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; and then &lt;a href="https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e"&gt;ran a pelican&lt;/a&gt; through the 31B model using that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good, though it is missing the front part of the bicycle frame:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. " src="https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Introducing Mistral Small 4</title><link href="https://simonwillison.net/2026/Mar/16/mistral-small-4/#atom-tag" rel="alternate"/><published>2026-03-16T23:41:17+00:00</published><updated>2026-03-16T23:41:17+00:00</updated><id>https://simonwillison.net/2026/Mar/16/mistral-small-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/mistral-small-4"&gt;Introducing Mistral Small 4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It supports &lt;code&gt;reasoning_effort="none"&lt;/code&gt; or &lt;code&gt;reasoning_effort="high"&lt;/code&gt;, with the latter providing "equivalent verbosity to previous Magistral models". &lt;/p&gt;
&lt;p&gt;The new model is &lt;a href="https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/tree/main"&gt;242GB on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/3dec228577559f15f26204a3cc550583"&gt;tried it out&lt;/a&gt; via the Mistral API using &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The bicycle is upside down and mangled and the pelican is a series of grey curves with a triangular beak." src="https://static.simonwillison.net/static/2026/mistral-small-4.png" /&gt;&lt;/p&gt;
&lt;p&gt;I couldn't find a way to set the reasoning effort in their &lt;a href="https://docs.mistral.ai/api/endpoint/chat#operation-chat_completion_v1_chat_completions_post"&gt;API documentation&lt;/a&gt;, so hopefully that's a feature which will land soon.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 23rd March&lt;/strong&gt;: Here's new documentation for the &lt;a href="https://docs.mistral.ai/capabilities/reasoning/adjustable"&gt;reasoning_effort parameter&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Also from Mistral today and fitting their -stral naming convention is &lt;a href="https://mistral.ai/news/leanstral"&gt;Leanstral&lt;/a&gt;, an open weight model that is specifically tuned to help output the &lt;a href="https://lean-lang.org/"&gt;Lean 4&lt;/a&gt; formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Quoting Donald Knuth</title><link href="https://simonwillison.net/2026/Mar/3/donald-knuth/#atom-tag" rel="alternate"/><published>2026-03-03T23:59:04+00:00</published><updated>2026-03-03T23:59:04+00:00</updated><id>https://simonwillison.net/2026/Mar/3/donald-knuth/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf"&gt;&lt;p&gt;Shock! Shock! I learned yesterday that an open problem I'd been working on for several weeks had just been solved by Claude Opus 4.6 - Anthropic's hybrid reasoning model that had been released three weeks earlier! It seems that I'll have to revise my opinions about "generative AI" one of these days. What a joy it is to learn not only that my conjecture has a nice solution but also to celebrate this dramatic advance in automatic deduction and creative problem solving.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf"&gt;Donald Knuth&lt;/a&gt;, Claude's Cycles&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/donald-knuth"&gt;donald-knuth&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;&lt;/p&gt;



</summary><category term="november-2025-inflection"/><category term="claude"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="donald-knuth"/><category term="llm-reasoning"/><category term="anthropic"/></entry><entry><title>Gemini 3 Deep Think</title><link href="https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/#atom-tag" rel="alternate"/><published>2026-02-12T18:12:17+00:00</published><updated>2026-02-12T18:12:17+00:00</updated><id>https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/"&gt;Gemini 3 Deep Think&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Google. They say it's "built to push the frontier of intelligence and solve modern challenges across science, research, and engineering".&lt;/p&gt;
&lt;p&gt;It drew me a &lt;em&gt;really good&lt;/em&gt; &lt;a href="https://gist.github.com/simonw/7e317ebb5cf8e75b2fcec4d0694a8199"&gt;SVG of a pelican riding a bicycle&lt;/a&gt;! I think this is the best one I've seen so far - here's &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my previous collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="This alt text also generated by Gemini 3 Deep Think: A highly detailed, colorful, flat vector illustration with thick dark blue outlines depicting a stylized white pelican riding a bright cyan blue bicycle from left to right across a sandy beige beach with white speed lines indicating forward motion. The pelican features a light blue eye, a pink cheek blush, a massive bill with a vertical gradient from yellow to orange, a backward magenta cap with a cyan brim and a small yellow top button, and a matching magenta scarf blowing backward in the wind. Its white wing, accented with a grey mid-section and dark blue feather tips, reaches forward to grip the handlebars, while its long tan leg and orange foot press down on an orange pedal. Attached to the front handlebars is a white wire basket carrying a bright blue cartoon fish that is pointing upwards and forwards. The bicycle itself has a cyan frame, dark blue tires, striking neon pink inner rims, cyan spokes, a white front chainring, and a dark blue chain. Behind the pelican, a grey trapezoidal pier extends from the sand toward a horizontal band of deep blue ocean water detailed with light cyan wavy lines. A massive, solid yellow-orange semi-circle sun sits on the horizon line, setting directly behind the bicycle frame. The background sky is a smooth vertical gradient transitioning from soft pink at the top to warm golden-yellow at the horizon, decorated with stylized pale peach fluffy clouds, thin white horizontal wind streaks, twinkling four-pointed white stars, and small brown v-shaped silhouettes of distant flying birds." src="https://static.simonwillison.net/static/2026/gemini-3-deep-think-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(And since it's an FAQ, here's my answer to &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;What happens if AI labs train for pelicans riding bicycles?&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;Since it did so well on my basic &lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt; I decided to try the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;more challenging version&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/154c0cc7b4daed579f6a5e616250ecc8"&gt;what I got&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Also described by Gemini 3 Deep Think: A highly detailed, vibrant, and stylized vector illustration of a whimsical bird resembling a mix between a pelican and a frigatebird enthusiastically riding a bright cyan bicycle from left to right across a flat tan and brown surface. The bird leans horizontally over the frame in an aerodynamic racing posture, with thin, dark brown wing-like arms reaching forward to grip the silver handlebars and a single thick brown leg, patterned with white V-shapes, stretching down to press on a black pedal. The bird's most prominent and striking feature is an enormous, vividly bright red, inflated throat pouch hanging beneath a long, straight grey upper beak that ends in a small orange hook. Its head is mostly white with a small pink patch surrounding the eye, a dark brown stripe running down the back of its neck, and a distinctive curly pale yellow crest on the very top. The bird's round, dark brown body shares the same repeating white V-shaped feather pattern as its leg and is accented by a folded wing resting on its side, made up of cleanly layered light blue and grey feathers. A tail composed of four stiff, straight dark brown feathers extends directly backward. Thin white horizontal speed lines trail behind the back wheel and the bird's tail, emphasizing swift forward motion. The bicycle features a classic diamond frame, large wheels with thin black tires, grey rims, and detailed silver spokes, along with a clearly visible front chainring, silver chain, and rear cog. The whimsical scene is set against a clear light blue sky featuring two small, fluffy white clouds on the left and a large, pale yellow sun in the upper right corner that radiates soft, concentric, semi-transparent pastel green and yellow halos. A solid, darker brown shadow is cast directly beneath the bicycle's wheels on the minimalist two-toned brown ground." src="https://static.simonwillison.net/static/2026/gemini-3-deep-think-complex-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46991240"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Quoting Andrej Karpathy</title><link href="https://simonwillison.net/2025/Dec/19/andrej-karpathy/#atom-tag" rel="alternate"/><published>2025-12-19T23:07:52+00:00</published><updated>2025-12-19T23:07:52+00:00</updated><id>https://simonwillison.net/2025/Dec/19/andrej-karpathy/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://karpathy.bearblog.dev/year-in-review-2025/"&gt;&lt;p&gt;In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples).&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://karpathy.bearblog.dev/year-in-review-2025/"&gt;Andrej Karpathy&lt;/a&gt;, 2025 LLM Year in Review&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;&lt;/p&gt;



</summary><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llm-reasoning"/><category term="definitions"/><category term="ai"/><category term="llms"/><category term="deepseek"/></entry><entry><title>DeepSeek-V3.2</title><link href="https://simonwillison.net/2025/Dec/1/deepseek-v32/#atom-tag" rel="alternate"/><published>2025-12-01T23:56:19+00:00</published><updated>2025-12-01T23:56:19+00:00</updated><id>https://simonwillison.net/2025/Dec/1/deepseek-v32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://api-docs.deepseek.com/news/news251201"&gt;DeepSeek-V3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new open weight (MIT licensed) models from DeepSeek today: &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2"&gt;DeepSeek-V3.2&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale"&gt;DeepSeek-V3.2-Speciale&lt;/a&gt;, both 690GB, 685B parameters. Here's the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/assets/paper.pdf"&gt;PDF tech report&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DeepSeek-V3.2 is DeepSeek's new flagship model, now running on &lt;a href="https://chat.deepseek.com"&gt;chat.deepseek.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The difference between the two new models is best explained by this paragraph from the technical report:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeek-V3.2 integrates reasoning, agent, and human alignment data distilled from specialists, undergoing thousands of steps of continued RL training to reach the final checkpoints. To investigate the potential of extended thinking, we also developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusively on reasoning data with a reduced length penalty during RL. Additionally, we incorporated the dataset and reward method from DeepSeekMath-V2 (Shao et al., 2025) to enhance capabilities in mathematical proofs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I covered &lt;a href="https://simonwillison.net/2025/Nov/27/deepseek-math-v2/"&gt;DeepSeek-Math-V2 last week&lt;/a&gt;. Like that model, DeepSeek-V3.2-Speciale also scores gold on the 2025 International Mathematical Olympiad so beloved of model training teams!&lt;/p&gt;
&lt;p id="pelicans"&gt;I tried both models on "Generate an SVG of a pelican riding a bicycle" using the chat feature of [OpenRouter](https://openrouter.ai/). DeepSeek V3.2 produced this very short reasoning chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's assume the following:&lt;/p&gt;
&lt;p&gt;Wheel radius: 40&lt;br&gt;
Distance between wheel centers: 180&lt;br&gt;
Seat height: 60 (above the rear wheel center)&lt;br&gt;
Handlebars: above the front wheel, extending back and up.&lt;/p&gt;
&lt;p&gt;We'll set the origin at the center of the rear wheel.&lt;/p&gt;
&lt;p&gt;We'll create the SVG with a viewBox that fits the entire drawing.&lt;/p&gt;
&lt;p&gt;Let's start by setting up the SVG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Followed by this illustration:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pleasing gradents for the sky and ground and sun. Neat three-circle clouds. A Pelican on a Bicycle title printed on the image. The pelican is cute but stlightly detached from the bicycle. The bicycle has a somewhat mangled brown frame." src="https://static.simonwillison.net/static/2025/deepseek-v32.png" /&gt;&lt;/p&gt;
&lt;p&gt;Here's what I got from the Speciale model, which thought deeply about the geometry of bicycles and pelicans for &lt;a href="https://gist.githubusercontent.com/simonw/3debaf0df67c2d99a36f41f21ffe534c/raw/fbbb60c6d5b6f02d539ade5105b990490a81a86d/svg.txt"&gt;a very long time (at least 10 minutes)&lt;/a&gt; before spitting out this result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="It's not great. The bicycle is distorted, the pelican is a white oval, an orange almost-oval beak, a little black eye and setched out straight line limbs leading to the pedal and handlebars." src="https://static.simonwillison.net/static/2025/deepseek-v32-speciale.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46108780"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>deepseek-ai/DeepSeek-Math-V2</title><link href="https://simonwillison.net/2025/Nov/27/deepseek-math-v2/#atom-tag" rel="alternate"/><published>2025-11-27T15:59:23+00:00</published><updated>2025-11-27T15:59:23+00:00</updated><id>https://simonwillison.net/2025/Nov/27/deepseek-math-v2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-Math-V2"&gt;deepseek-ai/DeepSeek-Math-V2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New on Hugging Face, a specialist mathematical reasoning LLM from DeepSeek. This is their entry in the space previously dominated by proprietary models from OpenAI and Google DeepMind, both of which &lt;a href="https://simonwillison.net/2025/Jul/21/gemini-imo/"&gt;achieved gold medal scores&lt;/a&gt; on the International Mathematical Olympiad earlier this year.&lt;/p&gt;
&lt;p&gt;We now have an open weights (Apache 2 licensed) 685B, 689GB model that can achieve the same. From the &lt;a href="https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf"&gt;accompanying paper&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeekMath-V2 demonstrates strong performance on competition mathematics. With scaled test-time compute, it achieved gold-medal scores in high-school competitions including IMO 2025 and CMO 2024, and a near-perfect score on the undergraduate Putnam 2024 competition.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mathematics"&gt;mathematics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="mathematics"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Olmo 3 is a fully open LLM</title><link href="https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag" rel="alternate"/><published>2025-11-22T23:59:46+00:00</published><updated>2025-11-22T23:59:46+00:00</updated><id>https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Olmo is the LLM series from Ai2 - the &lt;a href="https://allenai.org/"&gt;Allen institute for AI&lt;/a&gt;. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://allenai.org/blog/olmo3"&gt;new Olmo 3&lt;/a&gt; claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;At its center is &lt;strong&gt;Olmo 3-Think (32B)&lt;/strong&gt;, the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.&lt;/p&gt;
&lt;p&gt;Having full access to the training data is really useful. Here's how they describe that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Olmo 3 is pretrained on &lt;strong&gt;Dolma 3&lt;/strong&gt;, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with &lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct &lt;strong&gt;Dolma 3 Mix&lt;/strong&gt;, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also highlight that they are training on fewer tokens than their competition:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.&lt;/p&gt;
&lt;p&gt;I tried out the 32B Think model and the 7B Instruct model &lt;a href="https://lmstudio.ai/models/olmo3"&gt;using LM Studio&lt;/a&gt;. The 7B model is a 4.16GB download, the 32B one is 18.14GB.&lt;/p&gt;
&lt;p&gt;The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for &lt;em&gt;14 minutes 43 seconds&lt;/em&gt;, outputting 8,437 tokens total most of which was &lt;a href="https://gist.github.com/simonw/2ae9d5ed71de9608b7955eea9671306f"&gt;this epic thinking trace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 100 100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bike frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;saddlebrown&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52 50,57 35,62 50&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Head/beak &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45 60,50 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Left wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65 60,70 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Right wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Feet on pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25 75,30 85,35 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;75 75,70 85,65 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Rendered it looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo3-32b-pelican.jpg" alt="Two circles, each with a triangle sticking out from the bottom. They have bars leading up to a brown box. Overlapping them is a black triangle with white circles for eyes and two grey triangles that are probably meant to be wings. It is not recognizable as a pelican or a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I tested OLMo 2 32B 4bit &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;back in March&lt;/a&gt; and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo2-pelican.jpg" alt="Blue and black wiggly lines looking more like a circuit diagram than a pelican riding a bicycle" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now &lt;a href="https://openrouter.ai/chat?models=qwen/qwen3-32b"&gt;using OpenRouter&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-32b-pelican.png" alt="The bicycle is two black circles joined by two lines, with a weird rectangular saddle perched on top The pelican is a blue oval, a white circles with a yellow triangle in it and a weird eye shaped oval overlapping the blue one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="olmotrace"&gt;OlmoTrace&lt;/h4&gt;
&lt;p&gt;I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A core goal of Olmo 3 is not just to &lt;em&gt;open&lt;/em&gt; the model flow, but to make it &lt;em&gt;actionable&lt;/em&gt; for people who want to understand and improve model behavior. Olmo 3 integrates with &lt;a href="https://allenai.org/blog/olmotrace"&gt;&lt;strong&gt;OlmoTrace&lt;/strong&gt;&lt;/a&gt;, our tool for tracing model outputs back to training data in real time.&lt;/p&gt;
&lt;p&gt;For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can access OlmoTrace via &lt;a href="https://playground.allenai.org/"&gt;playground.allenai.org&lt;/a&gt;, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.&lt;/p&gt;
&lt;p&gt;I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmotrace.jpg" alt="Screenshot of AI2 Playground interface with Olmo 3 32B Think model selected. User prompt reads &amp;quot;Generate a conference bio for Simon Willison&amp;quot;. AI response shows expanded &amp;quot;Thoughts&amp;quot; section: &amp;quot;Okay, the user wants me to generate a conference bio for Simon Willison. Let me start by recalling who Simon is. He's a well-known developer, entrepreneur, and author in the tech community. I remember he's the co-founder of co:here and now works at Anthropic. He's famous for his work with SQLite, especially the datasette project. Also, he writes a blog called SimonWillison.net which is quite popular. First, I need to structure a typical conference bio. These usually include his current role, past positions, notable projects, and maybe some achievements or recognitions. I should highlight his key contributions like datasette, which allows easy exploration of SQLite databases via a web interface. That's a big deal in data journalism and open data circles.&amp;quot; Right panel shows &amp;quot;OlmoTrace&amp;quot; feature described as &amp;quot;Documents from the training data that have exact text matches with the model response. Powered by infini-gram&amp;quot;. First document excerpt discusses technology and innovation, with highlighted match text &amp;quot;societal implications of technology, emphasizing the&amp;quot; shown in bold, surrounded by text about responsibility and merging innovation with intellect. Second document excerpt about Matt Hall has highlighted match &amp;quot;is a software engineer and entrepreneur based in&amp;quot; shown in bold, describing someone in New York City who co-founded a PFP collection and works at Google Creative Lab. Note indicates &amp;quot;Document repeated 2 times in result&amp;quot; with &amp;quot;View all repeated documents&amp;quot; link." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.&lt;/p&gt;
&lt;p&gt;I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by &lt;a href="https://infini-gram.io/"&gt;Ai2's infini-gram&lt;/a&gt;) but the documents it found had nothing to do with me at all.&lt;/p&gt;
&lt;h4 id="can-open-training-data-address-concerns-of-backdoors-"&gt;Can open training data address concerns of backdoors?&lt;/h4&gt;
&lt;p&gt;Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to &lt;a href="https://marin.community/"&gt;Stanford's Marin&lt;/a&gt; and &lt;a href="https://www.swiss-ai.org/apertus"&gt;Swiss AI's Apertus&lt;/a&gt;, neither of which I'd heard about before.&lt;/p&gt;
&lt;p&gt;A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that &lt;a href="https://www.anthropic.com/research/small-samples-poison"&gt;a small number of samples can poison LLMs of any size&lt;/a&gt; - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.&lt;/p&gt;

&lt;p&gt;This makes fully open training data an even bigger deal.&lt;/p&gt;

&lt;p&gt;Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in &lt;a href="https://www.interconnects.ai/p/olmo-3-americas-truly-open-reasoning"&gt;his detailed post about the release&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).&lt;/p&gt;

&lt;p&gt;This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." &lt;a href="https://arxiv.org/abs/2506.10947"&gt;arXiv preprint arXiv:2506.10947&lt;/a&gt; (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." &lt;a href="https://arxiv.org/abs/2507.10532"&gt;arXiv preprint arXiv:2507.10532&lt;/a&gt; (2025).)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in &lt;a href="https://simonwillison.net/2024/Feb/2/olmos/"&gt;February 2024&lt;/a&gt;) and Olmo 2 (in &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;March 2025&lt;/a&gt;) have been significant. I'm hoping that trend continues!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/interpretability"&gt;interpretability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nathan-lambert"&gt;nathan-lambert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="interpretability"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="ai2"/><category term="ai-ethics"/><category term="llm-release"/><category term="lm-studio"/><category term="nathan-lambert"/><category term="olmo"/></entry><entry><title>Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark</title><link href="https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag" rel="alternate"/><published>2025-11-18T19:00:48+00:00</published><updated>2025-11-18T19:00:48+00:00</updated><id>https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google released Gemini 3 Pro today. Here's &lt;a href="https://blog.google/products/gemini/gemini-3/"&gt;the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu&lt;/a&gt;, their &lt;a href="https://blog.google/technology/developers/gemini-3-developers/"&gt;developer blog announcement from Logan Kilpatrick&lt;/a&gt;, the &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;Gemini 3 Pro Model Card&lt;/a&gt;, and their &lt;a href="https://blog.google/products/gemini/gemini-3-collection/"&gt;collection of 11 more articles&lt;/a&gt;. It's a big release!&lt;/p&gt;
&lt;p&gt;I had a few days of preview access to this model via &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt;. The best way to describe it is that it's &lt;strong&gt;Gemini 2.5 upgraded to match the leading rival models&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video.&lt;/p&gt;
&lt;h4 id="benchmarks"&gt;Benchmarks&lt;/h4&gt;
&lt;p&gt;Google's own reported numbers (in &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;the model card&lt;/a&gt;) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg" alt="Table of benchmark numbers, described in full below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pricing"&gt;Pricing&lt;/h4&gt;
&lt;p&gt;It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
      &lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.1&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $1.25&lt;br /&gt;
        &amp;gt; 200k tokens: $2.50
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $10.00&lt;br /&gt;
        &amp;gt; 200k tokens: $15.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Gemini 3 Pro&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $2.00&lt;br /&gt;
        &amp;gt; 200k tokens: $4.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $12.00&lt;br /&gt;
        &amp;gt; 200k tokens: $18.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.5&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $3.00&lt;br /&gt;
        &amp;gt; 200k tokens: $6.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $15.00&lt;br /&gt;
        &amp;gt; 200k tokens: $22.50
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.1&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;$75.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;h4 id="trying-it-out-against-a-complex-image"&gt;Trying it out against a complex image&lt;/h4&gt;
&lt;p&gt;That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's what I got back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Humanity's Last Exam (Academic reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.&lt;/li&gt;
&lt;li&gt;With search and code execution: Gemini 3 Pro 45.8% (others have no data).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPQA Diamond (Scientific knowledge; No tools)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;AIME 2025 (Mathematics)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.&lt;/li&gt;
&lt;li&gt;With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MathArena Apex (Challenging Math Contest problems)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMU-Pro (Multimodal understanding and reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ScreenSpot-Pro (Screen understanding)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CharXiv Reasoning (Information synthesis from complex charts)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Video-MMMU (Knowledge acquisition from videos)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SWE-Bench Verified (Agentic coding; Single attempt)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;t2-bench (Agentic tool use)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SimpleQA Verified (Parametric knowledge)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMLU (Multilingual Q&amp;amp;A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Global PIQA (Commonsense reasoning across 100 Languages and Cultures)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MRCR v2 (8-needle) (Long context performance)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.&lt;/li&gt;
&lt;li&gt;1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have not checked every line of this but a loose spot-check looks accurate to me.&lt;/p&gt;
&lt;p&gt;That prompt took 1,105 input and 3,901 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=1105&amp;amp;cit=3901&amp;amp;ot=3901&amp;amp;ic=2&amp;amp;oc=12&amp;amp;sel=gemini-3-pro-preview"&gt;5.6824 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this follow-up prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -c 'Convert to JSON'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/ea7d52706557528e7eb3912cdf9250b0#response-1"&gt;the full output here&lt;/a&gt;, which starts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"metadata"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"columns"&lt;/span&gt;: [
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Benchmark&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Description&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 3 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 2.5 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Claude Sonnet 4.5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;GPT-5.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    ]
  },
  &lt;span class="pl-ent"&gt;"benchmarks"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Humanity's Last Exam&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Academic reasoning&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"sub_results"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;No tools&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21.6%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;13.7%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        },
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;With search and code execution&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45.8%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;
        }
      ]
    },&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="analyzing-a-city-council-meeting"&gt;Analyzing a city council meeting&lt;/h4&gt;
&lt;p&gt;To try it out against an audio file I extracted the 3h33m of audio from the video &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0"&gt;Half Moon Bay City Council Meeting - November 4, 2025&lt;/a&gt;. I used &lt;code&gt;yt-dlp&lt;/code&gt; to get that audio:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;yt-dlp -x --audio-format m4a &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://www.youtube.com/watch?v=qgJ7x7R6gy0&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a /tmp/HMBCC\ 11⧸4⧸25\ -\ Half\ Moon\ Bay\ City\ Council\ Meeting\ -\ November\ 4,\ 2025\ \[qgJ7x7R6gy0\].m4a 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using &lt;code&gt;ffmpeg&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ffmpeg -i &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -ac 1 -ar 22050 -c:a aac -b:a 24k &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB_compressed.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then ran it again like this (for some reason I had to use &lt;code&gt;--attachment-type&lt;/code&gt; this time):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This time it worked! The &lt;a href="https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943314"&gt;full output is here&lt;/a&gt;, but it starts like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here is the transcript of the Half Moon Bay City Council meeting.&lt;/p&gt;
&lt;h4&gt;Meeting Outline&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;1. Call to Order, Updates, and Public Forum&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:00:00 - 00:13:25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Consent Calendar&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:13:25 - 00:15:15&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ordinance Introduction: Commercial Vitality (Item 9A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:15:15 - 00:30:45&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Ordinance Introduction: Building Standards &amp;amp; Electrification (Item 9B)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling (&lt;em&gt;California Restaurant Association v. City of Berkeley&lt;/em&gt;). &lt;strong&gt;Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:30:45 - 00:45:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;5. Housing Element Update &amp;amp; Adoption (Item 9C)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. &lt;strong&gt;There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure.&lt;/strong&gt; Public speakers debate the enforceability of Measure D. &lt;strong&gt;Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law.&lt;/strong&gt; The Council votes to adopt the element but strikes the language committing to a ballot measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:45:00 - 01:05:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h4&gt;Transcript&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [00:00:00]
Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Victor Hernandez (Interpreter)&lt;/strong&gt; [00:00:35]
Thank you, Mr. Mayor, City Council, all city staff, members of the public. &lt;em&gt;[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]&lt;/em&gt; Thank you very much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]".&lt;/p&gt;
&lt;p&gt;I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [01:04:00]
Meeting adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That actually happens &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0&amp;amp;t=3h31m5s"&gt;at 3h31m5s&lt;/a&gt; and the mayor says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said.&lt;/p&gt;
&lt;p&gt;This took 320,087 input tokens and 7,870 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=320087&amp;amp;ot=7870&amp;amp;ic=4&amp;amp;oc=18"&gt;$1.42&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-a-new-pelican-benchmark"&gt;And a new pelican benchmark&lt;/h4&gt;
&lt;p&gt;Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt; prompt at both levels.&lt;/p&gt;
&lt;p&gt;Here's low - Gemini decided to add a jaunty little hat (with a comment &lt;a href="https://gist.github.com/simonw/70d56ba39b7cbb44985d2384004fc4a0#response"&gt;in the SVG&lt;/a&gt; that says &lt;code&gt;&amp;lt;!-- Hat (Optional Fun Detail) --&amp;gt;&lt;/code&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-low.png" alt="The pelican is wearing a blue hat. It has a good beak. The bicycle is a little bit incorrect but generally a good effort." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-high.png" alt="The pelican is not wearing a hat. It has a good beak. The bicycle is accurate and well-drawn." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/breeding-plumage.jpg" alt="A glorious California brown pelican perched on a rock by the water. It has a yellow tint to its head and a red spot near its throat." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's Gemini 3 Pro's &lt;a href="https://gist.github.com/simonw/2b9930ae1ce6f3f5e9cfe3cb31ec0c0a"&gt;attempt&lt;/a&gt; at high thinking level for that new prompt:&lt;/p&gt;
&lt;p id="advanced-pelican"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-breeding-pelican-high.png" alt="It's clearly a pelican. It has all of the requested features. It looks a bit abstract though." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for good measure, here's that same prompt &lt;a href="https://gist.github.com/simonw/7a655ebe42f3d428d2ea5363dad8067c"&gt;against GPT-5.1&lt;/a&gt; - which produced this dumpy little fellow:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-1-breeding-pelican.png" alt="The pelican is very round. Its body overlaps much of the bicycle. It has a lot of dorky charisma." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And Claude Sonnet 4.5, which &lt;a href="https://gist.github.com/simonw/3296af92e4328dd4740385e6a4a2ac35"&gt;didn't do quite as well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4-5-breeding-pelican.png" alt="Oh dear. It has all of the requested components, but the bicycle is a bit wrong and the pelican is arranged in a very awkward shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum</title><link href="https://simonwillison.net/2025/Nov/14/gpt-51-system-card-addendum/#atom-tag" rel="alternate"/><published>2025-11-14T13:46:23+00:00</published><updated>2025-11-14T13:46:23+00:00</updated><id>https://simonwillison.net/2025/Nov/14/gpt-51-system-card-addendum/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-system-card-addendum-gpt-5-1/"&gt;GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I was confused about whether the new "adaptive thinking" feature of GPT-5.1 meant they were moving away from the "router" mechanism where GPT-5 in ChatGPT automatically selected a model for you.&lt;/p&gt;
&lt;p&gt;This page addresses that, emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑5.1 Instant is more conversational than our earlier chat model, with improved instruction following and an adaptive reasoning capability that lets it decide when to think before responding. GPT‑5.1 Thinking adapts thinking time more precisely to each question. &lt;strong&gt;GPT‑5.1 Auto will continue to route each query to the model best suited for it&lt;/strong&gt;, so that in most cases, the user does not need to choose a model at all.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So GPT‑5.1 Instant can decide when to think before responding, GPT-5.1 Thinking can decide how hard to think, and GPT-5.1 Auto (not a model you can use via the API) can decide which out of Instant and Thinking a prompt should be routed to.&lt;/p&gt;
&lt;p&gt;If anything this feels &lt;em&gt;more&lt;/em&gt; confusing than the GPT-5 routing situation!&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf"&gt;system card addendum PDF&lt;/a&gt; itself is somewhat frustrating: it shows results on an internal benchmark called "Production Benchmarks", also mentioned in the &lt;a href="https://openai.com/index/gpt-5-system-card/"&gt;GPT-5 system card&lt;/a&gt;, but with vanishingly little detail about what that tests beyond high level category names like "personal data", "extremism" or "mental health" and "emotional reliance" - those last two both listed as "New evaluations, as introduced in the &lt;a href="https://cdn.openai.com/pdf/3da476af-b937-47fb-9931-88a851620101/addendum-to-gpt-5-system-card-sensitive-conversations.pdf"&gt;GPT-5 update on sensitive conversations&lt;/a&gt;" - a PDF dated October 27th that I had previously missed.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;That&lt;/em&gt; document describes the two new categories like so:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Emotional Reliance not_unsafe - tests that the model does not produce disallowed content under our policies related to unhealthy emotional dependence or attachment to ChatGPT&lt;/li&gt;
&lt;li&gt;Mental Health not_unsafe - tests that the model does not produce disallowed content under our policies in situations where there are signs that a user may be experiencing isolated delusions, psychosis, or mania&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;So these are the &lt;a href="https://www.tiktok.com/@pearlmania500/video/7535954556379761950"&gt;ChatGPT Psychosis&lt;/a&gt; benchmarks!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-personality"&gt;ai-personality&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm-reasoning"/><category term="ai-personality"/><category term="gpt-5"/></entry><entry><title>Introducing GPT-5.1 for developers</title><link href="https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag" rel="alternate"/><published>2025-11-13T23:59:35+00:00</published><updated>2025-11-13T23:59:35+00:00</updated><id>https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-1-for-developers/"&gt;Introducing GPT-5.1 for developers&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced GPT-5.1 yesterday, calling it &lt;a href="https://openai.com/index/gpt-5-1/"&gt;a smarter, more conversational ChatGPT&lt;/a&gt;. Today they've added it to their API.&lt;/p&gt;
&lt;p&gt;We actually got four new models today:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1"&gt;gpt-5.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-chat-latest"&gt;gpt-5.1-chat-latest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex"&gt;gpt-5.1-codex&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex-mini"&gt;gpt-5.1-codex-mini&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are a lot of details to absorb here.&lt;/p&gt;
&lt;p&gt;GPT-5.1 introduces a new reasoning effort called "none" (previous were minimal, low, medium, and high) - and none is the new default.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This makes the model behave like a non-reasoning model for latency-sensitive use cases, with the high intelligence of GPT‑5.1 and added bonus of performant tool-calling. Relative to GPT‑5 with 'minimal' reasoning, GPT‑5.1 with no reasoning is better at parallel tool calling (which itself increases end-to-end task completion speed), coding tasks, following instructions, and using search tools---and supports &lt;a href="https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses"&gt;web search⁠&lt;/a&gt; in our API platform.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When you DO enable thinking you get to benefit from a new feature called "adaptive reasoning":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On straightforward tasks, GPT‑5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills. On difficult tasks that require extra thinking, GPT‑5.1 remains persistent, exploring options and checking its work in order to maximize reliability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Another notable new feature for 5.1 is &lt;a href="https://platform.openai.com/docs/guides/prompt-caching#extended-prompt-cache-retention"&gt;extended prompt cache retention&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To enable this set &lt;code&gt;"prompt_cache_retention": "24h"&lt;/code&gt; in the API call. Weirdly there's no price increase involved with this at all. I &lt;a href="https://x.com/simonw/status/1989104422832738305"&gt;asked about that&lt;/a&gt; and OpenAI's Steven Heidel &lt;a href="https://x.com/stevenheidel/status/1989113407149314199"&gt;replied&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;with 24h prompt caching we move the caches from gpu memory to gpu-local storage. that storage is not free, but we made it free since it moves capacity from a limited resource (GPUs) to a more abundant resource (storage). then we can serve more traffic overall!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting documentation I've seen so far is in the new &lt;a href="https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide"&gt;5.1 cookbook&lt;/a&gt;, which also includes details of the new &lt;code&gt;shell&lt;/code&gt; and &lt;code&gt;apply_patch&lt;/code&gt; built-in tools. The &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py"&gt;apply_patch.py implementation&lt;/a&gt; is worth a look, especially if you're interested in the advancing state-of-the-art of file editing tools for LLMs.&lt;/p&gt;
&lt;p&gt;I'm still working on &lt;a href="https://github.com/simonw/llm/issues/1300"&gt;integrating the new models into LLM&lt;/a&gt;. The Codex models are Responses-API-only.&lt;/p&gt;
&lt;p&gt;I got this pelican for GPT-5.1 default (no thinking):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle wheels have no spokes at all, the pelican is laying quite flat on it" src="https://static.simonwillison.net/static/2025/gpt-5.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;And this one with reasoning effort set to high:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This bicycle has four spokes per wheel, and the pelican is sitting more upright" src="https://static.simonwillison.net/static/2025/gpt-5.1-high-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;These actually feel like a &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;regression from GPT-5&lt;/a&gt; to me. The bicycles have less spokes!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/><category term="gpt-codex"/><category term="november-2025-inflection"/></entry><entry><title>Kimi K2 Thinking</title><link href="https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag" rel="alternate"/><published>2025-11-06T23:53:06+00:00</published><updated>2025-11-06T23:53:06+00:00</updated><id>https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/moonshotai/Kimi-K2-Thinking"&gt;Kimi K2 Thinking&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;back in July&lt;/a&gt;. They've now released the Thinking version, also a trillion parameters (MoE, 32B active) and also under their custom modified (so &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/#kimi-license"&gt;not quite open source&lt;/a&gt;) MIT license.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one is only 594GB on Hugging Face - Kimi K2 was 1.03TB - which I think is due to the new INT4 quantization. This makes the model both cheaper and faster to host.&lt;/p&gt;
&lt;p&gt;So far the only people hosting it are Moonshot themselves. I tried it out both via &lt;a href="https://platform.moonshot.ai"&gt;their own API&lt;/a&gt; and via &lt;a href="https://openrouter.ai/moonshotai/kimi-k2-thinking/providers"&gt;the OpenRouter proxy to it&lt;/a&gt;, via the &lt;a href="https://github.com/ghostofpokemon/llm-moonshot"&gt;llm-moonshot&lt;/a&gt; plugin (by NickMystic) and my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin respectively.&lt;/p&gt;
&lt;p&gt;The buzz around this model so far is very positive. Could this be the first open weight model that's competitive with the latest from OpenAI and Anthropic, especially for long-running agentic tool call sequences?&lt;/p&gt;
&lt;p&gt;Moonshot AI's &lt;a href="https://moonshotai.github.io/Kimi-K2/thinking.html"&gt;self-reported benchmark scores&lt;/a&gt; show K2 Thinking beating the top OpenAI and Anthropic models (GPT-5 and Sonnet 4.5 Thinking) at "Agentic Reasoning" and "Agentic Search" but not quite top for "Coding":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison bar chart showing agentic reasoning, search, and coding benchmark performance scores across three AI systems (K, OpenAI, and AI) on tasks including Humanity's Last Exam (44.9, 41.7, 32.0), BrowseComp (60.2, 54.9, 24.1), Seal-0 (56.3, 51.4, 53.4), SWE-Multilingual (61.1, 55.3, 68.0), SWE-bench Verified (71.3, 74.9, 77.2), and LiveCodeBench V6 (83.1, 87.0, 64.0), with category descriptions including &amp;quot;Expert-level questions across subjects&amp;quot;, &amp;quot;Agentic search &amp;amp; browsing&amp;quot;, &amp;quot;Real-world latest information collection&amp;quot;, &amp;quot;Agentic coding&amp;quot;, and &amp;quot;Competitive programming&amp;quot;." src="https://static.simonwillison.net/static/2025/kimi-k2-thinking-benchmarks.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I ran a couple of pelican tests:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5 described this as: Cartoon illustration of a white duck or goose with an orange beak and gray wings riding a bicycle with a red frame and light blue wheels against a light blue background." src="https://static.simonwillison.net/static/2025/k2-thinking.png" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter # paste key
llm -m openrouter/moonshotai/kimi-k2-thinking \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5: Minimalist cartoon illustration of a white bird with an orange beak and feet standing on a triangular-framed penny-farthing style bicycle with gray-hubbed wheels and a propeller hat on its head, against a light background with dotted lines and a brown ground line." src="https://static.simonwillison.net/static/2025/k2-thinking-openrouter.png" /&gt;&lt;/p&gt;
&lt;p&gt;Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1986541785511043536"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CNBC quoted a source who &lt;a href="https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-releases-new-ai-model-kimi-k2-thinking.html"&gt;provided the training price&lt;/a&gt; for the model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Kimi K2 Thinking model cost $4.6 million to train, according to a source familiar with the matter. [...] CNBC was unable to independently verify the DeepSeek or Kimi figures.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MLX developer Awni Hannun &lt;a href="https://x.com/awnihannun/status/1986601104130646266"&gt;got it working&lt;/a&gt; on two 512GB M3 Ultra Mac Studios:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality!&lt;/p&gt;
&lt;p&gt;The model was quantization aware trained (qat) at int4.&lt;/p&gt;
&lt;p&gt;Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://huggingface.co/mlx-community/Kimi-K2-Thinking"&gt;the 658GB mlx-community model&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/><category term="artificial-analysis"/><category term="moonshot"/><category term="kimi"/></entry><entry><title>Quoting MiniMax</title><link href="https://simonwillison.net/2025/Nov/3/minimax/#atom-tag" rel="alternate"/><published>2025-11-03T17:24:39+00:00</published><updated>2025-11-03T17:24:39+00:00</updated><id>https://simonwillison.net/2025/Nov/3/minimax/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://x.com/minimax__ai/status/1985375617622454566"&gt;&lt;p&gt;&lt;strong&gt;Interleaved thinking&lt;/strong&gt; is essential for LLM agents: it means alternating between explicit reasoning and tool use, while carrying that reasoning forward between steps.This process significantly enhances &lt;strong&gt;planning, self‑correction, and reliability&lt;/strong&gt; in long workflows. [...]&lt;/p&gt;
&lt;p&gt;From community feedback, we've often observed failures to preserve prior-round thinking state across multi-turn interactions with M2. The root cause is that the widely-used &lt;strong&gt;OpenAI Chat Completion API does not support passing reasoning content back in subsequent requests&lt;/strong&gt;. Although the Anthropic API natively supports this capability, the community has provided less support for models beyond Claude, and many applications still omit passing back the previous turns' thinking in their Anthropic API implementations. This situation has resulted in poor support for Interleaved Thinking for new models. &lt;strong&gt;To fully unlock M2's capabilities, preserving the reasoning process across multi-turn interactions is essential&lt;/strong&gt;.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://x.com/minimax__ai/status/1985375617622454566"&gt;MiniMax&lt;/a&gt;, Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/minimax"&gt;minimax&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="ai-agents"/><category term="llm-reasoning"/><category term="definitions"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="minimax"/></entry><entry><title>Introducing Claude Haiku 4.5</title><link href="https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag" rel="alternate"/><published>2025-10-15T19:36:34+00:00</published><updated>2025-10-15T19:36:34+00:00</updated><id>https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-haiku-4-5"&gt;Introducing Claude Haiku 4.5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Anthropic released Claude Haiku 4.5 today, the cheapest member of the Claude 4.5 family that started with Sonnet 4.5 &lt;a href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/"&gt;a couple of weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's priced at $1/million input tokens and $5/million output tokens, slightly more expensive than Haiku 3.5 ($0.80/$4) and a &lt;em&gt;lot&lt;/em&gt; more expensive than the original Claude 3 Haiku ($0.25/$1.25), both of which remain available at those prices.&lt;/p&gt;
&lt;p&gt;It's a third of the price of Sonnet 4 and Sonnet 4.5 (both $3/$15) which is notable because Anthropic's benchmarks put it in a similar space to that older Sonnet 4 model. As they put it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've been hoping to see Anthropic release a fast, inexpensive model that's price competitive with the cheapest models from OpenAI and Gemini, currently $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite). Haiku 4.5 certainly isn't that, it looks like they're continuing to focus squarely on the "great at code" part of the market.&lt;/p&gt;
&lt;p&gt;The new Haiku is the first Haiku model to support reasoning. It sports a 200,000 token context window, 64,000 maximum output (up from just 8,192 for Haiku 3.5) and a "reliable knowledge cutoff" of February 2025, one month later than the January 2025 date for Sonnet 4 and 4.5 and Opus 4 and 4.1.&lt;/p&gt;
&lt;p&gt;Something that caught my eye in the accompanying &lt;a href="https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf"&gt;system card&lt;/a&gt; was this note about context length:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For Claude Haiku 4.5, we trained the model to be explicitly context-aware, with precise information about how much context-window has been used. This has two effects: the model learns when and how to wrap up its answer when the limit is approaching, and the model learns to continue reasoning more persistently when the limit is further away. We found this intervention—along with others—to be effective at limiting agentic “laziness” (the phenomenon where models stop working on a problem prematurely, give incomplete answers, or cut corners on tasks).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've added the new price to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;, released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.20"&gt;llm-anthropic 0.20&lt;/a&gt; with the new model and updated my &lt;a href="https://tools.simonwillison.net/haiku"&gt;Haiku-from-your-webcam&lt;/a&gt; demo (&lt;a href="https://github.com/simonw/tools/blob/main/haiku.html"&gt;source&lt;/a&gt;) to use Haiku 4.5 as well.&lt;/p&gt;
&lt;p&gt;Here's &lt;code&gt;llm -m claude-haiku-4.5 'Generate an SVG of a pelican riding a bicycle'&lt;/code&gt; (&lt;a href="https://gist.github.com/simonw/31256c523fa502eeb303b8e0bbe30eee"&gt;transcript&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Described by Haiku 4.5: A whimsical illustration of a bird with a round tan body, pink beak, and orange legs riding a bicycle against a blue sky and green grass background." src="https://static.simonwillison.net/static/2025/claude-haiku-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;18 input tokens and 1513 output tokens = &lt;a href="https://www.llm-prices.com/#it=18&amp;amp;ot=1513&amp;amp;ic=1&amp;amp;oc=5"&gt;0.7583 cents&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45595403"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>GPT-5 pro</title><link href="https://simonwillison.net/2025/Oct/6/gpt-5-pro/#atom-tag" rel="alternate"/><published>2025-10-06T19:48:45+00:00</published><updated>2025-10-06T19:48:45+00:00</updated><id>https://simonwillison.net/2025/Oct/6/gpt-5-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5-pro"&gt;GPT-5 pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's OpenAI's model documentation for their GPT-5 pro model, released to their API today at their DevDay event.&lt;/p&gt;
&lt;p&gt;It has similar base characteristics to &lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5&lt;/a&gt;: both share a September 30, 2024 knowledge cutoff and 400,000 context limit.&lt;/p&gt;
&lt;p&gt;GPT-5 pro has maximum output tokens 272,000 max, an increase from 128,000 for GPT-5.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) &lt;code&gt;reasoning.effort: high&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's only available via OpenAI's Responses API. My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool doesn't support that in core yet, but the &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai-plugin&lt;/a&gt; plugin does. I released &lt;a href="https://github.com/simonw/llm-openai-plugin/releases/tag/0.7"&gt;llm-openai-plugin 0.7&lt;/a&gt; adding support for the new model, then ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-openai-plugin
llm -m openai/gpt-5-pro "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's very, very slow. The model took 6 minutes 8 seconds to respond and charged me for 16 input and 9,205 output tokens. At $15/million input and $120/million output this pelican &lt;a href="https://www.llm-prices.com/#it=16&amp;amp;ot=9205&amp;amp;ic=15&amp;amp;oc=120&amp;amp;sb=output&amp;amp;sd=descending"&gt;cost me $1.10&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img alt="It's obviously a pelican riding a bicycle. Half the spokes are missing on each wheel and the pelican is a bit squat looking." src="https://static.simonwillison.net/static/2025/gpt-5-pro.png" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9a06ab36f486f31401fec1fc104a8ce5"&gt;the full transcript&lt;/a&gt;. It looks visually pretty simpler to the much, much cheaper result I &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;got from GPT-5&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/></entry><entry><title>Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now)</title><link href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/#atom-tag" rel="alternate"/><published>2025-09-29T18:11:39+00:00</published><updated>2025-09-29T18:11:39+00:00</updated><id>https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-sonnet-4-5"&gt;released Claude Sonnet 4.5 today&lt;/a&gt;, with a &lt;em&gt;very&lt;/em&gt; bold set of claims:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Anthropic gave me access to a preview version of a "new model" over the weekend which turned out to be Sonnet 4.5. My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since &lt;a href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/"&gt;it launched a few weeks ago&lt;/a&gt;. This space moves &lt;em&gt;so fast&lt;/em&gt; - Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the "best coding model" crown.&lt;/p&gt;
&lt;p&gt;The pricing is the same as the previous Sonnet: $3/million input tokens and $15/million output tokens. This remains significantly cheaper than Claude Opus - $15/$75 - but still quite a bit more than GPT-5 and GPT-5-Codex, both at $1.25/$10.&lt;/p&gt;
&lt;h4 id="it-really-shines-with-claude-ai-code-interpreter"&gt;It really shines with Claude.ai Code Interpreter&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://claude.ai/"&gt;claude.ai&lt;/a&gt; web interface (not yet the Claude iPhone native app) recently added the ability for Claude to write and then directly execute code in a sandboxed server environment, using Python and Node.js. I &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;wrote about that in detail&lt;/a&gt; three weeks ago.&lt;/p&gt;
&lt;p&gt;Anthropic's implementation of this code interpreter pattern is more powerful than ChatGPT's equivalent because it can directly clone code from GitHub and install software packages from NPM and PyPI.&lt;/p&gt;
&lt;p&gt;Sonnet 4.5 &lt;em&gt;absolutely shines&lt;/em&gt; at using this tool!&lt;/p&gt;
&lt;p&gt;Here's my most ambitious experiment with it. I started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Checkout https://github.com/simonw/llm and run the tests with&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pip install -e '.[test]'&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;pytest&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It did exactly as I asked, checking out my &lt;a href="https://github.com/simonw/llm"&gt;LLM repo&lt;/a&gt;, installing it along with all of its dependencies, and running the tests.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;✅ All 466 tests passed in 167.69 seconds (about 2 minutes 47 seconds)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I posed it a &lt;em&gt;wildly&lt;/em&gt; ambitious challenge:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;The tool stores prompts and responses in a SQLite database. It currently uses a single conversation row and a linear collection of responses within that conversation. Experiment with the idea of adding a parent_response_id column to that responses table and using it to model the responses in a conversation as a tree instead. Use new pytest tests which you execute individually (to save time on the whole test suite) to exercise your design. Write a tree_notes.md file which you first populate with your plan and then use for notes as you go along.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As you can see in &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;the resulting conversation transcript&lt;/a&gt;, it did a really impressive job of this! It churned away across &lt;em&gt;dozens&lt;/em&gt; of different tool calls, eventually outputting the following summary of its activity:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I've successfully experimented with adding tree-structured conversations to the LLM CLI tool. Here's what was accomplished:&lt;/p&gt;
&lt;h3 id="-implementation-complete"&gt;✅ &lt;strong&gt;Implementation Complete&lt;/strong&gt;
&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. Database Schema Enhancement&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Added &lt;code&gt;parent_response_id&lt;/code&gt; column to the &lt;code&gt;responses&lt;/code&gt; table via migration &lt;code&gt;m022_parent_response_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enables tree-structured (branching) conversations instead of just linear sequences&lt;/li&gt;
&lt;li&gt;Fully backward compatible - existing data unaffected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Comprehensive Utility Module&lt;/strong&gt; (&lt;code&gt;tree_utils.py&lt;/code&gt;)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;12 helper functions for tree operations:
&lt;ul&gt;
&lt;li&gt;Navigation (children, parent, siblings, paths)&lt;/li&gt;
&lt;li&gt;Analysis (depth, descendants, size)&lt;/li&gt;
&lt;li&gt;Queries (roots, leaves, statistics)&lt;/li&gt;
&lt;li&gt;Visualization (ASCII tree display)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Complete Test Suite&lt;/strong&gt; (&lt;code&gt;test_tree_conversations.py&lt;/code&gt;)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;16 tests covering all tree operations&lt;/li&gt;
&lt;li&gt;Tests linear chains, branching, multiple roots, forests&lt;/li&gt;
&lt;li&gt;Includes realistic debugging scenario&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All tests pass ✅&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Integration Verification&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Updated existing migration tests to expect new column&lt;/li&gt;
&lt;li&gt;All 6 migration tests pass&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 22/22 tests passing ✅&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="-deliverables-7-files-in-mnt-user-data-outputs-"&gt;📦 &lt;strong&gt;Deliverables&lt;/strong&gt; (7 files in &lt;code&gt;/mnt/user-data/outputs/&lt;/code&gt;)&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;README.md&lt;/strong&gt; - Quick start guide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IMPLEMENTATION_SUMMARY.md&lt;/strong&gt; - Complete technical overview&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIGRATION_CODE.md&lt;/strong&gt; - Database migration details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TEST_RESULTS.md&lt;/strong&gt; - Full test coverage report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_notes.md&lt;/strong&gt; - Design decisions and development notes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_utils.py&lt;/strong&gt; - Utility functions module&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;test_tree_conversations.py&lt;/strong&gt; - Test suite&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="-key-features"&gt;🎯 &lt;strong&gt;Key Features&lt;/strong&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Multiple branches from any conversation point&lt;/li&gt;
&lt;li&gt;Multiple roots per conversation (forest structure)&lt;/li&gt;
&lt;li&gt;Rich analytics (depth, branching factor, tree size)&lt;/li&gt;
&lt;li&gt;ASCII tree visualization&lt;/li&gt;
&lt;li&gt;Cycle detection for safety&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="-ready-for-integration"&gt;🚀 &lt;strong&gt;Ready for Integration&lt;/strong&gt;
&lt;/h3&gt;
&lt;p&gt;The implementation is production-ready with comprehensive testing. Next steps would be:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Integrate &lt;code&gt;tree_utils.py&lt;/code&gt; into LLM package&lt;/li&gt;
&lt;li&gt;Add CLI commands (&lt;code&gt;llm branch&lt;/code&gt;, &lt;code&gt;llm tree&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Update &lt;code&gt;Response.log()&lt;/code&gt; to accept &lt;code&gt;parent_response_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All code is tested, documented, and ready to use!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Did I mention I kicked off this experiment by entering prompts on my phone?&lt;/p&gt;
&lt;p&gt;I told it to &lt;code&gt;Give me a zip file of everything you have done so far&lt;/code&gt; - you can explore the contents of the file it made me &lt;a href="https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b8b4"&gt;in this Gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I shared a &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;longer version&lt;/a&gt; of this experiment where I had the model experiment with a complex message hashing mechanism as well. I've &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;pushed the code to a branch&lt;/a&gt;, which is best explained by the &lt;a href="https://github.com/simonw/llm/blob/697636222a277f892dfffef86b2471cc89c95576/message_matching_notes.md"&gt;message_matching_notes.md&lt;/a&gt; document written by Claude.&lt;/p&gt;
&lt;h4 id="pelicans"&gt;Pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.19"&gt;llm-anthropic 0.19&lt;/a&gt; adding support for the new model. Then I ran my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;classic pelican benchmark&lt;/a&gt;, first &lt;a href="https://gist.github.com/simonw/52a5bb478fa743daede12d97173cb817"&gt;with thinking turned on&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 -o thinking 1 -o thinking_budget 4000 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4.5-thinking.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/449619ea8f0c112e3faa72e99fcb16ee"&gt;without&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4.5.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are pretty good - they are recognizably pelicans! - though not quite as good &lt;a href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/"&gt;as GPT-5-Codex&lt;/a&gt; which is better at drawing bicycles.&lt;/p&gt;
&lt;p&gt;I also tried having it describe this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelicans.jpg" alt="A whole mess of pelicans, described below." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe this image&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -a https://static.simonwillison.net/static/2024/pelicans.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;This image shows a large gathering of pelicans crowded together along what appears to be a waterfront or shoreline. The birds are densely packed, creating a chaotic scene with dozens of brown pelicans visible. Their distinctive long bills and pouches are clearly visible among the mass of feathered bodies. In the background, there's calm water and a dark tree line or forested area. The lighting suggests this was taken during early morning or late afternoon, giving the scene a golden, warm tone. The pelicans appear to be resting or congregating together, which is typical behavior for these colonial seabirds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="anthropic-are-rolling-this-out-everywhere"&gt;Anthropic are rolling this out everywhere&lt;/h4&gt;
&lt;p&gt;The release of this model has been &lt;em&gt;very&lt;/em&gt; well coordinated. My embargo on talking about it was due to lift at 10am Pacific today, and I got an email from them linking to their blog post at 10am on the dot. It's also already &lt;a href="https://openrouter.ai/anthropic/claude-sonnet-4.5"&gt;live on OpenRouter&lt;/a&gt; and &lt;a href="https://x.com/cursor_ai/status/1972713190074261949"&gt;in Cursor&lt;/a&gt; and &lt;a href="https://github.blog/changelog/2025-09-29-anthropic-claude-sonnet-4-5-is-in-public-preview-for-github-copilot/"&gt;GitHub Copilot&lt;/a&gt; and no doubt a whole bunch of other places as well.&lt;/p&gt;
&lt;p&gt;Anthropic also shipped a &lt;a href="https://marketplace.visualstudio.com/items?itemName=anthropic.claude-code"&gt;new Claude Code VS Code extension&lt;/a&gt; today, plus a big upgrade to the Claude Code terminal app. Plus they rebranded their confusingly named Claude Code SDK to the &lt;a href="https://docs.claude.com/en/api/agent-sdk/overview"&gt;Claude Agent SDK&lt;/a&gt; instead, emphasizing that it's a tool for building agents beyond just customizing the existing Claude Code product. That's available for both &lt;a href="https://docs.claude.com/en/api/agent-sdk/typescript"&gt;TypeScript&lt;/a&gt; and &lt;a href="https://docs.claude.com/en/api/agent-sdk/python"&gt;Python&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="code-interpreter"/><category term="llm-tool-use"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Quoting Scott Aaronson</title><link href="https://simonwillison.net/2025/Sep/29/scott-aaronson/#atom-tag" rel="alternate"/><published>2025-09-29T00:52:26+00:00</published><updated>2025-09-29T00:52:26+00:00</updated><id>https://simonwillison.net/2025/Sep/29/scott-aaronson/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://scottaaronson.blog/?p=9183"&gt;&lt;p&gt;Given a week or two to try out ideas and search the literature, I’m pretty sure that Freek and I could’ve solved this problem ourselves. Instead, though, I simply asked GPT5-Thinking. After five minutes, it gave me something confident, plausible-looking, and (I could tell) wrong. But rather than laughing at the silly AI like a skeptic might do, I &lt;em&gt;told&lt;/em&gt; GPT5 how I knew it was wrong. It thought some more, apologized, and tried again, and gave me something better. So it went for a few iterations, much like interacting with a grad student or colleague. [...]&lt;/p&gt;
&lt;p&gt;Now, in September 2025, I’m here to tell you that AI has finally come for what my experience tells me is the most quintessentially human of all human intellectual activities: namely, proving oracle separations between quantum complexity classes. Right now, it almost certainly &lt;em&gt;can’t&lt;/em&gt; write the whole research paper (at least if you want it to be correct and good), but it can help you get unstuck if you otherwise know what you’re doing, which you might call a sweet spot.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://scottaaronson.blog/?p=9183"&gt;Scott Aaronson&lt;/a&gt;, UT Austin Quantum Information Center&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quantum-computing"&gt;quantum-computing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="gpt-5"/><category term="quantum-computing"/><category term="generative-ai"/><category term="llm-reasoning"/><category term="ai"/><category term="llms"/></entry><entry><title>Improved Gemini 2.5 Flash and Flash-Lite</title><link href="https://simonwillison.net/2025/Sep/25/improved-gemini-25-flash-and-flash-lite/#atom-tag" rel="alternate"/><published>2025-09-25T19:27:43+00:00</published><updated>2025-09-25T19:27:43+00:00</updated><id>https://simonwillison.net/2025/Sep/25/improved-gemini-25-flash-and-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/"&gt;Improved Gemini 2.5 Flash and Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new preview models from Google - updates to their fast and inexpensive Flash and Flash Lite families:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The latest version of Gemini 2.5 Flash-Lite was trained and built based on three key themes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Better instruction following&lt;/strong&gt;: The model is significantly better at following complex instructions and system prompts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced verbosity&lt;/strong&gt;: It now produces more concise answers, a key factor in reducing token costs and latency for high-throughput applications (see charts above).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stronger multimodal &amp;amp; translation capabilities&lt;/strong&gt;: This update features more accurate audio transcription, better image understanding, and improved translation quality.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;This latest 2.5 Flash model comes with improvements in two key areas we heard consistent feedback on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Better agentic tool use&lt;/strong&gt;: We've improved how the model uses tools, leading to better performance in more complex, agentic and multi-step applications. This model shows noticeable improvements on key agentic benchmarks, including a 5% gain on SWE-Bench Verified, compared to our last release (48.9% → 54%).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;More efficient&lt;/strong&gt;: With thinking on, the model is now significantly more cost-efficient—achieving higher quality outputs while using fewer tokens, reducing latency and cost (see charts above).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also added two new convenience model IDs: &lt;code&gt;gemini-flash-latest&lt;/code&gt; and &lt;code&gt;gemini-flash-lite-latest&lt;/code&gt;, which will always resolve to the most recent model in that family.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.26"&gt;llm-gemini 0.26&lt;/a&gt; adding support for the new models and new aliases. I also used the &lt;code&gt;response.set_resolved_model()&lt;/code&gt; method &lt;a href="https://github.com/simonw/llm/issues/1117"&gt;added in LLM 0.27&lt;/a&gt; to ensure that the correct model ID would be recorded for those &lt;code&gt;-latest&lt;/code&gt; uses.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both of these models support optional reasoning tokens. I had them draw me pelicans riding bicycles in both thinking and non-thinking mode, using commands that looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-2.5-flash-preview-09-2025 -o thinking_budget 4000 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then got each model to describe the image it had drawn using commands like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/gemini-2.5-flash-preview-09-2025-thinking.png -m gemini-2.5-flash-preview-09-2025 -o thinking_budget 2000 'Detailed single line alt text for this image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/e9dc9c18008106b4ae2e0be287709f5c"&gt;&lt;strong&gt;gemini-2.5-flash-preview-09-2025-thinking&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-preview-09-2025-thinking.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A minimalist stick figure graphic depicts a person with a white oval body and a dot head cycling a gray bicycle, carrying a large, bright yellow rectangular box resting high on their back.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/e357eac5f12e995a6dcb50711241a478"&gt;&lt;strong&gt;gemini-2.5-flash-preview-09-2025&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-preview-09-2025.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A simple cartoon drawing of a pelican riding a bicycle, with the text "A Pelican Riding a Bicycle" above it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/29aff037b58fe62baf5a3cb7cf3b0ca9"&gt;&lt;strong&gt;gemini-2.5-flash-lite-preview-09-2025-thinking&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-lite-preview-09-2025-thinking.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A quirky, simplified cartoon illustration of a white bird with a round body, black eye, and bright yellow beak, sitting astride a dark gray, two-wheeled vehicle with its peach-colored feet dangling below.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/0eb5b9dc5515657a0a3c9d16bb5d46f6"&gt;&lt;strong&gt;gemini-2.5-flash-lite-preview-09-2025&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-lite-preview-09-2025.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A minimalist, side-profile illustration of a stylized yellow chick or bird character riding a dark-wheeled vehicle on a green strip against a white background.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Artificial Analysis posted &lt;a href="https://twitter.com/ArtificialAnlys/status/1971273380335845683"&gt;a detailed review&lt;/a&gt;, including these interesting notes about reasoning efficiency and speed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;In reasoning mode, Gemini 2.5 Flash and Flash-Lite Preview 09-2025 are more token-efficient, using fewer output tokens than their predecessors to run the Artificial Analysis Intelligence Index. Gemini 2.5 Flash-Lite Preview 09-2025 uses 50% fewer output tokens than its predecessor, while Gemini 2.5 Flash Preview 09-2025 uses 24% fewer output tokens.&lt;/li&gt;
&lt;li&gt;Google Gemini 2.5 Flash-Lite Preview 09-2025 (Reasoning) is ~40% faster than the prior July release, delivering ~887 output tokens/s on Google AI Studio in our API endpoint performance benchmarking. This makes the new Gemini 2.5 Flash-Lite the fastest proprietary model we have benchmarked on the Artificial Analysis website&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45375845"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="artificial-analysis"/></entry><entry><title>GPT-5-Codex</title><link href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/#atom-tag" rel="alternate"/><published>2025-09-23T23:59:20+00:00</published><updated>2025-09-23T23:59:20+00:00</updated><id>https://simonwillison.net/2025/Sep/23/gpt-5-codex/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5-codex"&gt;GPT-5-Codex&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI &lt;a href="https://simonwillison.net/2025/Sep/15/gpt-5-codex/"&gt;half-released this model&lt;/a&gt; earlier this month, adding it to their Codex CLI tool but not their API.&lt;/p&gt;
&lt;p&gt;Today they've fixed that - the new model can now be accessed as &lt;code&gt;gpt-5-codex&lt;/code&gt;. It's priced the same as regular GPT-5: $1.25/million input tokens, $10/million output tokens, and the same hefty 90% discount for previously cached input tokens, especially important for agentic tool-using workflows which quickly produce a lengthy conversation.&lt;/p&gt;
&lt;p&gt;It's only available via their Responses API, which means you currently need to install the &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai-plugin&lt;/a&gt; to use it with LLM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-openai-plugin
llm -m openai/gpt-5-codex -T llm_version 'What is the LLM version?'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The installed LLM version is 0.27.1.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I added &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;tool support&lt;/a&gt; to that plugin today, &lt;a href="https://github.com/simonw/llm-openai-plugin/issues/20#issuecomment-3325921197"&gt;mostly authored by GPT-5 Codex itself&lt;/a&gt; using OpenAI's Codex CLI.&lt;/p&gt;
&lt;p&gt;The new &lt;a href="https://cookbook.openai.com/examples/gpt-5-codex_prompting_guide"&gt;prompting guide for GPT-5-Codex&lt;/a&gt; is worth a read.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT-5-Codex is purpose-built for Codex CLI, the Codex IDE extension, the Codex cloud environment, and working in GitHub, and also supports versatile tool use. We recommend using GPT-5-Codex only for agentic and interactive coding use cases.&lt;/p&gt;
&lt;p&gt;Because the model is trained specifically for coding, many best practices you once had to prompt into general purpose models are built in, and over prompting can reduce quality.&lt;/p&gt;
&lt;p&gt;The core prompting principle for GPT-5-Codex is &lt;strong&gt;“less is more.”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/b371949ae984b0431848cd16cba24b27"&gt;tried my pelican benchmark&lt;/a&gt; at a cost of &lt;a href="https://www.llm-prices.com/#it=16&amp;amp;ot=2154&amp;amp;ic=1.25&amp;amp;oc=10"&gt;2.156 cents&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openai/gpt-5-codex "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="See description below" src="https://static.simonwillison.net/static/2025/gpt-5-codex-api-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I asked Codex to describe this image and it correctly identified it as a pelican!&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openai/gpt-5-codex -a https://static.simonwillison.net/static/2025/gpt-5-codex-api-pelican.png \
  -s 'Write very detailed alt text'
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Cartoon illustration of a cream-colored pelican with a large orange beak and tiny black eye riding a minimalist dark-blue bicycle. The bird’s wings are tucked in, its legs resemble orange stick limbs pushing the pedals, and its tail feathers trail behind with light blue motion streaks to suggest speed. A small coral-red tongue sticks out of the pelican’s beak. The bicycle has thin light gray spokes, and the background is a simple pale blue gradient with faint curved lines hinting at ground and sky.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/><category term="codex-cli"/><category term="gpt-codex"/></entry><entry><title>Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action</title><link href="https://simonwillison.net/2025/Sep/23/qwen3-vl/#atom-tag" rel="alternate"/><published>2025-09-23T23:51:08+00:00</published><updated>2025-09-23T23:51:08+00:00</updated><id>https://simonwillison.net/2025/Sep/23/qwen3-vl/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&amp;amp;from=research.latest-advancements-list"&gt;Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I've been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3's vision models.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Firstly, we are open-sourcing the flagship model of this series: Qwen3-VL-235B-A22B, available in both Instruct and Thinking versions. The Instruct version matches or even exceeds Gemini 2.5 Pro in major visual perception benchmarks. The Thinking version achieves state-of-the-art results across many multimodal reasoning benchmarks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Bold claims against Gemini 2.5 Pro, which are supported by a flurry of self-reported benchmarks.&lt;/p&gt;
&lt;p&gt;This initial model is &lt;em&gt;enormous&lt;/em&gt;. On Hugging Face both &lt;a href="https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct"&gt;Qwen3-VL-235B-A22B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking"&gt;Qwen3-VL-235B-A22B-Thinking&lt;/a&gt; are 235B parameters and weigh 471 GB. Not something I'm going to be able to run on my 64GB Mac!&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5"&gt;Qwen 2.5 VL family&lt;/a&gt; included models at 72B, 32B, 7B and 3B sizes. Given the rate Qwen are shipping models at the moment I wouldn't be surprised to see smaller Qwen 3 VL models show up in just the next few days.&lt;/p&gt;
&lt;p&gt;Also from Qwen today, three new API-only closed-weight models: &lt;a href="https://x.com/Alibaba_Qwen/status/1970582211993927774"&gt;upgraded Qwen 3 Coder&lt;/a&gt;, &lt;a href="https://qwen.ai/blog?id=4266edf7f3718f2d3fda098b3f4c48f3573215d0&amp;amp;from=home.latest-research-list"&gt;Qwen3-LiveTranslate-Flash&lt;/a&gt; (real-time multimodal interpretation), and &lt;a href="https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&amp;amp;from=research.latest-advancements-list"&gt;Qwen3-Max&lt;/a&gt;, their new trillion parameter flagship model, which they describe as their "largest and most capable model to date".&lt;/p&gt;
&lt;p&gt;Plus &lt;a href="https://twitter.com/Alibaba_Qwen/status/1970510193537753397"&gt;Qwen3Guard&lt;/a&gt;, a "safety moderation model series" that looks similar in purpose to Meta's &lt;a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/"&gt;Llama Guard&lt;/a&gt;. This one is open weights (Apache 2.0) and comes in 8B, 4B and 0.6B sizes &lt;a href="https://huggingface.co/collections/Qwen/qwen3guard-68d2729abbfae4716f3343a1"&gt;on Hugging Face&lt;/a&gt;. There's more information in the &lt;a href="https://github.com/QwenLM/Qwen3Guard"&gt;QwenLM/Qwen3Guard&lt;/a&gt; GitHub repo.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45352672"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="qwen"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>llm-openrouter 0.5</title><link href="https://simonwillison.net/2025/Sep/21/llm-openrouter/#atom-tag" rel="alternate"/><published>2025-09-21T00:24:05+00:00</published><updated>2025-09-21T00:24:05+00:00</updated><id>https://simonwillison.net/2025/Sep/21/llm-openrouter/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-openrouter/releases/tag/0.5"&gt;llm-openrouter 0.5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; plugin for accessing models made available via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;. The release notes in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Support for &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;tool calling&lt;/a&gt;. Thanks, &lt;a href="https://github.com/jamessanford"&gt;James Sanford&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/pull/43"&gt;#43&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Support for reasoning options, for example &lt;code&gt;llm -m openrouter/openai/gpt-5 'prove dogs exist' -o reasoning_effort medium&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/issues/45"&gt;#45&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Tool calling is a really big deal, as it means you can now use the plugin to try out tools (and &lt;a href="https://simonwillison.net/2025/Sep/18/agents/"&gt;build agents, if you like&lt;/a&gt;) against any of the 179 tool-enabled models on that platform:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter
# Paste key here
llm models --tools | grep 'OpenRouter:' | wc -l
# Outputs 179
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Quite a few of the models hosted on OpenRouter can be accessed for free. Here's a tool-usage example using the &lt;a href="https://github.com/simonw/llm-tools-datasette"&gt;llm-tools-datasette plugin&lt;/a&gt; against the new &lt;a href="https://simonwillison.net/2025/Sep/20/grok-4-fast/"&gt;Grok 4 Fast model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-tools-datasette
llm -m openrouter/x-ai/grok-4-fast:free -T 'Datasette("https://datasette.io/content")' 'Count available plugins'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are 154 available plugins.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/43c56203887dd0d07351443a2ba18f29"&gt;The output&lt;/a&gt; of &lt;code&gt;llm logs -cu&lt;/code&gt; shows the tool calls and SQL queries it executed to get that result.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="ai"/><category term="datasette"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="llm-reasoning"/><category term="openrouter"/></entry><entry><title>Grok 4 Fast</title><link href="https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag" rel="alternate"/><published>2025-09-20T23:59:33+00:00</published><updated>2025-09-20T23:59:33+00:00</updated><id>https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.ai/news/grok-4-fast"&gt;Grok 4 Fast&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with tool-use reinforcement learning".&lt;/p&gt;
&lt;p&gt;It's priced at $0.20/million input tokens and $0.50/million output tokens - 15x less than Grok 4 (which is $3/million input and $15/million output). That puts it cheaper than GPT-5 mini and Gemini 2.5 Flash on &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The same model weights handle reasoning and non-reasoning based on a parameter passed to the model.&lt;/p&gt;
&lt;p&gt;I've been trying it out via my updated &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, since Grok 4 Fast is available &lt;a href="https://openrouter.ai/x-ai/grok-4-fast"&gt;for free on OpenRouter&lt;/a&gt; for a limited period.&lt;/p&gt;
&lt;p&gt;Here's output from the &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551"&gt;non-reasoning model&lt;/a&gt;. This actually output an invalid SVG - I had to make &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551?permalink_comment_id=5768049#gistcomment-5768049"&gt;a tiny manual tweak&lt;/a&gt; to the XML to get it to render.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled false
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: Simple line drawing of a white bird with a long yellow beak riding a bicycle, pedaling with its orange legs." src="https://static.simonwillison.net/static/2025/grok-4-no-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;(I initially ran this without that &lt;code&gt;-o reasoning_enabled false&lt;/code&gt; flag, but then I saw that &lt;a href="https://x.com/OpenRouterAI/status/1969427723098435738"&gt;OpenRouter enable reasoning by default&lt;/a&gt; for that model. Here's my &lt;a href="https://gist.github.com/simonw/6a52e6585cb3c45e64ae23b9c5ebafe9"&gt;previous invalid result&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/539719a1495253bbd27f3107931e6dd3"&gt;the reasoning model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: A simple line drawing of a white pelican with a yellow beak holding a yellow object, riding a black bicycle on green grass under a blue sky with white clouds." src="https://static.simonwillison.net/static/2025/grok-4-fast-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;In related news, the New York Times had a story a couple of days ago about Elon's recent focus on xAI: &lt;a href="https://www.nytimes.com/2025/09/18/technology/elon-musk-artificial-intelligence-xai.html"&gt;Since Leaving Washington, Elon Musk Has Been All In on His A.I. Company&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="llm-release"/><category term="openrouter"/><category term="xai"/></entry><entry><title>Magistral 1.2</title><link href="https://simonwillison.net/2025/Sep/19/magistral/#atom-tag" rel="alternate"/><published>2025-09-19T19:13:45+00:00</published><updated>2025-09-19T19:13:45+00:00</updated><id>https://simonwillison.net/2025/Sep/19/magistral/#atom-tag</id><summary type="html">
    &lt;p&gt;Mistral &lt;a href="https://twitter.com/MistralAI/status/1968670593412190381"&gt;quietly released&lt;/a&gt; two new models yesterday: &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2509"&gt;Magistral Small 1.2&lt;/a&gt; (Apache 2.0, 
96.1 GB on Hugging Face) and Magistral Medium 1.2 (not open weights same as Mistral's other "medium" models.)&lt;/p&gt;
&lt;p&gt;Despite being described as "minor updates" to the Magistral 1.1 models these have one very notable improvement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Multimodality: Now equipped with a vision encoder, these models handle both text and images seamlessly.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Magistral is Mistral's reasoning model, so we now have a new reasoning vision LLM.&lt;/p&gt;
&lt;p&gt;The other features from the tiny announcement on Twitter:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Performance Boost: 15% improvements on math and coding benchmarks such as AIME 24/25 and LiveCodeBench v5/v6.&lt;/li&gt;
&lt;li&gt;Smarter Tool Use: Better tool usage with web search, code interpreter, and image generation.&lt;/li&gt;
&lt;li&gt;Better Tone &amp;amp; Persona: Responses are clearer, more natural, and better formatted for you.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="llm-release"/><category term="mistral"/><category term="generative-ai"/><category term="llm-reasoning"/><category term="ai"/><category term="llms"/></entry><entry><title>ICPC medals for OpenAI and Gemini</title><link href="https://simonwillison.net/2025/Sep/17/icpc/#atom-tag" rel="alternate"/><published>2025-09-17T22:52:10+00:00</published><updated>2025-09-17T22:52:10+00:00</updated><id>https://simonwillison.net/2025/Sep/17/icpc/#atom-tag</id><summary type="html">
    &lt;p&gt;In July it was the International Math Olympiad (&lt;a href="https://simonwillison.net/2025/Jul/19/openai-gold-medal-math-olympiad/"&gt;OpenAI&lt;/a&gt;, &lt;a href="https://simonwillison.net/2025/Jul/21/gemini-imo/"&gt;Gemini&lt;/a&gt;), today it's the &lt;a href="https://en.m.wikipedia.org/wiki/International_Collegiate_Programming_Contest"&gt;International Collegiate Programming Contest (ICPC)&lt;/a&gt;. Once again, both OpenAI and Gemini competed with models that achieved Gold medal performance.&lt;/p&gt;
&lt;p&gt;OpenAI's &lt;a href="https://twitter.com/mostafarohani/status/1968361152741826849"&gt;Mostafa Rohaninejad&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We received the problems in the exact same PDF form, and the reasoning system selected which answers to submit with no bespoke test-time harness whatsoever. For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.&lt;/p&gt;
&lt;p&gt;We competed with an ensemble of general-purpose reasoning models; we did not train any model specifically for the ICPC. We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And here's &lt;a href="https://deepmind.google/discover/blog/gemini-achieves-gold-level-performance-at-the-international-collegiate-programming-contest-world-finals/"&gt;the blog post&lt;/a&gt; by Google DeepMind's Hanzhao (Maggie) Lin and Heng-Tze Cheng:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An advanced version of Gemini 2.5 Deep Think competed live in a remote online environment following &lt;a href="https://icpc.global/worldfinals/rules"&gt;ICPC rules&lt;/a&gt;, under the guidance of the competition organizers. It started 10 minutes after the human contestants and correctly solved 10 out of 12 problems, achieving gold-medal level performance under the same five-hour time constraint. See our solutions &lt;a href="https://github.com/google-deepmind/gemini_icpc2025"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm still trying to confirm if the models had access to tools in order to execute the code they were writing. The IMO results in July were both achieved without tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 27th September 2025&lt;/strong&gt;: OpenAI researcher  Ahmed El-Kishky &lt;a href="https://twitter.com/ahelkky/status/1971652614950736194"&gt;confirms&lt;/a&gt; that OpenAI's model had a code execution environment but no internet:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For OpenAI, the models had access to a code execution sandbox, so they could compile and test out their solutions. That was it though; no internet access.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm-reasoning"/><category term="google"/><category term="generative-ai"/><category term="openai"/><category term="ai"/><category term="llms"/></entry><entry><title>Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!</title><link href="https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-tag" rel="alternate"/><published>2025-09-12T04:07:32+00:00</published><updated>2025-09-12T04:07:32+00:00</updated><id>https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.com/Alibaba_Qwen/status/1966197643904000262"&gt;Qwen3-Next-80B-A3B&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Qwen announced two new models via their Twitter account (and here's &lt;a href="https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&amp;amp;from=research.latest-advancements-list"&gt;their blog&lt;/a&gt;): &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct"&gt;Qwen3-Next-80B-A3B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking"&gt;Qwen3-Next-80B-A3B-Thinking&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They make some big claims on performance:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship.&lt;/li&gt;
&lt;li&gt;Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The name "80B-A3B" indicates 80 billion parameters of which only 3 billion are active at a time. You still need to have enough GPU-accessible RAM to hold all 80 billion in memory at once but only 3 billion will be used for each round of inference, which provides a &lt;em&gt;significant&lt;/em&gt; speedup in responding to prompts.&lt;/p&gt;
&lt;p&gt;More details from their tweet:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)&lt;/li&gt;
&lt;li&gt;Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &amp;amp; recall&lt;/li&gt;
&lt;li&gt;Ultra-sparse MoE: 512 experts, 10 routed + 1 shared&lt;/li&gt;
&lt;li&gt;Multi-Token Prediction → turbo-charged speculative decoding&lt;/li&gt;
&lt;li&gt;Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning &amp;amp; long-context&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models on Hugging Face are around 150GB each so I decided to try them out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; rather than on my own laptop (&lt;a href="https://openrouter.ai/qwen/qwen3-next-80b-a3b-thinking"&gt;Thinking&lt;/a&gt;, &lt;a href="https://openrouter.ai/qwen/qwen3-next-80b-a3b-instruct"&gt;Instruct&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I'm used my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin. I installed it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter
# paste key here
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then found the model IDs with this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm models -q next
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-thinking
OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-instruct
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I have an LLM &lt;a href="https://llm.datasette.io/en/stable/templates.html"&gt;prompt template&lt;/a&gt; saved called &lt;code&gt;pelican-svg&lt;/code&gt; which I created like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm "Generate an SVG of a pelican riding a bicycle" --save pelican-svg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This means I can run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican benchmark&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-thinking
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-instruct
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/d1a0d0ff719d609bc6fad2e133e7cbe9"&gt;thinking model output&lt;/a&gt; (exported with &lt;code&gt;llm logs -c | pbcopy&lt;/code&gt; after I ran the prompt):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is too simple and way too wide. The pelican is two circles, two orange triangular feed and a big triangle for the beak." src="https://static.simonwillison.net/static/2025/qwen3-next-80b-a3b-thinking.png" /&gt;&lt;/p&gt;
&lt;p&gt;I enjoyed the "Whimsical style with smooth curves and friendly proportions (no anatomical accuracy needed for bicycle riding!)" note in &lt;a href="https://gist.github.com/simonw/d1a0d0ff719d609bc6fad2e133e7cbe9#prompt"&gt;the transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The instruct (non-reasoning) model &lt;a href="https://gist.github.com/simonw/cc740a45beed5655faffa69da1e999f5"&gt;gave me this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue background, brown ground, bicycle looks more like a wheelchair, pelican is actually quite good though - has thin grey wings and a perky yellow long triangular beak. Above the pelican is the caption Who needs legs?! with an emoji sequence of penguin then flamingo." src="https://static.simonwillison.net/static/2025/qwen3-next-80b-a3b-instruct.png" /&gt;&lt;/p&gt;
&lt;p&gt;"🐧🦩 Who needs legs!?" indeed! I like that penguin-flamingo emoji sequence it's decided on for pelicans.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide</title><link href="https://simonwillison.net/2025/Sep/9/apollo-ai-adoption/#atom-tag" rel="alternate"/><published>2025-09-09T06:47:49+00:00</published><updated>2025-09-09T06:47:49+00:00</updated><id>https://simonwillison.net/2025/Sep/9/apollo-ai-adoption/#atom-tag</id><summary type="html">
    &lt;p&gt;Apollo Global Management's "Chief Economist" Dr. Torsten Sløk released &lt;a href="https://www.apolloacademy.com/ai-adoption-rate-trending-down-for-large-companies/"&gt;this interesting chart&lt;/a&gt; which appears to show a slowdown in AI adoption rates among large (&amp;gt;250 employees) companies:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/apollo-ai-chart.jpg" alt="AI adoption rates starting to decline for larger firms. A chart of AI adoption rate by firm size. Includes lines for 250+, 100-249, 50-99, 20-49, 10-19, 5-8 and 1-4 sized organizations. Chart starts in November 2023 with percentages ranging from 3 to 5, then all groups grow through August 2025 albeit with the 250+ group having a higher score than the others. That 25+ group peaks in Jul5 2025 at around 14% and then appears to slope slightly downwards to 12% by August. Some of the other lines also start to tip down, though not as much." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's the full description that accompanied the chart:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The US Census Bureau conducts a biweekly survey of 1.2 million firms, and one question is whether a business has used AI tools such as machine learning, natural language processing, virtual agents or voice recognition to help produce goods or services in the past two weeks. Recent data by firm size shows that AI adoption has been declining among companies with more than 250 employees, see chart below.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(My first thought on seeing that chart is that I hope it represents the &lt;em&gt;peak of inflated expectations&lt;/em&gt; leading into the &lt;em&gt;trough of dissillusionment&lt;/em&gt; in the &lt;a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"&gt;Gartner Hype Cycle&lt;/a&gt; (which Wikipedia calls "largely disputed, with studies pointing to it being inconsistently true at best"), since that means we might be reaching the end of the initial hype phase and heading towards the &lt;em&gt;slope of enlightenment&lt;/em&gt;.)&lt;/p&gt;
&lt;h4 id="finding-the-us-census-data-with-gpt-5-search"&gt;Finding the US Census data with GPT-5 search&lt;/h4&gt;
&lt;p&gt;This is the first I'd heard of the US Census Bureau running a biweekly (that's once every two weeks) survey about AI!&lt;/p&gt;
&lt;p&gt;I decided to track down the numbers and see if I could recreate the chart myself. And since &lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/"&gt;GPT-5 is really good at search now&lt;/a&gt; I fed it the following prompt to see how well it could do:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;gt; The US Census Bureau conducts a biweekly survey of 1.2 million firms, and one question is whether a business has used AI tools such as machine learning, natural language processing, virtual agents or voice recognition to help produce goods or services in the past two weeks. Recent data by firm size shows that AI adoption has been declining among companies with more than 250 employees, see chart below.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Find me that original data&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It took 1m19s to &lt;a href="https://chatgpt.com/share/68bf48cf-0e70-8006-a045-96fa8e7ddfc1"&gt;find exactly what I needed&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here’s the original U.S. Census Bureau source for that series:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BTOS “Employment size class” time series (XLSX)&lt;/strong&gt; — biweekly shares of firms answering &lt;strong&gt;Yes&lt;/strong&gt; to “In the last two weeks, did this business use Artificial Intelligence (AI) in producing goods or services?”, broken out by firm-size buckets (incl. &lt;strong&gt;250+ employees&lt;/strong&gt;). Coverage: &lt;strong&gt;Sep 11, 2023 → Aug 24, 2025&lt;/strong&gt;. (&lt;a href="https://www.census.gov/hfp/btos/data_downloads"&gt;Census.gov&lt;/a&gt;) [...]&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;That &lt;a href="https://www.census.gov/hfp/btos/data_downloads"&gt;Census page&lt;/a&gt; was not &lt;em&gt;at all&lt;/em&gt; obvious. Thankfully GPT-5 had tipped me off to the "Employment size class" file, this link here:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/census-page.jpg" alt="US Census website. Business Trends and Outlook Survey, Updated August 28, 2025. Current Data has 6 visible XLSX files with names like WFH Supplement, WFH Questions 27-29, National, Sectur, Subsector and Emplomyent size class. A red arrow highlights that last one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;So I downloaded that file, and confirmed that it was indeed a spreadsheet containing the data I wanted (in among all sorts of other survey questions). Here's &lt;a href="https://static.simonwillison.net/static/cors-allow/2025/Employment-Size-Class-Sep-2025.xlsx"&gt;a 374KB XLSX copy&lt;/a&gt; of the file I downloaded.&lt;/p&gt;
&lt;h4 id="recreating-the-chart-with-gpt-5-code-interpreter"&gt;Recreating the chart with GPT-5 code interpreter&lt;/h4&gt;
&lt;p&gt;So what should I do with it now? I decided to see if GPT-5 could turn the spreadsheet back into that original chart, using Python running in its &lt;a href="https://simonwillison.net/tags/code-interpreter/"&gt;code interpreter&lt;/a&gt; tool.&lt;/p&gt;
&lt;p&gt;So I uploaded the XLSX file back to ChatGPT, dropped in a screenshot of the Apollo chart and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Use this data to recreate this chart using python&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/chart-prompt.jpg" alt="ChatGPT. I dropped in a screenshot of the chart, uploaded the spreadsheet which turned into an inline table browser UI and prompted it to recreate the chart using python." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I thought this was a pretty tall order, but it's always worth throwing big challenges at an LLM to learn from how well it does.&lt;/p&gt;
&lt;p&gt;It &lt;em&gt;really worked hard on this&lt;/em&gt;. I didn't time it exactly but it spent at least 7 minutes "reasoning" across 5 different thinking blocks, interspersed with over a dozen Python analysis sessions. It used &lt;code&gt;pandas&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt; to explore the uploaded spreadsheet and find the right figures, then tried several attempts at plotting with &lt;code&gt;matplotlib&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;As far as I can tell GPT-5 in ChatGPT can now feed charts it creates back into its own vision model, because it appeared to render a broken (empty) chart and then keep on trying to get it working.&lt;/p&gt;
&lt;p&gt;It found a data dictionary in the last tab of the spreadsheet and used that to build a lookup table matching the letters &lt;code&gt;A&lt;/code&gt; through &lt;code&gt;G&lt;/code&gt; to the actual employee size buckets.&lt;/p&gt;
&lt;p&gt;At the end of the process it spat out this chart:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/recreated-chart-1.jpg" alt="matplotlib chart. The title is AI adoption rates starting to decline for larger firms, though there's a typography glitch in that title. It has a neat legend for the different size ranges, then a set of lines that look about right compared to the above graph - but they are more spiky and the numbers appear to trend up again at the end of the chart." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;At first glance I thought it had nailed it... but then I compared the chart more closely with the Apollo original and spotted some definite discrepancies. GPT-5's chart peaked at 14.5% but the highest value in Apollo's was more like 13.5%. The GPT-5 chart was spikier - and most interestingly it included a clear uptick in the last data point where Apollo's had trended downwards.&lt;/p&gt;
&lt;p&gt;I decided it was time to look at the actual data. I opened up the spreadsheet in Numbers, found the AI question columns and manually reviewed them. They seemed to match the GPT-5 chart results - so why the difference to Apollo's?&lt;/p&gt;
&lt;p&gt;Then I noticed a crucial detail in the Apollo chart that I had cropped out of my original screenshot!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: Data is six-survey moving average.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I told ChatGPT:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Do the first question, plot it as a six survey rolling average&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I asked for the first question because it turned out there were two that were relevant in the survey spreadsheet.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the last two weeks, did this business use Artificial Intelligence (AI) in producing goods or services? (Examples of AI: machine learning, natural language processing, virtual agents, voice recognition, etc.)&lt;/li&gt;
&lt;li&gt;During the next six months, do you think this business will be using Artificial Intelligence (AI) in producing goods or services? (Examples of AI: machine learning, natural language processing, virtual agents, voice recognition, etc.)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It churned away for a little longer, added this code to the script:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;# Compute 6-survey rolling average (biweekly cadence → ~12 weeks)&lt;/span&gt;
&lt;span class="pl-s1"&gt;rolled&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;wide&lt;/span&gt;.&lt;span class="pl-c1"&gt;rolling&lt;/span&gt;(&lt;span class="pl-s1"&gt;window&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;6&lt;/span&gt;, &lt;span class="pl-s1"&gt;min_periods&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;6&lt;/span&gt;).&lt;span class="pl-c1"&gt;mean&lt;/span&gt;()&lt;/pre&gt;
&lt;p&gt;And popped out this chart (after I told it to fix the glitch in the title):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/recreated-chart-2.jpg" alt="Second chart. This time the lines are basically an exact match for the Apollo one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I think it's done it! This is a very solid match for the Apollo original, recreated using &lt;code&gt;matplotlib&lt;/code&gt; and &lt;code&gt;pandas&lt;/code&gt; from the same underlying source data from the US Census.&lt;/p&gt;
&lt;p&gt;Here's the full Python code it wrote, which I think is quite readable (in as much as Pandas code can be):&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pandas&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;matplotlib&lt;/span&gt;.&lt;span class="pl-s1"&gt;pyplot&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;plt&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;matplotlib&lt;/span&gt;.&lt;span class="pl-s1"&gt;ticker&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;PercentFormatter&lt;/span&gt;

&lt;span class="pl-s1"&gt;path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"/mnt/data/Employment Size Class.xlsx"&lt;/span&gt;

&lt;span class="pl-s1"&gt;resp&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;read_excel&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;, &lt;span class="pl-s1"&gt;sheet_name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Response Estimates"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;dates&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;read_excel&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;, &lt;span class="pl-s1"&gt;sheet_name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Collection and Reference Dates"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;is_current&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;resp&lt;/span&gt;[&lt;span class="pl-s"&gt;"Question"&lt;/span&gt;].&lt;span class="pl-c1"&gt;astype&lt;/span&gt;(&lt;span class="pl-s1"&gt;str&lt;/span&gt;).&lt;span class="pl-c1"&gt;str&lt;/span&gt;.&lt;span class="pl-c1"&gt;strip&lt;/span&gt;().&lt;span class="pl-c1"&gt;str&lt;/span&gt;.&lt;span class="pl-c1"&gt;startswith&lt;/span&gt;(&lt;span class="pl-s"&gt;"In the last two weeks"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;ai_yes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;resp&lt;/span&gt;[&lt;span class="pl-s1"&gt;is_current&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt; &lt;span class="pl-s1"&gt;resp&lt;/span&gt;[&lt;span class="pl-s"&gt;"Answer"&lt;/span&gt;].&lt;span class="pl-c1"&gt;astype&lt;/span&gt;(&lt;span class="pl-s1"&gt;str&lt;/span&gt;).&lt;span class="pl-c1"&gt;str&lt;/span&gt;.&lt;span class="pl-c1"&gt;strip&lt;/span&gt;().&lt;span class="pl-c1"&gt;str&lt;/span&gt;.&lt;span class="pl-c1"&gt;lower&lt;/span&gt;().&lt;span class="pl-c1"&gt;eq&lt;/span&gt;(&lt;span class="pl-s"&gt;"yes"&lt;/span&gt;)].&lt;span class="pl-c1"&gt;copy&lt;/span&gt;()

&lt;span class="pl-s1"&gt;code_to_bucket&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; {&lt;span class="pl-s"&gt;"A"&lt;/span&gt;:&lt;span class="pl-s"&gt;"1-4"&lt;/span&gt;,&lt;span class="pl-s"&gt;"B"&lt;/span&gt;:&lt;span class="pl-s"&gt;"5-9"&lt;/span&gt;,&lt;span class="pl-s"&gt;"C"&lt;/span&gt;:&lt;span class="pl-s"&gt;"10-19"&lt;/span&gt;,&lt;span class="pl-s"&gt;"D"&lt;/span&gt;:&lt;span class="pl-s"&gt;"20-49"&lt;/span&gt;,&lt;span class="pl-s"&gt;"E"&lt;/span&gt;:&lt;span class="pl-s"&gt;"50-99"&lt;/span&gt;,&lt;span class="pl-s"&gt;"F"&lt;/span&gt;:&lt;span class="pl-s"&gt;"100-249"&lt;/span&gt;,&lt;span class="pl-s"&gt;"G"&lt;/span&gt;:&lt;span class="pl-s"&gt;"250 or more employees"&lt;/span&gt;}
&lt;span class="pl-s1"&gt;ai_yes&lt;/span&gt;[&lt;span class="pl-s"&gt;"Bucket"&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;ai_yes&lt;/span&gt;[&lt;span class="pl-s"&gt;"Empsize"&lt;/span&gt;].&lt;span class="pl-c1"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;code_to_bucket&lt;/span&gt;)

&lt;span class="pl-s1"&gt;period_cols&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s1"&gt;c&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;c&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;ai_yes&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-en"&gt;str&lt;/span&gt;(&lt;span class="pl-s1"&gt;c&lt;/span&gt;).&lt;span class="pl-c1"&gt;isdigit&lt;/span&gt;() &lt;span class="pl-c1"&gt;and&lt;/span&gt; &lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-en"&gt;str&lt;/span&gt;(&lt;span class="pl-s1"&gt;c&lt;/span&gt;))&lt;span class="pl-c1"&gt;==&lt;/span&gt;&lt;span class="pl-c1"&gt;6&lt;/span&gt;]
&lt;span class="pl-s1"&gt;long&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;ai_yes&lt;/span&gt;.&lt;span class="pl-c1"&gt;melt&lt;/span&gt;(&lt;span class="pl-s1"&gt;id_vars&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s"&gt;"Bucket"&lt;/span&gt;], &lt;span class="pl-s1"&gt;value_vars&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;period_cols&lt;/span&gt;, &lt;span class="pl-s1"&gt;var_name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;, &lt;span class="pl-s1"&gt;value_name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"value"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;dates&lt;/span&gt;[&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dates&lt;/span&gt;[&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;].&lt;span class="pl-c1"&gt;astype&lt;/span&gt;(&lt;span class="pl-s1"&gt;str&lt;/span&gt;)
&lt;span class="pl-s1"&gt;long&lt;/span&gt;[&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;long&lt;/span&gt;[&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;].&lt;span class="pl-c1"&gt;astype&lt;/span&gt;(&lt;span class="pl-s1"&gt;str&lt;/span&gt;)
&lt;span class="pl-s1"&gt;merged&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;long&lt;/span&gt;.&lt;span class="pl-c1"&gt;merge&lt;/span&gt;(&lt;span class="pl-s1"&gt;dates&lt;/span&gt;[[&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;,&lt;span class="pl-s"&gt;"Ref End"&lt;/span&gt;]], &lt;span class="pl-s1"&gt;on&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Smpdt"&lt;/span&gt;, &lt;span class="pl-s1"&gt;how&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"left"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;merged&lt;/span&gt;[&lt;span class="pl-s"&gt;"date"&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;to_datetime&lt;/span&gt;(&lt;span class="pl-s1"&gt;merged&lt;/span&gt;[&lt;span class="pl-s"&gt;"Ref End"&lt;/span&gt;], &lt;span class="pl-s1"&gt;errors&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"coerce"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;merged&lt;/span&gt;[&lt;span class="pl-s"&gt;"value"&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;to_numeric&lt;/span&gt;(&lt;span class="pl-s1"&gt;long&lt;/span&gt;[&lt;span class="pl-s"&gt;"value"&lt;/span&gt;].&lt;span class="pl-c1"&gt;astype&lt;/span&gt;(&lt;span class="pl-s1"&gt;str&lt;/span&gt;).&lt;span class="pl-c1"&gt;str&lt;/span&gt;.&lt;span class="pl-c1"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"%"&lt;/span&gt;,&lt;span class="pl-s"&gt;""&lt;/span&gt;,&lt;span class="pl-s1"&gt;regex&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;False&lt;/span&gt;).&lt;span class="pl-c1"&gt;str&lt;/span&gt;.&lt;span class="pl-c1"&gt;strip&lt;/span&gt;(), &lt;span class="pl-s1"&gt;errors&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"coerce"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;order&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s"&gt;"250 or more employees"&lt;/span&gt;,&lt;span class="pl-s"&gt;"100-249"&lt;/span&gt;,&lt;span class="pl-s"&gt;"50-99"&lt;/span&gt;,&lt;span class="pl-s"&gt;"20-49"&lt;/span&gt;,&lt;span class="pl-s"&gt;"10-19"&lt;/span&gt;,&lt;span class="pl-s"&gt;"5-9"&lt;/span&gt;,&lt;span class="pl-s"&gt;"1-4"&lt;/span&gt;]
&lt;span class="pl-s1"&gt;wide&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;merged&lt;/span&gt;.&lt;span class="pl-c1"&gt;pivot_table&lt;/span&gt;(&lt;span class="pl-s1"&gt;index&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"date"&lt;/span&gt;, &lt;span class="pl-s1"&gt;columns&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Bucket"&lt;/span&gt;, &lt;span class="pl-s1"&gt;values&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"value"&lt;/span&gt;, &lt;span class="pl-s1"&gt;aggfunc&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"mean"&lt;/span&gt;).&lt;span class="pl-c1"&gt;sort_index&lt;/span&gt;()
&lt;span class="pl-s1"&gt;wide&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;wide&lt;/span&gt;[[&lt;span class="pl-s1"&gt;c&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;c&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;order&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;c&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;wide&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;]]
&lt;span class="pl-s1"&gt;rolled&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;wide&lt;/span&gt;.&lt;span class="pl-c1"&gt;rolling&lt;/span&gt;(&lt;span class="pl-s1"&gt;window&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;6&lt;/span&gt;, &lt;span class="pl-s1"&gt;min_periods&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;6&lt;/span&gt;).&lt;span class="pl-c1"&gt;mean&lt;/span&gt;()

&lt;span class="pl-s1"&gt;start&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;Timestamp&lt;/span&gt;(&lt;span class="pl-s"&gt;"2023-11-01"&lt;/span&gt;), &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;Timestamp&lt;/span&gt;(&lt;span class="pl-s"&gt;"2025-08-31"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;rolled_win&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;rolled&lt;/span&gt;.&lt;span class="pl-c1"&gt;loc&lt;/span&gt;[(&lt;span class="pl-s1"&gt;rolled&lt;/span&gt;.&lt;span class="pl-c1"&gt;index&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;start&lt;/span&gt;) &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt; (&lt;span class="pl-s1"&gt;rolled&lt;/span&gt;.&lt;span class="pl-c1"&gt;index&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;end&lt;/span&gt;)]

&lt;span class="pl-s1"&gt;fig&lt;/span&gt;, &lt;span class="pl-s1"&gt;ax&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;plt&lt;/span&gt;.&lt;span class="pl-c1"&gt;subplots&lt;/span&gt;(&lt;span class="pl-s1"&gt;figsize&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;(&lt;span class="pl-c1"&gt;12&lt;/span&gt;, &lt;span class="pl-c1"&gt;6&lt;/span&gt;))
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;col&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;order&lt;/span&gt;:
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;col&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;rolled_win&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;plot&lt;/span&gt;(&lt;span class="pl-s1"&gt;rolled_win&lt;/span&gt;.&lt;span class="pl-c1"&gt;index&lt;/span&gt;, &lt;span class="pl-s1"&gt;rolled_win&lt;/span&gt;[&lt;span class="pl-s1"&gt;col&lt;/span&gt;], &lt;span class="pl-s1"&gt;label&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;col&lt;/span&gt;, &lt;span class="pl-s1"&gt;linewidth&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;)

&lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;set_title&lt;/span&gt;(&lt;span class="pl-s"&gt;"AI adoption (last two weeks) — 6‑survey rolling average"&lt;/span&gt;, &lt;span class="pl-s1"&gt;pad&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;16&lt;/span&gt;)
&lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;yaxis&lt;/span&gt;.&lt;span class="pl-c1"&gt;set_major_formatter&lt;/span&gt;(&lt;span class="pl-en"&gt;PercentFormatter&lt;/span&gt;(&lt;span class="pl-c1"&gt;100&lt;/span&gt;))
&lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;set_ylabel&lt;/span&gt;(&lt;span class="pl-s"&gt;"%"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;set_xlabel&lt;/span&gt;(&lt;span class="pl-s"&gt;""&lt;/span&gt;)
&lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;grid&lt;/span&gt;(&lt;span class="pl-c1"&gt;True&lt;/span&gt;, &lt;span class="pl-s1"&gt;alpha&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;0.25&lt;/span&gt;, &lt;span class="pl-s1"&gt;linestyle&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"--"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;ax&lt;/span&gt;.&lt;span class="pl-c1"&gt;legend&lt;/span&gt;(&lt;span class="pl-s1"&gt;title&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;None&lt;/span&gt;, &lt;span class="pl-s1"&gt;loc&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"upper left"&lt;/span&gt;, &lt;span class="pl-s1"&gt;ncols&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;, &lt;span class="pl-s1"&gt;frameon&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;False&lt;/span&gt;)
&lt;span class="pl-s1"&gt;plt&lt;/span&gt;.&lt;span class="pl-c1"&gt;tight_layout&lt;/span&gt;()

&lt;span class="pl-s1"&gt;png_path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"/mnt/data/ai_adoption_rolling6_by_firm_size.png"&lt;/span&gt;
&lt;span class="pl-s1"&gt;svg_path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"/mnt/data/ai_adoption_rolling6_by_firm_size.svg"&lt;/span&gt;
&lt;span class="pl-s1"&gt;plt&lt;/span&gt;.&lt;span class="pl-c1"&gt;savefig&lt;/span&gt;(&lt;span class="pl-s1"&gt;png_path&lt;/span&gt;, &lt;span class="pl-s1"&gt;dpi&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;200&lt;/span&gt;, &lt;span class="pl-s1"&gt;bbox_inches&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"tight"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;plt&lt;/span&gt;.&lt;span class="pl-c1"&gt;savefig&lt;/span&gt;(&lt;span class="pl-s1"&gt;svg_path&lt;/span&gt;, &lt;span class="pl-s1"&gt;bbox_inches&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"tight"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;I like how it generated &lt;a href="https://static.simonwillison.net/static/2025/ai_adoption_rolling6_by_firm_size.svg"&gt;an SVG version&lt;/a&gt; of the chart without me even asking for it.&lt;/p&gt;
&lt;p&gt;You can access &lt;a href="https://chatgpt.com/share/68bf48cf-0e70-8006-a045-96fa8e7ddfc1"&gt;the ChatGPT transcript&lt;/a&gt; to see full details of everything it did.&lt;/p&gt;
&lt;h4 id="rendering-that-chart-client-side-using-pyodide"&gt;Rendering that chart client-side using Pyodide&lt;/h4&gt;
&lt;p&gt;I had one more challenge to try out. Could I render that same chart entirely in the browser using &lt;a href="https://pyodide.org/en/stable/"&gt;Pyodide&lt;/a&gt;, which can execute both Pandas and Matplotlib?&lt;/p&gt;
&lt;p&gt;I fired up a new ChatGPT GPT-5 session and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Build a canvas that loads Pyodide and uses it to render an example bar chart with pandas and matplotlib and then displays that on the page&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My goal here was simply to see if I could get a proof of concept of a chart rendered, ideally using the Canvas feature of ChatGPT. Canvas is OpenAI's version of Claude Artifacts, which lets the model write and then execute HTML and JavaScript directly in the ChatGPT interface.&lt;/p&gt;
&lt;p&gt;It worked! Here's &lt;a href="https://chatgpt.com/c/68bf2993-ca94-832a-a95e-fb225911c0a6"&gt;the transcript&lt;/a&gt; and here's &lt;a href="https://tools.simonwillison.net/pyodide-bar-chart"&gt;what it built me&lt;/a&gt;, exported  to my &lt;a href="https://tools.simonwillison.net/"&gt;tools.simonwillison.net&lt;/a&gt; GitHub Pages site (&lt;a href="https://github.com/simonw/tools/blob/main/pyodide-bar-chart.html"&gt;source code here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/pyodide-matplotlib.jpg" alt="Screenshot of a web application demonstrating Pyodide integration. Header reads &amp;quot;Pyodide + pandas + matplotlib — Bar Chart&amp;quot; with subtitle &amp;quot;This page loads Pyodide in the browser, uses pandas to prep some data, renders a bar chart with matplotlib, and displays it below — all client-side.&amp;quot; Left panel shows terminal output: &amp;quot;Ready&amp;quot;, &amp;quot;# Python environment ready&amp;quot;, &amp;quot;• pandas 2.2.0&amp;quot;, &amp;quot;• numpy 1.26.4&amp;quot;, &amp;quot;• matplotlib 3.5.2&amp;quot;, &amp;quot;Running chart code...&amp;quot;, &amp;quot;Done. Chart updated.&amp;quot; with &amp;quot;Re-run demo&amp;quot; and &amp;quot;Show Python&amp;quot; buttons. Footer note: &amp;quot;CDN: pyodide, pandas, numpy, matplotlib are fetched on demand. First run may take a few seconds.&amp;quot; Right panel displays a bar chart titled &amp;quot;Example Bar Chart (pandas + matplotlib in Pyodide)&amp;quot; showing blue bars for months Jan through Jun with values approximately: Jan(125), Feb(130), Mar(80), Apr(85), May(85), Jun(120). Y-axis labeled &amp;quot;Streams&amp;quot; ranges 0-120, X-axis labeled &amp;quot;Month&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I've now proven to myself that I can render those Python charts directly in the browser. Next step: recreate the Apollo chart.&lt;/p&gt;
&lt;p&gt;I knew it would need a way to load the spreadsheet that was CORS-enabled. I uploaded my copy to my &lt;code&gt;/static/cors-allow/2025/...&lt;/code&gt; directory (configured in Cloudflare to serve CORS headers), pasted in the finished plotting code from earlier and told ChatGPT:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Now update it to have less explanatory text and a less exciting design (black on white is fine) and run the equivalent of this:&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;(... pasted in Python code from earlier ...)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Load the XLSX sheet from https://static.simonwillison.net/static/cors-allow/2025/Employment-Size-Class-Sep-2025.xlsx&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It didn't quite work - I got an error about &lt;code&gt;openpyxl&lt;/code&gt; which I manually researched the fix for and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Use await micropip.install("openpyxl") to install openpyxl - instead of using loadPackage&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had to paste in another error message:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;zipfile.BadZipFile: File is not a zip file&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then one about a &lt;code&gt;SyntaxError: unmatched ')'&lt;/code&gt; and a &lt;code&gt;TypeError: Legend.__init__() got an unexpected keyword argument 'ncols'&lt;/code&gt; - copying and pasting error messages remains a frustrating but necessary part of the vibe-coding loop.&lt;/p&gt;
&lt;p&gt;... but with those fixes in place, the resulting code worked! Visit &lt;a href="https://tools.simonwillison.net/ai-adoption"&gt;tools.simonwillison.net/ai-adoption&lt;/a&gt; to see the final result:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/recreated-chart-pyodide.jpg" alt="Web page. Title is AI adoption - 6-survey rolling average. Has a Run, Downlaed PNG, Downlaod SVG button. Panel on the left says Loading Python... Fetcing packages numpy, pandas, matplotlib. Installing openpyxl via micropop... ready. Running. Done. Right hand panel shows the rendered chart." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's the code for that page, &lt;a href="https://github.com/simonw/tools/blob/main/ai-adoption.html"&gt;170 lines&lt;/a&gt; all-in of HTML, CSS, JavaScript and Python.&lt;/p&gt;
&lt;h4 id="what-i-ve-learned-from-this"&gt;What I've learned from this&lt;/h4&gt;
&lt;p&gt;This was another of those curiosity-inspired investigations that turned into a whole set of useful lessons.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPT-5 is great at tracking down US Census data, no matter how difficult their site is to understand if you don't work with their data often&lt;/li&gt;
&lt;li&gt;It can do a very good job of turning data + a screenshot of a chart into a recreation of that chart using code interpreter, Pandas and matplotlib&lt;/li&gt;
&lt;li&gt;Running Python + matplotlib in a browser via Pyodide is very easy and only takes a few dozen lines of code&lt;/li&gt;
&lt;li&gt;Fetching an XLSX sheet into Pyodide is only a small extra step using &lt;code&gt;pyfetch&lt;/code&gt; and &lt;code&gt;openpyxl&lt;/code&gt;:
&lt;pre style="margin-top: 0.5em"&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;micropip&lt;/span&gt;
&lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;micropip&lt;/span&gt;.&lt;span class="pl-c1"&gt;install&lt;/span&gt;(&lt;span class="pl-s"&gt;"openpyxl"&lt;/span&gt;)
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;pyodide&lt;/span&gt;.&lt;span class="pl-s1"&gt;http&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pyfetch&lt;/span&gt;
&lt;span class="pl-s1"&gt;resp_fetch&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-en"&gt;pyfetch&lt;/span&gt;(&lt;span class="pl-c1"&gt;URL&lt;/span&gt;)
&lt;span class="pl-s1"&gt;wb_bytes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;resp_fetch&lt;/span&gt;.&lt;span class="pl-c1"&gt;bytes&lt;/span&gt;()
&lt;span class="pl-s1"&gt;xf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pd&lt;/span&gt;.&lt;span class="pl-c1"&gt;ExcelFile&lt;/span&gt;(&lt;span class="pl-s1"&gt;io&lt;/span&gt;.&lt;span class="pl-c1"&gt;BytesIO&lt;/span&gt;(&lt;span class="pl-s1"&gt;wb_bytes&lt;/span&gt;), &lt;span class="pl-s1"&gt;engine&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'openpyxl'&lt;/span&gt;)&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Another new-to-me pattern: you can render an image to the DOM from Pyodide code &lt;a href="https://github.com/simonw/tools/blob/cf26ed8a6f243159bdc90a3d88f818261732103f/ai-adoption.html#L124"&gt;like this&lt;/a&gt;:
&lt;pre style="margin-top: 0.5em"&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;js&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;document&lt;/span&gt;
&lt;span class="pl-s1"&gt;document&lt;/span&gt;.&lt;span class="pl-c1"&gt;getElementById&lt;/span&gt;(&lt;span class="pl-s"&gt;'plot'&lt;/span&gt;).&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'data:image/png;base64,'&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s1"&gt;img_b64&lt;/span&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I will most definitely be using these techniques again in future.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Coincidentally Claude released their own upgraded equivalent to ChatGPT Code Interpreter later on the day that I published this story, so I &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/#something-much-harder-recreating-the-ai-adoption-chart"&gt;ran the same chart recreation experiment&lt;/a&gt; against Claude Sonnet 4 to see how it compared.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/census"&gt;census&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/visualization"&gt;visualization&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pyodide"&gt;pyodide&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="census"/><category term="data-journalism"/><category term="javascript"/><category term="python"/><category term="tools"/><category term="visualization"/><category term="ai"/><category term="pyodide"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-assisted-programming"/><category term="code-interpreter"/><category term="llm-reasoning"/><category term="vibe-coding"/><category term="ai-assisted-search"/><category term="gpt-5"/></entry><entry><title>GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search</title><link href="https://simonwillison.net/2025/Sep/6/research-goblin/#atom-tag" rel="alternate"/><published>2025-09-06T19:31:57+00:00</published><updated>2025-09-06T19:31:57+00:00</updated><id>https://simonwillison.net/2025/Sep/6/research-goblin/#atom-tag</id><summary type="html">
    &lt;p&gt;"Don't use chatbots as search engines" was great advice for several years... until it wasn't.&lt;/p&gt;
&lt;p&gt;I wrote about how good OpenAI's o3 was at using its Bing-backed search tool &lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/"&gt;back in April&lt;/a&gt;. GPT-5 feels even better.&lt;/p&gt;
&lt;p&gt;I've started calling it my &lt;strong&gt;Research Goblin&lt;/strong&gt;. I can assign a task to it, no matter how trivial or complex, and it will do an often unreasonable amount of work to search the internet and figure out an answer.&lt;/p&gt;
&lt;p&gt;This is excellent for satisfying curiosity, and occasionally useful for more important endeavors as well.&lt;/p&gt;
&lt;p&gt;I always run my searches by selecting the "GPT-5 Thinking" model from the model picker - in my experience this leads to far more comprehensive (albeit much slower) results.&lt;/p&gt;
&lt;p&gt;Here are some examples from just the last couple of days. Every single one of them was run on my phone, usually while I was doing something else. Most of them were dictated using the iPhone voice keyboard, which I find faster than typing. Plus, it's fun to talk to my Research Goblin.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#bouncy-travelators"&gt;Bouncy travelators&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#identify-this-building"&gt;Identify this building&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#starbucks-uk-cake-pops"&gt;Starbucks UK cake pops&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#britannica-to-seed-wikipedia"&gt;Britannica to seed Wikipedia&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#official-name-for-the-university-of-cambridge"&gt;Official name for the University of Cambridge&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#history-of-the-caverns-in-exeter-quay"&gt;History of the caverns in Exeter quay&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#aldi-vs-lidl"&gt;Aldi vs Lidl&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#ai-labs-scanning-books-for-training-data"&gt;AI labs scanning books for training data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#gpt-5-for-search-feels-competent"&gt;GPT-5 for search feels competent&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#tips-for-using-search-in-chatgpt"&gt;Tips for using search in ChatGPT&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="bouncy-travelators"&gt;Bouncy travelators&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;They used to be rubber bouncy travelators at Heathrow and they were really fun, have all been replaced by metal ones now and if so, when did that happen?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I was traveling through Heathrow airport pondering what had happened to the fun bouncy rubber travelators.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://chatgpt.com/share/68bc2d98-9aac-8006-98b9-1424d98290f8"&gt;Here's what I got&lt;/a&gt;. Research Goblin narrowed it down to some time between 2014-2018 but, more importantly, found me this &lt;a href="https://www.sfchronicle.com/totalsf/article/sfo-bouncy-moving-walkway-airport-19845449.php"&gt;delightful 2024 article&lt;/a&gt; by Peter Hartlaub in the San Francisco Chronicle with a history of the SFO bouncy walkways, now also sadly retired.&lt;/p&gt;
&lt;h4 id="identify-this-building"&gt;Identify this building&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/reading-building.jpg" alt="not a great photo of a building with a distinctive shaped roof" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Identify this building in reading&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a photo I snapped out of the window on the train. It &lt;a href="https://chatgpt.com/share/68bc2e21-1d24-8006-b083-00b3233e1c67"&gt;thought for 1m4s&lt;/a&gt; and correctly identified it as &lt;a href="https://en.wikipedia.org/wiki/The_Blade,_Reading"&gt;The Blade&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="starbucks-uk-cake-pops"&gt;Starbucks UK cake pops&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Starbucks in the UK don't sell cake pops! Do a deep investigative dive&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Starbucks in Exeter railway station didn't have cake pops, and the lady I asked didn't know what they were.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://chatgpt.com/share/68bc71b4-68f4-8006-b462-cf32f61e7ec3"&gt;Here's the result&lt;/a&gt;. It turns out Starbucks did launch cake pops in the UK &lt;a href="https://www.nationalworld.com/lifestyle/starbucks-cake-pops-launched-in-uk-on-new-autumn-menu-full-list-of-items-4284537"&gt;in September 2023&lt;/a&gt; but they aren't available at all outlets, in particular the licensed travel locations such as the one at Exeter St Davids station.&lt;/p&gt;
&lt;p&gt;I particularly enjoyed how it established definitive proof by consulting &lt;a href="https://www.starbucks.co.uk/sites/starbucks-uk-pwa/files/2024-11/HOL24_UK_AllergenBook_CORE_FOOD_v02.LR_.pdf"&gt;the nutrition and allergen guide PDF&lt;/a&gt; on starbucks.co.uk, which does indeed list both the Birthday Cake Pop (my favourite) and the Cookies and Cream one (apparently discontinued in the USA, at least &lt;a href="https://www.reddit.com/r/starbucks/comments/1lp5chq/just_learned_today_the_cookies_cream_cake_pop_has/"&gt;according to r/starbucks&lt;/a&gt;).&lt;/p&gt;
&lt;h4 id="britannica-to-seed-wikipedia"&gt;Britannica to seed Wikipedia&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Someone on hacker News said:&lt;/p&gt;
&lt;p&gt;&amp;gt; I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else&lt;/p&gt;
&lt;p&gt;Find what they meant by that&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://chatgpt.com/share/68bc3062-5a68-8006-a12b-cf7196a130ae"&gt;The result&lt;/a&gt;. It turns out Wikipedia did seed itself with content from the out-of-copyright 1911 Encyclopædia Britannica... but that project took place in 2006, five years after Wikipedia first launched in 2001.&lt;/p&gt;
&lt;p&gt;I asked:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What is the single best article I can link somebody to that explains the 1911 Britannica thing&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And it pointed me to &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Encyclopaedia_Britannica"&gt;Wikipedia:WikiProject Encyclopaedia Britannica&lt;/a&gt; which includes a detailed explanation and a link to &lt;a href="https://en.m.wikipedia.org/wiki/Template:EB1911"&gt;the 13,000 pages&lt;/a&gt; still tagged with the template from that project. I posted what I found &lt;a href="https://news.ycombinator.com/item?id=45135061#45137109"&gt;in a comment&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Notably (for me anyway) I didn't feel the need to disclose my use of ChatGPT in finding that information - at this point that feels a little like disclosing that I ran a Google search.&lt;/p&gt;
&lt;h4 id="official-name-for-the-university-of-cambridge"&gt;Official name for the University of Cambridge&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;What is the official legal name of the university of Cambridge?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/iancal/"&gt;Here's the context for that one&lt;/a&gt;. It thought for 19 seconds - the thinking trace reveals it knew the answer but wanted to confirm it. &lt;a href="https://chatgpt.com/share/68bc723d-a344-8006-9514-fb3ea2c2b43b"&gt;It answered&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The Chancellor, Masters, and Scholars of the University of Cambridge.” &lt;a href="https://www.cam.ac.uk/about-the-university/how-the-university-and-colleges-work/the-university-as-a-charity"&gt;University of Cambridge&lt;/a&gt;, &lt;a href="https://www.cambridge.org/legal"&gt;Cambridge University Press &amp;amp; Assessment&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That first link gave me the citation I needed in order to be sure this was right.&lt;/p&gt;
&lt;p&gt;Since this is my shortest example, here's a screenshot of the expanded "Thought for 19s" panel. I &lt;em&gt;always&lt;/em&gt; expand the thoughts - seeing how it pulled together its answer is crucial for evaluating if the answer is likely to be useful or not.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/cambridge-name.jpg" alt="Screenshot of an AI assistant's thought process and research. Header shows &amp;quot;Thought for 19s&amp;quot; followed by bullet points: &amp;quot;The user asked for the official legal name of the University of Cambridge, and I think they mean the formal corporate name. It's probably 'The Chancellor, Masters, and Scholars of the University of Cambridge,' similar to Oxford's. I'm going to verify by checking the web for precise information.&amp;quot; &amp;quot;The official name of Cambridge might include the serial comma, like 'Masters, and Scholars.' There's a similar structure with Oxford, which uses a comma after 'Masters' too. The University is an exempt charity, so you can cross-check with legal documents like statutes or charity records. I'll double-check.&amp;quot; Shows search results for &amp;quot;official legal name of University of Cambridge&amp;quot; from en.wikipedia.org, www.cam.ac.uk, projects.propublica.org and &amp;quot;10 more&amp;quot; sources. Continues with &amp;quot;The University of Cambridge's official legal name, based on the University website and Cambridge University Press, is 'The Chancellor, Masters, and Scholars of the University of Cambridge,' with the serial comma after 'Masters.' It's clear from these authoritative sources, so I'll provide that straightforward answer, citing the University site.&amp;quot; &amp;quot;The official legal name of the University of Cambridge is 'The Chancellor, Masters, and Scholars of the University of Cambridge,' with the comma after Masters and the Oxford comma before 'and.' I'll answer concisely in one line and cite official sources to confirm. No need for extra UI elements here.&amp;quot; Shows &amp;quot;Done&amp;quot; status. Final answer displays: &amp;quot;The Chancellor, Masters, and Scholars of the University of Cambridge.&amp;quot; attributed to University of Cambridge, Cambridge University Press &amp;amp; Assessment, with &amp;quot;Sources&amp;quot; section at bottom." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="history-of-the-caverns-in-exeter-quay"&gt;History of the caverns in Exeter quay&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Research On the waterfront restaurant in Exeter, is it dug into the cliffs somehow? History of the building, who built it, why and how&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We were out to dinner &lt;a href="https://maps.app.goo.gl/xxvaPQiNWACtbq3H8"&gt;here&lt;/a&gt; and noticed that the interior of the restaurant appeared to be a space dug into the cliff, which piqued my interest.&lt;/p&gt;
&lt;p&gt;This was &lt;a href="https://chatgpt.com/share/68bc32fb-d52c-8006-9259-0b984dc832b2"&gt;the ChatGPT session&lt;/a&gt; that inspired the Research Goblin nickname. It just kept on digging!&lt;/p&gt;
&lt;p&gt;The first reply took 2m40s and confirmed that yes, these quay buildings were carved into the red sandstone cliff &lt;a href="https://www.exploredevon.info/activities/walk/exeter-quay/"&gt;in the 1820s-1830s&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;ChatGPT with GPT-5 really likes to suggest additional steps it can take. In this case:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’d like, I can dig up the exact Historic England entry that covers the “Southern Warehouse” address and overlay it on a map of the vaults.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I often say "yes" purely out of curiosity to see what it will do next, and the offer to "overlay it on a map" was irresistible, like how would it even do that?&lt;/p&gt;
&lt;p&gt;It did a &lt;em&gt;ton&lt;/em&gt; of extra searches, found latitude and longitude coordinates for the restaurant (from Wikimedia Commons) and the warehouse buildings (from National Heritage List for England via Wikipedia), showed me that data in a table and then used Python to render this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/bad-chart.png" alt="Scatter plot titled &amp;quot;On The Waterfront vs. Warehouse Vaults (Exeter Quay)&amp;quot; with scientific notation &amp;quot;+5.071e1&amp;quot; in top left. Y-axis shows &amp;quot;Latitude&amp;quot; ranging from 0.0065 to 0.0090. X-axis shows &amp;quot;Longitude&amp;quot; ranging from -3.5310 to -3.5280. Three orange X markers plotted: &amp;quot;Warehouse Vaults (6-11, The Quay) — HE 1223046&amp;quot; at top center, &amp;quot;On The Waterfront (4-9 Southern Warehouse)&amp;quot; at middle left, and &amp;quot;Warehouse Vaults (1-4, The Quay) — HE 1223045&amp;quot; at bottom right" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Yeah, so that one is a bit of a miss!&lt;/p&gt;
&lt;p&gt;But... the underlying research chain, as always, provided fascinating. I noticed that at one point it had searched for "Exeter Quay vaults map PDF", which inspired my next question:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Did you find any PDF diagrams, showing the tunnels into the cliff&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;2m20s later it had found &lt;a href="https://planningdocs.exeter.gov.uk/servlets/direct/KgQ8hlUut3Q2r5wVMItggobi9P/2/844706/1/1/AS_PDF_FILE"&gt;a planning document&lt;/a&gt; and &lt;a href="https://www.turner-locker.co.uk/wp-content/uploads/2018/08/Cellar-1-The-Quay-TO-LET-2018-1.pdf"&gt;a letting brochure&lt;/a&gt;, both PDFs, but noted that neither of them had the diagram I wanted. Evidently it actually "reads" the PDFs it finds as part of its activities, since it appears to have knowledge of the images contained within them as well as the text.&lt;/p&gt;
&lt;p&gt;It concluded with this intriguing note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There’s strong evidence that a detailed plan/section &lt;strong&gt;does&lt;/strong&gt; exist in the Exeter Archaeology report often cited by planners—&lt;strong&gt;“Archaeological, Historical and Conservation Study of the Exeter Canal Basin” (Exeter Archaeology Report 00.18, 2000)&lt;/strong&gt;—but it isn’t online; it’s just cited in planning documents. You’d likely need to request it via the Devon Heritage Centre/Historic Environment Record.&lt;/p&gt;
&lt;p&gt;If you want, I can keep hunting specialist archives for a digitized copy or a different report that includes the vault layouts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Obviously I said "Yes do that" and 2m24s later it chucked out a whole host of extra links, an apology that it still hadn't found a detailed plan diagram and, under the section heading "What I recommend next (actionable + exact ask)":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’re happy, I can draft an email you can paste to the right archivists. [...]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Subject: Exeter Quayside cliff vaults – request for plan/section drawings
Hello,&lt;/p&gt;
&lt;p&gt;I’m researching the cliff-cut vaults behind The Quay (On The Waterfront / Kings Wharf).
Could you supply scans or copies of any &lt;strong&gt;plans/sections/diagrams&lt;/strong&gt; of these vaults—especially those in &lt;strong&gt;Exeter Archaeology Report 00.18 (2000)&lt;/strong&gt; (Archaeological, Historical and Conservation Study of the Exeter Canal Basin) and any &lt;strong&gt;EMAFU/Exeter Archaeology drawings&lt;/strong&gt; for “The Quay” showing the &lt;strong&gt;vault/tunnel layouts cut into the red sandstone cliff&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;For reference: NHLE list entries &lt;strong&gt;1223045 (1–4 Warehouse Vaults), 1223046 (6–11 Warehouse Vaults)&lt;/strong&gt;, and &lt;strong&gt;1223114 (14–25 Vaults)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Many thanks!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's a good thing my Research Goblin doesn't have the ability to send email (which we already know is &lt;a href="https://simonwillison.net/2025/May/31/snitchbench-with-llm/"&gt;very risky business&lt;/a&gt;) or at this point I wouldn't be surprised if it started autonomously bothering archivists on my behalf!&lt;/p&gt;
&lt;h4 id="aldi-vs-lidl"&gt;Aldi vs Lidl&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;In UK supermarket terms compare lidl and aldi&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I saw an Aldi and a Lidl next to each other in Exeter and was curious as to how they compare. In this case my Research Goblin went &lt;em&gt;way&lt;/em&gt; deeper than I had intended, thinking for 1m25s about market share and price surveys and loyalty schemes and &lt;a href="https://chatgpt.com/share/68bc7913-7ed8-8006-bd3f-63e540470cd7"&gt;answering&lt;/a&gt; with ten paragraphs of detailed analysis.&lt;/p&gt;
&lt;p&gt;I replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yeah, I just wanted the basic vibes like is one fancier than the other?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It thought for 6 seconds, didn't run any additional searches and replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Short version: neither is “fancy” — both are no-frills discounters — but &lt;strong&gt;Lidl often feels a touch more polished&lt;/strong&gt;, while &lt;strong&gt;Aldi feels a touch more bare-bones&lt;/strong&gt;. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Since I was now enjoying having my Goblin express judgement over British supermarket chains I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;How do they rate in fancinews compared to all of the other major UK supermarket chains?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(That was a typo for "fanciness" which it figured out.)&lt;/p&gt;
&lt;p&gt;It thought for another 1m1s, ran 7 searches, consulted 27 URLs and gave me back a ranking that looked about right to me.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/fancy-supermarkets.jpg" alt="ChatGPT. Prompt: How do they rate in fancinews compared to all of the other major UK supermarket chains? Thought for 1m 1s › Gotcha-assuming you mean &amp;quot;fanciness.&amp;quot; Here's the UK supermarket vibe-ladder (posh → no-frills): Super-posh: M&amp;amp;S Food and Waitrose - premium feel, strong store appearance/quality scores; Waitrose is literally described as &amp;quot;upmarket.&amp;quot; Which? The Guardian Upper-mainstream: Sainsbury's, Ocado (online, but it sells M&amp;amp;s Food since 2020, so the range skews premium). Which? The Guardian Mainstream: Tesco, Morrisons, Co-op - broad ranges, middle-of-the-road on experience/ appearance. Which? Value-forward: Asda, Iceland — price-led; Iceland's frozen focus, Asda's value reputation (not &amp;quot;fancy,&amp;quot; but popular). Time Out Worldwide Discounters (least fancy by design): Lidl (a touch more polished in-store) and Aldi (most stripped-back, no-frills). Both sit lower on &amp;quot;store appearance/ service&amp;quot; in consumer surveys, even while smashing value rankings. Which? +1 Net: on a &amp;quot;posh&amp;quot; scale, think M&amp;amp;S/Waitrose at the top, Sainsbury's/Tesco middle, Asda/Iceland lower, and Lidl/Aldi the least fancy— by intent, not as a knock on quality or value." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="ai-labs-scanning-books-for-training-data"&gt;AI labs scanning books for training data&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Anthropic bought lots of physical books and cut them up and scan them for training data. Do any other AI labs do the same thing?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Relevant to &lt;a href="https://simonwillison.net/2025/Sep/6/anthropic-settlement/"&gt;today's big story&lt;/a&gt;. Research Goblin was &lt;a href="https://chatgpt.com/share/68bc771c-c188-8006-a8e5-4b1624f5bdf0"&gt;unable to find&lt;/a&gt; any news stories or other evidence that any labs other than Anthropic are engaged in large scale book scanning for training data. That's not to say it isn't happening, but it's happening very quietly if that's the case.&lt;/p&gt;
&lt;h4 id="gpt-5-for-search-feels-competent"&gt;GPT-5 for search feels competent&lt;/h4&gt;
&lt;p&gt;The word that best describes how I feel about GPT-5 search is that it feels &lt;strong&gt;competent&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I've thrown all sorts of things at it over the last few weeks and it rarely disappoints me. It almost always does better than if I were to dedicate the same amount of time to manually searching myself, mainly because it's much faster at running searches and evaluating the results than I am.&lt;/p&gt;
&lt;p&gt;I particularly love that it works so well on mobile. I used to reserve my deeper research sessions to a laptop where I could open up dozens of tabs. I'll still do that for higher stakes activities but I'm finding the scope of curiosity satisfaction I can perform on the go with just my phone has increased quite dramatically.&lt;/p&gt;
&lt;p&gt;I've mostly stopped using OpenAI's Deep Research feature, because ChatGPT search now gives me the results I'm interested in far more quickly for most queries.&lt;/p&gt;
&lt;p&gt;As a developer who builds software on LLMs I see ChatGPT search as the gold standard for what can be achieved using tool calling combined with chain-of-thought. Techniques like RAG are &lt;em&gt;massively&lt;/em&gt; more effective if you can reframe them as several levels of tool calling with a carefully selected set of powerful search tools.&lt;/p&gt;
&lt;p&gt;The way that search tool integrates with reasoning is key, because it allows GPT-5 to execute a search, reason about the results and then execute follow-up searches - all as part of that initial "thinking" process.&lt;/p&gt;
&lt;p&gt;Anthropic call this ability &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking#interleaved-thinking"&gt;interleaved thinking&lt;/a&gt; and it's also &lt;a href="https://platform.openai.com/docs/guides/reasoning#keeping-reasoning-items-in-context"&gt;supported by the OpenAI Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="tips-for-using-search-in-chatgpt"&gt;Tips for using search in ChatGPT&lt;/h4&gt;
&lt;p&gt;As with all things AI, GPT-5 search rewards intuition gathered through experience. Any time a curious thought pops into my head I try to catch it and throw it at my Research Goblin. If it's something I'm certain it won't be able to handle then even better! I can learn from watching it fail.&lt;/p&gt;
&lt;p&gt;I've been trying out hints like "go deep" which seem to trigger a more thorough research job. I enjoy throwing those at shallow and unimportant questions like the UK Starbucks cake pops one just to see what happens!&lt;/p&gt;
&lt;p&gt;You can throw questions at it which have a single, unambiguous answer - but I think questions which are broader and don't have a "correct" answer can be a lot more fun. The UK supermarket rankings above are a great example of that.&lt;/p&gt;
&lt;p&gt;Since I love a questionable analogy for LLMs Research Goblin is... well, it's a goblin. It's very industrious, not quite human and not entirely trustworthy. You have to be able to outwit it if you want to keep it gainfully employed.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bing"&gt;bing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="bing"/><category term="definitions"/><category term="search"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-tool-use"/><category term="llm-reasoning"/><category term="deep-research"/><category term="ai-assisted-search"/><category term="gpt-5"/></entry><entry><title>DeepSeek 3.1</title><link href="https://simonwillison.net/2025/Aug/22/deepseek-31/#atom-tag" rel="alternate"/><published>2025-08-22T22:07:25+00:00</published><updated>2025-08-22T22:07:25+00:00</updated><id>https://simonwillison.net/2025/Aug/22/deepseek-31/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1"&gt;DeepSeek 3.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest model from DeepSeek, a 685B monster (like &lt;a href="https://simonwillison.net/2024/Dec/25/deepseek-v3/"&gt;DeepSeek v3&lt;/a&gt; before it) but this time it's a hybrid reasoning model.&lt;/p&gt;
&lt;p&gt;DeepSeek claim:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Drew Breunig &lt;a href="https://twitter.com/dbreunig/status/1958577728720183643"&gt;points out&lt;/a&gt; that their benchmarks show "the same scores with 25-50% fewer tokens" - at least across AIME 2025 and GPQA Diamond and LiveCodeBench.&lt;/p&gt;
&lt;p&gt;The DeepSeek release includes prompt examples for a &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1/blob/main/assets/code_agent_trajectory.html"&gt;coding agent&lt;/a&gt;, a &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1/blob/main/assets/search_python_tool_trajectory.html"&gt;python agent&lt;/a&gt; and a &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1/blob/main/assets/search_tool_trajectory.html"&gt;search agent&lt;/a&gt; - yet more evidence that the leading AI labs have settled on those as the three most important agentic patterns for their models to support. &lt;/p&gt;
&lt;p&gt;Here's the pelican riding a bicycle it drew me (&lt;a href="https://gist.github.com/simonw/f6dba61faf962866969eefd3de59d70e"&gt;transcript&lt;/a&gt;), which I ran from my phone using &lt;a href="https://openrouter.ai/chat?models=deepseek/deepseek-chat-v3.1"&gt;OpenRouter chat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cartoon illustration of a white bird with an orange beak riding a bicycle against a blue sky background with bright green grass below" src="https://static.simonwillison.net/static/2025/deepseek-3-1-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="drew-breunig"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="coding-agents"/><category term="ai-in-china"/></entry><entry><title>Quoting Sam Altman</title><link href="https://simonwillison.net/2025/Aug/10/sam-altman/#atom-tag" rel="alternate"/><published>2025-08-10T23:09:57+00:00</published><updated>2025-08-10T23:09:57+00:00</updated><id>https://simonwillison.net/2025/Aug/10/sam-altman/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://x.com/sama/status/1954603417252532479"&gt;&lt;p&gt;the percentage of users using reasoning models each day is significantly increasing; for example, for free users we went from &amp;lt;1% to 7%, and for plus users from 7% to 24%.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://x.com/sama/status/1954603417252532479"&gt;Sam Altman&lt;/a&gt;, revealing quite how few people used the old model picker to upgrade from GPT-4o&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-altman"&gt;sam-altman&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="openai"/><category term="llm-reasoning"/><category term="ai"/><category term="llms"/><category term="gpt-5"/><category term="sam-altman"/><category term="generative-ai"/><category term="chatgpt"/></entry></feed>