Simon Willison's Weblog: llm-release

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

2026-04-16T17:16:52+00:00

For anyone who has been (inadvisably) taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning's two big model releases - Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic.

Here's the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin) - transcript here:

And here's one I got from Anthropic's brand new Claude Opus 4.7 (transcript):

I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!

I tried Opus a second time passing thinking_level: max. It didn't do much better (transcript):

I don't think Qwen are cheating

A lot of people are convinced that the labs train for my stupid benchmark. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":

Qwen3.6-35B-A3B
(transcript)

Opus 4.7
(transcript)

I'm giving this one to Qwen too, partly for the excellent  SVG comment.

What can we learn from this?

The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.

The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those first pelicans from October 2024 were junk. The more recent entries have generally been much, much better - to the point that Gemini 3.1 Pro produces illustrations you could actually use somewhere, provided you had a pressing need to illustrate a pelican riding a bicycle.

Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.

If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!

Tags: ai, generative-ai, local-llms, llms, anthropic, claude, qwen, pelican-riding-a-bicycle, llm-release, lm-studio

Gemini 3.1 Flash TTS

2026-04-15T17:13:14+00:00

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts.

It's presented via the standard Gemini API using gemini-3.1-flash-tts-preview as the model ID, but can only output audio files.

The prompting guide is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio:

# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps.

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you.
[shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!

Here's what I got using that example prompt:

Your browser does not support the audio element.

Then I modified it to say "Jaz is from Newcastle" and "... requires a charismatic Newcastle accent" and got this result:

Your browser does not support the audio element.

Here's Exeter, Devon for good measure:

Your browser does not support the audio element.

I had Gemini 3.1 Pro vibe code this UI for trying it out:

Tags: text-to-speech, prompt-engineering, gemini, llm-release, tools, google, generative-ai, ai, llms, vibe-coding

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

2026-04-08T23:07:44+00:00

Meta announced Muse Spark today, their first model release since Llama 4 almost exactly a year ago. It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on meta.ai (Facebook or Instagram login required).

Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows".

The model is exposed as two different modes on meta.ai - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro.

A couple of pelicans

I prefer to run my pelican test via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly.

Here's the pelican I got for "Instant":

And this one for "Thinking":

Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model output an SVG directly (with code comments) whereas the Thinking model wrapped it in a thin HTML shell with some unused Playables SDK v1.0.0 JavaScript libraries.

Which got me curious...

Poking around with tools

Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style.

But what else can it do?

I asked it:

what tools do you have access to?

And then:

I want the exact tool names, parameter names and tool descriptions, in the original format

It spat out detailed descriptions of 16 different tools. You can see the full list I got back here - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks.

Here are highlights derived from that response:

Browse and search. browser.search can run a web search through an undisclosed search engine, browser.open can load the full page from one of those search results and browser.find can run pattern matches against the returned page content.
Meta content search. meta_1p.content_search can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including author_ids, key_celebrities, commented_by_user_ids, and liked_by_user_ids.
"Catalog search" - meta_1p.meta_catalog_search can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector.
Image generation. media.image_gen generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images.
container.python_execution - yes! It's Code Interpreter, my favourite feature of both ChatGPT and Claude.

Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at /mnt/data/.

Python 3.9 is EOL these days but the library collection looks useful.

I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021).
container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to html for websites/apps or svg for vector graphics."
container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or catalog_search_citation_id for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox.
container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar?
Tools for editing files in the container - container.view, container.insert (with new_str and insert_line), container.str_replace. These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness.
container.visual_grounding - see below, this one is fun.
subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response."
third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook.

Image analysis in the container

Let's talk about that container.visual_grounding one. Here's the description in full:

Description: Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects.

Parameters:

object_names (array, required)

image_path (string, default "")

format_type ("bbox"|"point"|"count"|null, default "bbox")

title (string|null)[string]

So it can take an image path, optional object names and return things in format bbox, point or count.

This sounded to me like Meta's Segment Anything, (I later heard that it's not, this is a native feature of the model itself via a tool call with a custom system prompt). I decided to take it for a spin.

Prompt:

generate a photo of a raccoon sitting on a trash can wearing trash as a hat

Response:

Here you go — king of the alley, in full regalia:

Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too?

My guess is that image generation in Meta AI is powered by their Emu model, or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's media.image_gen).

The image_gen tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried:

use python OpenCV to analyze that image and find out neat things about it

Annoyingly it did not show me the code it ran, but it did create this image:

Pretty neat! We can both generate images and then use them with Python tools.

Next, the visual_grounding tool. Knowing the names of the tools makes it easy to refer to them in prompts:

Now try the container.visual_grounding tool

The response (full response here):

Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision:

[...]

Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect?

Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly point mode, but how about the others?

I prompted:

yes draw bounding boxes, can it do masks too?

Response:

Yes — bounding boxes work great. Masks? Not directly in visual_grounding; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks.

Here's the bbox output:

What you’re seeing:

raccoon: big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal

coffee cup: (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space

banana peel: (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side

newspaper: (305,223)-(548,315) — left ear coverage

trash can lid: (107,800)-(1000,998) — spans almost the full image width at the bottom

face: (305,283)-(750,537) — tight crop around the mask

eyes: both ~40×37 pixel boxes, perfectly level at y=379

Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers.

About masks: since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options:

GrabCut — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask

SAM-style color threshold — use the bbox plus k-means to isolate the coffee cup or banana peel

Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG.

Here's the HTML for that bbox visualization.

Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can see that all here. The count mode is particularly interesting - here's that in full:

[
  {
    "object_name": "raccoon whisker",
    "points": [
      {"x": 403, "y": 445},
      {"x": 397, "y": 458},
      {"x": 392, "y": 472},
      {"x": 391, "y": 485},
      {"x": 398, "y": 497},
      {"x": 415, "y": 512},
      {"x": 618, "y": 446},
      {"x": 622, "y": 461},
      {"x": 625, "y": 475},
      {"x": 628, "y": 488},
      {"x": 635, "y": 501},
      {"x": 651, "y": 518}
    ],
    "count": 12
  },
  {
    "object_name": "raccoon paw claw",
    "points": [
      {"x": 406, "y": 845},
      {"x": 424, "y": 847},
      {"x": 441, "y": 842},
      {"x": 473, "y": 833},
      {"x": 596, "y": 830},
      {"x": 635, "y": 844},
      {"x": 660, "y": 843},
      {"x": 679, "y": 838}
    ],
    "count": 8
  },
  {
    "object_name": "trash item on head",
    "points": [
      {"x": 514, "y": 191},
      {"x": 568, "y": 263},
      {"x": 392, "y": 274}
    ],
    "count": 3
  },
  {
    "object_name": "eye",
    "points": [
      {"x": 462, "y": 395},
      {"x": 582, "y": 398}
    ],
    "count": 2
  },
  {
    "object_name": "ear",
    "points": [
      {"x": 359, "y": 313},
      {"x": 680, "y": 294}
    ],
    "count": 2
  }
]

So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools.

Which means... it can count pelicans too!

Here's that overlay exported as HTML.

Update: Meta's Jack Wu confirms that these tools are part of the new harness they launched alongside the new model.

Maybe open weights in the future?

On Twitter Alexandr Wang said:

this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions.

I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency:

[...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.

So are Meta back in the frontier model game? Artificial Analysis think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively.

I'm waiting for API access - while the tool collection on meta.ai is quite strong the real test of a model like this is still what we can build on top of it.

Tags: facebook, ai, generative-ai, llms, code-interpreter, llm-tool-use, meta, pelican-riding-a-bicycle, llm-reasoning, llm-release

GLM-5.1: Towards Long-Horizon Tasks

2026-04-07T21:25:14+00:00

GLM-5.1: Towards Long-Horizon Tasks

Chinese AI lab Z.ai's latest model is a giant 754B parameter 1.51TB (on Hugging Face) MIT-licensed monster - the same size as their previous GLM-5 release, and sharing the same paper.

It's available via OpenRouter so I asked it to draw me a pelican:

llm install llm-openrouter
llm -m openrouter/z-ai/glm-5.1 'Generate an SVG of a pelican on a bicycle'

And something new happened... unprompted, the model decided to give me an HTML page that included both the SVG and a separate set of CSS animations!

The SVG was excellent, and might be my new favorite from an open weights model:

But the animation broke it:

That's the pelican, floating up in the top left corner.

I usually don't do follow-up prompts for the pelican test, but in this case I made an exception:

llm -c 'the animation is a bit broken, the pelican ends up positioned off the screen at the top right'

GLM 5.1 replied:

The issue is that CSS transform animations on SVG elements override the SVG transform attribute used for positioning, causing the pelican to lose its placement and fly off to the top-right. The fix is to separate positioning (SVG attribute) from animation (inner group) and use <animateTransform> for SVG rotations since it handles coordinate systems correctly.

And spat out fresh HTML which fixed the problem!

I particularly like the animation of the beak, which is described in the SVG comments like so:

<!-- Pouch (lower beak) with wobble -->
<g>
    <path d="M42,-58 Q43,-50 48,-42 Q55,-35 62,-38 Q70,-42 75,-60 L42,-58 Z" fill="url(#pouchGrad)" stroke="#b06008" stroke-width="1" opacity="0.9"/>
    <path d="M48,-50 Q55,-46 60,-52" fill="none" stroke="#c06a08" stroke-width="0.8" opacity="0.6"/>
    <animateTransform attributeName="transform" type="scale"
    values="1,1; 1.03,0.97; 1,1" dur="0.75s" repeatCount="indefinite"
    additive="sum"/>
</g>

Update: On Bluesky @charles.capps.me suggested a "NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER" and...

The HTML+SVG comments on that one include /* Earring sparkle */, , ,  - here's the transcript and the HTML result.

Tags: llm-release, generative-ai, glm, pelican-riding-a-bicycle, ai, ai-in-china, llms, css, svg

Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me

2026-04-07T20:52:54+00:00

Anthropic didn't release their latest model, Claude Mythos (system card PDF), today. They have instead made it available to a very restricted set of preview partners under their newly announced Project Glasswing.

The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare.

Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.

[...]

Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems.

There's a great deal more technical detail in Assessing Claude Mythos Preview’s cybersecurity capabilities on the Anthropic Red Team blog:

In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex JIT heap spray that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets.

Plus this comparison with Claude 4.6 Opus:

Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.

Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted.

Just a few days (last Friday) ago I started a new ai-security-research tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research.

Greg Kroah-Hartman of the Linux kernel:

Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.

Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.

Daniel Stenberg of curl:

The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.

I'm spending hours per day on this now. It's intense.

And Thomas Ptacek published Vulnerability Research Is Cooked, a post inspired by his podcast conversation with Anthropic's Nicholas Carlini.

Anthropic have a 5 minute talking heads video describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine):

It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...]

I've found more bugs in the last couple of weeks than I found in the rest of my life combined. We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it. On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches patches so that anyone who runs the software is no longer vulnerable to these attacks.

I found this on the OpenBSD 7.8 errata page:

025: RELIABILITY FIX: March 25, 2026 All architectures

TCP packets with invalid SACK options could crash the kernel.

A source code patch exists which remedies this problem.

I tracked that change down in the GitHub mirror of the OpenBSD CVS repo (apparently they still use CVS!) and found it using git blame:

Sure enough, the surrounding code is from 27 years ago.

I'm not sure which Linux vulnerability Nicholas was describing, but it may have been this NFS one recently covered by Michael Lynch .

There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues.

I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon.

The bad news for those of us who are not trusted partners is this:

We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.

I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off.

Tags: security, thomas-ptacek, ai, generative-ai, llms, anthropic, nicholas-carlini, ai-ethics, llm-release, ai-security-research

Gemma 4: Byte for byte, the most capable open models

2026-04-02T18:28:54+00:00

Gemma 4: Byte for byte, the most capable open models

Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.

Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.

They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:

The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.

I don't entirely understand that, but apparently that's what the "E" in E2B means!

One particularly exciting feature of these models is that they are multi-modal beyond just images:

Vision and audio: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.

I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.

I tried them out using the GGUFs for LM Studio. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out "---\n" in a loop for every prompt I tried.

The succession of pelican quality from 2B to 4B to 26B-A4B is notable:

E2B:

E4B:

26B-A4B:

(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after fixing that I got probably the best pelican I've seen yet from a model that runs on my laptop.)

Google are providing API access to the two larger Gemma models via their AI Studio. I added support to llm-gemini and then ran a pelican through the 31B model using that:

llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'

Pretty good, though it is missing the front part of the bicycle frame:

Tags: vision-llms, llm, llm-reasoning, ai, local-llms, llms, gemma, llm-release, google, generative-ai, lm-studio, pelican-riding-a-bicycle

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

2026-03-17T19:39:17+00:00

OpenAI today: Introducing GPT‑5.4 mini and nano. These models join GPT-5.4 which was released two weeks ago.

OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.

Here's how the pricing looks - all prices are per million tokens. gpt-5.4-nano is notably even cheaper than Google's Gemini 3.1 Flash-Lite:

Model	Input	Cached input	Output
gpt-5.4	$2.50	$0.25	$15.00
gpt-5.4-mini	$0.75	$0.075	$4.50
gpt-5.4-nano	$0.20	$0.02	$1.25
Other models for comparison
Claude Opus 4.6	$5.00	-	$25.00
Claude Sonnet 4.6	$3.00	-	$15.00
Gemini 3.1 Pro	$2.00	-	$12.00
Claude Haiku 4.5	$1.00	-	$5.00
Gemini 3.1 Flash-Lite	$0.25	-	$1.50

I used GPT-5.4 nano to generate a description of this photo I took at the John M. Mossman Lock Collection:

llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'

Here's the output:

The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.

That took 2,751 input tokens and 112 output tokens, at a cost of 0.069 cents (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.

I released llm 0.29 with support for the new models.

Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (generation transcripts here). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!

Tags: ai, openai, generative-ai, llms, llm, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-release

Introducing Mistral Small 4

2026-03-16T23:41:17+00:00

Introducing Mistral Small 4

Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:

Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.

It supports reasoning_effort="none" or reasoning_effort="high", with the latter providing "equivalent verbosity to previous Magistral models".

The new model is 242GB on Hugging Face.

I tried it out via the Mistral API using llm-mistral:

llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"

I couldn't find a way to set the reasoning effort in their API documentation, so hopefully that's a feature which will land soon.

Update 23rd March: Here's new documentation for the reasoning_effort parameter.

Also from Mistral today and fitting their -stral naming convention is Leanstral, an open weight model that is specifically tuned to help output the Lean 4 formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.

Tags: llm, llm-reasoning, pelican-riding-a-bicycle, ai, llms, llm-release, mistral, generative-ai

Introducing GPT‑5.4

2026-03-05T23:56:09+00:00

Introducing GPT‑5.4

Two new API models: gpt-5.4 and gpt-5.4-pro, also available in ChatGPT and Codex CLI. August 31st 2025 knowledge cutoff, 1 million token context window. Priced slightly higher than the GPT-5.2 family with a bump in price for both models if you go above 272,000 tokens.

5.4 beats coding specialist GPT-5.3-Codex on all of the relevant benchmarks. I wonder if we'll get a 5.4 Codex or if that model line has now been merged into main?

Given Claude's recent focus on business applications it's interesting to see OpenAI highlight this in their announcement of GPT-5.4:

We put a particular focus on improving GPT‑5.4’s ability to create and edit spreadsheets, presentations, and documents. On an internal benchmark of spreadsheet modeling tasks that a junior investment banking analyst might do, GPT‑5.4 achieves a mean score of 87.3%, compared to 68.4% for GPT‑5.2.

Here's a pelican on a bicycle drawn by GPT-5.4:

And here's one by GPT-5.4 Pro, which took 4m45s and cost me $1.55:

Tags: llm-release, generative-ai, openai, ai, llms, pelican-riding-a-bicycle

Gemini 3.1 Flash-Lite

2026-03-03T21:53:54+00:00

Gemini 3.1 Flash-Lite

Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.

It supports four different thinking levels, so I had it output four different pelicans:

minimal

low

medium

high

Tags: gemini, llm, pelican-riding-a-bicycle, llm-pricing, ai, llms, llm-release, google, generative-ai

Gemini 3.1 Pro

2026-02-19T17:58:37+00:00

Gemini 3.1 Pro

The first in the Gemini 3.1 series, priced the same as Gemini 3 Pro ($2/million input, $12/million output under 200,000 tokens, $4/$18 for 200,000 to 1,000,000). That's less than half the price of Claude Opus 4.6 with very similar benchmark scores to that model.

They boast about its improved SVG animation performance compared to Gemini 3 Pro in the announcement!

I tried "Generate an SVG of a pelican riding a bicycle" in Google AI Studio and it thought for 323.9 seconds (thinking trace here) before producing this one:

It's good to see the legs clearly depicted on both sides of the frame (should satisfy Elon), the fish in the basket is a nice touch and I appreciated this comment in the SVG code:

<!-- Black Flight Feathers on Wing Tip -->
<path d="M 420 175 C 440 182, 460 187, 470 190 C 450 210, 430 208, 410 198 Z" fill="#374151" />

I've added the two new model IDs gemini-3.1-pro-preview and gemini-3.1-pro-preview-customtools to my llm-gemini plugin for LLM. That "custom tools" one is described here - apparently it may provide better tool performance than the default model in some situations.

The model appears to be incredibly slow right now - it took 104s to respond to a simple "hi" and a few of my other tests met "Error: This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later." or "Error: Deadline expired before operation could complete" errors. I'm assuming that's just teething problems on launch day.

It sounds like last week's Deep Think release was our first exposure to the 3.1 family:

Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.

Update: In What happens if AI labs train for pelicans riding bicycles? last November I said:

If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.

Google's Gemini Lead Jeff Dean tweeted this video featuring an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.

I've been saying for a while that I wish AI labs would highlight things that their new models can do that their older models could not, so top marks to the Gemini team for this video.

Update 2: I used llm-gemini to run my more detailed Pelican prompt, with this result:

From the SVG comments:

<!-- Pouch Gradient (Breeding Plumage: Red to Olive/Green) -->
...
<!-- Neck Gradient (Breeding Plumage: Chestnut Nape, White/Yellow Front) -->

Tags: gemini, llm, pelican-riding-a-bicycle, ai, llms, llm-release, google, generative-ai, svg

Introducing Claude Sonnet 4.6

2026-02-17T23:58:58+00:00

Introducing Claude Sonnet 4.6

Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to November's Opus 4.5 while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's the system card PDF.

Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.

I just released llm-anthropic 0.24 with support for both Sonnet 4.6 and Opus 4.6. Claude Code did most of the work - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described in Anthropic's migration guide.

Here's what I got from:

uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6

The SVG comments include:

<!-- Hat (fun accessory) -->

I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!

For comparison, here's the pelican Opus 4.5 drew me in November:

And here's Anthropic's current best pelican, drawn by Opus 4.6 on February 5th:

Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

Via Hacker News

Tags: llm, anthropic, claude, llm-pricing, ai, llms, llm-release, generative-ai, pelican-riding-a-bicycle, claude-code

Qwen3.5: Towards Native Multimodal Agents

2026-02-17T04:30:57+00:00

Qwen3.5: Towards Native Multimodal Agents

Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input.

The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture:

Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability.

It's 807GB on Hugging Face, and Unsloth have a collection of smaller GGUFs ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL.

I got this pelican from the OpenRouter hosted model (transcript):

The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher Junyang Lin says:

Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode.

Here's its pelican, which is similar in quality to the open weights model:

Tags: vision-llms, ai, qwen, llms, ai-in-china, llm-release, generative-ai, openrouter, pelican-riding-a-bicycle

Introducing GPT‑5.3‑Codex‑Spark

2026-02-12T21:16:07+00:00

Introducing GPT‑5.3‑Codex‑Spark

OpenAI announced a partnership with Cerebras on January 14th. Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex".

Despite being named GPT-5.3-Codex-Spark it's not purely an accelerated alternative to GPT-5.3-Codex - the blog post calls it "a smaller version of GPT‑5.3-Codex" and clarifies that "at launch, Codex-Spark has a 128k context window and is text-only."

I had some preview access to this model and I can confirm that it's significantly faster than their other models.

Here's what that speed looks like running in Codex CLI:

That was the "Generate an SVG of a pelican riding a bicycle" prompt - here's the rendered result:

Compare that to the speed of regular GPT-5.3 Codex medium:

Significantly slower, but the pelican is a lot better:

What's interesting about this model isn't the quality though, it's the speed. When a model responds this fast you can stay in flow state and iterate with the model much more productively.

I showed a demo of Cerebras running Llama 3.1 70 B at 2,000 tokens/second against Val Town back in October 2024. OpenAI claim 1,000 tokens/second for their new model, and I expect it will prove to be a ferociously useful partner for hands-on iterative coding sessions.

It's not yet clear what the pricing will look like for this new model.

Tags: llm-performance, openai, cerebras, pelican-riding-a-bicycle, ai, llms, llm-release, codex-cli, generative-ai

Gemini 3 Deep Think

2026-02-12T18:12:17+00:00

Gemini 3 Deep Think

New from Google. They say it's "built to push the frontier of intelligence and solve modern challenges across science, research, and engineering".

It drew me a really good SVG of a pelican riding a bicycle! I think this is the best one I've seen so far - here's my previous collection.

(And since it's an FAQ, here's my answer to What happens if AI labs train for pelicans riding bicycles?)

Since it did so well on my basic Generate an SVG of a pelican riding a bicycle I decided to try the more challenging version as well:

Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.

Here's what I got:

Via Hacker News

Tags: gemini, llm-reasoning, pelican-riding-a-bicycle, ai, llms, llm-release, google, generative-ai

GLM-5: From Vibe Coding to Agentic Engineering

2026-02-11T18:56:14+00:00

GLM-5: From Vibe Coding to Agentic Engineering

This is a huge new MIT-licensed model: 744B parameters and 1.51TB on Hugging Face twice the size of GLM-4.7 which was 368B and 717GB (4.5 and 4.6 were around that size too).

It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen Agentic Engineering show up in a few other places recently. most notable from Andrej Karpathy and Addy Osmani.

I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via OpenRouter and got back a very good pelican on a disappointing bicycle frame:

Via Hacker News

Tags: agentic-engineering, pelican-riding-a-bicycle, ai, ai-in-china, llms, llm-release, vibe-coding, ai-assisted-programming, generative-ai, definitions, openrouter, glm

Opus 4.6 and Codex 5.3

2026-02-05T20:29:20+00:00

Two major new model releases today, within about 15 minutes of each other.

Anthropic released Opus 4.6. Here's its pelican:

OpenAI release GPT-5.3-Codex, albeit only via their Codex app, not yet in their API. Here's its pelican:

I've had a bit of preview access to both of these models and to be honest I'm finding it hard to find a good angle to write about them - they're both really good, but so were their predecessors Codex 5.2 and Opus 4.5. I've been having trouble finding tasks that those previous models couldn't handle but the new ones are able to ace.

The most convincing story about capabilities of the new model so far is Nicholas Carlini from Anthropic talking about Opus 4.6 and Building a C compiler with a team of parallel Claudes - Anthropic's version of Cursor's FastRender project.

Tags: llm-release, anthropic, generative-ai, openai, pelican-riding-a-bicycle, ai, llms, parallel-agents, c, nicholas-carlini

Kimi K2.5: Visual Agentic Intelligence

2026-01-27T15:07:41+00:00

Kimi K2.5: Visual Agentic Intelligence

Kimi K2 landed in July as a 1 trillion parameter open weight LLM. It was joined by Kimi K2 Thinking in November which added reasoning capabilities. Now they've made it multi-modal: the K2 models were text-only, but the new 2.5 can handle image inputs as well:

Kimi K2.5 builds on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. Built as a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities and a self-directed agent swarm paradigm.

The "self-directed agent swarm paradigm" claim there means improved long-sequence tool calling and training on how to break down tasks for multiple agents to work on at once:

For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls. Compared with a single-agent setup, this reduces execution time by up to 4.5x. The agent swarm is automatically created and orchestrated by Kimi K2.5 without any predefined subagents or workflow.

I used the OpenRouter Chat UI to have it "Generate an SVG of a pelican riding a bicycle", and it did quite well:

As a more interesting test, I decided to exercise the claims around multi-agent planning with this prompt:

I want to build a Datasette plugin that offers a UI to upload files to an S3 bucket and stores information about them in a SQLite table. Break this down into ten tasks suitable for execution by parallel coding agents.

Here's the full response. It produced ten realistic tasks and reasoned through the dependencies between them. For comparison here's the same prompt against Claude Opus 4.5 and against GPT-5.2 Thinking.

The Hugging Face repository is 595GB. The model uses Kimi's janky "modified MIT" license, which adds the following clause:

Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.

Given the model's size, I expect one way to run it locally would be with MLX and a pair of $10,000 512GB RAM M3 Ultra Mac Studios. That setup has been demonstrated to work with previous trillion parameter K2 models.

Via Hacker News

Tags: vision-llms, ai-agents, llm-tool-use, pelican-riding-a-bicycle, ai, ai-in-china, llms, moonshot, kimi, parallel-agents, hugging-face, janky-licenses, llm-release

Introducing GPT-5.2-Codex

2025-12-19T05:21:17+00:00

Introducing GPT-5.2-Codex

The latest in OpenAI's Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools).

GPT‑5.2-Codex is a version of GPT‑5.2⁠ further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.

As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API "in the coming weeks". Unlike previous models there's a new invite-only preview process for vetted cybersecurity professionals for "more permissive models".

I've been very impressed recently with GPT 5.2's ability to tackle multi-hour agentic coding challenges. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I'm not sure how concrete that 1.8% improvement will be!

I didn't hack API access together this time (see previous attempts), instead opting to just ask Codex CLI to "Generate an SVG of a pelican riding a bicycle" while running the new model (effort medium). Here's the transcript in my new Codex CLI timeline viewer, and here's the pelican it drew:

Tags: gpt-codex, openai, pelican-riding-a-bicycle, ai, llms, llm-release, codex-cli, generative-ai

Gemini 3 Flash

2025-12-17T22:44:52+00:00

It continues to be a busy December, if not quite as busy as last year. Today's big news is Gemini 3 Flash, the latest in Google's "Flash" line of faster and less expensive models.

Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro:

Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds.

Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series).

The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k, and it's nice not to have a price increase for the new Flash at larger token lengths.

It's a little more expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively.

Google claim it may still end up cheaper though, due to more efficient output token usage:

> Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.

Here's a more extensive price comparison on my llm-prices.com site.

Generating some SVGs of pelicans

I released llm-gemini 0.28 this morning with support for the new model. You can try it out like this:

llm install -U llm-gemini
llm keys set gemini # paste in key
llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle"

According to the developer docs the new model supports four different thinking level options: minimal, low, medium, and high. This is different from Gemini 3 Pro, which only supported low and high.

You can run those like this:

llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle"

Here are four pelicans, for thinking levels minimal, low, medium, and high:

I built the gallery component with Gemini 3 Flash

The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this:

<image-gallery width="4">
    <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." />
    <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." />
    <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." />
    <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." />
</image-gallery>

Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe:

llm -m gemini-3-flash-preview --system '
You write alt text for any image pasted in by the user. Alt text is always presented in a
fenced code block to make it easy to copy and paste out. It is always presented on a single
line so it can be used easily in Markdown images. All text on the image (for screenshots etc)
must be exactly included. A short note describing the nature of the image itself should go first.' \
-a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg

You can see the code that powers the image gallery Web Component here on GitHub. I built it by prompting Gemini 3 Flash via LLM like this:

llm -m gemini-3-flash-preview '
Build a Web Component that implements a simple image gallery. Usage is like this:

<image-gallery width="5">
  <img src="image1.jpg" alt="Image 1">
  <img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg">
  <img src="image3.jpg" alt="Image 3">
</image-gallery>

If an image has a data-thumb= attribute that one is used instead, other images are scaled down. 

The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image.

Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.'

It took a few follow-up prompts using llm -c:

llm -c 'Use a real modal such that keyboard shortcuts and accessibility features work without extra JS'

llm -c 'Use X for the close icon and make it a bit more subtle'

llm -c 'remove the hover effect entirely'

llm -c 'I want no border on the close icon even when it is focused'

Here's the full transcript, exported using llm logs -cue.

Those five prompts took:

225 input, 3,269 output
2,243 input, 2,908 output
4,319 input, 2,516 output
6,376 input, 2,094 output
8,151 input, 1,806 output

Added together that's 21,314 input and 12,593 output for a grand total of 4.8436 cents.

The guide to migrating from Gemini 2.5 reveals one disappointment:

Image segmentation: Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or Gemini Robotics-ER 1.5.

I wrote about this capability in Gemini 2.5 back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini.

Tags: google, ai, web-components, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-release

GPT-5.2

2025-12-11T23:58:04+00:00

OpenAI reportedly declared a "code red" on the 1st of December in response to increasingly credible competition from the likes of Google's Gemini 3. It's less than two weeks later and they just announced GPT-5.2, calling it "the most capable model series yet for professional knowledge work".

Key characteristics of GPT-5.2

The new model comes in two variants: GPT-5.2 and GPT-5.2 Pro. There's no Mini variant yet.

GPT-5.2 is available via their UI in both "instant" and "thinking" modes, presumably still corresponding to the API concept of different reasoning effort levels.

The knowledge cut-off date for both variants is now August 31st 2025. This is significant - GPT 5.1 and 5 were both Sep 30, 2024 and GPT-5 mini was May 31, 2024.

Both of the 5.2 models have a 400,000 token context window and 128,000 max output tokens - no different from 5.1 or 5.

Pricing wise 5.2 is a rare increase - it's 1.4x the cost of GPT 5.1, at $1.75/million input and $14/million output. GPT-5.2 Pro is $21.00/million input and a hefty $168.00/million output, putting it up there with their previous most expensive models o1 Pro and GPT-4.5.

So far the main benchmark results we have are self-reported by OpenAI. The most interesting ones are a 70.9% score on their GDPval "Knowledge work tasks" benchmark (GPT-5 got 38.8%) and a 52.9% on ARC-AGI-2 (up from 17.6% for GPT-5.1 Thinking).

The ARC Prize Twitter account provided this interesting note on the efficiency gains for GPT-5.2 Pro

A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task

Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task

This represents a ~390X efficiency improvement in one year

GPT-5.2 can be accessed in OpenAI's Codex CLI tool like this:

codex -m gpt-5.2

There are three new API models:

gpt-5.2 - I think this is what you get if you select "GPT-5.2 Thinking" in ChatGPT but I'm a little confused.
gpt-5.2-chat-latest - the model used by ChatGPT for "GPT-5.2 Instant" mode. It's priced the same as GPT-5.2 but has a reduced 128,000 context window with 16,384 max output tokens.
gpt-5.2-pro

OpenAI have published a new GPT-5.2 Prompting Guide. An interesting note from that document is that compaction can now be run with a new dedicated server-side API:

For long-running, tool-heavy workflows that exceed the standard context window, GPT-5.2 with Reasoning supports response compaction via the /responses/compact endpoint. Compaction performs a loss-aware compression pass over prior conversation state, returning encrypted, opaque items that preserve task-relevant information while dramatically reducing token footprint. This allows the model to continue reasoning across extended workflows without hitting context limits.

It's better at vision

One note from the announcement that caught my eye:

GPT‑5.2 Thinking is our strongest vision model yet, cutting error rates roughly in half on chart reasoning and software interface understanding.

I had disappointing results from GPT-5 on an OCR task a while ago. I tried it against GPT-5.2 and it did much better:

llm -m gpt-5.2 ocr -a https://static.simonwillison.net/static/2025/ft.jpeg

Here's the result from that, which cost 1,520 input and 1,022 for a total of 1.6968 cents.

Rendering some pelicans

For my classic "Generate an SVG of a pelican riding a bicycle" test:

llm -m gpt-5.2 "Generate an SVG of a pelican riding a bicycle"

And for the more advanced alternative test, which tests instruction following in a little more depth:

llm -m gpt-5.2 "Generate an SVG of a California brown pelican riding a bicycle. The bicycle
must have spokes and a correctly shaped bicycle frame. The pelican must have its
characteristic large pouch, and there should be a clear indication of feathers.
The pelican must be clearly pedaling the bicycle. The image should show the full
breeding plumage of the California brown pelican."

Update 14th December 2025: I used GPT-5.2 running in Codex CLI to port a complex Python library to JavaScript. It ran without interference for nearly four hours and completed a complex task exactly to my specification.

Tags: ai, openai, generative-ai, llms, llm, pelican-riding-a-bicycle, llm-release, gpt-5

Devstral 2

2025-12-09T23:58:27+00:00

Devstral 2

Two new models from Mistral today: Devstral 2 and Devstral Small 2 - both focused on powering coding agents such as Mistral's newly released Mistral Vibe which I wrote about earlier today.

Devstral 2: SOTA open model for code agents with a fraction of the parameters of its competitors and achieving 72.2% on SWE-bench Verified.

Up to 7x more cost-efficient than Claude Sonnet at real-world tasks.

Devstral 2 is a 123B model released under a janky license - it's "modified MIT" where the modification is:

You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. [...]

Mistral Small 2 is under a proper Apache 2 license with no weird strings attached. It's a 24B model which is 51.6GB on Hugging Face and should quantize to significantly less.

I tried out the larger model via my llm-mistral plugin like this:

llm install llm-mistral
llm mistral refresh
llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"

For a ~120B model that one is pretty good!

Here's the same prompt with -m mistral/labs-devstral-small-2512 for the API hosted version of Devstral Small 2:

Again, a decent result given the small parameter size. For comparison, here's what I got for the 24B Mistral Small 3.2 earlier this year.

Tags: llm-release, mistral, generative-ai, ai, janky-licenses, llms, llm, pelican-riding-a-bicycle

Introducing Mistral 3

2025-12-02T17:30:57+00:00

Introducing Mistral 3

Four new models from Mistral today: three in their "Ministral" smaller model series (14B, 8B, and 3B) and a new Mistral Large 3 MoE model with 675B parameters, 41B active.

All of the models are vision capable, and they are all released under an Apache 2 license.

I'm particularly excited about the 3B model, which appears to be a competent vision-capable model in a tiny ~3GB file.

Xenova from Hugging Face got it working in a browser:

@MistralAI releases Mistral 3, a family of multimodal models, including three start-of-the-art dense models (3B, 8B, and 14B) and Mistral Large 3 (675B, 41B active). All Apache 2.0! 🤗

Surprisingly, the 3B is small enough to run 100% locally in your browser on WebGPU! 🤯

You can try that demo in your browser, which will fetch 3GB of model and then stream from your webcam and let you run text prompts against what the model is seeing, entirely locally.

Mistral's API hosted versions of the new models are supported by my llm-mistral plugin already thanks to the llm mistral refresh command:

$ llm mistral refresh
Added models: ministral-3b-2512, ministral-14b-latest, mistral-large-2512, ministral-14b-2512, ministral-8b-2512

I tried pelicans against all of the models. Here's the best one, from Mistral Large 3:

And the worst from Ministral 3B:

Tags: vision-llms, llm-release, mistral, llm, generative-ai, ai, llms

DeepSeek-V3.2

2025-12-01T23:56:19+00:00

DeepSeek-V3.2

Two new open weight (MIT licensed) models from DeepSeek today: DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, both 690GB, 685B parameters. Here's the PDF tech report.

DeepSeek-V3.2 is DeepSeek's new flagship model, now running on chat.deepseek.com.

The difference between the two new models is best explained by this paragraph from the technical report:

DeepSeek-V3.2 integrates reasoning, agent, and human alignment data distilled from specialists, undergoing thousands of steps of continued RL training to reach the final checkpoints. To investigate the potential of extended thinking, we also developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusively on reasoning data with a reduced length penalty during RL. Additionally, we incorporated the dataset and reward method from DeepSeekMath-V2 (Shao et al., 2025) to enhance capabilities in mathematical proofs.

I covered DeepSeek-Math-V2 last week. Like that model, DeepSeek-V3.2-Speciale also scores gold on the 2025 International Mathematical Olympiad so beloved of model training teams!

I tried both models on "Generate an SVG of a pelican riding a bicycle" using the chat feature of [OpenRouter](https://openrouter.ai/). DeepSeek V3.2 produced this very short reasoning chain:

Let's assume the following:

Wheel radius: 40
Distance between wheel centers: 180
Seat height: 60 (above the rear wheel center)
Handlebars: above the front wheel, extending back and up.

We'll set the origin at the center of the rear wheel.

We'll create the SVG with a viewBox that fits the entire drawing.

Let's start by setting up the SVG.

Followed by this illustration:

Here's what I got from the Speciale model, which thought deeply about the geometry of bicycles and pelicans for a very long time (at least 10 minutes) before spitting out this result:

Via Hacker News

Tags: llm-release, openrouter, generative-ai, deepseek, ai, ai-in-china, llms, llm-reasoning, pelican-riding-a-bicycle

deepseek-ai/DeepSeek-Math-V2

2025-11-27T15:59:23+00:00

deepseek-ai/DeepSeek-Math-V2

New on Hugging Face, a specialist mathematical reasoning LLM from DeepSeek. This is their entry in the space previously dominated by proprietary models from OpenAI and Google DeepMind, both of which achieved gold medal scores on the International Mathematical Olympiad earlier this year.

We now have an open weights (Apache 2 licensed) 685B, 689GB model that can achieve the same. From the accompanying paper:

DeepSeekMath-V2 demonstrates strong performance on competition mathematics. With scaled test-time compute, it achieved gold-medal scores in high-school competitions including IMO 2025 and CMO 2024, and a near-perfect score on the undergraduate Putnam 2024 competition.

Tags: llm-release, llm-reasoning, deepseek, ai, ai-in-china, llms, mathematics, generative-ai

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

2025-11-24T19:37:07+00:00

Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3, both released within the past week!

The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February).

The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.

The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details:

Opus 4.5 has a new effort parameter which defaults to high but can be set to medium or low for faster responses.
The model supports enhanced computer use, specifically a zoom tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect.
"Thinking blocks from previous assistant turns are preserved in model context by default" - apparently previous Anthropic models discarded those.

I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period. Here's the Claude Code transcript where I had it help implement one of the more complicated new features.

It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model.

With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.

I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.

This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.

Google's Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable.

The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?

And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models.

I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!

I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.

"Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.

In the meantime, I'm just gonna have to keep on getting them to draw pelicans riding bicycles. Here's Opus 4.5 (on its default "high" effort level):

It did significantly better on the new more detailed prompt:

Here's that same complex prompt against Gemini 3 Pro and against GPT-5.1-Codex-Max-xhigh.

Still susceptible to prompt injection

From the safety section of Anthropic's announcement post:

With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry:

On the one hand this looks great, it's a clear improvement over previous models and the competition.

What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3!

I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models.

Tags: prompt-injection, generative-ai, llms, anthropic, claude, evals, llm-pricing, pelican-riding-a-bicycle, llm-release, november-2025-inflection

Olmo 3 is a fully open LLM

2025-11-22T23:59:46+00:00

Olmo is the LLM series from Ai2 - the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.

The new Olmo 3 claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:

At its center is Olmo 3-Think (32B), the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.

They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.

Having full access to the training data is really useful. Here's how they describe that:

Olmo 3 is pretrained on Dolma 3, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with olmOCR, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.

They also highlight that they are training on fewer tokens than their competition:

[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.

If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.

I tried out the 32B Think model and the 7B Instruct model using LM Studio. The 7B model is a 4.16GB download, the 32B one is 18.14GB.

The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for 14 minutes 43 seconds, outputting 8,437 tokens total most of which was this epic thinking trace.

I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".

<svg width="200" height="200" viewBox="0 0 100 100">
  <!-- Bicycle -->
  <circle cx="30" cy="60" r="15" stroke="black" fill="none"/>
  <circle cx="70" cy="60" r="15" stroke="black" fill="none"/>
  
  <!-- Bike frame -->
  <rect x="35" y="25" width="30" height="10" fill="saddlebrown"/>
  <line x1="35" y1="40" x2="30" y2="60" stroke="black" stroke-width="3"/>
  <line x1="65" y1="40" x2="70" y2="60" stroke="black" stroke-width="3"/>
  
  <!-- Pelican -->
  <ellipse cx="55" cy="65" rx="20" ry="15" fill="white"/>
  <polygon points="52 50,57 35,62 50" fill="black"/> <!-- Head/beak -->
  <circle cx="55" cy="45" r="2" fill="white"/>
  <circle cx="60" cy="45" r="2" fill="white"/>
  
  <polygon points="45 60,50 70,55 60" fill="lightgrey"/> <!-- Left wing -->
  <polygon points="65 60,70 70,55 60" fill="lightgrey"/> <!-- Right wing -->
  
  <!-- Feet on pedals -->
  <polygon points="25 75,30 85,35 75" fill="black"/>
  <polygon points="75 75,70 85,65 75" fill="black"/>
</svg>

Rendered it looks like this:

I tested OLMo 2 32B 4bit back in March and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:

To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now using OpenRouter):

OlmoTrace

I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:

A core goal of Olmo 3 is not just to open the model flow, but to make it actionable for people who want to understand and improve model behavior. Olmo 3 integrates with OlmoTrace, our tool for tracing model outputs back to training data in real time.

For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.

You can access OlmoTrace via playground.allenai.org, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.

I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:

It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.

I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by Ai2's infini-gram) but the documents it found had nothing to do with me at all.

Can open training data address concerns of backdoors?

Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to Stanford's Marin and Swiss AI's Apertus, neither of which I'd heard about before.

A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that a small number of samples can poison LLMs of any size - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.

This makes fully open training data an even bigger deal.

Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in his detailed post about the release:

In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).

This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).)

I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in February 2024) and Olmo 2 (in March 2025) have been significant. I'm hoping that trend continues!

Tags: ai, generative-ai, llms, interpretability, pelican-riding-a-bicycle, llm-reasoning, ai2, ai-ethics, llm-release, lm-studio, nathan-lambert, olmo

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

2025-11-20T16:32:25+00:00

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro, also known as Gemini 3 Pro Image. I've had a few days of preview access and this is an astonishingly capable image generation model.

As is often the case, the most useful low-level details can be found in the API documentation:

Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks.

High-resolution output: Built-in generation capabilities for 1K, 2K, and 4K visuals.

Advanced text rendering: Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets.

Grounding with Google Search: The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events).

Thinking mode: The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output.

Up to 14 reference images: You can now mix up to 14 reference images to produce the final image.

[...] These 14 images can include the following:

Up to 6 images of objects with high-fidelity to include in the final image

Up to 5 images of humans to maintain character consistency

There is also a short (6 page) model card PDF which lists the following as "new capabilities" compared to the previous Nano Banana: Multi character editing, Chart editing, Text editing, Factuality - Edu, Multi-input 1-3, Infographics, Doodle editing, Visual design.

Trying out some detailed instruction image prompts

Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K.

Here's what I got for his first test prompt, using Google's AI Studio:

Create an image of a three-dimensional pancake in the shape of a skull, garnished on top with blueberries and maple syrup.

The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original.

Then I ran his follow-up prompt:

Make ALL of the following edits to the image:
- Put a strawberry in the left eye socket.
- Put a blackberry in the right eye socket.
- Put a mint garnish on top of the pancake.
- Change the plate to a plate-shaped chocolate-chip cookie.
- Add happy people to the background.

I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG.

The new model isn't cheap. Here's the API pricing: it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed.

Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there.

Creating an infographic

So this thing is great at following instructions. How about rendering text?

I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was:

Infographic explaining how the Datasette open source project works

This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt:

That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic.

"Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me.

Here's the thinking trace for that image generation process:

Initiating Datasette Infographic

I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram.

Developing the Datasette workflow

I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output.

Visualizing Datasette Workflow

I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts.

Defining Core Datasette Stages

I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step.

Constructing Datasette Infographic

I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation.

Analyzing Datasette's Architecture

I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow.

Mapping the Datasette Flow

I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow.

Detecting generated images with SynthID

This note from the announcement post caught my eye:

We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark.

Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon.

Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that to the Gemini app and asked "Was this image created with AI?":

It replied:

Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image.

Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro.

Tags: google, ai, datasette, generative-ai, llms, gemini, text-to-image, llm-release, nano-banana

Building more with GPT-5.1-Codex-Max

2025-11-19T23:15:10+00:00

Building more with GPT-5.1-Codex-Max

Hot on the heels of yesterday's Gemini 3 Pro release comes a new model from OpenAI called GPT-5.1-Codex-Max.

(Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!)

It's currently only available through their Codex CLI coding agent, where it's the new default model:

Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces. Unlike GPT‑5.1, which is a general-purpose model, we recommend using GPT‑5.1-Codex-Max and the Codex family of models only for agentic coding tasks in Codex or Codex-like environments.

It's not available via the API yet but should be shortly.

The timing of this release is interesting given that Gemini 3 Pro appears to have aced almost all of the benchmarks just yesterday. It's reminiscent of the period in 2024 when OpenAI consistently made big announcements that happened to coincide with Gemini releases.

OpenAI's self-reported SWE-Bench Verified score is particularly notable: 76.5% for thinking level "high" and 77.9% for the new "xhigh". That was the one benchmark where Gemini 3 Pro was out-performed by Claude Sonnet 4.5 - Gemini 3 Pro got 76.2% and Sonnet 4.5 got 77.2%. OpenAI now have the highest scoring model there by a full .7 of a percentage point!

They also report a score of 58.1% on Terminal Bench 2.0, beating Gemini 3 Pro's 54.2% (and Sonnet 4.5's 42.8%.)

The most intriguing part of this announcement concerns the model's approach to long context problems:

GPT‑5.1-Codex-Max is built for long-running, detailed work. It’s our first model natively trained to operate across multiple context windows through a process called compaction, coherently working over millions of tokens in a single task. [...]

Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.

There's a lot of confusion on Hacker News about what this actually means. Claude Code already does a version of compaction, automatically summarizing previous turns when the context runs out. Does this just mean that Codex-Max is better at that process?

I had it draw me a couple of pelicans by typing "Generate an SVG of a pelican riding a bicycle" directly into the Codex CLI tool. Here's thinking level medium:

And here's thinking level "xhigh":

I also tried xhigh on the my longer pelican test prompt, which came out like this:

Also today: GPT-5.1 Pro is rolling out today to all Pro users. According to the ChatGPT release notes:

GPT-5.1 Pro is rolling out today for all ChatGPT Pro users and is available in the model picker. GPT-5 Pro will remain available as a legacy model for 90 days before being retired.

That's a pretty fast deprecation cycle for the GPT-5 Pro model that was released just three months ago.

Via Hacker News

Tags: november-2025-inflection, llm-release, gpt-5, codex-cli, generative-ai, openai, ai, llms, pelican-riding-a-bicycle, evals, gpt-codex

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

2025-11-18T19:00:48+00:00

Google released Gemini 3 Pro today. Here's the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu, their developer blog announcement from Logan Kilpatrick, the Gemini 3 Pro Model Card, and their collection of 11 more articles. It's a big release!

I had a few days of preview access to this model via AI Studio. The best way to describe it is that it's Gemini 2.5 upgraded to match the leading rival models.

Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video.

Benchmarks

Google's own reported numbers (in the model card) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate.

Pricing

It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models:

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-5.1	$1.25	$10.00
Gemini 2.5 Pro	≤ 200k tokens: $1.25 > 200k tokens: $2.50	≤ 200k tokens: $10.00 > 200k tokens: $15.00
Gemini 3 Pro	≤ 200k tokens: $2.00 > 200k tokens: $4.00	≤ 200k tokens: $12.00 > 200k tokens: $18.00
Claude Sonnet 4.5	≤ 200k tokens: $3.00 > 200k tokens: $6.00	≤ 200k tokens: $15.00 > 200k tokens: $22.50
Claude Opus 4.1	$15.00	$75.00

Trying it out against a complex image

That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image:

llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'

Here's what I got back:

A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.

Humanity's Last Exam (Academic reasoning)

No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.

With search and code execution: Gemini 3 Pro 45.8% (others have no data).

ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)

Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.

GPQA Diamond (Scientific knowledge; No tools)

Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.

AIME 2025 (Mathematics)

No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.

With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.

MathArena Apex (Challenging Math Contest problems)

Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.

MMMU-Pro (Multimodal understanding and reasoning)

Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.

ScreenSpot-Pro (Screen understanding)

Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.

CharXiv Reasoning (Information synthesis from complex charts)

Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.

OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)

Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.

Video-MMMU (Knowledge acquisition from videos)

Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.

LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)

Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.

Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)

Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.

SWE-Bench Verified (Agentic coding; Single attempt)

Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.

t2-bench (Agentic tool use)

Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.

Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)

Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.

FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)

Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.

SimpleQA Verified (Parametric knowledge)

Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.

MMMLU (Multilingual Q&A)

Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.

Global PIQA (Commonsense reasoning across 100 Languages and Cultures)

Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.

MRCR v2 (8-needle) (Long context performance)

128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.

1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).

I have not checked every line of this but a loose spot-check looks accurate to me.

That prompt took 1,105 input and 3,901 output tokens, at a cost of 5.6824 cents.

I ran this follow-up prompt:

llm -c 'Convert to JSON'

You can see the full output here, which starts like this:

{
  "metadata": {
    "columns": [
      "Benchmark",
      "Description",
      "Gemini 3 Pro",
      "Gemini 2.5 Pro",
      "Claude Sonnet 4.5",
      "GPT-5.1"
    ]
  },
  "benchmarks": [
    {
      "name": "Humanity's Last Exam",
      "description": "Academic reasoning",
      "sub_results": [
        {
          "condition": "No tools",
          "gemini_3_pro": "37.5%",
          "gemini_2_5_pro": "21.6%",
          "claude_sonnet_4_5": "13.7%",
          "gpt_5_1": "26.5%"
        },
        {
          "condition": "With search and code execution",
          "gemini_3_pro": "45.8%",
          "gemini_2_5_pro": null,
          "claude_sonnet_4_5": null,
          "gpt_5_1": null
        }
      ]
    },

Analyzing a city council meeting

To try it out against an audio file I extracted the 3h33m of audio from the video Half Moon Bay City Council Meeting - November 4, 2025. I used yt-dlp to get that audio:

yt-dlp -x --audio-format m4a 'https://www.youtube.com/watch?v=qgJ7x7R6gy0'

That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this:

llm -m gemini-3-pro-preview -a /tmp/HMBCC\ 11⧸4⧸25\ -\ Half\ Moon\ Bay\ City\ Council\ Meeting\ -\ November\ 4,\ 2025\ \[qgJ7x7R6gy0\].m4a 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'

That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using ffmpeg:

ffmpeg -i "/private/tmp/HMB.m4a" -ac 1 -ar 22050 -c:a aac -b:a 24k "/private/tmp/HMB_compressed.m4a"

Then ran it again like this (for some reason I had to use --attachment-type this time):

llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'

This time it worked! The full output is here, but it starts like this:

Here is the transcript of the Half Moon Bay City Council meeting.

Meeting Outline

1. Call to Order, Updates, and Public Forum

Summary: Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics.

Timestamp: 00:00:00 - 00:13:25

Participants: Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees.

2. Consent Calendar

Summary: The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes.

Timestamp: 00:13:25 - 00:15:15

Participants: Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast.

3. Ordinance Introduction: Commercial Vitality (Item 9A)

Summary: Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement.

Timestamp: 00:15:15 - 00:30:45

Participants: Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose.

4. Ordinance Introduction: Building Standards & Electrification (Item 9B)

Summary: Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling (California Restaurant Association v. City of Berkeley). Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances.

Timestamp: 00:30:45 - 00:45:00

Participants: Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick.

5. Housing Element Update & Adoption (Item 9C)

Summary: Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure. Public speakers debate the enforceability of Measure D. Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law. The Council votes to adopt the element but strikes the language committing to a ballot measure.

Timestamp: 00:45:00 - 01:05:00

Participants: Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.

Transcript

Mayor Brownstone [00:00:00] Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom.

Victor Hernandez (Interpreter) [00:00:35] Thank you, Mr. Mayor, City Council, all city staff, members of the public. [Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.] Thank you very much.

Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]".

I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this:

Mayor Brownstone [01:04:00] Meeting adjourned. Have a good evening.

That actually happens at 3h31m5s and the mayor says:

Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening.

I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said.

This took 320,087 input tokens and 7,870 output tokens, for a total cost of $1.42.

And a new pelican benchmark

Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic Generate an SVG of a pelican riding a bicycle prompt at both levels.

Here's low - Gemini decided to add a jaunty little hat (with a comment in the SVG that says ):

And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape:

Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward:

Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.

For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle):

Here's Gemini 3 Pro's attempt at high thinking level for that new prompt:

And for good measure, here's that same prompt against GPT-5.1 - which produced this dumpy little fellow:

And Claude Sonnet 4.5, which didn't do quite as well:

None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown.

Tags: google, ai, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-reasoning, llm-release