Simon Willison's Weblog: llm

llm-anthropic 0.25

2026-04-16T20:37:12+00:00

New model: claude-opus-4.7, which supports thinking_effort: xhigh. #66

New thinking_display and thinking_adaptive boolean options. thinking_display summarized output is currently only available in JSON output or JSON logs.

Increased default max_tokens to the maximum allowed for each model.

No longer uses obsolete structured-outputs-2025-11-13 beta header for older models.

Tags: llm, anthropic, claude

research-llm-apis 2026-04-04

2026-04-05T00:32:11+00:00

Release: research-llm-apis 2026-04-04

I'm working on a major change to my LLM Python library and CLI tool. LLM provides an abstraction layer over hundreds of different LLMs from dozens of different vendors thanks to its plugin system, and some of those vendors have grown new features over the past year which LLM's abstraction layer can't handle, such as server-side tool execution.

To help design that new abstraction layer I had Claude Code read through the Python client libraries for Anthropic, OpenAI, Gemini and Mistral and use those to help craft curl commands to access the raw JSON for both streaming and non-streaming modes across a range of different scenarios. Both the scripts and the captured outputs now live in this new repo.

Tags: llm, apis, json, llms

Gemma 4: Byte for byte, the most capable open models

2026-04-02T18:28:54+00:00

Gemma 4: Byte for byte, the most capable open models

Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.

Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.

They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:

The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.

I don't entirely understand that, but apparently that's what the "E" in E2B means!

One particularly exciting feature of these models is that they are multi-modal beyond just images:

Vision and audio: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.

I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.

I tried them out using the GGUFs for LM Studio. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out "---\n" in a loop for every prompt I tried.

The succession of pelican quality from 2B to 4B to 26B-A4B is notable:

E2B:

E4B:

26B-A4B:

(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after fixing that I got probably the best pelican I've seen yet from a model that runs on my laptop.)

Google are providing API access to the two larger Gemma models via their AI Studio. I added support to llm-gemini and then ran a pelican through the 31B model using that:

llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'

Pretty good, though it is missing the front part of the bicycle frame:

Tags: vision-llms, llm, llm-reasoning, ai, local-llms, llms, gemma, llm-release, google, generative-ai, lm-studio, pelican-riding-a-bicycle

llm-gemini 0.30

2026-04-02T18:25:08+00:00

Release: llm-gemini 0.30

New models gemini-3.1-flash-lite-preview, gemma-4-26b-a4b-it and gemma-4-31b-it. See my notes on Gemma 4.

Tags: gemini, llm, gemma

datasette-llm 0.1a6

2026-04-01T23:01:37+00:00

Release: datasette-llm 0.1a6

The same model ID no longer needs to be repeated in both the default model and allowed models lists - setting it as a default model automatically adds it to the allowed models list. #6

Improved documentation for Python API usage.

Tags: llm, datasette

datasette-enrichments-llm 0.2a1

2026-04-01T22:00:34+00:00

Release: datasette-enrichments-llm 0.2a1

The actor who triggers an enrichment is now passed to the llm.mode(... actor=actor) method. #3

Tags: enrichments, llm, datasette

datasette-extract 0.3a0

2026-04-01T03:32:16+00:00

Release: datasette-extract 0.3a0

Now uses datasette-llm to manage model configuration, which means you can control which models are available for extraction tasks using the extract purpose and LLM model configuration. #38

Tags: llm, datasette

datasette-enrichments-llm 0.2a0

2026-04-01T03:28:44+00:00

Release: datasette-enrichments-llm 0.2a0

This plugin now uses datasette-llm to configure and manage models. This means it's possible to specify which models should be made available for enrichments, using the new enrichments purpose.

Tags: llm, datasette

datasette-llm-usage 0.2a0

2026-04-01T03:24:03+00:00

Release: datasette-llm-usage 0.2a0

Removed features relating to allowances and estimated pricing. These are now the domain of datasette-llm-accountant.

Now depends on datasette-llm for model configuration. #3

Full prompts and responses and tool calls can now be logged to the llm_usage_prompt_log table in the internal database if you set the new datasette-llm-usage.log_prompts plugin configuration setting.

Redesigned the /-/llm-usage-simple-prompt page, which now requires the llm-usage-simple-prompt permission.

Tags: llm, datasette

datasette-llm 0.1a5

2026-04-01T03:11:01+00:00

Release: datasette-llm 0.1a5

The llm_prompt_context() plugin hook wrapper mechanism now tracks prompts executed within a chain as well as one-off prompts, which means it can be used to track tool call loops. #5

Tags: llm, datasette

datasette-llm 0.1a4

2026-03-31T21:17:23+00:00

Release: datasette-llm 0.1a4

Ability to configure different API keys for models based on their purpose - for example, set it up so enrichments always use gpt-5.4-mini with an API key dedicated to that purpose. #4

I released llm-echo 0.3 to provide an API key testing utility I needed for the tests for this new feature.

Tags: llm, datasette

llm-all-models-async 0.1

2026-03-31T20:52:02+00:00

Release: llm-all-models-async 0.1

LLM plugins can define new models in both sync and async varieties. The async variants are most common for API-backed models - sync variants tend to be things that run the model directly within the plugin.

My llm-mrchatterbox plugin is sync only. I wanted to try it out with various Datasette LLM features (specifically datasette-enrichments-llm) but Datasette can only use async models.

So... I had Claude spin up this plugin that turns sync models into async models using a thread pool. This ended up needing an extra plugin hook mechanism in LLM itself, which I shipped just now in LLM 0.30.

Tags: llm, async, python

llm 0.30

2026-03-31T20:35:51+00:00

Release: llm 0.30

The register_models() plugin hook now takes an optional model_aliases parameter listing all of the models, async models and aliases that have been registered so far by other plugins. A plugin with @hookimpl(trylast=True) can use this to take previously registered models into account. #1389

Added docstrings to public classes and methods and included those directly in the documentation.

Tags: llm

llm-echo 0.4

2026-03-31T16:48:59+00:00

Release: llm-echo 0.4

Prompts now have the input_tokens and output_tokens fields populated on the response.

Tags: llm

llm-echo 0.3

2026-03-31T15:43:26+00:00

Release: llm-echo 0.3

Mechanisms for testing tool calls. #3

Mechanism for testing raw responses. #4

New echo-needs-key model for testing model key logic. #7

Tags: llm

datasette-llm 0.1a3

2026-03-30T19:48:43+00:00

Release: datasette-llm 0.1a3

Adds the ability to configure which LLMs are available for which purpose, which means you can restrict the list of models that can be used with a specific plugin. #3

Tags: llm, datasette

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

2026-03-30T14:28:34+00:00

Trip Venturella released Mr. Chatterbox, a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it in the model card:

Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.

Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.

Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?

Thanks to Trip we can now find out for ourselves!

The model itself is tiny, at least by Large Language Model standards - just 2.05GB on disk. You can try it out using Trip's HuggingFace Spaces demo:

Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question.

The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.

But what a fun project!

Running it locally with LLM

I decided to see if I could run the model on my own machine using my LLM framework.

I got Claude Code to do most of the work - here's the transcript.

Trip trained the model using Andrej Karpathy's nanochat, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the Space demo source code) I had Claude read the LLM plugin tutorial and build the rest of the plugin.

llm-mrchatterbox is the result. Install the plugin like this:

llm install llm-mrchatterbox

The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:

llm -m mrchatterbox "Good day, sir"

Or start an ongoing chat session like this:

llm chat -m mrchatterbox

If you don't have LLM installed you can still get a chat session started from scratch using uvx like this:

uvx --with llm-mrchatterbox llm chat -m mrchatterbox

When you are finished with the model you can delete the cached file using:

llm mrchatterbox delete-model

This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future.

I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.

Update 31st March 2026: I had missed this when I first published this piece but Trip has his own detailed writeup of the project which goes into much more detail about how he trained the model. Here's how the books were filtered for pre-training:

First, I downloaded the British Library dataset split of all 19th-century books. I filtered those down to books contemporaneous with the reign of Queen Victoria—which, unfortunately, cut out the novels of Jane Austen—and further filtered those down to a set of books with a optical character recognition (OCR) confidence of .65 or above, as listed in the metadata. This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data.

Getting it to behave like a conversational model was a lot harder. Trip started by trying to train on plays by Oscar Wilde and George Bernard Shaw, but found they didn't provide enough pairs. Then he tried extracting dialogue pairs from the books themselves with poor results. The approach that worked was to have Claude Haiku and GPT-4o-mini generate synthetic conversation pairs for the supervised fine tuning, which solved the problem but sadly I think dilutes the "no training inputs from after 1899" claim from the original model card.

Tags: llm, training-data, ai, local-llms, llms, ai-ethics, claude-code, andrej-karpathy, ai-assisted-programming, generative-ai, hugging-face, uv

llm-mrchatterbox 0.1

2026-03-30T02:20:46+00:00

Release: llm-mrchatterbox 0.1

See Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer.

Tags: llm

datasette-llm 0.1a2

2026-03-26T15:52:32+00:00

Release: datasette-llm 0.1a2

actor is now available to the llm_prompt_context plugin hook. #2

Tags: llm, datasette

datasette-llm 0.1a1

2026-03-25T21:24:31+00:00

Release: datasette-llm 0.1a1

New release of the base plugin that makes models from LLM available for use by other Datasette plugins such as datasette-enrichments-llm.

New register_llm_purposes() plugin hook and get_purposes() function for retrieving registered purpose strings. #1

One of the responsibilities of this plugin is to configure which models are used for which purposes, so you can say in one place "data enrichment uses GPT-5.4-nano but SQL query assistance happens using Sonnet 4.6", for example.

Plugins that depend on this can use model = await llm.model(purpose="enrichment") to indicate the purpose of the prompts they wish to execute against the model. Those plugins can now also use the new register_llm_purposes() hook to register those purpose strings, which means future plugins can list those purposes in one place to power things like an admin UI for assigning models to purposes.

Tags: annotated-release-notes, llm, datasette, plugins

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

2026-03-17T19:39:17+00:00

OpenAI today: Introducing GPT‑5.4 mini and nano. These models join GPT-5.4 which was released two weeks ago.

OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.

Here's how the pricing looks - all prices are per million tokens. gpt-5.4-nano is notably even cheaper than Google's Gemini 3.1 Flash-Lite:

Model	Input	Cached input	Output
gpt-5.4	$2.50	$0.25	$15.00
gpt-5.4-mini	$0.75	$0.075	$4.50
gpt-5.4-nano	$0.20	$0.02	$1.25
Other models for comparison
Claude Opus 4.6	$5.00	-	$25.00
Claude Sonnet 4.6	$3.00	-	$15.00
Gemini 3.1 Pro	$2.00	-	$12.00
Claude Haiku 4.5	$1.00	-	$5.00
Gemini 3.1 Flash-Lite	$0.25	-	$1.50

I used GPT-5.4 nano to generate a description of this photo I took at the John M. Mossman Lock Collection:

llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'

Here's the output:

The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.

That took 2,751 input tokens and 112 output tokens, at a cost of 0.069 cents (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.

I released llm 0.29 with support for the new models.

Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (generation transcripts here). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!

Tags: vision-llms, llm, openai, pelican-riding-a-bicycle, llm-pricing, ai, llms, llm-release, generative-ai

Introducing Mistral Small 4

2026-03-16T23:41:17+00:00

Introducing Mistral Small 4

Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:

Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.

It supports reasoning_effort="none" or reasoning_effort="high", with the latter providing "equivalent verbosity to previous Magistral models".

The new model is 242GB on Hugging Face.

I tried it out via the Mistral API using llm-mistral:

llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"

I couldn't find a way to set the reasoning effort in their API documentation, so hopefully that's a feature which will land soon.

Update 23rd March: Here's new documentation for the reasoning_effort parameter.

Also from Mistral today and fitting their -stral naming convention is Leanstral, an open weight model that is specifically tuned to help output the Lean 4 formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.

Tags: llm, llm-reasoning, pelican-riding-a-bicycle, ai, llms, llm-release, mistral, generative-ai

Gemini 3.1 Flash-Lite

2026-03-03T21:53:54+00:00

Gemini 3.1 Flash-Lite

Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.

It supports four different thinking levels, so I had it output four different pelicans:

minimal

low

medium

high

Tags: gemini, llm, pelican-riding-a-bicycle, llm-pricing, ai, llms, llm-release, google, generative-ai

Gemini 3.1 Pro

2026-02-19T17:58:37+00:00

Gemini 3.1 Pro

The first in the Gemini 3.1 series, priced the same as Gemini 3 Pro ($2/million input, $12/million output under 200,000 tokens, $4/$18 for 200,000 to 1,000,000). That's less than half the price of Claude Opus 4.6 with very similar benchmark scores to that model.

They boast about its improved SVG animation performance compared to Gemini 3 Pro in the announcement!

I tried "Generate an SVG of a pelican riding a bicycle" in Google AI Studio and it thought for 323.9 seconds (thinking trace here) before producing this one:

It's good to see the legs clearly depicted on both sides of the frame (should satisfy Elon), the fish in the basket is a nice touch and I appreciated this comment in the SVG code:

<!-- Black Flight Feathers on Wing Tip -->
<path d="M 420 175 C 440 182, 460 187, 470 190 C 450 210, 430 208, 410 198 Z" fill="#374151" />

I've added the two new model IDs gemini-3.1-pro-preview and gemini-3.1-pro-preview-customtools to my llm-gemini plugin for LLM. That "custom tools" one is described here - apparently it may provide better tool performance than the default model in some situations.

The model appears to be incredibly slow right now - it took 104s to respond to a simple "hi" and a few of my other tests met "Error: This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later." or "Error: Deadline expired before operation could complete" errors. I'm assuming that's just teething problems on launch day.

It sounds like last week's Deep Think release was our first exposure to the 3.1 family:

Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.

Update: In What happens if AI labs train for pelicans riding bicycles? last November I said:

If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.

Google's Gemini Lead Jeff Dean tweeted this video featuring an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.

I've been saying for a while that I wish AI labs would highlight things that their new models can do that their older models could not, so top marks to the Gemini team for this video.

Update 2: I used llm-gemini to run my more detailed Pelican prompt, with this result:

From the SVG comments:

<!-- Pouch Gradient (Breeding Plumage: Red to Olive/Green) -->
...
<!-- Neck Gradient (Breeding Plumage: Chestnut Nape, White/Yellow Front) -->

Tags: gemini, llm, pelican-riding-a-bicycle, ai, llms, llm-release, google, generative-ai, svg

llm-gemini 0.29

2026-02-19T17:57:25+00:00

Release: llm-gemini 0.29

Tags: llm, gemini

Introducing Claude Sonnet 4.6

2026-02-17T23:58:58+00:00

Introducing Claude Sonnet 4.6

Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to November's Opus 4.5 while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's the system card PDF.

Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.

I just released llm-anthropic 0.24 with support for both Sonnet 4.6 and Opus 4.6. Claude Code did most of the work - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described in Anthropic's migration guide.

Here's what I got from:

uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6

The SVG comments include:

<!-- Hat (fun accessory) -->

I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!

For comparison, here's the pelican Opus 4.5 drew me in November:

And here's Anthropic's current best pelican, drawn by Opus 4.6 on February 5th:

Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

Via Hacker News

Tags: llm, anthropic, claude, llm-pricing, ai, llms, llm-release, generative-ai, pelican-riding-a-bicycle, claude-code

llm-anthropic 0.24

2026-02-17T23:51:23+00:00

Release: llm-anthropic 0.24

Tags: llm, claude, anthropic

datasette-llm 0.1a0

2026-01-22T17:10:06+00:00

Release: datasette-llm 0.1a0

Tags: datasette, llm

jordanhubbard/nanolang

2026-01-19T23:58:56+00:00

jordanhubbard/nanolang

Plenty of people have mused about what a new programming language specifically designed to be used by LLMs might look like. Jordan Hubbard (co-founder of FreeBSD, with serious stints at Apple and NVIDIA) just released exactly that.

A minimal, LLM-friendly programming language with mandatory testing and unambiguous syntax.

NanoLang transpiles to C for native performance while providing a clean, modern syntax optimized for both human readability and AI code generation.

The syntax strikes me as an interesting mix between C, Lisp and Rust.

I decided to see if an LLM could produce working code in it directly, given the necessary context. I started with this MEMORY.md file, which begins:

Purpose: This file is designed specifically for Large Language Model consumption. It contains the essential knowledge needed to generate, debug, and understand NanoLang code. Pair this with spec.json for complete language coverage.

I ran that using LLM and llm-anthropic like this:

llm -m claude-opus-4.5 \
  -s https://raw.githubusercontent.com/jordanhubbard/nanolang/refs/heads/main/MEMORY.md \
  'Build me a mandelbrot fractal CLI tool in this language' 
  > /tmp/fractal.nano

The resulting code... did not compile.

I may have been too optimistic expecting a one-shot working program for a new language like this. So I ran a clone of the actual project, copied in my program and had Claude Code take a look at the failing compiler output.

... and it worked! Claude happily grepped its way through the various examples/ and built me a working program.

Here's the Claude Code transcript - you can see it reading relevant examples here - and here's the finished code plus its output.

I've suspected for a while that LLMs and coding agents might significantly reduce the friction involved in launching a new language. This result reinforces my opinion.

Via Hacker News

Tags: coding-agents, ai-assisted-programming, programming-languages, claude-code, generative-ai, ai, llms, llm

llm-gemini 0.28.2

2025-12-23T16:20:33+00:00

Release: llm-gemini 0.28.2

Tags: llm, gemini