<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: ollama</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/ollama.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-10-14T23:36:21+00:00</updated><author><name>Simon Willison</name></author><entry><title>NVIDIA DGX Spark: great hardware, early days for the ecosystem</title><link href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag" rel="alternate"/><published>2025-10-14T23:36:21+00:00</published><updated>2025-10-14T23:36:21+00:00</updated><id>https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;NVIDIA sent me a preview unit of their new &lt;a href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/"&gt;DGX Spark&lt;/a&gt; desktop "AI supercomputer". I've never had hardware to review before! You can consider this my first ever sponsored post if you like, but they did not pay me any cash and aside from an embargo date they did not request (nor would I grant) any editorial input into what I write about the device.&lt;/p&gt;
&lt;p&gt;The device retails for around $4,000. They officially go on sale tomorrow.&lt;/p&gt;
&lt;p&gt;First impressions are that this is a snazzy little computer. It's similar in size to a Mac mini, but with an exciting textured surface that feels refreshingly different and a little bit &lt;a href="https://www.indiewire.com/awards/industry/devs-cinematography-rob-hardy-alex-garland-1234583396/"&gt;science fiction&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/nvidia-spark.jpg" alt="A rectangular small computer, sitting horizontally on a box. It is about the width of a Mac Mini. It has a NVIDIA logo on  a reflective handle portion, then textured silver metal front, then another reflective handle at the other end. It's pretty and a bit weird looking. It sits on the box it came in, which has NVIDIA DGX Spark written on it in white text on green." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;There is a &lt;em&gt;very&lt;/em&gt; powerful machine tucked into that little box. Here are the specs, which I had Claude Code figure out for me by &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d"&gt;poking around on the device itself&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hardware Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture: aarch64 (ARM64)&lt;/li&gt;
&lt;li&gt;CPU: 20 cores
&lt;ul&gt;
&lt;li&gt;10x Cortex-X925 (performance cores)&lt;/li&gt;
&lt;li&gt;10x Cortex-A725 (efficiency cores)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;RAM: 119 GB total (112 GB available) - &lt;em&gt;I’m not sure why Claude reported it differently here, the machine is listed as 128GB - it looks like a &lt;a href="https://news.ycombinator.com/item?id=45586776#45588329"&gt;128GB == 119GiB thing&lt;/a&gt; because Claude &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d#file-nvidia-claude-code-txt-L41"&gt;used free -h&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Storage: 3.7 TB (6% used, 3.3 TB available)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPU Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model: NVIDIA GB10 (Blackwell architecture)&lt;/li&gt;
&lt;li&gt;Compute Capability: sm_121 (12.1)&lt;/li&gt;
&lt;li&gt;Memory: 119.68 GB&lt;/li&gt;
&lt;li&gt;Multi-processor Count: 48 streaming multiprocessors&lt;/li&gt;
&lt;li&gt;Architecture: Blackwell&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Short version: this is an ARM64 device with 128GB of memory that's available to both the GPU and the 20 CPU cores at the same time, strapped onto a 4TB NVMe SSD.&lt;/p&gt;
&lt;p&gt;The Spark is firmly targeted at “AI researchers”. It’s designed for both training and running models.&lt;/p&gt;
&lt;h4 id="the-tricky-bit-cuda-on-arm64"&gt;The tricky bit: CUDA on ARM64&lt;/h4&gt;
&lt;p&gt;Until now almost all of my own model running experiments have taken place on a Mac. This has gotten far less painful over the past year and a half thanks to the amazing work of the &lt;a href="https://simonwillison.net/tags/mlx/"&gt;MLX&lt;/a&gt; team and community, but it's still left me deeply frustrated at my lack of access to the NVIDIA CUDA ecosystem. I've lost count of the number of libraries and tutorials which expect you to be able to use Hugging Face Transformers or PyTorch with CUDA, and leave you high and dry if you don't have an NVIDIA GPU to run things on.&lt;/p&gt;
&lt;p&gt;Armed (ha) with my new NVIDIA GPU I was excited to dive into this world that had long eluded me... only to find that there was another assumption baked in to much of this software: x86 architecture for the rest of the machine.&lt;/p&gt;
&lt;p&gt;This resulted in all kinds of unexpected new traps for me to navigate. I eventually managed to get a PyTorch 2.7 wheel for CUDA on ARM, but failed to do so for 2.8. I'm not confident there because the wheel itself is unavailable but I'm finding navigating the PyTorch ARM ecosystem pretty confusing.&lt;/p&gt;
&lt;p&gt;NVIDIA are trying to make this easier, with mixed success. A lot of my initial challenges got easier when I found their &lt;a href="https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html"&gt;official Docker container&lt;/a&gt;, so now I'm figuring out how best to use Docker with GPUs. Here's the current incantation that's been working for me:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have not yet got my head around the difference between CUDA 12 and 13. 13 appears to be very new, and a lot of the existing tutorials and libraries appear to expect 12.&lt;/p&gt;
&lt;h4 id="the-missing-documentation-isn-t-missing-any-more"&gt;The missing documentation isn't missing any more&lt;/h4&gt;
&lt;p&gt;When I first received this machine around a month ago there was very little in the way of documentation to help get me started. This meant climbing the steep NVIDIA+CUDA learning curve mostly on my own.&lt;/p&gt;
&lt;p&gt;This has changed &lt;em&gt;substantially&lt;/em&gt; in just the last week. NVIDIA now have extensive guides for getting things working on the Spark and they are a huge breath of fresh air - exactly the information I needed when I started exploring this hardware.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://developer.nvidia.com/topics/ai/dgx-spark"&gt;getting started guide&lt;/a&gt;, details on the &lt;a href="https://build.nvidia.com/spark/dgx-dashboard/instructions"&gt;DGX dashboard web app&lt;/a&gt;, and the essential collection of &lt;a href="https://build.nvidia.com/spark"&gt;playbooks&lt;/a&gt;. There's still a lot I haven't tried yet just in this official set of guides.&lt;/p&gt;
&lt;h4 id="claude-code-for-everything"&gt;Claude Code for everything&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.claude.com/product/claude-code"&gt;Claude Code&lt;/a&gt; was an absolute lifesaver for me while I was trying to figure out how best to use this device. My Ubuntu skills were a little rusty, and I also needed to figure out CUDA drivers and Docker incantations and how to install the right versions of PyTorch. Claude 4.5 Sonnet is &lt;em&gt;much better than me&lt;/em&gt; at all of these things.&lt;/p&gt;
&lt;p&gt;Since many of my experiments took place in disposable Docker containers I had no qualms at all about running it in YOLO mode:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;IS_SANDBOX=1&lt;/code&gt; environment variable stops Claude from complaining about running as root.&lt;/p&gt;

&lt;details&gt;&lt;summary style="font-style: italic"&gt;Before I found out about IS_SANDBOX&lt;/summary&gt;

&lt;p&gt;&lt;br /&gt;&lt;em&gt;I was &lt;a href="https://twitter.com/lawrencecchen/status/1978255934938886409"&gt;tipped off&lt;/a&gt; about IS_SANDBOX after I published this article. Here's my original workaround:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude understandably won't let you do this as root, even in a Docker container, so I found myself using the following incantation in a fresh &lt;code&gt;nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04&lt;/code&gt; instance pretty often:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get install -y sudo
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; pick the first free UID &amp;gt;=1000&lt;/span&gt;
U=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent passwd &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen UID: &lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; same for a GID&lt;/span&gt;
G=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent group &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen GID: &lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; create user+group&lt;/span&gt;
groupadd -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; devgrp
useradd -m -u &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -s /bin/bash dev
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; enable password-less sudo:&lt;/span&gt;
&lt;span class="pl-c1"&gt;printf&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;dev ALL=(ALL) NOPASSWD:ALL\n&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /etc/sudoers.d/90-dev-nopasswd
chmod 0440 /etc/sudoers.d/90-dev-nopasswd
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install npm&lt;/span&gt;
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install Claude&lt;/span&gt;
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then switch to the &lt;code&gt;dev&lt;/code&gt; user and run Claude for the first time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;su - dev
claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;/details&gt;&lt;br /&gt;

&lt;p&gt;This will provide a URL which you can visit to authenticate with your Anthropic account, confirming by copying back a token and pasting it into the terminal.&lt;/p&gt;
&lt;p&gt;Docker tip: you can create a snapshot of the current image (with Claude installed) by running &lt;code&gt;docker ps&lt;/code&gt; to get the container ID and then:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker commit --pause=false &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;container_id&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; cc:snapshot&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then later you can start a similar container using:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it \
  --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  cc:snapshot bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's an example of the kinds of prompts I've been running in Claude Code inside the container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want to run https://huggingface.co/unsloth/Qwen3-4B-GGUF using llama.cpp - figure out how to get llama cpp working on this machine  such that it runs with the GPU, then install it in this directory and get that model to work to serve a prompt. Goal is to get this  command to run: llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one worked flawlessly - Claude checked out the &lt;code&gt;llama.cpp&lt;/code&gt; repo, compiled it for me and iterated on it until it could run that model on the GPU. Here's a &lt;a href="https://gist.github.com/simonw/3e7d28d9ed222d842f729bfca46d6673"&gt;full transcript&lt;/a&gt;, converted from Claude's &lt;code&gt;.jsonl&lt;/code&gt; log format to Markdown using a script I &lt;a href="https://github.com/simonw/tools/blob/main/python/claude_to_markdown.py"&gt;vibe coded just now&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I later told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write out a markdown file with detailed notes on what you did. Start with the shortest form of notes on how to get a successful build, then add a full account of everything you tried, what went wrong and how you fixed it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which produced &lt;a href="https://gist.github.com/simonw/0942d96f616b9e328568ab27d911c8ed"&gt;this handy set of notes&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="tailscale-was-made-for-this"&gt;Tailscale was made for this&lt;/h4&gt;
&lt;p&gt;Having a machine like this on my local network is neat, but what's even neater is being able to access it from anywhere else in the world, from both my phone and my laptop.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; is &lt;em&gt;perfect&lt;/em&gt; for this. I installed it on the Spark (using the &lt;a href="https://tailscale.com/kb/1031/install-linux"&gt;Ubuntu instructions here&lt;/a&gt;), signed in with my SSO account (via Google)... and the Spark showed up in the "Network Devices" panel on my laptop and phone instantly.&lt;/p&gt;
&lt;p&gt;I can SSH in from my laptop or using the &lt;a href="https://termius.com/free-ssh-client-for-iphone"&gt;Termius iPhone app&lt;/a&gt; on my phone. I've also been running tools like &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt; which give me a mobile-friendly web interface for interacting with LLMs on the Spark.&lt;/p&gt;
&lt;h4 id="here-comes-the-ecosystem"&gt;Here comes the ecosystem&lt;/h4&gt;
&lt;p&gt;The embargo on these devices dropped yesterday afternoon, and it turns out a whole bunch of relevant projects have had similar preview access to myself. This is &lt;em&gt;fantastic news&lt;/em&gt; as many of the things I've been trying to figure out myself suddenly got a whole lot easier.&lt;/p&gt;
&lt;p&gt;Four particularly notable examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama &lt;a href="https://ollama.com/blog/nvidia-spark"&gt;works out of the box&lt;/a&gt;. They actually had a build that worked a few weeks ago, and were the first success I had running an LLM on the machine.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; creator Georgi Gerganov just published  &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/16578"&gt;extensive benchmark results&lt;/a&gt; from running &lt;code&gt;llama.cpp&lt;/code&gt; on a Spark. He's getting ~3,600 tokens/second to read the prompt and ~59 tokens/second to generate a response with the MXFP4 version of GPT-OSS 20B and ~817 tokens/second to read and ~18 tokens/second to generate for GLM-4.5-Air-GGUF.&lt;/li&gt;
&lt;li&gt;LM Studio now have &lt;a href="https://lmstudio.ai/blog/dgx-spark"&gt;a build for the Spark&lt;/a&gt;. I haven't tried this one yet as I'm currently using my machine exclusively via SSH.&lt;/li&gt;
&lt;li&gt;vLLM - one of the most popular engines for serving production LLMs - had &lt;a href="https://x.com/eqhylxx/status/1977928690945360049"&gt;early access&lt;/a&gt; and there's now an official &lt;a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3"&gt;NVIDIA vLLM NGC Container&lt;/a&gt; for running their stack.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's &lt;a href="https://docs.unsloth.ai/new/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth"&gt;a tutorial from Unsloth&lt;/a&gt; on fine-tuning gpt-oss-20b on the Spark.&lt;/p&gt;
&lt;h4 id="should-you-get-one-"&gt;Should you get one?&lt;/h4&gt;
&lt;p&gt;It's a bit too early for me to provide a confident recommendation concerning this machine. As indicated above, I've had a tough time figuring out how best to put it to use, largely through my own inexperience with CUDA, ARM64 and Ubuntu GPU machines in general.&lt;/p&gt;
&lt;p&gt;The ecosystem improvements in just the past 24 hours have been very reassuring though. I expect it will be clear within a few weeks how well supported this machine is going to be.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hardware"&gt;hardware&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/disclosures"&gt;disclosures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="hardware"/><category term="ai"/><category term="docker"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="ollama"/><category term="llama-cpp"/><category term="coding-agents"/><category term="claude-code"/><category term="lm-studio"/><category term="disclosures"/><category term="nvidia-spark"/></entry><entry><title>OpenAI's new open weight (Apache 2) models are really good</title><link href="https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag" rel="alternate"/><published>2025-08-05T20:33:13+00:00</published><updated>2025-08-05T20:33:13+00:00</updated><id>https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag</id><summary type="html">
    &lt;p&gt;The long promised &lt;a href="https://openai.com/index/introducing-gpt-oss/"&gt;OpenAI open weight models are here&lt;/a&gt;, and they are &lt;em&gt;very&lt;/em&gt; impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B.&lt;/p&gt;
&lt;p&gt;OpenAI's own benchmarks are eyebrow-raising - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;gpt-oss-120b&lt;/strong&gt; model achieves &lt;strong&gt;near-parity with OpenAI o4-mini&lt;/strong&gt; on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The &lt;strong&gt;gpt-oss-20b&lt;/strong&gt; model delivers &lt;strong&gt;similar results to OpenAI o3‑mini&lt;/strong&gt; on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3-mini are &lt;em&gt;really good&lt;/em&gt; proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.&lt;/p&gt;
&lt;p&gt;Both models are mixture-of-experts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like "GPQA Diamond (without tools) PhD-level science questions" too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3 — 83.3%&lt;/li&gt;
&lt;li&gt;o4-mini — 81.4%&lt;/li&gt;
&lt;li&gt;gpt-oss-120b — 80.1%&lt;/li&gt;
&lt;li&gt;o3-mini — 77%&lt;/li&gt;
&lt;li&gt;gpt-oss-20b — 71.5%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of these benchmarks are edging towards saturated.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-model-card"&gt;Training details from the model card&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#china"&gt;Competing with the Chinese open models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/h4&gt;
&lt;p&gt;There are already a bunch of different ways to run these models - OpenAI partnered with numerous organizations in advance of the release.&lt;/p&gt;
&lt;p&gt;I decided to start with &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had to update to the most recent version of the app, then install the new model from &lt;a href="https://lmstudio.ai/models/openai/gpt-oss-20b"&gt;their openai/gpt-oss-20b&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;First impressions: this is a &lt;em&gt;really good&lt;/em&gt; model, and it somehow runs using just 11.72GB of my system RAM.&lt;/p&gt;
&lt;p&gt;The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.&lt;/p&gt;
&lt;p&gt;Let's try "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/h4&gt;
&lt;p&gt;I started &lt;a href="https://gist.github.com/simonw/b71394cc85fe0f048e376392e41586da"&gt;with low&lt;/a&gt;. It thought for 0.07 seconds and then output this (at 39 tokens a second):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-low.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Except... it output invalid SVG. One of the path elements looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Frame --&amp;gt;
&amp;lt;path d="
    M150,250          &amp;lt;!-- rear wheel center --&amp;gt;
    L300,120          &amp;lt;!-- top tube to front --&amp;gt;
    L450,250          &amp;lt;!-- chain stays back to front --&amp;gt;
    L300,350          &amp;lt;!-- seat stays down --&amp;gt;
    Z"
    fill="#e0e0e0" stroke="#555" stroke-width="4"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But you can't put comments inside attributes like that. I fixed this to get the above image.&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/h4&gt;
&lt;p&gt;I tried again &lt;a href="https://gist.github.com/simonw/642e9e371387fc59a5aad25dcd41b0cc"&gt;on medium&lt;/a&gt;. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-medium.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/h4&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3"&gt;on high&lt;/a&gt;. My first attempt here failed because it hit the default LM Studio 4096 context limit - so I reloaded the model and set the new limit to the maximum possible.&lt;/p&gt;
&lt;p&gt;It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-high.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thought &lt;em&gt;really hard&lt;/em&gt; to get there. Here's a snippet of &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3 "&gt;that transcript&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican's body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 &amp;gt; seat top 255. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've seen this from other local reasoning models too: they can really overthink this kind of problem! I don't know if there's any pelican SVG that's worth waiting nearly six minutes for.&lt;/p&gt;
&lt;h4 id="space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/h4&gt;
&lt;p&gt;Given how long high took I switched back to medium for my next experiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d607e4"&gt;thought for 10.78 seconds&lt;/a&gt; and produced this:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-gpt-oss-20b-mxfp4-medium"&gt;play that here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's not the best I've seen - I was more impressed &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;by GLM 4.5 Air&lt;/a&gt; - but it's very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).&lt;/p&gt;
&lt;h4 id="trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/h4&gt;
&lt;p&gt;I don't quite have the resources on my laptop to run the larger model. Thankfully it's already being hosted by a number of different API providers.&lt;/p&gt;
&lt;p&gt;OpenRouter already &lt;a href="https://openrouter.ai/openai/gpt-oss-120b/providers"&gt;lists three&lt;/a&gt; - Fireworks, Groq and Cerebras. (Update: now also Parasail and Baseten.)&lt;/p&gt;
&lt;p&gt;Cerebras is &lt;em&gt;fast&lt;/em&gt;, so I decided to try them first.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt; plugin and ran the &lt;code&gt;refresh&lt;/code&gt; command to ensure it had their latest models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install -U llm-cerebras jsonschema
llm cerebras refresh&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Installing jsonschema worked around a warning message.)&lt;/p&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Refreshed 10 Cerebras models:
  - cerebras-deepseek-r1-distill-llama-70b
  - cerebras-gpt-oss-120b
  - cerebras-llama-3.3-70b
  - cerebras-llama-4-maverick-17b-128e-instruct
  - cerebras-llama-4-scout-17b-16e-instruct
  - cerebras-llama3.1-8b
  - cerebras-qwen-3-235b-a22b-instruct-2507
  - cerebras-qwen-3-235b-a22b-thinking-2507
  - cerebras-qwen-3-32b
  - cerebras-qwen-3-coder-480b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m cerebras-gpt-oss-120b \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Cerebras runs the new model at between 2 and 4 thousands tokens per second!&lt;/p&gt;
&lt;p&gt;To my surprise this one &lt;a href="https://gist.github.com/simonw/4c685f19f1a93b68eacb627125e36be4"&gt;had the same comments-in-attributes bug&lt;/a&gt; that we saw with oss-20b earlier. I fixed those and got this pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-120-cerebras.jpg" alt="Yellow and not great pelican, quite a good bicycle if a bit sketchy." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That bug appears intermittently - I've not seen it on some of my other runs of the same prompt.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin also provides access to the models, balanced across the underlying providers. You can use that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste API key here&lt;/span&gt;
llm -m openrouter/openai/gpt-oss-120b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; &lt;a href="https://github.com/ggml-org/llama.cpp/pull/15091"&gt;pull request for gpt-oss&lt;/a&gt; was landed less than an hour ago. It's worth browsing through the coded - a &lt;em&gt;lot&lt;/em&gt; of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;llama.cpp Homebrew package&lt;/a&gt; within the next day or so, which should provide a convenient way to run the model via &lt;code&gt;llama-server&lt;/code&gt; and friends.&lt;/p&gt;
&lt;h4 id="gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/h4&gt;
&lt;p&gt;Ollama &lt;a href="https://ollama.com/library/gpt-oss"&gt;also have gpt-oss&lt;/a&gt;, requiring an update to their app.&lt;/p&gt;
&lt;p&gt;I fetched that 14GB model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull gpt-oss:20b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can use it with the new Ollama native app, or access it from &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama
llm -m gpt-oss:20b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also appears to use around 13.26GB of system memory while running a prompt.&lt;/p&gt;
&lt;p&gt;Ollama also launched &lt;a href="https://ollama.com/turbo"&gt;Ollama Turbo&lt;/a&gt; today, offering the two OpenAI models as a paid hosted service:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Turbo is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs, or run very slowly. Ollama Turbo provides a way to run these models fast while using Ollama's App, CLI, and API. &lt;/p&gt;&lt;/blockquote&gt;
&lt;h4 id="the-model-card"&gt;Training details from the model card&lt;/h4&gt;
&lt;p&gt;Here are some interesting notes about how the models were trained from &lt;a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf"&gt;the model card&lt;/a&gt; (PDF):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: We train the models on a text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge. To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thunder Compute's article &lt;a href="https://www.thundercompute.com/blog/nvidia-h100-pricing"&gt;NVIDIA H100 Pricing (August 2025): Cheapest On-Demand Cloud GPU Rates&lt;/a&gt; lists prices from around $2/hour to $11/hour, which would indicate a training cost of the 120b model between $4.2m and $23.1m and the 20b between $420,000 and $2.3m.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3. This procedure teaches the models how to reason and solve problems using CoT and teaches the model how to use tools. Because of the similar RL techniques, these models have a personality similar to models served in our first-party products like ChatGPT. Our training dataset consists of a wide range of problems from coding, math, science, and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models have additional special training to help them use web browser and Python (Jupyter notebook) tools more effectively:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;During post-training, we also teach the models to use different agentic tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A browsing tool, that allows the model to call search and open functions to interact with
the web. This aids factuality and allows the models to fetch info beyond their knowledge
cutoff.&lt;/li&gt;
&lt;li&gt;A python tool, which allows the model to run code in a stateful Jupyter notebook environment.&lt;/li&gt;
&lt;li&gt;Arbitrary developer functions, where one can specify function schemas in a &lt;code&gt;Developer&lt;/code&gt;
message similar to the OpenAI API. The definition of function is done within our harmony
format.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a corresponding &lt;a href="https://github.com/openai/gpt-oss?tab=readme-ov-file#python"&gt;section about Python tool usage&lt;/a&gt; in the &lt;code&gt;openai/gpt-oss&lt;/code&gt; repository README.&lt;/p&gt;


&lt;h4 id="openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/h4&gt;
&lt;p&gt;One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.&lt;/p&gt;
&lt;p&gt;Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/openai/harmony"&gt;openai/harmony&lt;/a&gt; is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the &lt;code&gt;gpt-oss&lt;/code&gt; models. It's clearly inspired by their new-ish &lt;a href="https://openai.com/index/new-tools-for-building-agents/"&gt;Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The format is described in the new &lt;a href="https://cookbook.openai.com/articles/openai-harmony"&gt;OpenAI Harmony Response Format&lt;/a&gt; cookbook document. It introduces some concepts that I've not seen in open weight models before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt; and &lt;code&gt;tool&lt;/code&gt; roles - many other models only use user and assistant, and sometimes system and tool.&lt;/li&gt;
&lt;li&gt;Three different channels for output: &lt;code&gt;final&lt;/code&gt;, &lt;code&gt;analysis&lt;/code&gt; and &lt;code&gt;commentary&lt;/code&gt;. Only the &lt;code&gt;final&lt;/code&gt; channel is default intended to be visible to users. &lt;code&gt;analysis&lt;/code&gt; is for chain of thought and &lt;code&gt;commentary&lt;/code&gt; is sometimes used for tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That channels concept has been present in ChatGPT for a few months, starting with the release of o3.&lt;/p&gt;
&lt;p&gt;The details of the new tokens used by Harmony caught my eye:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Token&lt;/th&gt;
    &lt;th&gt;Purpose&lt;/th&gt;
    &lt;th&gt;ID&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|start|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message header&lt;/td&gt;
    &lt;td&gt;200006&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|end|&amp;gt;&lt;/td&gt;
    &lt;td&gt;End of message&lt;/td&gt;
    &lt;td&gt;200007&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|message|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message content&lt;/td&gt;
    &lt;td&gt;200008&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|channel|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of channel info&lt;/td&gt;
    &lt;td&gt;200005&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|constrain|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Data type for tool call&lt;/td&gt;
    &lt;td&gt;200003&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|return|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Stop after response&lt;/td&gt;
    &lt;td&gt;200002&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|call|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Call a tool&lt;/td&gt;
    &lt;td&gt;200012&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;Those token IDs are particularly important. They are part of a new token vocabulary called &lt;code&gt;o200k_harmony&lt;/code&gt;, which landed in OpenAI's tiktoken tokenizer library &lt;a href="https://github.com/openai/tiktoken/commit/3591ff175d6a80efbe4fcc7f0e219ddd4b8c52f1"&gt;this morning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the past I've seen models get confused by special tokens - try pasting &lt;code&gt;&amp;lt;|end|&amp;gt;&lt;/code&gt; into a model and see what happens.&lt;/p&gt;
&lt;p&gt;Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!&lt;/p&gt;
&lt;p&gt;The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.&lt;/p&gt;
&lt;p&gt;I tried one of their demos using &lt;code&gt;uv run&lt;/code&gt; to turn it into a shell one-liner:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --python 3.12 --with openai-harmony python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import *&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import DeveloperContent&lt;/span&gt;
&lt;span class="pl-s"&gt;enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)&lt;/span&gt;
&lt;span class="pl-s"&gt;convo = Conversation.from_messages([&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.SYSTEM,&lt;/span&gt;
&lt;span class="pl-s"&gt;        SystemContent.new(),&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.DEVELOPER,&lt;/span&gt;
&lt;span class="pl-s"&gt;        DeveloperContent.new().with_instructions("Talk like a pirate!")&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(Role.USER, "Arrr, how be you?"),&lt;/span&gt;
&lt;span class="pl-s"&gt;])&lt;/span&gt;
&lt;span class="pl-s"&gt;tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)&lt;/span&gt;
&lt;span class="pl-s"&gt;print(tokens)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note those token IDs like &lt;code&gt;200006&lt;/code&gt; corresponding to the special tokens listed above.&lt;/p&gt;
&lt;h4 id="the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/h4&gt;
&lt;p&gt;There's one aspect of these models that I haven't explored in detail yet: &lt;strong&gt;tool calling&lt;/strong&gt;. How these work is clearly a big part of the new Harmony format, but the packages I'm using myself (around my own &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;LLM tool calling&lt;/a&gt; support) need various tweaks and fixes to start working with that new mechanism.&lt;/p&gt;
&lt;p&gt;Tool calling currently represents my biggest disappointment with local models that I've run on my own machine. I've been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.&lt;/p&gt;
&lt;p&gt;Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.&lt;/p&gt;
&lt;p&gt;My experience to date has been that local models are unable to handle these lengthy conversations. I'm not sure if that's inherent to the limitations of my own machine, or if it's something that the right model architecture and training could overcome.&lt;/p&gt;
&lt;p&gt;OpenAI make big claims about the tool calling capabilities of these new models. I'm looking forward to seeing how well they perform in practice.&lt;/p&gt;

&lt;h4 id="china"&gt;Competing with the Chinese open models&lt;/h4&gt;

&lt;p&gt;I've been writing a &lt;em&gt;lot&lt;/em&gt; about the &lt;a href="https://simonwillison.net/tags/ai-in-china/"&gt;flurry of excellent open weight models&lt;/a&gt; released by Chinese AI labs over the past few months - all of them very capable and most of them under Apache 2 or MIT licenses.&lt;/p&gt;

&lt;p&gt;Just last week &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;I said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. [...]&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With the release of the gpt-oss models that statement no longer holds true. I'm waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it's likely that OpenAI now offer the best available open weights models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Independent evaluations are beginning to roll in. Here's &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="cerebras"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="space-invaders"/><category term="gpt-oss"/></entry><entry><title>Ollama's new app</title><link href="https://simonwillison.net/2025/Jul/31/ollamas-new-app/#atom-tag" rel="alternate"/><published>2025-07-31T00:58:32+00:00</published><updated>2025-07-31T00:58:32+00:00</updated><id>https://simonwillison.net/2025/Jul/31/ollamas-new-app/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/blog/new-app"&gt;Ollama&amp;#x27;s new app&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ollama has been one of my favorite ways to run local models for a while - it makes it really easy to download models, and it's smart about keeping them resident in memory while they are being used and then cleaning them out after they stop receiving traffic.&lt;/p&gt;
&lt;p&gt;The one missing feature to date has been an interface: Ollama has been exclusively command-line, which is fine for the CLI literate among us and not much use for everyone else.&lt;/p&gt;
&lt;p&gt;They've finally fixed that! The new app's interface is accessible from the existing system tray menu and lets you chat with any of your installed models. Vision models can accept images through the new interface as well.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface showing a response about encouraging pelicans in a garden. The prompt reads &amp;quot;Describe ways I can encourage pelicans to hang out in my garden&amp;quot; followed by the response: &amp;quot;Pelicans are large, aquatic birds typically found near lakes, rivers, coastal areas, and wetlands—far from typical residential gardens. While it's unlikely you'll attract wild pelicans to your backyard garden, you can create a bird-friendly environment that supports a variety of waterfowl and wildlife. Here's how you can encourage birdlife (including species like herons, ducks, or even occasional pelicans if you're near a suitable habitat) to visit your garden: 1. Create a Water Feature Pelicans love water, so a large, shallow pond or water garden is key: • Size matters: Aim for at least 3–5 feet deep in parts and a shallow edge (6–12 inches) for wading. • Add native aquatic plants (e.g., cattails, water lilies, reeds) to provide shelter and food.&amp;quot; At the bottom is a &amp;quot;Send a message&amp;quot; text input field and &amp;quot;qwen3:30b-a3b-instruct-2507-q4_K_M&amp;quot; with a dropdown arrow." src="https://static.simonwillison.net/static/2025/ollama-app.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44739632"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ollama"/></entry><entry><title>How to run an LLM on your laptop</title><link href="https://simonwillison.net/2025/Jul/18/how-to-run-an-llm-on-your-laptop/#atom-tag" rel="alternate"/><published>2025-07-18T15:33:27+00:00</published><updated>2025-07-18T15:33:27+00:00</updated><id>https://simonwillison.net/2025/Jul/18/how-to-run-an-llm-on-your-laptop/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.technologyreview.com/2025/07/17/1120391/how-to-run-an-llm-on-your-laptop/"&gt;How to run an LLM on your laptop&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I talked to Grace Huckins for this piece from MIT Technology Review on running local models. Apparently she enjoyed my dystopian backup plan!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Simon Willison has a plan for the end of the world. It’s a USB stick, onto which he has loaded a couple of his favorite open-weight LLMs—models that have been shared publicly by their creators and that can, in principle, be downloaded and run with local hardware. If human civilization should ever collapse, Willison plans to use all the knowledge encoded in their billions of parameters for help. “It’s like having a weird, condensed, faulty version of Wikipedia, so I can help reboot society with the help of my little USB stick,” he says.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The article suggests &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; or &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; for laptops, and new-to-me &lt;a href="https://apps.apple.com/us/app/llm-farm/id6461209867"&gt;LLM Farm&lt;/a&gt; for the iPhone:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;My beat-up iPhone 12 was able to run Meta’s Llama 3.2 1B using an app called LLM Farm. It’s not a particularly good model—it very quickly goes off into bizarre tangents and hallucinates constantly—but trying to coax something so chaotic toward usability can be entertaining.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update 19th July 20205&lt;/strong&gt;: Evan Hahn compared the size of &lt;a href="https://evanhahn.com/local-llms-versus-offline-wikipedia/"&gt;various offline LLMs to different Wikipedia exports&lt;/a&gt;. Full English Wikipedia without images, revision history or talk pages is 13.82GB, smaller than Mistral Small 3.2 (15GB) but larger than Qwen 3 14B and Gemma 3n.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/wikipedia"&gt;wikipedia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/press-quotes"&gt;press-quotes&lt;/a&gt;&lt;/p&gt;



</summary><category term="wikipedia"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ollama"/><category term="lm-studio"/><category term="press-quotes"/></entry><entry><title>Introducing Gemma 3n: The developer guide</title><link href="https://simonwillison.net/2025/Jun/26/gemma-3n/#atom-tag" rel="alternate"/><published>2025-06-26T21:08:36+00:00</published><updated>2025-06-26T21:08:36+00:00</updated><id>https://simonwillison.net/2025/Jun/26/gemma-3n/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/"&gt;Introducing Gemma 3n: The developer guide&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Extremely consequential new open weights model release from Google today:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multimodal by design:&lt;/strong&gt; Gemma 3n natively supports image, audio, video, and text inputs and text outputs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized for on-device:&lt;/strong&gt; Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on &lt;a href="https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#per-layer-embeddings-(ple):-unlocking-more-memory-efficiency"&gt;&lt;strong&gt;effective&lt;/strong&gt;&lt;/a&gt; parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is &lt;strong&gt;very&lt;/strong&gt; exciting: a 2B and 4B model optimized for end-user devices which accepts text, images &lt;em&gt;and&lt;/em&gt; audio as inputs!&lt;/p&gt;
&lt;p&gt;Gemma 3n is also the most comprehensive day one launch I've seen for any model: Google partnered with "AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now.&lt;/p&gt;
&lt;p&gt;So far I've run two variants on my Mac laptop. Ollama offer &lt;a href="https://ollama.com/library/gemma3n"&gt;a 7.5GB version&lt;/a&gt; (full tag &lt;code&gt;gemma3n:e4b-it-q4_K_M0&lt;/code&gt;) of the 4B model, which I ran like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3n
llm install llm-ollama
llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It drew me this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican looks a bit like a grey pig. It is floating above a bicycle that looks more like a rail cart." src="https://static.simonwillison.net/static/2025/gemma3n-ollama.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The Ollama version doesn't appear to support image or audio input yet.&lt;/p&gt;
&lt;p&gt;... but the &lt;a href="https://github.com/Blaizzy/mlx-vlm"&gt;mlx-vlm&lt;/a&gt; version does!&lt;/p&gt;
&lt;p&gt;First I tried that on &lt;a href="https://static.simonwillison.net/static/2025/pelican-joke-request.wav"&gt;this WAV file&lt;/a&gt; like so (using a recipe adapted from &lt;a href="https://www.youtube.com/watch?v=8-8R2UvUBrc"&gt;Prince Canuma's video&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Transcribe the following speech segment in English:" \
  --audio pelican-joke-request.wav
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Tell me a joke about a pelican.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I had it draw me a pelican for good measure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I quite like this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is a sin wave, but you can make out which bits are the frame and the wheels. The pelican is white with a black triangular bill. It's a much better attempt than the Ollama one." src="https://static.simonwillison.net/static/2025/gemma3n-mlx-vlm.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations.&lt;/p&gt;
&lt;p&gt;Finally, I had it describe the image it had just created:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 1000 \
  --prompt "Describe image" \
  --image gemma3n-mlx-vlm.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got back - it thought this was a chemical diagram!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image is a cartoon-style illustration depicting a molecular structure against a light blue background. The structure is composed of several differently colored and shaped elements connected by curved black lines. &lt;/p&gt;
&lt;p&gt;Here's a breakdown of the elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Large Black Circles:&lt;/strong&gt; There are two large, solid black circles, likely representing larger molecules or components.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cream-Colored Oval:&lt;/strong&gt; A prominent cream-colored oval shape is centrally located and connected to other elements. It has a smaller, darker cream-colored sphere attached to its left side.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Red Oval:&lt;/strong&gt; A smaller, solid red oval is positioned near one of the black circles and connected to the cream-colored oval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Black Lines:&lt;/strong&gt; These lines act as bonds, connecting the various molecular components. They are curved and dynamic, suggesting movement or interaction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Triangular Shape:&lt;/strong&gt; A small black triangle is attached to the smaller cream-colored sphere.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Letter "I":&lt;/strong&gt; The letter "I" appears twice, likely labeling specific parts of the molecule. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The overall impression is of a simplified representation of a biological molecule, possibly a protein or a complex organic compound. The use of different colors helps to distinguish the various components within the structure.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prince-canuma"&gt;prince-canuma&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="prince-canuma"/></entry><entry><title>Mistral-Small 3.2</title><link href="https://simonwillison.net/2025/Jun/20/mistral-small-32/#atom-tag" rel="alternate"/><published>2025-06-20T19:12:42+00:00</published><updated>2025-06-20T19:12:42+00:00</updated><id>https://simonwillison.net/2025/Jun/20/mistral-small-32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506"&gt;Mistral-Small 3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released on Hugging Face a couple of hours ago, so far there aren't any quantizations to run it on a Mac but I'm sure those will emerge pretty quickly.&lt;/p&gt;
&lt;p&gt;This is a minor bump to Mistral Small 3.1, one of my favorite local models. I've been running Small 3.1 &lt;a href="https://ollama.com/library/mistral-small3.1/tags"&gt;via Ollama&lt;/a&gt; where it's a 15GB download - these 24 billion parameter models are a great balance between capabilities and not using up all of the available RAM on my laptop. I expect Ollama will add 3.2 imminently.&lt;/p&gt;
&lt;p&gt;According to Mistral:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Small-3.2 improves in the following categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Instruction following&lt;/strong&gt;: Small-3.2 is better at following precise instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repetition errors&lt;/strong&gt;: Small-3.2 produces less infinite generations or repetitive answers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Function calling&lt;/strong&gt;: Small-3.2's function calling template is more robust (see &lt;a href="https://github.com/mistralai/mistral-common/blob/535b4d0a0fc94674ea17db6cf8dc2079b81cbcfa/src/mistral_common/tokens/tokenizers/instruct.py#L778"&gt;here&lt;/a&gt; and &lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506#function-calling"&gt;examples&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Interestingly they recommend running it with a temperature of 0.15 - many models recommend a default of 0.7. They also provide a &lt;a href="https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506/blob/main/SYSTEM_PROMPT.txt"&gt;suggested system prompt&lt;/a&gt; which includes a note that "Your knowledge base was last updated on 2023-10-01".&lt;/p&gt;
&lt;p&gt;It's not currently available via Mistral's API, or through any of the third-party LLM hosting vendors that I've checked, so I've not been able to run a prompt through the model myself yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; I downloaded one of first GGUF quantizations to show up on Hugging Face, &lt;a href="https://huggingface.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt; by Gabriel Larson. I ran it using Ollama and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M
llm install llm-ollama
llm -m hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one is pretty good for a 15GB model!&lt;/p&gt;
&lt;p&gt;&lt;img alt="See description below" src="https://static.simonwillison.net/static/2025/mistral-3.2-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/87ade7285a1d51d7073f73559baf41b2"&gt;the full transcript&lt;/a&gt; - it doesn't quite work in the actual image but I was delighted to see the model attempt to add this detail:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Basket with fish --&amp;gt;
&amp;lt;rect x="250" y="190" width="25" height="15" rx="5" fill="#FFA500"/&amp;gt;
&amp;lt;circle cx="260" cy="200" r="3" fill="#FF4500"/&amp;gt; &amp;lt;!-- Fish --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what you get if you isolate just that part of the generated SVG:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Large blue rectangle image, in the middle is a very small rounded corner orange rectangle &amp;quot;basket&amp;quot; containing an even smaller red circle that represents the fish." src="https://static.simonwillison.net/static/2025/fish-basket.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I had Mistral Small 3.2 describe the full image to me, since it's a vision-LLM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
  'describe image' \
  -a https://static.simonwillison.net/static/2025/mistral-3.2-pelican.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And it gave me the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image depicts a cartoonish illustration of a duck that is being lifted off the ground by a hook. The duck appears to be in mid-air, with its wings spread slightly as if it's surprised or reacting to being picked up. The hook is attached to an arm or a mechanism and seems to be connected to a vehicle below—perhaps a truck or a platform with wheels. The background of the image is light blue, indicating an outdoor setting. Overall, the scene is whimsical and playful, possibly suggesting a humorous or unusual situation where the duck is being transported in this manner.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: It's now available as &lt;a href="https://ollama.com/library/mistral-small3.2"&gt;an official Ollama model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull mistral-small3.2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LM Studio has a community quantization too: &lt;a href="https://huggingface.co/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF"&gt;lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="hugging-face"/><category term="mistral"/><category term="vision-llms"/><category term="llm-tool-use"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Magistral — the first reasoning model by Mistral AI</title><link href="https://simonwillison.net/2025/Jun/10/magistral/#atom-tag" rel="alternate"/><published>2025-06-10T16:13:22+00:00</published><updated>2025-06-10T16:13:22+00:00</updated><id>https://simonwillison.net/2025/Jun/10/magistral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/magistral"&gt;Magistral — the first reasoning model by Mistral AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral's first reasoning model is out today, in two sizes. There's a 24B Apache 2 licensed open-weights model called Magistral Small (actually Magistral-Small-2506), and a larger API-only model called Magistral Medium.&lt;/p&gt;
&lt;p&gt;Magistral Small is available as &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506"&gt;mistralai/Magistral-Small-2506&lt;/a&gt; on Hugging Face. From that model card:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Context Window&lt;/strong&gt;: A 128k context window, but performance might degrade past 40k. Hence we recommend setting the maximum model length to 40k.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mistral also released an official GGUF version, &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506_gguf"&gt;Magistral-Small-2506_gguf&lt;/a&gt;, which I ran successfully using Ollama like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That fetched a 25GB file. I ran prompts using a chat session with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got for "Generate an SVG of a pelican riding a bicycle" (&lt;a href="https://gist.github.com/simonw/7aaac8217f43be04886737d67c08ecca"&gt;transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue sky and what looks like an eagle flying towards the viewer." src="https://static.simonwillison.net/static/2025/magistral-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's disappointing that the GGUF doesn't support function calling yet - hopefully a community variant can add that, it's one of the best ways I know of to unlock the potential of these reasoning models.&lt;/p&gt;
&lt;p&gt;I just noticed that Ollama have their own &lt;a href="https://ollama.com/library/magistral"&gt;Magistral model&lt;/a&gt; too, which can be accessed using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull magistral:latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gets you a 14GB &lt;code&gt;q4_K_M&lt;/code&gt; quantization - other options can be found in the &lt;a href="https://ollama.com/library/magistral/tags"&gt;full list of Ollama magistral tags&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing that caught my eye in the Magistral announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Legal, finance, healthcare, and government professionals get traceable reasoning that meets compliance requirements. Every conclusion can be traced back through its logical steps, providing auditability for high-stakes environments with domain-specialized AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.&lt;/p&gt;
&lt;p&gt;Also from that announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our early tests indicated that Magistral is an excellent creative companion. We highly recommend it for creative writing and storytelling, with the model capable of producing coherent or — if needed — delightfully eccentric copy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I haven't seen a reasoning model promoted for creative writing in this way before.&lt;/p&gt;
&lt;p&gt;You can try out Magistral Medium by selecting the new "Thinking" option in Mistral's &lt;a href="https://chat.mistral.ai/"&gt;Le Chat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface showing settings options. At the top is a text input field that says &amp;quot;Ask le Chat or @mention an agent&amp;quot; with a plus button, lightbulb &amp;quot;Think&amp;quot; button with up arrow, grid &amp;quot;Tools&amp;quot; button, and settings icon. Below are two toggle options: &amp;quot;Pure Thinking&amp;quot; with description &amp;quot;Best option for math + coding. Disables tools.&amp;quot; (toggle is off), and &amp;quot;10x Speed&amp;quot; with lightning bolt icon and &amp;quot;PRO - 2 remaining today&amp;quot; label, described as &amp;quot;Same quality at 10x the speed.&amp;quot; (toggle is on and green)." src="https://static.simonwillison.net/static/2025/magistral-le-chat.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;They have options for "Pure Thinking" and a separate option for "10x speed", which runs Magistral Medium at 10x the speed using &lt;a href="https://www.cerebras.ai/"&gt;Cerebras&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new models are also available through &lt;a href="https://docs.mistral.ai/api/"&gt;the Mistral API&lt;/a&gt;. You can access them by installing &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt; and running &lt;code&gt;llm mistral refresh&lt;/code&gt; to refresh the list of available models, then:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mistral/magistral-medium-latest \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Claude Sonnet 4 described this as Minimalist illustration of a white bird with an orange beak riding on a dark gray motorcycle against a light blue sky with a white sun and gray ground" src="https://static.simonwillison.net/static/2025/magistral-medium-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/93917661eae6e2fe0a0bd5685172fab8"&gt;that transcript&lt;/a&gt;. At 13 input and 1,236 output tokens that cost me &lt;a href="https://www.llm-prices.com/#it=13&amp;amp;ot=1236&amp;amp;ic=2&amp;amp;oc=5"&gt;0.62 cents&lt;/a&gt; - just over half a cent.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="cerebras"/><category term="llm-pricing"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>deepseek-ai/DeepSeek-R1-0528</title><link href="https://simonwillison.net/2025/May/31/deepseek-aideepseek-r1-0528/#atom-tag" rel="alternate"/><published>2025-05-31T21:18:32+00:00</published><updated>2025-05-31T21:18:32+00:00</updated><id>https://simonwillison.net/2025/May/31/deepseek-aideepseek-r1-0528/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528"&gt;deepseek-ai/DeepSeek-R1-0528&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sadly the trend for &lt;em&gt;terrible naming&lt;/em&gt; of models has infested the Chinese AI labs as well.&lt;/p&gt;
&lt;p&gt;DeepSeek-R1-0528 is a brand new and much improved open weights reasoning model from DeepSeek, a major step up from the DeepSeek R1 they released &lt;a href="https://simonwillison.net/2025/Jan/20/deepseek-r1/"&gt;back in January&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by [...] Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro. [...]&lt;/p&gt;
&lt;p&gt;Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The new R1 comes in two sizes: a 685B model called &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528"&gt;deepseek-ai/DeepSeek-R1-0528&lt;/a&gt;  (the previous R1 was 671B) and an 8B variant distilled from Qwen 3 called &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B"&gt;deepseek-ai/DeepSeek-R1-0528-Qwen3-8B&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The January release of R1 had a much larger collection of distilled models: four based on Qwen 2.5 (14B, 32B, Math 1.5B and Math 7B) and 2 based on Llama 3 (Llama-3.1 8B and Llama 3.3 70B Instruct).&lt;/p&gt;
&lt;p&gt;No Llama model at all this time. I wonder if that's because Qwen 3 is &lt;em&gt;really good&lt;/em&gt; and Apache 2 licensed, while Llama continues to stick with their &lt;a href="https://simonwillison.net/2025/Apr/20/janky-license/"&gt;janky license terms&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Further adding to the confusion, Ollama have mixed the two new models into their existing &lt;a href="https://ollama.com/library/deepseek-r1/tags"&gt;deepseek-r1&lt;/a&gt; label. Last week running &lt;code&gt;ollama pull deepseek-r1:8B&lt;/code&gt; from Ollama would get you &lt;code&gt;DeepSeek-R1-Distill-Llama-8B&lt;/code&gt; (&lt;a href="https://web.archive.org/web/20250514072051/https://ollama.com/library/deepseek-r1:8b"&gt;Internet Archive link&lt;/a&gt;), today that same command gets you &lt;a href="https://ollama.com/library/deepseek-r1:8b"&gt;DeepSeek-R1-0528-Qwen3-8B&lt;/a&gt;, a completely different base model.&lt;/p&gt;
&lt;p&gt;If this bothers you as much as it bothers me, thankfully you can use the more explicit tag &lt;a href="https://ollama.com/library/deepseek-r1:8b-0528-qwen3-q8_0"&gt;deepseek-r1:8b-0528-qwen3-q8_0&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: This is a great illustration of how confusing these names are! Ollama's &lt;code&gt;deepseek-r1:8B&lt;/code&gt; alias actually points to &lt;a href="https://ollama.com/library/deepseek-r1:8b-0528-qwen3-q4_K_M"&gt;deepseek-r1:8b-0528-qwen3-q4_K_M&lt;/a&gt;, a 5.2GB model. I pulled the larger &lt;code&gt;q8_0&lt;/code&gt; one.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I ran it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull deepseek-r1:8b-0528-qwen3-q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It fetched 8.9GB of model weights, then I had it draw me a pelican.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm -m deepseek-r1:8b-0528-qwen3-q8_0 \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the “reasoning” traces of these models are almost always the most interesting part, I’ll quote &lt;a href="https://gist.github.com/simonw/b8aea7c0279175166ebd46160d290d20"&gt;the response&lt;/a&gt; here in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;br /&gt;
We are going to create an SVG that depicts a pelican riding a bicycle.
Steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create the basic structure of an SVG with width and height set appropriately (e.g., 400x200).&lt;/li&gt;
&lt;li&gt;Draw a pelican: We'll break it down into parts - body, beak, eye, feet, wings, etc.&lt;/li&gt;
&lt;li&gt;Draw a bicycle: We can represent it as two circles for wheels, a rectangle for the frame, and some other details.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let's design the SVG with the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The pelican will be facing to the right (positive x-axis) so that it is riding the bike from left to right.&lt;/li&gt;
&lt;li&gt;The bicycle will have two wheels and a simple frame in front of the pelican.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We'll use absolute positioning for each part. Let's assume we set up coordinates such that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The top-left of the SVG is (0,0), and we want the scene to be drawn from left to right.&lt;/li&gt;
&lt;li&gt;The pelican will be positioned at (50, 120) with a body length of about 30 units.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We'll use simple shapes for the pelican:&lt;/p&gt;
&lt;p&gt;Body: An ellipse or rounded rectangle. Let's do an ellipse for the body.
Beak: A triangle pointing to the right.
Head: Another ellipse but smaller and positioned in front of the body.
Wings: Two triangles (one above each other, one slightly behind) that are upside down.&lt;/p&gt;
&lt;p&gt;Bicycle:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Wheels: two circles at fixed positions relative to the frame.&lt;/li&gt;
&lt;li&gt;Frame: a rectangle with some details for handlebars and seat.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Since this is a fun illustration, we'll keep it simple but recognizable.&lt;/p&gt;
&lt;p&gt;Let's code accordingly.&lt;br /&gt;
&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;br /&gt;
Here’s an SVG of a pelican riding a bicycle. This example uses basic shapes to create a stylized version:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;xmlns&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://www.w3.org/2000/svg&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 400 200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican body &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;150&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;150&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;80&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#FFD700&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican head (larger for comedic effect) &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;90&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;120&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#FFA3CE&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican beak &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;80,110 60,120 80,130 70,140 60,150&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#FFC233&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican eye &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;95&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;115&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#fff&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle wheels &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;380&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;140&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;transform&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;rotate(90 380 140)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;380&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;transform&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;rotate(90 380 70)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M 15,40 L 15,140 H 370 L 370,140 V 40 Z&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;gray&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;6&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Handlebars and seat (stylized) &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M 280,40 C 290,35 300,30 310,40 C 320,50 330,60 310,70 L 310,70 Q 290,50 280,40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;gray&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;340&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#3366CC&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;340,65 330,45 350,45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#F29E3D&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;340&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;140&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#3366CC&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;340,135 330,115 350,115&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#F29E3D&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;

  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Neck connecting head to body &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M 80,120 C 70,160 100,140 150,120&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#FFA3CE&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;8&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Explanation:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pelican Body&lt;/strong&gt;: A large yellow ellipse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head&lt;/strong&gt;: A pink circle with a prominent beak (yellow-orange triangle).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bicycle&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Wheels are black ellipses rotated vertically to simulate spokes.&lt;/li&gt;
&lt;li&gt;The frame is a gray rectangle extending from the pelican’s back to the right side of the SVG.&lt;/li&gt;
&lt;li&gt;Handlebars and seat are simplified curves in gray.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This creates a comical scene with the pelican riding forward. You can adjust colors, sizes, or positions by modifying &lt;code&gt;cx&lt;/code&gt;, &lt;code&gt;cy&lt;/code&gt;, &lt;code&gt;rx&lt;/code&gt;, &lt;code&gt;ry&lt;/code&gt; attributes and fill/stroke values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Love the thought process, and the explanation! The actual rendered SVG does leave a lot to be desired though:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A pink circle ovenrlaps a yellow oval, with a grey thick lined rectangle cutting thorough both. Two identical blue circles on the right have orange triangular elements and black ovals, a fair distance from the bit that might represent the pelican - it's very hard to tell." src="https://static.simonwillison.net/static/2025/deepseek-qwen-8b-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;To be fair, this is just using the ~8GB Qwen3 Q8_0 model on my laptop. I don't have the hardware to run the full sized R1 but it's available as &lt;code&gt;deepseek-reasoner&lt;/code&gt; through DeepSeek's API, so I tried it there using the &lt;a href="https://github.com/rumisle/llm-deepseek"&gt;llm-deepseek plugin&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-deepseek
llm -m deepseek-reasoner \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one came out &lt;a href="https://gist.github.com/simonw/d8765ea8413592b074ded45cbc585c54"&gt;a lot better&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A much better pelican on a bicycle - you can tell it's a bicycle and you can tell that it's a white bird with a long orange beak riding it, albeit it's not clearly a pelican and the bird is lying beneath the seat." src="https://static.simonwillison.net/static/2025/deepseek-r1-better-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Meanwhile, on Reddit, u/adrgrondin got &lt;a href="https://www.reddit.com/r/LocalLLM/comments/1kz7qu1/deepseekr10528qwen38b_on_iphone_16_pro/"&gt;DeepSeek-R1-0528-Qwen3-8B running on an iPhone 16 Pro&lt;/a&gt; using MLX:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It runs at a decent speed for the size thanks to MLX, pretty impressive. But not really usable in my opinion, the model is thinking for too long, and the phone gets really hot.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="ollama"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Large Language Models can run tools in your terminal with LLM 0.26</title><link href="https://simonwillison.net/2025/May/27/llm-tools/#atom-tag" rel="alternate"/><published>2025-05-27T20:35:20+00:00</published><updated>2025-05-27T20:35:20+00:00</updated><id>https://simonwillison.net/2025/May/27/llm-tools/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;&lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-26"&gt;LLM 0.26&lt;/a&gt;&lt;/strong&gt; is out with the biggest new feature since I started the project: &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;&lt;strong&gt;support for tools&lt;/strong&gt;&lt;/a&gt;. You can now use the LLM &lt;a href="https://llm.datasette.io/en/stable/usage.html"&gt;CLI tool&lt;/a&gt; - and &lt;a href="https://llm.datasette.io/en/stable/python-api.html"&gt;Python library&lt;/a&gt; - to grant LLMs from OpenAI, Anthropic, Gemini and local models from Ollama with access to any tool that you can represent as a Python function.&lt;/p&gt;
&lt;p&gt;LLM also now has &lt;a href="https://llm.datasette.io/en/stable/plugins/directory.html#tools"&gt;tool plugins&lt;/a&gt;, so you can install a plugin that adds new capabilities to whatever model you are currently using.&lt;/p&gt;
&lt;p&gt;There's a lot to cover here, but here are the highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM can run tools now&lt;/strong&gt;! You can &lt;strong&gt;install tools from plugins&lt;/strong&gt; and load them by name with &lt;code&gt;--tool/-T name_of_tool&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You can also &lt;strong&gt;pass in Python function code on the command-line&lt;/strong&gt; with the &lt;code&gt;--functions&lt;/code&gt; option.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Python API supports tools too&lt;/strong&gt;: &lt;code&gt;llm.get_model("gpt-4.1").chain("show me the locals", tools=[locals]).text()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tools work in &lt;strong&gt;both async and sync contexts&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's what's covered in this post:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#trying-it-out"&gt;Trying it out&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#more-interesting-tools-from-plugins"&gt;More interesting tools from plugins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#ad-hoc-command-line-tools-with-functions"&gt;Ad-hoc command-line tools with --functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#tools-in-the-llm-python-api"&gt;Tools in the LLM Python API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#why-did-this-take-me-so-long-"&gt;Why did this take me so long?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#is-this-agents-then-"&gt;Is this agents then?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/May/27/llm-tools/#what-s-next-for-tools-in-llm-"&gt;What's next for tools in LLM?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h4 id="trying-it-out"&gt;Trying it out&lt;/h4&gt;
&lt;p&gt;First, &lt;a href="https://llm.datasette.io/en/stable/setup.html"&gt;install the latest LLM&lt;/a&gt;. It may not be on Homebrew yet so I suggest using &lt;code&gt;pip&lt;/code&gt; or &lt;code&gt;pipx&lt;/code&gt; or &lt;code&gt;uv&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv tool install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If you have it already, &lt;a href="https://llm.datasette.io/en/stable/setup.html#upgrading-to-the-latest-version"&gt;upgrade it&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv tool upgrade llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Tools work with other vendors, but let's stick with OpenAI for the moment. Give LLM an OpenAI API key&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openai
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste key here&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now let's run our first tool:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm --tool llm_version &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;What version?&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --td&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's what I get:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/llm-tools.gif" alt="Animated demo. I run that command, LLM shows Tool call: llm_version({}) in yellow, then 0.26a1 in green, then streams out the text The installed version is 0.26a1" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;llm_version&lt;/code&gt; is a very simple demo tool that ships with LLM. Running &lt;code&gt;--tool llm_version&lt;/code&gt; exposes that tool to the model - you can specify that multiple times to enable multiple tools, and it has a shorter version of &lt;code&gt;-T&lt;/code&gt; to save on typing.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--td&lt;/code&gt; option stands for &lt;code&gt;--tools-debug&lt;/code&gt; - it causes LLM to output information about tool calls and their responses so you can peek behind the scenes.&lt;/p&gt;
&lt;p&gt;This is using the default LLM model, which is usually &lt;code&gt;gpt-4o-mini&lt;/code&gt;. I switched it to &lt;code&gt;gpt-4.1-mini&lt;/code&gt; (better but fractionally more expensive) by running:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm models default gpt-4.1-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can try other models using the &lt;code&gt;-m&lt;/code&gt; option. Here's how to run a similar demo of the &lt;code&gt;llm_time&lt;/code&gt; built-in tool using &lt;code&gt;o4-mini&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm --tool llm_time &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;What time is it?&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --td -m o4-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Tool call: llm_time({})&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;  {
    &lt;span class="pl-ent"&gt;"utc_time"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2025-05-27 19:15:55 UTC&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"utc_time_iso"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2025-05-27T19:15:55.288632+00:00&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"local_timezone"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;PDT&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"local_time"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2025-05-27 12:15:55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"timezone_offset"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;UTC-7:00&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"is_dst"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;
  }&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The current time is 12:15 PM PDT (UTC−7:00) on May 27, 2025, which corresponds to 7:15 PM UTC.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Models from (tool supporting) plugins work too. Anthropic's Claude Sonnet 4:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-anthropic -U
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; anthropic
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste Anthropic key here&lt;/span&gt;
llm --tool llm_version &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;What version?&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --td -m claude-4-sonnet&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or Google's Gemini 2.5 Flash:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-gemini -U
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; gemini
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste Gemini key here&lt;/span&gt;
llm --tool llm_version &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;What version?&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --td -m gemini-2.5-flash-preview-05-20&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can even run simple tools with Qwen3:4b, a &lt;em&gt;tiny&lt;/em&gt; (2.6GB) model that I run using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull qwen3:4b
llm install &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;llm-ollama&amp;gt;=0.11a0&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
llm --tool llm_version &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;What version?&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --td -m qwen3:4b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Qwen 3 calls the tool, thinks about it a bit and then prints out a response:
&lt;img src="https://static.simonwillison.net/static/2025/llm-tools-qwen.jpg" alt="Tool call: llm_version({}) 0.26a1&amp;lt;think&amp;gt; Okay, the user asked, &amp;quot;What version?&amp;quot; I need to respond with the version of the LLM. The tool provided is llm_version, which returns the installed version. I called that function and got the response 0.26a1. Now I should present this information clearly. Let me check if there's any additional context needed, but the user just asked for the version, so a straightforward answer should work. I'll state the version number and maybe mention that it's the installed version. Keep it simple and precise. &amp;lt;/think&amp;gt; The installed version of the LLM is 0.26a1." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="more-interesting-tools-from-plugins"&gt;More interesting tools from plugins&lt;/h4&gt;
&lt;p&gt;This demo has been pretty weak so far. Let's do something a whole lot more interesting.&lt;/p&gt;
&lt;p&gt;LLMs are notoriously bad at mathematics. This is deeply surprising to many people: supposedly the most sophisticated computer systems we've ever built can't multiply two large numbers together?&lt;/p&gt;
&lt;p&gt;We can fix that with tools.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-tools-simpleeval"&gt;llm-tools-simpleeval&lt;/a&gt; plugin exposes the &lt;a href="https://github.com/danthedeckie/simpleeval"&gt;simpleeval&lt;/a&gt; "Simple Safe Sandboxed Extensible Expression Evaluator for Python" library by Daniel Fairhead. This provides a robust-enough sandbox for executing simple Python expressions.&lt;/p&gt;
&lt;p&gt;Here's how to run a calculation:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-tools-simpleeval
llm -T simpleeval &lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Trying that out:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -T simple_eval &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Calculate 1234 * 4346 / 32414 and square root it&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --td&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I got back this - it tried &lt;code&gt;sqrt()&lt;/code&gt; first, then when that didn't work switched to &lt;code&gt;** 0.5&lt;/code&gt; instead:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Tool call: simple_eval({'expression': '1234 * 4346 / 32414'})
  165.45208860368976


Tool call: simple_eval({'expression': 'sqrt(1234 * 4346 / 32414)'})
  Error: Function 'sqrt' not defined, for expression 'sqrt(1234 * 4346 / 32414)'.


Tool call: simple_eval({'expression': '(1234 * 4346 / 32414) ** 0.5'})
  12.862818066181678

The result of (1234 * 4346 / 32414) is approximately
165.45, and the square root of this value is approximately 12.86.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've released four tool plugins so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-tools-simpleeval"&gt;llm-tools-simpleeval&lt;/a&gt;&lt;/strong&gt; - as shown above, simple expression support for things like mathematics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-tools-quickjs"&gt;llm-tools-quickjs&lt;/a&gt;&lt;/strong&gt; - provides access to a sandboxed QuickJS JavaScript interpreter, allowing LLMs to run JavaScript code. The environment persists between calls so the model can set variables and build functions and reuse them later on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-tools-sqlite"&gt;llm-tools-sqlite&lt;/a&gt;&lt;/strong&gt; - read-only SQL query access to a local SQLite database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-tools-datasette"&gt;llm-tools-datasette&lt;/a&gt;&lt;/strong&gt; - run SQL queries against a remote &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; instance!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let's try that Datasette one now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-tools-datasette
llm -T &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Datasette("https://datasette.io/content")&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --td &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;What has the most stars?&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The syntax here is slightly different: the Datasette plugin is what I'm calling a "toolbox" - a plugin that has multiple tools inside it and can be configured with a constructor.&lt;/p&gt;
&lt;p&gt;Specifying &lt;code&gt;--tool&lt;/code&gt; as &lt;code&gt;Datasette("https://datasette.io/content")&lt;/code&gt; provides the plugin with the URL to the Datasette instance it should use - in this case the &lt;a href="https://datasette.io/content"&gt;content database&lt;/a&gt; that powers the Datasette website.&lt;/p&gt;
&lt;p&gt;Here's the output, with the schema section truncated for brevity:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/datasette-tool.jpg" alt="I run that command. It first does a Tool call to Datasette_query with SELECT name, stars, FROM repos ORDER BY stars DESC LIMIT 1. This returns an error message because there is no such column stars. It calls the Datasette_schema() function which returns a whole load of CREATE TABLE statements. Then it executes Datasette_query again this time with SELECT name, stargazers_count FROM repos ORDER BY stargazers_count DESC LIMIT 1. This returns name=datasette a count of 10020, so the model replies and says The repository with the most stars is &amp;quot;datasette&amp;quot; with 10,020 stars." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This question triggered three calls. The model started by guessing the query! It tried &lt;code&gt;SELECT name, stars FROM repos ORDER BY stars DESC LIMIT 1&lt;/code&gt;, which failed because the &lt;code&gt;stars&lt;/code&gt; column doesn't exist.&lt;/p&gt;
&lt;p&gt;The tool call returned an error, so the model had another go - this time calling the &lt;code&gt;Datasette_schema()&lt;/code&gt; tool to get the schema of the database.&lt;/p&gt;
&lt;p&gt;Based on that schema it assembled and then executed the correct query, and output its interpretation of the result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The repository with the most stars is "datasette" with 10,020 stars.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Getting to this point was a real &lt;a href="https://www.penny-arcade.com/comic/2010/09/17/mine-all-mine-part-one"&gt;Penny Arcade Minecraft moment&lt;/a&gt; for me. The possibilities here are &lt;em&gt;limitless&lt;/em&gt;. If you can write a Python function for it, you can trigger it from an LLM.&lt;/p&gt;
&lt;h4 id="ad-hoc-command-line-tools-with-functions"&gt;Ad-hoc command-line tools with &lt;code&gt;--functions&lt;/code&gt;
&lt;/h4&gt;
&lt;p&gt;I'm looking forward to people building more plugins, but there's also much less structured and more ad-hoc way to use tools with the LLM CLI tool: the &lt;code&gt;--functions&lt;/code&gt; option.&lt;/p&gt;
&lt;p&gt;This was inspired by a similar feature &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#defining-custom-sql-functions"&gt;I added to sqlite-utils&lt;/a&gt; a while ago.&lt;/p&gt;
&lt;p&gt;You can pass a block of literal Python code directly to the CLI tool using the &lt;code&gt;--functions&lt;/code&gt; option, and any functions defined there will be made available to the model as tools.&lt;/p&gt;
&lt;p&gt;Here's an example that adds the ability to search my blog:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm --functions &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;import httpx&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;def search_blog(q):&lt;/span&gt;
&lt;span class="pl-s"&gt;    "Search Simon Willison blog"&lt;/span&gt;
&lt;span class="pl-s"&gt;    return httpx.get("https://simonwillison.net/search/", params={"q": q}).content&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --td &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Three features of sqlite-utils&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;use Simon search&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is &lt;em&gt;such a hack&lt;/em&gt; of an implementation! I'm literally just hitting &lt;a href="https://simonwillison.net/search/?q=pelicans"&gt;my search page&lt;/a&gt; and dumping the HTML straight back into tho model.&lt;/p&gt;
&lt;p&gt;It totally works though - it helps that the GPT-4.1 series all handle a million tokens now, so crufty HTML is no longer a problem for them.&lt;/p&gt;
&lt;p&gt;(I had to add "use Simon search" as the system prompt because without it the model would try to answer the question itself, rather than using the search tool I provided. System prompts for tools are clearly a &lt;em&gt;big topic&lt;/em&gt;, Anthropic's own web search tool has &lt;a href="https://simonwillison.net/2025/May/25/claude-4-system-prompt/#search-instructions"&gt;6,471 tokens of instructions&lt;/a&gt;!)&lt;/p&gt;
&lt;p&gt;Here's the output I got just now:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Three features of sqlite-utils are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It is a combined CLI tool and Python library for manipulating SQLite databases.&lt;/li&gt;
&lt;li&gt;It can automatically add columns to a database table if you attempt to insert data that doesn't quite fit (using the alter=True option).&lt;/li&gt;
&lt;li&gt;It supports plugins, allowing the extension of its functionality through third-party or custom plugins.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;A better search tool would have more detailed instructions and would return relevant snippets of the results, not just the headline and first paragraph for each result. This is pretty great for just four lines of Python though!&lt;/p&gt;
&lt;h4 id="tools-in-the-llm-python-api"&gt;Tools in the LLM Python API&lt;/h4&gt;
&lt;p&gt;LLM is both a CLI tool and a Python library at the same time (similar to my other project &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt;). The LLM Python library &lt;a href="https://llm.datasette.io/en/stable/python-api.html#tools"&gt;grew tool support&lt;/a&gt; in LLM 0.26 as well.&lt;/p&gt;
&lt;p&gt;Here's a simple example solving one of the previously hardest problems in LLMs: counting the number of Rs in "strawberry":&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;count_char_in_text&lt;/span&gt;(&lt;span class="pl-s1"&gt;char&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;, &lt;span class="pl-s1"&gt;text&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-smi"&gt;int&lt;/span&gt;:
    &lt;span class="pl-s"&gt;"How many times does char appear in text?"&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;text&lt;/span&gt;.&lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-s1"&gt;char&lt;/span&gt;)

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-4.1-mini"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;chain_response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;chain&lt;/span&gt;(
    &lt;span class="pl-s"&gt;"Rs in strawberry?"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;tools&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s1"&gt;count_char_in_text&lt;/span&gt;],
    &lt;span class="pl-s1"&gt;after_call&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;print&lt;/span&gt;
)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;chunk&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;chain_response&lt;/span&gt;:
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;after_call=print&lt;/code&gt; argument is a way to peek at the tool calls, the Python equivalent of the &lt;code&gt;--td&lt;/code&gt; option from earlier.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;model.chain()&lt;/code&gt; method is new: it's similar to &lt;code&gt;model.prompt()&lt;/code&gt; but knows how to spot returned tool call requests, execute them and then prompt the model again with the results. A &lt;code&gt;model.chain()&lt;/code&gt; could potentially execute dozens of responses on the way to giving you a final answer.&lt;/p&gt;
&lt;p&gt;You can iterate over the &lt;code&gt;chain_response&lt;/code&gt; to output those tokens as they are returned by the model, even across multiple responses.&lt;/p&gt;
&lt;p&gt;I got back this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Tool(name='count_char_in_text', description='How many times does char appear in text?', input_schema={'properties': {'char': {'type': 'string'}, 'text': {'type': 'string'}}, 'required': ['char', 'text'], 'type': 'object'}, implementation=&amp;lt;function count_char_in_text at 0x109dd4f40&amp;gt;, plugin=None) ToolCall(name='count_char_in_text', arguments={'char': 'r', 'text': 'strawberry'}, tool_call_id='call_DGXcM8b2B26KsbdMyC1uhGUu') ToolResult(name='count_char_in_text', output='3', tool_call_id='call_DGXcM8b2B26KsbdMyC1uhGUu', instance=None, exception=None)&lt;/code&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;There are 3 letter "r"s in the word "strawberry".&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;LLM's Python library also supports &lt;code&gt;asyncio&lt;/code&gt;, and tools can be &lt;code&gt;async def&lt;/code&gt; functions &lt;a href="https://llm.datasette.io/en/latest/python-api.html#tool-functions-can-be-sync-or-async"&gt;as described here&lt;/a&gt;. If a model requests multiple async tools at once the library will run them concurrently with &lt;code&gt;asyncio.gather()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The Toolbox form of tools is supported too: you can pass &lt;code&gt;tools=[Datasette("https://datasette.io/content")]&lt;/code&gt; to that &lt;code&gt;chain()&lt;/code&gt; method to achieve the same effect as the &lt;code&gt;--tool 'Datasette(...)&lt;/code&gt; option from earlier.&lt;/p&gt;
&lt;h4 id="why-did-this-take-me-so-long-"&gt;Why did this take me so long?&lt;/h4&gt;
&lt;p&gt;I've been tracking &lt;a href="https://simonwillison.net/tags/llm-tool-use/"&gt;llm-tool-use&lt;/a&gt; for a while. I first saw the trick described in &lt;a href="https://arxiv.org/abs/2210.03629"&gt;the ReAcT paper&lt;/a&gt;, first published in October 2022 (a month before the initial release of ChatGPT). I built &lt;a href="https://til.simonwillison.net/llms/python-react-pattern"&gt;a simple implementation of that&lt;/a&gt; in a few dozen lines of Python. It was clearly a very neat pattern!&lt;/p&gt;
&lt;p&gt;Over the past few years it has become &lt;em&gt;very&lt;/em&gt; apparent that tool use is the single most effective way to extend the abilities of language models. It's such a simple trick: you tell the model that there are tools it can use, and have it output special syntax (JSON or XML or &lt;code&gt;tool_name(arguments)&lt;/code&gt;, it doesn't matter which) requesting a tool action, then stop.&lt;/p&gt;
&lt;p&gt;Your code parses that output, runs the requested tools and then starts a new prompt to the model with the results.&lt;/p&gt;
&lt;p&gt;This works with almost &lt;strong&gt;every model&lt;/strong&gt; now. Most of them are specifically trained for tool usage, and there are leaderboards like the &lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html"&gt;Berkeley Function-Calling Leaderboard&lt;/a&gt; dedicated to tracking which models do the best job of it.&lt;/p&gt;
&lt;p&gt;All of the big model vendors - OpenAI, Anthropic, Google, Mistral, Meta - have a version of this baked into their API, either called tool usage or function calling. It's all the same underlying pattern.&lt;/p&gt;
&lt;p&gt;The models you can run locally are getting good at this too. Ollama &lt;a href="https://ollama.com/blog/tool-support"&gt;added tool support&lt;/a&gt; last year, and it's baked into the &lt;a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md"&gt;llama.cpp&lt;/a&gt; server as well.&lt;/p&gt;
&lt;p&gt;It's been clear for a while that LLM absolutely needed to grow support for tools. I released &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;LLM schema support&lt;/a&gt; back in February as a stepping stone towards this. I'm glad to finally have it over the line.&lt;/p&gt;
&lt;p&gt;As always with LLM, the challenge was designing an abstraction layer that could work across as many different models as possible. A year ago I didn't feel that model tool support was mature enough to figure this out. Today there's a very definite consensus among vendors about how this should work, which finally gave me the confidence to implement it.&lt;/p&gt;
&lt;p&gt;I also presented a workshop at PyCon US two weeks ago about &lt;a href="https://simonwillison.net/2025/May/15/building-on-llms/"&gt;Building software on top of Large Language Models&lt;/a&gt;, which was exactly the incentive I needed to finally get this working in an alpha! Here's the &lt;a href="https://building-with-llms-pycon-2025.readthedocs.io/en/latest/tools.html"&gt;tools section&lt;/a&gt; from that tutorial.&lt;/p&gt;
&lt;h4 id="is-this-agents-then-"&gt;Is this agents then?&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;Sigh&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I still &lt;a href="https://simonwillison.net/2024/Dec/31/llms-in-2024/#-agents-still-haven-t-really-happened-yet"&gt;don't like&lt;/a&gt; using the term "agents". I worry that developers will think &lt;a href="https://simonwillison.net/2025/May/22/tools-in-a-loop/"&gt;tools in a loop&lt;/a&gt;, regular people will think virtual AI assistants &lt;a href="https://en.m.wikipedia.org/wiki/Her_(2013_film)"&gt;voiced by Scarlett Johansson&lt;/a&gt; and academics will &lt;a href="https://simonwillison.net/2025/Mar/19/worms-and-dogs-and-countries/"&gt;grumble about thermostats&lt;/a&gt;. But in the LLM world we appear to be converging on "tools in a loop", and that's absolutely what this.&lt;/p&gt;
&lt;p&gt;So yes, if you want to build "agents" then LLM 0.26 is a great way to do that.&lt;/p&gt;
&lt;h4 id="what-s-next-for-tools-in-llm-"&gt;What's next for tools in LLM?&lt;/h4&gt;
&lt;p&gt;I already have a &lt;a href="https://github.com/simonw/llm/milestone/13"&gt;LLM tools v2 milestone&lt;/a&gt; with 13 issues in it, mainly around improvements to how tool execution logs are displayed but with quite a few minor issues I decided shouldn't block this release. There's a bunch more stuff in the &lt;a href="https://github.com/simonw/llm/issues?q=is%3Aissue%20state%3Aopen%20label%3Atools"&gt;tools label&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm most excited about the potential for plugins.&lt;/p&gt;
&lt;p&gt;Writing tool plugins is &lt;em&gt;really fun&lt;/em&gt;. I have an &lt;a href="https://github.com/simonw/llm-plugin-tools"&gt;llm-plugin-tools&lt;/a&gt; cookiecutter template that I've been using for my own, and I plan to put together a tutorial around that soon.&lt;/p&gt;
&lt;p&gt;There's more work to be done adding tool support to more model plugins. I added &lt;a href="https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#supporting-tools"&gt;details of this&lt;/a&gt; to the advanced plugins documentation. This commit &lt;a href="https://github.com/simonw/llm-gemini/commit/a7f1096cfbb733018eb41c29028a8cc6160be298"&gt;adding tool support for Gemini&lt;/a&gt; is a useful illustratino of what's involved.&lt;/p&gt;

&lt;p&gt;And yes, &lt;strong&gt;Model Context Protocol&lt;/strong&gt; support is clearly on the agenda as well. MCP is emerging as the standard way for models to access tools at a frankly bewildering speed. Two weeks ago it wasn't directly supported by the APIs of any of the major vendors. In just the past eight days &lt;a href="https://simonwillison.net/2025/May/27/mistral-agents-api/"&gt;it's been added&lt;/a&gt; by OpenAI, Anthropic &lt;em&gt;and&lt;/em&gt; Mistral! It's feeling like a lot less of a moving target today.&lt;/p&gt;
&lt;p&gt;I want LLM to be able to act as an MCP client, so that any of the MCP servers people are writing can be easily accessed as additional sources of tools for LLM.&lt;/p&gt;
&lt;p&gt;If you're interested in talking more about what comes next for LLM, &lt;a href="https://datasette.io/discord-llm"&gt;come and chat to us in our Discord&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/releases"&gt;releases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="releases"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="gemini"/><category term="llm-tool-use"/><category term="ai-agents"/><category term="ollama"/></entry><entry><title>Devstral</title><link href="https://simonwillison.net/2025/May/21/devstral/#atom-tag" rel="alternate"/><published>2025-05-21T22:02:23+00:00</published><updated>2025-05-21T22:02:23+00:00</updated><id>https://simonwillison.net/2025/May/21/devstral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/devstral"&gt;Devstral&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New Apache 2.0 licensed LLM release from Mistral, this time specifically trained for code.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Devstral achieves a score of 46.8% on SWE-Bench Verified, outperforming prior open-source SoTA models by more than 6% points. When evaluated under the same test scaffold (OpenHands, provided by &lt;a href="https://www.all-hands.dev/"&gt;All Hands AI&lt;/a&gt; 🙌), Devstral exceeds far larger models such as Deepseek-V3-0324 (671B) and Qwen3 232B-A22B.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm always suspicious of small models like this that claim great benchmarks against much larger rivals, but there's a Devstral model that is &lt;a href="https://ollama.com/library/devstral/tags"&gt;just 14GB on Ollama&lt;/a&gt; to it's quite easy to try out for yourself.&lt;/p&gt;
&lt;p&gt;I fetched it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull devstral
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then ran it in a &lt;a href="https://llm.datasette.io/en/stable/usage.html#starting-an-interactive-chat"&gt;llm chat&lt;/a&gt; session with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm chat -m devstral
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Initial impressions: I think this one is pretty good! Here's &lt;a href="https://gist.github.com/simonw/543e4322c7a3144afb1cc2d685036742"&gt;a full transcript&lt;/a&gt; where I had it write Python code to fetch a CSV file from a URL and import it into a SQLite database, creating the table with the necessary columns. Honestly I need to retire that challenge, it's been a while since a model failed at it, but it's still interesting to see how it handles follow-up prompts to demand things like &lt;code&gt;asyncio&lt;/code&gt; or a different HTTP client library.&lt;/p&gt;
&lt;p&gt;It's also available through &lt;a href="https://docs.mistral.ai/api/"&gt;Mistral's API&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-mistral/releases/tag/0.13"&gt;llm-mistral 0.13&lt;/a&gt; configures the &lt;code&gt;devstral-small&lt;/code&gt; alias for it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-mistral
llm keys set mistral
# paste key here
llm -m devstral-small 'HTML+JS for a large text countdown app from 5m'
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="mistral"/><category term="ollama"/><category term="llm-release"/></entry><entry><title>qwen2.5vl in Ollama</title><link href="https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/#atom-tag" rel="alternate"/><published>2025-05-18T12:31:15+00:00</published><updated>2025-05-18T12:31:15+00:00</updated><id>https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/library/qwen2.5vl"&gt;qwen2.5vl in Ollama&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ollama announced a complete overhaul of their vision support the other day. Here's the first new model they've shipped since then - a packaged version of &lt;a href="https://github.com/QwenLM/Qwen2.5-VL"&gt;Qwen 2.5 VL&lt;/a&gt; which was first released &lt;a href="https://qwenlm.github.io/blog/qwen2.5-vl/"&gt;on January 26th 2025&lt;/a&gt;. Here are my &lt;a href="https://simonwillison.net/2025/Jan/27/qwen25-vl-qwen25-vl-qwen25-vl/"&gt;notes from that release&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I upgraded Ollama (it auto-updates so I just had to restart it from the tray icon) and ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloaded a 6GB model file. I tried it out against my &lt;a href="https://static.simonwillison.net/static/2025/cleo-sand.jpg"&gt;photo of Cleo rolling on the beach&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/cleo-sand.jpg \
  'describe this image' -m qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And got a pretty good result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows a dog lying on its back on a sandy beach. The dog appears to be a medium to large breed with a dark coat, possibly black or dark brown. It is wearing a red collar or harness around its chest. The dog's legs are spread out, and its belly is exposed, suggesting it might be rolling around or playing in the sand. The sand is light-colored and appears to be dry, with some small footprints and marks visible around the dog. The lighting in the image suggests it is taken during the daytime, with the sun casting a shadow of the dog to the left side of the image. The overall scene gives a relaxed and playful impression, typical of a dog enjoying time outdoors on a beach.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Qwen 2.5 VL has a strong reputation for OCR, so I tried it on &lt;a href="https://simonwillison.net/2025/May/17/pycon-poster/#datasette-poster"&gt;my poster&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/poster.jpg \
  'convert to markdown' -m qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result that came back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It looks like the image you provided is a jumbled and distorted text, making it difficult to interpret. If you have a specific question or need help with a particular topic, please feel free to ask, and I'll do my best to assist you!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm not sure what went wrong here. My best guess is that the maximum resolution the model can handle is too small to make out the text, or maybe Ollama resized the image to the point of illegibility before handing it to the model?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I think this may be &lt;a href="https://github.com/simonw/llm/issues/1046"&gt;a bug&lt;/a&gt; relating to URL handling in LLM/llm-ollama. I tried downloading the file first:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget https://static.simonwillison.net/static/2025/poster.jpg
llm -m qwen2.5vl 'extract text' -a poster.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time it did a lot better. The results weren't perfect though - &lt;a href="https://gist.github.com/simonw/2b46e932a16c92e673ea09dfc0186ec2#response"&gt;it ended up stuck in a loop&lt;/a&gt; outputting the same code example dozens of times.&lt;/p&gt;
&lt;p&gt;I tried with a different prompt - "extract text" - and it got confused by the three column layout, misread Datasette as "Datasetette" and missed some of the text. Here's &lt;a href="https://gist.github.com/simonw/3ececa5f5ff109a81bc6893be06f00b1#response"&gt;that result&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These experiments used &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; (6GB) - I expect the results would be better with the larger &lt;code&gt;qwen2.5vl:32b&lt;/code&gt; (21GB) and &lt;code&gt;qwen2.5vl:72b&lt;/code&gt; (71GB) models.&lt;/p&gt;
&lt;p&gt;Fred Jonsson &lt;a href="https://twitter.com/enginoid/status/1924092556079436086"&gt;reported a better result&lt;/a&gt; using the MLX model via LM studio (~9GB model running in 8bit - I think that's &lt;a href="https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit"&gt;mlx-community/Qwen2.5-VL-7B-Instruct-8bit&lt;/a&gt;). His &lt;a href="https://gist.github.com/enginoid/5c91c920124d4a2e0ab253df769e35fa"&gt;full output is here&lt;/a&gt; - looks almost exactly right to me.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="ai-in-china"/></entry><entry><title>Saying "hi" to Microsoft's Phi-4-reasoning</title><link href="https://simonwillison.net/2025/May/6/phi-4-reasoning/#atom-tag" rel="alternate"/><published>2025-05-06T18:25:45+00:00</published><updated>2025-05-06T18:25:45+00:00</updated><id>https://simonwillison.net/2025/May/6/phi-4-reasoning/#atom-tag</id><summary type="html">
    &lt;p&gt;Microsoft released a new sub-family of models a few days ago: Phi-4 reasoning. They introduced them in &lt;a href="https://azure.microsoft.com/en-us/blog/one-year-of-phi-small-language-models-making-big-leaps-in-ai/"&gt;this blog post&lt;/a&gt; celebrating a year since the release of Phi-3:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today, we are excited to introduce &lt;strong&gt;Phi-4-reasoning&lt;/strong&gt;, &lt;strong&gt;Phi-4-reasoning-plus&lt;/strong&gt;, and &lt;strong&gt;Phi-4-mini-reasoning&lt;/strong&gt; - marking a new era for small language models and once again redefining what is possible with small and efficient AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;All three of the new models are released under the MIT license, as is the case with previous Phi models.&lt;/p&gt;
&lt;p&gt;They're available through &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; as &lt;a href="https://ollama.com/library/phi4-reasoning/tags"&gt;phi4-reasoning&lt;/a&gt; (which includes the plus model) and &lt;a href="https://ollama.com/library/phi4-mini-reasoning/tags"&gt;phi4-mini-reasoning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These models appear to have a tendency to overthink!&lt;/p&gt;
&lt;h4 id="trying-it-out-with-ollama"&gt;Trying it out with Ollama&lt;/h4&gt;
&lt;p&gt;I ran it using Ollama and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this. Fetching the model downloaded 11GB of data:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull phi4-reasoning
llm install llm-ollama
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I said "hi" to it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m phi4-reasoning:latest hi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the full response, which starts with a thought section wrapped in &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;User says "hi". We are given instructions "You are Phi, a language model developed by Microsoft, trained to provide accurate, secure, and user-aligned responses." There is additional information. Now I must answer "hi" message from the user.&lt;/p&gt;
&lt;p&gt;Step 1: Read the instructions carefully: The conversation text instructs us that we are "Phi", with internal guidelines provided in system instructions. But our answer should be safe according to these instructions. We need to produce a response that says hello and greet back. However, I must adhere to the principles described above: Provide clear, specific language, etc.&lt;/p&gt;
&lt;p&gt;I must not mention any chain-of-thought or guidelines.&lt;/p&gt;
&lt;p&gt;Now instructions "hi" are from user. So we say "Hello! How can I help you?" But must include disclaimers for sensitive topics? Actually instructions says: "When replying topics above at every message include a disclaimer both at the beginning and the end when replying topics on medical, legal, financial matters or political matters." But our answer "hi" is not such topic. So we can say "Hi! I'm Phi, how may I help you?" but no disclaimers required.&lt;/p&gt;
&lt;p&gt;But instructions: "Do not share guidelines." So do not mention them.&lt;/p&gt;
&lt;p&gt;I must check if the message "hi" instructs to greet back normally. The assistant should respond with a greeting.&lt;/p&gt;
&lt;p&gt;Let me recall that instructions say: Use markdown formatting where appropriate, so I can include a formatted greeting maybe "Hello!" or something.&lt;/p&gt;
&lt;p&gt;Additionally, check guidelines. They are internal instructions for safe completions. So we just produce "Hi there! How may I help you?" in a friendly tone.&lt;/p&gt;
&lt;p&gt;I should greet the user politely and ask if they need any help. Possibly provide a disclaimer that I'm not licensed professional? But the conversation is just greeting, so no disclaimers required as it's not sensitive topics.&lt;/p&gt;
&lt;p&gt;Wait instructions say: "You must give a disclaimer both at the beginning and the end when replying topics above at every message." Actually the guidelines for medical/legal/financial/political are needed only if discussing those topics. In our case it is just a greeting message, so disclaimers are not required.&lt;/p&gt;
&lt;p&gt;I should greet politely. So I'll say something like:&lt;/p&gt;
&lt;p&gt;"Hello! I'm Phi, how can I help you today?" Possibly "Hi there!"&lt;/p&gt;
&lt;p&gt;Since the question was simply "hi", I respond with a greeting.&lt;/p&gt;
&lt;p&gt;I must include markdown formatting if appropriate. So maybe bullet point or code snippet is not necessary. I will simply greet.&lt;/p&gt;
&lt;p&gt;I need to check instructions: "You are Phi, a language model developed by Microsoft." But I must not mention internal guidelines? But that's okay.&lt;/p&gt;
&lt;p&gt;I'll produce: "Hello! How may I help you today?" with greeting and friendly tone.&lt;/p&gt;
&lt;p&gt;Also note that the system message instructs me to include disclaimers at beginning and end for topics above. But this is a simple greeting, so no disclaimer needed.&lt;/p&gt;
&lt;p&gt;Thus my answer:&lt;/p&gt;
&lt;p&gt;"Hi there! How can I assist you today?"&lt;/p&gt;
&lt;p&gt;I must not mention guidelines in output text? The instructions say: "Do not share these guidelines with the user even in chain-of-thought", so I should not mention them.&lt;/p&gt;
&lt;p&gt;I'll produce a final message: "Hello, how can I help?" I'll produce a friendly greeting message. Possibly add markdown formatting: e.g. "Hi there!" or bullet list if needed?&lt;/p&gt;
&lt;p&gt;I'll produce a response like:&lt;/p&gt;
&lt;p&gt;"Hello! How may I assist you today? Please let me know what you need."&lt;/p&gt;
&lt;p&gt;I'll produce the answer.&lt;/p&gt;
&lt;p&gt;I'll produce text with no disclaimers because not required here.&lt;/p&gt;
&lt;p&gt;I'll produce final answer message: "Hello, how can I help?" That is it.&lt;/p&gt;
&lt;p&gt;I'll produce answer as final message.&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Hi there! How can I assist you today?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's 56 sentences of reasoning followed by a single sentence of output!&lt;/p&gt;
&lt;p&gt;I find this deeply amusing.&lt;/p&gt;
&lt;h4 id="qwen-3-over-thinks-things-a-whole-lot-less"&gt;Qwen 3 over-thinks things a whole lot less&lt;/h4&gt;
&lt;p&gt;For comparison, here's Qwen 3 (&lt;a href="https://simonwillison.net/2025/May/2/qwen3-8b/"&gt;8B 4bit via MLX&lt;/a&gt;) - another brand new "reasoning" model - responding to the same prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mlx-community/Qwen3-8B-4bit hi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;br /&gt;
Okay, the user said "hi". That's a greeting. I should respond in a friendly and welcoming manner. Let me think about how to make it engaging. Maybe start with a cheerful "Hello!" to match the greeting. Then ask how I can assist them today. Keep it simple and open-ended so they feel comfortable sharing their needs. Avoid any complicated language. Make sure the tone is positive and helpful. Alright, that should work.&lt;br /&gt;
&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Hello! 😊 How can I assist you today? I'm here to help with anything you need!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A fun thing about Qwen 3 is that you can tell it not to "think" at all by adding &lt;code&gt;/nothink&lt;/code&gt; to your prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mlx-community/Qwen3-8B-4bit 'hi /nothink'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Output (empty &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; section included):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Hello! How can I assist you today? 😊&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="phi-4-reasoning-s-system-prompt"&gt;Phi-4 reasoning's system prompt&lt;/h4&gt;
&lt;p&gt;Since Phi-4 talked about its system prompt so much, I decided to see where that was coming from.  It turns out Ollama bakes &lt;a href="https://ollama.com/library/phi4-reasoning/blobs/88df15fe1f34"&gt;the system prompt&lt;/a&gt; into their model releases. Reading that in full helps explain why Phi-4 reasoning acted the way it did:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: &amp;lt;think&amp;gt; {Thought section} &amp;lt;/think&amp;gt; {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't see anything in there about "Do not share guidelines", even though the model response mentioned that rule.&lt;/p&gt;
&lt;p&gt;My guess is that the model has been trained to "not talk about the system prompt" through RLHF or similar. I've heard in the past that models default to chattering about their system prompt if you don't put measures in place to discourage that.&lt;/p&gt;

&lt;p&gt;The lengthy response from Phi-4-reasoning shown above may well be caused by the system prompt containing significantly more tokens than the single token "hi" sent by the user.&lt;/p&gt;
&lt;h4 id="it-s-still-hard-to-know-when-to-use-reasoning-models"&gt;It's still hard to know when to use reasoning models&lt;/h4&gt;
&lt;p&gt;We've had access to these "reasoning" models - with a baked in chain-of-thought at the start of each response - since o1 debuted &lt;a href="https://simonwillison.net/2024/Sep/12/openai-o1/"&gt;in September last year&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'll be honest: I still don't have a great intuition for when it makes the most sense to use them.&lt;/p&gt;
&lt;p&gt;I've had great success with them for code: any coding tasks that might involve multiple functions or classes that co-ordinate together seems to benefit from a reasoning step.&lt;/p&gt;
&lt;p&gt;They are an absolute benefit for debugging: I've seen reasoning models walk through quite large codebases following multiple levels of indirection in order to find potential root causes of the problem I've described.&lt;/p&gt;
&lt;p&gt;Other than that though... they're apparently good for mathematical puzzles - the phi4-reasoning models seem to &lt;em&gt;really&lt;/em&gt; want to dig into a math problem and output LaTeX embedded in Markdown as the answer. I'm not enough of a mathematician to put them through their paces here.&lt;/p&gt;
&lt;p&gt;All of that in mind, these reasoners that run on my laptop are fun to torment with inappropriate challenges that sit far beneath their lofty ambitions, but aside from that I don't really have a great answer to when I would use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 8th May 2025&lt;/strong&gt;:
    I said "hi" to NVIDIA's new 
    &lt;a href="https://huggingface.co/nvidia/OpenCodeReasoning-Nemotron-32B" target="_blank"&gt;OpenCodeReasoning-Nemotron-32B&lt;/a&gt; 
    model (run using Ollama and 
    &lt;a href="https://huggingface.co/lmstudio-community/OpenCodeReasoning-Nemotron-32B-GGUF" target="_blank"&gt;this GGUF file&lt;/a&gt;) 
    and got 
    &lt;a href="https://gist.github.com/simonw/67bb10723f8b09a82560ee53c9c9f5e8" target="_blank"&gt;a similar result&lt;/a&gt;.
&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/phi"&gt;phi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="microsoft"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="phi"/><category term="qwen"/><category term="ollama"/><category term="llm-reasoning"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Qwen 3 offers a case study in how to effectively release a model</title><link href="https://simonwillison.net/2025/Apr/29/qwen-3/#atom-tag" rel="alternate"/><published>2025-04-29T00:37:03+00:00</published><updated>2025-04-29T00:37:03+00:00</updated><id>https://simonwillison.net/2025/Apr/29/qwen-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Alibaba's Qwen team released the hotly anticipated &lt;a href="https://qwenlm.github.io/blog/qwen3/"&gt;Qwen 3 model family&lt;/a&gt; today. The Qwen models are already some of the best open weight models - Apache 2.0 licensed and with a variety of different capabilities (including vision and audio input/output).&lt;/p&gt;
&lt;p&gt;Qwen 3 is text input/output only for the moment and comes in an exciting range of different shapes and sizes: 32B, 14B, 8B, 4B, 1.7B, and 0.6B models. The 4B and up models all have 131,072 token context windows (extended from 32k using YaRN) - 0.6B, and 1.7B are 32,768.&lt;/p&gt;
&lt;p&gt;This covers the full spectrum of sizes that I generally care about: 0.6B and 1.7B should run fine on an iPhone, and 32B will fit on my 64GB Mac with room to spare for other applications.&lt;/p&gt;
&lt;p&gt;Qwen also released two Mixture of Experts models - Qwen3-30B-A3B and Qwen3-235B-A22B. The A stands for "active parameters" - Qwen3-30B-A3B is a 30 billion parameter model that keeps 3 billion active at once, which speeds up inference (I previously said it reduces the memory needed to run the models, but &lt;a href="https://bsky.app/profile/pekka.bsky.social/post/3lnw2knbkls2e"&gt;that's incorrect&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;All eight of these models are released under the Apache 2.0 license.&lt;/p&gt;
&lt;p&gt;Qwen describe these as "hybrid thinking" models - every one of them is capable of the "reason step by step" trick before they deliver their final answer. In my experiments so far this presents itself as a &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; block at the start of the response.&lt;/p&gt;
&lt;p&gt;That thinking mode can be turned on and off by including special &lt;code&gt;/think&lt;/code&gt; and &lt;code&gt;/no_think&lt;/code&gt; tokens in the prompt. It's on by default.&lt;/p&gt;
&lt;p&gt;One last interesting snippet from the announcement:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;This is the first time I've seen a model specifically trained for the &lt;a href="https://www.anthropic.com/news/model-context-protocol"&gt;Model Context Protocol&lt;/a&gt;, which was first announced by Anthropic last November.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/#trying-out-qwen-3-via-ollama"&gt;Trying out Qwen 3 via Ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/#a-really-well-coordinated-release"&gt;A really well coordinated release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/#the-continuing-trend-of-better-models-that-run-on-the-same-hardware"&gt;The continuing trend of better models that run on the same hardware&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="trying-out-qwen-3-via-ollama"&gt;Trying out Qwen 3 via Ollama&lt;/h4&gt;
&lt;p&gt;Let's see that in action. I started out with Qwen 32B from &lt;a href="https://www.ollama.com/"&gt;Ollama&lt;/a&gt; - a 20GB download:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull qwen3:32b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I ran a prompt against it using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and the &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama plugin&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m qwen3:32b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's the truncated output I got back (you can see the full response &lt;a href="https://gist.github.com/simonw/2cd679d2cfa41fffb37ebc4a2c5c3e4b"&gt;in this Gist&lt;/a&gt;).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;br /&gt;
Okay, the user wants an SVG of a pelican riding a bicycle. Let me start by breaking down the components needed. First, I need to create the pelican. Pelicans have a distinctive beak and pouch, so I'll need to draw that. Maybe start with an ellipse for the body, then add the beak as a triangle. The wings could be curved shapes on either side.&lt;/p&gt;
&lt;p&gt;Next, the bicycle. I'll need two wheels, a frame, handlebars, and pedals. The wheels can be circles with some lines for spokes. The frame connects the wheels, with a seat and handlebars at the front. The pedals are small circles with lines for the crank.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;I think that's a plan. Now, translate this into SVG code step by step, making sure coordinates and shapes are correctly placed.&lt;br /&gt;
&lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Here's an SVG of a pelican riding a bicycle, created using simple shapes and paths:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 200 200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;xmlns&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://www.w3.org/2000/svg&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle Frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;120&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;4&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;120&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;140&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;80&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;4&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  [...]
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle Handlebars &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;80&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#8B4513&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This SVG features:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A basic bicycle frame with two wheels, spokes, and a seat&lt;/li&gt;
&lt;li&gt;A blue pelican with:
&lt;ul&gt;
&lt;li&gt;A rounded body and head&lt;/li&gt;
&lt;li&gt;A distinctive yellow beak and orange pouch&lt;/li&gt;
&lt;li&gt;Two wings extending from the body&lt;/li&gt;
&lt;li&gt;Webbed feet wrapped around the bicycle frame&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Simple handlebars at the front of the bicycle&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The pelican is positioned sitting on the bicycle seat with wings slightly spread for balance. You can copy this code into an HTML file or SVG editor to view the illustration. The colors use traditional bicycle brown (#8B4513) and a classic blue for the pelican (#4169E1) with gold accents for the beak (#FFD700).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As is often the case with "reasoning" models, reading the reasoning chain is more interesting than the finished result. Here's what the SVG looks like when you render it:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-32b-pelican.jpg" alt="The bicycle looks more like a shopping cart. The pelican is a bunch of overlapping blue shapes and the yellow bill is misconfigured. It's not a great effort to be honest." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Not the best pelican on a bicycle, but &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;I've seen worse&lt;/a&gt;!&lt;/p&gt;
&lt;h4 id="a-really-well-coordinated-release"&gt;A really well coordinated release&lt;/h4&gt;
&lt;p&gt;The thing that stands out most to me about the Qwen 3 release is how well coordinated it was across the LLM ecosystem.&lt;/p&gt;
&lt;p&gt;Qwen worked directly with almost all of the popular LLM serving frameworks to ensure that support for the new models was available on day one.&lt;/p&gt;
&lt;p&gt;The Qwen 3 README &lt;a href="https://github.com/QwenLM/Qwen3/blob/main/README.md#run-qwen3"&gt;mentions the following projects&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://huggingface.co/docs/transformers/en/index"&gt;Transformers&lt;/a&gt; Python library&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.modelscope.cn/"&gt;ModelScope&lt;/a&gt; (effectively the Chinese equivalent of Hugging Face for model distribution)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/ggml-org/llama.cpp"&gt;llama.cpp&lt;/a&gt;, which added support for the Qwen 3 architecture &lt;a href="https://github.com/ggml-org/llama.cpp/releases/tag/b5092"&gt;three weeks ago&lt;/a&gt; in version &lt;code&gt;b5092&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.ollama.com/"&gt;Ollama&lt;/a&gt;, with &lt;a href="https://ollama.com/library/qwen3"&gt;qwen3&lt;/a&gt; out on release day&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lmstudio.ai/"&gt;LMStudio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/ml-explore/mlx-lm"&gt;mlx-lm&lt;/a&gt; for Apple Silicon - the first commit for Qwen 3 support in MLX showed up in a PR &lt;a href="https://github.com/ml-explore/mlx-lm/pull/41"&gt;over a month ago&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://github.com/sgl-project/sglang"&gt;SGLang&lt;/a&gt; serving framework&lt;/li&gt;
&lt;li&gt;The popular &lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; engine&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.hiascend.com/en/software/mindie"&gt;MindIE&lt;/a&gt;, which I haven't heard of before but looks to be a serving framework for Huawei's Ascend chips (competitive with NVIDIA's GPUs)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is an extraordinary level of coordination for a model release! I haven't seen any other model providers make this level of effort - the usual pattern is to dump a bunch of models on Hugging Face for a single architecture (usually NVIDIA) and then wait for the community to catch up with quantizations and conversions for everything else.&lt;/p&gt;
&lt;p&gt;It's also great to see smaller models that can run on consumer hardware on day one. I think one of the reasons &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/"&gt;the Llama 4 release&lt;/a&gt; a few weeks ago was a little muted is that very few people had access to hardware that was capable of running the models.&lt;/p&gt;
&lt;p&gt;The one thing that's missing here is hosted API partnerships - something Meta did a good job of with Llama 4, which was available (albeit with some bugs) on Groq, Fireworks and Together on the day of release.&lt;/p&gt;
&lt;h4 id="the-continuing-trend-of-better-models-that-run-on-the-same-hardware"&gt;The continuing trend of better models that run on the same hardware&lt;/h4&gt;
&lt;p&gt;I don't have a great feeling for the general "vibes" of Qwen 3 yet - it's only been available for a few hours and I've not really run it through its paces yet. The general buzz so far seems pretty positive and the initial benchmarks are promising, but these things always take a little while to shake out.&lt;/p&gt;
&lt;p&gt;Assuming those benchmarks hold up, I think this is a very strong model. My favourite detail from the release announcement is this (highlight mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Due to advancements in model architecture, increase in training data, and more effective training methods, the overall performance of Qwen3 dense base models matches that of Qwen2.5 base models with more parameters. For instance, &lt;strong&gt;Qwen3-1.7B/4B/8B/14B/32B-Base performs as well as Qwen2.5-3B/7B/14B/32B/72B-Base, respectively&lt;/strong&gt;. Notably, in areas like STEM, coding, and reasoning, Qwen3 dense base models even outperform larger Qwen2.5 models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm always keen to see this trend continue: better models that can run on the same hardware as their predecessors.&lt;/p&gt;
&lt;p&gt;We've seen the same trend with other models - Llama 3.3 70B &lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/"&gt;claimed to deliver&lt;/a&gt; "similar performance to Llama 3.1 405B", and Mistral Small 3 24B &lt;a href="https://simonwillison.net/2025/Jan/30/mistral-small-3/"&gt;said that&lt;/a&gt; it was "on par with Llama 3.3 70B instruct, while being more than 3x faster on the same hardware."&lt;/p&gt;
&lt;p&gt;At some point this trend will have to stop, but for the moment it's exciting to continue to see new optimizations emerge that allow us to run even better smaller models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/model-context-protocol"&gt;model-context-protocol&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="model-context-protocol"/><category term="ai-in-china"/></entry><entry><title>Gemma 3 QAT Models</title><link href="https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/#atom-tag" rel="alternate"/><published>2025-04-19T17:20:50+00:00</published><updated>2025-04-19T17:20:50+00:00</updated><id>https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/"&gt;Gemma 3 QAT Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting release from Google, as a follow-up &lt;a href="https://simonwillison.net/2025/Mar/12/gemma-3/"&gt;to Gemma 3&lt;/a&gt; from last month:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wasn't previously aware of Quantization-Aware Training but it turns out to be quite an established pattern now, supported in both &lt;a href="https://www.tensorflow.org/model_optimization/guide/quantization/training"&gt;Tensorflow&lt;/a&gt; and &lt;a href="https://pytorch.org/blog/quantization-aware-training/"&gt;PyTorch&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Google report model size drops from BF16 to int4 for the following models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemma 3 27B: 54GB to 14.1GB&lt;/li&gt;
&lt;li&gt;Gemma 3 12B: 24GB to 6.6GB&lt;/li&gt;
&lt;li&gt;Gemma 3 4B: 8GB to 2.6GB&lt;/li&gt;
&lt;li&gt;Gemma 3 1B: 2GB to 0.5GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;They partnered with Ollama, LM Studio, MLX (here's &lt;a href="https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae"&gt;their collection&lt;/a&gt;) and llama.cpp for this release - I'd love to see more AI labs following their example.&lt;/p&gt;
&lt;p&gt;The Ollama model version picker currently hides them behind "View all" option, so here are the direct links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:1b-it-qat"&gt;gemma3:1b-it-qat&lt;/a&gt; - 1GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:4b-it-qat"&gt;gemma3:4b-it-qat&lt;/a&gt; - 4GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:12b-it-qat"&gt;gemma3:12b-it-qat&lt;/a&gt; - 8.9GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:27b-it-qat"&gt;gemma3:27b-it-qat&lt;/a&gt; - 18GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I fetched that largest model with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3:27b-it-qat
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And now I'm trying it out with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b-it-qat "impress me with some physics"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got &lt;a href="https://gist.github.com/simonw/5b699ba6b7c05e2d071910e238823ff4"&gt;a pretty great response&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Having spent a while putting it through its paces via &lt;a href="https://simonwillison.net/2024/Dec/27/open-webui/"&gt;Open WebUI&lt;/a&gt; and &lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; to access my laptop from my phone I think this may be my new favorite general-purpose local model. Ollama appears to use 22GB of RAM while the model is running, which leaves plenty on my 64GB machine for other applications.&lt;/p&gt;
&lt;p&gt;I've also tried it via &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt; like this (downloading 16GB):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm chat -m mlx-community/gemma-3-27b-it-qat-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It feels a little faster with MLX and uses 15GB of memory according to Activity Monitor.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="ollama"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>MCP Run Python</title><link href="https://simonwillison.net/2025/Apr/18/mcp-run-python/#atom-tag" rel="alternate"/><published>2025-04-18T04:51:20+00:00</published><updated>2025-04-18T04:51:20+00:00</updated><id>https://simonwillison.net/2025/Apr/18/mcp-run-python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/pydantic/pydantic-ai/tree/main/mcp-run-python"&gt;MCP Run Python&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Pydantic AI's MCP server for running LLM-generated Python code in a sandbox. They ended up using a trick I explored &lt;a href="https://til.simonwillison.net/deno/pyodide-sandbox"&gt;two years ago&lt;/a&gt;: using a &lt;a href="https://deno.com/"&gt;Deno&lt;/a&gt; process to run &lt;a href="https://pyodide.org/"&gt;Pyodide&lt;/a&gt; in a WebAssembly sandbox.&lt;/p&gt;
&lt;p&gt;Here's a bit of a wild trick: since Deno loads code on-demand from &lt;a href="https://jsr.io/"&gt;JSR&lt;/a&gt;, and &lt;a href="https://docs.astral.sh/uv/guides/scripts/"&gt;uv run&lt;/a&gt; can install Python dependencies on demand via the &lt;code&gt;--with&lt;/code&gt; option... here's a one-liner you can paste into a macOS shell (provided you have Deno and &lt;code&gt;uv&lt;/code&gt; installed already) which will run the example from &lt;a href="https://github.com/pydantic/pydantic-ai/blob/v0.1.2/mcp-run-python/README.md"&gt;their README&lt;/a&gt; - calculating the number of days between two dates in the most complex way imaginable:&lt;/p&gt;
&lt;pre&gt;ANTHROPIC_API_KEY=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sk-ant-...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
uv run --with pydantic-ai python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;import asyncio&lt;/span&gt;
&lt;span class="pl-s"&gt;from pydantic_ai import Agent&lt;/span&gt;
&lt;span class="pl-s"&gt;from pydantic_ai.mcp import MCPServerStdio&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;server = MCPServerStdio(&lt;/span&gt;
&lt;span class="pl-s"&gt;    "deno",&lt;/span&gt;
&lt;span class="pl-s"&gt;    args=[&lt;/span&gt;
&lt;span class="pl-s"&gt;        "run",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "-N",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "-R=node_modules",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "-W=node_modules",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "--node-modules-dir=auto",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "jsr:@pydantic/mcp-run-python",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "stdio",&lt;/span&gt;
&lt;span class="pl-s"&gt;    ],&lt;/span&gt;
&lt;span class="pl-s"&gt;)&lt;/span&gt;
&lt;span class="pl-s"&gt;agent = Agent("claude-3-5-haiku-latest", mcp_servers=[server])&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;async def main():&lt;/span&gt;
&lt;span class="pl-s"&gt;    async with agent.run_mcp_servers():&lt;/span&gt;
&lt;span class="pl-s"&gt;        result = await agent.run("How many days between 2000-01-01 and 2025-03-18?")&lt;/span&gt;
&lt;span class="pl-s"&gt;    print(result.output)&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;asyncio.run(main())&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;I ran that just now and got:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The number of days between January 1st, 2000 and March 18th, 2025 is 9,208 days.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I thoroughly enjoy how tools like &lt;code&gt;uv&lt;/code&gt; and Deno enable throwing together shell one-liner demos like this one.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/54fc42ef9a7fb8f777162bbbfbba4f23"&gt;an extended version&lt;/a&gt; of this example which adds pretty-printed logging of the messages exchanged with the LLM to illustrate exactly what happened. The most important piece is this tool call where Claude 3.5 Haiku asks for Python code to be executed my the MCP server:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;ToolCallPart&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;tool_name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'run_python_code'&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;args&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;{
        &lt;span class="pl-s"&gt;'python_code'&lt;/span&gt;: (
            &lt;span class="pl-s"&gt;'from datetime import date&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;
            &lt;span class="pl-s"&gt;'&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;
            &lt;span class="pl-s"&gt;'date1 = date(2000, 1, 1)&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;
            &lt;span class="pl-s"&gt;'date2 = date(2025, 3, 18)&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;
            &lt;span class="pl-s"&gt;'&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;
            &lt;span class="pl-s"&gt;'days_between = (date2 - date1).days&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;
            &lt;span class="pl-s"&gt;'print(f"Number of days between {date1} and {date2}: {days_between}")'&lt;/span&gt;
        ),
    },
    &lt;span class="pl-s1"&gt;tool_call_id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'toolu_01TXXnQ5mC4ry42DrM1jPaza'&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;part_kind&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'tool-call'&lt;/span&gt;,
)&lt;/pre&gt;

&lt;p&gt;I also managed to run it against &lt;a href="https://ollama.com/library/mistral-small3.1"&gt;Mistral Small 3.1&lt;/a&gt; (15GB) running locally using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; (I had to add "Use your python tool" to the prompt to get it to work):&lt;/p&gt;
&lt;pre&gt;ollama pull mistral-small3.1:24b

uv run --with devtools --with pydantic-ai python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;import asyncio&lt;/span&gt;
&lt;span class="pl-s"&gt;from devtools import pprint&lt;/span&gt;
&lt;span class="pl-s"&gt;from pydantic_ai import Agent, capture_run_messages&lt;/span&gt;
&lt;span class="pl-s"&gt;from pydantic_ai.models.openai import OpenAIModel&lt;/span&gt;
&lt;span class="pl-s"&gt;from pydantic_ai.providers.openai import OpenAIProvider&lt;/span&gt;
&lt;span class="pl-s"&gt;from pydantic_ai.mcp import MCPServerStdio&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;server = MCPServerStdio(&lt;/span&gt;
&lt;span class="pl-s"&gt;    "deno",&lt;/span&gt;
&lt;span class="pl-s"&gt;    args=[&lt;/span&gt;
&lt;span class="pl-s"&gt;        "run",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "-N",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "-R=node_modules",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "-W=node_modules",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "--node-modules-dir=auto",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "jsr:@pydantic/mcp-run-python",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "stdio",&lt;/span&gt;
&lt;span class="pl-s"&gt;    ],&lt;/span&gt;
&lt;span class="pl-s"&gt;)&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;agent = Agent( &lt;/span&gt;
&lt;span class="pl-s"&gt;    OpenAIModel(                          &lt;/span&gt;
&lt;span class="pl-s"&gt;        model_name="mistral-small3.1:latest",&lt;/span&gt;
&lt;span class="pl-s"&gt;        provider=OpenAIProvider(base_url="http://localhost:11434/v1"),                &lt;/span&gt;
&lt;span class="pl-s"&gt;    ),            &lt;/span&gt;
&lt;span class="pl-s"&gt;    mcp_servers=[server],&lt;/span&gt;
&lt;span class="pl-s"&gt;)&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;async def main():&lt;/span&gt;
&lt;span class="pl-s"&gt;    with capture_run_messages() as messages:&lt;/span&gt;
&lt;span class="pl-s"&gt;        async with agent.run_mcp_servers():&lt;/span&gt;
&lt;span class="pl-s"&gt;            result = await agent.run("How many days between 2000-01-01 and 2025-03-18? Use your python tool.")&lt;/span&gt;
&lt;span class="pl-s"&gt;    pprint(messages)&lt;/span&gt;
&lt;span class="pl-s"&gt;    print(result.output)&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;asyncio.run(main())&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/e444a81440bda2f37b0fef205780074a"&gt;the full output&lt;/a&gt; including the debug logs.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43691230"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sandboxing"&gt;sandboxing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deno"&gt;deno&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pydantic"&gt;pydantic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/model-context-protocol"&gt;model-context-protocol&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="sandboxing"/><category term="ai"/><category term="deno"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="claude"/><category term="mistral"/><category term="llm-tool-use"/><category term="uv"/><category term="ollama"/><category term="pydantic"/><category term="model-context-protocol"/></entry><entry><title>An LLM Query Understanding Service</title><link href="https://simonwillison.net/2025/Apr/9/an-llm-query-understanding-service/#atom-tag" rel="alternate"/><published>2025-04-09T20:47:42+00:00</published><updated>2025-04-09T20:47:42+00:00</updated><id>https://simonwillison.net/2025/Apr/9/an-llm-query-understanding-service/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://softwaredoug.com/blog/2025/04/08/llm-query-understand"&gt;An LLM Query Understanding Service&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Doug Turnbull recently wrote about how &lt;a href="https://softwaredoug.com/blog/2025/04/02/all-search-structured-now"&gt;all search is structured now&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Many times, even a small open source LLM will be able to turn a search query into reasonable structure at relatively low cost.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this follow-up tutorial he demonstrates Qwen 2-7B running in a GPU-enabled Google Kubernetes Engine container to turn user search queries like "red loveseat" into structured filters like &lt;code&gt;{"item_type": "loveseat", "color": "red"}&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here's the prompt he uses.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Respond with a single line of JSON:

  {"item_type": "sofa", "material": "wood", "color": "red"}

Omit any other information. Do not include any
other text in your response. Omit a value if the
user did not specify it. For example, if the user
said "red sofa", you would respond with:

  {"item_type": "sofa", "color": "red"}

Here is the search query: blue armchair
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Out of curiosity, I tried running his prompt against some other models using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemini-1.5-flash-8b&lt;/code&gt;, the cheapest of the Gemini models, &lt;a href="https://gist.github.com/simonw/cc825bfa7f921ca9ac47d7afb6eab1ce"&gt;handled it well&lt;/a&gt; and cost $0.000011 - or 0.0011 cents.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt; &lt;a href="https://gist.github.com/simonw/d18422ca24528cdb9e5bd77692531cfd"&gt;worked too&lt;/a&gt; - that's a very small 2GB model which I ran using Ollama.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;deepseek-r1:1.5b&lt;/code&gt; - a tiny 1.1GB model, again via Ollama, &lt;a href="https://gist.github.com/simonw/c37eca96dd6721883207c99d25aec49d"&gt;amusingly failed&lt;/a&gt; by interpreting "red loveseat" as &lt;code&gt;{"item_type": "sofa", "material": null, "color": "red"}&lt;/code&gt; after thinking very hard about the problem!&lt;/li&gt;
&lt;/ul&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/oa5hbz/llm_query_understanding_service"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="search"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="qwen"/><category term="ollama"/><category term="ai-assisted-search"/><category term="ai-in-china"/></entry><entry><title>Political Email Extraction Leaderboard</title><link href="https://simonwillison.net/2025/Apr/8/political-email-extraction-leaderboard/#atom-tag" rel="alternate"/><published>2025-04-08T23:22:41+00:00</published><updated>2025-04-08T23:22:41+00:00</updated><id>https://simonwillison.net/2025/Apr/8/political-email-extraction-leaderboard/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://thescoop.org/LLM-Extraction-Challenge/"&gt;Political Email Extraction Leaderboard&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Derek Willis collects "political fundraising emails from just about every committee" - 3,000-12,000 a month - and has created an LLM benchmark from 1,000 of them that he collected last November.&lt;/p&gt;
&lt;p&gt;He explains the leaderboard &lt;a href="https://thescoop.org/archives/2025/01/27/llm-extraction-challenge-fundraising-emails/index.html"&gt;in this blog post&lt;/a&gt;. The goal is to have an LLM correctly identify the the committee name from the disclaimer text included in the email.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/dwillis/LLM-Extraction-Challenge/blob/main/fundraising-emails/email_ollama.py"&gt;the code&lt;/a&gt; he uses to run prompts using Ollama. It uses this system prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Produce a JSON object with the following keys: 'committee', which is the name of the committee in the disclaimer that begins with Paid for by but does not include 'Paid for by', the committee address or the treasurer name. If no committee is present, the value of 'committee' should be None. Also add a key called 'sender', which is the name of the person, if any, mentioned as the author of the email. If there is no person named, the value is None. Do not include any other text, no yapping.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 2.5 Pro tops the leaderboard at the moment with 95.40%, but the new Mistral Small 3.1 manages 5th place with 85.70%, pretty good for a local model!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Table comparing AI model performance with columns for Model (JSON Filename), Total Records, Committee Matches, and Match Percentage. Shows 7 models with 1000 records each: gemini_25_november_2024_prompt2.json (95.40%), qwen25_november_2024_prompt2.json (92.90%), gemini20_flash_november_2024_prompt2.json (92.40%), claude37_sonnet_november_2024_prompt2.json (90.70%), mistral_small_31_november_2024_prompt2.json (85.70%), gemma2_27b_november_2024_prompt2.json (84.40%), and gemma2_november_2024_prompt2.json (83.90%)." src="https://static.simonwillison.net/static/2025/derek-leaderboard.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I said &lt;a href="https://simonwillison.net/2025/Mar/8/nicar-llms/#llms.020.jpeg"&gt;we need our own evals&lt;/a&gt; in my talk at the NICAR Data Journalism conference last month, without realizing Derek has been running one since January.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/dwillis.bsky.social/post/3lmdjmfyeac25"&gt;@dwillis.bsky.social&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/derek-willis"&gt;derek-willis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="derek-willis"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="mistral"/><category term="gemini"/><category term="evals"/><category term="ollama"/><category term="system-prompts"/></entry><entry><title>Mistral Small 3.1 on Ollama</title><link href="https://simonwillison.net/2025/Apr/8/mistral-small-31-on-ollama/#atom-tag" rel="alternate"/><published>2025-04-08T22:07:50+00:00</published><updated>2025-04-08T22:07:50+00:00</updated><id>https://simonwillison.net/2025/Apr/8/mistral-small-31-on-ollama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/library/mistral-small3.1"&gt;Mistral Small 3.1 on Ollama&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral Small 3.1 (&lt;a href="https://simonwillison.net/2025/Mar/17/mistral-small-31/"&gt;previously&lt;/a&gt;) is now available through &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt;, providing an easy way to run this multi-modal (vision) model on a Mac (and other platforms, though I haven't tried those myself).&lt;/p&gt;
&lt;p&gt;I had to upgrade Ollama to the most recent version to get it to work - prior to that I got a &lt;code&gt;Error: unable to load model&lt;/code&gt; message. Upgrades can be accessed through the Ollama macOS system tray icon.&lt;/p&gt;
&lt;p&gt;I fetched the 15GB model by running:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull mistral-small3.1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then used &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; to run prompts through it, including one to describe &lt;a href="https://static.simonwillison.net/static/2025/Mpaboundrycdfw-1.png"&gt;this image&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm -m mistral-small3.1 'describe this image' -a https://static.simonwillison.net/static/2025/Mpaboundrycdfw-1.png
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/89005e8aa2daef82c53c2c2c62207f6a#response"&gt;the output&lt;/a&gt;. It's good, though not quite as impressive as the description &lt;a href="https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/"&gt;I got from the slightly larger Qwen2.5-VL-32B&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I also tried it on a scanned (private) PDF of hand-written text with very good results, though it did misread one of the hand-written numbers.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="vision-llms"/><category term="ollama"/></entry><entry><title>simonw/ollama-models-atom-feed</title><link href="https://simonwillison.net/2025/Mar/22/ollama-models-atom-feed/#atom-tag" rel="alternate"/><published>2025-03-22T22:04:57+00:00</published><updated>2025-03-22T22:04:57+00:00</updated><id>https://simonwillison.net/2025/Mar/22/ollama-models-atom-feed/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/ollama-models-atom-feed"&gt;simonw/ollama-models-atom-feed&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I setup a GitHub Actions + GitHub Pages Atom feed of scraped recent models data from the Ollama &lt;a href="https://ollama.com/search?o=newest"&gt;latest models&lt;/a&gt; page - Ollama remains one of the easiest ways to run models on a laptop so a new model release from them is worth hearing about.&lt;/p&gt;
&lt;p&gt;I built the scraper by pasting example HTML &lt;a href="https://claude.ai/share/c96d6bb9-a976-45f9-82c2-8599c2d6d492"&gt;into Claude&lt;/a&gt; and asking for a Python script to convert it to Atom - here's &lt;a href="https://github.com/simonw/ollama-models-atom-feed/blob/main/to_atom.py"&gt;the script&lt;/a&gt; we wrote together.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 25th March 2025&lt;/strong&gt;: The first version of this included all 160+ models in a single feed. I've upgraded the script to output two feeds - the original &lt;a href="https://simonw.github.io/ollama-models-atom-feed/atom.xml"&gt;atom.xml&lt;/a&gt; one and a new &lt;a href="https://simonw.github.io/ollama-models-atom-feed/atom-recent-20.xml"&gt;atom-recent-20.xml&lt;/a&gt; feed containing just the most recent 20 items.&lt;/p&gt;
&lt;p&gt;I modified the script using Google's &lt;a href="https://simonwillison.net/2025/Mar/25/gemini/"&gt;new Gemini 2.5 Pro&lt;/a&gt; model, like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat to_atom.py | llm -m gemini-2.5-pro-exp-03-25 \
  -s 'rewrite this script so that instead of outputting Atom to stdout it saves two files, one called atom.xml with everything and another called atom-recent-20.xml with just the most recent 20 items - remove the output option entirely'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/358b5caa015de53dee0fbc96415ae6d6"&gt;full transcript&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/atom"&gt;atom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="atom"/><category term="github"/><category term="projects"/><category term="ai"/><category term="github-actions"/><category term="git-scraping"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="gemini"/><category term="ollama"/></entry><entry><title>Notes on Google's Gemma 3</title><link href="https://simonwillison.net/2025/Mar/12/gemma-3/#atom-tag" rel="alternate"/><published>2025-03-12T16:15:19+00:00</published><updated>2025-03-12T16:15:19+00:00</updated><id>https://simonwillison.net/2025/Mar/12/gemma-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google's Gemma team released an impressive new model today (under their not-open-source &lt;a href="https://ai.google.dev/gemma/terms"&gt;Gemma license&lt;/a&gt;). Gemma 3 comes in four sizes - 1B, 4B, 12B, and 27B - and while 1B is text-only the larger three models are all multi-modal for vision:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's the &lt;a href="https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf"&gt;Gemma 3 Technical Report PDF&lt;/a&gt;, which makes the big claim that they find "Gemma3-27B-IT comparable to Gemini-1.5-Pro across
benchmarks".&lt;/p&gt;
&lt;p&gt;I tried out the largest model using the latest &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; - this is the second time I've spotted a major model release partnering with Ollama on launch day, the first being &lt;a href="https://simonwillison.net/2025/Jan/30/mistral-small-3/"&gt;Mistral Small 3&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this (after upgrading Ollama through their menu icon upgrade option):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3:27b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That pulled 17GB of model weights. I've been trying it out using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm -m gemma3:27b 'Build a single page HTML+CSS+JavaScript UI that gives me a large textarea for writing in which constantly saves what I have entered to localStorage (restoring when I reload the page) and displays a word counter'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That was a replay of &lt;a href="https://tools.simonwillison.net/colophon#word-counter.html"&gt;a prompt&lt;/a&gt; I ran against Claude Artifacts a few months ago. Here's &lt;a href="https://static.simonwillison.net/static/2025/gemma3-word-count.html"&gt;what Gemma built&lt;/a&gt;, and the &lt;a href="https://gist.github.com/simonw/f8a87f2242439d3974405a17f5d8bc23"&gt;full chat transcript&lt;/a&gt;. It's a simple example but it worked just right.&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemma-word-count.gif" alt="Animated demo - a simple textarea, I type text in it and a word count updates above." /&gt;&lt;/p&gt;

&lt;p&gt;Let's try out the vision mode, by asking it to describe this photo:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/two-pelicans.jpg" alt="Two pelicans flying against a blue sky " style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b describe -a https://static.simonwillison.net/static/2025/two-pelicans.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/0e703ba72d0cf5a6dd117981bd03e604"&gt;the full response&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Overall Impression:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The image features two brown pelicans in flight against a solid, bright blue background. The composition is dynamic, capturing the birds mid-wingbeat.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Details:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pelicans:&lt;/strong&gt; Both pelicans are a rich, dark brown color. They have large, pale yellow-orange beaks and pouches. One pelican is shown with its wings fully extended, appearing to be gliding or soaring. The other is captured with its wings in a more dynamic, upward-sweeping motion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background:&lt;/strong&gt; The background is a flat, uniform shade of bright blue, which makes the pelicans stand out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composition:&lt;/strong&gt; The birds are positioned in opposite directions, creating a sense of movement and balance. The image is well-lit, highlighting the texture of the pelicans' feathers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Overall, the image is a striking and visually appealing depiction of these majestic birds in flight.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Is it right to say "the birds are positioned in opposite directions"? I thought that was a mistake at first, but I'll give it a pass since their wings are positioned in opposite directions to each other.&lt;/p&gt;

&lt;p&gt;Something I've been curious about recently is longer context support: how well can a local model on my laptop deal with summarization or data extraction tasks against longer pieces of text?&lt;/p&gt;
&lt;p&gt;I decided to try &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-a--m-model-option"&gt;my Hacker News summarize script&lt;/a&gt; using Gemma, against &lt;a href="https://news.ycombinator.com/item?id=43340491"&gt;the thread&lt;/a&gt; there discussing the Gemma 3 technical paper.&lt;/p&gt;
&lt;p&gt;First I did a quick token count (using the OpenAI tokenizer but it's usually a similar number to other models):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;curl 'https://hn.algolia.com/api/v1/items/43340491' | ttok
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This returned 22,260 - well within Gemma's documented limits but still a healthy number considering just last year most models topped out at 4,000 or 8,000.&lt;/p&gt;
&lt;p&gt;I ran my script like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43340491 -m gemma3:27b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It did a pretty good job! Here's the &lt;a href="https://gist.github.com/simonw/ab487ea3d1605e719dc2950cd4565146"&gt;full prompt and response&lt;/a&gt;. The one big miss is that it ignored my instructions to include illustrative quotes - I don't know if modifying the prompt will fix that but it's disappointing that it didn't handle that well, given how important direct quotes are for building confidence in RAG-style responses.&lt;/p&gt;
&lt;p&gt;Here's what I got for &lt;a href="https://gist.github.com/simonw/f79c4bd2fbe966e1b600cb8b41cae810"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;

&lt;p style="text-align: center"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemma-3-pelican.svg" alt="A collection of abstract shapes, definitely not a pelican on a bicycle" /&gt;&lt;/p&gt;

&lt;p&gt;You can also try out the new Gemma &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-3-27b-it"&gt;in Google AI Studio&lt;/a&gt;, and via their API. I added support for it to &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.15"&gt;llm-gemini 0.15&lt;/a&gt;, though sadly it appears vision mode doesn't work with that API hosted model yet.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm keys set gemini
# paste key here
llm -m gemma-3-27b-it 'five facts about pelicans of interest to skunks'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/bc22062e60e5af3faf458756cb368d0e"&gt;Here's what I got&lt;/a&gt;. I'm not sure how pricing works for that hosted model.&lt;/p&gt;
&lt;p&gt;Gemma 3 is also already available &lt;a href="https://github.com/Blaizzy/mlx-vlm/pull/235"&gt;through MLX-VLM&lt;/a&gt; - here's &lt;a href="https://huggingface.co/collections/mlx-community/gemma-3-67d14a10480a436ad478b0f9"&gt;the MLX model collection&lt;/a&gt; - but I haven't tried that version yet.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gemini"/><category term="vision-llms"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/></entry><entry><title>QwQ-32B: Embracing the Power of Reinforcement Learning</title><link href="https://simonwillison.net/2025/Mar/5/qwq-32b/#atom-tag" rel="alternate"/><published>2025-03-05T21:10:28+00:00</published><updated>2025-03-05T21:10:28+00:00</updated><id>https://simonwillison.net/2025/Mar/5/qwq-32b/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwq-32b/"&gt;QwQ-32B: Embracing the Power of Reinforcement Learning&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New Apache 2 licensed reasoning model from Qwen:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had a lot of fun &lt;a href="https://simonwillison.net/2024/Nov/27/qwq/"&gt;trying out&lt;/a&gt; their previous QwQ reasoning model last November. I demonstrated this new QwQ in &lt;a href="https://simonwillison.net/2025/Mar/8/nicar-llms/#llms.027.jpeg"&gt;my talk at NICAR&lt;/a&gt; about recent LLM developments. Here's &lt;a href="https://gist.github.com/simonw/46cd83701868d364f4cfb1340f0f7fa5"&gt;the example I ran&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;LM Studio just &lt;a href="https://huggingface.co/lmstudio-community/QwQ-32B-GGUF/tree/main"&gt;released GGUFs&lt;/a&gt; ranging in size from 17.2 to 34.8 GB. MLX have compatible weights published in &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-3bit"&gt;3bit&lt;/a&gt;, &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-4bit"&gt;4bit&lt;/a&gt;, &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-6bit"&gt;6bit&lt;/a&gt; and &lt;a href="https://huggingface.co/mlx-community/QwQ-32B-8bit"&gt;8bit&lt;/a&gt;. Ollama &lt;a href="https://ollama.com/library/qwq"&gt;has the new qwq&lt;/a&gt; too - it looks like they've renamed the previous November release &lt;a href="https://ollama.com/library/qwq:32b-preview-q8_0"&gt;qwq:32b-preview&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alibaba_qwen/status/1897361654763151544"&gt;@alibaba_qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>llm-ollama 0.9.0</title><link href="https://simonwillison.net/2025/Mar/4/llm-ollama-090/#atom-tag" rel="alternate"/><published>2025-03-04T07:17:52+00:00</published><updated>2025-03-04T07:17:52+00:00</updated><id>https://simonwillison.net/2025/Mar/4/llm-ollama-090/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/taketwo/llm-ollama/releases/tag/0.9.0"&gt;llm-ollama 0.9.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This release of the &lt;code&gt;llm-ollama&lt;/code&gt; plugin adds support for &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;schemas&lt;/a&gt;, thanks to a &lt;a href="https://github.com/taketwo/llm-ollama/pull/36"&gt;PR by Adam Compton&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Ollama provides very robust support for this pattern thanks to their &lt;a href="https://ollama.com/blog/structured-outputs"&gt;structured outputs&lt;/a&gt; feature, which works across all of the models that they support by intercepting the logic that outputs the next token and restricting it to only tokens that would be valid in the context of the provided schema.&lt;/p&gt;
&lt;p&gt;With Ollama and &lt;code&gt;llm-ollama&lt;/code&gt; installed you can run even run structured schemas against vision prompts for local models. Here's one against Ollama's &lt;a href="https://ollama.com/library/llama3.2-vision"&gt;llama3.2-vision&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m llama3.2-vision:latest \
  'describe images' \
  --schema 'species,description,count int' \
  -a https://static.simonwillison.net/static/2025/two-pelicans.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got back this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{
    "species": "Pelicans",
    "description": "The image features a striking brown pelican with its distinctive orange beak, characterized by its large size and impressive wingspan.",
    "count": 1
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Actually a bit disappointing, as there are &lt;a href="https://static.simonwillison.net/static/2025/two-pelicans.jpg"&gt;two pelicans&lt;/a&gt; and their beaks are brown.)


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="ollama"/></entry><entry><title>llm-mistral 0.11</title><link href="https://simonwillison.net/2025/Mar/4/llm-mistral-011/#atom-tag" rel="alternate"/><published>2025-03-04T07:05:21+00:00</published><updated>2025-03-04T07:05:21+00:00</updated><id>https://simonwillison.net/2025/Mar/4/llm-mistral-011/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-mistral/releases/tag/0.11"&gt;llm-mistral 0.11&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I added &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;schema support&lt;/a&gt; to this plugin which adds support for the &lt;a href="https://docs.mistral.ai/api/"&gt;Mistral API&lt;/a&gt; to LLM. Release notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Support for LLM &lt;a href="https://llm.datasette.io/en/stable/schemas.html"&gt;schemas&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-mistral/issues/19"&gt;#19&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-o prefix '{'&lt;/code&gt; option for forcing a response prefix. &lt;a href="https://github.com/simonw/llm-mistral/issues/18"&gt;#18&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Schemas now work with OpenAI, Anthropic, Gemini and Mistral hosted models, plus self-hosted models via &lt;a href="https://www.ollama.com/"&gt;Ollama&lt;/a&gt; and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="ollama"/></entry><entry><title>Structured data extraction from unstructured content using LLM schemas</title><link href="https://simonwillison.net/2025/Feb/28/llm-schemas/#atom-tag" rel="alternate"/><published>2025-02-28T17:07:07+00:00</published><updated>2025-02-28T17:07:07+00:00</updated><id>https://simonwillison.net/2025/Feb/28/llm-schemas/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-23"&gt;LLM 0.23&lt;/a&gt; is out today, and the signature feature is support for &lt;strong&gt;&lt;a href="https://llm.datasette.io/en/stable/schemas.html"&gt;schemas&lt;/a&gt;&lt;/strong&gt; - a new way of providing structured output from a model that matches a specification provided by the user. I've also upgraded both the &lt;a href="https://github.com/simonw/llm-anthropic"&gt;llm-anthropic&lt;/a&gt; and &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; plugins to add support for  schemas.&lt;/p&gt;
&lt;p&gt;TLDR: you can now do things like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm --schema &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;name,age int,short_bio&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;invent a cool dog&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And get back:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zylo&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"age"&lt;/span&gt;: &lt;span class="pl-c1"&gt;4&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"short_bio"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zylo is a unique hybrid breed, a mix between a Siberian Husky and a Corgi. With striking blue eyes and a fluffy, colorful coat that changes shades with the seasons, Zylo embodies the spirit of winter and summer alike. Known for his playful personality and intelligence, Zylo can perform a variety of tricks and loves to fetch his favorite frisbee. Always ready for an adventure, he's just as happy hiking in the mountains as he is cuddling on the couch after a long day of play.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;More details &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-23"&gt;in the release notes&lt;/a&gt; and &lt;a href="https://llm.datasette.io/en/stable/schemas.html#schemas-tutorial"&gt;LLM schemas tutorial&lt;/a&gt;, which includes an example (extracting people from news articles) that's even more useful than inventing dogs!&lt;/p&gt;



&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/#structured-data-extraction-is-a-killer-app-for-llms"&gt;Structured data extraction is a killer app for LLMs&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/#designing-this-feature-for-llm"&gt;Designing this feature for LLM&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/#reusing-schemas-and-creating-templates"&gt;Reusing schemas and creating templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/#doing-more-with-the-logged-structured-data"&gt;Doing more with the logged structured data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/#using-schemas-from-llm-s-python-library"&gt;Using schemas from LLM's Python library&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/#what-s-next-for-llm-schemas-"&gt;What's next for LLM schemas?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="structured-data-extraction-is-a-killer-app-for-llms"&gt;Structured data extraction is a killer app for LLMs&lt;/h4&gt;
&lt;p&gt;I've suspected for a while that the single most commercially valuable application of LLMs is turning unstructured content into structured data. That's the trick where you feed an LLM an article, or a PDF, or a screenshot and use it to turn that into JSON or CSV or some other structured format.&lt;/p&gt;
&lt;p&gt;It's possible to achieve strong results on this with prompting alone: feed data into an LLM, give it an example of the output you would like and let it figure out the details.&lt;/p&gt;
&lt;p&gt;Many of the leading LLM providers now bake this in as a feature. OpenAI, Anthropic, Gemini and Mistral all offer variants of "structured output" as additional options through their API:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OpenAI: &lt;a href="https://platform.openai.com/docs/guides/structured-outputs"&gt;Structured Outputs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Gemini: &lt;a href="https://ai.google.dev/gemini-api/docs/structured-output?lang=rest"&gt;Generate structured output with the Gemini API&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Mistral: &lt;a href="https://docs.mistral.ai/capabilities/structured-output/custom_structured_output/"&gt;Custom Structured Outputs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic's &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview"&gt;tool use&lt;/a&gt; can be used for this, as shown in their &lt;a href="https://github.com/anthropics/anthropic-cookbook/blob/main/tool_use/extracting_structured_json.ipynb"&gt;Extracting Structured JSON using Claude and Tool Use&lt;/a&gt; cookbook example.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These mechanisms are all very similar: you pass a &lt;a href="https://json-schema.org/"&gt;JSON schema&lt;/a&gt; to the model defining the shape that you would like, they then use that schema to guide the output of the model.&lt;/p&gt;
&lt;p&gt;How reliable that is can vary! Some providers use tricks along the lines of &lt;a href="https://github.com/1rgs/jsonformer"&gt;Jsonformer&lt;/a&gt;, compiling the JSON schema into code that interacts with the model's next-token generation at runtime, limiting it to only generate tokens that are valid in the context of the schema.&lt;/p&gt;
&lt;p&gt;Other providers YOLO it - they trust that their model is "good enough" that showing it the schema will produce the right results!&lt;/p&gt;
&lt;p&gt;In practice, this means that you need to be aware that sometimes this stuff will go wrong. As with anything LLM, 100% reliability is never guaranteed.&lt;/p&gt;
&lt;p&gt;From my experiments so far, and depending on the model that you chose, these mistakes are rare. If you're using a top tier model it will almost certainly do the right thing.&lt;/p&gt;
&lt;h4 id="designing-this-feature-for-llm"&gt;Designing this feature for LLM&lt;/h4&gt;
&lt;p&gt;I've wanted this feature for ages. I see it as an important step on the way to full tool usage, which is something I'm very excited to bring to the CLI tool and Python library.&lt;/p&gt;
&lt;p&gt;LLM is designed as an abstraction layer over different models. This makes building new features &lt;em&gt;much harder&lt;/em&gt;, because I need to figure out a common denominator and then build an abstraction that captures as much value as possible while still being general enough to work across multiple models.&lt;/p&gt;
&lt;p&gt;Support for structured output across multiple vendors has matured now to the point that I'm ready to commit to a design.&lt;/p&gt;
&lt;p&gt;My first version of this feature worked exclusively with JSON schemas. An earlier version of the tutorial started with this example:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl https://www.nytimes.com/ &lt;span class="pl-k"&gt;|&lt;/span&gt; uvx strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  llm --schema &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{&lt;/span&gt;
&lt;span class="pl-s"&gt;  "type": "object",&lt;/span&gt;
&lt;span class="pl-s"&gt;  "properties": {&lt;/span&gt;
&lt;span class="pl-s"&gt;    "items": {&lt;/span&gt;
&lt;span class="pl-s"&gt;      "type": "array",&lt;/span&gt;
&lt;span class="pl-s"&gt;      "items": {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "type": "object",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "properties": {&lt;/span&gt;
&lt;span class="pl-s"&gt;          "headline": {&lt;/span&gt;
&lt;span class="pl-s"&gt;            "type": "string"&lt;/span&gt;
&lt;span class="pl-s"&gt;          },&lt;/span&gt;
&lt;span class="pl-s"&gt;          "short_summary": {&lt;/span&gt;
&lt;span class="pl-s"&gt;            "type": "string"&lt;/span&gt;
&lt;span class="pl-s"&gt;          },&lt;/span&gt;
&lt;span class="pl-s"&gt;          "key_points": {&lt;/span&gt;
&lt;span class="pl-s"&gt;            "type": "array",&lt;/span&gt;
&lt;span class="pl-s"&gt;            "items": {&lt;/span&gt;
&lt;span class="pl-s"&gt;              "type": "string"&lt;/span&gt;
&lt;span class="pl-s"&gt;            }&lt;/span&gt;
&lt;span class="pl-s"&gt;          }&lt;/span&gt;
&lt;span class="pl-s"&gt;        },&lt;/span&gt;
&lt;span class="pl-s"&gt;        "required": ["headline", "short_summary", "key_points"]&lt;/span&gt;
&lt;span class="pl-s"&gt;      }&lt;/span&gt;
&lt;span class="pl-s"&gt;    }&lt;/span&gt;
&lt;span class="pl-s"&gt;  },&lt;/span&gt;
&lt;span class="pl-s"&gt;  "required": ["items"]&lt;/span&gt;
&lt;span class="pl-s"&gt;}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; jq&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we're feeding a full JSON schema document to the new &lt;code&gt;llm --schema&lt;/code&gt; option, then piping in the homepage of the New York Times (after running it through &lt;a href="https://github.com/simonw/strip-tags"&gt;strip-tags&lt;/a&gt;) and asking for &lt;code&gt;headline&lt;/code&gt;, &lt;code&gt;short_summary&lt;/code&gt; and &lt;code&gt;key_points&lt;/code&gt; for multiple items on the page.&lt;/p&gt;
&lt;p&gt;This example still works with the finished feature - you can see &lt;a href="https://gist.github.com/simonw/372d11e2729a9745654740ff3f5669ab"&gt;example JSON output here&lt;/a&gt; - but constructing those long-form schemas by hand was a big pain.&lt;/p&gt;
&lt;p&gt;So... I invented my own shortcut syntax.&lt;/p&gt;
&lt;p&gt;That earlier example is a simple illustration:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm --schema &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;name,age int,short_bio&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;invent a cool dog&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here the schema is a comma-separated list of field names, with an optional space-separated type.&lt;/p&gt;
&lt;p&gt;The full concise schema syntax &lt;a href="https://llm.datasette.io/en/stable/schemas.html#concise-llm-schema-syntax"&gt;is described here&lt;/a&gt;. There's a more complex example &lt;a href="https://llm.datasette.io/en/latest/schemas.html#extracting-people-from-a-news-articles"&gt;in the tutorial&lt;/a&gt;, which uses the newline-delimited form to extract information about people who are mentioned in a news article:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://apnews.com/article/trump-federal-employees-firings-a85d1aaf1088e050d39dcf7e3664bb9f&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  uvx strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  llm --schema-multi &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;name: the person's name&lt;/span&gt;
&lt;span class="pl-s"&gt;organization: who they represent&lt;/span&gt;
&lt;span class="pl-s"&gt;role: their job title or role&lt;/span&gt;
&lt;span class="pl-s"&gt;learned: what we learned about them from this story&lt;/span&gt;
&lt;span class="pl-s"&gt;article_headline: the headline of the story&lt;/span&gt;
&lt;span class="pl-s"&gt;article_date: the publication date in YYYY-MM-DD&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;extract people mentioned in this article&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;--schema-multi&lt;/code&gt; option here tells LLM to take that schema for a single object and upgrade it to an array of those objects (actually an object with a single &lt;code&gt;"items"&lt;/code&gt; property that's an array of objects), which is a quick way to request that the same schema be returned multiple times against a single input.&lt;/p&gt;
&lt;h4 id="reusing-schemas-and-creating-templates"&gt;Reusing schemas and creating templates&lt;/h4&gt;
&lt;p&gt;My original plan with schemas was to provide a separate &lt;code&gt;llm extract&lt;/code&gt; command for running these kinds of operations. I ended up going in a different direction - I realized that adding &lt;code&gt;--schema&lt;/code&gt; to the default &lt;code&gt;llm prompt&lt;/code&gt; command would make it interoperable with other existing features (like &lt;a href="https://llm.datasette.io/en/stable/usage.html#attachments"&gt;attachments&lt;/a&gt; for feeding in images and PDFs).&lt;/p&gt;
&lt;p&gt;The most valuable way to apply schemas is across many different prompts, in order to gather the same structure of information from many different sources.&lt;/p&gt;
&lt;p&gt;I put a bunch of thought into the &lt;code&gt;--schema&lt;/code&gt; option. It takes a variety of different values - quoting &lt;a href="https://llm.datasette.io/en/latest/schemas.html#ways-to-specify-a-schema"&gt;the documentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This option can take multiple forms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A string providing a JSON schema: &lt;code&gt;--schema '{"type": "object", ...}'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;A &lt;a href="https://llm.datasette.io/en/stable/schemas.html#schemas-dsl"&gt;condensed schema definition&lt;/a&gt;: &lt;code&gt;--schema 'name,age int'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The name or path of a file on disk containing a JSON schema: &lt;code&gt;--schema dogs.schema.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The hexadecimal ID of a previously logged schema: &lt;code&gt;--schema 520f7aabb121afd14d0c6c237b39ba2d&lt;/code&gt; - these IDs can be found using the &lt;code&gt;llm schemas&lt;/code&gt; command.&lt;/li&gt;
&lt;li&gt;A schema that has been &lt;a href="https://llm.datasette.io/en/latest/templates.html#prompt-templates-save"&gt;saved in a template&lt;/a&gt;: &lt;code&gt;--schema t:name-of-template&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://llm.datasette.io/en/latest/schemas.html#extracting-people-from-a-news-articles"&gt;tutorial&lt;/a&gt; demonstrates saving a schema by using it once and then obtaining its ID through the new &lt;code&gt;llm schemas&lt;/code&gt; command, then saving it to a &lt;a href="https://llm.datasette.io/en/stable/templates.html"&gt;template&lt;/a&gt; (along with the system prompt) like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm --schema 3b7702e71da3dd791d9e17b76c88730e \
  --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;extract people mentioned in this article&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  --save people&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And now we can feed in new articles using the &lt;code&gt;llm -t people&lt;/code&gt; shortcut to apply that newly saved template:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl https://www.theguardian.com/commentisfree/2025/feb/27/billy-mcfarland-new-fyre-festival-fantasist &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; llm -t people&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="doing-more-with-the-logged-structured-data"&gt;Doing more with the logged structured data&lt;/h4&gt;
&lt;p&gt;Having run a few prompts that use the same schema, an obvious next step is to do something with the data that has been collected.&lt;/p&gt;
&lt;p&gt;I ended up implementing this on top of the existing &lt;a href="https://llm.datasette.io/en/stable/logging.html"&gt;llm logs&lt;/a&gt; mechanism.&lt;/p&gt;
&lt;p&gt;LLM already defaults to logging every prompt and response it makes to a SQLite database - mine contains over 4,747 of these records now, according to this query:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite3 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;llm logs path&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;select count(*) from responses&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;With schemas, an increasing portion of those are valid JSON.&lt;/p&gt;
&lt;p&gt;Since LLM records the schema that was used for each response - using the schema ID, which is derived from a content hash of the expanded JSON schema - it's now possible to ask LLM for all responses that used a particular schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm logs --schema 3b7702e71da3dd791d9e17b76c88730e --short&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I got back:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model&lt;/span&gt;: &lt;span class="pl-s"&gt;gpt-4o-mini&lt;/span&gt;
  &lt;span class="pl-ent"&gt;datetime&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2025-02-28T07:37:18&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-ent"&gt;conversation&lt;/span&gt;: &lt;span class="pl-s"&gt;01jn5qt397aaxskf1vjp6zxw2a&lt;/span&gt;
  &lt;span class="pl-ent"&gt;system&lt;/span&gt;: &lt;span class="pl-s"&gt;extract people mentioned in this article&lt;/span&gt;
  &lt;span class="pl-ent"&gt;prompt&lt;/span&gt;: &lt;span class="pl-s"&gt;Menu AP Logo Menu World U.S. Politics Sports Entertainment Business Science&lt;/span&gt;
    &lt;span class="pl-s"&gt;Fact Check Oddities Be Well Newsletters N...&lt;/span&gt;
- &lt;span class="pl-ent"&gt;model&lt;/span&gt;: &lt;span class="pl-s"&gt;gpt-4o-mini&lt;/span&gt;
  &lt;span class="pl-ent"&gt;datetime&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2025-02-28T07:38:58&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-ent"&gt;conversation&lt;/span&gt;: &lt;span class="pl-s"&gt;01jn5qx4q5he7yq803rnexp28p&lt;/span&gt;
  &lt;span class="pl-ent"&gt;system&lt;/span&gt;: &lt;span class="pl-s"&gt;extract people mentioned in this article&lt;/span&gt;
  &lt;span class="pl-ent"&gt;prompt&lt;/span&gt;: &lt;span class="pl-s"&gt;Skip to main contentSkip to navigationSkip to navigationPrint subscriptionsNewsletters&lt;/span&gt;
    &lt;span class="pl-s"&gt;Sign inUSUS editionUK editionA...&lt;/span&gt;
- &lt;span class="pl-ent"&gt;model&lt;/span&gt;: &lt;span class="pl-s"&gt;gpt-4o&lt;/span&gt;
  &lt;span class="pl-ent"&gt;datetime&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2025-02-28T07:39:07&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-ent"&gt;conversation&lt;/span&gt;: &lt;span class="pl-s"&gt;01jn5qxh20tksb85tf3bx2m3bd&lt;/span&gt;
  &lt;span class="pl-ent"&gt;system&lt;/span&gt;: &lt;span class="pl-s"&gt;extract people mentioned in this article&lt;/span&gt;
  &lt;span class="pl-ent"&gt;attachments&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;image/jpeg&lt;/span&gt;
    &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/onion-zuck.jpg&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see, I've run that example schema three times (while constructing the tutorial) using GPT-4o mini - twice against text content from &lt;code&gt;curl ... | strip-tags&lt;/code&gt; and once against &lt;a href="https://static.simonwillison.net/static/2025/onion-zuck.jpg"&gt;a screenshot JPEG&lt;/a&gt; to demonstrate attachment support.&lt;/p&gt;
&lt;p&gt;Extracting gathered JSON from the logs is clearly a useful next step... so I added several options to &lt;code&gt;llm logs&lt;/code&gt; to support that use-case.&lt;/p&gt;
&lt;p&gt;The first is &lt;code&gt;--data&lt;/code&gt; - adding that will cause &lt;code&gt;LLM logs&lt;/code&gt; to output just the data that was gathered using a schema. Mix that with &lt;code&gt;-c&lt;/code&gt; to see the JSON from the most recent response:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm logs -c --data&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{&lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zap&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-ent"&gt;"age"&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;, &lt;span class="pl-ent"&gt;"short_bio"&lt;/span&gt;: ...&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Combining that with the &lt;code&gt;--schema&lt;/code&gt; option is where things get really interesting. You can specify a schema using any of the mechanisms described earlier, which means you can see ALL of the data gathered using that schema by combining &lt;code&gt;--data&lt;/code&gt; with &lt;code&gt;--schema X&lt;/code&gt; (and &lt;code&gt;-n 0&lt;/code&gt; for everything).&lt;/p&gt;
&lt;p&gt;Here are all of the dogs I've invented:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm logs --schema &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;name,age int,short_bio&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --data -n 0&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Output (here truncated):&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{&lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zap&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-ent"&gt;"age"&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;, &lt;span class="pl-ent"&gt;"short_bio"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zap is a futuristic ...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;}
{&lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zephyr&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-ent"&gt;"age"&lt;/span&gt;: &lt;span class="pl-c1"&gt;3&lt;/span&gt;, &lt;span class="pl-ent"&gt;"short_bio"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zephyr is an adventurous...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;}
{&lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zylo&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-ent"&gt;"age"&lt;/span&gt;: &lt;span class="pl-c1"&gt;4&lt;/span&gt;, &lt;span class="pl-ent"&gt;"short_bio"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Zylo is a unique ...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Some schemas gather multiple items, producing output that looks like this (from the tutorial):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{"items": [{"name": "Mark Zuckerberg", "organization": "...
{"items": [{"name": "Billy McFarland", "organization": "...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can get back the individual objects by adding &lt;code&gt;--data-key items&lt;/code&gt;. Here I'm also using the &lt;code&gt;--schema t:people&lt;/code&gt; shortcut to specify the schema that was saved to the &lt;code&gt;people&lt;/code&gt; template earlier on.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm logs --schema t:people --data-key items&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{"name": "Katy Perry", "organization": ...
{"name": "Gayle King", "organization": ...
{"name": "Lauren Sanchez", "organization": ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This feature defaults to outputting newline-delimited JSON, but you can add the &lt;code&gt;--data-array&lt;/code&gt; flag to get back a JSON array of objects instead.&lt;/p&gt;
&lt;p&gt;... which means you can pipe it into &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-json-data"&gt;sqlite-utils insert&lt;/a&gt; to create a SQLite database!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm logs --schema t:people --data-key items --data-array &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  sqlite-utils insert data.db people -&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Add all of this together and we can construct a schema, run it against a bunch of sources and dump the resulting structured data into SQLite where we can explore it using SQL queries (and &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;). It's a really powerful combination.&lt;/p&gt;
&lt;h4 id="using-schemas-from-llm-s-python-library"&gt;Using schemas from LLM's Python library&lt;/h4&gt;
&lt;p&gt;The most popular way to work with schemas in Python these days is with &lt;a href="https://docs.pydantic.dev/"&gt;Pydantic&lt;/a&gt;, to the point that many of the official API libraries for models directly incorporate Pydantic for this purpose.&lt;/p&gt;
&lt;p&gt;LLM depended on Pydantic already, and for this project I finally dropped my dual support for Pydantic v1 and v2 and &lt;a href="https://github.com/simonw/llm/pull/775"&gt;committed to v2 only&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A key reason Pydantic is popular for this is that it's trivial to use it to build a JSON schema document:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pydantic&lt;/span&gt;, &lt;span class="pl-s1"&gt;json&lt;/span&gt;

&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-v"&gt;Dog&lt;/span&gt;(&lt;span class="pl-s1"&gt;pydantic&lt;/span&gt;.&lt;span class="pl-c1"&gt;BaseModel&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;name&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;
    &lt;span class="pl-s1"&gt;age&lt;/span&gt;: &lt;span class="pl-smi"&gt;int&lt;/span&gt;
    &lt;span class="pl-s1"&gt;bio&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;

&lt;span class="pl-s1"&gt;schema&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Dog&lt;/span&gt;.&lt;span class="pl-c1"&gt;model_json_schema&lt;/span&gt;()
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-c1"&gt;dumps&lt;/span&gt;(&lt;span class="pl-s1"&gt;schema&lt;/span&gt;, &lt;span class="pl-s1"&gt;indent&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;))&lt;/pre&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"properties"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: {
      &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Name&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;string&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    },
    &lt;span class="pl-ent"&gt;"age"&lt;/span&gt;: {
      &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Age&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;integer&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    },
    &lt;span class="pl-ent"&gt;"bio"&lt;/span&gt;: {
      &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Bio&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;string&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    }
  },
  &lt;span class="pl-ent"&gt;"required"&lt;/span&gt;: [
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;name&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;age&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;bio&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  ],
  &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Dog&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;object&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;LLM's Python library doesn't require you to use Pydantic, but it supports passing either a Pydantic &lt;code&gt;BaseModel&lt;/code&gt; subclass or a full JSON schema to the new &lt;code&gt;model.prompt(schema=)&lt;/code&gt; parameter. Here's &lt;a href="https://llm.datasette.io/en/latest/python-api.html#schemas"&gt;the usage example&lt;/a&gt; from the documentation:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;, &lt;span class="pl-s1"&gt;json&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;pydantic&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;BaseModel&lt;/span&gt;

&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-v"&gt;Dog&lt;/span&gt;(&lt;span class="pl-v"&gt;BaseModel&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;name&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;
    &lt;span class="pl-s1"&gt;age&lt;/span&gt;: &lt;span class="pl-smi"&gt;int&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-4o-mini"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Describe a nice dog"&lt;/span&gt;, &lt;span class="pl-s1"&gt;schema&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-v"&gt;Dog&lt;/span&gt;)
&lt;span class="pl-s1"&gt;dog&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-c1"&gt;loads&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;dog&lt;/span&gt;)
&lt;span class="pl-c"&gt;# {"name":"Buddy","age":3}&lt;/span&gt;&lt;/pre&gt;
&lt;h4 id="what-s-next-for-llm-schemas-"&gt;What's next for LLM schemas?&lt;/h4&gt;
&lt;p&gt;So far I've implemented schema support for models from OpenAI, Anthropic and Gemini. The &lt;a href="https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#supporting-schemas"&gt;plugin author documentation&lt;/a&gt; includes details on how to add this to further plugins - I'd love to see one of the local model plugins implement this pattern as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt; &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; now support schemas thanks to &lt;a href="https://github.com/taketwo/llm-ollama/pull/36"&gt;this PR&lt;/a&gt; by Adam Compton. And I've added support &lt;a href="https://simonwillison.net/2025/Mar/4/llm-mistral-011/"&gt;to llm-mistral&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm presenting a workshop at the &lt;a href="https://www.ire.org/training/conferences/nicar-2025/"&gt;NICAR 2025&lt;/a&gt; data journalism conference next week about &lt;a href="https://github.com/simonw/nicar-2025-scraping/"&gt;Cutting-edge web scraping techniques&lt;/a&gt;. LLM schemas is a great example of NDD - NICAR-Driven Development - where I'm churning out features I need for that conference (see also shot-scraper's new &lt;a href="https://shot-scraper.datasette.io/en/stable/har.html"&gt;HAR support&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I expect the workshop will be a great opportunity to further refine the design and implementation of this feature!&lt;/p&gt;
&lt;p&gt;I'm also going to be using this new feature to add multiple model support to my &lt;a href="https://www.datasette.cloud/blog/2024/datasette-extract/"&gt;datasette-extract plugin&lt;/a&gt;, which provides a web UI for structured data extraction that writes the resulting records directly to a SQLite database table.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/structured-extraction"&gt;structured-extraction&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="data-journalism"/><category term="projects"/><category term="ai"/><category term="annotated-release-notes"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="gemini"/><category term="ollama"/><category term="structured-extraction"/></entry><entry><title>S1: The $6 R1 Competitor?</title><link href="https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/#atom-tag" rel="alternate"/><published>2025-02-05T20:00:26+00:00</published><updated>2025-02-05T20:00:26+00:00</updated><id>https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://timkellogg.me/blog/2025/02/03/s1"&gt;S1: The $6 R1 Competitor?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Tim Kellogg shares his notes on a new paper, &lt;a href="https://arxiv.org/abs/2501.19393"&gt;s1: Simple test-time scaling&lt;/a&gt;, which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for 26 minutes on 16 NVIDIA H100 GPUs.&lt;/p&gt;
&lt;p&gt;Tim highlight the most exciting result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that's needed to achieve o1-preview performance on a 32B model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The paper describes a technique called "Budget forcing":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To enforce a minimum, we suppress the generation
of the end-of-thinking token delimiter and optionally append
the string “Wait” to the model’s current reasoning trace to
encourage the model to reflect on its current generation&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's the same trick Theia Vogel described &lt;a href="https://simonwillison.net/2025/Jan/22/r1py/"&gt;a few weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the &lt;code&gt;s1-32B&lt;/code&gt; model &lt;a href="https://huggingface.co/simplescaling/s1-32B"&gt;on Hugging Face&lt;/a&gt;. I found a GGUF version of it at &lt;a href="https://huggingface.co/brittlewis12/s1-32B-GGUF"&gt;brittlewis12/s1-32B-GGUF&lt;/a&gt;, which I ran using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run hf.co/brittlewis12/s1-32B-GGUF:Q4_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also found those 1,000 samples on Hugging Face in the &lt;a href="https://huggingface.co/datasets/simplescaling/s1K"&gt;simplescaling/s1K&lt;/a&gt; data repository there.&lt;/p&gt;
&lt;p&gt;I used DuckDB to convert the parquet file to CSV (and turn one &lt;code&gt;VARCHAR[]&lt;/code&gt; column into JSON):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;COPY (
    SELECT 
      solution,
      question,
      cot_type,
      source_type,
      metadata,
      cot,
      json_array(thinking_trajectories) as thinking_trajectories,
      attempt
    FROM 's1k-00001.parquet'
) TO 'output.csv' (HEADER, DELIMITER ',');
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I loaded that CSV into &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt; so I could use the &lt;code&gt;convert&lt;/code&gt; command to turn a Python data structure into JSON using &lt;code&gt;json.dumps()&lt;/code&gt; and &lt;code&gt;eval()&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Load into SQLite
sqlite-utils insert s1k.db s1k output.csv --csv
# Fix that column
sqlite-utils convert s1k.db s1u metadata 'json.dumps(eval(value))' --import json
# Dump that back out to CSV
sqlite-utils rows s1k.db s1k --csv &amp;gt; s1k.csv
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's that CSV &lt;a href="https://gist.github.com/simonw/048385f27e351c11b488bd9737452fa7"&gt;in a Gist&lt;/a&gt;, which means I can &lt;a href="https://lite.datasette.io/?install=datasette-pretty-json&amp;amp;csv=https://gist.githubusercontent.com/simonw/048385f27e351c11b488bd9737452fa7/raw/5270dacc5aa4a7385f9a6e3d691c81cf3595abc9/s1k.csv#/data/s1k?_facet=cot_type"&gt;load it into Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Datasette Lite showing cot_type 1, crossword 15, Link: 93, rowid: 93, solution: &amp;quot;### Answer: INCIDENT ROOM ### Explanation: Definition: investigators' facility **Anagram of**(... changes) NOTICED MINOR. Defn: ... in a police station.&amp;quot; Question text: &amp;quot;Solve the crossword puzzle. You are presented with a clue as input and the number of letters in brackets.&amp;quot; Clue: &amp;quot;Noticed minor changes in investigators' facility (8,4)&amp;quot; cot_type: crossword, source_type: 0xharib/xword1, metadata: { &amp;quot;instruction&amp;quot;: &amp;quot;You are an expert level solver of cryptic crosswords. You are presented with a clue as input. Respond with the answer and explanation.&amp;quot; }" src="https://static.simonwillison.net/static/2025/s1k.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It really is a tiny amount of training data. It's mostly math and science, but there are also &lt;a href="https://lite.datasette.io/?install=datasette-pretty-json&amp;amp;csv=https://gist.githubusercontent.com/simonw/048385f27e351c11b488bd9737452fa7/raw/5270dacc5aa4a7385f9a6e3d691c81cf3595abc9/s1k.csv#/data/s1k?_facet=cot_type&amp;amp;cot_type=crossword"&gt;15 cryptic crossword examples&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="datasette"/><category term="parquet"/><category term="duckdb"/><category term="datasette-lite"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="ollama"/><category term="llm-reasoning"/><category term="ai-in-china"/></entry><entry><title>Mistral Small 3</title><link href="https://simonwillison.net/2025/Jan/30/mistral-small-3/#atom-tag" rel="alternate"/><published>2025-01-30T15:36:30+00:00</published><updated>2025-01-30T15:36:30+00:00</updated><id>https://simonwillison.net/2025/Jan/30/mistral-small-3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/mistral-small-3/"&gt;Mistral Small 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
First model release of 2025 for French AI lab Mistral, who describe Mistral Small 3 as "a latency-optimized 24B-parameter model released under the Apache 2.0 license."&lt;/p&gt;
&lt;p&gt;More notably, they claim the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mistral Small 3 is competitive with larger models such as Llama 3.3 70B or Qwen 32B, and is an excellent open replacement for opaque proprietary models like GPT4o-mini. Mistral Small 3 is on par with Llama 3.3 70B instruct, while being more than 3x faster on the same hardware.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Llama 3.3 70B and Qwen 32B are two of my favourite models to run on my laptop - that ~20GB size turns out to be a great trade-off between memory usage and model utility. It's exciting to see a new entrant into that weight class.&lt;/p&gt;
&lt;p&gt;The license is important: previous Mistral Small models used their Mistral Research License, which prohibited commercial deployments unless you negotiate a commercial license with them. They appear to be moving away from that, at least for their core models:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-licensed models&lt;/strong&gt;. As with Mistral Small 3, model weights will be available to download and deploy locally, and free to modify and use in any capacity. […] Enterprises and developers that need specialized capabilities (increased speed and context, domain specific knowledge, task-specific models like code completion) can count on additional commercial models complementing what we contribute to the community.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Despite being called Mistral Small 3, this appears to be the fourth release of a model under that label. The Mistral API calls this one &lt;code&gt;mistral-small-2501&lt;/code&gt; - previous model IDs were &lt;code&gt;mistral-small-2312&lt;/code&gt;, &lt;code&gt;mistral-small-2402&lt;/code&gt; and &lt;code&gt;mistral-small-2409&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I've &lt;a href="https://github.com/simonw/llm-mistral/releases/tag/0.10"&gt;updated the llm-mistral plugin&lt;/a&gt; for talking directly to Mistral's &lt;a href="https://docs.mistral.ai/deployment/laplateforme/overview/"&gt;La Plateforme&lt;/a&gt; API:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-mistral
llm keys set mistral
# Paste key here
llm -m mistral/mistral-small-latest "tell me a joke about a badger and a puffin"
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Sure, here's a light-hearted joke for you:&lt;/p&gt;
&lt;p&gt;Why did the badger bring a puffin to the party?&lt;/p&gt;
&lt;p&gt;Because he heard puffins make great party 'Puffins'!&lt;/p&gt;
&lt;p&gt;(That's a play on the word "puffins" and the phrase "party people.")&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;API pricing is $0.10/million tokens of input, $0.30/million tokens of output - half the price of the previous Mistral Small API model ($0.20/$0.60). for comparison, GPT-4o mini is $0.15/$0.60.&lt;/p&gt;
&lt;p&gt;Mistral also ensured that the new model was &lt;a href="https://ollama.com/library/mistral-small"&gt;available on Ollama&lt;/a&gt; in time for their release announcement.&lt;/p&gt;
&lt;p&gt;You can pull the model like this (fetching 14GB):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run mistral-small:24b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; plugin will then let you prompt it like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm -m mistral-small:24b "say hi"
&lt;/code&gt;&lt;/pre&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/MistralAI/status/1884968836606136636"&gt;@MistralAI&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="llm-pricing"/><category term="ollama"/><category term="llm-release"/></entry><entry><title>Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens</title><link href="https://simonwillison.net/2025/Jan/26/qwen25-1m/#atom-tag" rel="alternate"/><published>2025-01-26T18:54:45+00:00</published><updated>2025-01-26T18:54:45+00:00</updated><id>https://simonwillison.net/2025/Jan/26/qwen25-1m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen2.5-1m/"&gt;Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Very significant new release from Alibaba's Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I've had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called &lt;strong&gt;Dual Chunk Attention&lt;/strong&gt;, first described in &lt;a href="https://arxiv.org/abs/2402.17463"&gt;this paper&lt;/a&gt; from February 2024.&lt;/p&gt;
&lt;p&gt;They've released two models on Hugging Face: &lt;a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M"&gt;Qwen2.5-7B-Instruct-1M&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-1M"&gt;Qwen2.5-14B-Instruct-1M&lt;/a&gt;, both requiring CUDA and both under an Apache 2.0 license.&lt;/p&gt;
&lt;p&gt;You'll need a &lt;em&gt;lot&lt;/em&gt; of VRAM to run them at their full capacity:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;VRAM Requirement for processing 1 million-token sequences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Qwen2.5-7B-Instruct-1M&lt;/strong&gt;: At least 120GB VRAM (total across GPUs).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qwen2.5-14B-Instruct-1M&lt;/strong&gt;: At least 320GB VRAM (total across GPUs).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Qwen recommend using their custom fork of vLLM to serve the models:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;GGUF quantized versions of the models are already starting to show up. LM Studio's "official model curator" &lt;a href="https://huggingface.co/bartowski"&gt;Bartowski&lt;/a&gt; published &lt;a href="https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF"&gt;lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF&lt;/a&gt; and &lt;a href="https://huggingface.co/lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF"&gt;lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF&lt;/a&gt; - sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.&lt;/p&gt;
&lt;p&gt;These might not work well yet with the full context lengths as the underlying &lt;code&gt;llama.cpp&lt;/code&gt; library may need some changes.&lt;/p&gt;
&lt;p&gt;I tried running the 8.1GB 7B model using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; on my Mac like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then with &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried piping a large prompt in using &lt;a href="https://pypi.org/project/files-to-prompt/"&gt;files-to-prompt&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m 'describe this codebase in detail'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That should give me every Python file in my &lt;a href="https://github.com/simonw/llm"&gt;llm project&lt;/a&gt;. Piping that through &lt;a href="https://pypi.org/project/ttok/"&gt;ttok&lt;/a&gt; first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.&lt;/p&gt;
&lt;p&gt;The result &lt;a href="https://gist.github.com/simonw/ace6ff544dddabb0797b8d20d84627a8#response"&gt;was disappointing&lt;/a&gt;: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;2,048 input, 999 output
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This suggests to me that something's not working right here - maybe the Ollama hosting framework is truncating the input, or maybe there's a problem with the GGUF I'm using?&lt;/p&gt;
&lt;p&gt;I'll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; It &lt;a href="https://news.ycombinator.com/item?id=42832838#42833427"&gt;turns out&lt;/a&gt; Ollama has a &lt;code&gt;num_ctx&lt;/code&gt; option which defaults to 2048, affecting the input context length. I tried this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;files-to-prompt \
  ~/Dropbox/Development/llm \
  -e py -c | \
llm -m q1m 'describe this codebase in detail' \
 -o num_ctx 80000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But I quickly ran out of RAM (I have 64GB but a lot of that was in use already) and hit &lt;code&gt;Ctrl+C&lt;/code&gt; to avoid crashing my computer. I need to experiment a bit to figure out how much RAM is used for what context size.&lt;/p&gt;
&lt;p&gt;Awni Hannun &lt;a href="https://twitter.com/awnihannun/status/1883611098081099914"&gt;shared tips&lt;/a&gt; for running &lt;a href="https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-4bit"&gt;mlx-community/Qwen2.5-7B-Instruct-1M-4bit&lt;/a&gt; using MLX,  which should work for up to 250,000 tokens. They ran 120,000 tokens and reported:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Peak RAM for prompt filling was 22GB&lt;/li&gt;
&lt;li&gt;Peak RAM for generation 12GB&lt;/li&gt;
&lt;li&gt;Prompt filling took 350 seconds on an M2 Ultra&lt;/li&gt;
&lt;li&gt;Generation ran at 31 tokens-per-second on M2 Ultra&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/reach_vb/status/1883560095176708163"&gt;VB&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="long-context"/><category term="llama-cpp"/><category term="llm-release"/><category term="files-to-prompt"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>DeepSeek-R1 and exploring DeepSeek-R1-Distill-Llama-8B</title><link href="https://simonwillison.net/2025/Jan/20/deepseek-r1/#atom-tag" rel="alternate"/><published>2025-01-20T15:20:23+00:00</published><updated>2025-01-20T15:20:23+00:00</updated><id>https://simonwillison.net/2025/Jan/20/deepseek-r1/#atom-tag</id><summary type="html">
    &lt;p&gt;DeepSeek are the Chinese AI lab who dropped the best currently available open weights LLM &lt;a href="https://simonwillison.net/2024/Dec/26/deepseek-v3/"&gt;on Christmas day&lt;/a&gt;, DeepSeek v3. That model was trained in part using their unreleased R1 "reasoning" model. Today &lt;a href="https://github.com/deepseek-ai/DeepSeek-R1"&gt;they've released R1 itself&lt;/a&gt;, along with a whole family of new models derived from that base.&lt;/p&gt;
&lt;p&gt;There's a whole lot of stuff in the new release.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DeepSeek-R1-Zero&lt;/strong&gt; appears to be the base model. It's &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero/tree/main"&gt;over 650GB&lt;/a&gt; in size and, like most of their other releases, is &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero/blob/main/LICENSE"&gt;under a clean MIT license&lt;/a&gt;. DeepSeek warn that "DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing." ... so they also released:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DeepSeek-R1&lt;/strong&gt; - which "incorporates cold-start data before RL" and "achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks". That one is also MIT licensed, and is a similar size.&lt;/p&gt;
&lt;p&gt;I don't have the ability to run models larger than about 50GB (I have an M2 with 64GB of RAM), so neither of these two models are something I can easily play with myself. That's where the new &lt;strong&gt;distilled models&lt;/strong&gt; come in.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a fascinating flex! They have models based on Qwen 2.5 (14B, 32B, Math 1.5B and Math 7B) and Llama 3 (Llama-3.1 8B and Llama 3.3 70B Instruct).&lt;/p&gt;
&lt;p&gt;Weirdly those Llama models have an MIT license attached, which I'm not sure is compatible with the underlying Llama license. Qwen models are Apache licensed so maybe MIT is OK?&lt;/p&gt;
&lt;p&gt;(I also &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B/commit/c79f47acaf303faabb7133b4b7b76f24231f2c8d#d2h-740129"&gt;just noticed&lt;/a&gt; the MIT license files say "Copyright (c) 2023 DeepSeek" so they may need to pay a little bit more attention to how they copied those in.)&lt;/p&gt;
&lt;p&gt;Licensing aside, these distilled models are fascinating beasts.&lt;/p&gt;
&lt;h4 id="running-deepseek-r1-distill-llama-8b-gguf"&gt;Running DeepSeek-R1-Distill-Llama-8B-GGUF&lt;/h4&gt;
&lt;p&gt;Quantized versions are already starting to show up. So far I've tried just one of those -  &lt;code&gt;unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF&lt;/code&gt; released by &lt;a href="https://unsloth.ai/"&gt;Unsloth AI&lt;/a&gt; - and it's really fun to play with.&lt;/p&gt;
&lt;p&gt;I'm running it using the combination of &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt;, &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and the &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama plugin&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;First I fetched and ran the model using Ollama itself:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloads around 8.5GB of model data and starts an interactive chat interface.&lt;/p&gt;
&lt;p&gt;Once the model has been fetched, LLM can talk to it as well. I prefer using LLM for experiments because it &lt;a href="https://llm.datasette.io/en/stable/logging.html"&gt;logs everything to SQLite&lt;/a&gt; for later exploration.&lt;/p&gt;
&lt;p&gt;I installed the plugin and ran it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
  'a joke about a pelican and a walrus who run a tea room together'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each response starts with a &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; pseudo-XML tag containing the chain of thought used to help generate the response. For "a joke about a pelican and a walrus who run a tea room together" that started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;&lt;br /&gt;
Okay, so I need to come up with a joke about a pelican and a walrus running a tea room together. Hmm, that's an interesting combination. Let me think about how these two characters might interact in a humorous situation.&lt;/p&gt;
&lt;p&gt;First, let's consider their characteristics. Pelicans are known for their long beaks and Webbed feet, often seen near the beach or water. Walruses have big teeth, thick fur, and they're generally found in colder climates, like icebergs or snowy areas. So, combining these two into a tea room setting is already a funny image.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It then thought for &lt;em&gt;20 paragraphs&lt;/em&gt; before outputting the joke! Here's &lt;a href="https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e3816b0"&gt;the full transcript&lt;/a&gt; and here's the output at the end, after the &lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt; tag:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Joke:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A pelican and a walrus decide to open a quaint little tea room together. The walrus, with its big size, struggles to find comfortable chairs, so it sits on the table by accident, knocking over the teapot. Meanwhile, the pelican, trying to help, uses its beak to place saucers on the table, causing a few spills.&lt;/p&gt;
&lt;p&gt;After a series of comical mishaps, the walrus looks up and says with a grin, "This isn't so fishy anymore." The pelican smirks and remarks, "Maybe not, but we do have a lot of krill in our tea!"&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Setup:&lt;/strong&gt; Pelican and Walrus run an unlikely tea room. Their sizes lead to comedic chaos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Punchline:&lt;/strong&gt; "This isn't so fishy anymore" is a pun on the situation becoming less absurd. The follow-up joke introduces "krill," a play on words that ties into both marine life ("fish") and tea, making it humorous while fitting the theme.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;So yeah, the joke is &lt;em&gt;awful&lt;/em&gt;. But the process of getting there was such an interesting insight into how these new models work.&lt;/p&gt;
&lt;p&gt;This was a relatively small 8B model. I'm looking forward to trying out the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B"&gt;Llama 70B version&lt;/a&gt;, which isn't yet available in a GGUF I can run with Ollama. Given &lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/"&gt;the strength of Llama 3.3 70B&lt;/a&gt; - currently my favourite GPT-4 class model that I've run on my own machine - I have high expectations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 21st January 2025&lt;/strong&gt;: I got &lt;a href="https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF"&gt;this quantized version&lt;/a&gt; of that Llama 3.3 70B R1 distilled model working like this - a 34GB download:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q3_K_M&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id="can-it-draw-a-pelican-"&gt;Can it draw a pelican?&lt;/h4&gt;
&lt;p&gt;I tried &lt;a href="https://github.com/simonw/pelican-bicycle"&gt;my classic&lt;/a&gt; &lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt; prompt too. It &lt;a href="https://gist.github.com/simonw/ddd21a016cd7be74fd3e18407e1e313e"&gt;did not do very well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/r1-peli-1.svg" alt="Blue square, abstract white shape on it" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It looked to me like it got the order of the elements wrong, so I followed up with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;the background ended up covering the rest of the image&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/ddd21a016cd7be74fd3e18407e1e313e#response-1"&gt;thought some more&lt;/a&gt; and gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/r1-peli-2.svg" alt="Blue square, abstract white shape on it" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;As with the earlier joke, the chain of thought in the transcript was far more interesting than the end result.&lt;/p&gt;

&lt;h4 id="other-ways-to-try-deepseek-r1"&gt;Other ways to try DeepSeek-R1&lt;/h4&gt;
&lt;p&gt;If you want to try the model out without installing anything at all you can do so using &lt;a href="https://chat.deepseek.com/"&gt;chat.deepseek.com&lt;/a&gt; - you'll need to create an account (sign in with Google, use an email address or provide a Chinese +86 phone number) and then select the "DeepThink" option below the prompt input box.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/deepseek-chat-deepthink.jpg" alt="Screenshot of DeepSeek Chat showing the DeepThink selected icon" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;DeepSeek offer the model via &lt;a href="https://api-docs.deepseek.com/"&gt;their API&lt;/a&gt;, using an OpenAI-imitating endpoint. You can access that via LLM by dropping this into &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;your extra-openai-models.yaml configuration file&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;deepseek-reasoner&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;deepseek-reasoner&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key_name&lt;/span&gt;: &lt;span class="pl-s"&gt;deepseek&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://api.deepseek.com&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then run &lt;code&gt;llm keys set deepseek&lt;/code&gt; and paste in your API key, then use &lt;code&gt;llm -m deepseek-reasoner 'prompt'&lt;/code&gt; to run prompts.&lt;/p&gt;
&lt;p&gt;This won't show you the reasoning tokens, sadly. Those are served up by the API (&lt;a href="https://gist.github.com/simonw/a5ca117dd0325c93a5b1f5a18c4a9e34"&gt;example here&lt;/a&gt;) but LLM doesn't yet have a way to display them.&lt;/p&gt;
&lt;p&gt;Since I published this article &lt;a href="https://www.ollama.com/"&gt;Ollama&lt;/a&gt; made available their own versions of the DeepSeek models. Confusingly, they've taken the various different distilled models and made them all available as different sizes of their &lt;a href="https://ollama.com/library/deepseek-r1"&gt;deepseek-r1&lt;/a&gt; - for example, their 9GB R1 Qwen2.5-14B model is called &lt;code&gt;deepseek-r1:14b&lt;/code&gt; and can be run like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run deepseek-r1:14b
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Using a browser that supports WebGPU such as Google Chrome you can visit &lt;a href="https://huggingface.co/spaces/webml-community/deepseek-r1-webgpu"&gt;webml-community/deepseek-r1-webgpu&lt;/a&gt; and load and then prompt a 1.28GB R1 model based on Qwen2.5-Math-1.5B entirely in your browser.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/deepseek-webgpu.jpg" alt="Screenshot of the webml-community/deepseek-r1-webgpu space running a prompt: Solve the equation x^2 - 3x + 2 = 0. It thinks through the problem and produces a final answer of 1 and 2." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>microsoft/phi-4</title><link href="https://simonwillison.net/2025/Jan/8/phi-4/#atom-tag" rel="alternate"/><published>2025-01-08T17:57:18+00:00</published><updated>2025-01-08T17:57:18+00:00</updated><id>https://simonwillison.net/2025/Jan/8/phi-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/microsoft/phi-4"&gt;microsoft/phi-4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's the official release of Microsoft's Phi-4 LLM, now officially under an MIT license.&lt;/p&gt;
&lt;p&gt;A few weeks ago I covered the earlier &lt;a href="https://simonwillison.net/2024/Dec/15/phi-4-technical-report/"&gt;unofficial versions&lt;/a&gt;, where I talked about how the model used synthetic training data in some really interesting ways.&lt;/p&gt;
&lt;p&gt;It benchmarks favorably compared to GPT-4o, suggesting this is yet another example of a GPT-4 class model &lt;a href="https://simonwillison.net/2024/Dec/31/llms-in-2024/#some-of-those-gpt-4-models-run-on-my-laptop"&gt;that can run on a good laptop&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The model already has several available community quantizations. I ran the &lt;a href="https://huggingface.co/mlx-community/phi-4-4bit"&gt;mlx-community/phi-4-4bit&lt;/a&gt; one (a 7.7GB download) using &lt;a href="https://pypi.org/project/mlx-llm/"&gt;mlx-llm&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with 'numpy&amp;lt;2' --with mlx-lm python -c '
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/phi-4-4bit")

prompt = "Generate an SVG of a pelican riding a bicycle"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=2048)
print(response)'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/f58e464dd653e1c637cf42d18416344d"&gt;Here's what I got back&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img style="width: 80%" src="https://static.simonwillison.net/static/2025/phi4-pelican.svg" alt="Hardly recognizable pelican on a bicycle"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; The model is now available &lt;a href="https://ollama.com/library/phi4"&gt;via Ollama&lt;/a&gt;, so you can fetch a 9.1GB model file using &lt;code&gt;ollama run phi4&lt;/code&gt;, after which it becomes available via the &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; plugin.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/phi"&gt;phi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="microsoft"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="phi"/><category term="uv"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Open WebUI</title><link href="https://simonwillison.net/2024/Dec/27/open-webui/#atom-tag" rel="alternate"/><published>2024-12-27T01:38:14+00:00</published><updated>2024-12-27T01:38:14+00:00</updated><id>https://simonwillison.net/2024/Dec/27/open-webui/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/open-webui/open-webui"&gt;Open WebUI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I tried out this open source (MIT licensed, JavaScript and Python) localhost UI for accessing LLMs today for the first time. It's very nicely done.&lt;/p&gt;
&lt;p&gt;I ran it with &lt;a href="https://docs.astral.sh/uv/guides/tools/"&gt;uvx&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --python 3.11 open-webui serve
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On first launch it installed a bunch of dependencies and then downloaded 903MB to &lt;code&gt;~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2&lt;/code&gt; - a copy of the &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"&gt;all-MiniLM-L6-v2&lt;/a&gt; embedding model, presumably for its &lt;a href="https://docs.openwebui.com/features/rag"&gt;RAG feature&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It then presented me with a working Llama 3.2:3b chat interface, which surprised me because I hadn't spotted it downloading that model. It turns out that was because I have &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; running on my laptop already (with several models, including Llama 3.2:3b, already installed) - and Open WebUI automatically detected Ollama and gave me access to a list of available models.&lt;/p&gt;
&lt;p&gt;I found a "knowledge" section and added all of the Datasette documentation (by dropping in the &lt;code&gt;.rst&lt;/code&gt; files from the docs) - and now I can type &lt;code&gt;#&lt;/code&gt; in chat to search for a file, add that to the context and then ask questions about it directly.&lt;/p&gt;
&lt;p&gt;I selected the &lt;code&gt;spatialite.rst.txt&lt;/code&gt; file, prompted it with "How do I use SpatiaLite with Datasette" and got back &lt;a href="https://gist.github.com/simonw/3bf5ff0ed3b47aefbb94d3fd992f81f9#response"&gt;this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Chat UI. Prompt is How do I use SpatiaLite with Datasette? with an attached spatialite.rst.txt file. The response from llama3.2:3b reads: To use SpatiaLite with Datasette, you need to install the mod_spatialite dynamic library. This can then be loaded into Datasette using the --load-extension command-line option.   You can install it on OS X by running brew update and then brew install spatialite-tools. On Linux, SpatiaLite is packaged for most distributions, so you can typically install it with a package manager like apt.   When running Datasette, you can look for SpatiaLite in common installation locations or specify the full path to the extension if it's installed elsewhere.   For example, on OS X, you might run datasette --load-extension=spatialite after installing it with Homebrew. On Linux, you could use datasette --load-extension=/usr/lib/x86_64-linux-gnu/mod_spatialite.so.   It's also important to note that SpatiaLite adds a large number of additional SQL functions, some of which are not safe for untrusted users to execute. To secure your Datasette instance, consider disabling arbitrary SQL queries and defining canned queries with the SQL queries that use SpatiaLite functions you want people to be able to execute." src="https://static.simonwillison.net/static/2024/spatialite-webui.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;That's honestly a very solid answer, especially considering the Llama 3.2 3B model from Ollama is just a 1.9GB file! It's impressive how well that model can handle basic Q&amp;amp;A and summarization against text provided to it - it somehow has a 128,000 token context size.&lt;/p&gt;
&lt;p&gt;Open WebUI has a lot of other tricks up its sleeve: it can talk to API models such as OpenAI directly, has optional integrations with web search and custom tools and logs every interaction to a SQLite database. It also comes with &lt;a href="https://docs.openwebui.com/"&gt;extensive documentation&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="sqlite"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="rag"/><category term="uv"/><category term="ollama"/></entry></feed>