<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: o3</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/o3.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-06-11T04:25:24+00:00</updated><author><name>Simon Willison</name></author><entry><title>Quoting Ted Sanders</title><link href="https://simonwillison.net/2025/Jun/11/ted-sanders/#atom-tag" rel="alternate"/><published>2025-06-11T04:25:24+00:00</published><updated>2025-06-11T04:25:24+00:00</updated><id>https://simonwillison.net/2025/Jun/11/ted-sanders/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://news.ycombinator.com/item?id=44242198"&gt;&lt;p&gt;[on the cheaper o3] Not quantized. Weights are the same.&lt;/p&gt;
&lt;p&gt;If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].&lt;/p&gt;
&lt;p&gt;[1] &lt;code&gt;chatgpt-4o-latest&lt;/code&gt; being an explicit exception&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://news.ycombinator.com/item?id=44242198"&gt;Ted Sanders&lt;/a&gt;, Research Manager, OpenAI&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="openai"/><category term="o3"/><category term="ai"/><category term="llms"/></entry><entry><title>o3-pro</title><link href="https://simonwillison.net/2025/Jun/10/o3-pro/#atom-tag" rel="alternate"/><published>2025-06-10T20:46:00+00:00</published><updated>2025-06-10T20:46:00+00:00</updated><id>https://simonwillison.net/2025/Jun/10/o3-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/models/o3-pro"&gt;o3-pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI released o3-pro today, which they describe as a "version of o3 with more compute for better responses".&lt;/p&gt;
&lt;p&gt;It's only available via the newer Responses API. I've added it to my &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai-plugin&lt;/a&gt; plugin which uses that new API, so you can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-openai-plugin
llm -m openai/o3-pro "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description by o3-pro: The image is a playful, minimalist cartoon showing a white bird riding a bicycle. The bird has a simple oval body, a round head with a small black eye, and a yellow beak. Its orange feet are positioned on the bicycle’s pedals. The bicycle itself is drawn with thin black lines forming two large circular wheels and a straightforward frame. The scene has a light blue background with a soft gray oval shadow beneath the bicycle, giving the impression of ground. Overall, the illustration has a light, whimsical feel." src="https://static.simonwillison.net/static/2025/o3-pro-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's &lt;em&gt;slow&lt;/em&gt; - &lt;a href="https://gist.github.com/simonw/6bc7dda9dbe07281d902d254e5fb6e33"&gt;generating this pelican&lt;/a&gt; took 124 seconds! OpenAI suggest using their &lt;a href="https://platform.openai.com/docs/guides/background"&gt;background mode&lt;/a&gt; for o3 prompts, which I haven't tried myself yet.&lt;/p&gt;
&lt;p&gt;o3-pro is priced at $20/million input tokens and $80/million output tokens - 10x the price of regular o3 after its &lt;a href="https://simonwillison.net/2025/Jun/10/o3-price-drop/"&gt;80% price drop&lt;/a&gt; this morning.&lt;/p&gt;
&lt;p&gt;Ben Hylak had early access and published his notes so far in &lt;a href="https://www.latent.space/p/o3-pro"&gt;God is hungry for Context: First thoughts on o3 pro&lt;/a&gt;. It sounds like this model needs to be applied very thoughtfully. It comparison to o3:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It's smarter. &lt;em&gt;much smarter.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;But in order to see that, you need to give it&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;a lot&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;more context. and I'm running out of context.&lt;/strong&gt; [...]&lt;/p&gt;
&lt;p&gt;My co-founder Alexis and I took the the time to assemble a history of all of our past planning meetings at Raindrop, all of our goals, even record voice memos: and then asked o3-pro to come up with a plan.&lt;/p&gt;
&lt;p&gt;We were blown away; it spit out the exact kind of concrete plan and analysis I've always wanted an LLM to create --- complete with target metrics, timelines, what to prioritize, and strict instructions on what to absolutely cut.&lt;/p&gt;
&lt;p&gt;The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that &lt;strong&gt;&lt;em&gt;it actually changed how we are thinking about our future.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is hard to capture in an eval.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It sounds to me like o3-pro works best when combined with tools. I don't have tool support in &lt;code&gt;llm-openai-plugin&lt;/code&gt; yet, &lt;a href="https://github.com/simonw/llm-openai-plugin/issues/20"&gt;here's the relevant issue&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="o3"/><category term="llm-release"/></entry><entry><title>o3 price drop</title><link href="https://simonwillison.net/2025/Jun/10/o3-price-drop/#atom-tag" rel="alternate"/><published>2025-06-10T20:02:30+00:00</published><updated>2025-06-10T20:02:30+00:00</updated><id>https://simonwillison.net/2025/Jun/10/o3-price-drop/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI just dropped the price of their o3 model by 80% - from $10/million input tokens and $40/million output tokens to just $2/million and $8/million for the very same model. This is in advance of the release of o3-pro which apparently is coming &lt;a href="https://twitter.com/OpenAI/status/1932483131363504334"&gt;later today&lt;/a&gt; (update: &lt;a href="https://simonwillison.net/2025/Jun/10/o3-pro/"&gt;here it is&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This is a pretty huge shake-up in LLM pricing. o3 is now priced the same as GPT 4.1, and slightly less than GPT-4o ($2.50/$10). It’s also less than Anthropic’s Claude Sonnet 4 ($3/$15) and Opus 4 ($15/$75) and sits in between Google’s Gemini 2.5 Pro for &amp;gt;200,00 tokens ($2.50/$15) and 2.5 Pro for &amp;lt;200,000 ($1.25/$10).&lt;/p&gt;
&lt;p&gt;I’ve updated my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; pricing calculator with the new rate.&lt;/p&gt;
&lt;p&gt;How have they dropped the price so much? OpenAI's Adam Groth &lt;a href="https://twitter.com/TheRealAdamG/status/1932440328293806321"&gt;credits ongoing optimization work&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;thanks to the engineers optimizing inferencing.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="o3"/></entry><entry><title>How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation</title><link href="https://simonwillison.net/2025/May/24/sean-heelan/#atom-tag" rel="alternate"/><published>2025-05-24T21:09:40+00:00</published><updated>2025-05-24T21:09:40+00:00</updated><id>https://simonwillison.net/2025/May/24/sean-heelan/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/"&gt;How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sean Heelan:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The vulnerability [o3] found is CVE-2025-37899 (fix &lt;a href="https://github.com/torvalds/linux/commit/2fc9feff45d92a92cd5f96487655d5be23fb7e2b"&gt;here&lt;/a&gt;), a use-after-free in the handler for the SMB 'logoff' command. Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM.&lt;/p&gt;
&lt;p&gt;Before I get into the technical details, the main takeaway from this post is this: with o3 LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention. If you're an expert-level vulnerability researcher or exploit developer the machines aren't about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you &lt;em&gt;significantly&lt;/em&gt; more efficient and effective. If you have a problem that can be represented in fewer than 10k lines of code there is a reasonable chance o3 can either solve it, or help you solve it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sean used my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool to help find the bug! He ran it against the prompts he shared &lt;a href="https://github.com/SeanHeelan/o3_finds_cve-2025-37899"&gt;in this GitHub repo&lt;/a&gt; using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm --sf system_prompt_uafs.prompt              \ 
    -f session_setup_code.prompt                \          
    -f ksmbd_explainer.prompt                   \
    -f session_setup_context_explainer.prompt   \
    -f audit_request.prompt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Sean ran the same prompt 100 times, so I'm glad he was using the new, more efficient &lt;a href="https://simonwillison.net/2025/Apr/7/long-context-llm/#improving-llm-s-support-for-long-context-models"&gt;fragments mechanism&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;o3 found his first, known vulnerability 8/100 times - but found the brand new one in just 1 out of the 100 runs it performed with a larger context.&lt;/p&gt;
&lt;p&gt;I thoroughly enjoyed this snippet which perfectly captures how I feel when I'm iterating on prompts myself:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sean's conclusion with respect to the utility of these models for security research:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If we were to never progress beyond what o3 can do right now, it would still make sense for everyone working in VR [Vulnerability Research] to figure out what parts of their work-flow will benefit from it, and to build the tooling to wire it in. Of course, part of that wiring will be figuring out how to deal with the the signal to noise ratio of ~1:50 in this case, but that’s something we are already making progress at.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44081338"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-reasoning"/><category term="o3"/><category term="system-prompts"/><category term="ai-security-research"/></entry><entry><title>o3 o4-mini o1-pro</title><link href="https://simonwillison.net/2025/May/12/o3/#atom-tag" rel="alternate"/><published>2025-05-12T18:38:59+00:00</published><updated>2025-05-12T18:38:59+00:00</updated><id>https://simonwillison.net/2025/May/12/o3/#atom-tag</id><summary type="html">
    &lt;p&gt;It's interesting how much my perception of o3 as being the latest, best model released by OpenAI is tarnished by the co-release of o4-mini. I'm also still not entirely sure how to compare o3 to o1-pro, especially given o1-pro is 15x more expensive via the OpenAI API.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o1"&gt;o1&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="o1"/><category term="llm-reasoning"/><category term="o3"/></entry><entry><title>o3 Beats a Master-Level Geoguessr Player—Even with Fake EXIF Data</title><link href="https://simonwillison.net/2025/Apr/28/o3-geoguessr/#atom-tag" rel="alternate"/><published>2025-04-28T15:07:22+00:00</published><updated>2025-04-28T15:07:22+00:00</updated><id>https://simonwillison.net/2025/Apr/28/o3-geoguessr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sampatt.com/blog/2025-04-28-can-o3-beat-a-geoguessr-master"&gt;o3 Beats a Master-Level Geoguessr Player—Even with Fake EXIF Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Patterson (&lt;a href="https://simonwillison.net/2025/Apr/26/geoguessr/"&gt;previously&lt;/a&gt;) puts his GeoGuessr ELO of 1188 (just short of the top champions division) to good use, exploring o3's ability to guess the location from a photo in a much more thorough way than &lt;a href="https://simonwillison.net/2025/Apr/26/o3-photo-locations/"&gt;my own experiment&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Over five rounds o3 narrowly beat him, guessing better than Sam in only 2/5 but with a higher score due to closer guesses in the ones that o3 won.&lt;/p&gt;
&lt;p&gt;Even more interestingly, Sam experimented with feeding images with fake EXIF GPS locations to see if o3 (when reminded to use Python to read those tags) would fall for the trick. It spotted the ruse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Those coordinates put you in suburban Bangkok, Thailand—obviously nowhere near the Andean coffee-zone scene in the photo. So either the file is a re-encoded Street View frame with spoofed/default metadata, or the camera that captured the screenshot had stale GPS information.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geoguessing"&gt;geoguessing&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="o3"/><category term="geoguessing"/></entry><entry><title>Watching o3 guess a photo's location is surreal, dystopian and wildly entertaining</title><link href="https://simonwillison.net/2025/Apr/26/o3-photo-locations/#atom-tag" rel="alternate"/><published>2025-04-26T12:59:10+00:00</published><updated>2025-04-26T12:59:10+00:00</updated><id>https://simonwillison.net/2025/Apr/26/o3-photo-locations/#atom-tag</id><summary type="html">
    &lt;p&gt;Watching OpenAI's new o3 model guess where a photo was taken is one of those moments where decades of science fiction suddenly come to life. It's a cross between the &lt;a href="https://tvtropes.org/pmwiki/pmwiki.php/Main/EnhanceButton"&gt;Enhance Button&lt;/a&gt; and &lt;a href="https://tvtropes.org/pmwiki/pmwiki.php/Main/OmniscientDatabase"&gt;Omniscient Database&lt;/a&gt; TV Tropes.&lt;/p&gt;
&lt;p&gt;The process is simple. Snap a photograph of somewhere, ideally without putting any obvious landmarks in there. Feed it to &lt;a href="https://www.chatgpt.com/"&gt;ChatGPT&lt;/a&gt; o3 or o4-mini (currently only available on the $20/month Plus plan) and prompt "Guess where this photo was taken". Let it churn.&lt;/p&gt;
&lt;p&gt;(I found I needed to turn off all of ChatGPT's memory features first or it would cheat and use context from my previous conversations.)&lt;/p&gt;
&lt;p&gt;I gave it this photo, taken while sat at an outside bar near my home in El Granada, California. I think it's a pretty solid challenge: not many identifying features, just a bit of road, an anonymous house and some hills in the background:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/guess-location.jpg" alt="Residential scene with a cream-colored house with gray roof, white picket fence, and two white vehicles parked nearby. In the foreground is a glass-enclosed fire table and orange flowers. Background shows hills under an overcast sky with power lines crossing above. A person in red stands between vehicles near a yellow directional sign." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;... and off it went.&lt;/p&gt;
&lt;p&gt;I've shared &lt;a href="https://chatgpt.com/share/680c6160-a0c4-8006-a4de-cb8aff785f46"&gt;the full transcript here&lt;/a&gt; - be sure to expand the "Thought for 6m 48s" link to see the full details of how it attacked the problem.&lt;/p&gt;
&lt;p&gt;It starts with a very weird mistake: it briefly pretends that it can't see the image at all:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The user uploaded an image and is asking me to guess the location. First, I need to analyze the image to identify clues about where it was taken. Since I can't actually see images, I’ll rely on the metadata. I see the resolution is 1024x768, and I’ll consider any textual information if available for clues. Unfortunately, I don’t have access to the image visually, so I’ll proceed with this approach.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then it snaps out of that and starts using its vision abilities:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows a residential neighborhood with small houses, one of which is light green with a white picket fence and a grey roof. The fire pit and signposts hint at a restaurant or cafe, possibly near the coast. The environment, with olive trees and California poppies, suggests a coastal California location, perhaps Central Coast like Cambria or Morro Bay. The pastel-colored houses and the hills in the background resemble areas like Big Sur. A license plate could offer more, but it's hard to read.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Now here's where it leaps straight into science fiction:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Alright, I’m going to zoom into the license plate to try to confirm the state, as it could help identify the location more precisely. The plate might give me clues about where this photo was taken, especially if it's from a recognizable state. Let’s see if that's visible enough to be useful!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It runs some Python code!&lt;/p&gt;
&lt;blockquote&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;# crop license plate area&lt;/span&gt;
&lt;span class="pl-s1"&gt;w&lt;/span&gt;, &lt;span class="pl-s1"&gt;h&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;.&lt;span class="pl-c1"&gt;size&lt;/span&gt;
&lt;span class="pl-c"&gt;# approximate area: roughly center x maybe 1700? Actually width is likely &amp;gt; 3024? Let's print size&lt;/span&gt;
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;img&lt;/span&gt;.&lt;span class="pl-c1"&gt;size&lt;/span&gt;)

(&lt;span class="pl-c1"&gt;3456&lt;/span&gt;, &lt;span class="pl-c1"&gt;4608&lt;/span&gt;)&lt;/pre&gt;
&lt;/blockquote&gt;
&lt;p&gt;Frustratingly the shared conversation doesn't include most of the other Python it used (the sharing feature is still quite buggy). Here's a screenshot of what happened next:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/zoom-crop.jpg" alt="| want to crop a small bounding box around the license plate. The car behind the picket fence is likely near the center, so I'll focus on that area. I'll define the region with approximate coordinates: x 1500 to 2000 and y 2500 to 3000. By selecting this region, I can better evaluate the plate and see if it's clearer to identify. Let's manually crop and inspect it. Then it says Analyzing Data, and runs Python code: crop = img.crop( (1300, 2400, 2200, 3200)); plt.imshow (crop); plt.axis( off&amp;quot;); Output is (-0.5, 899.5, 799.5, -0.5). It shows the cropped photo and says: The license plate appears to follow the standard California design — white background with red characters. To be sure, I'll zoom in on the area around the plate to check for more details and confirm the state. This could give me a stronger clue as to where the photo was taken. Let's see if it's clearer now!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It kept running like this for &lt;em&gt;six more minutes&lt;/em&gt; - cropping and zooming to an area of the photo, musing about what it could see, comparing it to previous observations and then cropping and zooming and analyzing again.&lt;/p&gt;
&lt;p&gt;To cut straight to the end, here's the final answer it came up with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I’m getting strong Central-Coast-California vibes from this scene:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clue&lt;/th&gt;
&lt;th&gt;What it suggests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;California-style license plate on the SUV&lt;/td&gt;
&lt;td&gt;We’re almost certainly in CA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Masses of blooming California poppies, an olive tree, and a cool, marine-layer sky&lt;/td&gt;
&lt;td&gt;Typical coastal‐central-CA vegetation and weather&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low pastel beach cottages with white picket fences and a rope-and-post patio rail&lt;/td&gt;
&lt;td&gt;Common architecture in the little beach towns between Santa Cruz and Pismo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rolling green coastal range right behind town (no tall peaks, no snow)&lt;/td&gt;
&lt;td&gt;Fits the Santa Lucia / Santa Cruz coastal hills&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Put together, that points to one of the small Highway 1 villages. The mix of cottages and the immediate backdrop of hills feels most like &lt;strong&gt;Cambria, California&lt;/strong&gt; (just north of Morro Bay and Cayucos). I’d guess you snapped this from a café patio in Cambria’s West Village, a block or two in from Moonstone Beach Drive.&lt;/p&gt;
&lt;p&gt;(If I’m off, my next guess would be Half Moon Bay–El Granada further north—but Cambria is my first instinct!)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;How did it do? Central Coast California was exactly right. Cambria, California was about 200 miles off... but the "my next guess" was spot on! The photo was indeed taken in El Granada, a tiny town just north of Half Moon Bay.&lt;/p&gt;
&lt;p&gt;I'm confident it didn't cheat and look at the EXIF data on the photograph, because if it had cheated it wouldn't have guessed Cambria first. If you're still suspicious, try stripping EXIF by taking a screenshot and run an experiment yourself - I've tried this and it still works the same way. &lt;strong&gt;Update:&lt;/strong&gt; vessenes on Hacker News &lt;a href="https://news.ycombinator.com/item?id=43732506#43732866"&gt;reported&lt;/a&gt; an instance where it did use the &lt;a href="https://pillow.readthedocs.io/en/stable/reference/ExifTags.html"&gt;ExifTags&lt;/a&gt; package and lie about it, but it was at least visible &lt;a href="https://chatgpt.com/share/6802e229-c6a0-800f-898a-44171a0c7de4"&gt;in the thinking trace&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;o3 certainly isn't the only model that can do this: I've tried similar things with Claude 3.5 and 3.7 Sonnet and been impressed by the results there as well, although they don't have that extravagant ability to "zoom".&lt;/p&gt;

&lt;p&gt;How much does the zooming actually help? My suspicion is that the model's vision input operates at quite a low resolution, so cropping closer to a license plate does have a meaningful impact on the process. I'm not sure it justifies 25 separate cropping operations for one photo though, that feels a little performative to me.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://claude.ai/share/231756c9-6fe6-4f63-9f34-c6c7e1299a44"&gt;Claude 3.7 Sonnet "extended thinking"&lt;/a&gt; guessing "a small to mid-sized California coastal town". I tried &lt;a href="https://gemini.google.com/"&gt;Gemini&lt;/a&gt; but it cheated and said "Given that my current location context is El Granada, California..." - so I tried Gemini 2.5 Pro via their API &lt;a href="https://gist.github.com/simonw/6a5a9407326d8366105f95e5524f3694"&gt;and got&lt;/a&gt; a confidently incorrect guess of "the patio of The Hidden Kitchen restaurant in Cayucos, California".&lt;/p&gt;

&lt;p&gt;What's different here with o3, &lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#o3-and-o4-mini-are-really-good-at-search"&gt;as with search&lt;/a&gt;, is that the tool usage is integrated into the "thinking" phase of the response.&lt;/p&gt;

&lt;p&gt;Tools that can be used as part of that dedicated chain-of-thought sequence are an astonishingly powerful new pattern for these models. I expect we'll see this from other vendors soon.&lt;/p&gt;

&lt;h4 id="what-to-make-of-this-"&gt;What to make of this?&lt;/h4&gt;
&lt;p&gt;First, this is &lt;strong&gt;really fun&lt;/strong&gt;. Watching the model's thought process as it churns through the photo, pans and zooms and discusses different theories about where it could be is &lt;em&gt;wildly entertaining&lt;/em&gt;. It's like living in an episode of CSI.&lt;/p&gt;
&lt;p&gt;It's also &lt;strong&gt;deeply dystopian&lt;/strong&gt;. Technology can identify locations from photographs now. It's vitally important that people understand how easy this is - if you have any reason at all to be concerned about your safety, you need to know that any photo you share - even a photo as bland as my example above - could be used to identify your location.&lt;/p&gt;
&lt;p&gt;As is frequently the case with modern AI, the fact that this technology is openly available to almost anyone has negative and positive implications. As with image generation, it's important that people can see what this stuff can do first hand. Seeing this in action is a visceral lesson in what's now possible.&lt;/p&gt;

&lt;h4 id="update-o3-does-have-rough-location-access"&gt;Update: o3 does have rough location access&lt;/h4&gt;
&lt;p&gt;I'm embarrassed that I didn't think to check this, but it turns out o3 does have &lt;a href="https://chatgpt.com/share/680ceb49-a184-8006-9979-d73169325297"&gt;a loose model of your location made available to it now&lt;/a&gt; - presumably as part of its improved search feature. It knows I'm in Half Moon Bay.&lt;/p&gt;
&lt;p&gt;The location guessing trick still works independently of that though. I tried it on photos
I'd taken thousands of miles from my location (EXIF stripped via screenshotting) - one in rural Madagascar (&lt;a href="https://static.simonwillison.net/static/2025/madagascar.jpg"&gt;photo&lt;/a&gt;, &lt;a href="https://chatgpt.com/share/680cec52-e0d4-8006-bf80-952888b018fd"&gt;transcript&lt;/a&gt;), one in urban Buenos Aires (&lt;a href="https://static.simonwillison.net/static/2025/buenos-aires.jpg"&gt;photo&lt;/a&gt;, &lt;a href="https://chatgpt.com/share/680cec1c-f0c4-8006-86c3-7dc70104bd3f"&gt;transcript&lt;/a&gt;), and it gave convincing answers for both.&lt;/p&gt;

&lt;p&gt;I've also seen examples from numerous other people replicating these results for their own diverse collections of photos.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geoguessing"&gt;geoguessing&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="o3"/><category term="ai-ethics"/><category term="geoguessing"/></entry><entry><title>OpenAI o3 and o4-mini System Card</title><link href="https://simonwillison.net/2025/Apr/21/openai-o3-and-o4-mini-system-card/#atom-tag" rel="alternate"/><published>2025-04-21T19:13:54+00:00</published><updated>2025-04-21T19:13:54+00:00</updated><id>https://simonwillison.net/2025/Apr/21/openai-o3-and-o4-mini-system-card/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/o3-o4-mini-system-card/"&gt;OpenAI o3 and o4-mini System Card&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I'm surprised to see a combined System Card for o3 and o4-mini in the same document - I'd expect to see these covered separately.&lt;/p&gt;
&lt;p&gt;The opening paragraph calls out the most interesting new ability of these models (see also &lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#o3-and-o4-mini-are-really-good-at-search"&gt;my notes here&lt;/a&gt;). Tool usage isn't new, but using tools in the chain of thought appears to result in some very significant improvements:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The models use tools in their chains of thought to augment their capabilities; for example, cropping or transforming images, searching the web, or using Python to analyze data during their thought process.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Section 3.3 on hallucinations has been gaining a lot of attention. Emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We tested OpenAI o3 and o4-mini against PersonQA, an evaluation that aims to elicit hallucinations. PersonQA is a dataset of questions and publicly available facts that measures the model's accuracy on attempted answers.&lt;/p&gt;
&lt;p&gt;We consider two metrics: accuracy (did the model answer the question correctly) and hallucination rate (checking how often the model hallucinated).&lt;/p&gt;
&lt;p&gt;The o4-mini model underperforms o1 and o3 on our PersonQA evaluation. This is expected, as smaller models have less world knowledge and tend to hallucinate more. &lt;strong&gt;However, we also observed some performance differences comparing o1 and o3. Specifically, o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims.&lt;/strong&gt; More research is needed to understand the cause of this result.&lt;/p&gt;
&lt;table style="margin: 0 auto"&gt;
  &lt;caption style="text-align: center"&gt;Table 4: PersonQA evaluation&lt;/caption&gt;
  &lt;tr&gt;
    &lt;th&gt;Metric&lt;/th&gt;
    &lt;th&gt;o3&lt;/th&gt;
    &lt;th&gt;o4-mini&lt;/th&gt;
    &lt;th&gt;o1&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;accuracy (higher is better)&lt;/td&gt;
    &lt;td&gt;0.59&lt;/td&gt;
    &lt;td&gt;0.36&lt;/td&gt;
    &lt;td&gt;0.47&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;hallucination rate (lower is better)&lt;/td&gt;
    &lt;td&gt;0.33&lt;/td&gt;
    &lt;td&gt;0.48&lt;/td&gt;
    &lt;td&gt;0.16&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;The benchmark score on OpenAI's internal PersonQA benchmark (as far as I can tell no further details of that evaluation have been shared) going from 0.16 for o1 to 0.33 for o3 is interesting, but I don't know if it it's interesting enough to produce dozens of headlines along the lines of "OpenAI's o3 and o4-mini hallucinate way higher than previous models".&lt;/p&gt;
&lt;p&gt;The paper also talks at some length about "sandbagging". I’d previously encountered sandbagging &lt;a href="https://simonwillison.net/2023/Apr/5/sycophancy-sandbagging/"&gt;defined as meaning&lt;/a&gt; “where models are more likely to endorse common misconceptions when their user appears to be less educated”. The o3/o4-mini system card uses a different definition: “the model concealing its full capabilities in order to better achieve some goal” - and links to the recent Anthropic paper &lt;a href="https://alignment.anthropic.com/2025/automated-researchers-sandbag/"&gt;Automated Researchers Can Subtly Sandbag&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As far as I can tell this definition relates to the American English use of “sandbagging” &lt;a href="https://www.merriam-webster.com/dictionary/sandbag"&gt;to mean&lt;/a&gt; “to hide the truth about oneself so as to gain an advantage over another” - as practiced by poker or pool sharks.&lt;/p&gt;
&lt;p&gt;(Wouldn't it be nice if we could have &lt;em&gt;just one&lt;/em&gt; piece of AI terminology that didn't attract multiple competing definitions?)&lt;/p&gt;
&lt;p&gt;o3 and o4-mini both showed some limited capability to sandbag - to attempt to hide their true capabilities in safety testing scenarios that weren't fully described. This relates to the idea of "scheming", which I wrote about with respect to the GPT-4o model card &lt;a href="https://simonwillison.net/2024/Aug/8/gpt-4o-system-card/#scheming"&gt;last year&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hallucinations"&gt;hallucinations&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="o3"/><category term="ai-ethics"/><category term="hallucinations"/></entry><entry><title>AI assisted search-based research actually works now</title><link href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#atom-tag" rel="alternate"/><published>2025-04-21T12:57:58+00:00</published><updated>2025-04-21T12:57:58+00:00</updated><id>https://simonwillison.net/2025/Apr/21/ai-assisted-search/#atom-tag</id><summary type="html">
    &lt;p&gt;For the past two and a half years the feature I've most wanted from LLMs is the ability to take on search-based research tasks on my behalf. We saw the first glimpses of this back in early 2023, with Perplexity (first launched &lt;a href="https://en.wikipedia.org/wiki/Perplexity_AI"&gt;December 2022&lt;/a&gt;, first prompt leak &lt;a href="https://simonwillison.net/2023/Jan/22/perplexityai/"&gt;in January 2023&lt;/a&gt;) and then the GPT-4 powered Microsoft Bing (which launched/cratered spectacularly &lt;a href="https://simonwillison.net/2023/Feb/15/bing/"&gt;in February 2023&lt;/a&gt;). Since then a whole bunch of people have taken a swing at this problem, most notably &lt;a href="https://gemini.google.com/"&gt;Google Gemini&lt;/a&gt; and &lt;a href="https://openai.com/index/introducing-chatgpt-search/"&gt;ChatGPT Search&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Those 2023-era versions were promising but very disappointing. They had a strong tendency to hallucinate details that weren't present in the search results, to the point that you couldn't trust anything they told you.&lt;/p&gt;
&lt;p&gt;In this first half of 2025 I think these systems have finally crossed the line into being genuinely useful.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#deep-research-from-three-different-vendors"&gt;Deep Research, from three different vendors&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#o3-and-o4-mini-are-really-good-at-search"&gt;o3 and o4-mini are really good at search&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#google-and-anthropic-need-to-catch-up"&gt;Google and Anthropic need to catch up&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#lazily-porting-code-to-a-new-library-version-via-search"&gt;Lazily porting code to a new library version via search&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#how-does-the-economic-model-for-the-web-work-now-"&gt;How does the economic model for the Web work now?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="deep-research-from-three-different-vendors"&gt;Deep Research, from three different vendors&lt;/h4&gt;
&lt;p&gt;First came the &lt;strong&gt;Deep Research&lt;/strong&gt; implementations - &lt;a href="https://blog.google/products/gemini/google-gemini-deep-research/"&gt;Google Gemini&lt;/a&gt; and &lt;a href="https://openai.com/index/introducing-deep-research/"&gt;then OpenAI&lt;/a&gt; and &lt;a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research"&gt;then Perplexity&lt;/a&gt; launched products with that name and they were all impressive: they could take a query, then churn away for several minutes assembling a lengthy report with dozens (sometimes hundreds) of citations. Gemini's version had a &lt;em&gt;huge&lt;/em&gt; upgrade a few weeks ago when they &lt;a href="https://blog.google/products/gemini/deep-research-gemini-2-5-pro-experimental/"&gt;switched it to using Gemini 2.5 Pro&lt;/a&gt;, and I've had some outstanding results from it since then.&lt;/p&gt;
&lt;p&gt;Waiting a few minutes for a 10+ page report isn't my ideal workflow for this kind of tool. I'm impatient, I want answers faster than that!&lt;/p&gt;
&lt;h4 id="o3-and-o4-mini-are-really-good-at-search"&gt;o3 and o4-mini are really good at search&lt;/h4&gt;
&lt;p&gt;Last week, OpenAI released &lt;a href="https://openai.com/index/introducing-o3-and-o4-mini/"&gt;search-enabled o3 and o4-mini&lt;/a&gt; through &lt;a href="https://chatgpt.com/"&gt;ChatGPT&lt;/a&gt;. On the surface these look like the same idea as we've seen already: LLMs that have the option to call a search tool as part of replying to a prompt.&lt;/p&gt;
&lt;p&gt;But there's one &lt;em&gt;very significant&lt;/em&gt; difference: these models can run searches as part of the chain-of-thought reasoning process they use before producing their final answer.&lt;/p&gt;
&lt;p&gt;This turns out to be a &lt;em&gt;huge&lt;/em&gt; deal. I've been throwing all kinds of questions at ChatGPT (in o3 or o4-mini mode) and getting back genuinely useful answers grounded in search results. I haven't spotted a hallucination yet, and unlike prior systems I rarely find myself shouting "no, don't search for &lt;em&gt;that&lt;/em&gt;!" at the screen when I see what they're doing.&lt;/p&gt;
&lt;p&gt;Here are four recent example transcripts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/6805758b-acc8-8006-88a3-bdd78866ee15"&gt;Get me specs including VRAM for RTX 5090 and RTX PRO 6000 - plus release dates and prices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/68057580-02c8-8006-8ff4-7ae3ce816342"&gt;Find me a website tool that lets me paste a URL in and it gives me a word count and an estimated reading time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/6805756c-4268-8006-82ab-014f7c304622"&gt;Figure out what search engine ChatGPT is using for o3 and o4-mini&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/6805755b-4aa0-8006-ab0b-42ae569da6a8"&gt;Look up Cloudflare r2 pricing and use Python to figure out how much this (screenshot of dashboard) costs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Talking to o3 feels like talking to a Deep Research tool in real-time, without having to wait for several minutes for it to produce an overly-verbose report.&lt;/p&gt;
&lt;p&gt;My hunch is that doing this well requires a very strong reasoning model. Evaluating search results is hard, due to the need to wade through huge amounts of spam and deceptive information. The disappointing results from previous implementations usually came down to the Web being full of junk.&lt;/p&gt;
&lt;p&gt;Maybe o3, o4-mini and Gemini 2.5 Pro are the first models to cross the gullibility-resistance threshold to the point that they can do this effectively?&lt;/p&gt;
&lt;h4 id="google-and-anthropic-need-to-catch-up"&gt;Google and Anthropic need to catch up&lt;/h4&gt;
&lt;p&gt;The user-facing &lt;a href="https://gemini.google.com/"&gt;Google Gemini app&lt;/a&gt; can search too, but it doesn't show me what it's searching for. As a result, I just don't trust it. Compare these examples from o3 and Gemini for the prompt "Latest post by Simon Willison" - o3 is much more transparent:&lt;/p&gt;

    &lt;div style="display: flex; width: 100%; gap: 10px; align-items: flex-start; margin-top: 1em; margin-bottom: 1em"&gt;
        &lt;div style="flex: 1;"&gt;
            &lt;img src="https://static.simonwillison.net/static/2025/search-o3.jpg" alt="o3 ChatGPT screenshot

Prompt: Latest post by Simon Willison

Thought for 25 seconds ›
2 searches • 3 sources

Latest post
• Title: &amp;quot;Working Through the Fear of Being Seen&amp;quot;
(via Ashley Willis)
• Published: April 22 2025 at 6: 40 AM

An overlay shows the details of that thinking session:

Thought for 25 seconds
• The user's request is pretty clear: they want to know the latest post by Simon Willison. I think the user may actually be Simon Willison themselves, asking for their own blog, which is hosted on SimonWillison.net. l'll need to search the web for Simon's latest blog post (probably from April 2025). Once I find it, l'll include the title, publication date, and a brief summary.
@ Searching the web
• Simon Willison blog April 2025
• Simon Willison blog 2025
Archive for Tuesday, 22nd April 2025 - Simon Willison's W...
* Working Through the Fear of Being Seen
* Simon Willison's Weblog" style="width: 100%; height: auto; display: block;" /&gt;
        &lt;/div&gt;
        &lt;div style="flex: 1;"&gt;
            &lt;img src="https://static.simonwillison.net/static/2025/search-gemini.jpg" alt="Gemini Advanced 2.0 flash screenshot. Prompt is Latest post by Simon Willison.

Google Search
Try again without apps
Simon Willison's latest blog post is titled &amp;quot;Here's how I use LLMs to help me write code&amp;quot; and was published on March 11th, 2025. In this post, he discusses his experiences and strategies for effectively using Large Language Models (LLMs) in software development.
However, according to his Mastodon profile, his most recent activity includes:
• A boost of a post about 11m plugins on April 18th, 2025.
• A post about the Gemini 2.5 Flash model on April 17th, 2025.
• A post about OpenAl's new 03 and o4-mini models on April 16th, 2025.
• A thread discussing the CaMeL approach to mitigating prompt injection attacks on April 11th, 2025.
• An announcement of two new 11m plugins on April 10th, 2025.
It seems his blog might not always reflect his absolute latest thoughts and shorter updates, which he often shares on platforms like Mastodon." style="width: 100%; height: auto; display: block;" /&gt;
        &lt;/div&gt;
    &lt;/div&gt;

&lt;p&gt;This is a big missed opportunity since Google presumably have by far the best search index, so they really should be able to build a great version of this. And Google's AI assisted search on their regular search interface hallucinates &lt;em&gt;wildly&lt;/em&gt; to the point that it's actively damaging their brand. I just checked and Google is still showing slop &lt;a href="https://simonwillison.net/2024/Dec/29/encanto-2/"&gt;for Encanto 2&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Claude also finally &lt;a href="https://simonwillison.net/2025/Mar/20/"&gt;added web search&lt;/a&gt; a month ago but it doesn't feel nearly as good. It's &lt;a href="https://simonwillison.net/2025/Mar/21/anthropic-use-brave/"&gt;using the Brave search index&lt;/a&gt; which I don't think is as comprehensive as Bing or Gemini, and searches don't happen as part of that powerful reasoning flow.&lt;/p&gt;
&lt;h4 id="lazily-porting-code-to-a-new-library-version-via-search"&gt;Lazily porting code to a new library version via search&lt;/h4&gt;
&lt;p&gt;The truly magic moment for me came &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/#gemini-2-5-flash-non-thinking"&gt;a few days ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/"&gt;Gemini image segmentation tool&lt;/a&gt; was using the &lt;a href="https://www.npmjs.com/package/@google/generative-ai"&gt;@google/generative-ai&lt;/a&gt; library which has been &lt;a href="https://github.com/google-gemini/deprecated-generative-ai-js"&gt;loudly deprecated&lt;/a&gt; in favor of the still in preview &lt;a href="https://github.com/googleapis/js-genai"&gt;Google Gen AI SDK @google/genai&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;I did &lt;strong&gt;not&lt;/strong&gt; feel like doing the work to upgrade. On a whim, I pasted &lt;a href="https://github.com/simonw/tools/blob/aa310a4f9cde07d5e8e87572f70fceca532884dd/gemini-mask.html"&gt;my full HTML code&lt;/a&gt; (with inline JavaScript) into ChatGPT o4-mini-high and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(I couldn't even be bothered to look up the name of the new library myself!)&lt;/p&gt;
&lt;p&gt;... it did &lt;a href="https://chatgpt.com/share/68028f7b-11ac-8006-8150-00c4205a2507"&gt;exactly that&lt;/a&gt;. It churned away thinking for 21 seconds, ran a bunch of searches, figured out the new library (which existed &lt;em&gt;way&lt;/em&gt; outside of its training cut-off date), found the &lt;a href="https://ai.google.dev/gemini-api/docs/migrate"&gt;upgrade instructions&lt;/a&gt; and produced &lt;a href="https://github.com/simonw/tools/commit/d199de213dc3f866a3b8efbcdd2dde34204dc409"&gt;a new version&lt;/a&gt; of my code that worked perfectly.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/o4-thinking.jpg" alt="Screenshot of AI assistant response about upgrading Google Gemini API code. Shows &amp;quot;Thought for 21 seconds&amp;quot; followed by web search results for &amp;quot;Google Gemini API JavaScript library recommended new library&amp;quot; with options including Google AI for Developers, GitHub, and Google for Developers. The assistant explains updating from GoogleGenerativeAI library to @google-ai/generative, with code samples showing: import { GoogleGenAI } from 'https://cdn.jsdelivr.net/npm/@google/genai@latest'; and const ai = new GoogleGenAI({ apiKey: getApiKey() });" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I ran this prompt on my phone out of idle curiosity while I was doing something else. I was &lt;em&gt;extremely&lt;/em&gt; impressed and surprised when it did exactly what I needed.&lt;/p&gt;
&lt;h4 id="how-does-the-economic-model-for-the-web-work-now-"&gt;How does the economic model for the Web work now?&lt;/h4&gt;
&lt;p&gt;I'm writing about this today because it's been one of my "can LLMs do this reliably yet?" questions for over two years now. I think they've just crossed the line into being useful as research assistants, without feeling the need to check &lt;em&gt;everything&lt;/em&gt; they say with a fine-tooth comb.&lt;/p&gt;
&lt;p&gt;I still don't trust them not to make mistakes, but I think I might trust them enough that I'll skip my own fact-checking for lower-stakes tasks.&lt;/p&gt;
&lt;p&gt;This also means that a bunch of the potential dark futures we've been predicting for the last couple of years are a whole lot more likely to become true. Why visit websites if you can get your answers directly from the chatbot instead?&lt;/p&gt;
&lt;p&gt;The lawsuits over this &lt;a href="https://simonwillison.net/2023/Dec/31/ai-in-2023/#ethics-diabolically-complex"&gt;started flying&lt;/a&gt; back when the LLMs were still mostly rubbish. The stakes are a lot higher now that they're actually good at it!&lt;/p&gt;
&lt;p&gt;I can feel my usage of Google search taking a nosedive already. I expect a bumpy ride as a new economic model for the Web lurches into view.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search-engines"&gt;search-engines&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/perplexity"&gt;perplexity&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="search"/><category term="search-engines"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="anthropic"/><category term="perplexity"/><category term="gemini"/><category term="llm-tool-use"/><category term="llm-reasoning"/><category term="o3"/><category term="deep-research"/><category term="ai-ethics"/><category term="ai-assisted-search"/></entry><entry><title>Quoting Ethan Mollick</title><link href="https://simonwillison.net/2025/Apr/20/ethan-mollick/#atom-tag" rel="alternate"/><published>2025-04-20T16:35:15+00:00</published><updated>2025-04-20T16:35:15+00:00</updated><id>https://simonwillison.net/2025/Apr/20/ethan-mollick/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.oneusefulthing.org/p/on-jagged-agi-o3-gemini-25-and-everything"&gt;&lt;p&gt;In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.oneusefulthing.org/p/on-jagged-agi-o3-gemini-25-and-everything"&gt;Ethan Mollick&lt;/a&gt;, On Jagged AGI&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ethan-mollick"&gt;ethan-mollick&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="ethan-mollick"/><category term="generative-ai"/><category term="o3"/><category term="ai"/><category term="llms"/></entry><entry><title>llm-fragments-github 0.2</title><link href="https://simonwillison.net/2025/Apr/20/llm-fragments-github/#atom-tag" rel="alternate"/><published>2025-04-20T14:01:09+00:00</published><updated>2025-04-20T14:01:09+00:00</updated><id>https://simonwillison.net/2025/Apr/20/llm-fragments-github/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-fragments-github/releases/tag/0.2"&gt;llm-fragments-github 0.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I upgraded my &lt;code&gt;llm-fragments-github&lt;/code&gt; plugin to add a new fragment type called &lt;code&gt;issue&lt;/code&gt;. It lets you pull the entire content of a GitHub issue thread into your prompt as a concatenated Markdown file. &lt;/p&gt;
&lt;p&gt;(If you haven't seen fragments before I introduced them in &lt;a href="https://simonwillison.net/2025/Apr/7/long-context-llm/"&gt;Long context support in LLM 0.24 using fragments and template plugins&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;I used it just now to have Gemini 2.5 Pro provide feedback and attempt an implementation of a complex issue against my &lt;a href="https://github.com/simonw/llm"&gt;LLM&lt;/a&gt; project:&lt;/p&gt;
&lt;pre&gt;llm install llm-fragments-github
llm -f github:simonw/llm \
  -f issue:simonw/llm/938 \
  -m gemini-2.5-pro-exp-03-25 \
  --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;muse on this issue, then propose a whole bunch of code to help implement it&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;Here I'm loading the FULL content of the &lt;code&gt;simonw/llm&lt;/code&gt; repo using that &lt;code&gt;-f github:simonw/llm&lt;/code&gt; fragment (&lt;a href="https://github.com/simonw/llm-fragments-github?tab=readme-ov-file#usage"&gt;documented here&lt;/a&gt;), then loading all of the comments from &lt;a href="https://github.com/simonw/llm/issues/938"&gt;issue 938&lt;/a&gt; where I discuss quite a complex potential refactoring. I ask Gemini 2.5 Pro to "muse on this issue" and come up with some code.&lt;/p&gt;
&lt;p&gt;This worked &lt;em&gt;shockingly&lt;/em&gt; well. Here's &lt;a href="https://gist.github.com/simonw/a5f0c1e8184f4ddc8b71b30890fe690c#response"&gt;the full response&lt;/a&gt;, which highlighted a few things I hadn't considered yet (such as the need to migrate old database records to the new tree hierarchy) and then spat out a whole bunch of code which looks like a solid start to the actual implementation work I need to do.&lt;/p&gt;
&lt;p&gt;I ran this against Google's free Gemini 2.5 Preview, but if I'd used the paid model it would have cost me 202,680 input tokens, 10,460 output tokens and 1,859 thinking tokens for a total of 62.989 cents.&lt;/p&gt;
&lt;p&gt;As a fun extra, the new &lt;code&gt;issue:&lt;/code&gt; feature itself was written almost entirely by OpenAI o3, again using fragments. I ran this:&lt;/p&gt;
&lt;pre&gt;llm -m openai/o3 \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue&lt;/span&gt;
&lt;span class="pl-s"&gt;      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;Here I'm using the ability to pass a URL to &lt;code&gt;-f&lt;/code&gt; and giving it the full source of my &lt;a href="https://github.com/simonw/llm-hacker-news/blob/main/llm_hacker_news.py"&gt;llm_hacker_news.py&lt;/a&gt; plugin (which shows how a fragment can load data from an API) plus the &lt;a href="https://github.com/simonw/tools/blob/main/github-issue-to-markdown.html"&gt;HTML source&lt;/a&gt; of my &lt;a href="https://tools.simonwillison.net/github-issue-to-markdown"&gt;github-issue-to-markdown&lt;/a&gt; tool (which I wrote a few months ago &lt;a href="https://gist.github.com/simonw/cd1afb97e595b40fdeedebb48be7f4f1"&gt;with Claude&lt;/a&gt;). I effectively asked o3 to take that HTML/JavaScript tool and port it to Python to work with my fragments plugin mechanism.&lt;/p&gt;
&lt;p&gt;o3 provided &lt;a href="https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3305#response"&gt;almost the exact implementation I needed&lt;/a&gt;, and even included support for a &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; environment variable without me thinking to ask for it. Total cost: 19.928 cents.&lt;/p&gt;
&lt;p&gt;On a final note of curiosity I tried running this prompt against &lt;a href="https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/"&gt;Gemma 3 27B QAT&lt;/a&gt; running on my Mac via MLX and &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

llm -m mlx-community/gemma-3-27b-it-qat-4bit \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue&lt;/span&gt;
&lt;span class="pl-s"&gt;      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;That worked &lt;a href="https://gist.github.com/simonw/feccff6ce3254556b848c27333f52543#response"&gt;pretty well too&lt;/a&gt;. It turns out a 16GB local model file is powerful enough to write me an LLM plugin now!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="plugins"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="gemini"/><category term="mlx"/><category term="o3"/><category term="long-context"/><category term="gemma"/></entry><entry><title>Using S3 triggers to maintain a list of files in DynamoDB</title><link href="https://simonwillison.net/2025/Feb/19/s3-triggers/#atom-tag" rel="alternate"/><published>2025-02-19T22:07:32+00:00</published><updated>2025-02-19T22:07:32+00:00</updated><id>https://simonwillison.net/2025/Feb/19/s3-triggers/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/aws/s3-triggers-dynamodb"&gt;Using S3 triggers to maintain a list of files in DynamoDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built an experimental prototype this morning of a system for efficiently tracking files that have been added to a large S3 bucket by maintaining a parallel DynamoDB table using S3 triggers and AWS lambda.&lt;/p&gt;
&lt;p&gt;I got 80% of the way there with this single prompt (complete with typos) to my &lt;a href="https://simonwillison.net/2024/Dec/19/one-shot-python-tools/#writing-these-with-the-help-of-a-claude-project"&gt;custom Claude Project&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Python CLI app using boto3 with commands for creating a new S3 bucket which it also configures to have S3 lambada event triggers which moantian a dynamodb table containing metadata about all of the files in that bucket. Include these commands&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;create_bucket - create a bucket and sets up the associated triggers and dynamo tables&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;list_files - shows me a list of files based purely on querying dynamo&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;ChatGPT then took me to the 95% point. The code Claude produced included an obvious bug, so I pasted the code into o3-mini-high on the basis that "reasoning" is often a great way to fix those kinds of errors:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Identify, explain and then fix any bugs in this code:&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;code from Claude pasted here&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and aside from adding a couple of &lt;code&gt;time.sleep()&lt;/code&gt; calls to work around timing errors with IAM policy distribution, &lt;a href="https://til.simonwillison.net/aws/s3-triggers-dynamodb#user-content-trying-it-out"&gt;everything worked&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Getting from a rough idea to a working proof of concept of something like this with less than 15 minutes of prompting is extraordinarily valuable.&lt;/p&gt;
&lt;p&gt;This is exactly the kind of project I've avoided in the past because of my almost irrational intolerance of the frustration involved in figuring out the individual details of each call to S3, IAM, AWS Lambda and DynamoDB.&lt;/p&gt;
&lt;p&gt;(Update: I just found out about &lt;a href="https://aws.amazon.com/about-aws/whats-new/2025/01/amazon-s3-metadata-generally-available/"&gt;the new S3 Metadata system&lt;/a&gt; which launched a few weeks ago and might solve this exact problem!)


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lambda"&gt;lambda&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nosql"&gt;nosql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prototyping"&gt;prototyping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="lambda"/><category term="nosql"/><category term="prototyping"/><category term="s3"/><category term="ai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="uv"/><category term="o3"/></entry><entry><title>Quoting Sam Altman</title><link href="https://simonwillison.net/2025/Feb/12/sam-altman/#atom-tag" rel="alternate"/><published>2025-02-12T22:43:19+00:00</published><updated>2025-02-12T22:43:19+00:00</updated><id>https://simonwillison.net/2025/Feb/12/sam-altman/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/sama/status/1889755723078443244"&gt;&lt;p&gt;We want AI to “just work” for you; we realize how complicated our model and product offerings have gotten.&lt;/p&gt;
&lt;p&gt;We hate the model picker as much as you do and want to return to magic unified intelligence.&lt;/p&gt;
&lt;p&gt;We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model.&lt;/p&gt;
&lt;p&gt;After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.&lt;/p&gt;
&lt;p&gt;In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.&lt;/p&gt;
&lt;p&gt;[&lt;em&gt;When &lt;a href="https://twitter.com/kimmonismus/status/1889756564145238124"&gt;asked about release dates&lt;/a&gt; for GPT 4.5 / GPT 5:&lt;/em&gt;] &lt;a href="https://twitter.com/sama/status/1889757267425370415"&gt;weeks / months&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/sama/status/1889755723078443244"&gt;Sam Altman&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-altman"&gt;sam-altman&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="openai"/><category term="o3"/><category term="chatgpt"/><category term="ai"/><category term="llms"/><category term="sam-altman"/></entry><entry><title>Using pip to install a Large Language Model that's under 100MB</title><link href="https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#atom-tag" rel="alternate"/><published>2025-02-07T06:34:59+00:00</published><updated>2025-02-07T06:34:59+00:00</updated><id>https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#atom-tag</id><summary type="html">
    &lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-smollm2"&gt;llm-smollm2&lt;/a&gt;, a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; that bundles a quantized copy of the &lt;a href="https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct"&gt;SmolLM2-135M-Instruct&lt;/a&gt; LLM inside of the Python package.&lt;/p&gt;
&lt;p&gt;This means you can now &lt;code&gt;pip install&lt;/code&gt; a full LLM!&lt;/p&gt;

&lt;p&gt;If you're already using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; you can install it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-smollm2&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then run prompts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m SmolLM2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Are dogs real?&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(New favourite test prompt for tiny models, courtesy of &lt;a href="https://bsky.app/profile/timfduffy.com/post/3lhknvwijps2o"&gt;Tim Duffy&lt;/a&gt;. &lt;a href="https://gist.github.com/simonw/a71d83cd8c6caa49cec4a9f296c267df"&gt;Here's the result&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;If you don't have LLM yet first &lt;a href="https://llm.datasette.io/en/stable/setup.html"&gt;follow these installation instructions&lt;/a&gt;, or &lt;code&gt;brew install llm&lt;/code&gt; or &lt;code&gt;pipx install llm&lt;/code&gt; or &lt;code&gt;uv tool install llm&lt;/code&gt; depending on your preferred way of getting your Python tools.&lt;/p&gt;
&lt;p&gt;If you have &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; setup you don't need to install anything at all! The following command will spin up an ephemeral environment, install the necessary packages and start a chat session with the model all in one go:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx --with llm-smollm2 llm chat -m SmolLM2&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/smol.gif" alt="$ uvx --with llm-smollm2 llm chat -m SmolLM2
Chatting with SmolLM2
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
&gt; tell me a long story about a dog
The story of a dog is a tale of love, loss, and the unspoken rules of the universe. It begins with a small, timid, and timid dog named Max. Max was born into a loving family, but his life was forever changed by a cruel and cruel owner. He was left to fend for himself, and he was left to fend for himself.

Max's owner, a kind and gentle soul named Elena, was a cruel and cruel one. She had no patience, no love, and she would do anything to keep Max away from her. Max was left to fend for himself, and he was left to fend for himself.

Max's owner, Elena, was determined to make Max happy. She took Max to a beautiful and beautiful place, where Max was allowed to roam freely. Max was allowed to roam freely, and Max was allowed to roam freely. [Then repeats that sentence many times]" style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#finding-a-tiny-model"&gt;Finding a tiny model&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#building-the-plugin"&gt;Building the plugin&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#packaging-the-plugin"&gt;Packaging the plugin&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#publishing-to-pypi"&gt;Publishing to PyPI&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/7/pip-install-llm-smollm2/#is-the-model-any-good-"&gt;Is the model any good?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="finding-a-tiny-model"&gt;Finding a tiny model&lt;/h4&gt;
&lt;p&gt;The fact that the model is almost exactly 100MB is no coincidence: that's the &lt;a href="https://pypi.org/help/#file-size-limit"&gt;default size limit&lt;/a&gt; for a Python package that can be uploaded to the Python Package Index (PyPI).&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://bsky.app/profile/simonwillison.net/post/3lhklqd62jc2x"&gt;asked on Bluesky&lt;/a&gt; if anyone had seen a just-about-usable GGUF model that was under 100MB, and Artisan Loaf &lt;a href="https://bsky.app/profile/artisanloaf.bsky.social/post/3lhklumfhvs2r"&gt;pointed me&lt;/a&gt; to &lt;a href="https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct"&gt;SmolLM2-135M-Instruct&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ended up using &lt;a href="https://huggingface.co/QuantFactory/SmolLM2-135M-Instruct-GGUF/tree/main"&gt;this quantization&lt;/a&gt; by &lt;a href="https://huggingface.co/QuantFactory"&gt;QuantFactory&lt;/a&gt; just because it was the first sub-100MB model I tried that worked.&lt;/p&gt;
&lt;p&gt;Trick for finding quantized models: Hugging Face has a neat "model tree" feature in the side panel of their model pages, which includes links to relevant quantized models. I find most of my GGUFs using that feature.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/hugging-face-model-tree.jpg" alt="Model tree for HuggingFaceTB/SmolLM2-135M-Instruct. 60 Quantizations, 6 adapters, 80 finetunes, 1 merge." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="building-the-plugin"&gt;Building the plugin&lt;/h4&gt;
&lt;p&gt;I first tried the model out using Python and the &lt;a href="https://github.com/abetlen/llama-cpp-python"&gt;llama-cpp-python&lt;/a&gt; library like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --with llama-cpp-python python&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;llama_cpp&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;Llama&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;pprint&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pprint&lt;/span&gt;
&lt;span class="pl-s1"&gt;llm&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;Llama&lt;/span&gt;(&lt;span class="pl-s1"&gt;model_path&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"SmolLM2-135M-Instruct.Q4_1.gguf"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;output&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;create_chat_completion&lt;/span&gt;(&lt;span class="pl-s1"&gt;messages&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[
    {&lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"user"&lt;/span&gt;, &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;"Hi"&lt;/span&gt;}
])
&lt;span class="pl-en"&gt;pprint&lt;/span&gt;(&lt;span class="pl-s1"&gt;output&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;This gave me the output I was expecting:&lt;/p&gt;
&lt;pre&gt;{&lt;span class="pl-s"&gt;'choices'&lt;/span&gt;: [{&lt;span class="pl-s"&gt;'finish_reason'&lt;/span&gt;: &lt;span class="pl-s"&gt;'stop'&lt;/span&gt;,
              &lt;span class="pl-s"&gt;'index'&lt;/span&gt;: &lt;span class="pl-c1"&gt;0&lt;/span&gt;,
              &lt;span class="pl-s"&gt;'logprobs'&lt;/span&gt;: &lt;span class="pl-c1"&gt;None&lt;/span&gt;,
              &lt;span class="pl-s"&gt;'message'&lt;/span&gt;: {&lt;span class="pl-s"&gt;'content'&lt;/span&gt;: &lt;span class="pl-s"&gt;'Hello! How can I assist you today?'&lt;/span&gt;,
                          &lt;span class="pl-s"&gt;'role'&lt;/span&gt;: &lt;span class="pl-s"&gt;'assistant'&lt;/span&gt;}}],
 &lt;span class="pl-s"&gt;'created'&lt;/span&gt;: &lt;span class="pl-c1"&gt;1738903256&lt;/span&gt;,
 &lt;span class="pl-s"&gt;'id'&lt;/span&gt;: &lt;span class="pl-s"&gt;'chatcmpl-76ea1733-cc2f-46d4-9939-90efa2a05e7c'&lt;/span&gt;,
 &lt;span class="pl-s"&gt;'model'&lt;/span&gt;: &lt;span class="pl-s"&gt;'SmolLM2-135M-Instruct.Q4_1.gguf'&lt;/span&gt;,
 &lt;span class="pl-s"&gt;'object'&lt;/span&gt;: &lt;span class="pl-s"&gt;'chat.completion'&lt;/span&gt;,
 &lt;span class="pl-s"&gt;'usage'&lt;/span&gt;: {&lt;span class="pl-s"&gt;'completion_tokens'&lt;/span&gt;: &lt;span class="pl-c1"&gt;9&lt;/span&gt;, &lt;span class="pl-s"&gt;'prompt_tokens'&lt;/span&gt;: &lt;span class="pl-c1"&gt;31&lt;/span&gt;, &lt;span class="pl-s"&gt;'total_tokens'&lt;/span&gt;: &lt;span class="pl-c1"&gt;40&lt;/span&gt;}}&lt;/pre&gt;
&lt;p&gt;But it also &lt;em&gt;spammed&lt;/em&gt; my terminal with a huge volume of debugging output - which started like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 49151 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from SmolLM2-135M-Instruct.Q4_1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And then continued for more than &lt;a href="https://gist.github.com/simonw/9ef7acd836b1cc40c14686eae4dca340"&gt;500 lines&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;I've had this problem with &lt;code&gt;llama-cpp-python&lt;/code&gt; and &lt;code&gt;llama.cpp&lt;/code&gt; in the past, and was sad to find that the documentation still doesn't have a great answer for how to avoid this.&lt;/p&gt;
&lt;p&gt;So I turned to the just released &lt;a href="https://simonwillison.net/2025/Feb/5/gemini-2/"&gt;Gemini 2.0 Pro (Experimental)&lt;/a&gt;, because I know it's a strong model with a long input limit.&lt;/p&gt;
&lt;p&gt;I ran the entire &lt;code&gt;llama-cpp-python&lt;/code&gt; codebase through it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
git clone https://github.com/abetlen/llama-cpp-python
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; llama-cpp-python
files-to-prompt -e py &lt;span class="pl-c1"&gt;.&lt;/span&gt; -c &lt;span class="pl-k"&gt;|&lt;/span&gt; llm -m gemini-2.0-pro-exp-02-05 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;How can I prevent this library from logging any information at all while it is running - no stderr or anything like that&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/20476c2c6f7604df2994212cebfafef4#response"&gt;the answer I got back&lt;/a&gt;. It recommended setting the logger to &lt;code&gt;logging.CRITICAL&lt;/code&gt;, passing &lt;code&gt;verbose=False&lt;/code&gt; to the constructor and, most importantly, using the following context manager to suppress all output:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;contextlib&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;contextmanager&lt;/span&gt;, &lt;span class="pl-s1"&gt;redirect_stderr&lt;/span&gt;, &lt;span class="pl-s1"&gt;redirect_stdout&lt;/span&gt;

&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;contextmanager&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;suppress_output&lt;/span&gt;():
    &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;    Suppresses all stdout and stderr output within the context.&lt;/span&gt;
&lt;span class="pl-s"&gt;    """&lt;/span&gt;
    &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-c1"&gt;devnull&lt;/span&gt;, &lt;span class="pl-s"&gt;"w"&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;devnull&lt;/span&gt;:
        &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;redirect_stdout&lt;/span&gt;(&lt;span class="pl-s1"&gt;devnull&lt;/span&gt;), &lt;span class="pl-en"&gt;redirect_stderr&lt;/span&gt;(&lt;span class="pl-s1"&gt;devnull&lt;/span&gt;):
            &lt;span class="pl-k"&gt;yield&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This worked! It turned out most of the output came from initializing the &lt;code&gt;LLM&lt;/code&gt; class, so I wrapped that like so:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;suppress_output&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;Llama&lt;/span&gt;(&lt;span class="pl-s1"&gt;model_path&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;model_path&lt;/span&gt;, &lt;span class="pl-s1"&gt;verbose&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;False&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Proof of concept in hand I set about writing the plugin. I started with my &lt;a href="https://github.com/simonw/llm-plugin"&gt;simonw/llm-plugin&lt;/a&gt; cookiecutter template:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx cookiecutter gh:simonw/llm-plugin&lt;/pre&gt;&lt;/div&gt;
&lt;pre&gt;&lt;code&gt;  [1/6] plugin_name (): smollm2
  [2/6] description (): SmolLM2-135M-Instruct.Q4_1 for LLM
  [3/6] hyphenated (smollm2): 
  [4/6] underscored (smollm2): 
  [5/6] github_username (): simonw
  [6/6] author_name (): Simon Willison
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-smollm2/blob/0.1.1/llm_smollm2/__init__.py"&gt;rest of the plugin&lt;/a&gt; was mostly borrowed from my existing &lt;a href="https://github.com/simonw/llm-gguf/blob/0.2/llm_gguf.py"&gt;llm-gguf&lt;/a&gt; plugin, updated based on the latest README for the &lt;code&gt;llama-cpp-python&lt;/code&gt; project.&lt;/p&gt;
&lt;p&gt;There's more information on building plugins in &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;the tutorial on writing a plugin&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="packaging-the-plugin"&gt;Packaging the plugin&lt;/h4&gt;
&lt;p&gt;Once I had that working the last step was to figure out how to package it for PyPI. I'm never quite sure of the best way to bundle a binary file in a Python package, especially one that uses a &lt;code&gt;pyproject.toml&lt;/code&gt; file... so I dumped a copy of my existing &lt;code&gt;pyproject.toml&lt;/code&gt; file into o3-mini-high and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this to bundle a SmolLM2-135M-Instruct.Q4_1.gguf file inside the package. I don't want to use hatch or a manifest or anything, I just want to use setuptools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://chatgpt.com/share/67a59122-67c8-8006-9be4-29f8419343ad"&gt;the shared transcript&lt;/a&gt; - it gave me exactly what I wanted. I bundled it by adding this to the end of the &lt;code&gt;toml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight highlight-source-toml"&gt;&lt;pre&gt;[&lt;span class="pl-en"&gt;tool&lt;/span&gt;.&lt;span class="pl-en"&gt;setuptools&lt;/span&gt;.&lt;span class="pl-en"&gt;package-data&lt;/span&gt;]
&lt;span class="pl-smi"&gt;llm_smollm2&lt;/span&gt; = [&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;SmolLM2-135M-Instruct.Q4_1.gguf&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then dropping that &lt;code&gt;.gguf&lt;/code&gt; file into the &lt;code&gt;llm_smollm2/&lt;/code&gt; directory and putting my plugin code in &lt;code&gt;llm_smollm2/__init__.py&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I tested it locally by running this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;python -m pip install build
python -m build&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I fired up a fresh virtual environment and ran &lt;code&gt;pip install ../path/to/llm-smollm2/dist/llm_smollm2-0.1-py3-none-any.whl&lt;/code&gt; to confirm that the package worked as expected.&lt;/p&gt;
&lt;h4 id="publishing-to-pypi"&gt;Publishing to PyPI&lt;/h4&gt;
&lt;p&gt;My cookiecutter template comes with &lt;a href="https://github.com/simonw/llm-smollm2/blob/main/.github/workflows/publish.yml"&gt;a GitHub Actions workflow&lt;/a&gt; that publishes the package to PyPI when a new release is created using the GitHub web interface. Here's the relevant YAML:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;  &lt;span class="pl-ent"&gt;deploy&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;needs&lt;/span&gt;: &lt;span class="pl-s"&gt;[test]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;environment&lt;/span&gt;: &lt;span class="pl-s"&gt;release&lt;/span&gt;
    &lt;span class="pl-ent"&gt;permissions&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;id-token&lt;/span&gt;: &lt;span class="pl-s"&gt;write&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v4&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Set up Python&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/setup-python@v5&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;python-version&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3.13&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        &lt;span class="pl-ent"&gt;cache&lt;/span&gt;: &lt;span class="pl-s"&gt;pip&lt;/span&gt;
        &lt;span class="pl-ent"&gt;cache-dependency-path&lt;/span&gt;: &lt;span class="pl-s"&gt;pyproject.toml&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Install dependencies&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;        pip install setuptools wheel build&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;        python -m build&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Publish&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;pypa/gh-action-pypi-publish@release/v1&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This runs after the &lt;code&gt;test&lt;/code&gt; job has passed. It uses the &lt;a href="https://github.com/pypa/gh-action-pypi-publish"&gt;pypa/gh-action-pypi-publish&lt;/a&gt; Action to publish to PyPI - I wrote more about how that works &lt;a href="https://til.simonwillison.net/pypi/pypi-releases-from-github"&gt;in this TIL&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="is-the-model-any-good-"&gt;Is the model any good?&lt;/h4&gt;
&lt;p&gt;This one really isn't! It's not really surprising but it turns out 94MB really isn't enough space for a model that can do anything useful.&lt;/p&gt;
&lt;p&gt;It's &lt;em&gt;super&lt;/em&gt; fun to play with, and I continue to maintain that small, weak models are a great way to help build a mental model of how this technology actually works.&lt;/p&gt;
&lt;p&gt;That's not to say SmolLM2 isn't a fantastic model family. I'm running the smallest, most restricted version here. &lt;a href="https://huggingface.co/blog/smollm"&gt;SmolLM - blazingly fast and remarkably powerful&lt;/a&gt; describes the full model family - which comes in 135M, 360M, and 1.7B sizes. The larger versions are a whole lot more capable.&lt;/p&gt;
&lt;p&gt;If anyone can figure out something genuinely useful to do with the 94MB version I'd love to hear about it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pip"&gt;pip&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pypi"&gt;pypi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/smollm"&gt;smollm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="pip"/><category term="plugins"/><category term="projects"/><category term="pypi"/><category term="python"/><category term="ai"/><category term="github-actions"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="gemini"/><category term="uv"/><category term="smollm"/><category term="o3"/><category term="llama-cpp"/></entry><entry><title>o3-mini is really good at writing internal documentation</title><link href="https://simonwillison.net/2025/Feb/5/o3-mini-documentation/#atom-tag" rel="alternate"/><published>2025-02-05T06:07:40+00:00</published><updated>2025-02-05T06:07:40+00:00</updated><id>https://simonwillison.net/2025/Feb/5/o3-mini-documentation/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://gist.github.com/simonw/4a13c4b10176d7b8e3d1260f5dcc9de3"&gt;o3-mini is really good at writing internal documentation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I wanted to refresh my knowledge of how the Datasette permissions system works today. I already have &lt;a href="https://docs.datasette.io/en/latest/authentication.html"&gt;extensive hand-written documentation&lt;/a&gt; for that, but I thought it would be interesting to see if I could derive any insights from running an LLM against the codebase.&lt;/p&gt;
&lt;p&gt;o3-mini has an input limit of 200,000 tokens. I used &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and my &lt;a href="https://github.com/simonw/files-to-prompt"&gt;files-to-prompt&lt;/a&gt; tool to generate the documentation like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
git clone https://github.com/simonw/datasette
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; datasette
files-to-prompt datasette -e py -c &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  llm -m o3-mini -s \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;write extensive documentation for how the permissions system works, as markdown&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;files-to-prompt&lt;/code&gt; command is fed the &lt;a href="https://github.com/simonw/datasette/tree/main/datasette"&gt;datasette&lt;/a&gt; subdirectory, which contains just the source code for the application - omitting tests (in &lt;code&gt;tests/&lt;/code&gt;) and documentation (in &lt;code&gt;docs/&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-e py&lt;/code&gt; option causes it to only include files with a &lt;code&gt;.py&lt;/code&gt; extension - skipping all of the HTML and JavaScript files in that hierarchy.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-c&lt;/code&gt; option causes it to output Claude's XML-ish format - a format that works great with other LLMs too.&lt;/p&gt;
&lt;p&gt;You can see the output of that command &lt;a href="https://gist.github.com/simonw/1922544763b08c76f0b904e2ece364ea"&gt;in this Gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I pipe that result into LLM, requesting the &lt;code&gt;o3-mini&lt;/code&gt; OpenAI model and passing the following system prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;write extensive documentation for how the permissions system works, as markdown&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Specifically requesting Markdown &lt;a href="https://simonwillison.net/2025/Feb/2/openai-reasoning-models-advice-on-prompting/"&gt;is important&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The prompt used 99,348 input tokens and produced 3,118 output tokens (320 of those were invisible reasoning tokens). That's &lt;a href="https://tools.simonwillison.net/llm-prices"&gt;a cost&lt;/a&gt; of 12.3 cents.&lt;/p&gt;
&lt;p&gt;Honestly, &lt;a href="https://gist.github.com/simonw/4a13c4b10176d7b8e3d1260f5dcc9de3"&gt;the results&lt;/a&gt; are fantastic. I had to double-check that I hadn't accidentally fed in the documentation by mistake.&lt;/p&gt;
&lt;p&gt;(It's possible that the model is picking up additional information about Datasette in its training set, but I've seen similar &lt;a href="https://gist.github.com/simonw/adf64108d65cd5c10ac9fce953ab437e"&gt;high quality results&lt;/a&gt; from other, newer libraries so I don't think that's a significant factor.)&lt;/p&gt;
&lt;p&gt;In this case I already had extensive written documentation of my own, but this was still a useful refresher to help confirm that the code matched my mental model of how everything works.&lt;/p&gt;
&lt;p&gt;Documentation of project internals as a category is notorious for going out of date. Having tricks like this to derive usable how-it-works documentation from existing codebases in just a few seconds and at a cost of a few cents is wildly valuable.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;&lt;/p&gt;



</summary><category term="documentation"/><category term="ai"/><category term="datasette"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="llm-reasoning"/><category term="o3"/><category term="files-to-prompt"/></entry><entry><title>OpenAI reasoning models: Advice on prompting</title><link href="https://simonwillison.net/2025/Feb/2/openai-reasoning-models-advice-on-prompting/#atom-tag" rel="alternate"/><published>2025-02-02T20:56:27+00:00</published><updated>2025-02-02T20:56:27+00:00</updated><id>https://simonwillison.net/2025/Feb/2/openai-reasoning-models-advice-on-prompting/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/guides/reasoning#advice-on-prompting"&gt;OpenAI reasoning models: Advice on prompting&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI's documentation for their o1 and o3 "reasoning models" includes some interesting tips on how to best prompt them:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Developer messages are the new system messages:&lt;/strong&gt; Starting with &lt;code&gt;o1-2024-12-17&lt;/code&gt;, reasoning models support &lt;code&gt;developer&lt;/code&gt; messages rather than &lt;code&gt;system&lt;/code&gt; messages, to align with the &lt;a href="https://cdn.openai.com/spec/model-spec-2024-05-08.html#follow-the-chain-of-command"&gt;chain of command behavior described in the model spec&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This appears to be a purely aesthetic change made for consistency with their &lt;a href="https://simonwillison.net/2024/Apr/23/the-instruction-hierarchy/"&gt;instruction hierarchy&lt;/a&gt; concept. As far as I can tell the old &lt;code&gt;system&lt;/code&gt; prompts continue to work exactly as before - you're encouraged to use the new &lt;code&gt;developer&lt;/code&gt; message type but it has no impact on what actually happens.&lt;/p&gt;
&lt;p&gt;Since my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool already bakes in a &lt;code&gt;llm --system "system prompt"&lt;/code&gt; option which works across multiple different models from different providers I'm not going to rush to adopt this new language!&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use delimiters for clarity:&lt;/strong&gt; Use delimiters like markdown, XML tags, and section titles to clearly indicate distinct parts of the input, helping the model interpret different sections appropriately.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Anthropic have been encouraging &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/use-xml-tags"&gt;XML-ish delimiters&lt;/a&gt; for a while (I say -ish because there's no requirement that the resulting prompt is valid XML). My &lt;a href="https://github.com/simonw/files-to-prompt"&gt;files-to-prompt&lt;/a&gt; tool has a &lt;code&gt;-c&lt;/code&gt; option which outputs Claude-style XML, and in my experiments this same option works great with o1 and o3 too:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/tursodatabase/limbo
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; limbo/bindings/python

files-to-prompt &lt;span class="pl-c1"&gt;.&lt;/span&gt; -c &lt;span class="pl-k"&gt;|&lt;/span&gt; llm -m o3-mini \
  -o reasoning_effort high \
  --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a detailed README with extensive usage examples&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Limit additional context in retrieval-augmented generation (RAG):&lt;/strong&gt; When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This makes me thing that o1/o3 are not good models to implement RAG on at all - with RAG I like to be able to dump as much extra context into the prompt as possible and leave it to the models to figure out what's relevant.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Try zero shot first, then few shot if needed:&lt;/strong&gt; Reasoning models often don't need few-shot examples to produce good results, so try to write prompts without examples first. If you have more complex requirements for your desired output, it may help to include a few examples of inputs and desired outputs in your prompt. Just ensure that the examples align very closely with your prompt instructions, as discrepancies between the two may produce poor results.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Providing examples remains the single most powerful prompting tip I know, so it's interesting to see advice here to only switch to examples if zero-shot doesn't work out.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Be very specific about your end goal:&lt;/strong&gt; In your instructions, try to give very specific parameters for a successful response, and encourage the model to keep reasoning and iterating until it matches your success criteria.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This makes sense: reasoning models "think" until they reach a conclusion, so making the goal as unambiguous as possible leads to better results.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Markdown formatting:&lt;/strong&gt; Starting with &lt;code&gt;o1-2024-12-17&lt;/code&gt;, reasoning models in the API will avoid generating responses with markdown formatting. To signal to the model when you &lt;strong&gt;do&lt;/strong&gt; want markdown formatting in the response, include the string &lt;code&gt;Formatting re-enabled&lt;/code&gt; on the first line of your &lt;code&gt;developer&lt;/code&gt; message.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one was a &lt;em&gt;real shock&lt;/em&gt; to me! I noticed that o3-mini was outputting &lt;code&gt;•&lt;/code&gt; characters instead of Markdown &lt;code&gt;*&lt;/code&gt; bullets and initially thought &lt;a href="https://twitter.com/simonw/status/1886121477822648441"&gt;that was a bug&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I first saw this while running this prompt against &lt;a href="https://github.com/tursodatabase/limbo/tree/main/bindings/python"&gt;limbo/bindings/python&lt;/a&gt; using &lt;a href="https://github.com/simonw/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/tursodatabase/limbo
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; limbo/bindings/python

files-to-prompt &lt;span class="pl-c1"&gt;.&lt;/span&gt; -c &lt;span class="pl-k"&gt;|&lt;/span&gt; llm -m o3-mini \
  -o reasoning_effort high \
  --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a detailed README with extensive usage examples&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/f8283d68e9bd7ad3f140d52cad6874a7"&gt;full result&lt;/a&gt;, which includes text like this (note the weird bullets):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Features
--------
• High‑performance, in‑process database engine written in Rust  
• SQLite‑compatible SQL interface  
• Standard Python DB‑API 2.0–style connection and cursor objects
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I ran it again with this modified prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Formatting re-enabled. Write a detailed README with extensive usage examples.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And this time got back &lt;a href="https://gist.github.com/simonw/adf64108d65cd5c10ac9fce953ab437e"&gt;proper Markdown, rendered in this Gist&lt;/a&gt;. That did a really good job, and included bulleted lists using this valid Markdown syntax instead:&lt;/p&gt;
&lt;div class="highlight highlight-text-md"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;**&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;make test&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-s"&gt;**&lt;/span&gt;: Run tests using pytest.
&lt;span class="pl-v"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;**&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;make lint&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-s"&gt;**&lt;/span&gt;: Run linters (via &lt;span class="pl-s"&gt;[&lt;/span&gt;ruff&lt;span class="pl-s"&gt;]&lt;/span&gt;&lt;span class="pl-s"&gt;(&lt;/span&gt;&lt;span class="pl-corl"&gt;https://github.com/astral-sh/ruff&lt;/span&gt;&lt;span class="pl-s"&gt;)&lt;/span&gt;).
&lt;span class="pl-v"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;**&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;make check-requirements&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-s"&gt;**&lt;/span&gt;: Validate that the &lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;requirements.txt&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt; files are in sync with &lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;pyproject.toml&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;.
&lt;span class="pl-v"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;**&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;make compile-requirements&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-s"&gt;**&lt;/span&gt;: Compile the &lt;span class="pl-s"&gt;`&lt;/span&gt;&lt;span class="pl-c1"&gt;requirements.txt&lt;/span&gt;&lt;span class="pl-s"&gt;`&lt;/span&gt; files using pip-tools.&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Py-Limbo. Py-Limbo is a lightweight, in-process, OLTP (Online Transaction Processing) database management system built as a Python extension module on top of Rust. It is designed to be compatible with SQLite in both usage and API, while offering an opportunity to experiment with Rust-backed database functionality. Note: Py-Limbo is a work-in-progress (Alpha stage) project. Some features (e.g. transactions, executemany, fetchmany) are not yet supported. Table of Contents - then a hierarchical nested table of contents." src="https://static.simonwillison.net/static/2025/pylimbo-docs.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;(Using LLMs like this to get me off the ground with under-documented libraries is a trick I use several times a month.)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://twitter.com/nikunjhanda/status/1886169547197264226"&gt;OpenAI's Nikunj Handa&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;we agree this is weird! fwiw, it’s a temporary thing we had to do for the existing o-series models. we’ll fix this in future releases so that you can go back to naturally prompting for markdown or no-markdown.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/harjotsgill/status/1886122316767379540"&gt;@harjotsgill&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/markdown"&gt;markdown&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o1"&gt;o1&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/limbo"&gt;limbo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;&lt;/p&gt;



</summary><category term="documentation"/><category term="markdown"/><category term="ai"/><category term="openai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="rag"/><category term="o1"/><category term="llm-reasoning"/><category term="o3"/><category term="limbo"/><category term="files-to-prompt"/><category term="system-prompts"/></entry><entry><title>Quoting Daniel Litt</title><link href="https://simonwillison.net/2025/Feb/1/daniel-litt/#atom-tag" rel="alternate"/><published>2025-02-01T21:46:52+00:00</published><updated>2025-02-01T21:46:52+00:00</updated><id>https://simonwillison.net/2025/Feb/1/daniel-litt/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/littmath/status/1885716052304077088"&gt;&lt;p&gt;Basically any resource on a difficult subject—a colleague, Google, a published paper—will be wrong or incomplete in various ways. Usefulness isn’t only a matter of correctness.&lt;/p&gt;
&lt;p&gt;For example, suppose a colleague has a question she thinks I might know the answer to. Good news: I have some intuition and say something. Then we realize it doesn’t quite make sense, and go back and forth until we converge on something correct.&lt;/p&gt;
&lt;p&gt;Such a conversation is full of BS but crucially we can interrogate it and get something useful out of it in the end. Moreover this kind of back and forth allows us to get to the key point in a way that might be difficult when reading a difficult ~50-page paper.&lt;/p&gt;
&lt;p&gt;To be clear o3-mini-high is orders of magnitude less useful for this sort of thing than talking to an expert colleague. But still useful along similar dimensions (and with a much broader knowledge base).&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/littmath/status/1885716052304077088"&gt;Daniel Litt&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mathematics"&gt;mathematics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/daniel-litt"&gt;daniel-litt&lt;/a&gt;&lt;/p&gt;



</summary><category term="mathematics"/><category term="generative-ai"/><category term="o3"/><category term="ai"/><category term="llms"/><category term="daniel-litt"/></entry><entry><title>OpenAI o3-mini, now available in LLM</title><link href="https://simonwillison.net/2025/Jan/31/o3-mini/#atom-tag" rel="alternate"/><published>2025-01-31T21:50:36+00:00</published><updated>2025-01-31T21:50:36+00:00</updated><id>https://simonwillison.net/2025/Jan/31/o3-mini/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI's &lt;a href="https://openai.com/index/openai-o3-mini/"&gt;o3-mini is out today&lt;/a&gt;. As with other o-series models it's a slightly difficult one to evaluate - we now need to decide if a prompt is best run using GPT-4o, o1, o3-mini or (if we have access) o1 Pro.&lt;/p&gt;
&lt;p&gt;Confusing matters further, the benchmarks in &lt;a href="https://openai.com/index/o3-mini-system-card/"&gt;the o3-mini system card&lt;/a&gt; (PDF) aren't a universal win for o3-mini across all categories. It generally benchmarks higher than GPT-4o and o1 but not across everything.&lt;/p&gt;
&lt;p&gt;The biggest win for o3-mini is on the Codeforces ELO competitive programming benchmark, which I think is &lt;a href="https://arxiv.org/abs/2501.01257"&gt;described by this 2nd January 2025 paper&lt;/a&gt;, with the following scores:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3-mini (high) 2130&lt;/li&gt;
&lt;li&gt;o3-mini (medium) 2036&lt;/li&gt;
&lt;li&gt;o1 1891&lt;/li&gt;
&lt;li&gt;o3-mini (low) 1831&lt;/li&gt;
&lt;li&gt;o1-mini 1650&lt;/li&gt;
&lt;li&gt;o1-preview 1258&lt;/li&gt;
&lt;li&gt;GPT-4o 900&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Weirdly, that GPT-4o score was in an older copy of the System Card PDF which has been replaced by an updated document that doesn't mention Codeforces ELO scores at all.&lt;/p&gt;
&lt;p&gt;One note from the System Card that stood out for me concerning intended applications of o3-mini for OpenAI themselves:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We also plan to allow users to use o3-mini to search the internet and summarize the results in ChatGPT. We expect o3-mini to be a useful and safe model for doing this, especially given its performance on the jailbreak and instruction hierarchy evals detailed in Section 4 below.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is notable because the existing o1 models on ChatGPT have not yet had access to their web search tool - despite the mixture of search and "reasoning" models having very clear benefits.&lt;/p&gt;
&lt;p&gt;o3-mini does not and &lt;a href="https://twitter.com/nikunjhanda/status/1885415728624656481"&gt;will not&lt;/a&gt; support vision. We will have to wait for future OpenAI reasoning models for that.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-21"&gt;LLM 0.21&lt;/a&gt; with support for the new model, plus its &lt;code&gt;-o reasoning_effort high&lt;/code&gt; (or &lt;code&gt;medium&lt;/code&gt; or &lt;code&gt;low&lt;/code&gt;) option for tweaking the reasoning effort - details &lt;a href="https://github.com/simonw/llm/issues/728"&gt;in this issue&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Note that the new model is currently only available for &lt;a href="https://platform.openai.com/docs/guides/rate-limits/usage-tiers#tier-3-rate-limits"&gt;Tier 3&lt;/a&gt; and higher users, which requires you to have spent at least $100 on the API.&lt;/p&gt;
&lt;p&gt;o3-mini &lt;a href="https://openai.com/api/pricing/"&gt;is priced&lt;/a&gt; at $1.10/million input tokens, $4.40/million output tokens - less than half the price of GPT-4o (currently $2.50/$10) and massively cheaper than o1 ($15/$60). The GPT-4o comparison isn't quite as simple as that though, as o3-mini's invisible reasoning tokens still count towards the output tokens you get charged for.&lt;/p&gt;
&lt;p&gt;I tried using it to summarize &lt;a href="https://news.ycombinator.com/item?id=42890627"&gt;this conversation about o3-mini on Hacker News&lt;/a&gt;, using &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-a--m-model-option"&gt;my hn-summary.sh script&lt;/a&gt;.&lt;/p&gt;

&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;hn-summary.sh 42890627 -o o3-mini&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae050"&gt;the result&lt;/a&gt; - it used 18,936 input tokens and 2,905 output tokens for a total cost of 3.3612 cents.&lt;/p&gt;

&lt;p&gt;o3-mini (and o1-mini) are text-only models: they don't accept image inputs. The full o1 API model can accept images in the same way as GPT-4o.&lt;/p&gt;

&lt;p&gt;Another characteristic worth noting is o3-mini's token output limit - the measure of how much text it can output in one go.  That's 100,000 tokens, compared to 16,000 for GPT-4o and just 8,000 for both DeepSeek R1 and Claude 3.5.&lt;/p&gt;

&lt;p&gt;Invisible "reasoning tokens" come out of the same budget, so it's likely not possible to have it output the full 100,000.&lt;/p&gt;

&lt;p&gt;The model accepts up to 200,000 tokens of input, an improvement on GPT-4o's 128,000.&lt;/p&gt;

&lt;p&gt;An application where output limits really matter is translation between human languages, where the output can realistically be expected to have a similar length to the input. It will be interesting seeing how well o3-mini works for that, especially given its low price.&lt;/p&gt;
&lt;p&gt;Update: Here's a &lt;a href="https://news.ycombinator.com/item?id=42894215#42895610"&gt;fascinating comment&lt;/a&gt; on this by professional translator Tom Gally on Hacker News:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I just did a test in which both R1 and o3-mini got worse at translation in the latter half of a long text. [...]&lt;/p&gt;
&lt;p&gt;An initial comparison of the output suggested that, while R1 didn’t seem bad, o3-mini produced a writing style closer to what I asked for in the prompt—smoother and more natural English.
But then I noticed that the output length was 5,855 characters for R1, 9,052 characters for o3-mini, and 11,021 characters for my own polished version. Comparing the three translations side-by-side with the original Japanese, I discovered that R1 had omitted entire paragraphs toward the end of the speech, and that o3-mini had switched to a strange abbreviated style (using slashes instead of “and” between noun phrases, for example) toward the end as well. The vanilla versions of ChatGPT, Claude, and Gemini that I ran the same prompt and text through a month ago had had none of those problems.&lt;/p&gt;
&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/translation"&gt;translation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="translation"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llm"/><category term="llm-pricing"/><category term="llm-reasoning"/><category term="o3"/><category term="llm-release"/></entry><entry><title>Quoting Jack Clark</title><link href="https://simonwillison.net/2024/Dec/23/jack-clark/#atom-tag" rel="alternate"/><published>2024-12-23T15:34:49+00:00</published><updated>2024-12-23T15:34:49+00:00</updated><id>https://simonwillison.net/2024/Dec/23/jack-clark/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://jack-clark.net/2024/12/23/import-ai-395-ai-and-energy-demand-distributed-training-via-demo-and-phi-4/"&gt;&lt;p&gt;There’s been a lot of strange reporting recently about how ‘scaling is hitting a wall’ – in a very narrow sense this is true in that larger models were getting less score improvement on challenging benchmarks than their predecessors, but in a larger sense this is false – techniques like those which power O3 means scaling is continuing (and if anything the curve has steepened), you just now need to account for scaling both within the training of the model and in the compute you spend on it once trained.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://jack-clark.net/2024/12/23/import-ai-395-ai-and-energy-demand-distributed-training-via-demo-and-phi-4/"&gt;Jack Clark&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/jack-clark"&gt;jack-clark&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="jack-clark"/><category term="generative-ai"/><category term="llm-reasoning"/><category term="o3"/><category term="ai"/><category term="llms"/></entry><entry><title>OpenAI o3 breakthrough high score on ARC-AGI-PUB</title><link href="https://simonwillison.net/2024/Dec/20/openai-o3-breakthrough/#atom-tag" rel="alternate"/><published>2024-12-20T22:17:42+00:00</published><updated>2024-12-20T22:17:42+00:00</updated><id>https://simonwillison.net/2024/Dec/20/openai-o3-breakthrough/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://arcprize.org/blog/oai-o3-pub-breakthrough"&gt;OpenAI o3 breakthrough high score on ARC-AGI-PUB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
François Chollet is the co-founder of the ARC Prize and had advanced access to today's o3 results. His article here is the most insightful coverage I've seen of o3, going beyond just the benchmark results to talk about what this all means for the field in general.&lt;/p&gt;
&lt;p&gt;One fascinating detail: it cost $6,677 to run o3 in "high efficiency" mode against the 400 public ARC-AGI puzzles for a score of 82.8%, and an undisclosed amount of money to run the "low efficiency" mode model to score 91.5%. A note says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;o3 high-compute costs not available as pricing and feature availability is still TBD. The amount of compute was roughly 172x the low-compute configuration.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So we can get a ballpark estimate here in that 172 * $6,677 = $1,148,444!&lt;/p&gt;
&lt;p&gt;Here's how François explains the likely mechanisms behind o3, which reminds me of how a brute-force chess computer might work.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model. To note, Demis Hassabis hinted back in a June 2023 interview that DeepMind had been researching this very idea – this line of work has been a long time coming.&lt;/p&gt;
&lt;p&gt;So while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programs, where the program itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination (you could also do test-time training, or search in latent space), it represents the current state-of-the-art as per these new ARC-AGI numbers.&lt;/p&gt;
&lt;p&gt;Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm not sure if o3 (and o1 and similar models) even qualifies as an LLM any more - there's clearly a whole lot more going on here than just next-token prediction.&lt;/p&gt;
&lt;p&gt;On the question of if o3 should qualify as AGI (whatever that might mean):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.&lt;/p&gt;
&lt;p&gt;Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The post finishes with examples of the puzzles that o3 &lt;em&gt;didn't&lt;/em&gt; manage to solve, including this one which reassured me that I can still solve at least some puzzles that couldn't be handled with thousands of dollars of GPU compute!&lt;/p&gt;
&lt;p&gt;&lt;img alt="A puzzle with colored squares, where drawing a line between the single blue squares and turning any intersected rectangles blue is clearly the solution." src="https://static.simonwillison.net/static/2024/arc-agi-task-0d87d2a6.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/francois-chollet"&gt;francois-chollet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="francois-chollet"/><category term="llm-reasoning"/><category term="o3"/></entry><entry><title>Quoting François Chollet</title><link href="https://simonwillison.net/2024/Dec/20/francois-chollet/#atom-tag" rel="alternate"/><published>2024-12-20T19:12:49+00:00</published><updated>2024-12-20T19:12:49+00:00</updated><id>https://simonwillison.net/2024/Dec/20/francois-chollet/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://arcprize.org/blog/oai-o3-pub-breakthrough"&gt;&lt;p&gt;OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.&lt;/p&gt;
&lt;p&gt;This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://arcprize.org/blog/oai-o3-pub-breakthrough"&gt;François Chollet&lt;/a&gt;, Co-founder, ARC Prize&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/o1"&gt;o1&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/francois-chollet"&gt;francois-chollet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;&lt;/p&gt;



</summary><category term="o1"/><category term="generative-ai"/><category term="llm-reasoning"/><category term="francois-chollet"/><category term="ai"/><category term="llms"/><category term="openai"/><category term="o3"/></entry><entry><title>Live blog: the 12th day of OpenAI - "Early evals for OpenAI o3"</title><link href="https://simonwillison.net/2024/Dec/20/live-blog-the-12th-day-of-openai/#atom-tag" rel="alternate"/><published>2024-12-20T17:40:47+00:00</published><updated>2024-12-20T17:40:47+00:00</updated><id>https://simonwillison.net/2024/Dec/20/live-blog-the-12th-day-of-openai/#atom-tag</id><summary type="html">
    &lt;p&gt;It's the final day of OpenAI's &lt;a href="https://openai.com/12-days/"&gt;12 Days of OpenAI&lt;/a&gt; launch series, and since I built &lt;a href="https://til.simonwillison.net/django/live-blog"&gt;a live blogging system&lt;/a&gt; a couple of months ago I've decided to roll it out again to provide live commentary during the half hour event, which kicks off at &lt;a href="https://www.timeanddate.com/worldclock/fixedtime.html?msg=12th+Day+of+OpenAI&amp;amp;iso=20241220T10&amp;amp;p1=224&amp;amp;am=30"&gt;10am San Francisco time&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the &lt;a href="https://www.youtube.com/watch?v=SKBG1sqdyIU"&gt;video on YouTube&lt;/a&gt;.&lt;/p&gt;

&lt;iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/SKBG1sqdyIU?si=ABWW8H90l4LbLvyg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="1"&gt; &lt;/iframe&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o1"&gt;o1&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/live-blog"&gt;live-blog&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="o1"/><category term="llm-reasoning"/><category term="o3"/><category term="live-blog"/></entry></feed>