<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: deep-research</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/deep-research.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-10-18T19:21:30+00:00</updated><author><name>Simon Willison</name></author><entry><title>TIL: Exploring OpenAI's deep research API model o4-mini-deep-research</title><link href="https://simonwillison.net/2025/Oct/18/o4-mini-deep-research/#atom-tag" rel="alternate"/><published>2025-10-18T19:21:30+00:00</published><updated>2025-10-18T19:21:30+00:00</updated><id>https://simonwillison.net/2025/Oct/18/o4-mini-deep-research/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/o4-mini-deep-research"&gt;TIL: Exploring OpenAI&amp;#x27;s deep research API model o4-mini-deep-research&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I landed &lt;a href="https://github.com/simonw/llm-prices/pull/9"&gt;a PR&lt;/a&gt; by Manuel Solorzano adding pricing information to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; for OpenAI's &lt;a href="https://platform.openai.com/docs/models/o4-mini-deep-research"&gt;o4-mini-deep-research&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/models/o3-deep-research"&gt;o3-deep-research&lt;/a&gt; models, which they released &lt;a href="https://cookbook.openai.com/examples/deep_research_api/introduction_to_deep_research_api"&gt;in June&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/guides/deep-research"&gt;document here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I realized I'd never tried these before, so I put &lt;code&gt;o4-mini-deep-research&lt;/code&gt; through its paces researching locations of surviving &lt;a href="https://en.wikipedia.org/wiki/Orchestrion"&gt;orchestrions&lt;/a&gt; for me (I &lt;a href="https://www.niche-museums.com/115"&gt;really like orchestrions&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The API cost me $1.10 and triggered a small flurry of extra vibe-coded tools, including this &lt;a href="https://tools.simonwillison.net/deep-research-viewer#gist=3454a4ce40f8547a5c65c911de611ff4"&gt;new tool&lt;/a&gt; for visualizing Responses API traces from deep research models and &lt;a href="https://gistpreview.github.io/?b9f5416b37c4ceec46d8447b52be0ad2"&gt;this mocked up page&lt;/a&gt; listing the 19 orchestrions it found (only one of which I have fact-checked myself).&lt;/p&gt;
&lt;p&gt;&lt;img alt="A web page showing information about historic orchestrions. The header reads &amp;quot;Historic Orchestrions Around the World&amp;quot; with subtitle &amp;quot;A collection of rare and remarkable mechanical orchestras&amp;quot; and three pills showing &amp;quot;19 Orchestrions&amp;quot;, &amp;quot;7 Locations&amp;quot;, and &amp;quot;7 Countries&amp;quot;. Below is a white card titled &amp;quot;The Musical Museum (Brentford)&amp;quot; with a location pin icon showing &amp;quot;London (Brentford), UK&amp;quot; and a blue &amp;quot;View on Google Maps →&amp;quot; button. The card contains three sections: DESCRIPTION: &amp;quot;Imhof &amp;amp; Mukle pipe-organ orchestrion (1899) with multiple registers and percussion (drums, tambourine, triangle) (www.soundsurvey.org.uk).&amp;quot; HISTORY: &amp;quot;Built in London c.1899 by Imhof &amp;amp; Mukle; remained in their Oxford Street showroom until company collapse in the 1970s, when it was acquired by the Brentford Musical Museum (www.soundsurvey.org.uk).&amp;quot; NOTES: &amp;quot;The museum advertises that the soprano Adelina Patti used a similar Imhof orchestrion at her home in Wales (www.soundsurvey.org.uk).&amp;quot;" src="https://static.simonwillison.net/static/2025/orchestrions-around-the-world.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="deep-research"/><category term="vibe-coding"/></entry><entry><title>GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search</title><link href="https://simonwillison.net/2025/Sep/6/research-goblin/#atom-tag" rel="alternate"/><published>2025-09-06T19:31:57+00:00</published><updated>2025-09-06T19:31:57+00:00</updated><id>https://simonwillison.net/2025/Sep/6/research-goblin/#atom-tag</id><summary type="html">
    &lt;p&gt;"Don't use chatbots as search engines" was great advice for several years... until it wasn't.&lt;/p&gt;
&lt;p&gt;I wrote about how good OpenAI's o3 was at using its Bing-backed search tool &lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/"&gt;back in April&lt;/a&gt;. GPT-5 feels even better.&lt;/p&gt;
&lt;p&gt;I've started calling it my &lt;strong&gt;Research Goblin&lt;/strong&gt;. I can assign a task to it, no matter how trivial or complex, and it will do an often unreasonable amount of work to search the internet and figure out an answer.&lt;/p&gt;
&lt;p&gt;This is excellent for satisfying curiosity, and occasionally useful for more important endeavors as well.&lt;/p&gt;
&lt;p&gt;I always run my searches by selecting the "GPT-5 Thinking" model from the model picker - in my experience this leads to far more comprehensive (albeit much slower) results.&lt;/p&gt;
&lt;p&gt;Here are some examples from just the last couple of days. Every single one of them was run on my phone, usually while I was doing something else. Most of them were dictated using the iPhone voice keyboard, which I find faster than typing. Plus, it's fun to talk to my Research Goblin.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#bouncy-travelators"&gt;Bouncy travelators&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#identify-this-building"&gt;Identify this building&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#starbucks-uk-cake-pops"&gt;Starbucks UK cake pops&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#britannica-to-seed-wikipedia"&gt;Britannica to seed Wikipedia&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#official-name-for-the-university-of-cambridge"&gt;Official name for the University of Cambridge&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#history-of-the-caverns-in-exeter-quay"&gt;History of the caverns in Exeter quay&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#aldi-vs-lidl"&gt;Aldi vs Lidl&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#ai-labs-scanning-books-for-training-data"&gt;AI labs scanning books for training data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#gpt-5-for-search-feels-competent"&gt;GPT-5 for search feels competent&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/research-goblin/#tips-for-using-search-in-chatgpt"&gt;Tips for using search in ChatGPT&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="bouncy-travelators"&gt;Bouncy travelators&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;They used to be rubber bouncy travelators at Heathrow and they were really fun, have all been replaced by metal ones now and if so, when did that happen?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I was traveling through Heathrow airport pondering what had happened to the fun bouncy rubber travelators.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://chatgpt.com/share/68bc2d98-9aac-8006-98b9-1424d98290f8"&gt;Here's what I got&lt;/a&gt;. Research Goblin narrowed it down to some time between 2014-2018 but, more importantly, found me this &lt;a href="https://www.sfchronicle.com/totalsf/article/sfo-bouncy-moving-walkway-airport-19845449.php"&gt;delightful 2024 article&lt;/a&gt; by Peter Hartlaub in the San Francisco Chronicle with a history of the SFO bouncy walkways, now also sadly retired.&lt;/p&gt;
&lt;h4 id="identify-this-building"&gt;Identify this building&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/reading-building.jpg" alt="not a great photo of a building with a distinctive shaped roof" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Identify this building in reading&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a photo I snapped out of the window on the train. It &lt;a href="https://chatgpt.com/share/68bc2e21-1d24-8006-b083-00b3233e1c67"&gt;thought for 1m4s&lt;/a&gt; and correctly identified it as &lt;a href="https://en.wikipedia.org/wiki/The_Blade,_Reading"&gt;The Blade&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="starbucks-uk-cake-pops"&gt;Starbucks UK cake pops&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Starbucks in the UK don't sell cake pops! Do a deep investigative dive&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Starbucks in Exeter railway station didn't have cake pops, and the lady I asked didn't know what they were.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://chatgpt.com/share/68bc71b4-68f4-8006-b462-cf32f61e7ec3"&gt;Here's the result&lt;/a&gt;. It turns out Starbucks did launch cake pops in the UK &lt;a href="https://www.nationalworld.com/lifestyle/starbucks-cake-pops-launched-in-uk-on-new-autumn-menu-full-list-of-items-4284537"&gt;in September 2023&lt;/a&gt; but they aren't available at all outlets, in particular the licensed travel locations such as the one at Exeter St Davids station.&lt;/p&gt;
&lt;p&gt;I particularly enjoyed how it established definitive proof by consulting &lt;a href="https://www.starbucks.co.uk/sites/starbucks-uk-pwa/files/2024-11/HOL24_UK_AllergenBook_CORE_FOOD_v02.LR_.pdf"&gt;the nutrition and allergen guide PDF&lt;/a&gt; on starbucks.co.uk, which does indeed list both the Birthday Cake Pop (my favourite) and the Cookies and Cream one (apparently discontinued in the USA, at least &lt;a href="https://www.reddit.com/r/starbucks/comments/1lp5chq/just_learned_today_the_cookies_cream_cake_pop_has/"&gt;according to r/starbucks&lt;/a&gt;).&lt;/p&gt;
&lt;h4 id="britannica-to-seed-wikipedia"&gt;Britannica to seed Wikipedia&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Someone on hacker News said:&lt;/p&gt;
&lt;p&gt;&amp;gt; I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else&lt;/p&gt;
&lt;p&gt;Find what they meant by that&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://chatgpt.com/share/68bc3062-5a68-8006-a12b-cf7196a130ae"&gt;The result&lt;/a&gt;. It turns out Wikipedia did seed itself with content from the out-of-copyright 1911 Encyclopædia Britannica... but that project took place in 2006, five years after Wikipedia first launched in 2001.&lt;/p&gt;
&lt;p&gt;I asked:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What is the single best article I can link somebody to that explains the 1911 Britannica thing&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And it pointed me to &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Encyclopaedia_Britannica"&gt;Wikipedia:WikiProject Encyclopaedia Britannica&lt;/a&gt; which includes a detailed explanation and a link to &lt;a href="https://en.m.wikipedia.org/wiki/Template:EB1911"&gt;the 13,000 pages&lt;/a&gt; still tagged with the template from that project. I posted what I found &lt;a href="https://news.ycombinator.com/item?id=45135061#45137109"&gt;in a comment&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Notably (for me anyway) I didn't feel the need to disclose my use of ChatGPT in finding that information - at this point that feels a little like disclosing that I ran a Google search.&lt;/p&gt;
&lt;h4 id="official-name-for-the-university-of-cambridge"&gt;Official name for the University of Cambridge&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;What is the official legal name of the university of Cambridge?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2025/Sep/6/iancal/"&gt;Here's the context for that one&lt;/a&gt;. It thought for 19 seconds - the thinking trace reveals it knew the answer but wanted to confirm it. &lt;a href="https://chatgpt.com/share/68bc723d-a344-8006-9514-fb3ea2c2b43b"&gt;It answered&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The Chancellor, Masters, and Scholars of the University of Cambridge.” &lt;a href="https://www.cam.ac.uk/about-the-university/how-the-university-and-colleges-work/the-university-as-a-charity"&gt;University of Cambridge&lt;/a&gt;, &lt;a href="https://www.cambridge.org/legal"&gt;Cambridge University Press &amp;amp; Assessment&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That first link gave me the citation I needed in order to be sure this was right.&lt;/p&gt;
&lt;p&gt;Since this is my shortest example, here's a screenshot of the expanded "Thought for 19s" panel. I &lt;em&gt;always&lt;/em&gt; expand the thoughts - seeing how it pulled together its answer is crucial for evaluating if the answer is likely to be useful or not.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/cambridge-name.jpg" alt="Screenshot of an AI assistant's thought process and research. Header shows &amp;quot;Thought for 19s&amp;quot; followed by bullet points: &amp;quot;The user asked for the official legal name of the University of Cambridge, and I think they mean the formal corporate name. It's probably 'The Chancellor, Masters, and Scholars of the University of Cambridge,' similar to Oxford's. I'm going to verify by checking the web for precise information.&amp;quot; &amp;quot;The official name of Cambridge might include the serial comma, like 'Masters, and Scholars.' There's a similar structure with Oxford, which uses a comma after 'Masters' too. The University is an exempt charity, so you can cross-check with legal documents like statutes or charity records. I'll double-check.&amp;quot; Shows search results for &amp;quot;official legal name of University of Cambridge&amp;quot; from en.wikipedia.org, www.cam.ac.uk, projects.propublica.org and &amp;quot;10 more&amp;quot; sources. Continues with &amp;quot;The University of Cambridge's official legal name, based on the University website and Cambridge University Press, is 'The Chancellor, Masters, and Scholars of the University of Cambridge,' with the serial comma after 'Masters.' It's clear from these authoritative sources, so I'll provide that straightforward answer, citing the University site.&amp;quot; &amp;quot;The official legal name of the University of Cambridge is 'The Chancellor, Masters, and Scholars of the University of Cambridge,' with the comma after Masters and the Oxford comma before 'and.' I'll answer concisely in one line and cite official sources to confirm. No need for extra UI elements here.&amp;quot; Shows &amp;quot;Done&amp;quot; status. Final answer displays: &amp;quot;The Chancellor, Masters, and Scholars of the University of Cambridge.&amp;quot; attributed to University of Cambridge, Cambridge University Press &amp;amp; Assessment, with &amp;quot;Sources&amp;quot; section at bottom." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="history-of-the-caverns-in-exeter-quay"&gt;History of the caverns in Exeter quay&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Research On the waterfront restaurant in Exeter, is it dug into the cliffs somehow? History of the building, who built it, why and how&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We were out to dinner &lt;a href="https://maps.app.goo.gl/xxvaPQiNWACtbq3H8"&gt;here&lt;/a&gt; and noticed that the interior of the restaurant appeared to be a space dug into the cliff, which piqued my interest.&lt;/p&gt;
&lt;p&gt;This was &lt;a href="https://chatgpt.com/share/68bc32fb-d52c-8006-9259-0b984dc832b2"&gt;the ChatGPT session&lt;/a&gt; that inspired the Research Goblin nickname. It just kept on digging!&lt;/p&gt;
&lt;p&gt;The first reply took 2m40s and confirmed that yes, these quay buildings were carved into the red sandstone cliff &lt;a href="https://www.exploredevon.info/activities/walk/exeter-quay/"&gt;in the 1820s-1830s&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;ChatGPT with GPT-5 really likes to suggest additional steps it can take. In this case:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’d like, I can dig up the exact Historic England entry that covers the “Southern Warehouse” address and overlay it on a map of the vaults.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I often say "yes" purely out of curiosity to see what it will do next, and the offer to "overlay it on a map" was irresistible, like how would it even do that?&lt;/p&gt;
&lt;p&gt;It did a &lt;em&gt;ton&lt;/em&gt; of extra searches, found latitude and longitude coordinates for the restaurant (from Wikimedia Commons) and the warehouse buildings (from National Heritage List for England via Wikipedia), showed me that data in a table and then used Python to render this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/bad-chart.png" alt="Scatter plot titled &amp;quot;On The Waterfront vs. Warehouse Vaults (Exeter Quay)&amp;quot; with scientific notation &amp;quot;+5.071e1&amp;quot; in top left. Y-axis shows &amp;quot;Latitude&amp;quot; ranging from 0.0065 to 0.0090. X-axis shows &amp;quot;Longitude&amp;quot; ranging from -3.5310 to -3.5280. Three orange X markers plotted: &amp;quot;Warehouse Vaults (6-11, The Quay) — HE 1223046&amp;quot; at top center, &amp;quot;On The Waterfront (4-9 Southern Warehouse)&amp;quot; at middle left, and &amp;quot;Warehouse Vaults (1-4, The Quay) — HE 1223045&amp;quot; at bottom right" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Yeah, so that one is a bit of a miss!&lt;/p&gt;
&lt;p&gt;But... the underlying research chain, as always, provided fascinating. I noticed that at one point it had searched for "Exeter Quay vaults map PDF", which inspired my next question:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Did you find any PDF diagrams, showing the tunnels into the cliff&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;2m20s later it had found &lt;a href="https://planningdocs.exeter.gov.uk/servlets/direct/KgQ8hlUut3Q2r5wVMItggobi9P/2/844706/1/1/AS_PDF_FILE"&gt;a planning document&lt;/a&gt; and &lt;a href="https://www.turner-locker.co.uk/wp-content/uploads/2018/08/Cellar-1-The-Quay-TO-LET-2018-1.pdf"&gt;a letting brochure&lt;/a&gt;, both PDFs, but noted that neither of them had the diagram I wanted. Evidently it actually "reads" the PDFs it finds as part of its activities, since it appears to have knowledge of the images contained within them as well as the text.&lt;/p&gt;
&lt;p&gt;It concluded with this intriguing note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There’s strong evidence that a detailed plan/section &lt;strong&gt;does&lt;/strong&gt; exist in the Exeter Archaeology report often cited by planners—&lt;strong&gt;“Archaeological, Historical and Conservation Study of the Exeter Canal Basin” (Exeter Archaeology Report 00.18, 2000)&lt;/strong&gt;—but it isn’t online; it’s just cited in planning documents. You’d likely need to request it via the Devon Heritage Centre/Historic Environment Record.&lt;/p&gt;
&lt;p&gt;If you want, I can keep hunting specialist archives for a digitized copy or a different report that includes the vault layouts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Obviously I said "Yes do that" and 2m24s later it chucked out a whole host of extra links, an apology that it still hadn't found a detailed plan diagram and, under the section heading "What I recommend next (actionable + exact ask)":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’re happy, I can draft an email you can paste to the right archivists. [...]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Subject: Exeter Quayside cliff vaults – request for plan/section drawings
Hello,&lt;/p&gt;
&lt;p&gt;I’m researching the cliff-cut vaults behind The Quay (On The Waterfront / Kings Wharf).
Could you supply scans or copies of any &lt;strong&gt;plans/sections/diagrams&lt;/strong&gt; of these vaults—especially those in &lt;strong&gt;Exeter Archaeology Report 00.18 (2000)&lt;/strong&gt; (Archaeological, Historical and Conservation Study of the Exeter Canal Basin) and any &lt;strong&gt;EMAFU/Exeter Archaeology drawings&lt;/strong&gt; for “The Quay” showing the &lt;strong&gt;vault/tunnel layouts cut into the red sandstone cliff&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;For reference: NHLE list entries &lt;strong&gt;1223045 (1–4 Warehouse Vaults), 1223046 (6–11 Warehouse Vaults)&lt;/strong&gt;, and &lt;strong&gt;1223114 (14–25 Vaults)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Many thanks!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's a good thing my Research Goblin doesn't have the ability to send email (which we already know is &lt;a href="https://simonwillison.net/2025/May/31/snitchbench-with-llm/"&gt;very risky business&lt;/a&gt;) or at this point I wouldn't be surprised if it started autonomously bothering archivists on my behalf!&lt;/p&gt;
&lt;h4 id="aldi-vs-lidl"&gt;Aldi vs Lidl&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;In UK supermarket terms compare lidl and aldi&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I saw an Aldi and a Lidl next to each other in Exeter and was curious as to how they compare. In this case my Research Goblin went &lt;em&gt;way&lt;/em&gt; deeper than I had intended, thinking for 1m25s about market share and price surveys and loyalty schemes and &lt;a href="https://chatgpt.com/share/68bc7913-7ed8-8006-bd3f-63e540470cd7"&gt;answering&lt;/a&gt; with ten paragraphs of detailed analysis.&lt;/p&gt;
&lt;p&gt;I replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yeah, I just wanted the basic vibes like is one fancier than the other?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It thought for 6 seconds, didn't run any additional searches and replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Short version: neither is “fancy” — both are no-frills discounters — but &lt;strong&gt;Lidl often feels a touch more polished&lt;/strong&gt;, while &lt;strong&gt;Aldi feels a touch more bare-bones&lt;/strong&gt;. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Since I was now enjoying having my Goblin express judgement over British supermarket chains I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;How do they rate in fancinews compared to all of the other major UK supermarket chains?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(That was a typo for "fanciness" which it figured out.)&lt;/p&gt;
&lt;p&gt;It thought for another 1m1s, ran 7 searches, consulted 27 URLs and gave me back a ranking that looked about right to me.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/fancy-supermarkets.jpg" alt="ChatGPT. Prompt: How do they rate in fancinews compared to all of the other major UK supermarket chains? Thought for 1m 1s › Gotcha-assuming you mean &amp;quot;fanciness.&amp;quot; Here's the UK supermarket vibe-ladder (posh → no-frills): Super-posh: M&amp;amp;S Food and Waitrose - premium feel, strong store appearance/quality scores; Waitrose is literally described as &amp;quot;upmarket.&amp;quot; Which? The Guardian Upper-mainstream: Sainsbury's, Ocado (online, but it sells M&amp;amp;s Food since 2020, so the range skews premium). Which? The Guardian Mainstream: Tesco, Morrisons, Co-op - broad ranges, middle-of-the-road on experience/ appearance. Which? Value-forward: Asda, Iceland — price-led; Iceland's frozen focus, Asda's value reputation (not &amp;quot;fancy,&amp;quot; but popular). Time Out Worldwide Discounters (least fancy by design): Lidl (a touch more polished in-store) and Aldi (most stripped-back, no-frills). Both sit lower on &amp;quot;store appearance/ service&amp;quot; in consumer surveys, even while smashing value rankings. Which? +1 Net: on a &amp;quot;posh&amp;quot; scale, think M&amp;amp;S/Waitrose at the top, Sainsbury's/Tesco middle, Asda/Iceland lower, and Lidl/Aldi the least fancy— by intent, not as a knock on quality or value." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="ai-labs-scanning-books-for-training-data"&gt;AI labs scanning books for training data&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Anthropic bought lots of physical books and cut them up and scan them for training data. Do any other AI labs do the same thing?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Relevant to &lt;a href="https://simonwillison.net/2025/Sep/6/anthropic-settlement/"&gt;today's big story&lt;/a&gt;. Research Goblin was &lt;a href="https://chatgpt.com/share/68bc771c-c188-8006-a8e5-4b1624f5bdf0"&gt;unable to find&lt;/a&gt; any news stories or other evidence that any labs other than Anthropic are engaged in large scale book scanning for training data. That's not to say it isn't happening, but it's happening very quietly if that's the case.&lt;/p&gt;
&lt;h4 id="gpt-5-for-search-feels-competent"&gt;GPT-5 for search feels competent&lt;/h4&gt;
&lt;p&gt;The word that best describes how I feel about GPT-5 search is that it feels &lt;strong&gt;competent&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I've thrown all sorts of things at it over the last few weeks and it rarely disappoints me. It almost always does better than if I were to dedicate the same amount of time to manually searching myself, mainly because it's much faster at running searches and evaluating the results than I am.&lt;/p&gt;
&lt;p&gt;I particularly love that it works so well on mobile. I used to reserve my deeper research sessions to a laptop where I could open up dozens of tabs. I'll still do that for higher stakes activities but I'm finding the scope of curiosity satisfaction I can perform on the go with just my phone has increased quite dramatically.&lt;/p&gt;
&lt;p&gt;I've mostly stopped using OpenAI's Deep Research feature, because ChatGPT search now gives me the results I'm interested in far more quickly for most queries.&lt;/p&gt;
&lt;p&gt;As a developer who builds software on LLMs I see ChatGPT search as the gold standard for what can be achieved using tool calling combined with chain-of-thought. Techniques like RAG are &lt;em&gt;massively&lt;/em&gt; more effective if you can reframe them as several levels of tool calling with a carefully selected set of powerful search tools.&lt;/p&gt;
&lt;p&gt;The way that search tool integrates with reasoning is key, because it allows GPT-5 to execute a search, reason about the results and then execute follow-up searches - all as part of that initial "thinking" process.&lt;/p&gt;
&lt;p&gt;Anthropic call this ability &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking#interleaved-thinking"&gt;interleaved thinking&lt;/a&gt; and it's also &lt;a href="https://platform.openai.com/docs/guides/reasoning#keeping-reasoning-items-in-context"&gt;supported by the OpenAI Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="tips-for-using-search-in-chatgpt"&gt;Tips for using search in ChatGPT&lt;/h4&gt;
&lt;p&gt;As with all things AI, GPT-5 search rewards intuition gathered through experience. Any time a curious thought pops into my head I try to catch it and throw it at my Research Goblin. If it's something I'm certain it won't be able to handle then even better! I can learn from watching it fail.&lt;/p&gt;
&lt;p&gt;I've been trying out hints like "go deep" which seem to trigger a more thorough research job. I enjoy throwing those at shallow and unimportant questions like the UK Starbucks cake pops one just to see what happens!&lt;/p&gt;
&lt;p&gt;You can throw questions at it which have a single, unambiguous answer - but I think questions which are broader and don't have a "correct" answer can be a lot more fun. The UK supermarket rankings above are a great example of that.&lt;/p&gt;
&lt;p&gt;Since I love a questionable analogy for LLMs Research Goblin is... well, it's a goblin. It's very industrious, not quite human and not entirely trustworthy. You have to be able to outwit it if you want to keep it gainfully employed.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bing"&gt;bing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="bing"/><category term="definitions"/><category term="search"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-tool-use"/><category term="llm-reasoning"/><category term="deep-research"/><category term="ai-assisted-search"/><category term="gpt-5"/></entry><entry><title>Claude as a calculator</title><link href="https://simonwillison.net/2025/May/28/claude-calculator/#atom-tag" rel="alternate"/><published>2025-05-28T15:41:37+00:00</published><updated>2025-05-28T15:41:37+00:00</updated><id>https://simonwillison.net/2025/May/28/claude-calculator/#atom-tag</id><summary type="html">
    &lt;p&gt;Here's a quick demo of the kind of casual things I use LLMs for on a daily basis.&lt;/p&gt;
&lt;p&gt;I just found out that Perplexity offer their Deep Research feature via their API, through a model called &lt;a href="https://docs.perplexity.ai/models/models/sonar-deep-research"&gt;Sonar Deep Research&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Their documentation includes an example response, which included this usage data in the JSON:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{"prompt_tokens": 19, "completion_tokens": 498, "total_tokens": 517, "citation_tokens": 10175, "num_search_queries": 48, "reasoning_tokens": 95305}&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;But how much would that actually cost?&lt;/p&gt;
&lt;p&gt;Their pricing page lists &lt;a href="https://docs.perplexity.ai/guides/pricing#deep-research-models"&gt;the price for that model&lt;/a&gt;. I snapped this screenshot of the prices:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pricing table screenshot showing API costs: Input Tokens (Per Million) $2, Output Tokens (Per Million) $8, Price per 1000 Search Queries $5, Reasoning Tokens (Per Million) $3" src="https://static.simonwillison.net/static/2025/perplexity-pricing.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I could break out a calculator at this point, but I'm not quite curious enough to go through the extra effort.&lt;/p&gt;
&lt;p&gt;So I pasted that screenshot into Claude along with the JSON and &lt;a href="https://claude.ai/share/31c23164-ced3-419c-ba68-55213caf71c3"&gt;prompted&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;{"prompt_tokens": 19, "completion_tokens": 498, "total_tokens": 517, "citation_tokens": 10175, "num_search_queries": 48, "reasoning_tokens": 95305}&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Calculate price, use javascript&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wanted to make sure Claude would use its JavaScript analysis tool, since LLMs can't do maths.&lt;/p&gt;
&lt;p&gt;I watched Claude Sonnet 4 &lt;a href="https://gist.github.com/simonw/ad00a97d2e70918cfbbc94d429af46ca"&gt;write 61 lines of JavaScript&lt;/a&gt; - keeping an eye on it to check it didn't do anything obviously wrong. The code spat out this output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;=== COST CALCULATIONS ===
Input tokens cost: 19 tokens × $2/million = $0.000038
Output tokens cost: 498 tokens × $8/million = $0.003984
Search queries cost: 48 queries × $5/1000 = $0.240000
Reasoning tokens cost: 95305 tokens × $3/million = $0.285915

=== COST SUMMARY ===
Input tokens: $0.000038
Output tokens: $0.003984
Search queries: $0.240000
Reasoning tokens: $0.285915
─────────────────────────
TOTAL COST: $0.529937
TOTAL COST: $0.5299 (rounded to 4 decimal places)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So that Deep Research API call would cost 53 cents! Curiosity satisfied in less than a minute.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/perplexity"&gt;perplexity&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="claude"/><category term="perplexity"/><category term="llm-tool-use"/><category term="llm-pricing"/><category term="deep-research"/></entry><entry><title>Claude feature drop</title><link href="https://simonwillison.net/2025/May/2/claude-search/#atom-tag" rel="alternate"/><published>2025-05-02T18:45:57+00:00</published><updated>2025-05-02T18:45:57+00:00</updated><id>https://simonwillison.net/2025/May/2/claude-search/#atom-tag</id><summary type="html">
    &lt;p&gt;It's not in their &lt;a href="https://docs.anthropic.com/en/release-notes/claude-apps"&gt;release notes&lt;/a&gt; yet but Anthropic pushed some big new features today. &lt;a href="https://twitter.com/alexalbert__/status/1918349277962879218"&gt;Alex Albert&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We've improved web search and rolled it out worldwide to all paid plans. Web search now combines light Research functionality, allowing Claude to automatically adjust search depth based on your question.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Anthropic announced &lt;a href="https://www.anthropic.com/news/research"&gt;Claude Research&lt;/a&gt; a few weeks ago as a product that can combine web search with search against your private Google Workspace - I'm not clear on how much of that product we get in this "light Research" functionality.&lt;/p&gt;
&lt;p&gt;I'm most excited about &lt;a href="https://twitter.com/AnthropicAI/status/1918346819169374470"&gt;this detail&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can also drop a web link in any chat and Claude will fetch the content for you.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In my experiments so far the user-agent it uses is &lt;code&gt;Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com)&lt;/code&gt;. It appears to obey &lt;code&gt;robots.txt&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/user-agents"&gt;user-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-albert"&gt;alex-albert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="user-agents"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="alex-albert"/><category term="llm-tool-use"/><category term="deep-research"/></entry><entry><title>AI assisted search-based research actually works now</title><link href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#atom-tag" rel="alternate"/><published>2025-04-21T12:57:58+00:00</published><updated>2025-04-21T12:57:58+00:00</updated><id>https://simonwillison.net/2025/Apr/21/ai-assisted-search/#atom-tag</id><summary type="html">
    &lt;p&gt;For the past two and a half years the feature I've most wanted from LLMs is the ability to take on search-based research tasks on my behalf. We saw the first glimpses of this back in early 2023, with Perplexity (first launched &lt;a href="https://en.wikipedia.org/wiki/Perplexity_AI"&gt;December 2022&lt;/a&gt;, first prompt leak &lt;a href="https://simonwillison.net/2023/Jan/22/perplexityai/"&gt;in January 2023&lt;/a&gt;) and then the GPT-4 powered Microsoft Bing (which launched/cratered spectacularly &lt;a href="https://simonwillison.net/2023/Feb/15/bing/"&gt;in February 2023&lt;/a&gt;). Since then a whole bunch of people have taken a swing at this problem, most notably &lt;a href="https://gemini.google.com/"&gt;Google Gemini&lt;/a&gt; and &lt;a href="https://openai.com/index/introducing-chatgpt-search/"&gt;ChatGPT Search&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Those 2023-era versions were promising but very disappointing. They had a strong tendency to hallucinate details that weren't present in the search results, to the point that you couldn't trust anything they told you.&lt;/p&gt;
&lt;p&gt;In this first half of 2025 I think these systems have finally crossed the line into being genuinely useful.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#deep-research-from-three-different-vendors"&gt;Deep Research, from three different vendors&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#o3-and-o4-mini-are-really-good-at-search"&gt;o3 and o4-mini are really good at search&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#google-and-anthropic-need-to-catch-up"&gt;Google and Anthropic need to catch up&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#lazily-porting-code-to-a-new-library-version-via-search"&gt;Lazily porting code to a new library version via search&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Apr/21/ai-assisted-search/#how-does-the-economic-model-for-the-web-work-now-"&gt;How does the economic model for the Web work now?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="deep-research-from-three-different-vendors"&gt;Deep Research, from three different vendors&lt;/h4&gt;
&lt;p&gt;First came the &lt;strong&gt;Deep Research&lt;/strong&gt; implementations - &lt;a href="https://blog.google/products/gemini/google-gemini-deep-research/"&gt;Google Gemini&lt;/a&gt; and &lt;a href="https://openai.com/index/introducing-deep-research/"&gt;then OpenAI&lt;/a&gt; and &lt;a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research"&gt;then Perplexity&lt;/a&gt; launched products with that name and they were all impressive: they could take a query, then churn away for several minutes assembling a lengthy report with dozens (sometimes hundreds) of citations. Gemini's version had a &lt;em&gt;huge&lt;/em&gt; upgrade a few weeks ago when they &lt;a href="https://blog.google/products/gemini/deep-research-gemini-2-5-pro-experimental/"&gt;switched it to using Gemini 2.5 Pro&lt;/a&gt;, and I've had some outstanding results from it since then.&lt;/p&gt;
&lt;p&gt;Waiting a few minutes for a 10+ page report isn't my ideal workflow for this kind of tool. I'm impatient, I want answers faster than that!&lt;/p&gt;
&lt;h4 id="o3-and-o4-mini-are-really-good-at-search"&gt;o3 and o4-mini are really good at search&lt;/h4&gt;
&lt;p&gt;Last week, OpenAI released &lt;a href="https://openai.com/index/introducing-o3-and-o4-mini/"&gt;search-enabled o3 and o4-mini&lt;/a&gt; through &lt;a href="https://chatgpt.com/"&gt;ChatGPT&lt;/a&gt;. On the surface these look like the same idea as we've seen already: LLMs that have the option to call a search tool as part of replying to a prompt.&lt;/p&gt;
&lt;p&gt;But there's one &lt;em&gt;very significant&lt;/em&gt; difference: these models can run searches as part of the chain-of-thought reasoning process they use before producing their final answer.&lt;/p&gt;
&lt;p&gt;This turns out to be a &lt;em&gt;huge&lt;/em&gt; deal. I've been throwing all kinds of questions at ChatGPT (in o3 or o4-mini mode) and getting back genuinely useful answers grounded in search results. I haven't spotted a hallucination yet, and unlike prior systems I rarely find myself shouting "no, don't search for &lt;em&gt;that&lt;/em&gt;!" at the screen when I see what they're doing.&lt;/p&gt;
&lt;p&gt;Here are four recent example transcripts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/6805758b-acc8-8006-88a3-bdd78866ee15"&gt;Get me specs including VRAM for RTX 5090 and RTX PRO 6000 - plus release dates and prices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/68057580-02c8-8006-8ff4-7ae3ce816342"&gt;Find me a website tool that lets me paste a URL in and it gives me a word count and an estimated reading time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/6805756c-4268-8006-82ab-014f7c304622"&gt;Figure out what search engine ChatGPT is using for o3 and o4-mini&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chatgpt.com/share/6805755b-4aa0-8006-ab0b-42ae569da6a8"&gt;Look up Cloudflare r2 pricing and use Python to figure out how much this (screenshot of dashboard) costs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Talking to o3 feels like talking to a Deep Research tool in real-time, without having to wait for several minutes for it to produce an overly-verbose report.&lt;/p&gt;
&lt;p&gt;My hunch is that doing this well requires a very strong reasoning model. Evaluating search results is hard, due to the need to wade through huge amounts of spam and deceptive information. The disappointing results from previous implementations usually came down to the Web being full of junk.&lt;/p&gt;
&lt;p&gt;Maybe o3, o4-mini and Gemini 2.5 Pro are the first models to cross the gullibility-resistance threshold to the point that they can do this effectively?&lt;/p&gt;
&lt;h4 id="google-and-anthropic-need-to-catch-up"&gt;Google and Anthropic need to catch up&lt;/h4&gt;
&lt;p&gt;The user-facing &lt;a href="https://gemini.google.com/"&gt;Google Gemini app&lt;/a&gt; can search too, but it doesn't show me what it's searching for. As a result, I just don't trust it. Compare these examples from o3 and Gemini for the prompt "Latest post by Simon Willison" - o3 is much more transparent:&lt;/p&gt;

    &lt;div style="display: flex; width: 100%; gap: 10px; align-items: flex-start; margin-top: 1em; margin-bottom: 1em"&gt;
        &lt;div style="flex: 1;"&gt;
            &lt;img src="https://static.simonwillison.net/static/2025/search-o3.jpg" alt="o3 ChatGPT screenshot

Prompt: Latest post by Simon Willison

Thought for 25 seconds ›
2 searches • 3 sources

Latest post
• Title: &amp;quot;Working Through the Fear of Being Seen&amp;quot;
(via Ashley Willis)
• Published: April 22 2025 at 6: 40 AM

An overlay shows the details of that thinking session:

Thought for 25 seconds
• The user's request is pretty clear: they want to know the latest post by Simon Willison. I think the user may actually be Simon Willison themselves, asking for their own blog, which is hosted on SimonWillison.net. l'll need to search the web for Simon's latest blog post (probably from April 2025). Once I find it, l'll include the title, publication date, and a brief summary.
@ Searching the web
• Simon Willison blog April 2025
• Simon Willison blog 2025
Archive for Tuesday, 22nd April 2025 - Simon Willison's W...
* Working Through the Fear of Being Seen
* Simon Willison's Weblog" style="width: 100%; height: auto; display: block;" /&gt;
        &lt;/div&gt;
        &lt;div style="flex: 1;"&gt;
            &lt;img src="https://static.simonwillison.net/static/2025/search-gemini.jpg" alt="Gemini Advanced 2.0 flash screenshot. Prompt is Latest post by Simon Willison.

Google Search
Try again without apps
Simon Willison's latest blog post is titled &amp;quot;Here's how I use LLMs to help me write code&amp;quot; and was published on March 11th, 2025. In this post, he discusses his experiences and strategies for effectively using Large Language Models (LLMs) in software development.
However, according to his Mastodon profile, his most recent activity includes:
• A boost of a post about 11m plugins on April 18th, 2025.
• A post about the Gemini 2.5 Flash model on April 17th, 2025.
• A post about OpenAl's new 03 and o4-mini models on April 16th, 2025.
• A thread discussing the CaMeL approach to mitigating prompt injection attacks on April 11th, 2025.
• An announcement of two new 11m plugins on April 10th, 2025.
It seems his blog might not always reflect his absolute latest thoughts and shorter updates, which he often shares on platforms like Mastodon." style="width: 100%; height: auto; display: block;" /&gt;
        &lt;/div&gt;
    &lt;/div&gt;

&lt;p&gt;This is a big missed opportunity since Google presumably have by far the best search index, so they really should be able to build a great version of this. And Google's AI assisted search on their regular search interface hallucinates &lt;em&gt;wildly&lt;/em&gt; to the point that it's actively damaging their brand. I just checked and Google is still showing slop &lt;a href="https://simonwillison.net/2024/Dec/29/encanto-2/"&gt;for Encanto 2&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Claude also finally &lt;a href="https://simonwillison.net/2025/Mar/20/"&gt;added web search&lt;/a&gt; a month ago but it doesn't feel nearly as good. It's &lt;a href="https://simonwillison.net/2025/Mar/21/anthropic-use-brave/"&gt;using the Brave search index&lt;/a&gt; which I don't think is as comprehensive as Bing or Gemini, and searches don't happen as part of that powerful reasoning flow.&lt;/p&gt;
&lt;h4 id="lazily-porting-code-to-a-new-library-version-via-search"&gt;Lazily porting code to a new library version via search&lt;/h4&gt;
&lt;p&gt;The truly magic moment for me came &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/#gemini-2-5-flash-non-thinking"&gt;a few days ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/"&gt;Gemini image segmentation tool&lt;/a&gt; was using the &lt;a href="https://www.npmjs.com/package/@google/generative-ai"&gt;@google/generative-ai&lt;/a&gt; library which has been &lt;a href="https://github.com/google-gemini/deprecated-generative-ai-js"&gt;loudly deprecated&lt;/a&gt; in favor of the still in preview &lt;a href="https://github.com/googleapis/js-genai"&gt;Google Gen AI SDK @google/genai&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;I did &lt;strong&gt;not&lt;/strong&gt; feel like doing the work to upgrade. On a whim, I pasted &lt;a href="https://github.com/simonw/tools/blob/aa310a4f9cde07d5e8e87572f70fceca532884dd/gemini-mask.html"&gt;my full HTML code&lt;/a&gt; (with inline JavaScript) into ChatGPT o4-mini-high and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(I couldn't even be bothered to look up the name of the new library myself!)&lt;/p&gt;
&lt;p&gt;... it did &lt;a href="https://chatgpt.com/share/68028f7b-11ac-8006-8150-00c4205a2507"&gt;exactly that&lt;/a&gt;. It churned away thinking for 21 seconds, ran a bunch of searches, figured out the new library (which existed &lt;em&gt;way&lt;/em&gt; outside of its training cut-off date), found the &lt;a href="https://ai.google.dev/gemini-api/docs/migrate"&gt;upgrade instructions&lt;/a&gt; and produced &lt;a href="https://github.com/simonw/tools/commit/d199de213dc3f866a3b8efbcdd2dde34204dc409"&gt;a new version&lt;/a&gt; of my code that worked perfectly.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/o4-thinking.jpg" alt="Screenshot of AI assistant response about upgrading Google Gemini API code. Shows &amp;quot;Thought for 21 seconds&amp;quot; followed by web search results for &amp;quot;Google Gemini API JavaScript library recommended new library&amp;quot; with options including Google AI for Developers, GitHub, and Google for Developers. The assistant explains updating from GoogleGenerativeAI library to @google-ai/generative, with code samples showing: import { GoogleGenAI } from 'https://cdn.jsdelivr.net/npm/@google/genai@latest'; and const ai = new GoogleGenAI({ apiKey: getApiKey() });" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I ran this prompt on my phone out of idle curiosity while I was doing something else. I was &lt;em&gt;extremely&lt;/em&gt; impressed and surprised when it did exactly what I needed.&lt;/p&gt;
&lt;h4 id="how-does-the-economic-model-for-the-web-work-now-"&gt;How does the economic model for the Web work now?&lt;/h4&gt;
&lt;p&gt;I'm writing about this today because it's been one of my "can LLMs do this reliably yet?" questions for over two years now. I think they've just crossed the line into being useful as research assistants, without feeling the need to check &lt;em&gt;everything&lt;/em&gt; they say with a fine-tooth comb.&lt;/p&gt;
&lt;p&gt;I still don't trust them not to make mistakes, but I think I might trust them enough that I'll skip my own fact-checking for lower-stakes tasks.&lt;/p&gt;
&lt;p&gt;This also means that a bunch of the potential dark futures we've been predicting for the last couple of years are a whole lot more likely to become true. Why visit websites if you can get your answers directly from the chatbot instead?&lt;/p&gt;
&lt;p&gt;The lawsuits over this &lt;a href="https://simonwillison.net/2023/Dec/31/ai-in-2023/#ethics-diabolically-complex"&gt;started flying&lt;/a&gt; back when the LLMs were still mostly rubbish. The stakes are a lot higher now that they're actually good at it!&lt;/p&gt;
&lt;p&gt;I can feel my usage of Google search taking a nosedive already. I expect a bumpy ride as a new economic model for the Web lurches into view.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search-engines"&gt;search-engines&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/perplexity"&gt;perplexity&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="search"/><category term="search-engines"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="anthropic"/><category term="perplexity"/><category term="gemini"/><category term="llm-tool-use"/><category term="llm-reasoning"/><category term="o3"/><category term="deep-research"/><category term="ai-ethics"/><category term="ai-assisted-search"/></entry><entry><title>Quoting Thane Ruthenis</title><link href="https://simonwillison.net/2025/Mar/10/thane-ruthenis/#atom-tag" rel="alternate"/><published>2025-03-10T01:50:31+00:00</published><updated>2025-03-10T01:50:31+00:00</updated><id>https://simonwillison.net/2025/Mar/10/thane-ruthenis/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress"&gt;&lt;p&gt;It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.&lt;/p&gt;
&lt;p&gt;My guess is that it's most of the reason Sonnet 3.5.1 was so beloved. Its personality was made much more &lt;em&gt;appealing&lt;/em&gt;, compared to e. g. OpenAI's corporate drones. [...]&lt;/p&gt;
&lt;p&gt;Deep Research was this for me, at first. Some of its summaries were just &lt;em&gt;pleasant&lt;/em&gt; to read, they felt so information-dense and intelligent! Not like typical AI slop at all! But then it turned out most of it was just AI slop underneath anyway, and now my slop-recognition function has adjusted and the effect is gone.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress"&gt;Thane Ruthenis&lt;/a&gt;, A Bear Case: My Predictions Regarding AI Progress&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/slop"&gt;slop&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-personality"&gt;ai-personality&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="slop"/><category term="deep-research"/><category term="ai-personality"/></entry><entry><title>Deep research System Card</title><link href="https://simonwillison.net/2025/Feb/25/deep-research-system-card/#atom-tag" rel="alternate"/><published>2025-02-25T20:36:08+00:00</published><updated>2025-02-25T20:36:08+00:00</updated><id>https://simonwillison.net/2025/Feb/25/deep-research-system-card/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/deep-research-system-card/"&gt;Deep research System Card&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI are rolling out their Deep research "agentic" research tool to their $20/month ChatGPT Plus users today, who get 10 queries a month. $200/month ChatGPT Pro gets 120 uses.&lt;/p&gt;
&lt;p&gt;Deep research is the best version of this pattern I've tried so far - it can consult dozens of different online sources and produce a very convincing report-style document based on its findings. I've had some great results.&lt;/p&gt;
&lt;p&gt;The problem with this kind of tool is that while it's possible to catch most  hallucinations by checking the references it provides, the one thing that can't be easily spotted is misinformation by omission: it's very possible for the tool to miss out on crucial details because they didn't show up in the searches that it conducted.&lt;/p&gt;
&lt;p&gt;Hallucinations are also still possible though. From the system card:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The model may generate factually incorrect information, which can lead
to various harmful outcomes depending on its usage. Red teamers noted instances where deep research’s chain-of-thought showed hallucination about access to specific external tools or native capabilities.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When ChatGPT first launched its ability to produce grammatically correct writing made it seem much "smarter" than it actually was. Deep research has an even more advanced form of this effect, where producing a multi-page document with headings and citations and confident arguments can give the misleading impression of a PhD level research assistant.&lt;/p&gt;
&lt;p&gt;It's absolutely worth spending time exploring, but be careful not to fall for its surface-level charm. Benedict Evans wrote more about this in &lt;a href="https://www.ben-evans.com/benedictevans/2025/2/17/the-deep-research-problem"&gt;The Deep Research problem&lt;/a&gt; where he showed some great examples of its convincing mistakes in action.&lt;/p&gt;
&lt;p&gt;The deep research system card includes this slightly unsettling note in the section about chemical and biological threats:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold. We expect current trends of rapidly increasing capability to continue, and for models to cross this threshold in the near future. In preparation, we are intensifying our investments in safeguards.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/air"&gt;air&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hallucinations"&gt;hallucinations&lt;/a&gt;&lt;/p&gt;



</summary><category term="air"/><category term="ethics"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-agents"/><category term="deep-research"/><category term="ai-ethics"/><category term="hallucinations"/></entry><entry><title>Introducing Perplexity Deep Research</title><link href="https://simonwillison.net/2025/Feb/16/introducing-perplexity-deep-research/#atom-tag" rel="alternate"/><published>2025-02-16T00:46:38+00:00</published><updated>2025-02-16T00:46:38+00:00</updated><id>https://simonwillison.net/2025/Feb/16/introducing-perplexity-deep-research/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research"&gt;Introducing Perplexity Deep Research&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Perplexity become the &lt;em&gt;third&lt;/em&gt; company to release a product with "Deep Research" in the name.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google's Gemini Deep Research: &lt;a href="https://blog.google/products/gemini/google-gemini-deep-research/"&gt;Try Deep Research and our new experimental model in Gemini, your AI assistant&lt;/a&gt; on  December 11th 2024&lt;/li&gt;
&lt;li&gt;OpenAI's ChatGPT Deep Research: &lt;a href="https://openai.com/index/introducing-deep-research/"&gt;Introducing deep research&lt;/a&gt; - February 2nd 2025&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And now &lt;a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research"&gt;Perplexity Deep Research&lt;/a&gt;, announced on February 14th.&lt;/p&gt;
&lt;p&gt;The three products all do effectively the same thing: you give them a task, they go out and accumulate information from a large number of different websites and then use long context models and prompting to turn the result into a report. All three of them take several minutes to return a result.&lt;/p&gt;
&lt;p&gt;In my &lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#one-year-code-research-assistants"&gt;AI/LLM predictions post on January 10th&lt;/a&gt; I expressed skepticism at the idea of "agents", with the exception of coding and research specialists. I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It makes intuitive sense to me that this kind of research assistant can be built on our current generation of LLMs. They’re competent at driving tools, they’re capable of coming up with a relatively obvious research plan (look for newspaper articles and research papers) and they can synthesize sensible answers given the right collection of context gathered through search.&lt;/p&gt;
&lt;p&gt;Google are particularly well suited to solving this problem: they have the world’s largest search index and their Gemini model has a 2 million token context. I expect Deep Research to get a whole lot better, and I expect it to attract plenty of competition.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Just over a month later I'm feeling pretty good about that prediction!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search-engines"&gt;search-engines&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/perplexity"&gt;perplexity&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="search-engines"/><category term="ai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="perplexity"/><category term="gemini"/><category term="ai-agents"/><category term="deep-research"/><category term="ai-assisted-search"/></entry><entry><title>My AI/LLM predictions for the next 1, 3 and 6 years, for Oxide and Friends</title><link href="https://simonwillison.net/2025/Jan/10/ai-predictions/#atom-tag" rel="alternate"/><published>2025-01-10T01:43:16+00:00</published><updated>2025-01-10T01:43:16+00:00</updated><id>https://simonwillison.net/2025/Jan/10/ai-predictions/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://oxide-and-friends.transistor.fm/"&gt;Oxide and Friends&lt;/a&gt; podcast has an annual tradition of asking guests to share their predictions for the next 1, 3 and 6 years. Here's &lt;a href="https://github.com/oxidecomputer/oxide-and-friends/blob/master/2022_01_03.md"&gt;2022&lt;/a&gt;, &lt;a href="https://github.com/oxidecomputer/oxide-and-friends/blob/master/2023_01_09.md"&gt;2023&lt;/a&gt; and &lt;a href="https://github.com/oxidecomputer/oxide-and-friends/blob/master/2024_01_08.md"&gt;2024&lt;/a&gt;. This year they invited me to participate. I've never been brave enough to share &lt;em&gt;any&lt;/em&gt; public predictions before, so this was a great opportunity to get outside my comfort zone!&lt;/p&gt;
&lt;p&gt;We recorded the episode live using Discord on Monday. It's now available &lt;a href="https://www.youtube.com/watch?v=-pk6VokHpGY"&gt;on YouTube&lt;/a&gt; and &lt;a href="https://oxide-and-friends.transistor.fm/"&gt;in podcast form&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;lite-youtube videoid="-pk6VokHpGY"
  title="Oxide and Friends 1/6/2025 -- Predictions 2025"
  playlabel="Play: Oxide and Friends 1/6/2025 -- Predictions 2025"
&gt; &lt;/lite-youtube&gt;&lt;/p&gt;

&lt;p&gt;Here are my predictions, written up here in a little more detail than the stream of consciousness I shared on the podcast.&lt;/p&gt;
&lt;p&gt;I should emphasize that I find the very idea of trying to predict AI/LLMs over a multi-year period to be completely absurd! I can't predict what's going to happen a week from now, six years is a different universe.&lt;/p&gt;
&lt;p&gt;With that disclaimer out of the way, here's an expanded version of what I said.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#one-year-agents-fail-to-happen-again"&gt;One year: Agents fail to happen, again&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#one-year-code-research-assistants"&gt;One year: ... except for code and research assistants&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#three-years-someone-wins-a-pulitzer-for-ai-assisted-investigative-reporting"&gt;Three years: Someone wins a Pulitzer for AI-assisted investigative reporting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#three-years-part-two-privacy-laws-with-teeth"&gt;Three years part two: privacy laws with teeth&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#six-years-utopian-amazing-art"&gt;Six years utopian: amazing art&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#six-years-dystopian-agi-asi-causes-mass-civil-unrest"&gt;Six years dystopian: AGI/ASI causes mass civil unrest&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#my-total-lack-of-conviction"&gt;My total lack of conviction&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="one-year-agents-fail-to-happen-again"&gt;One year: Agents fail to happen, again&lt;/h4&gt;
&lt;p&gt;I wrote about how &lt;a href="https://simonwillison.net/2024/Dec/31/llms-in-2024/#-agents-still-haven-t-really-happened-yet"&gt;“Agents” still haven’t really happened yet&lt;/a&gt; in my review of Large Language Model developments  in 2024.&lt;/p&gt;
&lt;p&gt;I think we are going to see a &lt;em&gt;lot&lt;/em&gt; more froth about agents in 2025, but I expect the results will be a great disappointment to most of the people who are excited about this term. I expect a lot of money will be lost chasing after several different poorly defined dreams that share that name.&lt;/p&gt;
&lt;p&gt;What are agents anyway? Ask a dozen people and you'll get a dozen slightly different answers - I collected and &lt;a href="https://gist.github.com/simonw/beaa5f90133b30724c5cc1c4008d0654"&gt;then AI-summarized a bunch of those here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For the sake of argument, let's pick a definition that I can predict won't come to fruition: the idea of an AI assistant that can go out into the world and semi-autonomously act on your behalf. I think of this as the &lt;strong&gt;travel agent&lt;/strong&gt; definition of agents, because for some reason everyone always jumps straight to flight and hotel booking and itinerary planning when they describe this particular dream.&lt;/p&gt;
&lt;p&gt;Having the current generation of LLMs make material decisions on your behalf - like what to spend money on - is a &lt;em&gt;really bad idea&lt;/em&gt;. They're too unreliable, but more importantly they are too &lt;strong&gt;gullible&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If you're going to arm your AI assistant with a credit card and set it loose on the world, you need to be confident that it's not going to hit "buy" on the first website that claims to offer the best bargains!&lt;/p&gt;
&lt;p&gt;I'm confident that reliability is the reason we haven't seen LLM-powered agents that have taken off yet, despite the idea attracting a huge amount of buzz since right after ChatGPT first came out.&lt;/p&gt;
&lt;p&gt;I would be very surprised if any of the models released over the next twelve months had enough of a reliability improvement to make this work. Solving gullibility is an astonishingly difficult problem.&lt;/p&gt;
&lt;p&gt;(I had &lt;a href="https://www.youtube.com/watch?v=-pk6VokHpGY&amp;amp;t=1206s"&gt;a particularly spicy rant&lt;/a&gt; about how stupid the idea of sending a "digital twin" to a meeting on your behalf is.)&lt;/p&gt;
&lt;h4 id="one-year-code-research-assistants"&gt;One year: ... except for code and research assistants&lt;/h4&gt;
&lt;p&gt;There are two categories of "agent" that I do believe in, because they're proven to work already.&lt;/p&gt;
&lt;p&gt;The first is &lt;strong&gt;coding assistants&lt;/strong&gt; - where an LLM writes, executes and then refines computer code in a loop.&lt;/p&gt;
&lt;p&gt;I first saw this pattern demonstrated by OpenAI with their &lt;a href="https://simonwillison.net/tags/code-interpreter/"&gt;Code Interpreter&lt;/a&gt; feature for ChatGPT, released back in March/April of 2023.&lt;/p&gt;
&lt;p&gt;You can ask ChatGPT to solve a problem that can use Python code and it will write that Python, execute it in a secure sandbox (I think it's Kubernetes) and then use the output - or any error messages - to determine if the goal has been achieved.&lt;/p&gt;
&lt;p&gt;It's a beautiful pattern that worked great with early 2023 models (I believe it first shipped using original GPT-4), and continues to work today.&lt;/p&gt;
&lt;p&gt;Claude added their own version in October (&lt;a href="https://simonwillison.net/2024/Oct/24/claude-analysis-tool/"&gt;Claude analysis&lt;/a&gt;, using JavaScript that runs in the browser), Mistral have it, Gemini has a version and there are dozens of other implementations of the same pattern.&lt;/p&gt;
&lt;p&gt;The second category of agents that I believe in is &lt;strong&gt;research assistants&lt;/strong&gt; - where an LLM can run multiple searches, gather information and aggregate that into an answer to a question or write a report.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.perplexity.ai/"&gt;Perplexity&lt;/a&gt; and &lt;a href="https://openai.com/index/introducing-chatgpt-search/"&gt;ChatGPT Search&lt;/a&gt; have both been operating in this space for a while, but by far the most impressive implementation I've seen is Google Gemini's &lt;a href="https://blog.google/products/gemini/google-gemini-deep-research/"&gt;Deep Research&lt;/a&gt; tool, which I've had access to for a few weeks.&lt;/p&gt;
&lt;p&gt;With Deep Research I can pose a question like this one:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Pillar Point Harbor is one of the largest communal brown pelican roosts on the west coast of North America.&lt;/p&gt;
&lt;p&gt;find others&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Gemini will draft a plan, consult dozens of different websites via Google Search and then assemble a report (with all-important citations) describing what it found.&lt;/p&gt;
&lt;p&gt;Here's the plan it came up with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Pillar Point Harbor is one of the largest communal brown pelican roosts on the west coast of North America. Find other large communal brown pelican roosts on the west coast of North America.&lt;br /&gt;
(1) Find a list of brown pelican roosts on the west coast of North America.&lt;br /&gt;
(2) Find research papers or articles about brown pelican roosts and their size.&lt;br /&gt;
(3) Find information from birdwatching organizations or government agencies about brown pelican roosts.&lt;br /&gt;
(4) Compare the size of the roosts found in (3) to the size of the Pillar Point Harbor roost.&lt;br /&gt;
(5) Find any news articles or recent reports about brown pelican roosts and their populations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It dug up a whole bunch of details, but the one I cared most about was &lt;a href="https://birdallianceoregon.org/wp-content/uploads/2021/04/Brown-Pelican-survey_4-year_summary-infographic_2016-19_final.pdf"&gt;these PDF results for the 2016-2019 Pacific Brown Pelican Survey&lt;/a&gt; conducted by the West Coast Audubon network and partners - a PDF that included this delightful list:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Top 10 Megaroosts (sites that traditionally host &amp;gt;500 pelicans) with average fall count numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Alameda Breakwater, CA (3,183)&lt;/li&gt;
&lt;li&gt;Pillar Point Harbor, CA (1,481)&lt;/li&gt;
&lt;li&gt;East Sand Island, OR (1,121)&lt;/li&gt;
&lt;li&gt;Ano Nuevo State Park, CA (1,068)&lt;/li&gt;
&lt;li&gt;Salinas River mouth, CA (762)&lt;/li&gt;
&lt;li&gt;Bolinas Lagoon, CA (755)&lt;/li&gt;
&lt;li&gt;Morro Rock, CA (725)&lt;/li&gt;
&lt;li&gt;Moss landing, CA (570)&lt;/li&gt;
&lt;li&gt;Crescent City Harbor, CA (514)&lt;/li&gt;
&lt;li&gt;Bird Rock Tomales, CA (514)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;My local harbor is the second biggest megaroost!&lt;/p&gt;
&lt;p&gt;It makes intuitive sense to me that this kind of research assistant can be built on our current generation of LLMs. They're competent at driving tools, they're capable of coming up with a relatively obvious research plan (look for newspaper articles and research papers) and they can synthesize sensible answers given the right collection of context gathered through search.&lt;/p&gt;
&lt;p&gt;Google are particularly well suited to solving this problem: they have the world's largest search index and their Gemini model has a 2 million token context. I expect Deep Research to get a whole lot better, and I expect it to attract plenty of competition.&lt;/p&gt;
&lt;h4 id="three-years-someone-wins-a-pulitzer-for-ai-assisted-investigative-reporting"&gt;Three years: Someone wins a Pulitzer for AI-assisted investigative reporting&lt;/h4&gt;
&lt;p&gt;I went for a bit of a self-serving prediction here: I think within three years someone is going to win a Pulitzer prize for a piece of investigative reporting that was aided by generative AI tools.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: after publishing this piece I learned about this May 2024 story from Nieman Lab: &lt;a href="https://www.niemanlab.org/2024/05/for-the-first-time-two-pulitzer-winners-disclosed-using-ai-in-their-reporting/"&gt;For the first time, two Pulitzer winners disclosed using AI in their reporting&lt;/a&gt;. I think these were both examples of traditional machine learning as opposed to LLM-based generative AI, but this is yet another example of my predictions being less ambitious than I had thought!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I do &lt;em&gt;not&lt;/em&gt; mean that an LLM will write the article! I continue to think that having LLMs write on your behalf is one of the least interesting applications of these tools.&lt;/p&gt;
&lt;p&gt;I called this prediction self-serving because I want to help make this happen! My &lt;a href="https://datasette.io"&gt;Datasette&lt;/a&gt; suite of open source tools for data journalism has been growing AI features, like &lt;a href="https://simonwillison.net/2023/Dec/1/datasette-enrichments/"&gt;LLM-powered data enrichments&lt;/a&gt; and &lt;a href="https://www.datasette.cloud/blog/2024/datasette-extract/"&gt;extracting structured data&lt;/a&gt; into tables from unstructured text.&lt;/p&gt;
&lt;p&gt;My dream is for those tools - or tools like them - to be used for an award winning piece of investigative reporting.&lt;/p&gt;
&lt;p&gt;I picked three years for this because I think that's how long it will take for knowledge of how to responsibly and effectively use these tools to become widespread enough for that to happen.&lt;/p&gt;
&lt;p&gt;LLMs are not an obvious fit for journalism: journalists look for the truth, and LLMs are notoriously prone to hallucination and making things up. But journalists are also &lt;em&gt;really good&lt;/em&gt; at extracting useful information from potentially untrusted sources - that's a lot of what the craft of journalism is about.&lt;/p&gt;
&lt;p&gt;The two areas I think LLMs are particularly relevant to journalism are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Structured data extraction. If you have 10,000 PDFs from a successful Freedom of Information Act request, someone or something needs to kick off the process of reading through them to find the stories. LLMs are a fantastic way to take a vast amount of information and start making some element of sense from it. They can act as lead generators, helping identify the places to start looking more closely.&lt;/li&gt;
&lt;li&gt;Coding assistance. Writing code to help analyze data is a huge part of modern data journalism - from SQL queries through data cleanup scripts, custom web scrapers or visualizations to help find signal among the noise. Most newspapers don't have a team of programmers on staff: I think within three years we'll have robust enough tools built around this pattern that non-programmer journalists will be able to use them as part of their reporting process.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope to build some of these tools myself!&lt;/p&gt;
&lt;p&gt;So my concrete prediction for three years is that someone wins a Pulitzer with a small amount of assistance from LLMs.&lt;/p&gt;
&lt;p&gt;My more general prediction: within three years it won't be surprising at all to see most information professionals use LLMs as part of their daily workflow, in increasingly sophisticated ways. We'll know exactly what patterns work and how best to explain them to people. These skills will become widespread.&lt;/p&gt;
&lt;h4 id="three-years-part-two-privacy-laws-with-teeth"&gt;Three years part two: privacy laws with teeth&lt;/h4&gt;
&lt;p&gt;My other three year prediction concerned privacy legislation.&lt;/p&gt;
&lt;p&gt;The levels of (often justified) paranoia around both targeted advertising and what happens to the data people paste into these models is a constantly growing problem.&lt;/p&gt;
&lt;p&gt;I wrote recently about the &lt;a href="https://simonwillison.net/2025/Jan/2/they-spy-on-you-but-not-like-that/"&gt;inexterminable conspiracy theory that Apple target ads through spying through your phone's microphone&lt;/a&gt;. I've written in the past about &lt;a href="https://simonwillison.net/2023/Dec/14/ai-trust-crisis/"&gt;the AI trust crisis&lt;/a&gt;, where people refuse to believe that models are not being trained on their inputs no matter how emphatically the companies behind them deny it.&lt;/p&gt;
&lt;p&gt;I think the AI industry itself would benefit enormously from legislation that helps clarify what's going on with training on user-submitted data, and the wider tech industry could really do with harder rules around things like data retention and targeted advertising.&lt;/p&gt;
&lt;p&gt;I don't expect the next four years of US federal government to be effective at passing legislation, but I expect we'll see privacy legislation with sharper teeth emerging at the state level or internationally. Let's just hope we don't end up with a new generation of cookie-consent banners as a result!&lt;/p&gt;
&lt;h4 id="six-years-utopian-amazing-art"&gt;Six years utopian: amazing art&lt;/h4&gt;
&lt;p&gt;For six years I decided to go with two rival predictions, one optimistic and one pessimistic.&lt;/p&gt;
&lt;p&gt;I think six years is long enough that we'll figure out how to harness this stuff to make some &lt;strong&gt;really great art&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I don't think generative AI for art - images, video and music - deserves nearly the same level of respect as a useful tool as text-based LLMs. Generative art tools are a lot of fun to try out but the lack of fine-grained control over the output greatly limits its utility outside of personal amusement or generating &lt;a href="https://simonwillison.net/tags/slop/"&gt;slop&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;More importantly, they lack social acceptability. The vibes aren't good. Many talented artists have loudly rejected the idea of these tools, to the point that the very term "AI" is developing a distasteful connotation in society at large.&lt;/p&gt;
&lt;p&gt;Image and video models are also ground zero for the AI training data ethics debate, and for good reason: no artist wants to see a model trained on their work without their permission that then directly competes with them!&lt;/p&gt;
&lt;p&gt;I think six years is long enough for this whole thing to shake out - for society to figure out acceptable ways of using these tools to truly elevate human expression. What excites me is the idea of truly talented, visionary creative artists using whatever these tools have evolved into in six years to make meaningful art that could never have been achieved without them.&lt;/p&gt;
&lt;p&gt;On the podcast I talked about &lt;a href="https://en.wikipedia.org/wiki/Everything_Everywhere_All_at_Once"&gt;Everything Everywhere All at Once&lt;/a&gt;, a film that deserved every one of its seven Oscars. The core visual effects team on that film was just five people. Imagine what a team like that could do with the generative AI tools we'll have in six years time!&lt;/p&gt;
&lt;p id="since-recording"&gt;Since recording the podcast I learned from &lt;a href="https://www.swyx.io/"&gt;Swyx&lt;/a&gt; that Everything Everywhere All at Once &lt;a href="https://www.aboutamazon.com/news/aws/how-ai-tools-are-creating-new-possibilities-for-movies-and-visual-design-according-to-this-aws-powered-startup"&gt;used Runway ML as part of their toolset already&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Evan Halleck was on this team, and he used Runway's AI tools to save time and automate tedious aspects of editing. Specifically in the film’s rock scene, he used Runway’s rotoscoping tool to get a quick, clean cut of the rocks as sand and dust were moving around the shot. This translated days of work to a matter of minutes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I said I thought a film that had used generative AI tools would win an Oscar within six years. Looks like I was eight years out on that one!&lt;/p&gt;
&lt;h4 id="six-years-dystopian-agi-asi-causes-mass-civil-unrest"&gt;Six years dystopian: AGI/ASI causes mass civil unrest&lt;/h4&gt;
&lt;p&gt;My pessimistic alternative take for 2031 concerns "AGI" - a term which, like "agents", is constantly being redefined. The Information &lt;a href="https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership"&gt;recently reported&lt;/a&gt; (see also &lt;a href="https://www.theverge.com/2025/1/6/24337106/sam-altman-says-openai-knows-how-to-build-agi-blog-post"&gt;The Verge&lt;/a&gt;) that Microsoft and OpenAI are now defining AGI as a system capable of generating $100bn in profit!&lt;/p&gt;
&lt;p&gt;If we assume AGI is the point at which AI systems are capable of performing almost any job currently reserved for a human being it's hard &lt;em&gt;not&lt;/em&gt; to see potentially negative consequences.&lt;/p&gt;
&lt;p&gt;Sam Altman may have &lt;a href="https://www.bloomberg.com/news/articles/2024-07-22/ubi-study-backed-by-openai-s-sam-altman-bolsters-support-for-basic-income"&gt;experimented with Universal Basic Income&lt;/a&gt;, but the USA is a country that can't even figure out universal healthcare! I have huge trouble imagining a future economy that works for the majority of people when the majority of jobs are being done by machines.&lt;/p&gt;
&lt;p&gt;So my dystopian prediction for 2031 is that if that form of AGI has come to pass it will be accompanied by extraordinarily bad economic outcomes and mass civil unrest.&lt;/p&gt;
&lt;p&gt;My version of an AI utopia is tools that augment existing humans. That's what we've had with LLMs so far, and my ideal is that those tools continue to improve and subsequently humans become able to take on &lt;a href="https://simonwillison.net/2023/Mar/27/ai-enhanced-development/"&gt;more ambitious work&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If there's a version of AGI that results in that kind of utopia, I'm all for it.&lt;/p&gt;
&lt;h4 id="my-total-lack-of-conviction"&gt;My total lack of conviction&lt;/h4&gt;
&lt;p&gt;There's a reason I haven't made predictions like this before: my confidence in my ability to predict the future is almost non-existent. At least one of my predictions here &lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/#since-recording"&gt;already proved to be eight years late&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;These predictions are in the public record now (I even &lt;a href="https://github.com/oxidecomputer/oxide-and-friends/pull/158"&gt;submitted a pull request&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;It's going to be interesting looking back at these in one, three and six years to see how I did.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/predictions"&gt;predictions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/oxide"&gt;oxide&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-research"&gt;deep-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agent-definitions"&gt;agent-definitions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="predictions"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="gemini"/><category term="code-interpreter"/><category term="oxide"/><category term="ai-agents"/><category term="deep-research"/><category term="ai-assisted-search"/><category term="coding-agents"/><category term="agent-definitions"/></entry></feed>