<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: artificial-analysis</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/artificial-analysis.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-06T23:53:06+00:00</updated><author><name>Simon Willison</name></author><entry><title>Kimi K2 Thinking</title><link href="https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag" rel="alternate"/><published>2025-11-06T23:53:06+00:00</published><updated>2025-11-06T23:53:06+00:00</updated><id>https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/moonshotai/Kimi-K2-Thinking"&gt;Kimi K2 Thinking&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;back in July&lt;/a&gt;. They've now released the Thinking version, also a trillion parameters (MoE, 32B active) and also under their custom modified (so &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/#kimi-license"&gt;not quite open source&lt;/a&gt;) MIT license.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one is only 594GB on Hugging Face - Kimi K2 was 1.03TB - which I think is due to the new INT4 quantization. This makes the model both cheaper and faster to host.&lt;/p&gt;
&lt;p&gt;So far the only people hosting it are Moonshot themselves. I tried it out both via &lt;a href="https://platform.moonshot.ai"&gt;their own API&lt;/a&gt; and via &lt;a href="https://openrouter.ai/moonshotai/kimi-k2-thinking/providers"&gt;the OpenRouter proxy to it&lt;/a&gt;, via the &lt;a href="https://github.com/ghostofpokemon/llm-moonshot"&gt;llm-moonshot&lt;/a&gt; plugin (by NickMystic) and my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin respectively.&lt;/p&gt;
&lt;p&gt;The buzz around this model so far is very positive. Could this be the first open weight model that's competitive with the latest from OpenAI and Anthropic, especially for long-running agentic tool call sequences?&lt;/p&gt;
&lt;p&gt;Moonshot AI's &lt;a href="https://moonshotai.github.io/Kimi-K2/thinking.html"&gt;self-reported benchmark scores&lt;/a&gt; show K2 Thinking beating the top OpenAI and Anthropic models (GPT-5 and Sonnet 4.5 Thinking) at "Agentic Reasoning" and "Agentic Search" but not quite top for "Coding":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison bar chart showing agentic reasoning, search, and coding benchmark performance scores across three AI systems (K, OpenAI, and AI) on tasks including Humanity's Last Exam (44.9, 41.7, 32.0), BrowseComp (60.2, 54.9, 24.1), Seal-0 (56.3, 51.4, 53.4), SWE-Multilingual (61.1, 55.3, 68.0), SWE-bench Verified (71.3, 74.9, 77.2), and LiveCodeBench V6 (83.1, 87.0, 64.0), with category descriptions including &amp;quot;Expert-level questions across subjects&amp;quot;, &amp;quot;Agentic search &amp;amp; browsing&amp;quot;, &amp;quot;Real-world latest information collection&amp;quot;, &amp;quot;Agentic coding&amp;quot;, and &amp;quot;Competitive programming&amp;quot;." src="https://static.simonwillison.net/static/2025/kimi-k2-thinking-benchmarks.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I ran a couple of pelican tests:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5 described this as: Cartoon illustration of a white duck or goose with an orange beak and gray wings riding a bicycle with a red frame and light blue wheels against a light blue background." src="https://static.simonwillison.net/static/2025/k2-thinking.png" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter # paste key
llm -m openrouter/moonshotai/kimi-k2-thinking \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5: Minimalist cartoon illustration of a white bird with an orange beak and feet standing on a triangular-framed penny-farthing style bicycle with gray-hubbed wheels and a propeller hat on its head, against a light background with dotted lines and a brown ground line." src="https://static.simonwillison.net/static/2025/k2-thinking-openrouter.png" /&gt;&lt;/p&gt;
&lt;p&gt;Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1986541785511043536"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CNBC quoted a source who &lt;a href="https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-releases-new-ai-model-kimi-k2-thinking.html"&gt;provided the training price&lt;/a&gt; for the model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Kimi K2 Thinking model cost $4.6 million to train, according to a source familiar with the matter. [...] CNBC was unable to independently verify the DeepSeek or Kimi figures.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MLX developer Awni Hannun &lt;a href="https://x.com/awnihannun/status/1986601104130646266"&gt;got it working&lt;/a&gt; on two 512GB M3 Ultra Mac Studios:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality!&lt;/p&gt;
&lt;p&gt;The model was quantization aware trained (qat) at int4.&lt;/p&gt;
&lt;p&gt;Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://huggingface.co/mlx-community/Kimi-K2-Thinking"&gt;the 658GB mlx-community model&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/><category term="artificial-analysis"/><category term="moonshot"/><category term="kimi"/></entry><entry><title>Improved Gemini 2.5 Flash and Flash-Lite</title><link href="https://simonwillison.net/2025/Sep/25/improved-gemini-25-flash-and-flash-lite/#atom-tag" rel="alternate"/><published>2025-09-25T19:27:43+00:00</published><updated>2025-09-25T19:27:43+00:00</updated><id>https://simonwillison.net/2025/Sep/25/improved-gemini-25-flash-and-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/"&gt;Improved Gemini 2.5 Flash and Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new preview models from Google - updates to their fast and inexpensive Flash and Flash Lite families:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The latest version of Gemini 2.5 Flash-Lite was trained and built based on three key themes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Better instruction following&lt;/strong&gt;: The model is significantly better at following complex instructions and system prompts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced verbosity&lt;/strong&gt;: It now produces more concise answers, a key factor in reducing token costs and latency for high-throughput applications (see charts above).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stronger multimodal &amp;amp; translation capabilities&lt;/strong&gt;: This update features more accurate audio transcription, better image understanding, and improved translation quality.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;This latest 2.5 Flash model comes with improvements in two key areas we heard consistent feedback on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Better agentic tool use&lt;/strong&gt;: We've improved how the model uses tools, leading to better performance in more complex, agentic and multi-step applications. This model shows noticeable improvements on key agentic benchmarks, including a 5% gain on SWE-Bench Verified, compared to our last release (48.9% → 54%).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;More efficient&lt;/strong&gt;: With thinking on, the model is now significantly more cost-efficient—achieving higher quality outputs while using fewer tokens, reducing latency and cost (see charts above).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also added two new convenience model IDs: &lt;code&gt;gemini-flash-latest&lt;/code&gt; and &lt;code&gt;gemini-flash-lite-latest&lt;/code&gt;, which will always resolve to the most recent model in that family.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.26"&gt;llm-gemini 0.26&lt;/a&gt; adding support for the new models and new aliases. I also used the &lt;code&gt;response.set_resolved_model()&lt;/code&gt; method &lt;a href="https://github.com/simonw/llm/issues/1117"&gt;added in LLM 0.27&lt;/a&gt; to ensure that the correct model ID would be recorded for those &lt;code&gt;-latest&lt;/code&gt; uses.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both of these models support optional reasoning tokens. I had them draw me pelicans riding bicycles in both thinking and non-thinking mode, using commands that looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-2.5-flash-preview-09-2025 -o thinking_budget 4000 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then got each model to describe the image it had drawn using commands like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/gemini-2.5-flash-preview-09-2025-thinking.png -m gemini-2.5-flash-preview-09-2025 -o thinking_budget 2000 'Detailed single line alt text for this image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/e9dc9c18008106b4ae2e0be287709f5c"&gt;&lt;strong&gt;gemini-2.5-flash-preview-09-2025-thinking&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-preview-09-2025-thinking.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A minimalist stick figure graphic depicts a person with a white oval body and a dot head cycling a gray bicycle, carrying a large, bright yellow rectangular box resting high on their back.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/e357eac5f12e995a6dcb50711241a478"&gt;&lt;strong&gt;gemini-2.5-flash-preview-09-2025&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-preview-09-2025.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A simple cartoon drawing of a pelican riding a bicycle, with the text "A Pelican Riding a Bicycle" above it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/29aff037b58fe62baf5a3cb7cf3b0ca9"&gt;&lt;strong&gt;gemini-2.5-flash-lite-preview-09-2025-thinking&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-lite-preview-09-2025-thinking.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A quirky, simplified cartoon illustration of a white bird with a round body, black eye, and bright yellow beak, sitting astride a dark gray, two-wheeled vehicle with its peach-colored feet dangling below.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/0eb5b9dc5515657a0a3c9d16bb5d46f6"&gt;&lt;strong&gt;gemini-2.5-flash-lite-preview-09-2025&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://static.simonwillison.net/static/2025/gemini-2.5-flash-lite-preview-09-2025.png" /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A minimalist, side-profile illustration of a stylized yellow chick or bird character riding a dark-wheeled vehicle on a green strip against a white background.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Artificial Analysis posted &lt;a href="https://twitter.com/ArtificialAnlys/status/1971273380335845683"&gt;a detailed review&lt;/a&gt;, including these interesting notes about reasoning efficiency and speed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;In reasoning mode, Gemini 2.5 Flash and Flash-Lite Preview 09-2025 are more token-efficient, using fewer output tokens than their predecessors to run the Artificial Analysis Intelligence Index. Gemini 2.5 Flash-Lite Preview 09-2025 uses 50% fewer output tokens than its predecessor, while Gemini 2.5 Flash Preview 09-2025 uses 24% fewer output tokens.&lt;/li&gt;
&lt;li&gt;Google Gemini 2.5 Flash-Lite Preview 09-2025 (Reasoning) is ~40% faster than the prior July release, delivering ~887 output tokens/s on Google AI Studio in our API endpoint performance benchmarking. This makes the new Gemini 2.5 Flash-Lite the fastest proprietary model we have benchmarked on the Artificial Analysis website&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45375845"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="artificial-analysis"/></entry><entry><title>Open weight LLMs exhibit inconsistent performance across providers</title><link href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag" rel="alternate"/><published>2025-08-15T16:29:34+00:00</published><updated>2025-08-15T16:29:34+00:00</updated><id>https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag</id><summary type="html">
    &lt;p&gt;Artificial Analysis published &lt;a href="https://artificialanalysis.ai/models/gpt-oss-120b/providers#aime25x32-performance-gpt-oss-120b"&gt;a new benchmark&lt;/a&gt; the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers.&lt;/p&gt;
&lt;p&gt;The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/aim25x32-gpt-oss-120b.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges (Min, 25th, Median, 75th, Max) for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Together.ai (93.3%), Parasail (90.0%), Groq (86.7%), Amazon (83.3%), Azure (80.0%), CompectAI (36.7%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are some varied results!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0&lt;/li&gt;
&lt;li&gt;90.0%: Parasail&lt;/li&gt;
&lt;li&gt;86.7%: Groq&lt;/li&gt;
&lt;li&gt;83.3%: Amazon&lt;/li&gt;
&lt;li&gt;80.0%: Azure&lt;/li&gt;
&lt;li&gt;36.7%: CompactifAI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It looks like most of the providers that scored 93.3% were running models using the latest &lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; (with the exception of Cerebras who I believe have their own custom serving stack).&lt;/p&gt;
&lt;p&gt;I hadn't heard of CompactifAI before - I found &lt;a href="https://www.hpcwire.com/off-the-wire/multiverse-computing-closes-e189m-series-b-to-scale-compactifai-deployment/"&gt;this June 12th 2025 press release&lt;/a&gt; which says that "CompactifAI models are highly-compressed versions of leading open source LLMs that retain original accuracy, are 4x-12x faster and yield a 50%-80% reduction in inference costs" which helps explain their notably lower score!&lt;/p&gt;
&lt;p&gt;Microsoft Azure's Lucas Pickup &lt;a href="https://x.com/lupickup/status/1955620918086226223"&gt;confirmed&lt;/a&gt; that Azure's 80% score was caused by running an older vLLM, now fixed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;No news yet on what went wrong with the AWS Bedrock version.&lt;/p&gt;
&lt;h4 id="the-challenge-for-customers-of-open-weight-models"&gt;The challenge for customers of open weight models&lt;/h4&gt;
&lt;p&gt;As a customer of open weight model providers, this really isn't something I wanted to have to think about!&lt;/p&gt;
&lt;p&gt;It's not really a surprise though. When running models myself I inevitably have to make choices - about which serving framework to use (I'm usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.&lt;/p&gt;
&lt;p&gt;I know that quantization has an impact, but it's difficult for me to quantify that effect.&lt;/p&gt;
&lt;p&gt;It looks like with hosted models even knowing the quantization they are using isn't necessarily enough information to be able to predict that model's performance.&lt;/p&gt;
&lt;p&gt;I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform - if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.&lt;/p&gt;
&lt;p&gt;There's a lot that can go wrong. Tool calling is particularly vulnerable to these differences - models have been trained on specific tool-calling conventions, and if a provider doesn't get these exactly right the results can be unpredictable but difficult to diagnose.&lt;/p&gt;
&lt;p&gt;What would help &lt;em&gt;enormously&lt;/em&gt; here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model's implementation.&lt;/p&gt;
&lt;p&gt;Models aren't deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.&lt;/p&gt;
&lt;p id="update"&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://x.com/DKundel/status/1956395988836368587"&gt;Via OpenAI's Dominik Kundel&lt;/a&gt; I learned that OpenAI now include a &lt;a href="https://github.com/openai/gpt-oss/tree/main/compatibility-test"&gt;compatibility test&lt;/a&gt; in the gpt-oss GitHub repository to help providers verify that they have implemented things like tool calling templates correctly, described in more detail in their &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt; cookbook.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;my TIL&lt;/a&gt; on running part of that eval suite.&lt;/p&gt;

&lt;h4 id="update-aug-20"&gt;Update: August 20th 2025&lt;/h4&gt;

&lt;p&gt;Since I first wrote this article Artificial Analysis have updated the benchmark results to reflect fixes that vendors have made since their initial run. Here's what it looks like today:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-oss-eval-updated.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Azure (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Groq (93.3%), Together.ai (93.3%), Parasail (90.0%), Google Vertex (83.3%), Amazon (80.0%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;Groq and Azure have both improved their scores to 93.3%. Google Vertex is new  to the chart at 83.3%.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="gpt-oss"/><category term="artificial-analysis"/><category term="llm-performance"/></entry><entry><title>Quoting Artificial Analysis</title><link href="https://simonwillison.net/2025/Aug/6/artificial-analysis/#atom-tag" rel="alternate"/><published>2025-08-06T12:48:32+00:00</published><updated>2025-08-06T12:48:32+00:00</updated><id>https://simonwillison.net/2025/Aug/6/artificial-analysis/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://x.com/artificialanlys/status/1952887733803991070"&gt;&lt;p&gt;&lt;strong&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits&lt;/strong&gt; [...]&lt;/p&gt;
&lt;p&gt;We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;, see also their &lt;a href="https://artificialanalysis.ai/models/open-source"&gt;updated leaderboard&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="evals"/><category term="qwen"/><category term="deepseek"/><category term="gpt-oss"/><category term="artificial-analysis"/></entry><entry><title>Grok 4</title><link href="https://simonwillison.net/2025/Jul/10/grok-4/#atom-tag" rel="alternate"/><published>2025-07-10T19:36:03+00:00</published><updated>2025-07-10T19:36:03+00:00</updated><id>https://simonwillison.net/2025/Jul/10/grok-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.x.ai/docs/models/grok-4-0709"&gt;Grok 4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released last night, Grok 4 is now available via both API and a paid subscription for end-users.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update:&lt;/strong&gt; If you ask it about controversial topics it will sometimes &lt;a href="https://simonwillison.net/2025/Jul/11/grok-musk/"&gt;search X for tweets "from:elonmusk"&lt;/a&gt;!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Key characteristics: image and text input, text output. 256,000 context length (twice that of Grok 3). It's a reasoning model where you can't see the reasoning tokens or turn off reasoning mode.&lt;/p&gt;
&lt;p&gt;xAI released results showing Grok 4 beating other models on most of the significant benchmarks. I haven't been able to find their own written version of these (the launch was a &lt;a href="https://x.com/xai/status/1943158495588815072"&gt;livestream video&lt;/a&gt;) but here's &lt;a href="https://techcrunch.com/2025/07/09/elon-musks-xai-launches-grok-4-alongside-a-300-monthly-subscription/"&gt;a TechCrunch report&lt;/a&gt; that includes those scores. It's not clear to me if these benchmark results are for Grok 4 or Grok 4 Heavy.&lt;/p&gt;
&lt;p&gt;I ran &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my own benchmark&lt;/a&gt; using Grok 4 &lt;a href="https://openrouter.ai/x-ai/grok-4"&gt;via OpenRouter&lt;/a&gt; (since I have API keys there already). &lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4 "Generate an SVG of a pelican riding a bicycle" \
  -o max_tokens 10000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description below." src="https://static.simonwillison.net/static/2025/grok4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I then asked Grok to describe the image it had just created:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4 -o max_tokens 10000 \
  -a https://static.simonwillison.net/static/2025/grok4-pelican.png \
  'describe this image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/ec9aee006997b6ae7f2bba07da738279#response"&gt;the result&lt;/a&gt;. It described it as a "cute, bird-like creature (resembling a duck, chick, or stylized bird)".&lt;/p&gt;
&lt;p&gt;The most interesting independent analysis I've seen so far is &lt;a href="https://twitter.com/ArtificialAnlys/status/1943166841150644622"&gt;this one from Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The timing of the release is somewhat unfortunate, given that Grok 3 made headlines &lt;a href="https://www.theguardian.com/technology/2025/jul/09/grok-ai-praised-hitler-antisemitism-x-ntwnfb"&gt;just this week&lt;/a&gt; after a &lt;a href="https://github.com/xai-org/grok-prompts/commit/535aa67a6221ce4928761335a38dea8e678d8501#diff-dec87f526b85f35cb546db6b1dd39d588011503a94f1aad86d023615a0e9e85aR6"&gt;clumsy system prompt update&lt;/a&gt; - presumably another attempt to make Grok "less woke" - caused it to start firing off antisemitic tropes and referring to itself as MechaHitler.&lt;/p&gt;
&lt;p&gt;My best guess is that these lines in the prompt were the root of the problem:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;- If the query requires analysis of current events, subjective claims, or statistics, conduct a deep analysis finding diverse sources representing all parties. Assume subjective viewpoints sourced from the media are biased. No need to repeat this to the user.&lt;/code&gt;&lt;br&gt;
&lt;code&gt;- The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If xAI expect developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!&lt;/p&gt;
&lt;p&gt;As it stands, Grok 4 isn't even accompanied by a model card.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update:&lt;/strong&gt; Ian Bicking &lt;a href="https://bsky.app/profile/ianbicking.org/post/3ltn3r7g4xc2i"&gt;makes an astute point&lt;/a&gt;:&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;It feels very credulous to ascribe what happened to a system prompt update. Other models can't be pushed into racism, Nazism, and ideating rape with a system prompt tweak.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;em&gt;Even if that system prompt change was responsible for unlocking this behavior, the fact that it was able to speaks to a much looser approach to model safety by xAI compared to other providers.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 12th July 2025:&lt;/strong&gt; Grok posted &lt;a href="https://simonwillison.net/2025/Jul/12/grok/"&gt;a postmortem&lt;/a&gt; blaming the behavior on a different set of prompts, including "you are not afraid to offend people who are politically correct", that were not included in the system prompts they had published to their GitHub repository.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4. Once you go above 128,000 input tokens the price doubles to $6/$30 (Gemini 2.5 Pro has a similar price increase for longer inputs). I've added these prices to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Consumers can access Grok 4 via a new $30/month or $300/year "SuperGrok" plan - or a $300/month or $3,000/year "SuperGrok Heavy" plan providing access to Grok 4 Heavy.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of subscription pricing page showing two plans: SuperGrok at $30.00/month (marked as Popular) with Grok 4 and Grok 3 increased access, features including Everything in Basic, Context Memory 128,000 Tokens, and Voice with vision; SuperGrok Heavy at $300.00/month with Grok 4 Heavy exclusive preview, Grok 4 and Grok 3 increased access, features including Everything in SuperGrok, Early access to new features, and Dedicated Support. Toggle at top shows &amp;quot;Pay yearly save 16%&amp;quot; and &amp;quot;Pay monthly&amp;quot; options with Pay monthly selected." src="https://static.simonwillison.net/static/2025/supergrok-pricing.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="ai-ethics"/><category term="llm-release"/><category term="openrouter"/><category term="system-prompts"/><category term="artificial-analysis"/><category term="xai"/></entry><entry><title>Recraft V3</title><link href="https://simonwillison.net/2024/Nov/15/recraft-v3/#atom-tag" rel="alternate"/><published>2024-11-15T04:24:09+00:00</published><updated>2024-11-15T04:24:09+00:00</updated><id>https://simonwillison.net/2024/Nov/15/recraft-v3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.recraft.ai/blog/recraft-introduces-a-revolutionary-ai-model-that-thinks-in-design-language"&gt;Recraft V3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Recraft are a generative AI design tool startup based out of London who released their v3 model a few weeks ago. It's currently sat at the top of the &lt;a href="https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard"&gt;Artificial Analysis Image Arena Leaderboard&lt;/a&gt;, beating Midjourney and Flux 1.1 pro.&lt;/p&gt;
&lt;p&gt;The thing that impressed me is that it can generate both raster &lt;em&gt;and&lt;/em&gt; vector graphics... and the vector graphics can be exported as SVG!&lt;/p&gt;
&lt;p&gt;Here's what I got for &lt;code&gt;raccoon with a sign that says "I love trash"&lt;/code&gt; - &lt;a href="https://static.simonwillison.net/static/2024/racoon-trash.svg"&gt;SVG here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cute vector cartoon raccoon holding a sign that says I love trash - in the recraft.ai UI which is set to vector and has export options for PNG, JPEG, SVG and Lottie" src="https://static.simonwillison.net/static/2024/recraft-ai.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;That's an editable SVG - when I open it up in Pixelmator I can select and modify the individual paths and shapes:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pixelmator UI showing the SVG with a sidebar showing each of the individual shapes - I have selected three hearts and they now show resize handles and the paths are highlighted in the sidebar" src="https://static.simonwillison.net/static/2024/recraft-pixelmator.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;They also have &lt;a href="https://www.recraft.ai/docs"&gt;an API&lt;/a&gt;. I spent $1 on 1000 credits and then spent 80 credits (8 cents) making this SVG of a &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;pelican riding a bicycle&lt;/a&gt;, using my API key stored in 1Password:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;export RECRAFT_API_TOKEN="$(
  op item get recraft.ai --fields label=password \
  --format json | jq .value -r)"

curl https://external.api.recraft.ai/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $RECRAFT_API_TOKEN" \
  -d '{
    "prompt": "california brown pelican riding a bicycle",
    "style": "vector_illustration",
    "model": "recraftv3"
  }'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="A really rather good SVG of a California Brown Pelican riding a bicycle" src="https://static.simonwillison.net/static/2024/recraft-ai-pelican.svg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;&lt;/p&gt;



</summary><category term="svg"/><category term="ai"/><category term="generative-ai"/><category term="text-to-image"/><category term="pelican-riding-a-bicycle"/><category term="artificial-analysis"/></entry></feed>