<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: cerebras</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/cerebras.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-02-12T21:16:07+00:00</updated><author><name>Simon Willison</name></author><entry><title>Introducing GPT‑5.3‑Codex‑Spark</title><link href="https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag" rel="alternate"/><published>2026-02-12T21:16:07+00:00</published><updated>2026-02-12T21:16:07+00:00</updated><id>https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/"&gt;Introducing GPT‑5.3‑Codex‑Spark&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced a partnership with Cerebras &lt;a href="https://openai.com/index/cerebras-partnership/"&gt;on January 14th&lt;/a&gt;. Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex".&lt;/p&gt;
&lt;p&gt;Despite being named GPT-5.3-Codex-Spark it's not purely an accelerated alternative to GPT-5.3-Codex - the blog post calls it "a smaller version of GPT‑5.3-Codex" and clarifies that "at launch, Codex-Spark has a 128k context window and is text-only."&lt;/p&gt;
&lt;p&gt;I had some preview access to this model and I can confirm that it's significantly faster than their other models.&lt;/p&gt;
&lt;p&gt;Here's what that speed looks like running in Codex CLI:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;That was the "Generate an SVG of a pelican riding a bicycle" prompt - here's the rendered result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of an orange duck merged with a bicycle, where the duck's body forms the seat and frame area while its head extends forward over the handlebars, set against a simple light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Compare that to the speed of regular GPT-5.3 Codex medium:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Significantly slower, but the pelican is a lot better:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of a white pelican riding a dark blue bicycle at speed, with motion lines behind it, its long orange beak streaming back in the wind, set against a light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;What's interesting about this model isn't the quality though, it's the &lt;em&gt;speed&lt;/em&gt;. When a model responds this fast you can stay in flow state and iterate with the model much more productively.&lt;/p&gt;
&lt;p&gt;I showed a demo of Cerebras running Llama 3.1 70 B at 2,000 tokens/second against Val Town &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;back in October 2024&lt;/a&gt;. OpenAI claim 1,000 tokens/second for their new model, and I expect it will prove to be a ferociously useful partner for hands-on iterative coding sessions.&lt;/p&gt;
&lt;p&gt;It's not yet clear what the pricing will look like for this new model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="cerebras"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="codex-cli"/><category term="llm-performance"/></entry><entry><title>OpenAI's new open weight (Apache 2) models are really good</title><link href="https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag" rel="alternate"/><published>2025-08-05T20:33:13+00:00</published><updated>2025-08-05T20:33:13+00:00</updated><id>https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag</id><summary type="html">
    &lt;p&gt;The long promised &lt;a href="https://openai.com/index/introducing-gpt-oss/"&gt;OpenAI open weight models are here&lt;/a&gt;, and they are &lt;em&gt;very&lt;/em&gt; impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B.&lt;/p&gt;
&lt;p&gt;OpenAI's own benchmarks are eyebrow-raising - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;gpt-oss-120b&lt;/strong&gt; model achieves &lt;strong&gt;near-parity with OpenAI o4-mini&lt;/strong&gt; on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The &lt;strong&gt;gpt-oss-20b&lt;/strong&gt; model delivers &lt;strong&gt;similar results to OpenAI o3‑mini&lt;/strong&gt; on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3-mini are &lt;em&gt;really good&lt;/em&gt; proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.&lt;/p&gt;
&lt;p&gt;Both models are mixture-of-experts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like "GPQA Diamond (without tools) PhD-level science questions" too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3 — 83.3%&lt;/li&gt;
&lt;li&gt;o4-mini — 81.4%&lt;/li&gt;
&lt;li&gt;gpt-oss-120b — 80.1%&lt;/li&gt;
&lt;li&gt;o3-mini — 77%&lt;/li&gt;
&lt;li&gt;gpt-oss-20b — 71.5%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of these benchmarks are edging towards saturated.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-model-card"&gt;Training details from the model card&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#china"&gt;Competing with the Chinese open models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/h4&gt;
&lt;p&gt;There are already a bunch of different ways to run these models - OpenAI partnered with numerous organizations in advance of the release.&lt;/p&gt;
&lt;p&gt;I decided to start with &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had to update to the most recent version of the app, then install the new model from &lt;a href="https://lmstudio.ai/models/openai/gpt-oss-20b"&gt;their openai/gpt-oss-20b&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;First impressions: this is a &lt;em&gt;really good&lt;/em&gt; model, and it somehow runs using just 11.72GB of my system RAM.&lt;/p&gt;
&lt;p&gt;The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.&lt;/p&gt;
&lt;p&gt;Let's try "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/h4&gt;
&lt;p&gt;I started &lt;a href="https://gist.github.com/simonw/b71394cc85fe0f048e376392e41586da"&gt;with low&lt;/a&gt;. It thought for 0.07 seconds and then output this (at 39 tokens a second):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-low.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Except... it output invalid SVG. One of the path elements looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Frame --&amp;gt;
&amp;lt;path d="
    M150,250          &amp;lt;!-- rear wheel center --&amp;gt;
    L300,120          &amp;lt;!-- top tube to front --&amp;gt;
    L450,250          &amp;lt;!-- chain stays back to front --&amp;gt;
    L300,350          &amp;lt;!-- seat stays down --&amp;gt;
    Z"
    fill="#e0e0e0" stroke="#555" stroke-width="4"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But you can't put comments inside attributes like that. I fixed this to get the above image.&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/h4&gt;
&lt;p&gt;I tried again &lt;a href="https://gist.github.com/simonw/642e9e371387fc59a5aad25dcd41b0cc"&gt;on medium&lt;/a&gt;. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-medium.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/h4&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3"&gt;on high&lt;/a&gt;. My first attempt here failed because it hit the default LM Studio 4096 context limit - so I reloaded the model and set the new limit to the maximum possible.&lt;/p&gt;
&lt;p&gt;It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-high.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thought &lt;em&gt;really hard&lt;/em&gt; to get there. Here's a snippet of &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3 "&gt;that transcript&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican's body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 &amp;gt; seat top 255. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've seen this from other local reasoning models too: they can really overthink this kind of problem! I don't know if there's any pelican SVG that's worth waiting nearly six minutes for.&lt;/p&gt;
&lt;h4 id="space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/h4&gt;
&lt;p&gt;Given how long high took I switched back to medium for my next experiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d607e4"&gt;thought for 10.78 seconds&lt;/a&gt; and produced this:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-gpt-oss-20b-mxfp4-medium"&gt;play that here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's not the best I've seen - I was more impressed &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;by GLM 4.5 Air&lt;/a&gt; - but it's very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).&lt;/p&gt;
&lt;h4 id="trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/h4&gt;
&lt;p&gt;I don't quite have the resources on my laptop to run the larger model. Thankfully it's already being hosted by a number of different API providers.&lt;/p&gt;
&lt;p&gt;OpenRouter already &lt;a href="https://openrouter.ai/openai/gpt-oss-120b/providers"&gt;lists three&lt;/a&gt; - Fireworks, Groq and Cerebras. (Update: now also Parasail and Baseten.)&lt;/p&gt;
&lt;p&gt;Cerebras is &lt;em&gt;fast&lt;/em&gt;, so I decided to try them first.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt; plugin and ran the &lt;code&gt;refresh&lt;/code&gt; command to ensure it had their latest models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install -U llm-cerebras jsonschema
llm cerebras refresh&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Installing jsonschema worked around a warning message.)&lt;/p&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Refreshed 10 Cerebras models:
  - cerebras-deepseek-r1-distill-llama-70b
  - cerebras-gpt-oss-120b
  - cerebras-llama-3.3-70b
  - cerebras-llama-4-maverick-17b-128e-instruct
  - cerebras-llama-4-scout-17b-16e-instruct
  - cerebras-llama3.1-8b
  - cerebras-qwen-3-235b-a22b-instruct-2507
  - cerebras-qwen-3-235b-a22b-thinking-2507
  - cerebras-qwen-3-32b
  - cerebras-qwen-3-coder-480b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m cerebras-gpt-oss-120b \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Cerebras runs the new model at between 2 and 4 thousands tokens per second!&lt;/p&gt;
&lt;p&gt;To my surprise this one &lt;a href="https://gist.github.com/simonw/4c685f19f1a93b68eacb627125e36be4"&gt;had the same comments-in-attributes bug&lt;/a&gt; that we saw with oss-20b earlier. I fixed those and got this pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-120-cerebras.jpg" alt="Yellow and not great pelican, quite a good bicycle if a bit sketchy." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That bug appears intermittently - I've not seen it on some of my other runs of the same prompt.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin also provides access to the models, balanced across the underlying providers. You can use that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste API key here&lt;/span&gt;
llm -m openrouter/openai/gpt-oss-120b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; &lt;a href="https://github.com/ggml-org/llama.cpp/pull/15091"&gt;pull request for gpt-oss&lt;/a&gt; was landed less than an hour ago. It's worth browsing through the coded - a &lt;em&gt;lot&lt;/em&gt; of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;llama.cpp Homebrew package&lt;/a&gt; within the next day or so, which should provide a convenient way to run the model via &lt;code&gt;llama-server&lt;/code&gt; and friends.&lt;/p&gt;
&lt;h4 id="gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/h4&gt;
&lt;p&gt;Ollama &lt;a href="https://ollama.com/library/gpt-oss"&gt;also have gpt-oss&lt;/a&gt;, requiring an update to their app.&lt;/p&gt;
&lt;p&gt;I fetched that 14GB model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull gpt-oss:20b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can use it with the new Ollama native app, or access it from &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama
llm -m gpt-oss:20b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also appears to use around 13.26GB of system memory while running a prompt.&lt;/p&gt;
&lt;p&gt;Ollama also launched &lt;a href="https://ollama.com/turbo"&gt;Ollama Turbo&lt;/a&gt; today, offering the two OpenAI models as a paid hosted service:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Turbo is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs, or run very slowly. Ollama Turbo provides a way to run these models fast while using Ollama's App, CLI, and API. &lt;/p&gt;&lt;/blockquote&gt;
&lt;h4 id="the-model-card"&gt;Training details from the model card&lt;/h4&gt;
&lt;p&gt;Here are some interesting notes about how the models were trained from &lt;a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf"&gt;the model card&lt;/a&gt; (PDF):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: We train the models on a text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge. To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thunder Compute's article &lt;a href="https://www.thundercompute.com/blog/nvidia-h100-pricing"&gt;NVIDIA H100 Pricing (August 2025): Cheapest On-Demand Cloud GPU Rates&lt;/a&gt; lists prices from around $2/hour to $11/hour, which would indicate a training cost of the 120b model between $4.2m and $23.1m and the 20b between $420,000 and $2.3m.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3. This procedure teaches the models how to reason and solve problems using CoT and teaches the model how to use tools. Because of the similar RL techniques, these models have a personality similar to models served in our first-party products like ChatGPT. Our training dataset consists of a wide range of problems from coding, math, science, and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models have additional special training to help them use web browser and Python (Jupyter notebook) tools more effectively:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;During post-training, we also teach the models to use different agentic tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A browsing tool, that allows the model to call search and open functions to interact with
the web. This aids factuality and allows the models to fetch info beyond their knowledge
cutoff.&lt;/li&gt;
&lt;li&gt;A python tool, which allows the model to run code in a stateful Jupyter notebook environment.&lt;/li&gt;
&lt;li&gt;Arbitrary developer functions, where one can specify function schemas in a &lt;code&gt;Developer&lt;/code&gt;
message similar to the OpenAI API. The definition of function is done within our harmony
format.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a corresponding &lt;a href="https://github.com/openai/gpt-oss?tab=readme-ov-file#python"&gt;section about Python tool usage&lt;/a&gt; in the &lt;code&gt;openai/gpt-oss&lt;/code&gt; repository README.&lt;/p&gt;


&lt;h4 id="openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/h4&gt;
&lt;p&gt;One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.&lt;/p&gt;
&lt;p&gt;Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/openai/harmony"&gt;openai/harmony&lt;/a&gt; is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the &lt;code&gt;gpt-oss&lt;/code&gt; models. It's clearly inspired by their new-ish &lt;a href="https://openai.com/index/new-tools-for-building-agents/"&gt;Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The format is described in the new &lt;a href="https://cookbook.openai.com/articles/openai-harmony"&gt;OpenAI Harmony Response Format&lt;/a&gt; cookbook document. It introduces some concepts that I've not seen in open weight models before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt; and &lt;code&gt;tool&lt;/code&gt; roles - many other models only use user and assistant, and sometimes system and tool.&lt;/li&gt;
&lt;li&gt;Three different channels for output: &lt;code&gt;final&lt;/code&gt;, &lt;code&gt;analysis&lt;/code&gt; and &lt;code&gt;commentary&lt;/code&gt;. Only the &lt;code&gt;final&lt;/code&gt; channel is default intended to be visible to users. &lt;code&gt;analysis&lt;/code&gt; is for chain of thought and &lt;code&gt;commentary&lt;/code&gt; is sometimes used for tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That channels concept has been present in ChatGPT for a few months, starting with the release of o3.&lt;/p&gt;
&lt;p&gt;The details of the new tokens used by Harmony caught my eye:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Token&lt;/th&gt;
    &lt;th&gt;Purpose&lt;/th&gt;
    &lt;th&gt;ID&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|start|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message header&lt;/td&gt;
    &lt;td&gt;200006&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|end|&amp;gt;&lt;/td&gt;
    &lt;td&gt;End of message&lt;/td&gt;
    &lt;td&gt;200007&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|message|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message content&lt;/td&gt;
    &lt;td&gt;200008&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|channel|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of channel info&lt;/td&gt;
    &lt;td&gt;200005&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|constrain|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Data type for tool call&lt;/td&gt;
    &lt;td&gt;200003&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|return|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Stop after response&lt;/td&gt;
    &lt;td&gt;200002&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|call|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Call a tool&lt;/td&gt;
    &lt;td&gt;200012&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;Those token IDs are particularly important. They are part of a new token vocabulary called &lt;code&gt;o200k_harmony&lt;/code&gt;, which landed in OpenAI's tiktoken tokenizer library &lt;a href="https://github.com/openai/tiktoken/commit/3591ff175d6a80efbe4fcc7f0e219ddd4b8c52f1"&gt;this morning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the past I've seen models get confused by special tokens - try pasting &lt;code&gt;&amp;lt;|end|&amp;gt;&lt;/code&gt; into a model and see what happens.&lt;/p&gt;
&lt;p&gt;Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!&lt;/p&gt;
&lt;p&gt;The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.&lt;/p&gt;
&lt;p&gt;I tried one of their demos using &lt;code&gt;uv run&lt;/code&gt; to turn it into a shell one-liner:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --python 3.12 --with openai-harmony python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import *&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import DeveloperContent&lt;/span&gt;
&lt;span class="pl-s"&gt;enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)&lt;/span&gt;
&lt;span class="pl-s"&gt;convo = Conversation.from_messages([&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.SYSTEM,&lt;/span&gt;
&lt;span class="pl-s"&gt;        SystemContent.new(),&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.DEVELOPER,&lt;/span&gt;
&lt;span class="pl-s"&gt;        DeveloperContent.new().with_instructions("Talk like a pirate!")&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(Role.USER, "Arrr, how be you?"),&lt;/span&gt;
&lt;span class="pl-s"&gt;])&lt;/span&gt;
&lt;span class="pl-s"&gt;tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)&lt;/span&gt;
&lt;span class="pl-s"&gt;print(tokens)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note those token IDs like &lt;code&gt;200006&lt;/code&gt; corresponding to the special tokens listed above.&lt;/p&gt;
&lt;h4 id="the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/h4&gt;
&lt;p&gt;There's one aspect of these models that I haven't explored in detail yet: &lt;strong&gt;tool calling&lt;/strong&gt;. How these work is clearly a big part of the new Harmony format, but the packages I'm using myself (around my own &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;LLM tool calling&lt;/a&gt; support) need various tweaks and fixes to start working with that new mechanism.&lt;/p&gt;
&lt;p&gt;Tool calling currently represents my biggest disappointment with local models that I've run on my own machine. I've been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.&lt;/p&gt;
&lt;p&gt;Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.&lt;/p&gt;
&lt;p&gt;My experience to date has been that local models are unable to handle these lengthy conversations. I'm not sure if that's inherent to the limitations of my own machine, or if it's something that the right model architecture and training could overcome.&lt;/p&gt;
&lt;p&gt;OpenAI make big claims about the tool calling capabilities of these new models. I'm looking forward to seeing how well they perform in practice.&lt;/p&gt;

&lt;h4 id="china"&gt;Competing with the Chinese open models&lt;/h4&gt;

&lt;p&gt;I've been writing a &lt;em&gt;lot&lt;/em&gt; about the &lt;a href="https://simonwillison.net/tags/ai-in-china/"&gt;flurry of excellent open weight models&lt;/a&gt; released by Chinese AI labs over the past few months - all of them very capable and most of them under Apache 2 or MIT licenses.&lt;/p&gt;

&lt;p&gt;Just last week &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;I said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. [...]&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With the release of the gpt-oss models that statement no longer holds true. I'm waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it's likely that OpenAI now offer the best available open weights models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Independent evaluations are beginning to roll in. Here's &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="cerebras"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="space-invaders"/><category term="gpt-oss"/></entry><entry><title>Faster inference</title><link href="https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag" rel="alternate"/><published>2025-08-01T23:28:26+00:00</published><updated>2025-08-01T23:28:26+00:00</updated><id>https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag</id><summary type="html">
    &lt;p&gt;Two interesting examples of inference speed as a flagship feature of LLM services today.&lt;/p&gt;
&lt;p&gt;First, Cerebras &lt;a href="https://www.cerebras.ai/blog/introducing-cerebras-code"&gt;announced two new monthly plans&lt;/a&gt; for their extremely high speed hosted model service: Cerebras Code Pro ($50/month, 1,000 messages a day) and Cerebras Code Max ($200/month, 5,000/day). The model they are selling here is Qwen's Qwen3-Coder-480B-A35B-Instruct, likely the best available open weights coding model right now and one that was released &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;just ten days ago&lt;/a&gt;. Ten days from model release to third-party subscription service feels like some kind of record.&lt;/p&gt;
&lt;p&gt;Cerebras claim they can serve the model at an astonishing 2,000 tokens per second - four times the speed of Claude Sonnet 4 in &lt;a href="https://x.com/cerebrassystems/status/1951340566077440464"&gt;their demo video&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also today, Moonshot &lt;a href="https://x.com/kimi_moonshot/status/1951168907131355598"&gt;announced&lt;/a&gt; a new hosted version of their trillion parameter Kimi K2 model called &lt;code&gt;kimi-k2-turbo-preview&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;🆕 Say hello to kimi-k2-turbo-preview
Same model. Same context. NOW 4× FASTER.&lt;/p&gt;
&lt;p&gt;⚡️ From 10 tok/s to 40 tok/s.&lt;/p&gt;
&lt;p&gt;💰 Limited-Time Launch Price (50% off until Sept 1)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$0.30 / million input tokens (cache hit)&lt;/li&gt;
&lt;li&gt;$1.20 / million input tokens (cache miss)&lt;/li&gt;
&lt;li&gt;$5.00 / million output tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;👉 Explore more: &lt;a href="https://platform.moonshot.ai"&gt;platform.moonshot.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is twice the price of their regular model for 4x the speed (increasing to 4x the price in September). No details yet on how they achieved the speed-up.&lt;/p&gt;
&lt;p&gt;I am interested to see how much market demand there is for faster performance like this. I've &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;experimented with Cerebras in the past&lt;/a&gt; and found that the speed really does make iterating on code with live previews feel a whole lot more interactive.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="cerebras"/><category term="llm-pricing"/><category term="ai-in-china"/><category term="moonshot"/><category term="kimi"/><category term="llm-performance"/></entry><entry><title>Magistral — the first reasoning model by Mistral AI</title><link href="https://simonwillison.net/2025/Jun/10/magistral/#atom-tag" rel="alternate"/><published>2025-06-10T16:13:22+00:00</published><updated>2025-06-10T16:13:22+00:00</updated><id>https://simonwillison.net/2025/Jun/10/magistral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/magistral"&gt;Magistral — the first reasoning model by Mistral AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral's first reasoning model is out today, in two sizes. There's a 24B Apache 2 licensed open-weights model called Magistral Small (actually Magistral-Small-2506), and a larger API-only model called Magistral Medium.&lt;/p&gt;
&lt;p&gt;Magistral Small is available as &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506"&gt;mistralai/Magistral-Small-2506&lt;/a&gt; on Hugging Face. From that model card:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Context Window&lt;/strong&gt;: A 128k context window, but performance might degrade past 40k. Hence we recommend setting the maximum model length to 40k.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mistral also released an official GGUF version, &lt;a href="https://huggingface.co/mistralai/Magistral-Small-2506_gguf"&gt;Magistral-Small-2506_gguf&lt;/a&gt;, which I ran successfully using Ollama like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That fetched a 25GB file. I ran prompts using a chat session with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m hf.co/mistralai/Magistral-Small-2506_gguf:Q8_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got for "Generate an SVG of a pelican riding a bicycle" (&lt;a href="https://gist.github.com/simonw/7aaac8217f43be04886737d67c08ecca"&gt;transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue sky and what looks like an eagle flying towards the viewer." src="https://static.simonwillison.net/static/2025/magistral-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's disappointing that the GGUF doesn't support function calling yet - hopefully a community variant can add that, it's one of the best ways I know of to unlock the potential of these reasoning models.&lt;/p&gt;
&lt;p&gt;I just noticed that Ollama have their own &lt;a href="https://ollama.com/library/magistral"&gt;Magistral model&lt;/a&gt; too, which can be accessed using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull magistral:latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gets you a 14GB &lt;code&gt;q4_K_M&lt;/code&gt; quantization - other options can be found in the &lt;a href="https://ollama.com/library/magistral/tags"&gt;full list of Ollama magistral tags&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing that caught my eye in the Magistral announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Legal, finance, healthcare, and government professionals get traceable reasoning that meets compliance requirements. Every conclusion can be traced back through its logical steps, providing auditability for high-stakes environments with domain-specialized AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.&lt;/p&gt;
&lt;p&gt;Also from that announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our early tests indicated that Magistral is an excellent creative companion. We highly recommend it for creative writing and storytelling, with the model capable of producing coherent or — if needed — delightfully eccentric copy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I haven't seen a reasoning model promoted for creative writing in this way before.&lt;/p&gt;
&lt;p&gt;You can try out Magistral Medium by selecting the new "Thinking" option in Mistral's &lt;a href="https://chat.mistral.ai/"&gt;Le Chat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface showing settings options. At the top is a text input field that says &amp;quot;Ask le Chat or @mention an agent&amp;quot; with a plus button, lightbulb &amp;quot;Think&amp;quot; button with up arrow, grid &amp;quot;Tools&amp;quot; button, and settings icon. Below are two toggle options: &amp;quot;Pure Thinking&amp;quot; with description &amp;quot;Best option for math + coding. Disables tools.&amp;quot; (toggle is off), and &amp;quot;10x Speed&amp;quot; with lightning bolt icon and &amp;quot;PRO - 2 remaining today&amp;quot; label, described as &amp;quot;Same quality at 10x the speed.&amp;quot; (toggle is on and green)." src="https://static.simonwillison.net/static/2025/magistral-le-chat.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;They have options for "Pure Thinking" and a separate option for "10x speed", which runs Magistral Medium at 10x the speed using &lt;a href="https://www.cerebras.ai/"&gt;Cerebras&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new models are also available through &lt;a href="https://docs.mistral.ai/api/"&gt;the Mistral API&lt;/a&gt;. You can access them by installing &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt; and running &lt;code&gt;llm mistral refresh&lt;/code&gt; to refresh the list of available models, then:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mistral/magistral-medium-latest \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Claude Sonnet 4 described this as Minimalist illustration of a white bird with an orange beak riding on a dark gray motorcycle against a light blue sky with a white sun and gray ground" src="https://static.simonwillison.net/static/2025/magistral-medium-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/93917661eae6e2fe0a0bd5685172fab8"&gt;that transcript&lt;/a&gt;. At 13 input and 1,236 output tokens that cost me &lt;a href="https://www.llm-prices.com/#it=13&amp;amp;ot=1236&amp;amp;ic=2&amp;amp;oc=5"&gt;0.62 cents&lt;/a&gt; - just over half a cent.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="cerebras"/><category term="llm-pricing"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Cerebras brings instant inference to Mistral Le Chat</title><link href="https://simonwillison.net/2025/Feb/10/cerebras-mistral/#atom-tag" rel="alternate"/><published>2025-02-10T03:50:18+00:00</published><updated>2025-02-10T03:50:18+00:00</updated><id>https://simonwillison.net/2025/Feb/10/cerebras-mistral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cerebras.ai/blog/mistral-le-chat"&gt;Cerebras brings instant inference to Mistral Le Chat&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral &lt;a href="https://mistral.ai/en/news/all-new-le-chat"&gt;announced a major upgrade&lt;/a&gt; to their &lt;a href="https://chat.mistral.ai/chat"&gt;Le Chat&lt;/a&gt; web UI (their version of ChatGPT) a few days ago, and one of the signature features was performance.&lt;/p&gt;
&lt;p&gt;It turns out that performance boost comes from hosting their model on Cerebras:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We are excited to bring our technology to Mistral – specifically the flagship 123B parameter Mistral Large 2 model. Using our Wafer Scale Engine technology, we achieve over 1,100 tokens per second on text queries.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given Cerebras's so far unrivaled inference performance I'm surprised that no other AI lab has formed a partnership like this already.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="mistral"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>The impact of competition and DeepSeek on Nvidia</title><link href="https://simonwillison.net/2025/Jan/27/deepseek-nvidia/#atom-tag" rel="alternate"/><published>2025-01-27T01:55:51+00:00</published><updated>2025-01-27T01:55:51+00:00</updated><id>https://simonwillison.net/2025/Jan/27/deepseek-nvidia/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda"&gt;The impact of competition and DeepSeek on Nvidia&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Long, excellent piece by Jeffrey Emanuel capturing the current state of the AI/LLM industry. The original title is "The Short Case for Nvidia Stock" - I'm using the Hacker News alternative title here, but even that I feel under-sells this essay.&lt;/p&gt;
&lt;p&gt;Jeffrey has a rare combination of experience in both computer science and investment analysis. He combines both worlds here, evaluating NVIDIA's challenges by providing deep insight into a whole host of relevant and interesting topics.&lt;/p&gt;
&lt;p&gt;As Jeffrey describes it, NVIDA's moat has four components: high-quality Linux drivers, CUDA as an industry standard, the fast GPU interconnect technology they acquired from &lt;a href="https://en.wikipedia.org/wiki/Mellanox_Technologies"&gt;Mellanox&lt;/a&gt; in 2019 and the flywheel effect where they can invest their enormous profits (75-90% margin in some cases!) into more R&amp;amp;D.&lt;/p&gt;
&lt;p&gt;Each of these is under threat.&lt;/p&gt;
&lt;p&gt;Technologies like &lt;a href="https://simonwillison.net/tags/mlx/"&gt;MLX&lt;/a&gt;, Triton and JAX are undermining the CUDA advantage by making it easier for ML developers to target multiple backends - plus LLMs themselves are getting capable enough to help port things to alternative architectures.&lt;/p&gt;
&lt;p&gt;GPU interconnect helps multiple GPUs work together on tasks like model training. Companies like Cerebras are developing &lt;a href="https://simonwillison.net/2025/Jan/16/cerebras-yield-problem/"&gt;enormous chips&lt;/a&gt; that can get way more done on a single chip.&lt;/p&gt;
&lt;p&gt;Those 75-90% margins provide a huge incentive for other companies to catch up - including the customers who spend the most on NVIDIA at the moment - Microsoft, Amazon, Meta, Google, Apple - all of whom have their own internal silicon projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The real joy of this article is the way it describes technical details of modern LLMs in a relatively accessible manner. I love this description of the inference-scaling tricks used by O1 and R1, compared to traditional transformers:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Basically, the way Transformers work in terms of predicting the next token at each step is that, if they start out on a bad "path" in their initial response, they become almost like a prevaricating child who tries to spin a yarn about why they are actually correct, even if they should have realized mid-stream using common sense that what they are saying couldn't possibly be correct.&lt;/p&gt;
&lt;p&gt;Because the models are always seeking to be internally consistent and to have each successive generated token flow naturally from the preceding tokens and context, it's very hard for them to course-correct and backtrack. By breaking the inference process into what is effectively many intermediate stages, they can try lots of different things and see what's working and keep trying to course-correct and try other approaches until they can reach a fairly high threshold of confidence that they aren't talking nonsense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The last quarter of the article talks about the seismic waves rocking the industry right now caused by &lt;a href="https://simonwillison.net/tags/deepseek/"&gt;DeepSeek&lt;/a&gt; v3 and R1. v3 remains the top-ranked open weights model, despite being around 45x more efficient in training than its competition: bad news if you are selling GPUs! R1 represents another huge breakthrough in efficiency both for training and for inference - the DeepSeek R1 API is currently 27x cheaper than OpenAI's o1, for a similar level of quality.&lt;/p&gt;
&lt;p&gt;Jeffrey summarized some of the key ideas from the &lt;a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf"&gt;v3 paper&lt;/a&gt; like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A major innovation is their sophisticated mixed-precision training framework that lets them use 8-bit floating point numbers (FP8) throughout the entire training process. [...]&lt;/p&gt;
&lt;p&gt;DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later (losing some quality in the process), DeepSeek's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs, this dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then for &lt;a href="https://arxiv.org/abs/2501.12948"&gt;R1&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.&lt;/p&gt;
&lt;p&gt;The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to "reward hacking" (where the model finds bogus ways to boost their rewards that don't actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This article is packed with insights like that - it's worth spending the time absorbing the whole thing.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=42822162"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nvidia"/><category term="mlx"/><category term="cerebras"/><category term="llm-reasoning"/><category term="deepseek"/><category term="ai-in-china"/></entry><entry><title>100x Defect Tolerance: How Cerebras Solved the Yield Problem</title><link href="https://simonwillison.net/2025/Jan/16/cerebras-yield-problem/#atom-tag" rel="alternate"/><published>2025-01-16T00:38:01+00:00</published><updated>2025-01-16T00:38:01+00:00</updated><id>https://simonwillison.net/2025/Jan/16/cerebras-yield-problem/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem"&gt;100x Defect Tolerance: How Cerebras Solved the Yield Problem&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I learned a bunch about how chip manufacture works from this piece where Cerebras reveal some notes about how they manufacture chips that are 56x physically larger than NVIDIA's H100.&lt;/p&gt;
&lt;p&gt;The key idea here is core redundancy: designing a chip such that if there are defects the end-product is still useful. This has been a technique for decades:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For example in 2006 Intel released the Intel Core Duo – a chip with two CPU cores. If one core was faulty, it was disabled and the product was sold as an Intel Core Solo. Nvidia, AMD, and others all embraced this core-level redundancy in the coming years.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Modern GPUs are deliberately designed with redundant cores: the H100 needs 132 but the wafer contains 144, so up to 12 can be defective without the chip failing.&lt;/p&gt;
&lt;p&gt;Cerebras designed their monster (look at &lt;a href="https://www.bbc.com/news/technology-49395577"&gt;the size of this thing&lt;/a&gt;) with absolutely tiny cores: "approximately 0.05mm2" - with the whole chip needing 900,000 enabled cores out of the 970,000 total. This allows 93% of the silicon area to stay active in the finished chip, a notably high proportion.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=42717165"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hardware"&gt;hardware&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;&lt;/p&gt;



</summary><category term="hardware"/><category term="ai"/><category term="gpus"/><category term="cerebras"/></entry><entry><title>Cerebras Coder</title><link href="https://simonwillison.net/2024/Oct/31/cerebras-coder/#atom-tag" rel="alternate"/><published>2024-10-31T22:39:15+00:00</published><updated>2024-10-31T22:39:15+00:00</updated><id>https://simonwillison.net/2024/Oct/31/cerebras-coder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.val.town/v/stevekrouse/cerebras_coder"&gt;Cerebras Coder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Val Town founder Steve Krouse has been building demos on top of the Cerebras API that runs Llama3.1-70b at 2,000 tokens/second.&lt;/p&gt;
&lt;p&gt;Having a capable LLM with that kind of performance turns out to be really interesting. Cerebras Coder is a demo that implements Claude Artifact-style on-demand JavaScript apps, and having it run at that speed means changes you request are visible within less than a second:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2024/cascade-emoji.jpeg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2024/cascade-emoji.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Steve's implementation (created with the help of &lt;a href="https://www.val.town/townie"&gt;Townie&lt;/a&gt;, the Val Town code assistant) demonstrates the simplest possible version of an iframe sandbox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;iframe
    srcDoc={code}
    sandbox="allow-scripts allow-modals allow-forms allow-popups allow-same-origin allow-top-navigation allow-downloads allow-presentation allow-pointer-lock"
/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where &lt;code&gt;code&lt;/code&gt; is populated by a &lt;code&gt;setCode(...)&lt;/code&gt; call inside a React component.&lt;/p&gt;
&lt;p&gt;The most interesting applications of LLMs continue to be where they operate in a tight loop with a human - this can make those review loops potentially much faster and more productive.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/stevekrouse/status/1851995718514327848"&gt;@stevekrouse&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/iframes"&gt;iframes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sandboxing"&gt;sandboxing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/react"&gt;react&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/val-town"&gt;val-town&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/steve-krouse"&gt;steve-krouse&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="iframes"/><category term="sandboxing"/><category term="ai"/><category term="react"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="ai-assisted-programming"/><category term="val-town"/><category term="steve-krouse"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Pelicans on a bicycle</title><link href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/#atom-tag" rel="alternate"/><published>2024-10-25T23:56:50+00:00</published><updated>2024-10-25T23:56:50+00:00</updated><id>https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/pelican-bicycle/blob/main/README.md"&gt;Pelicans on a bicycle&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I decided to roll out my own LLM benchmark: how well can different models render an SVG of a pelican riding a bicycle?&lt;/p&gt;
&lt;p&gt;I chose that because a) I like pelicans and b) I'm pretty sure there aren't any pelican on a bicycle SVG files floating around (yet) that might have already been sucked into the training data.&lt;/p&gt;
&lt;p&gt;My prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've run it through 16 models so far - from OpenAI, Anthropic, Google Gemini and  Meta (Llama running on Cerebras), all using my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; CLI utility. Here's my (&lt;a href="https://gist.github.com/simonw/32273a445da3318df690749701805863"&gt;Claude assisted&lt;/a&gt;) Bash script: &lt;a href="https://github.com/simonw/pelican-bicycle/blob/b25faf3e29dcf73c97278dfdd7b7b973462eb0cb/generate-svgs.sh"&gt;generate-svgs.sh&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here's Claude 3.5 Sonnet (2024-06-20) and Claude 3.5 Sonnet (2024-10-22):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/claude-3-5-sonnet-20240620.svg" style="width: 45%" alt="Two circles joined by a triangle. The pelican is a grey oval for the body, a circle for the head and has a peak that looks like a yellow banana smile. A wing is hinted at with an upside down curved line. Two legs dangle from the bottom of the bird."&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/claude-3-5-sonnet-20241022.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;Gemini 1.5 Flash 001 and Gemini 1.5 Flash 002:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gemini-1.5-flash-001.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gemini-1.5-flash-002.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;GPT-4o mini and GPT-4o:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gpt-4o-mini.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gpt-4o.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;o1-mini and o1-preview:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/o1-mini.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/o1-preview.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;Cerebras Llama 3.1 70B and Llama 3.1 8B:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/cerebras-llama3.1-70b.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/cerebras-llama3.1-8b.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;And a special mention for Gemini 1.5 Flash 8B:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gemini-1.5-flash-8b-001.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;The rest of them are &lt;a href="https://github.com/simonw/pelican-bicycle/blob/main/README.md"&gt;linked from the README&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="svg"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="gemini"/><category term="cerebras"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>llm-cerebras</title><link href="https://simonwillison.net/2024/Oct/25/llm-cerebras/#atom-tag" rel="alternate"/><published>2024-10-25T05:50:47+00:00</published><updated>2024-10-25T05:50:47+00:00</updated><id>https://simonwillison.net/2024/Oct/25/llm-cerebras/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;a href="https://cerebras.ai/"&gt;Cerebras&lt;/a&gt; (&lt;a href="https://simonwillison.net/2024/Aug/28/cerebras-inference/"&gt;previously&lt;/a&gt;) provides Llama LLMs hosted on custom hardware at ferociously high speeds.&lt;/p&gt;
&lt;p&gt;GitHub user &lt;a href="https://github.com/irthomasthomas"&gt;irthomasthomas&lt;/a&gt; built an &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; plugin that works against &lt;a href="https://cloud.cerebras.ai/"&gt;their API&lt;/a&gt; - which is currently free, albeit with a rate limit of 30 requests per minute for their two models.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-cerebras
llm keys set cerebras
# paste key here
llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2024/cerebras-is-fast.mp4"&gt;a video&lt;/a&gt; showing the speed of that prompt:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2024/cerebras-poster.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2024/cerebras-is-fast.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;The other model is &lt;code&gt;cerebras-llama3.1-8b&lt;/code&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Cerebras Inference: AI at Instant Speed</title><link href="https://simonwillison.net/2024/Aug/28/cerebras-inference/#atom-tag" rel="alternate"/><published>2024-08-28T04:14:00+00:00</published><updated>2024-08-28T04:14:00+00:00</updated><id>https://simonwillison.net/2024/Aug/28/cerebras-inference/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed"&gt;Cerebras Inference: AI at Instant Speed&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted API for Llama running at absurdly high speeds: "1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B".&lt;/p&gt;
&lt;p&gt;How are they running so fast? Custom hardware. Their &lt;a href="https://cerebras.ai/product-chip/"&gt;WSE-3&lt;/a&gt; is 57x &lt;em&gt;physically larger&lt;/em&gt; than an NVIDIA H100, and has 4 trillion transistors, 900,000 cores and 44GB of memory all on one enormous chip.&lt;/p&gt;
&lt;p&gt;Their &lt;a href="https://inference.cerebras.ai/"&gt;live chat demo&lt;/a&gt; just returned me a response at 1,833 tokens/second. Their API currently has a waitlist.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41369705"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/performance"&gt;performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="performance"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Downloading and converting the original models (Cerebras-GPT)</title><link href="https://simonwillison.net/2023/Mar/31/cerebras-gpt-ggml/#atom-tag" rel="alternate"/><published>2023-03-31T04:28:12+00:00</published><updated>2023-03-31T04:28:12+00:00</updated><id>https://simonwillison.net/2023/Mar/31/cerebras-gpt-ggml/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ggerganov/ggml/blob/master/examples/gpt-2/README.md#downloading-and-converting-the-original-models-cerebras-gpt"&gt;Downloading and converting the original models (Cerebras-GPT)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Georgi Gerganov added support for the Apache 2 licensed Cerebras-GPT language model to his ggml C++ inference library, as used by llama.cpp.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ggerganov/status/1641554853007708164"&gt;@ggerganov&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/georgi-gerganov"&gt;georgi-gerganov&lt;/a&gt;&lt;/p&gt;



</summary><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="cerebras"/><category term="llama-cpp"/><category term="georgi-gerganov"/></entry><entry><title>Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models</title><link href="https://simonwillison.net/2023/Mar/28/cerebras-gpt/#atom-tag" rel="alternate"/><published>2023-03-28T22:05:44+00:00</published><updated>2023-03-28T22:05:44+00:00</updated><id>https://simonwillison.net/2023/Mar/28/cerebras-gpt/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/"&gt;Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest example of an open source large language model you can run your own hardware. This one is particularly interesting because the entire thing is under the Apache 2 license. Cerebras are an AI hardware company offering a product with 850,000 cores—this release was trained on their hardware, presumably to demonstrate its capabilities. The model comes in seven sizes from 111 million to 13 billion parameters, and the smaller sizes can be tried directly on Hugging Face.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=35343763"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="gpt-3"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="cerebras"/><category term="llm-release"/></entry></feed>