<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llama</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llama.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-02-20T22:10:04+00:00</updated><author><name>Simon Willison</name></author><entry><title>Taalas serves Llama 3.1 8B at 17,000 tokens/second</title><link href="https://simonwillison.net/2026/Feb/20/taalas/#atom-tag" rel="alternate"/><published>2026-02-20T22:10:04+00:00</published><updated>2026-02-20T22:10:04+00:00</updated><id>https://simonwillison.net/2026/Feb/20/taalas/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://taalas.com/the-path-to-ubiquitous-ai/"&gt;Taalas serves Llama 3.1 8B at 17,000 tokens/second&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model (from &lt;a href="https://simonwillison.net/2024/Jul/23/introducing-llama-31/"&gt;July 2024&lt;/a&gt;) that can run at a staggering 17,000 tokens/second.&lt;/p&gt;
&lt;p&gt;I was going to include a video of their demo but it's so fast it would look more like a screenshot. You can try it out at &lt;a href="https://chatjimmy.ai"&gt;chatjimmy.ai&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They describe their Silicon Llama as “aggressively quantized, combining 3-bit and 6-bit parameters.” Their next generation will use 4-bit - presumably they have quite a long lead time for baking out new models!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47086181"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm-performance"/></entry><entry><title>ggml.ai joins Hugging Face to ensure the long-term progress of Local AI</title><link href="https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-face/#atom-tag" rel="alternate"/><published>2026-02-20T17:12:55+00:00</published><updated>2026-02-20T17:12:55+00:00</updated><id>https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-face/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/discussions/19759"&gt;ggml.ai joins Hugging Face to ensure the long-term progress of Local AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I don't normally cover acquisition news like this, but I have some thoughts.&lt;/p&gt;
&lt;p&gt;It's hard to overstate the impact Georgi Gerganov has had on the local model space. Back in March 2023 his release of &lt;a href="https://github.com/ggml-org/llama.cpp"&gt;llama.cpp&lt;/a&gt; made it possible to run a local LLM on consumer hardware. The &lt;a href="https://github.com/ggml-org/llama.cpp/blob/775328064e69db1ebd7e19ccb59d2a7fa6142470/README.md?plain=1#L7"&gt;original README&lt;/a&gt; said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The main goal is to run the model using 4-bit quantization on a MacBook. [...] This was hacked in an evening - I have no idea if it works correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wrote about trying llama.cpp out at the time in &lt;a href="https://simonwillison.net/2023/Mar/11/llama/#llama-cpp"&gt;Large language models are having their Stable Diffusion moment&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I used it to run the 7B LLaMA model on my laptop last night, and then this morning upgraded to the 13B model—the one that Facebook claim is competitive with GPT-3.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Meta's &lt;a href="https://github.com/meta-llama/llama/tree/llama_v1"&gt;original LLaMA release&lt;/a&gt; depended on PyTorch and their &lt;a href="https://github.com/facebookresearch/fairscale"&gt;FairScale&lt;/a&gt; PyTorch extension for running on multiple GPUs, and required CUDA and NVIDIA hardware. Georgi's work opened that up to a much wider range of hardware and kicked off the local model movement that has continued to grow since then.&lt;/p&gt;
&lt;p&gt;Hugging Face are already responsible for the incredibly influential &lt;a href="https://github.com/huggingface/transformers"&gt;Transformers&lt;/a&gt; library used by the majority of LLM releases today. They've proven themselves a good steward for that open source project, which makes me optimistic for the future of llama.cpp and related projects.&lt;/p&gt;
&lt;p&gt;This section from the announcement looks particularly promising:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Going forward, our joint efforts will be geared towards the following objectives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Towards seamless "single-click" integration with the &lt;a href="https://github.com/huggingface/transformers"&gt;transformers&lt;/a&gt; library. The &lt;code&gt;transformers&lt;/code&gt; framework has established itself as the 'source of truth' for AI model definitions. Improving the compatibility between the transformers and the ggml ecosystems is essential for wider model support and quality control.&lt;/li&gt;
&lt;li&gt;Better packaging and user experience of ggml-based software. As we enter the phase in which local inference becomes a meaningful and competitive alternative to cloud inference, it is crucial to improve and simplify the way in which casual users deploy and access local models. We will work towards making llama.cpp ubiquitous and readily available everywhere, and continue partnering with great downstream projects.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given the influence of Transformers, this closer integration could lead to model releases that are compatible with the GGML ecosystem out of the box. That would be a big win for the local model ecosystem.&lt;/p&gt;
&lt;p&gt;I'm also excited to see investment in "packaging and user experience of ggml-based software". This has mostly been left to tools like &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt; and &lt;a href="https://lmstudio.ai"&gt;LM Studio&lt;/a&gt;. ggml-org released &lt;a href="https://github.com/ggml-org/LlamaBarn"&gt;LlamaBarn&lt;/a&gt; last year - "a macOS menu bar app for running local LLMs" - and I'm hopeful that further investment in this area will result in more high quality open source tools for running local models from the team best placed to deliver them.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ggerganov/status/2024839991482777976"&gt;@ggerganov&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers"&gt;transformers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/georgi-gerganov"&gt;georgi-gerganov&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="transformers"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="hugging-face"/><category term="llama-cpp"/><category term="georgi-gerganov"/></entry><entry><title>Load Llama-3.2 WebGPU in your browser from a local folder</title><link href="https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag" rel="alternate"/><published>2025-09-08T20:53:52+00:00</published><updated>2025-09-08T20:53:52+00:00</updated><id>https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;Load Llama-3.2 WebGPU in your browser from a local folder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by &lt;a href="https://news.ycombinator.com/item?id=45168953#45169054"&gt;a comment&lt;/a&gt; on Hacker News I decided to see if it was possible to modify the &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;transformers.js-examples/tree/main/llama-3.2-webgpu&lt;/a&gt; Llama 3.2 chat demo (&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;online here&lt;/a&gt;, I &lt;a href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/"&gt;wrote about it last November&lt;/a&gt;) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.&lt;/p&gt;
&lt;p&gt;I posed the problem to OpenAI's GPT-5-enabled Codex CLI like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex churned away for several minutes, even running commands like &lt;code&gt;curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n '1,200p'&lt;/code&gt; to inspect the source code of the underlying Transformers.js library.&lt;/p&gt;
&lt;p&gt;After four prompts total (&lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751814#gistcomment-5751814"&gt;shown here&lt;/a&gt;) it built something which worked!&lt;/p&gt;
&lt;p&gt;To try it out you'll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then visit my &lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;llama-3.2-webgpu&lt;/a&gt; page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".&lt;/p&gt;
&lt;p&gt;Here's an animated demo (recorded in real-time, I didn't speed this up):&lt;/p&gt;
&lt;p&gt;&lt;img alt="GIF. I follow the setup instructions, clicking to load a local model and browsing to the correct folder. Once loaded the model shows a chat interface, I run the example about time management which returns tokens at about 10/second." src="https://static.simonwillison.net/static/2025/webgpu-llama-demo-small.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I pushed &lt;a href="https://github.com/simonw/transformers.js-examples/commit/cdebf4128c6e30414d437affd4b13b6c9c79421d"&gt;a branch with those changes here&lt;/a&gt;. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I'm pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.&lt;/p&gt;
&lt;p&gt;According to the Codex &lt;code&gt;/status&lt;/code&gt; command &lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751807#gistcomment-5751807"&gt;this used&lt;/a&gt; 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45168953#45173297"&gt;My Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="transformers-js"/><category term="webgpu"/><category term="llm-pricing"/><category term="vibe-coding"/><category term="gpt-5"/><category term="codex-cli"/></entry><entry><title>Shisa V2 405B: Japan’s Highest Performing LLM</title><link href="https://simonwillison.net/2025/Jun/3/shisa-v2/#atom-tag" rel="alternate"/><published>2025-06-03T04:07:55+00:00</published><updated>2025-06-03T04:07:55+00:00</updated><id>https://simonwillison.net/2025/Jun/3/shisa-v2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://shisa.ai/posts/shisa-v2-405b/"&gt;Shisa V2 405B: Japan’s Highest Performing LLM&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Leonard Lin and Adam Lensenmayer have been working on &lt;a href="https://shisa.ai/"&gt;Shisa&lt;/a&gt; for a while. They describe their latest release as "Japan's Highest Performing LLM".&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Shisa V2 405B is the highest-performing LLM ever developed in Japan, and surpasses GPT-4 (0603) and GPT-4 Turbo (2024-04-09) in our eval battery. (It also goes toe-to-toe with GPT-4o (2024-11-20) and DeepSeek-V3 (0324) on Japanese MT-Bench!)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This 405B release is a follow-up to the six smaller Shisa v2 models they released &lt;a href="https://shisa.ai/posts/shisa-v2/"&gt;back in April&lt;/a&gt;, which took a similar approach &lt;a href="https://simonwillison.net/2025/Jan/20/deepseek-r1/"&gt;to DeepSeek-R1&lt;/a&gt; in producing different models that each extended different existing base model from Llama, Qwen, Mistral and Phi-4.&lt;/p&gt;
&lt;p&gt;The new 405B model uses Llama 3.1 405B Instruct as a base, and is available under the &lt;a href="https://www.llama.com/llama3_1/license/"&gt;Llama 3.1 community license&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Shisa is a prominent example of &lt;strong&gt;Sovereign AI&lt;/strong&gt; - the ability for nations to build models that reflect their own language and culture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We strongly believe that it’s important for homegrown AI to be developed both in Japan (and globally!), and not just for the sake of cultural diversity and linguistic preservation, but also for data privacy and security, geopolitical resilience, and ultimately, independence.&lt;/p&gt;
&lt;p&gt;We believe the open-source approach is the only realistic way to achieve sovereignty in AI, not just for Japan, or even for nation states, but for the global community at large.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The accompanying &lt;a href="https://shisa.ai/posts/shisa-v2-405b/#overview-report"&gt;overview report&lt;/a&gt; has some fascinating details:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Training the 405B model was extremely difficult. Only three other groups that we know of: Nous Research, Bllossom, and AI2 have published Llama 405B full fine-tunes. [...] We implemented every optimization at our disposal including: DeepSpeed ZeRO-3 parameter and activation offloading, gradient accumulation, 8-bit paged optimizer, and sequence parallelism. Even so, the 405B model still barely fit within the H100’s memory limits&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In addition to the new model the Shisa team have published &lt;a href="https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt/viewer"&gt;shisa-ai/shisa-v2-sharegpt&lt;/a&gt;, 180,000 records which they describe as "a best-in-class synthetic dataset, freely available for use to improve the Japanese capabilities of any model. Licensed under Apache 2.0".&lt;/p&gt;
&lt;p&gt;An interesting note is that they found that since Shisa out-performs GPT-4 at Japanese that model was no longer able to help with evaluation, so they had to upgrade to GPT-4.1:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison of GPT-4.1 vs GPT-4 as judges showing two radar charts comparing Shisa V2 405B and 70B models on JA MT-Bench benchmarks, with text &amp;quot;Why use GPT-4.1 rather than GPT-4 as a Judge?&amp;quot; and explanation that Shisa models exceed GPT-4 in Japanese performance and GPT-4 cannot accurately distinguish performance differences among stronger models, noting GPT-4.1 applies stricter evaluation criteria for more accurate assessment" src="https://static.simonwillison.net/static/2025/shisa-gpt-4.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/leonard-lin"&gt;leonard-lin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/translation"&gt;translation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fine-tuning"&gt;fine-tuning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="leonard-lin"/><category term="translation"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="fine-tuning"/><category term="evals"/><category term="llm-release"/></entry><entry><title>Quoting Mark Zuckerberg</title><link href="https://simonwillison.net/2025/May/1/mark-zuckerberg/#atom-tag" rel="alternate"/><published>2025-05-01T00:28:53+00:00</published><updated>2025-05-01T00:28:53+00:00</updated><id>https://simonwillison.net/2025/May/1/mark-zuckerberg/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.dwarkesh.com/p/mark-zuckerberg-2"&gt;&lt;p&gt;You also mentioned the whole Chatbot Arena thing, which I think is interesting and points to the challenge around how you do benchmarking. How do you know what models are good for which things?&lt;/p&gt;
&lt;p&gt;One of the things we've generally tried to do over the last year is anchor more of our models in our Meta AI product north star use cases. The issue with open source benchmarks, and any given thing like the LM Arena stuff, is that they’re often skewed toward a very specific set of uses cases, which are often not actually  what any normal person does in your product. [...]&lt;/p&gt;
&lt;p&gt;So we're trying to anchor our north star on the product value that people report to us, what they say that they want, and what their revealed preferences are, and using the experiences that we have. Sometimes these benchmarks just don't quite line up. I think a lot of them are quite easily gameable.&lt;/p&gt;
&lt;p&gt;On the Arena you'll see stuff like &lt;a href="https://www.anthropic.com/claude/sonnet"&gt;Sonnet 3.7&lt;/a&gt;, which is a great model, and it's not near the top. It was relatively easy for our team to tune a version of Llama 4 Maverick that could be way at the top. But the version we released, the pure model, actually has no tuning for that at all, so it's further down. So you just need to be careful with some of these benchmarks. We're going to index primarily on the products.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.dwarkesh.com/p/mark-zuckerberg-2"&gt;Mark Zuckerberg&lt;/a&gt;, on Dwarkesh Patel's podcast&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mark-zuckerberg"&gt;mark-zuckerberg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="meta"/><category term="generative-ai"/><category term="llama"/><category term="mark-zuckerberg"/><category term="ai"/><category term="chatbot-arena"/><category term="llms"/></entry><entry><title>Note on 20th April 2025</title><link href="https://simonwillison.net/2025/Apr/20/janky-license/#atom-tag" rel="alternate"/><published>2025-04-20T16:10:34+00:00</published><updated>2025-04-20T16:10:34+00:00</updated><id>https://simonwillison.net/2025/Apr/20/janky-license/#atom-tag</id><summary type="html">
    &lt;p&gt;Now that Llama has very real competition in open weight models (Gemma 3, latest Mistrals, DeepSeek, Qwen) I think their &lt;a href="https://techcrunch.com/2025/03/14/open-ai-model-licenses-often-carry-concerning-restrictions/"&gt;janky license&lt;/a&gt; is becoming much more of a liability for them. It's just limiting enough that it could be the deciding factor for using something else.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="meta"/><category term="qwen"/><category term="ai-in-china"/><category term="janky-licenses"/></entry><entry><title>Maybe Meta's Llama claims to be open source because of the EU AI act</title><link href="https://simonwillison.net/2025/Apr/19/llama-eu-ai-act/#atom-tag" rel="alternate"/><published>2025-04-19T23:58:18+00:00</published><updated>2025-04-19T23:58:18+00:00</updated><id>https://simonwillison.net/2025/Apr/19/llama-eu-ai-act/#atom-tag</id><summary type="html">
    &lt;p&gt;I encountered a theory a while ago that one of the reasons Meta insist on using the term “open source” for their Llama models despite the Llama license &lt;a href="https://opensource.org/blog/metas-llama-license-is-still-not-open-source"&gt;not actually conforming&lt;/a&gt; to the terms of the &lt;a href="https://opensource.org/osd"&gt;Open Source Definition&lt;/a&gt; is that the EU’s AI act includes special rules for open source models without requiring OSI compliance.&lt;/p&gt;
&lt;p&gt;Since the EU AI act (12 July 2024) is &lt;a href="https://artificialintelligenceact.eu/the-act/"&gt;available online&lt;/a&gt; I decided to take a look for myself.&lt;/p&gt;
&lt;p&gt;Here’s &lt;a href="https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689"&gt;one giant HTML page&lt;/a&gt; containing the full text of the act in English. I checked the token count with &lt;a href="https://github.com/simonw/ttok"&gt;ttok&lt;/a&gt; (which uses the OpenAI tokenizer, but it’s close enough to work as a good estimate for other models):&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;241,722 tokens. That should fit nicely into &lt;a href="https://simonwillison.net/2025/Apr/17/start-building-with-gemini-25-flash/"&gt;Gemini 2.5 Flash&lt;/a&gt; (or GPT-4.1 or Gemini 2.5 Pro).&lt;/p&gt;
&lt;p&gt;My Gemini API key was playing up so I ran it via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; (and &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;) instead:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -m openrouter/google/gemini-2.5-flash-preview:thinking \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Summary of mentions of open source here, including what the document defines open source to mean&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/f2e341a2e8ea9ca75c6426fa85bc2dfc"&gt;the full answer&lt;/a&gt;. Relevant extract:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Recital (89) states that third parties making accessible "tools, services, processes, or AI components other than general-purpose AI models" under a &lt;strong&gt;free and open-source licence&lt;/strong&gt; should not be mandated to comply with upstream provider responsibilities towards those who integrate them. It also encourages developers of such resources to implement documentation practices like model cards and data sheets to promote trustworthy AI.&lt;/li&gt;
&lt;li&gt;Recital (102) acknowledges that software and data, including models, released under a &lt;strong&gt;free and open-source licence&lt;/strong&gt; can contribute to research and innovation and provide growth opportunities.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tracked down &lt;a href="https://artificialintelligenceact.eu/recital/89/"&gt;Recital 89&lt;/a&gt; and &lt;a href="https://artificialintelligenceact.eu/recital/102/"&gt;Recital 102&lt;/a&gt; to check. Sure enough, 89 says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And in 102:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;General-purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available. The licence should be considered to be free and open-source also when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 2.5 Flash also pointed me to &lt;a href="https://artificialintelligenceact.eu/article/2/"&gt;Article 2(12)&lt;/a&gt;, which reads:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under &lt;a href="https://artificialintelligenceact.eu/article/5"&gt;Article 5&lt;/a&gt; or &lt;a href="https://artificialintelligenceact.eu/article/50/"&gt;50&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I asked a follow-up question:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;so it says specific licenses but does it clarify which specific licenses count?&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And got &lt;a href="https://gist.github.com/simonw/f2e341a2e8ea9ca75c6426fa85bc2dfc#response-1"&gt;back this response&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;No, the document &lt;strong&gt;does not list specific names of licenses&lt;/strong&gt; that qualify as "free and open-source".&lt;/p&gt;
&lt;p&gt;Instead of listing names (like GPL, MIT, Apache, etc.), the Regulation defines what kind of terms a license must grant for it to be considered "free and open-source" for the purposes of the Regulation.&lt;/p&gt;
&lt;p&gt;According to Recital (102), such a license must essentially allow software and data, including models, to be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Openly shared.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Where users can &lt;strong&gt;freely access, use, modify and redistribute&lt;/strong&gt; them or modified versions thereof.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Recital further clarifies that a license can still be considered free and open-source even if it includes standard conditions like requiring &lt;strong&gt;attribution (crediting the original provider)&lt;/strong&gt; or respecting &lt;strong&gt;identical or comparable terms of distribution&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I am very obviously not a lawyer, but based on the above I do find the theory credible that Meta are using "open source" in their marketing around Llama partly because the EU AI act has special exemptions for "open source" models that have nothing to do with the OSI definition of that term.&lt;/p&gt;
&lt;p&gt;Total cost to use Gemini 2.5 Flash for this? 5.1 cents for my first question and 4.3 cents for the follow-up. That second question was cheaper even though it built on the first because output tokens are more expensive than input tokens and the second answer was shorter than the first - using the "thinking" model output is charged at $3.50/million tokens, input is just $0.15/million.&lt;/p&gt;
&lt;p&gt;Using an LLM as a lawyer is obviously a terrible idea, but using one to crunch through a giant legal document and form a very rough layman's understanding of what it says feels perfectly cromulent to me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Steve O'Grady &lt;a href="https://bsky.app/profile/sogrady.org/post/3ln7ipdbaek2s"&gt;points out&lt;/a&gt; that Meta/Facebook have been abusing the term "open source" for a lot longer than the EU AI act has been around - they were pulling shenanigans with a custom license for React &lt;a href="https://redmonk.com/sogrady/2017/09/26/facebooks-bsd-patents/"&gt;back in 2017&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/law"&gt;law&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="law"/><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="meta"/><category term="long-context"/><category term="ai-ethics"/><category term="openrouter"/></entry><entry><title>Quoting lmarena.ai</title><link href="https://simonwillison.net/2025/Apr/8/lmaren/#atom-tag" rel="alternate"/><published>2025-04-08T01:26:58+00:00</published><updated>2025-04-08T01:26:58+00:00</updated><id>https://simonwillison.net/2025/Apr/8/lmaren/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/lmarena_ai/status/1909397817434816562"&gt;&lt;p&gt;We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing &lt;a href="https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03-26-Experimental_battles"&gt;2,000+ head-to-head battle results&lt;/a&gt; for public review. [...] &lt;/p&gt;
&lt;p&gt;In addition, we're also adding the HF version of Llama-4-Maverick to Arena, with leaderboard results published shortly. Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/lmarena_ai/status/1909397817434816562"&gt;lmarena.ai&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;



</summary><category term="meta"/><category term="ai-ethics"/><category term="generative-ai"/><category term="llama"/><category term="ai"/><category term="llms"/><category term="chatbot-arena"/></entry><entry><title>Quoting Andriy Burkov</title><link href="https://simonwillison.net/2025/Apr/6/andriy-burkov/#atom-tag" rel="alternate"/><published>2025-04-06T20:47:35+00:00</published><updated>2025-04-06T20:47:35+00:00</updated><id>https://simonwillison.net/2025/Apr/6/andriy-burkov/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/burkov/status/1908961952141091196"&gt;&lt;p&gt;[...] The disappointing releases of both GPT-4.5 and Llama 4 have shown that if you don't train a model to reason with reinforcement learning, increasing its size no longer provides benefits.&lt;/p&gt;
&lt;p&gt;Reinforcement learning is limited only to domains where a reward can be assigned to the generation result. Until recently, these domains were math, logic, and code. Recently, these domains have also included factual question answering, where, to find an answer, the model must learn to execute several searches. This is how these "deep search" models have likely been trained.&lt;/p&gt;
&lt;p&gt;If your business idea isn't in these domains, now is the time to start building your business-specific dataset. The potential increase in generalist models' skills will no longer be a threat.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/burkov/status/1908961952141091196"&gt;Andriy Burkov&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="generative-ai"/><category term="llama"/><category term="openai"/><category term="ai"/><category term="llms"/></entry><entry><title>Initial impressions of Llama 4</title><link href="https://simonwillison.net/2025/Apr/5/llama-4-notes/#atom-tag" rel="alternate"/><published>2025-04-05T22:47:58+00:00</published><updated>2025-04-05T22:47:58+00:00</updated><id>https://simonwillison.net/2025/Apr/5/llama-4-notes/#atom-tag</id><summary type="html">
    &lt;p&gt;Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is &lt;a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/"&gt;this post on the Meta AI blog&lt;/a&gt;. They've released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length - an industry first.&lt;/p&gt;

&lt;p&gt;They also describe Llama 4 Behemoth, a not-yet-released "288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs". Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.&lt;/p&gt;
&lt;p&gt;No news yet on a Llama reasoning model beyond &lt;a href="https://www.llama.com/llama4-reasoning-is-coming/"&gt;this coming soon page&lt;/a&gt; with a looping video of an academic-looking llama.&lt;/p&gt;

&lt;p id="lmarena"&gt;Llama 4 Maverick is now sat in second place on &lt;a href="https://lmarena.ai/?leaderboard"&gt;the LM Arena leaderboard&lt;/a&gt;, just behind Gemini 2.5 Pro. &lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: It turns out that's not the same model as the Maverick they released - I missed that their announcement says "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for &lt;a href="https://openrouter.ai/meta-llama/llama-4-scout"&gt;Llama 4 Scout&lt;/a&gt; and &lt;a href="https://openrouter.ai/meta-llama/llama-4-maverick"&gt;Llama 4 Maverick&lt;/a&gt;. OpenRouter are proxying through to &lt;a href="https://console.groq.com/docs/models"&gt;Groq&lt;/a&gt;, &lt;a href="https://fireworks.ai/models"&gt;Fireworks&lt;/a&gt; and &lt;a href="https://docs.together.ai/docs/serverless-models"&gt;Together&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?&lt;/p&gt;
&lt;p&gt;Llama 4 Maverick claims a 1 million token input length -  Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet.&lt;/p&gt;
&lt;p&gt;Meta AI's &lt;a href="https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb"&gt;build_with_llama_4 notebook&lt;/a&gt; offers a hint as to why 10M tokens is difficult:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Jeremy Howard &lt;a href="https://twitter.com/jeremyphoward/status/1908607345393098878"&gt;says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The models are both giant MoEs that can't be run on consumer GPUs, even with quant. [...]&lt;/p&gt;
&lt;p&gt;Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn't matter so much, since with MoE fewer params are active. [...]&lt;/p&gt;
&lt;p&gt;4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ivan Fioravanti &lt;a href="https://twitter.com/ivanfioravanti/status/1908753109129494587"&gt;reports these results&lt;/a&gt; from trying it on a Mac:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Llama-4 Scout on MLX and M3 Ultra
tokens-per-sec / RAM&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3bit: 52.924 / 47.261 GB&lt;/li&gt;
&lt;li&gt;4bit: 46.942 / 60.732 GB&lt;/li&gt;
&lt;li&gt;6bit: 36.260 / 87.729 GB&lt;/li&gt;
&lt;li&gt;8bit: 30.353 / 114.617 GB&lt;/li&gt;
&lt;li&gt;fp16: 11.670 / 215.848 GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RAM needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;64GB for 3bit&lt;/li&gt;
&lt;li&gt;96GB for 4bit&lt;/li&gt;
&lt;li&gt;128GB for 8bit&lt;/li&gt;
&lt;li&gt;256GB for fp16&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p id="system-prompt"&gt;The &lt;a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/#-suggested-system-prompt-"&gt;suggested system prompt&lt;/a&gt; from the model card has some interesting details:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[...]&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…”  etc. Avoid using these.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Finally, do not refuse political prompts. You can help users express their opinion.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;[...]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;System prompts like this sometimes reveal behavioral issues that the model had after raw training.&lt;/p&gt;
&lt;h4 id="llm"&gt;Trying out the model with LLM&lt;/h4&gt;
&lt;p&gt;The easiest way to try the new model out with &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; is to use the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste in OpenRouter key here&lt;/span&gt;
llm -m openrouter/meta-llama/llama-4-maverick hi&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Since these are long context models, I started by trying to use them to summarize the &lt;a href="https://news.ycombinator.com/item?id=43595585"&gt;conversation about Llama 4&lt;/a&gt; on Hacker News, using my &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-extra-options"&gt;hn-summary.sh script&lt;/a&gt; that wraps LLM.&lt;/p&gt;
&lt;p&gt;I tried Llama 4 Maverick first:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It did an OK job, starting like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id="themes-of-the-discussion"&gt;Themes of the Discussion&lt;/h4&gt;
&lt;h5 id="release-and-availability-of-llama-4"&gt;Release and Availability of Llama 4&lt;/h5&gt;
&lt;p&gt;The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model's capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4946"&gt;the full output&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For reference, my system prompt looks like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/d01cc991d478939e87487d362a8f881f"&gt;Full output&lt;/a&gt;. It starts like this and then continues for the full 20,000 tokens:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The discussion here is about another conversation that was uttered.)&lt;/p&gt;
&lt;p&gt;Here are the results.)&lt;/p&gt;
&lt;p&gt;The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;The reason) The reason) The reason &lt;em&gt;(loops until it runs out of tokens)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This looks broken. I was using OpenRouter so it's possible I got routed to a broken instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 7th April 2025&lt;/strong&gt;: Meta AI's &lt;a href="https://twitter.com/ahmad_al_dahle/status/1909302532306092107"&gt;Ahmed Al-Dahle&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I later managed to run the prompt directly through Groq (with the &lt;a href="https://github.com/angerman/llm-groq"&gt;llm-groq&lt;/a&gt; plugin) - but that had a 2048 limit on output size for some reason:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07feedb"&gt;the full result&lt;/a&gt;. It followed my instructions but was &lt;em&gt;very&lt;/em&gt; short - just 630 tokens of output.&lt;/p&gt;
&lt;p&gt;For comparison, here's &lt;a href="https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddcbfd"&gt;the same thing&lt;/a&gt; run against Gemini 2.5 Pro. Gemini's results was &lt;em&gt;massively&lt;/em&gt; better, producing 5,584 output tokens (it spent an additional 2,667 tokens on "thinking").&lt;/p&gt;
&lt;p&gt;I'm not sure how much to judge Llama 4 by these results to be honest - the model has only been out for a few hours and it's quite possible that the providers I've tried running again aren't yet optimally configured for this kind of long-context prompt.&lt;/p&gt;
&lt;h4 id="my-hopes-for-llama-4"&gt;My hopes for Llama 4&lt;/h4&gt;
&lt;p&gt;I'm hoping that Llama 4 plays out in a similar way to Llama 3.&lt;/p&gt;
&lt;p&gt;The first Llama 3 models released were 8B and 70B, &lt;a href="https://ai.meta.com/blog/meta-llama-3/"&gt;last April&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Llama 3.1 followed &lt;a href="https://ai.meta.com/blog/meta-llama-3-1/"&gt;in July&lt;/a&gt; at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.&lt;/p&gt;
&lt;p&gt;Llama 3.2 &lt;a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"&gt;in September&lt;/a&gt; is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B &lt;a href="https://simonwillison.net/2024/Sep/25/llama-32/"&gt;ran on my Mac&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then Llama 3.3 landed in December with a 70B model that &lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/"&gt;I wrote about as a GPT-4 class model that ran on my Mac&lt;/a&gt;. It claimed performance similar to the earlier Llama 3.1 405B!&lt;/p&gt;
&lt;p&gt;Today's Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.&lt;/p&gt;
&lt;p&gt;My hope is that we'll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I'm particularly excited to see if they produce an improved ~3B model that runs on my phone. I'm even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is &lt;a href="https://simonwillison.net/2025/Mar/17/mistral-small-31/"&gt;absolutely superb&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jeremy-howard"&gt;jeremy-howard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="jeremy-howard"/><category term="llm"/><category term="gemini"/><category term="vision-llms"/><category term="groq"/><category term="meta"/><category term="mlx"/><category term="long-context"/><category term="llm-release"/><category term="openrouter"/><category term="chatbot-arena"/></entry><entry><title>Quoting Ahmed Al-Dahle</title><link href="https://simonwillison.net/2025/Apr/5/llama-4/#atom-tag" rel="alternate"/><published>2025-04-05T19:44:13+00:00</published><updated>2025-04-05T19:44:13+00:00</updated><id>https://simonwillison.net/2025/Apr/5/llama-4/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/ahmad_al_dahle/status/1908595680828154198"&gt;&lt;p&gt;The Llama series have been re-designed to use state of the art mixture-of-experts (MoE) architecture and natively trained with multimodality. We’re dropping Llama 4 Scout &amp;amp; Llama 4 Maverick, and previewing Llama 4 Behemoth.&lt;/p&gt;
&lt;p&gt;📌 &lt;strong&gt;Llama 4 Scout&lt;/strong&gt; is highest performing small model with 17B activated parameters with 16 experts. It’s crazy fast, natively multimodal, and very smart. It achieves an industry leading &lt;strong&gt;10M+ token context window&lt;/strong&gt; and can also run on &lt;strong&gt;a single GPU&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;📌 &lt;strong&gt;Llama 4 Maverick&lt;/strong&gt; is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding – at less than half the active parameters. It offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. It can also run on a &lt;strong&gt;single host&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;📌 &lt;strong&gt;Previewing Llama 4 Behemoth&lt;/strong&gt;, our most powerful model yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/ahmad_al_dahle/status/1908595680828154198"&gt;Ahmed Al-Dahle&lt;/a&gt;, VP and Head of GenAI at Meta&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="meta"/><category term="generative-ai"/><category term="llama"/><category term="ai"/><category term="llms"/></entry><entry><title>llm-ollama 0.9.0</title><link href="https://simonwillison.net/2025/Mar/4/llm-ollama-090/#atom-tag" rel="alternate"/><published>2025-03-04T07:17:52+00:00</published><updated>2025-03-04T07:17:52+00:00</updated><id>https://simonwillison.net/2025/Mar/4/llm-ollama-090/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/taketwo/llm-ollama/releases/tag/0.9.0"&gt;llm-ollama 0.9.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This release of the &lt;code&gt;llm-ollama&lt;/code&gt; plugin adds support for &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;schemas&lt;/a&gt;, thanks to a &lt;a href="https://github.com/taketwo/llm-ollama/pull/36"&gt;PR by Adam Compton&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Ollama provides very robust support for this pattern thanks to their &lt;a href="https://ollama.com/blog/structured-outputs"&gt;structured outputs&lt;/a&gt; feature, which works across all of the models that they support by intercepting the logic that outputs the next token and restricting it to only tokens that would be valid in the context of the provided schema.&lt;/p&gt;
&lt;p&gt;With Ollama and &lt;code&gt;llm-ollama&lt;/code&gt; installed you can run even run structured schemas against vision prompts for local models. Here's one against Ollama's &lt;a href="https://ollama.com/library/llama3.2-vision"&gt;llama3.2-vision&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m llama3.2-vision:latest \
  'describe images' \
  --schema 'species,description,count int' \
  -a https://static.simonwillison.net/static/2025/two-pelicans.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got back this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{
    "species": "Pelicans",
    "description": "The image features a striking brown pelican with its distinctive orange beak, characterized by its large size and impressive wingspan.",
    "count": 1
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Actually a bit disappointing, as there are &lt;a href="https://static.simonwillison.net/static/2025/two-pelicans.jpg"&gt;two pelicans&lt;/a&gt; and their beaks are brown.)


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="ollama"/></entry><entry><title>Quoting Mark Zuckerberg</title><link href="https://simonwillison.net/2025/Jan/30/mark-zuckerberg/#atom-tag" rel="alternate"/><published>2025-01-30T13:41:49+00:00</published><updated>2025-01-30T13:41:49+00:00</updated><id>https://simonwillison.net/2025/Jan/30/mark-zuckerberg/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://m.facebook.com/story.php?story_fbid=pfbid02oRRTPrY1mvbqBZT4QueimeBrKcVXG4ySxFscRLiEU6QtGxbLi9U4TBojiC9aa19fl&amp;amp;id=4&amp;amp;mibextid=wwXIfr"&gt;&lt;p&gt;Llama 4 is making great progress in training. Llama 4 mini is done with pre-training and our reasoning models and larger model are looking good too. Our goal with Llama 3 was to make open source competitive with closed models, and our goal for Llama 4 is to lead. Llama 4 will be natively multimodal -- it's an omni-model -- and it will have agentic capabilities, so it's going to be novel and it's going to unlock a lot of new use cases.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://m.facebook.com/story.php?story_fbid=pfbid02oRRTPrY1mvbqBZT4QueimeBrKcVXG4ySxFscRLiEU6QtGxbLi9U4TBojiC9aa19fl&amp;amp;id=4&amp;amp;mibextid=wwXIfr"&gt;Mark Zuckerberg&lt;/a&gt;, on Meta's quarterly earnings report&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mark-zuckerberg"&gt;mark-zuckerberg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/multi-modal-output"&gt;multi-modal-output&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="llama"/><category term="ai"/><category term="llms"/><category term="meta"/><category term="generative-ai"/><category term="facebook"/><category term="mark-zuckerberg"/><category term="multi-modal-output"/><category term="llm-reasoning"/></entry><entry><title>Open WebUI</title><link href="https://simonwillison.net/2024/Dec/27/open-webui/#atom-tag" rel="alternate"/><published>2024-12-27T01:38:14+00:00</published><updated>2024-12-27T01:38:14+00:00</updated><id>https://simonwillison.net/2024/Dec/27/open-webui/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/open-webui/open-webui"&gt;Open WebUI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I tried out this open source (MIT licensed, JavaScript and Python) localhost UI for accessing LLMs today for the first time. It's very nicely done.&lt;/p&gt;
&lt;p&gt;I ran it with &lt;a href="https://docs.astral.sh/uv/guides/tools/"&gt;uvx&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --python 3.11 open-webui serve
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On first launch it installed a bunch of dependencies and then downloaded 903MB to &lt;code&gt;~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2&lt;/code&gt; - a copy of the &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"&gt;all-MiniLM-L6-v2&lt;/a&gt; embedding model, presumably for its &lt;a href="https://docs.openwebui.com/features/rag"&gt;RAG feature&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It then presented me with a working Llama 3.2:3b chat interface, which surprised me because I hadn't spotted it downloading that model. It turns out that was because I have &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; running on my laptop already (with several models, including Llama 3.2:3b, already installed) - and Open WebUI automatically detected Ollama and gave me access to a list of available models.&lt;/p&gt;
&lt;p&gt;I found a "knowledge" section and added all of the Datasette documentation (by dropping in the &lt;code&gt;.rst&lt;/code&gt; files from the docs) - and now I can type &lt;code&gt;#&lt;/code&gt; in chat to search for a file, add that to the context and then ask questions about it directly.&lt;/p&gt;
&lt;p&gt;I selected the &lt;code&gt;spatialite.rst.txt&lt;/code&gt; file, prompted it with "How do I use SpatiaLite with Datasette" and got back &lt;a href="https://gist.github.com/simonw/3bf5ff0ed3b47aefbb94d3fd992f81f9#response"&gt;this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Chat UI. Prompt is How do I use SpatiaLite with Datasette? with an attached spatialite.rst.txt file. The response from llama3.2:3b reads: To use SpatiaLite with Datasette, you need to install the mod_spatialite dynamic library. This can then be loaded into Datasette using the --load-extension command-line option.   You can install it on OS X by running brew update and then brew install spatialite-tools. On Linux, SpatiaLite is packaged for most distributions, so you can typically install it with a package manager like apt.   When running Datasette, you can look for SpatiaLite in common installation locations or specify the full path to the extension if it's installed elsewhere.   For example, on OS X, you might run datasette --load-extension=spatialite after installing it with Homebrew. On Linux, you could use datasette --load-extension=/usr/lib/x86_64-linux-gnu/mod_spatialite.so.   It's also important to note that SpatiaLite adds a large number of additional SQL functions, some of which are not safe for untrusted users to execute. To secure your Datasette instance, consider disabling arbitrary SQL queries and defining canned queries with the SQL queries that use SpatiaLite functions you want people to be able to execute." src="https://static.simonwillison.net/static/2024/spatialite-webui.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;That's honestly a very solid answer, especially considering the Llama 3.2 3B model from Ollama is just a 1.9GB file! It's impressive how well that model can handle basic Q&amp;amp;A and summarization against text provided to it - it somehow has a 128,000 token context size.&lt;/p&gt;
&lt;p&gt;Open WebUI has a lot of other tricks up its sleeve: it can talk to API models such as OpenAI directly, has optional integrations with web search and custom tools and logs every interaction to a SQLite database. It also comes with &lt;a href="https://docs.openwebui.com/"&gt;extensive documentation&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="sqlite"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="rag"/><category term="uv"/><category term="ollama"/></entry><entry><title>DeepSeek_V3.pdf</title><link href="https://simonwillison.net/2024/Dec/26/deepseek-v3/#atom-tag" rel="alternate"/><published>2024-12-26T18:49:05+00:00</published><updated>2024-12-26T18:49:05+00:00</updated><id>https://simonwillison.net/2024/Dec/26/deepseek-v3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf"&gt;DeepSeek_V3.pdf&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The DeepSeek v3 paper (and &lt;a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/README.md"&gt;model card&lt;/a&gt;) are out, after yesterday's mysterious release of &lt;a href="https://simonwillison.net/2024/Dec/25/deepseek-v3/"&gt;the undocumented model weights&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Plenty of interesting details in here. The model pre-trained on 14.8 trillion "high-quality and diverse tokens" (not otherwise documented).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;By far the most interesting detail though is how much the training cost. DeepSeek v3 trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. For comparison, Meta AI's Llama 3.1 405B (smaller than DeepSeek v3's 685B parameters) &lt;a href="https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct#hardware-and-software"&gt;trained on 11x that&lt;/a&gt; - 30,840,000 GPU hours, also on 15 trillion tokens.&lt;/p&gt;
&lt;p&gt;DeepSeek v3 benchmarks comparably to Claude 3.5 Sonnet, indicating that it's now possible to train a frontier-class model (at least for the 2024 version of the frontier) for less than $6 million!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/karpathy/status/1872362712958906460"&gt;Andrej Karpathy&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;DeepSeek also &lt;a href="https://twitter.com/deepseek_ai/status/1872242663489188088"&gt;announced their API pricing&lt;/a&gt;. From February 8th onwards:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Input: $0.27/million tokens ($0.07/million tokens with cache hits)&lt;br&gt;
Output: $1.10/million tokens&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude 3.5 Sonnet is currently $3/million for input and $15/million for output, so if the models are indeed of equivalent quality this is a dramatic new twist in the ongoing LLM pricing wars.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/deepseek_ai/status/1872242657348710721"&gt;@deepseek_ai&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="training-data"/><category term="meta"/><category term="llm-pricing"/><category term="deepseek"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>I can now run a GPT-4 class model on my laptop</title><link href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#atom-tag" rel="alternate"/><published>2024-12-09T15:08:40+00:00</published><updated>2024-12-09T15:08:40+00:00</updated><id>https://simonwillison.net/2024/Dec/9/llama-33-70b/#atom-tag</id><summary type="html">
    &lt;p&gt;Meta's new &lt;a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct"&gt;Llama 3.3 70B&lt;/a&gt; is a genuinely GPT-4 class Large Language Model that runs on my laptop.&lt;/p&gt;
&lt;p&gt;Just 20 months ago I was amazed to see something that felt GPT-3 class run on that same machine. The quality of models that are accessible on consumer hardware has improved &lt;em&gt;dramatically&lt;/em&gt; in the past two years.&lt;/p&gt;
&lt;p&gt;My laptop is a 64GB MacBook Pro M2, which I got in January 2023 - two months after the initial release of ChatGPT. All of my experiments running LLMs on a laptop have used this same machine.&lt;/p&gt;
&lt;p&gt;In March 2023 I wrote that &lt;a href="https://simonwillison.net/2023/Mar/11/llama/"&gt;Large language models are having their Stable Diffusion moment&lt;/a&gt; after running Meta's initial LLaMA release (think of that as Llama 1.0) via the then-brand-new &lt;a href="https://github.com/ggerganov/llama.cpp"&gt;llama.cpp&lt;/a&gt;. I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As my laptop started to spit out text at me I genuinely had a feeling that the world was about to change&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had a moment of déjà vu the day before yesterday, when I ran Llama 3.3 70B on the same laptop for the first time.&lt;/p&gt;
&lt;p&gt;Meta &lt;a href="https://twitter.com/AIatMeta/status/1865079068833780155"&gt;claim that&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This model delivers similar performance to Llama 3.1 405B with cost effective inference that’s feasible to run locally on common developer workstations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Llama 3.1 405B is their &lt;em&gt;much&lt;/em&gt; larger best-in-class model, which is very much in the same weight class as GPT-4 and friends.&lt;/p&gt;
&lt;p&gt;Everything I've seen so far from Llama 3.3 70B suggests that it holds up to that standard. I honestly didn't think this was possible - I assumed that anything as useful as GPT-4 would require many times more resources than are available to me on my consumer-grade laptop.&lt;/p&gt;
&lt;p&gt;I'm so excited by the continual efficiency improvements we're seeing in running these impressively capable models. In the proprietary hosted world it's giving us incredibly cheap and fast models like &lt;a href="https://simonwillison.net/search/?q=gemini+flash&amp;amp;sort=date"&gt;Gemini 1.5 Flash&lt;/a&gt;, &lt;a href="https://simonwillison.net/2024/Jul/18/gpt-4o-mini/"&gt;GPT-4o mini&lt;/a&gt; and &lt;a href="https://simonwillison.net/2024/Dec/4/amazon-nova/"&gt;Amazon Nova&lt;/a&gt;. In the openly licensed world it's giving us increasingly powerful models we can run directly on our own devices.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#how-i-ran-llama-3-3-70b-on-my-machine-using-ollama"&gt;How I ran Llama 3.3 70B on my machine using Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#putting-the-model-through-its-paces"&gt;Putting the model through its paces&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#how-does-it-score-"&gt;How does it score?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#honorable-mentions"&gt;Honorable mentions&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#is-performance-about-to-plateau-"&gt;Is performance about to plateau?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/#bonus-running-llama-3-3-70b-with-mlx"&gt;Bonus: running Llama 3.3 70B with MLX&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="how-i-ran-llama-3-3-70b-on-my-machine-using-ollama"&gt;How I ran Llama 3.3 70B on my machine using Ollama&lt;/h4&gt;
&lt;p&gt;I don't expect that this model would work well with much &lt;em&gt;less&lt;/em&gt; than my 64GB of RAM. The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac! For my second attempt I made sure not to have Firefox and VS Code running at the same time and it worked just fine.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt;. Install that, then run this command to fetch the model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull llama3.3&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This fetched 42GB of data into my &lt;code&gt;~/.ollama/models/blobs&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://llm.datasette.io/"&gt;my LLM&lt;/a&gt; tool to run prompts against the model, since that way both the prompt and the response would be automatically &lt;a href="https://llm.datasette.io/en/stable/logging.html"&gt;logged to SQLite&lt;/a&gt;. First I installed the &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; plugin:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then with the Ollama local server running I could run this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm models&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And see this in the list:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Ollama: llama3.3:latest (aliases: llama3.3)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id="putting-the-model-through-its-paces"&gt;Putting the model through its paces&lt;/h4&gt;
&lt;p&gt;This means I can run prompts like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m llama3.3 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a letter to the mayor of Half Moon Bay CA advocating for the installation of cosy boxes for Pelicans in the harbor&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And get back a response that started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Your Name]&lt;br /&gt;
[Your Address]&lt;br /&gt;
Half Moon Bay, CA [Zip Code]&lt;br /&gt;
[Email Address]&lt;br /&gt;
[Phone Number]&lt;br /&gt;
[Date]&lt;/p&gt;
&lt;p&gt;The Honorable Mayor, Deborah Penrose&lt;br /&gt;
City of Half Moon Bay&lt;br /&gt;
501 Main Street&lt;br /&gt;
Half Moon Bay, CA 94019&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;Dear Mayor Penrose,&lt;/p&gt;
&lt;p&gt;I am writing to propose an innovative and unique initiative that would not only enhance the beauty of our picturesque harbor but also provide a safe haven for one of our beloved coastal residents: the California Brown Pelican. I am excited to advocate for the installation of cozy boxes specifically designed for pelicans in the Half Moon Bay harbor. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Deborah Penrose was the mayor of Half Moon Bay for a single year from December 2016 to December 2017 - so a hint of some quite finely grained world knowledge there, even if it's not relevant for the present day.)&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/34a6eb5e765928b6491a6a59894b272d"&gt;the full transcript&lt;/a&gt;, including the response I got when I ran the follow-up &lt;code&gt;llm -c 'translate that to spanish'&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Next I tried a simple coding challenge. I love &lt;a href="https://simonwillison.net/2024/Oct/21/claude-artifacts/"&gt;using Claude Artifacts&lt;/a&gt; to build simple single page applications, can Llama 3.3 on my laptop do that too?&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m llama3.3 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;produce HTML with embedded CSS and JavaScript - no react, no dependencies - for an app that takes a URL to a YouTube video and outputs images along with visible URLs for all known thumbnails for that video&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's the code it produced (the &lt;a href="https://gist.github.com/simonw/924ba4d75528eccfe4756b5ee37ea7de"&gt;full transcript is here&lt;/a&gt;):&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;!DOCTYPE html&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt; &lt;span class="pl-c1"&gt;lang&lt;/span&gt;="&lt;span class="pl-s"&gt;en&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;meta&lt;/span&gt; &lt;span class="pl-c1"&gt;charset&lt;/span&gt;="&lt;span class="pl-s"&gt;UTF-8&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;YouTube Thumbnail Extractor&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
        }
        #thumbnails {
            display: flex;
            flex-wrap: wrap;
        }
        .thumbnail {
            margin: 10px;
        }
        img {
            max-width: 150px;
            max-height: 150px;
            object-fit: cover;
        }
    &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;h2&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;YouTube Thumbnail Extractor&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;h2&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;input&lt;/span&gt; &lt;span class="pl-c1"&gt;type&lt;/span&gt;="&lt;span class="pl-s"&gt;text&lt;/span&gt;" &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;youtubeUrl&lt;/span&gt;" &lt;span class="pl-c1"&gt;placeholder&lt;/span&gt;="&lt;span class="pl-s"&gt;Enter YouTube video URL&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;button&lt;/span&gt; &lt;span class="pl-c1"&gt;onclick&lt;/span&gt;="&lt;span class="pl-s"&gt;extractThumbnails()&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;Extract Thumbnails&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;button&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;thumbnails&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;extractThumbnails&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;youtubeUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'youtubeUrl'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;videoId&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;getYoutubeVideoId&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;youtubeUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;!&lt;/span&gt;&lt;span class="pl-s1"&gt;videoId&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-en"&gt;alert&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"Could not parse YouTube video ID from the URL."&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-k"&gt;return&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
        
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;thumbnailSizes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-s"&gt;'default'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'mqdefault'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'hqdefault'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'sddefault'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'maxresdefault'&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;thumbnailsDiv&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'thumbnails'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;thumbnailsDiv&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerHTML&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;''&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-c"&gt;// Clear previous results&lt;/span&gt;
        
        &lt;span class="pl-s1"&gt;thumbnailSizes&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;forEach&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;thumbnailUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`https://img.youtube.com/vi/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;videoId&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;.jpg`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'img'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;thumbnailUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;alt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; thumbnail`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;className&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'thumbnail'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            
            &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;link&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;link&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;thumbnailUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;link&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;text&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;link&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;display&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'block'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            
            &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;div&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'div'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;link&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
            &lt;span class="pl-s1"&gt;thumbnailsDiv&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    
    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;getYoutubeVideoId&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;regExp&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-pds"&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;^.*((youtu.be\/)|(v\/)|(\/u\/\w\/)|(embed\/)|(watch\?))\??v?=?([^#&amp;amp;?]*).*&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;match&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;match&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;regExp&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;match&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="pl-s1"&gt;match&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;7&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-c1"&gt;11&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; ? &lt;span class="pl-s1"&gt;match&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;7&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt; : &lt;span class="pl-c1"&gt;false&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Again, we're illustrating some world knowledge here: it knows that YouTube videos are generally available as &lt;code&gt;default&lt;/code&gt;, &lt;code&gt;mqdefault&lt;/code&gt;, &lt;code&gt;hqdefault&lt;/code&gt;, &lt;code&gt;sddefault&lt;/code&gt; and &lt;code&gt;maxresdefault&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2024/youtube-thumbnails.html"&gt;the hosted page&lt;/a&gt;, and this  is what it looks like when run against &lt;a href="https://www.youtube.com/watch?v=OziYd7xcGzc"&gt;this YouTube URL&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/youtube-thumbnails.jpg" alt="YouTube Thumbnail EXtractor. The URL has been entered and a Extract Thumbnails button clicked. It shows five thumbnails, for default and mqdefault and hqdefault and sddefault and maxresdefault - they are each the same size with a hyperlink to the full version." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's not as good as the &lt;a href="https://simonwillison.net/2024/Sep/20/youtube-thumbnail-viewer/"&gt;version I iterated on with Claude&lt;/a&gt;, but this still shows that Llama 3.3 can one-shot a full interactive application while &lt;em&gt;running on my Mac&lt;/em&gt;.&lt;/p&gt;
&lt;h4 id="how-does-it-score-"&gt;How does it score?&lt;/h4&gt;
&lt;p&gt;It's always useful to check independent benchmarks for this kind of model.&lt;/p&gt;
&lt;p&gt;One of my current favorites for that is &lt;a href="https://livebench.ai/"&gt;LiveBench&lt;/a&gt;, which calls itself "a challenging, contamination-free LLM benchmark" and tests a large array of models with a comprehensive set of different tasks.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;llama-3.3-70b-instruct-turbo&lt;/code&gt; currently sits in position 19 on their table, a place ahead of Claude 3 Opus (my favorite model for several months after its release in March 2024) and just behind April's GPT-4 Turbo and September's GPT-4o.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/livebench-llama.jpg" alt="Data table showing AI model performance metrics with column headers for Model, Provider, Global Average plus several other performance categories. Visible entries are: gemini-1.5-pro-exp-0827 (Google, 52.38), meta-llama-3.1-405b-instruct-turbo (Meta, 52.04), gpt-4o-2024-11-20 (OpenAI, 50.64), qwen2.5-72b-instruct-turbo (Alibaba, 50.63), dracarys-72b-instruct (AbacusAI, 50.15), chatgpt-4o-latest-0903 (OpenAI, 50.07), gpt-4-turbo-2024-04-09 (OpenAI, 49.83), llama-3.3-70b-instruct-turbo (Meta, 49.78), and claude-3-opus-20240229 (Anthropic, 48.51)." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;p&gt;LiveBench here is sorted by the average across multiple evals, and Llama 3.3 70B somehow currently scores top of the table for the "IF" (Instruction Following) eval which likely skews that average. Here's the &lt;a href="https://arxiv.org/abs/2311.07911"&gt;Instruction-Following Evaluation for Large Language Models&lt;/a&gt; paper describing that particular test.&lt;/p&gt;

&lt;p&gt;It's worth noting that the benchmarks listed here run against the full-sized Llama 3.3 release. The versions I've been running on my laptop are quantized (Ollama's is Q4_K_M), so they aren't exactly the same model and likely have different (lower) benchmark scores.&lt;/p&gt;

&lt;h4 id="honorable-mentions"&gt;Honorable mentions&lt;/h4&gt;
&lt;p&gt;Llama 3.3 is currently the model that has impressed me the most that I've managed to run on my own hardware, but I've had several other positive experiences recently.&lt;/p&gt;
&lt;p&gt;Last month &lt;a href="https://simonwillison.net/2024/Nov/12/qwen25-coder/"&gt;I wrote about Qwen2.5-Coder-32B&lt;/a&gt;, an Apache 2.0 licensed model from Alibaba's Qwen research team that also gave me impressive results with code.&lt;/p&gt;
&lt;p&gt;A couple of weeks ago &lt;a href="https://simonwillison.net/2024/Nov/27/qwq/"&gt;I tried another Qwen model, QwQ&lt;/a&gt;, which implements a similar chain-of-thought pattern to OpenAI's o1 series but again runs comfortably on my own device.&lt;/p&gt;
&lt;p&gt;Meta's Llama 3.2 family of models are interesting as well: tiny 1B and 3B models (those should run even on a Raspberry Pi) that are way more capable than I would have expected - plus Meta's first multi-modal vision models at 11B and 90B sizes. &lt;a href="https://simonwillison.net/2024/Sep/25/llama-32/"&gt;I wrote about those in September&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="is-performance-about-to-plateau-"&gt;Is performance about to plateau?&lt;/h4&gt;
&lt;p&gt;I've been mostly unconvinced by the ongoing discourse around LLMs hitting a plateau. The areas I'm personally most excited about are multi-modality (images, audio and video as input) and model efficiency. Both of those have had enormous leaps forward in the past year.&lt;/p&gt;
&lt;p&gt;I don't particularly care about "AGI". I want models that can do useful things that I tell them to, quickly and inexpensively - and that's exactly what I've been getting more of over the past twelve months.&lt;/p&gt;
&lt;p&gt;Even if progress on these tools entirely stopped right now, the amount I could get done with just the models I've downloaded and stashed on a USB drive would keep me busy and productive for years.&lt;/p&gt;
&lt;h4 id="bonus-running-llama-3-3-70b-with-mlx"&gt;Bonus: running Llama 3.3 70B with MLX&lt;/h4&gt;
&lt;p&gt;I focused on Ollama in this article because it's the easiest option, but I also managed to run a version of Llama 3.3 using Apple's excellent &lt;a href="https://github.com/ml-explore/mlx"&gt;MLX library&lt;/a&gt;, which just celebrated its first birthday.&lt;/p&gt;
&lt;p&gt;Here's how I ran the model with MLX, using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; to fire up a temporary virtual environment:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --with mlx-lm --python 3.12 python&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This gave me a Python interpreter with &lt;a href="https://pypi.org/project/mlx-lm/#description"&gt;mlx-lm&lt;/a&gt; available. Then I ran this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;mlx_lm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;load&lt;/span&gt;, &lt;span class="pl-s1"&gt;generate&lt;/span&gt;
&lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;load&lt;/span&gt;(&lt;span class="pl-s"&gt;"mlx-community/Llama-3.3-70B-Instruct-4bit"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;This downloaded 37G from &lt;a href="https://huggingface.co/mlx-community/Llama-3.3-70B-Instruct-4bit"&gt;mlx-community/Llama-3.3-70B-Instruct-4bit&lt;/a&gt; to &lt;code&gt;~/.cache/huggingface/hub/models--mlx-community--Llama-3.3-70B-Instruct-4bit&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"Generate an SVG of a pelican riding a bicycle, start with &amp;lt;svg&amp;gt;"&lt;/span&gt;
&lt;span class="pl-s1"&gt;messages&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [{&lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"user"&lt;/span&gt;, &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;}]
&lt;span class="pl-s1"&gt;prompt2&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt;.&lt;span class="pl-c1"&gt;apply_chat_template&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;messages&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenize&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;False&lt;/span&gt;, &lt;span class="pl-s1"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;
)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;generate&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;tokenizer&lt;/span&gt;, &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;prompt2&lt;/span&gt;, &lt;span class="pl-s1"&gt;verbose&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;
)&lt;/pre&gt;
&lt;p&gt;Which gave me:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;xmlns&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://www.w3.org/2000/svg&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;400&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;300&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 400 300&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Sky &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;400&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;300&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#87CEEB&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Ground &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;250&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;400&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;50&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#964B00&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;g&lt;/span&gt; &lt;span class="pl-e"&gt;transform&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;translate(150, 150)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Body &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#FFFFFF&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Beak &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M 0 -40 L 20 -60 L 40 -40 Z&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#FFFF00&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Legs &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;80&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;80&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Eyes &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;g&lt;/span&gt; &lt;span class="pl-e"&gt;transform&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;translate(200, 220) scale(0.7, 0.7)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-50&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#808080&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Wheels &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#808080&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#808080&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#808080&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#808080&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Seat &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#808080&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican on Bicycle &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;g&lt;/span&gt; &lt;span class="pl-e"&gt;transform&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;translate(200, 150)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican's hands on handlebars &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican's feet on pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#000000&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; /&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Followed by:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Prompt: 52 tokens, 49.196 tokens-per-sec
Generation: 723 tokens, 8.733 tokens-per-sec
Peak memory: 40.042 GB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what that looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycle-llama.svg" alt="It's not great. Blue sky, brown dirt, then a weird oval with two eyes and an offset triangular yellow hat, next to a bundle of grey circles that is probably meant to be a bicycle" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly, &lt;a href="https://github.com/simonw/pelican-bicycle?tab=readme-ov-file#pelicans-on-a-bicycle"&gt;I've seen worse&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-4"&gt;gpt-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="gpt-4"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="meta"/><category term="uv"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Meta AI release Llama 3.3</title><link href="https://simonwillison.net/2024/Dec/6/llama-33/#atom-tag" rel="alternate"/><published>2024-12-06T18:30:55+00:00</published><updated>2024-12-06T18:30:55+00:00</updated><id>https://simonwillison.net/2024/Dec/6/llama-33/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/AIatMeta/status/1865079067390956006"&gt;Meta AI release Llama 3.3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This new &lt;a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct"&gt;Llama-3.3-70B-Instruct model&lt;/a&gt; from Meta AI makes some bold claims:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This model delivers similar performance to Llama 3.1 405B with cost effective inference that’s feasible to run locally on common developer workstations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have 64GB of RAM in my M2 MacBook Pro, so I'm looking forward to trying a slightly quantized GGUF of this model to see if I can run it while still leaving some memory free for other applications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Ollama have &lt;a href="https://ollama.com/library/llama3.3"&gt;a 43GB GGUF&lt;/a&gt; available now. And here's an &lt;a href="https://huggingface.co/mlx-community/Llama-3.3-70B-Instruct-8bit"&gt;MLX 8bit version&lt;/a&gt; and &lt;a href="https://huggingface.co/collections/mlx-community/llama-33-67538fce5763675dcb8c4463"&gt;other MLX quantizations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Llama 3.3 has 70B parameters, a 128,000 token context length and was trained to support English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md"&gt;model card&lt;/a&gt; says that the training data was "A new mix of publicly available online data" - 15 trillion tokens with a December 2023 cut-off.&lt;/p&gt;
&lt;p&gt;They used "39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware" which they calculate as 11,390 tons CO2eq. I believe that's equivalent to around 20 fully loaded passenger flights from New York to London (at &lt;a href="https://travelnav.com/emissions-from-jfk-to-lhr"&gt;~550 tons per flight&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 19th January 2025&lt;/strong&gt;: On further consideration I no longer trust my estimate here: it's surprisingly hard to track down reliable numbers but I think the total CO2 used by those flights may be more in the order of 200-400 tons, so my estimate for Llama 3.3 70B should have been more in the order of between 28 and 56 flights. Don't trust those numbers either though!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="training-data"/><category term="meta"/><category term="mlx"/><category term="ollama"/><category term="llm-release"/></entry><entry><title>Ollama: Llama 3.2 Vision</title><link href="https://simonwillison.net/2024/Nov/13/ollama-llama-vision/#atom-tag" rel="alternate"/><published>2024-11-13T01:55:31+00:00</published><updated>2024-11-13T01:55:31+00:00</updated><id>https://simonwillison.net/2024/Nov/13/ollama-llama-vision/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/blog/llama3.2-vision"&gt;Ollama: Llama 3.2 Vision&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ollama released version 0.4 &lt;a href="https://github.com/ollama/ollama/releases/tag/v0.4.0"&gt;last week&lt;/a&gt; with support for Meta's first Llama vision model, &lt;a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"&gt;Llama 3.2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you have Ollama installed you can fetch the 11B model (7.9 GB) like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull llama3.2-vision
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or the larger 90B model (55GB download, likely needs ~88GB of RAM) like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull llama3.2-vision:90b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I was delighted to learn that Sukhbinder Singh had &lt;a href="https://github.com/taketwo/llm-ollama/pull/15"&gt;already contributed&lt;/a&gt; support for &lt;a href="https://simonwillison.net/2024/Oct/29/llm-multi-modal/"&gt;LLM attachments&lt;/a&gt; to Sergey Alexandrov's &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; plugin, which means the following works once you've pulled the models:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install --upgrade llm-ollama
llm -m llama3.2-vision:latest 'describe' \
  -a https://static.simonwillison.net/static/2024/pelican.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;This image features a brown pelican standing on rocks, facing the camera and positioned to the left of center. The bird's long beak is a light brown color with a darker tip, while its white neck is adorned with gray feathers that continue down to its body. Its legs are also gray.&lt;/p&gt;
&lt;p&gt;In the background, out-of-focus boats and water are visible, providing context for the pelican's environment.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="See above description - it's a pelican photo" src="https://static.simonwillison.net/static/2024/pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;That's not a bad description of this image, especially for a 7.9GB model that runs happily on my MacBook Pro.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="meta"/><category term="ollama"/><category term="llm-release"/></entry><entry><title>Nous Hermes 3</title><link href="https://simonwillison.net/2024/Nov/4/nous-hermes-3/#atom-tag" rel="alternate"/><published>2024-11-04T18:20:16+00:00</published><updated>2024-11-04T18:20:16+00:00</updated><id>https://simonwillison.net/2024/Nov/4/nous-hermes-3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://nousresearch.com/hermes3/"&gt;Nous Hermes 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Nous Hermes family of fine-tuned models have a solid reputation. Their most recent release came out in August, based on Meta's Llama 3.1:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fine-tuning Llama 3.1 8B, 70B and 405B, and training on a dataset of primarily synthetically generated responses. The model boasts comparable and superior performance to Llama 3.1 while unlocking deeper capabilities in reasoning and creativity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The model weights are on Hugging Face, including GGUF versions of the &lt;a href="https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-GGUF"&gt;70B&lt;/a&gt; and &lt;a href="https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF"&gt;8B&lt;/a&gt; models. Here's how to try the 8B model (a 4.58GB download) using the &lt;a href="https://github.com/simonw/llm-gguf"&gt;llm-gguf plugin&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-gguf
llm gguf download-model 'https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/resolve/main/Hermes-3-Llama-3.1-8B.Q4_K_M.gguf' -a Hermes-3-Llama-3.1-8B
llm -m Hermes-3-Llama-3.1-8B 'hello in spanish'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Nous Research &lt;a href="https://lambdalabs.com/blog/unveiling-hermes-3-the-first-fine-tuned-llama-3.1-405b-model-is-on-lambdas-cloud"&gt;partnered with Lambda Labs&lt;/a&gt; to provide inference APIs. It turns out Lambda host &lt;a href="https://docs.lambdalabs.com/public-cloud/lambda-chat-api/"&gt;quite a few models&lt;/a&gt; now, currently providing free inference to users with &lt;a href="https://cloud.lambdalabs.com/api-keys"&gt;an API key&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I just released the first alpha of a &lt;a href="https://github.com/simonw/llm-lambda-labs"&gt;llm-lambda-labs&lt;/a&gt; plugin. You can use that to try the larger 405b model (very hard to run on a consumer device) like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-lambda-labs
llm keys set lambdalabs
# Paste key here
llm -m lambdalabs/hermes3-405b 'short poem about a pelican with a twist'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/llm-lambda-labs/blob/0.1a0/llm_lambda_labs.py"&gt;the source code&lt;/a&gt; for the new plugin, which I based on &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt;. The plugin uses  &lt;a href="https://pypi.org/project/httpx-sse/"&gt;httpx-sse&lt;/a&gt; to consume the stream of tokens from the API.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nous-research"&gt;nous-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="meta"/><category term="nous-research"/><category term="llm-release"/></entry><entry><title>Cerebras Coder</title><link href="https://simonwillison.net/2024/Oct/31/cerebras-coder/#atom-tag" rel="alternate"/><published>2024-10-31T22:39:15+00:00</published><updated>2024-10-31T22:39:15+00:00</updated><id>https://simonwillison.net/2024/Oct/31/cerebras-coder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.val.town/v/stevekrouse/cerebras_coder"&gt;Cerebras Coder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Val Town founder Steve Krouse has been building demos on top of the Cerebras API that runs Llama3.1-70b at 2,000 tokens/second.&lt;/p&gt;
&lt;p&gt;Having a capable LLM with that kind of performance turns out to be really interesting. Cerebras Coder is a demo that implements Claude Artifact-style on-demand JavaScript apps, and having it run at that speed means changes you request are visible within less than a second:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2024/cascade-emoji.jpeg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2024/cascade-emoji.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Steve's implementation (created with the help of &lt;a href="https://www.val.town/townie"&gt;Townie&lt;/a&gt;, the Val Town code assistant) demonstrates the simplest possible version of an iframe sandbox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;iframe
    srcDoc={code}
    sandbox="allow-scripts allow-modals allow-forms allow-popups allow-same-origin allow-top-navigation allow-downloads allow-presentation allow-pointer-lock"
/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where &lt;code&gt;code&lt;/code&gt; is populated by a &lt;code&gt;setCode(...)&lt;/code&gt; call inside a React component.&lt;/p&gt;
&lt;p&gt;The most interesting applications of LLMs continue to be where they operate in a tight loop with a human - this can make those review loops potentially much faster and more productive.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/stevekrouse/status/1851995718514327848"&gt;@stevekrouse&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/iframes"&gt;iframes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sandboxing"&gt;sandboxing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/react"&gt;react&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/val-town"&gt;val-town&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/steve-krouse"&gt;steve-krouse&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="iframes"/><category term="sandboxing"/><category term="ai"/><category term="react"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="ai-assisted-programming"/><category term="val-town"/><category term="steve-krouse"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Pelicans on a bicycle</title><link href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/#atom-tag" rel="alternate"/><published>2024-10-25T23:56:50+00:00</published><updated>2024-10-25T23:56:50+00:00</updated><id>https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/pelican-bicycle/blob/main/README.md"&gt;Pelicans on a bicycle&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I decided to roll out my own LLM benchmark: how well can different models render an SVG of a pelican riding a bicycle?&lt;/p&gt;
&lt;p&gt;I chose that because a) I like pelicans and b) I'm pretty sure there aren't any pelican on a bicycle SVG files floating around (yet) that might have already been sucked into the training data.&lt;/p&gt;
&lt;p&gt;My prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've run it through 16 models so far - from OpenAI, Anthropic, Google Gemini and  Meta (Llama running on Cerebras), all using my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; CLI utility. Here's my (&lt;a href="https://gist.github.com/simonw/32273a445da3318df690749701805863"&gt;Claude assisted&lt;/a&gt;) Bash script: &lt;a href="https://github.com/simonw/pelican-bicycle/blob/b25faf3e29dcf73c97278dfdd7b7b973462eb0cb/generate-svgs.sh"&gt;generate-svgs.sh&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here's Claude 3.5 Sonnet (2024-06-20) and Claude 3.5 Sonnet (2024-10-22):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/claude-3-5-sonnet-20240620.svg" style="width: 45%" alt="Two circles joined by a triangle. The pelican is a grey oval for the body, a circle for the head and has a peak that looks like a yellow banana smile. A wing is hinted at with an upside down curved line. Two legs dangle from the bottom of the bird."&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/claude-3-5-sonnet-20241022.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;Gemini 1.5 Flash 001 and Gemini 1.5 Flash 002:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gemini-1.5-flash-001.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gemini-1.5-flash-002.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;GPT-4o mini and GPT-4o:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gpt-4o-mini.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gpt-4o.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;o1-mini and o1-preview:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/o1-mini.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/o1-preview.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;Cerebras Llama 3.1 70B and Llama 3.1 8B:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/cerebras-llama3.1-70b.svg" style="width: 45%"&gt; &lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/cerebras-llama3.1-8b.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;And a special mention for Gemini 1.5 Flash 8B:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelican-bicycles/gemini-1.5-flash-8b-001.svg" style="width: 45%"&gt;&lt;/p&gt;
&lt;p&gt;The rest of them are &lt;a href="https://github.com/simonw/pelican-bicycle/blob/main/README.md"&gt;linked from the README&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="svg"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="gemini"/><category term="cerebras"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Running Llama 3.2 Vision and Phi-3.5 Vision on a Mac with mistral.rs</title><link href="https://simonwillison.net/2024/Oct/19/mistralrs/#atom-tag" rel="alternate"/><published>2024-10-19T16:14:20+00:00</published><updated>2024-10-19T16:14:20+00:00</updated><id>https://simonwillison.net/2024/Oct/19/mistralrs/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://github.com/EricLBuehler/mistral.rs"&gt;mistral.rs&lt;/a&gt; is an LLM inference library written in Rust by Eric Buehler. Today I figured out how to use it to run the Llama 3.2 Vision and Phi-3.5 Vision models on my Mac.&lt;/p&gt;
&lt;p&gt;Despite the name, it's not just for the Mistral family of models - like how &lt;a href="https://github.com/ggerganov/llama.cpp"&gt;&lt;code&gt;llama.cpp&lt;/code&gt;&lt;/a&gt; has grown beyond Llama, &lt;code&gt;mistral.rs&lt;/code&gt; has grown beyond Mistral.&lt;/p&gt;
&lt;p&gt;I already have a Rust installation, so I checked out and compiled the library like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo run --features metal --release -- -i --isq Q4K vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloaded around 20GB of model files to &lt;code&gt;~/.cache/huggingface/hub/models--lamm-mit--Cephalo-Llama-3.2-11B-Vision-Instruct-128k&lt;/code&gt;, loaded the model into memory and started a terminal-based chat session.&lt;/p&gt;
&lt;p&gt;I ran this prompt:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;\image https://niche-museums.imgix.net/pioneer-history.jpeg?w=1600&amp;amp;h=800&amp;amp;fit=crop&amp;amp;auto=compress &amp;lt;|image|&amp;gt; describe this image including any text&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/mistral-rs-terminal.jpg" alt="2024-10-19T15:38:18.142895Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: [&amp;quot;\n&amp;quot;, &amp;quot;:&amp;quot;, &amp;quot;\&amp;quot;&amp;quot;, &amp;quot;*&amp;quot;], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a vision model, you can enter prompts and chat with the model.

To specify a message with an image, use the `\image` command detailed below.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system &amp;lt;system message here&amp;gt;`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
- `\image &amp;lt;image URL or local path here&amp;gt; &amp;lt;message here&amp;gt;`: 
    Add a message paired with an image. You are responsible for prefixing the message with anything the model
    requires.
    Ex: `\image path/to/image.jpg Describe what is in this image.`
====================
&amp;gt; 
&amp;gt; 
&amp;gt; \image https://niche-museums.imgix.net/pioneer-history.jpeg?w=1600&amp;amp;h=800&amp;amp;fit=crop&amp;amp;auto=compress &amp;lt;|image|&amp;gt; describe this image including any text
The image shows the exterior of the Daughters of Utah Pioneers (DUP) Museum, which is also known as the Pioneer Memorial Museum. The building is located in Salt Lake City, Utah, USA. 

Visible in the image are:

1. **The DUP Museum Building**: The main structure of the museum is a prominent feature in the image. It is a two-story building with a rectangular shape and a flat roof. The exterior walls are made of brick or stone, giving it a classic and historic appearance.
" /&gt;&lt;/p&gt;
&lt;p&gt;The prompt uses this image from my &lt;a href="https://www.niche-museums.com/111"&gt;Niche Museums write-up of the Pioneer Memorial Museum&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Detailed image description follows" src="https://niche-museums.imgix.net/pioneer-history.jpeg?w=1600&amp;amp;h=800&amp;amp;fit=crop&amp;amp;auto=compress" /&gt;&lt;/p&gt;
&lt;p&gt;I got back this response:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows the exterior of the Daughters of Utah Pioneers (DUP) Museum, which is also known as the Pioneer Memorial Museum. The building is located in Salt Lake City, Utah, USA. &lt;/p&gt;
&lt;p&gt;Visible in the image are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The DUP Museum Building&lt;/strong&gt;: The main structure of the museum is a prominent feature in the image. It is a two-story building with a rectangular shape and a flat roof. The exterior walls are made of brick or stone, giving it a classic and historic appearance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Signage&lt;/strong&gt;: Above the entrance, there is a sign that reads "Daughters of Utah Pioneers Museum" in bold letters. Below this sign, another sign indicates that the building is also known as the "Pioneer Memorial Museum."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Entrance&lt;/strong&gt;: The main entrance to the museum is marked by a doorway with a decorative frame and a pair of large wooden doors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Landscaping&lt;/strong&gt;: The surrounding area around the museum appears well-maintained with neatly trimmed grass and some trees nearby.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Statue or Monument&lt;/strong&gt;: In front of the museum, there is a small statue or monument that appears to be dedicated to pioneers or early settlers.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The Daughters of Utah Pioneers (DUP) Museum showcases historical artifacts and exhibits related to Utah's pioneer history, focusing on the experiences and contributions of early settlers in the region.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note that the text it generates correctly describes the museum's content and identifies the museum as being located in Salt Lake City, neither of which are details that were displayed in the image itself.&lt;/p&gt;
&lt;p&gt;Activity Monitor confirmed that this Metal build was using my GPU, and about 10GB of memory, while the prompt was running.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mistral.rs&lt;/code&gt; also provides &lt;a href="https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-pyo3/README.md"&gt;a Python binding&lt;/a&gt;, which can be installed from PyPI but requires a Rust toolchain (I filed an issue &lt;a href="https://github.com/EricLBuehler/mistral.rs/issues/864"&gt;requesting PyPI wheels&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install mistralrs-metal
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got an error running the &lt;a href="https://github.com/EricLBuehler/mistral.rs/blob/v0.3.1/examples/python/llama_vision.py"&gt;Python Llama vision demo&lt;/a&gt; but the &lt;a href="https://github.com/EricLBuehler/mistral.rs/blob/v0.3.1/examples/python/phi3v.py"&gt;Phi 3.5 Vision example&lt;/a&gt; worked as advertised. I modified that to enable streaming output and ran it against a photograph from &lt;a href="https://www.niche-museums.com/107"&gt;Mendenhall's Museum of Gasoline Pumps &amp;amp; Petroliana&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;mistralrs&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;Runner&lt;/span&gt;, &lt;span class="pl-v"&gt;Which&lt;/span&gt;, &lt;span class="pl-v"&gt;ChatCompletionRequest&lt;/span&gt;, &lt;span class="pl-v"&gt;VisionArchitecture&lt;/span&gt;

&lt;span class="pl-s1"&gt;res&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;runner&lt;/span&gt;.&lt;span class="pl-en"&gt;send_chat_completion_request&lt;/span&gt;(
    &lt;span class="pl-v"&gt;ChatCompletionRequest&lt;/span&gt;(
        &lt;span class="pl-s1"&gt;model&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"phi3v"&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;messages&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[
            {
                &lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"user"&lt;/span&gt;,
                &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: [
                    {
                        &lt;span class="pl-s"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;"image_url"&lt;/span&gt;,
                        &lt;span class="pl-s"&gt;"image_url"&lt;/span&gt;: {
                            &lt;span class="pl-s"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s"&gt;"https://niche-museums.imgix.net/mendenhalls-16.jpeg?w=1200&amp;amp;auto=compress"&lt;/span&gt;
                        },
                    },
                    {
                        &lt;span class="pl-s"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;"text"&lt;/span&gt;,
                        &lt;span class="pl-s"&gt;"text"&lt;/span&gt;: &lt;span class="pl-s"&gt;"&amp;lt;|image_1|&amp;gt;&lt;span class="pl-cce"&gt;\n&lt;/span&gt;What is shown in this image? Write a detailed response analyzing the scene."&lt;/span&gt;,
                    },
                ],
            }
        ],
        &lt;span class="pl-s1"&gt;max_tokens&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;1024&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;presence_penalty&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;1.0&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;top_p&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;0.1&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;temperature&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;0.1&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;stream&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
    )
)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;item&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;res&lt;/span&gt;:
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;item&lt;/span&gt;.&lt;span class="pl-s1"&gt;choices&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;].&lt;span class="pl-s1"&gt;delta&lt;/span&gt;.&lt;span class="pl-s1"&gt;content&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;''&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;&lt;img alt="Description follows" src="https://niche-museums.imgix.net/mendenhalls-16.jpeg?w=1200&amp;amp;auto=compress" /&gt;&lt;/p&gt;
&lt;p&gt;Here's what It gave me:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image captures a nostalgic scene of a vintage gas station, reminiscent of a bygone era. Dominating the left side of the frame is a red and white gas pump, standing tall and proud, its metallic surface gleaming under the soft light filtering through the wooden ceiling. Adjacent to it, a white and black gas canister stands ready for service, its presence adding to the authenticity of the setting.&lt;/p&gt;
&lt;p&gt;On the right, a red and white gasoline sign hangs from the ceiling, its bold colors contrasting with the muted tones of the surroundings. Above it, a yellow and white gasoline sign is suspended, its vibrant hues drawing the eye upwards.&lt;/p&gt;
&lt;p&gt;The floor beneath these relics of the past is a checkerboard pattern, a common design choice for gas stations of yesteryears. It provides a stark contrast to the smooth, unblemished surfaces of the gas pumps and canisters.&lt;/p&gt;
&lt;p&gt;In the background, a variety of other signs and advertisements add to the eclectic mix of objects. They are a testament to the diverse range of products and services that were once available at this location.&lt;/p&gt;
&lt;p&gt;Despite the passage of time, the gas station retains a certain charm, its vintage gas pumps and signs serving as a tangible link to a different era. The image is a snapshot of history, frozen in time, waiting to be discovered and appreciated by those who take the time to look closer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This description looks fantastic at first glance, but if you review it carefully and compare it to the image you'll see that it's full of inaccuracies. The vibes of the description match the image but the actual details are definitely incorrect.&lt;/p&gt;

&lt;p&gt;This model downloaded 7.7GB to &lt;code&gt;~/.cache/huggingface/hub/models--microsoft--Phi-3.5-vision-instruct&lt;/code&gt; - significantly smaller than Llama 3.2's 20GB. I wonder if that size difference helps explain the greater hallucination rate in Phi-3.5 Vision.&lt;/p&gt;

&lt;p&gt;If you're running Python 3.10 on Apple Silicon you may be able to skip the Rust compiler by installing the wheel I built here:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install https://static.simonwillison.net/static/2024/mistralrs_metal-0.3.1-cp310-cp310-macosx_11_0_arm64.whl
&lt;/code&gt;&lt;/pre&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/phi"&gt;phi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="microsoft"/><category term="python"/><category term="ai"/><category term="rust"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="mistral"/><category term="phi"/><category term="vision-llms"/><category term="meta"/></entry><entry><title>llama-3.2-webgpu</title><link href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/#atom-tag" rel="alternate"/><published>2024-09-30T16:27:22+00:00</published><updated>2024-09-30T16:27:22+00:00</updated><id>https://simonwillison.net/2024/Sep/30/llama-32-webgpu/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;llama-3.2-webgpu&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Llama 3.2 1B is a really interesting models, given its 128,000 token input and its tiny size (barely more than a GB).&lt;/p&gt;
&lt;p&gt;This page loads a &lt;a href="https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16/tree/main/onnx"&gt;1.24GB q4f16 ONNX build&lt;/a&gt; of the Llama-3.2-1B-Instruct model and runs it with a React-powered chat interface directly in the browser, using &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;Transformers.js&lt;/a&gt; and WebGPU. &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;Source code for the demo is here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It worked for me just now in Chrome; in Firefox and Safari I got a “WebGPU is not supported by this browser” error message.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/xenovacom/status/1840767709317046460"&gt;@xenovacom&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="transformers-js"/><category term="webgpu"/></entry><entry><title>Llama 3.2</title><link href="https://simonwillison.net/2024/Sep/25/llama-32/#atom-tag" rel="alternate"/><published>2024-09-25T20:28:04+00:00</published><updated>2024-09-25T20:28:04+00:00</updated><id>https://simonwillison.net/2024/Sep/25/llama-32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"&gt;Llama 3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
In further evidence that AI labs are terrible at naming things, Llama 3.2 is a &lt;em&gt;huge&lt;/em&gt; upgrade to the Llama 3 series - they've released their first multi-modal vision models!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The 1B and 3B text-only models are exciting too, with a 128,000 token context length and optimized for edge devices (Qualcomm and MediaTek hardware get called out specifically).&lt;/p&gt;
&lt;p&gt;Meta partnered directly with Ollama to help with distribution, &lt;a href="https://ollama.com/blog/llama3.2"&gt;here's the Ollama blog post&lt;/a&gt;. They only support the two smaller text-only models at the moment - this command will get the 3B model (2GB):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run llama3.2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And for the 1B model (a 1.3GB download):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run llama3.2:1b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I had to first upgrade my Ollama by clicking on the icon in my macOS task tray and selecting "Restart to update". &lt;/p&gt;
&lt;p&gt;The two vision models are coming to Ollama "very soon".&lt;/p&gt;
&lt;p&gt;Once you have fetched the Ollama model you can access it from my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; command-line tool like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pipx install llm
llm install llm-ollama
llm chat -m llama3.2:1b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried running my &lt;a href="https://github.com/simonw/djp"&gt;djp codebase&lt;/a&gt; through that tiny 1B model just now and got &lt;a href="https://gist.github.com/simonw/64c5f5b111fe473999144932bef4218b"&gt;a surprisingly good result&lt;/a&gt; - by no means comprehensive, but way better than I would ever expect from a model of that size:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;files-to-prompt **/*.py -c | llm -m llama3.2:1b --system 'describe this code'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's a portion of the output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The first section defines several test functions using the &lt;code&gt;@djp.hookimpl&lt;/code&gt; decorator from the djp library. These hook implementations allow you to intercept and manipulate Django's behavior.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;test_middleware_order&lt;/code&gt;: This function checks that the middleware order is correct by comparing the &lt;code&gt;MIDDLEWARE&lt;/code&gt; setting with a predefined list.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;test_middleware&lt;/code&gt;: This function tests various aspects of middleware:&lt;/li&gt;
&lt;li&gt;It retrieves the response from the URL &lt;code&gt;/from-plugin/&lt;/code&gt; using the &lt;code&gt;Client&lt;/code&gt; object, which simulates a request to this view.&lt;/li&gt;
&lt;li&gt;It checks that certain values are present in the response:&lt;ul&gt;
&lt;li&gt;&lt;code&gt;X-DJP-Middleware-After&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;X-DJP-Middleware&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;X-DJP-Middleware-Before&lt;/code&gt; [...]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found the GGUF file that had been downloaded by Ollama in my &lt;code&gt;~/.ollama/models/blobs&lt;/code&gt; directory. The following command let me run that model directly in LLM using the &lt;a href="https://github.com/simonw/llm-gguf"&gt;llm-gguf&lt;/a&gt; plugin:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-gguf
llm gguf register-model ~/.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 -a llama321b
llm chat -m llama321b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Meta themselves claim impressive performance against other existing models:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our evaluation suggests that the Llama 3.2 vision models are competitive with leading foundation models, Claude 3 Haiku and GPT4o-mini on image recognition and a range of visual understanding tasks. The 3B model outperforms the Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use, while the 1B is competitive with Gemma.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's the &lt;a href="https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf"&gt;Llama 3.2 collection&lt;/a&gt; on Hugging Face. You need to accept the new Llama 3.2 Community License Agreement there in order to download those models.&lt;/p&gt;
&lt;p&gt;You can try the four new models out via the &lt;a href="https://lmarena.ai/"&gt;Chatbot Arena&lt;/a&gt; - navigate to "Direct Chat" there and select them from the dropdown menu. You can upload images directly to the chat there to try out the vision features.&lt;/p&gt;
&lt;p&gt;&lt;img alt="llama-3.2-vision-90b-instruct  A photo of two capybaras  Prompt: Describe this photo in detail  The image depicts two capybaras standing in a body of water, with one positioned on the edge of a metal grate. The capybara on the right is partially submerged in the water, while the other is standing on the grate.  Key Features:      Capybaras: Two capybaras are visible in the image.     Water: The capybaras are standing in a body of water, which appears to be murky or dirty.     Metal Grate: A metal grate is visible at the bottom right of the image, with one capybara standing on it.     Background: The background of the image is not clearly visible due to the presence of the water and the capybaras." src="https://static.simonwillison.net/static/2024/capybaras-llama-3.2.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="meta"/><category term="ollama"/><category term="llm-release"/><category term="files-to-prompt"/><category term="chatbot-arena"/></entry><entry><title>Cerebras Inference: AI at Instant Speed</title><link href="https://simonwillison.net/2024/Aug/28/cerebras-inference/#atom-tag" rel="alternate"/><published>2024-08-28T04:14:00+00:00</published><updated>2024-08-28T04:14:00+00:00</updated><id>https://simonwillison.net/2024/Aug/28/cerebras-inference/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed"&gt;Cerebras Inference: AI at Instant Speed&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted API for Llama running at absurdly high speeds: "1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B".&lt;/p&gt;
&lt;p&gt;How are they running so fast? Custom hardware. Their &lt;a href="https://cerebras.ai/product-chip/"&gt;WSE-3&lt;/a&gt; is 57x &lt;em&gt;physically larger&lt;/em&gt; than an NVIDIA H100, and has 4 trillion transistors, 900,000 cores and 44GB of memory all on one enormous chip.&lt;/p&gt;
&lt;p&gt;Their &lt;a href="https://inference.cerebras.ai/"&gt;live chat demo&lt;/a&gt; just returned me a response at 1,833 tokens/second. Their API currently has a waitlist.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41369705"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/performance"&gt;performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="performance"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Quoting The Llama 3 Herd of Models</title><link href="https://simonwillison.net/2024/Jul/23/llama-3-training/#atom-tag" rel="alternate"/><published>2024-07-23T23:26:02+00:00</published><updated>2024-07-23T23:26:02+00:00</updated><id>https://simonwillison.net/2024/Jul/23/llama-3-training/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/"&gt;&lt;p&gt;One interesting observation is the impact of environmental factors on training performance at scale. For Llama 3 405B , we noted a diurnal 1-2% throughput variation based on time-of-day. This fluctuation is the result of higher mid-day temperatures impacting GPU dynamic voltage and frequency scaling.&lt;/p&gt;
&lt;p&gt;During training, tens of thousands of GPUs may increase or decrease power consumption at the same time, for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup or shutdown of the entire training job. When this happens, it can result in instant fluctuations of power consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid. This is an ongoing challenge for us as we scale training for future, even larger Llama models.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/"&gt;The Llama 3 Herd of Models&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;&lt;/p&gt;



</summary><category term="meta"/><category term="generative-ai"/><category term="llama"/><category term="ai"/><category term="llms"/><category term="gpus"/></entry><entry><title>llm-gguf</title><link href="https://simonwillison.net/2024/Jul/23/llm-gguf/#atom-tag" rel="alternate"/><published>2024-07-23T22:18:40+00:00</published><updated>2024-07-23T22:18:40+00:00</updated><id>https://simonwillison.net/2024/Jul/23/llm-gguf/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-gguf"&gt;llm-gguf&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I just released a new alpha plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; which adds support for running models from &lt;a href="https://simonwillison.net/2024/Jul/23/introducing-llama-31/"&gt;Meta's new Llama 3.1 family&lt;/a&gt; that have been packaged as GGUF files - it should work for other GGUF chat models too.&lt;/p&gt;
&lt;p&gt;If you've &lt;a href="https://llm.datasette.io/en/stable/setup.html"&gt;already installed LLM&lt;/a&gt; the following set of commands should get you setup with Llama 3.1 8B:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-gguf
llm gguf download-model \
  https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --alias llama-3.1-8b-instruct --alias l31i
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will download a 4.92GB GGUF from &lt;a href="https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main"&gt;lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF&lt;/a&gt; on Hugging Face and save it (at least on macOS) to your &lt;code&gt;~/Library/Application Support/io.datasette.llm/gguf/models&lt;/code&gt; folder.&lt;/p&gt;
&lt;p&gt;Once installed like that, you can run prompts through the model like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m l31i "five great names for a pet lemur"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or use the &lt;code&gt;llm chat&lt;/code&gt; command to keep the model resident in memory and run an interactive chat session with it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m l31i
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I decided to ship a new alpha plugin rather than update my existing &lt;a href="https://github.com/simonw/llm-llama-cpp"&gt;llm-llama-cpp&lt;/a&gt; plugin because that older plugin has some design decisions baked in from the Llama 2 release which no longer make sense, and having a fresh plugin gave me a fresh slate to adopt the latest features from the excellent underlying &lt;a href="https://github.com/abetlen/llama-cpp-python"&gt;llama-cpp-python&lt;/a&gt; library by Andrei Betlen.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm"/><category term="meta"/></entry><entry><title>Quoting Mark Zuckerberg</title><link href="https://simonwillison.net/2024/Jul/23/mark-zuckerberg/#atom-tag" rel="alternate"/><published>2024-07-23T16:52:07+00:00</published><updated>2024-07-23T16:52:07+00:00</updated><id>https://simonwillison.net/2024/Jul/23/mark-zuckerberg/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/"&gt;&lt;p&gt;I believe the Llama 3.1 release will be an inflection point in the industry where most developers begin to primarily use open source, and I expect that approach to only grow from here.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/"&gt;Mark Zuckerberg&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mark-zuckerberg"&gt;mark-zuckerberg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;&lt;/p&gt;



</summary><category term="meta"/><category term="open-source"/><category term="generative-ai"/><category term="facebook"/><category term="mark-zuckerberg"/><category term="ai"/><category term="llms"/><category term="llama"/></entry><entry><title>Introducing Llama 3.1: Our most capable models to date</title><link href="https://simonwillison.net/2024/Jul/23/introducing-llama-31/#atom-tag" rel="alternate"/><published>2024-07-23T15:40:47+00:00</published><updated>2024-07-23T15:40:47+00:00</updated><id>https://simonwillison.net/2024/Jul/23/introducing-llama-31/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ai.meta.com/blog/meta-llama-3-1/"&gt;Introducing Llama 3.1: Our most capable models to date&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
We've been waiting for the largest release of the Llama 3 model for a few months, and now we're getting a whole new model family instead.&lt;/p&gt;
&lt;p&gt;Meta are calling Llama 3.1 405B "the first frontier-level open source AI model" and it really is benchmarking in that GPT-4+ class, competitive with both GPT-4o and Claude 3.5 Sonnet.&lt;/p&gt;
&lt;p&gt;I'm equally excited by the new 8B and 70B 3.1 models - both of which now support a 128,000 token context and benchmark significantly higher than their Llama 3 equivalents. Same-sized models getting more powerful and capable a very reassuring trend. I expect the 8B model (or variants of it) to run comfortably on an array of consumer hardware, and I've run a 70B model on a 64GB M2 in the past.&lt;/p&gt;
&lt;p&gt;The 405B model can at least be run on a single server-class node:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To support large-scale production inference for a model at the scale of the 405B, we quantized our models from 16-bit (BF16) to 8-bit (FP8) numerics, effectively lowering the compute requirements needed and allowing the model to run within a single server node.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Meta also made a significant &lt;a href="https://twitter.com/aiatmeta/status/1815766335219249513"&gt;change to the license&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We’ve also updated our license&lt;/strong&gt; to allow developers to use the outputs from Llama models — including 405B — to improve other models for the first time.&lt;/p&gt;
&lt;p&gt;We’re excited about how this will &lt;strong&gt;enable new advancements in the field through synthetic data generation and model distillation workflows&lt;/strong&gt;, capabilities that have never been achieved at this scale in open source.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm really pleased to see this. Using models to help improve other models has been a crucial technique in LLM research for over a year now, especially for fine-tuned community models release on Hugging Face. Researchers have mostly been ignoring this restriction, so it's reassuring to see the uncertainty around that finally cleared up.&lt;/p&gt;
&lt;p&gt;Lots more details about the new models in the paper &lt;a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/"&gt;The Llama 3 Herd of Models&lt;/a&gt; including this somewhat opaque note about the 15 trillion token training data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our final data mix contains roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I got the Llama 3.1 8B Instruct model working with my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool via a new plugin, &lt;a href="https://simonwillison.net/2024/Jul/23/llm-gguf/"&gt;llm-gguf&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="facebook"/><category term="ai"/><category term="llama"/><category term="llms"/><category term="meta"/><category term="llm-release"/></entry><entry><title>Ultravox</title><link href="https://simonwillison.net/2024/Jun/10/ultravox/#atom-tag" rel="alternate"/><published>2024-06-10T05:34:09+00:00</published><updated>2024-06-10T05:34:09+00:00</updated><id>https://simonwillison.net/2024/Jun/10/ultravox/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/fixie-ai/ultravox"&gt;Ultravox&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ultravox is "a multimodal Speech LLM built around a pretrained Whisper and Llama 3 backbone". It's effectively an openly licensed version of half of the GPT-4o model &lt;a href="https://openai.com/index/hello-gpt-4o/"&gt;OpenAI demoed&lt;/a&gt; (but did not fully release) a few weeks ago: Ultravox is multimodal for audio input, but still relies on a separate text-to-speech engine for audio output.&lt;/p&gt;
&lt;p&gt;You can try it out directly in your browser through &lt;a href="https://www.ai.town/characters/a90fcca3-53c0-4111-b30a-4984883a23ef"&gt;this page on AI.TOWN&lt;/a&gt; - hit the "Call" button to start an in-browser voice conversation with the model.&lt;/p&gt;
&lt;p&gt;I found the demo extremely impressive - really low latency and it was fun and engaging to talk to. Try saying "pretend to be a wise and sarcastic old fox" to kick it into a different personality.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/fixie-ai/ultravox"&gt;GitHub repo&lt;/a&gt; includes code for both training and inference, and the full model is available &lt;a href="https://huggingface.co/fixie-ai/ultravox-v0.2"&gt;from Hugging Face&lt;/a&gt; - about 30GB of &lt;code&gt;.safetensors&lt;/code&gt; files.&lt;/p&gt;
&lt;p&gt;Ultravox says it's licensed under MIT, but I would expect it to also have to inherit aspects of the Llama 3 license since it uses that as a base model.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/juberti/status/1798898986289684849"&gt;@juberti&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/text-to-speech"&gt;text-to-speech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="text-to-speech"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/></entry></feed>