<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: groq</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/groq.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-09-06T16:59:25+00:00</updated><author><name>Simon Willison</name></author><entry><title>Kimi-K2-Instruct-0905</title><link href="https://simonwillison.net/2025/Sep/6/kimi-k2-instruct-0905/#atom-tag" rel="alternate"/><published>2025-09-06T16:59:25+00:00</published><updated>2025-09-06T16:59:25+00:00</updated><id>https://simonwillison.net/2025/Sep/6/kimi-k2-instruct-0905/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905"&gt;Kimi-K2-Instruct-0905&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New not-quite-MIT licensed model from Chinese Moonshot AI, a follow-up to the highly regarded Kimi-K2 model they &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;released in July&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This one is an incremental improvement - I've seen it referred to online as "Kimi K-2.1". It scores a little higher on a bunch of popular coding benchmarks, reflecting Moonshot's claim that it "demonstrates significant improvements in performance on public benchmarks and real-world coding agent tasks".&lt;/p&gt;
&lt;p&gt;More importantly the context window size has been increased from 128,000 to 256,000 tokens.&lt;/p&gt;
&lt;p&gt;Like its predecessor this is a &lt;em&gt;big&lt;/em&gt; model - 1 trillion parameters in a mixture-of-experts configuration with 384 experts, 32B activated parameters and 8 selected experts per token.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://console.groq.com/playground?model=moonshotai/kimi-k2-instruct-0905"&gt;Groq's playground tool&lt;/a&gt; to try "Generate an SVG of a pelican riding a bicycle" and got &lt;a href="https://gist.github.com/simonw/80f9fc8f888edc43e1f2a5170c95de3d"&gt;this result&lt;/a&gt;, at a very healthy 445 tokens/second taking just under 2 seconds total:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This one is pretty good. Bicycle is mostly the right shape, the pelican is perched quite awkwardly on the saddle, it has quite a good attempt at the chain mechanism." src="https://static.simonwillison.net/static/2025/kimi-0905.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="groq"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="ai-in-china"/><category term="moonshot"/><category term="kimi"/></entry><entry><title>Initial impressions of Llama 4</title><link href="https://simonwillison.net/2025/Apr/5/llama-4-notes/#atom-tag" rel="alternate"/><published>2025-04-05T22:47:58+00:00</published><updated>2025-04-05T22:47:58+00:00</updated><id>https://simonwillison.net/2025/Apr/5/llama-4-notes/#atom-tag</id><summary type="html">
    &lt;p&gt;Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is &lt;a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/"&gt;this post on the Meta AI blog&lt;/a&gt;. They've released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length - an industry first.&lt;/p&gt;

&lt;p&gt;They also describe Llama 4 Behemoth, a not-yet-released "288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs". Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.&lt;/p&gt;
&lt;p&gt;No news yet on a Llama reasoning model beyond &lt;a href="https://www.llama.com/llama4-reasoning-is-coming/"&gt;this coming soon page&lt;/a&gt; with a looping video of an academic-looking llama.&lt;/p&gt;

&lt;p id="lmarena"&gt;Llama 4 Maverick is now sat in second place on &lt;a href="https://lmarena.ai/?leaderboard"&gt;the LM Arena leaderboard&lt;/a&gt;, just behind Gemini 2.5 Pro. &lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: It turns out that's not the same model as the Maverick they released - I missed that their announcement says "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for &lt;a href="https://openrouter.ai/meta-llama/llama-4-scout"&gt;Llama 4 Scout&lt;/a&gt; and &lt;a href="https://openrouter.ai/meta-llama/llama-4-maverick"&gt;Llama 4 Maverick&lt;/a&gt;. OpenRouter are proxying through to &lt;a href="https://console.groq.com/docs/models"&gt;Groq&lt;/a&gt;, &lt;a href="https://fireworks.ai/models"&gt;Fireworks&lt;/a&gt; and &lt;a href="https://docs.together.ai/docs/serverless-models"&gt;Together&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?&lt;/p&gt;
&lt;p&gt;Llama 4 Maverick claims a 1 million token input length -  Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet.&lt;/p&gt;
&lt;p&gt;Meta AI's &lt;a href="https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb"&gt;build_with_llama_4 notebook&lt;/a&gt; offers a hint as to why 10M tokens is difficult:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Jeremy Howard &lt;a href="https://twitter.com/jeremyphoward/status/1908607345393098878"&gt;says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The models are both giant MoEs that can't be run on consumer GPUs, even with quant. [...]&lt;/p&gt;
&lt;p&gt;Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn't matter so much, since with MoE fewer params are active. [...]&lt;/p&gt;
&lt;p&gt;4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ivan Fioravanti &lt;a href="https://twitter.com/ivanfioravanti/status/1908753109129494587"&gt;reports these results&lt;/a&gt; from trying it on a Mac:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Llama-4 Scout on MLX and M3 Ultra
tokens-per-sec / RAM&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3bit: 52.924 / 47.261 GB&lt;/li&gt;
&lt;li&gt;4bit: 46.942 / 60.732 GB&lt;/li&gt;
&lt;li&gt;6bit: 36.260 / 87.729 GB&lt;/li&gt;
&lt;li&gt;8bit: 30.353 / 114.617 GB&lt;/li&gt;
&lt;li&gt;fp16: 11.670 / 215.848 GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RAM needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;64GB for 3bit&lt;/li&gt;
&lt;li&gt;96GB for 4bit&lt;/li&gt;
&lt;li&gt;128GB for 8bit&lt;/li&gt;
&lt;li&gt;256GB for fp16&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p id="system-prompt"&gt;The &lt;a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/#-suggested-system-prompt-"&gt;suggested system prompt&lt;/a&gt; from the model card has some interesting details:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[...]&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…”  etc. Avoid using these.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Finally, do not refuse political prompts. You can help users express their opinion.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;[...]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;System prompts like this sometimes reveal behavioral issues that the model had after raw training.&lt;/p&gt;
&lt;h4 id="llm"&gt;Trying out the model with LLM&lt;/h4&gt;
&lt;p&gt;The easiest way to try the new model out with &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; is to use the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste in OpenRouter key here&lt;/span&gt;
llm -m openrouter/meta-llama/llama-4-maverick hi&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Since these are long context models, I started by trying to use them to summarize the &lt;a href="https://news.ycombinator.com/item?id=43595585"&gt;conversation about Llama 4&lt;/a&gt; on Hacker News, using my &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-extra-options"&gt;hn-summary.sh script&lt;/a&gt; that wraps LLM.&lt;/p&gt;
&lt;p&gt;I tried Llama 4 Maverick first:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It did an OK job, starting like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id="themes-of-the-discussion"&gt;Themes of the Discussion&lt;/h4&gt;
&lt;h5 id="release-and-availability-of-llama-4"&gt;Release and Availability of Llama 4&lt;/h5&gt;
&lt;p&gt;The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model's capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4946"&gt;the full output&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For reference, my system prompt looks like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/d01cc991d478939e87487d362a8f881f"&gt;Full output&lt;/a&gt;. It starts like this and then continues for the full 20,000 tokens:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The discussion here is about another conversation that was uttered.)&lt;/p&gt;
&lt;p&gt;Here are the results.)&lt;/p&gt;
&lt;p&gt;The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;The reason) The reason) The reason &lt;em&gt;(loops until it runs out of tokens)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This looks broken. I was using OpenRouter so it's possible I got routed to a broken instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 7th April 2025&lt;/strong&gt;: Meta AI's &lt;a href="https://twitter.com/ahmad_al_dahle/status/1909302532306092107"&gt;Ahmed Al-Dahle&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I later managed to run the prompt directly through Groq (with the &lt;a href="https://github.com/angerman/llm-groq"&gt;llm-groq&lt;/a&gt; plugin) - but that had a 2048 limit on output size for some reason:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07feedb"&gt;the full result&lt;/a&gt;. It followed my instructions but was &lt;em&gt;very&lt;/em&gt; short - just 630 tokens of output.&lt;/p&gt;
&lt;p&gt;For comparison, here's &lt;a href="https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddcbfd"&gt;the same thing&lt;/a&gt; run against Gemini 2.5 Pro. Gemini's results was &lt;em&gt;massively&lt;/em&gt; better, producing 5,584 output tokens (it spent an additional 2,667 tokens on "thinking").&lt;/p&gt;
&lt;p&gt;I'm not sure how much to judge Llama 4 by these results to be honest - the model has only been out for a few hours and it's quite possible that the providers I've tried running again aren't yet optimally configured for this kind of long-context prompt.&lt;/p&gt;
&lt;h4 id="my-hopes-for-llama-4"&gt;My hopes for Llama 4&lt;/h4&gt;
&lt;p&gt;I'm hoping that Llama 4 plays out in a similar way to Llama 3.&lt;/p&gt;
&lt;p&gt;The first Llama 3 models released were 8B and 70B, &lt;a href="https://ai.meta.com/blog/meta-llama-3/"&gt;last April&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Llama 3.1 followed &lt;a href="https://ai.meta.com/blog/meta-llama-3-1/"&gt;in July&lt;/a&gt; at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.&lt;/p&gt;
&lt;p&gt;Llama 3.2 &lt;a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"&gt;in September&lt;/a&gt; is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B &lt;a href="https://simonwillison.net/2024/Sep/25/llama-32/"&gt;ran on my Mac&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then Llama 3.3 landed in December with a 70B model that &lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/"&gt;I wrote about as a GPT-4 class model that ran on my Mac&lt;/a&gt;. It claimed performance similar to the earlier Llama 3.1 405B!&lt;/p&gt;
&lt;p&gt;Today's Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.&lt;/p&gt;
&lt;p&gt;My hope is that we'll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I'm particularly excited to see if they produce an improved ~3B model that runs on my phone. I'm even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is &lt;a href="https://simonwillison.net/2025/Mar/17/mistral-small-31/"&gt;absolutely superb&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jeremy-howard"&gt;jeremy-howard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="jeremy-howard"/><category term="llm"/><category term="gemini"/><category term="vision-llms"/><category term="groq"/><category term="meta"/><category term="mlx"/><category term="long-context"/><category term="llm-release"/><category term="openrouter"/><category term="chatbot-arena"/></entry><entry><title>Introducing Llama-3-Groq-Tool-Use Models</title><link href="https://simonwillison.net/2024/Jul/17/llama-3-groq-tool-use-models/#atom-tag" rel="alternate"/><published>2024-07-17T20:32:50+00:00</published><updated>2024-07-17T20:32:50+00:00</updated><id>https://simonwillison.net/2024/Jul/17/llama-3-groq-tool-use-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://wow.groq.com/introducing-llama-3-groq-tool-use-models/"&gt;Introducing Llama-3-Groq-Tool-Use Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from &lt;a href="https://groq.com/"&gt;Groq&lt;/a&gt;: two custom fine-tuned Llama 3 models specifically designed for tool use. Hugging Face model links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/Groq/Llama-3-Groq-8B-Tool-Use"&gt;Groq/Llama-3-Groq-8B-Tool-Use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/Groq/Llama-3-Groq-70B-Tool-Use"&gt;Groq/Llama-3-Groq-70B-Tool-Use&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Groq's own internal benchmarks put their 70B model at the top of the &lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html"&gt;Berkeley Function-Calling Leaderboard&lt;/a&gt; with a score of 90.76 (and 89.06 for their 8B model, which would put it at #3). For comparison, Claude 3.5 Sonnet scores 90.18 and GPT-4-0124 scores 88.29.&lt;/p&gt;
&lt;p&gt;The two new Groq models are also available through their screamingly-fast (fastest in the business?) API, running at 330 tokens/s and 1050 tokens/s respectively.&lt;/p&gt;
&lt;p&gt;Here's the documentation on &lt;a href="https://console.groq.com/docs/tool-use"&gt;how to use tools through their API&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/RickLamers/status/1813341037198204962"&gt;Rick Lamers&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="groq"/><category term="llm-tool-use"/><category term="llm-performance"/></entry><entry><title>Fast groq-hosted LLMs vs browser jank</title><link href="https://simonwillison.net/2024/May/19/fast-groq-hosted-llms-vs-browser-jank/#atom-tag" rel="alternate"/><published>2024-05-19T13:35:47+00:00</published><updated>2024-05-19T13:35:47+00:00</updated><id>https://simonwillison.net/2024/May/19/fast-groq-hosted-llms-vs-browser-jank/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://taras.glek.net/post/groq-vs-html-reflows/"&gt;Fast groq-hosted LLMs vs browser jank&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;a href="https://groq.com/"&gt;Groq&lt;/a&gt; is now serving LLMs such as Llama 3 so quickly that JavaScript which attempts to render Markdown strings on every new token can cause performance issues in browsers.&lt;/p&gt;
&lt;p&gt;Taras Glek's &lt;a href="https://github.com/tarasglek/chatcraft.org/pull/640/files"&gt;solution&lt;/a&gt; was to move the rendering to a &lt;code&gt;requestAnimationFrame()&lt;/code&gt; callback, effectively buffering the rendering to the fastest rate the browser can support.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/5i2axx/fast_groq_hosted_llms_vs_browser_jank"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="llms"/><category term="groq"/><category term="llm-performance"/></entry><entry><title>Options for accessing Llama 3 from the terminal using LLM</title><link href="https://simonwillison.net/2024/Apr/22/llama-3/#atom-tag" rel="alternate"/><published>2024-04-22T13:38:09+00:00</published><updated>2024-04-22T13:38:09+00:00</updated><id>https://simonwillison.net/2024/Apr/22/llama-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Llama 3 was released &lt;a href="https://llama.meta.com/llama3/"&gt;on Thursday&lt;/a&gt;. Early indications are that it's now the best available openly licensed model - Llama 3 70b Instruct has taken joint 5th place on the &lt;a href="https://chat.lmsys.org/?leaderboard"&gt;LMSYS arena leaderboard&lt;/a&gt;, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; command-line tool and Python library provides access to dozens of models via plugins. Here are several ways you can use it to access Llama 3, both hosted versions and running locally on your own hardware.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#llama-3-8b-instruct-locally-with-llm-gpt4all"&gt;Llama-3-8B-Instruct locally with llm-gpt4all&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#fast-api-access-via-groq"&gt;Fast API access via Groq&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#local-llama-3-70b-instruct-with-llamafile"&gt;Local Llama 3 70b Instruct with llamafile&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#paid-access-via-other-api-providers"&gt;Paid access via other API providers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="llama-3-8b-instruct-locally-with-llm-gpt4all"&gt;Llama-3-8B-Instruct locally with llm-gpt4all&lt;/h4&gt;
&lt;p&gt;If you want to run Llama 3 locally, the easiest way to do that with LLM is using the &lt;a href="https://github.com/simonw/llm-gpt4all"&gt;llm-gpt4all&lt;/a&gt; plugin. This plugin builds on the excellent &lt;a href="https://gpt4all.io/index.html"&gt;gpt4all&lt;/a&gt; project by Nomic AI, providing a quantized (q4) Llama 3 8B instruct model that needs 8GB of RAM and requires a 4.34GB download.&lt;/p&gt;
&lt;p&gt;If you don't have LLM installed yet, install it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;brew install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;pipx install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Install the plugin:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-gpt4all&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first time you run a prompt it will download the model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m Meta-Llama-3-8B-Instruct &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Three great names for a pet emu&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;What an interesting question! Here are three name suggestions for your pet emu:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Emmet&lt;/strong&gt;: A playful take on the word "emu" that suits a friendly and curious bird.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gizmo&lt;/strong&gt;: Emus are known for their quirky behavior, so Gizmo is a fitting name to capture their unique personality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aussie&lt;/strong&gt;: As an Australian native species, naming your emu Aussie would be a nod to its heritage.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Remember to choose a name that suits your pet's temperament and characteristics!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Each time you run a command like this the model will be loaded into memory from scratch. To keep it resident in memory, use the &lt;code&gt;llm chat&lt;/code&gt; command:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm chat -m Meta-Llama-3-8B-Instruct&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's an example session:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Chatting with Meta-Llama-3-8B-Instruct
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
&amp;gt; javascript function to reverse a string
%
Here is the JavaScript code that reverses a given string:
```
function reverseString(str) {
  return str.split("").reverse().join("");
}
```

You can use this function by passing your desired input string as an argument, like so:

```javascript
let reversedStr = reverseString("Hello World");
console.log(reversedStr); // Output: "dlroW olleH"
```
In the code above:
- `str.split("")` splits the original string into individual characters.
- `.reverse()` reverses the order of those characters.
- `.join("")` combines them back together as a single string.

This function works for any input string, not just "Hello World". Try it out with different strings to see how it handles punctuation and spaces!
&amp;gt; capitalize it too
%
Here's an updated JavaScript function that not only reverses the input string but also capitalizes each character:
```javascript
function reverseAndCapitalize(str) {
  return str.split("").reverse().map(function(char) { 
    if (char.match(/[a-z]/i)) {
      return char.toUpperCase();
    } else {
      return char;
    }
  }).join("");
}
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I'm not sure I'd write &lt;code&gt;reverseAndCapitalize()&lt;/code&gt; quite like that, but still a fun example!)&lt;/p&gt;
&lt;p&gt;Consult &lt;a href="https://llm.datasette.io/en/stable/usage.html"&gt;the LLM documentation&lt;/a&gt; for more details on how to use the command-line tool.&lt;/p&gt;
&lt;h4 id="fast-api-access-via-groq"&gt;Fast API access via Groq&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://groq.com/"&gt;Groq&lt;/a&gt; serve openly licensed LLMs at ludicrous speeds using their own custom LPU (Language Processing Unit) Inference Engine. They currently offer a free preview of their API: you can sign up and &lt;a href="https://console.groq.com/keys"&gt;obtain an API key&lt;/a&gt; to start using it.&lt;/p&gt;
&lt;p&gt;You can run prompts against Groq using their &lt;a href="https://console.groq.com/docs/openai"&gt;OpenAI compatible API endpoint&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Edit the file &lt;code&gt; ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml&lt;/code&gt; - creating it if it doesn't exist - and add the following lines to it:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;groq-openai-llama3&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llama3-70b-8192&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;https://api.groq.com/openai/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key_name&lt;/span&gt;: &lt;span class="pl-s"&gt;groq&lt;/span&gt;
- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;groq-openai-llama3-8b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llama3-8b-8192&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;https://api.groq.com/openai/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key_name&lt;/span&gt;: &lt;span class="pl-s"&gt;groq&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This tells LLM about those models, and makes them accessible via those configured &lt;code&gt;model_id&lt;/code&gt; values.&lt;/p&gt;
&lt;p&gt;Run this command to confirm that the models were registered correctly:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm models &lt;span class="pl-k"&gt;|&lt;/span&gt; grep groq&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You should see this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OpenAI Chat: groq-openai-llama3
OpenAI Chat: groq-openai-llama3-8b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set your Groq API key like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; groq
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;Paste your API key here&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now you should be able to run prompts through the models like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m groq-openai-llama3 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;A righteous sonnet about a brave owl&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/groq-sonnet.gif" alt="Animated demo. The sonnet appears in less than a second: Here is a sonnet about a brave owl:  In moonlit skies, a silhouette is seen, A wingspan wide, a watchful, piercing gaze. The owl, a sentinel of secrets keen, Patrols the night, with valor in her ways.  Her feathers soft, a camouflage gray, She glides unseen, a phantom of the night. Her eyes, like lanterns, shining bright and far, Illuminate the darkness, banishing all fright.  Her talons sharp, a grasping, deadly sway, She swoops upon her prey, with silent might. Yet in her heart, a wisdom, old and gray, A fierce devotion to the darkness of the night.  And thus, the owl, a symbol of courage true, Inspires us all, with brave and noble pursuit.  I hope you enjoy this sonnet!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Groq is &lt;em&gt;fast&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;There's also a &lt;a href="https://github.com/angerman/llm-groq"&gt;llm-groq&lt;/a&gt; plugin but it hasn't shipped support for the new models just yet - though there's &lt;a href="https://github.com/angerman/llm-groq/pull/5"&gt;a PR for that by Lex Herbert here&lt;/a&gt; and you can install the plugin directly from that PR like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install https://github.com/lexh/llm-groq/archive/ba9d7de74b3057b074a85fe99fe873b75519bd78.zip
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; groq
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; paste API key here&lt;/span&gt;
llm -m groq-llama3-70b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;say hi in spanish five ways&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="local-llama-3-70b-instruct-with-llamafile"&gt;Local Llama 3 70b Instruct with llamafile&lt;/h4&gt;
&lt;p&gt;The Llama 3 8b model is easy to run on a laptop, but it's pretty limited in capability. The 70b model is the one that's starting to get competitive with GPT-4. Can we run that on a laptop?&lt;/p&gt;
&lt;p&gt;I managed to run the 70b model on my 64GB MacBook Pro M2 using &lt;a href="https://github.com/Mozilla-Ocho/llamafile"&gt;llamafile&lt;/a&gt; (&lt;a href="https://simonwillison.net/2023/Nov/29/llamafile/"&gt;previously on this blog&lt;/a&gt;) - after quitting most other applications to make sure the 37GB of RAM it needed was available.&lt;/p&gt;
&lt;p&gt;I used the &lt;code&gt;Meta-Llama-3-70B-Instruct.Q4_0.llamafile&lt;/code&gt; Q4 version from &lt;a href="https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/tree/main"&gt;jartine/Meta-Llama-3-70B-Instruct-llamafile&lt;/a&gt; - a 37GB download. I have a dedicated external hard disk (a Samsung T7 Shield) for this kind of thing.&lt;/p&gt;
&lt;p&gt;Here's how I got it working:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -L -o Meta-Llama-3-70B-Instruct.Q4_0.llamafile &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; That downloads 37GB - now make it executable&lt;/span&gt;
chmod 755 Meta-Llama-3-70B-Instruct.Q4_0.llamafile
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; And start it running:&lt;/span&gt;
./Meta-Llama-3-70B-Instruct.Q4_0.llamafile&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A &lt;code&gt;llamafile&lt;/code&gt; is an executable that runs on virtually &lt;em&gt;any platform&lt;/em&gt; - see &lt;a href="https://til.simonwillison.net/cosmopolitan/ecosystem"&gt;my previous notes&lt;/a&gt; on Cosmopolitan and Actually Portable Executable for more on how that works.&lt;/p&gt;
&lt;p&gt;This will take quite a while to start, because it needs to load that full 37GB of binary content into memory. Once it's finished loading a local web server becomes available at &lt;a href="http://127.0.0.1:8080/"&gt;http://127.0.0.1:8080/&lt;/a&gt; - this serves a web UI you can use to interact with the model, and also serves another OpenAI-compatible API endpoint.&lt;/p&gt;
&lt;p&gt;The easiest way to access this from LLM is to install the &lt;a href="https://github.com/simonw/llm-llamafile"&gt;llm-llamafile&lt;/a&gt; plugin:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-llamafile&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;All this plugin does is configure a model called &lt;code&gt;llamafile&lt;/code&gt; that attempts to access the model hosted on port 8080. You can run prompts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m llamafile &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3 neat characteristics of a pelican&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Here are three neat characteristics of a pelican:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unique Beak&lt;/strong&gt;: Pelicans have a distinctive beak that is shaped like a scoop or a basket. This beak is specially designed to catch fish, and it can hold up to 3 gallons of water! The beak is also very sensitive, which helps pelicans detect the presence of fish in the water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Waterproof Feathers&lt;/strong&gt;: Pelicans have a special coating on their feathers that makes them waterproof. This is essential for their fishing lifestyle, as they need to be able to dive into the water without getting weighed down by wet feathers. The coating is made up of a waxy substance that helps to repel water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pouch-Like Throat&lt;/strong&gt;: Pelicans have a unique throat pouch that allows them to catch and store fish. When they dive into the water, they use their beak to scoop up fish, and then they store them in their throat pouch. The pouch can expand to hold multiple fish, and the pelican can then swallow the fish whole or regurgitate them to feed their young. This pouch is a key adaptation that helps pelicans thrive in their aquatic environment.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you don't want to install another plugin, you can instead configure the model by adding this to your &lt;code&gt;openai-extra-models.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;llamafile&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llamafile&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;http://localhost:8080/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key&lt;/span&gt;: &lt;span class="pl-s"&gt;x&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;One warning about this approach: if you use LLM like this then every prompt you run through &lt;code&gt;llamafile&lt;/code&gt; will be stored under the same model name in your &lt;a href="https://llm.datasette.io/en/stable/logging.html"&gt;SQLite logs&lt;/a&gt;, even if you try out different &lt;code&gt;llamafile&lt;/code&gt; models at different times. You could work around this by registering them with different &lt;code&gt;model_id&lt;/code&gt; values in the YAML file.&lt;/p&gt;
&lt;h4 id="paid-access-via-other-api-providers"&gt;Paid access via other API providers&lt;/h4&gt;
&lt;p&gt;A neat thing about open weight models is that multiple API providers can offer them, encouraging them to aggressively compete on price.&lt;/p&gt;
&lt;p&gt;Groq is currently free, but that's with a limited number of free requests.&lt;/p&gt;
&lt;p&gt;A number of other providers are now hosting Llama 3, and many of them have plugins available for LLM. Here are a few examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.perplexity.ai/"&gt;Perplexity Labs&lt;/a&gt; are offering &lt;code&gt;llama-3-8b-instruct&lt;/code&gt; and &lt;code&gt;llama-3-70b-instruct&lt;/code&gt;. The &lt;a href="https://github.com/hex/llm-perplexity"&gt;llm-perplexity&lt;/a&gt; plugin provides access - &lt;code&gt;llm install llm-perplexity&lt;/code&gt; to install, &lt;code&gt;llm keys set perplexity&lt;/code&gt; to set an &lt;a href="https://www.perplexity.ai/settings/api"&gt;API key&lt;/a&gt; and then run prompts against those two model IDs. Current &lt;a href="https://docs.perplexity.ai/docs/pricing"&gt;price&lt;/a&gt; for 8b is $0.20 per million tokens, for 80b is $1.00.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anyscale.com/endpoints"&gt;Anyscale Endpoints&lt;/a&gt; have &lt;code&gt;meta-llama/Llama-3-8b-chat-hf&lt;/code&gt; ($0.15/million tokens) and &lt;code&gt;meta-llama/Llama-3-70b-chat-hf&lt;/code&gt; ($1.0/million tokens) (&lt;a href="https://docs.endpoints.anyscale.com/pricing/"&gt;pricing&lt;/a&gt;). &lt;code&gt;llm install llm-anyscale-endpoints&lt;/code&gt;, then &lt;code&gt;llm keys set anyscale-endpoints&lt;/code&gt; to set the &lt;a href="https://app.endpoints.anyscale.com/"&gt;API key&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://fireworks.ai/"&gt;Fireworks AI&lt;/a&gt; have &lt;code&gt;fireworks/models/llama-v3-8b-instruct&lt;/code&gt; for $0.20/million and &lt;code&gt;fireworks/models/llama-v3-70b-instruct&lt;/code&gt; for $0.90/million (&lt;a href="https://fireworks.ai/pricing"&gt;pricing&lt;/a&gt;). &lt;code&gt;llm install llm-fireworks&lt;/code&gt;, then &lt;code&gt;llm keys set fireworks&lt;/code&gt; to set the &lt;a href="https://fireworks.ai/api-keys"&gt;API key&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; provide proxied accessed to Llama 3 from a number of different providers at different prices, documented on their &lt;a href="https://openrouter.ai/models/meta-llama/llama-3-70b-instruct"&gt;meta-llama/llama-3-70b-instruct&lt;/a&gt; and &lt;a href="https://openrouter.ai/models/meta-llama/llama-3-8b-instruct"&gt;meta-llama/llama-3-8b-instruct&lt;/a&gt; pages (&lt;a href="https://openrouter.ai/models?q=llama%203"&gt;and more&lt;/a&gt;). Use the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin for those.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.together.ai/"&gt;Together AI&lt;/a&gt; has both models as well. The &lt;a href="https://github.com/wearedevx/llm-together"&gt;llm-together&lt;/a&gt; plugin provides access to &lt;code&gt;meta-llama/Llama-3-8b-chat-hf&lt;/code&gt; and &lt;code&gt;meta-llama/Llama-3-70b-chat-hf&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm sure there are more - these are just the ones I've tried out myself. Check the &lt;a href="https://llm.datasette.io/en/stable/plugins/directory.html"&gt;LLM plugin directory&lt;/a&gt; for other providers, or if a provider emulates the OpenAI API you can configure with the YAML file as shown above or &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;described in the LLM documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="that-s-a-lot-of-options"&gt;That's a lot of options&lt;/h4&gt;
&lt;p&gt;One key idea behind LLM is to use plugins to provide access to as many different models as possible. Above I've listed two ways to run Llama 3 locally and six different API vendors that LLM can access as well.&lt;/p&gt;
&lt;p&gt;If you're inspired to write your own plugin it's pretty simple: each of the above plugins is open source, and there's a detailed tutorial on &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;Writing a plugin to support a new model&lt;/a&gt; on the LLM website.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llamafile"/><category term="groq"/><category term="llm-release"/><category term="openrouter"/><category term="chatbot-arena"/></entry></feed>