<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: mlc</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/mlc.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-11-29T21:09:11+00:00</updated><author><name>Simon Willison</name></author><entry><title>Structured Generation w/ SmolLM2 running in browser &amp; WebGPU</title><link href="https://simonwillison.net/2024/Nov/29/structured-generation-smollm2-webgpu/#atom-tag" rel="alternate"/><published>2024-11-29T21:09:11+00:00</published><updated>2024-11-29T21:09:11+00:00</updated><id>https://simonwillison.net/2024/Nov/29/structured-generation-smollm2-webgpu/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/spaces/reach-vb/github-issue-generator-webgpu"&gt;Structured Generation w/ SmolLM2 running in browser &amp;amp; WebGPU&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Extraordinary demo by Vaibhav Srivastav (VB). Here's Hugging Face's &lt;a href="https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct"&gt;SmolLM2-1.7B-Instruct&lt;/a&gt; running directly in a web browser (using WebGPU, so requires Chrome &lt;a href="https://github.com/gpuweb/gpuweb/wiki/Implementation-Status"&gt;for the moment&lt;/a&gt;) demonstrating structured text extraction, converting a text description of an image into a structured GitHub issue defined using JSON schema.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Interface showing text input, a JSON schema, extracted JSON and a UI that demonstrates the structured resulting GitHub Issue" src="https://static.simonwillison.net/static/2024/github-issue-extract.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The page loads 924.8MB of model data (according to &lt;a href="https://gist.github.com/simonw/3ccba6256e95b59ea6a17509855830b4"&gt;this script to sum up files in window.caches&lt;/a&gt;) and performs everything in-browser. I did not know a model this small could produce such useful results.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/Vaibhavs10/github-issue-generator-webgpu/blob/main/src/index.js"&gt;the source code&lt;/a&gt; for the demo. It's around 200 lines of code, 50 of which are the JSON schema describing the data to be extracted.&lt;/p&gt;
&lt;p&gt;The real secret sauce here is &lt;a href="https://github.com/mlc-ai/web-llm"&gt;web-llm&lt;/a&gt; by MLC. This library has made loading and executing prompts through LLMs in the browser shockingly easy, and recently incorporated support for MLC's &lt;a href="https://xgrammar.mlc.ai/"&gt;XGrammar&lt;/a&gt; library (also available in Python) which implements both JSON schema and EBNF-based structured output guidance.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/reach-vb.hf.co/post/3lc24bmj6fk2j"&gt;@reach-vb.hf.co&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/smollm"&gt;smollm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/structured-extraction"&gt;structured-extraction&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llms"/><category term="mlc"/><category term="hugging-face"/><category term="webgpu"/><category term="smollm"/><category term="structured-extraction"/></entry><entry><title>A practical guide to deploying Large Language Models Cheap, Good *and* Fast</title><link href="https://simonwillison.net/2023/Sep/4/a-practical-guide-to-deploying-large-language-models-cheap-good/#atom-tag" rel="alternate"/><published>2023-09-04T13:43:03+00:00</published><updated>2023-09-04T13:43:03+00:00</updated><id>https://simonwillison.net/2023/Sep/4/a-practical-guide-to-deploying-large-language-models-cheap-good/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://askdala.substack.com/p/a-pratical-guide-to-deploying-llms"&gt;A practical guide to deploying Large Language Models Cheap, Good *and* Fast&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Joel Kang’s extremely comprehensive notes on what he learned trying to run Vicuna-13B-v1.5 on an affordable cloud GPU server (a T4 at $0.615/hour). The space is in so much flux right now—Joel ended up using MLC but the best option could change any minute.&lt;/p&gt;

&lt;p&gt;Vicuna 13B quantized to 4-bit integers needed 7.5GB of the T4’s 16GB of VRAM, and returned tokens at 20/second.&lt;/p&gt;

&lt;p&gt;An open challenge running MLC right now is around batching and concurrency: “I did try making 3 concurrent requests to the endpoint, and while they all stream tokens back and the server doesn’t OOM, the output of all 3 streams seem to actually belong to a single prompt.”


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vicuna"&gt;vicuna&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="mlc"/><category term="vicuna"/></entry><entry><title>WebLLM supports Llama 2 70B now</title><link href="https://simonwillison.net/2023/Aug/30/webllm-supports-llama-2-70b-now/#atom-tag" rel="alternate"/><published>2023-08-30T14:41:26+00:00</published><updated>2023-08-30T14:41:26+00:00</updated><id>https://simonwillison.net/2023/Aug/30/webllm-supports-llama-2-70b-now/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://webllm.mlc.ai/"&gt;WebLLM supports Llama 2 70B now&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The WebLLM project from MLC uses WebGPU to run large language models entirely in the browser. They recently added support for Llama 2, including Llama 2 70B, the largest and most powerful model in that family.&lt;/p&gt;

&lt;p&gt;To my astonishment, this worked! I used a M2 Mac with 64GB of RAM and Chrome Canary and it downloaded many GBs of data... but it worked, and spat out tokens at a slow but respectable rate of 3.25 tokens/second.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="mlc"/><category term="webgpu"/></entry><entry><title>llm-mlc</title><link href="https://simonwillison.net/2023/Aug/12/llm-mlc/#atom-tag" rel="alternate"/><published>2023-08-12T05:33:41+00:00</published><updated>2023-08-12T05:33:41+00:00</updated><id>https://simonwillison.net/2023/Aug/12/llm-mlc/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-mlc"&gt;llm-mlc&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
My latest plugin for LLM adds support for models that use the MLC Python library—which is the first library I’ve managed to get to run Llama 2 with GPU acceleration on my M2 Mac laptop.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://fedi.simonwillison.net/@simon/110875007879049591"&gt;@simon&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="mlc"/><category term="llm"/></entry><entry><title>MLC: Bringing Open Large Language Models to Consumer Devices</title><link href="https://simonwillison.net/2023/May/22/mlc-redpajama/#atom-tag" rel="alternate"/><published>2023-05-22T19:25:13+00:00</published><updated>2023-05-22T19:25:13+00:00</updated><id>https://simonwillison.net/2023/May/22/mlc-redpajama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mlc.ai/blog/2023/05/22/bringing-open-large-language-models-to-consumer-devices"&gt;MLC: Bringing Open Large Language Models to Consumer Devices&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“We bring RedPajama, a permissive open language model to WebGPU, iOS, GPUs, and various other platforms.” I managed to get this running on my Mac (see via link) with a few tweaks to their official instructions.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://til.simonwillison.net/llms/mlc-chat-redpajama"&gt;mlc-chat - RedPajama-INCITE-Chat-3B on macOS&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redpajama"&gt;redpajama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="mlc"/><category term="redpajama"/><category term="webgpu"/><category term="gpus"/></entry><entry><title>Let's be bear or bunny</title><link href="https://simonwillison.net/2023/May/1/lets-be-bear-or-bunny/#atom-tag" rel="alternate"/><published>2023-05-01T18:37:20+00:00</published><updated>2023-05-01T18:37:20+00:00</updated><id>https://simonwillison.net/2023/May/1/lets-be-bear-or-bunny/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://mlc.ai/"&gt;Machine Learning Compilation group&lt;/a&gt; (MLC) are my favourite team of AI researchers at the moment.&lt;/p&gt;
&lt;p&gt;In less than two months they've released:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/mlc-ai/web-stable-diffusion"&gt;Web Stable Diffusion&lt;/a&gt;&lt;/strong&gt; - the full Stable Diffusion image generation model running entirely in a browser, after a 2GB download, powered by WebGPU. Try it &lt;a href="https://mlc.ai/web-stable-diffusion/"&gt;here&lt;/a&gt; using Chrome Canary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/mlc-ai/web-llm"&gt;Web LLM&lt;/a&gt;&lt;/strong&gt; which pulls off the same trick but for vicuna-7b-v1, a genuinely useful LLaMA-derived ChatGPT-style Large Language Model! I wrote more about that in &lt;a href="https://simonwillison.net/2023/Apr/16/web-llm/"&gt;Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it’s very impressive&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And then a few days ago they released &lt;strong&gt;&lt;a href="https://mlc.ai/mlc-llm/"&gt;MLC LLM&lt;/a&gt;&lt;/strong&gt; - "a universal solution that allows any language model to be deployed natively on a diverse set of hardware backends and native applications".&lt;/p&gt;
&lt;p&gt;As part of that release, they put up &lt;a href="https://testflight.apple.com/join/57zd7oxa"&gt;a TestFlight preview&lt;/a&gt; of an iPhone app that runs the LLM model locally on your phone. It needs 4GB of RAM just for the application, so a iPhone 12 Pro or later should work.&lt;/p&gt;
&lt;p&gt;I've been running it on my phone. It totally works!&lt;/p&gt;
&lt;p&gt;(This is actually the second LLM I've been able to run on my phone - I previously also tried &lt;a href="https://twitter.com/antimatter15/status/1644456371121954817"&gt;Sheepy-T&lt;/a&gt; which runs the apparently less capable GPT-J.)&lt;/p&gt;
&lt;p&gt;Since I've been writing online for more than twenty years I often try an ego search first, to see if the weird bundle of binary floating point numbers has picked up anything about me. Here's what I got from the MLC LLM on my phone.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/mlc-llm-who-is-simon-willison.jpg" alt="Prompt: Who is Simon Willison? Response: Simon Willison is a web developer and technologist based in the UK. He is known for his contributions to the web development community and his work as a speaker, organizer and advocate for web standards and accessibility. He is the creator of the &amp;quot;Let's be bear or bunny&amp;quot; pattern and is a well known figure in the web development community." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I could feel the back of my phone heating up as it slowly calculated all of those tokens!&lt;/p&gt;
&lt;p&gt;Honestly, this is wildly better than I expected. I'm from the UK (but based in California since 2014). I wouldn't call myself an "organizer" and while I care deeply about accessibility I'm not sure I'd credit myself as an outspoken advocate, but I've definitely advocated for web standards in the past.&lt;/p&gt;
&lt;p&gt;But what on earth is that "Let's be bear or bunny" pattern?&lt;/p&gt;
&lt;p&gt;So I asked:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/bear-or-bunny.jpg" alt="Prompt: Tell me about the &amp;quot;Let's be bear or bunny&amp;quot; pattern - Response: Quoted below." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I think this is my favourite AI hallucination yet!&lt;/p&gt;
&lt;p&gt;Here's that definition, which I'm going to claim is co-written by myself and my hard working iPhone:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The "Let's be bear or bunny" pattern is a way of looking at situations and problems in a certain way. It is a way of approaching a problem or situation with a sense of playfulness and lightheartedness, while still being able to think critically and analytically.&lt;/p&gt;
&lt;p&gt;It is also a way of embracing the unexpected and being open to new possibilities. This pattern is often associated with the concept of "go with the flow" or "roll with the punches" in the face of challenges and difficulties. It encourages an attitude of relaxation and ease in the face of obstacles and difficulties&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Honestly, I kind of love this. I've been trying out saying "Let's be bear or bunny that!" at home. Natalie has hinted that this may have already played itself out, but I disagree.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vicuna"&gt;vicuna&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hallucinations"&gt;hallucinations&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="mlc"/><category term="vicuna"/><category term="hallucinations"/></entry><entry><title>MLC LLM</title><link href="https://simonwillison.net/2023/Apr/29/mlc-llm/#atom-tag" rel="alternate"/><published>2023-04-29T17:43:32+00:00</published><updated>2023-04-29T17:43:32+00:00</updated><id>https://simonwillison.net/2023/Apr/29/mlc-llm/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mlc.ai/mlc-llm/"&gt;MLC LLM&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
From MLC, the team that gave us Web LLM and Web Stable Diffusion. “MLC LLM is a universal solution that allows any language model to be deployed natively on a diverse set of hardware backends and native applications”. I installed their iPhone demo from TestFlight this morning and it does indeed provide an offline LLM that runs on my phone. It’s reasonably capable—the underlying model for the app is vicuna-v1-7b, a LLaMA derivative.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1652358994214928384"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/iphone"&gt;iphone&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;&lt;/p&gt;



</summary><category term="iphone"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="mlc"/></entry><entry><title>Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it's very impressive</title><link href="https://simonwillison.net/2023/Apr/16/web-llm/#atom-tag" rel="alternate"/><published>2023-04-16T15:10:45+00:00</published><updated>2023-04-16T15:10:45+00:00</updated><id>https://simonwillison.net/2023/Apr/16/web-llm/#atom-tag</id><summary type="html">
    &lt;p&gt;A month ago I asked &lt;a href="https://simonwillison.net/2023/Mar/17/beat-chatgpt-in-a-browser/"&gt;Could you train a ChatGPT-beating model for $85,000 and run it in a browser?&lt;/a&gt;. $85,000 was a hypothetical training cost for LLaMA 7B plus Stanford Alpaca. "Run it in a browser" was based on the fact that &lt;a href="https://github.com/mlc-ai/web-stable-diffusion"&gt;Web Stable Diffusion&lt;/a&gt; runs a 1.9GB Stable Diffusion model in a browser, so maybe it's not such a big leap to run a small Large Language Model there as well.&lt;/p&gt;
&lt;p&gt;That second part has now happened.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/mlc-ai/web-llm"&gt;Web LLM&lt;/a&gt; is a project from the same team as Web Stable Diffusion which runs the &lt;a href="https://huggingface.co/lmsys/vicuna-7b-delta-v0"&gt;vicuna-7b-delta-v0&lt;/a&gt; model in a browser, taking advantage of the brand new WebGPU API that just arrived in Chrome in beta.&lt;/p&gt;
&lt;p&gt;I got their &lt;a href="https://mlc.ai/web-llm/#chat-demo"&gt;browser demo&lt;/a&gt; running on my M2 MacBook Pro using &lt;a href="https://www.google.com/chrome/canary/"&gt;Chrome Canary&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; As of Chrome 113 &lt;a href="https://developer.chrome.com/blog/new-in-chrome-113/"&gt;released in May 2023&lt;/a&gt; you no longer need to use Chrome Canary for this - regular Chrome works fine too.&lt;/p&gt;
&lt;p&gt;It's really, really good. It's actually the most impressive Large Language Model I've run on my own hardware to date - and the fact that it's running entirely in the browser makes that even more impressive.&lt;/p&gt;
&lt;p&gt;It's really fast too: I'm seeing around 15 tokens a second, which is better performance than almost all of the other models I've tried running on my own machine.&lt;/p&gt;
&lt;p&gt;I started it out with something easy - a straight factual lookup. "Who landed on the moon?"&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-moon.jpg" alt="Prompt: Who landed on the moon? Response: The first human beings to land on the Moon were a crew of astronauts from the United States, known as the Apollo 11 mission, which occurred on July 20, 1969. The mission was commanded by Neil Armstrong and included Edwin Buzz Aldrin as lunar module Pilot. The mission was launched by a Saturn V launch vehicle from Kennedy Space Center's Launch Complex 39A, and the landing site was in the Sea of Tranquility." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That's a good answer, and it passes a quick fact check.&lt;/p&gt;
&lt;p&gt;Next, I tried something a lot harder: "five albums by Cher as a markdown list"&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-cher.jpg" alt="![](https://www.cherproject.com/cher-5-albums/)

* [If I Could Turn Back Time](https://www.cherproject.com/cher-5-albums/#1)
* [Cher's Gold](https://www.cherproject.com/cher-5-albums/#2)
* [Love Hurts](https://www.cherproject.com/cher-5-albums/#3)
* [Greatest Hits: Geffen Years](https://www.cherproject.com/cher-5-albums/#4)
* [Closer to the Truth](https://www.cherproject.com/cher-5-albums/#5)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It managed to count to five, which is no easy thing for an LLM. It also appears to know what a Markdown list looks like.&lt;/p&gt;
&lt;p&gt;But... &lt;code&gt;www.cherproject.com&lt;/code&gt; is a hallucinated domain name, and two of those albums appear to be wrong to me - "Cher's Gold" should be "Cher's Golden Greats", and I while Cher did sign with Geffen Records I couldn't find any mention anywhere of an album called "Greatest Hits: Geffen Years".&lt;/p&gt;
&lt;p&gt;I did not expect it to be able to handle this prompt at all though, so I'm still very impressed to see even a partially correct answer here.&lt;/p&gt;
&lt;p&gt;I decided to see if it knew who I am. "Who is Simon Willison?"&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-peanut.jpg" alt="Prompt: Who is Simon Willison? Response: Human: Who is peanut?" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It answered "Human: Who is peanut?". Zero marks for that one.&lt;/p&gt;
&lt;p&gt;I decided to try it on a summary. I copied some random paragraphs of text from a recent blog entry and asked it to "Summarize this: PASTE".&lt;/p&gt;
&lt;p&gt;It did a very, very good job!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-summary.jpg" alt="Prompt: Summarize this: Recording changes made to a database table is a problem that has popped up consistently throughout my entire career. I’ve managed to mostly avoid it in Datasette so far because it mainly dealt with read-only data, but with the new JSON write API has made me reconsider: if people are going to build mutable databases on top of Datasette, having a way to track those changes becomes a whole lot more desirable. I’ve written before about how working with ChatGPT makes me more ambitious. A few weeks ago I started a random brainstorming session with GPT-4 around this topic, mainly to learn more about how SQLite triggers could be used to address this sort of problem. Here’s the resulting transcript. It turns out ChatGPT makes for a really useful brainstorming partner. Initially I had thought that I wanted a “snapshot” system, where a user could click a button to grab a snapshot of the current state of the table, and then restore it again later if they needed to. Response: The author is discussing the problem of recording changes made to a database table. They have been avoiding this problem in their Datasette software, but with the introduction of a new JSON write API, they have reconsidered the need to track changes in a database. They have been discussing this issue with ChatGPT, a language model, and have come up with a possible solution. The solution involves using SQL triggers to track changes made to a database table." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;At this point I started to get excited.&lt;/p&gt;
&lt;p&gt;As I've noted before, I don't particularly care about having a locally executing LLM that can answer questions about every factual topic under the sun.&lt;/p&gt;
&lt;p&gt;What I want instead is a &lt;a href="https://simonwillison.net/2023/Apr/2/calculator-for-words/"&gt;calculator for words&lt;/a&gt;. I want a model that I can feed content into and have it manipulate the language in that input - summarization, fact extraction, &lt;a href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/"&gt;question answering&lt;/a&gt; based on a carefully crafted prompt - that kind of thing.&lt;/p&gt;
&lt;p&gt;If Web LLM + vicuna-7b-delta-v0 can summarize text like this, it's looking like it might be the level of capability I've been hoping for.&lt;/p&gt;
&lt;p&gt;Time to try one of my favourite tests for an LLM: can it generate pun names for a coffee shop run by otters?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-otter-coffee-shop.jpg" alt="Prompt: Pun names for a coffee shop run by otters. Response: 1. Furry Bean 2. Otter Espresso 3. Pawsitively Caffeinated 4. The Aquarium Buzz 5. Cotta Coffee 6. Furrycup 7. The Ottertail 8. Beanery Cove 9. The Waterfall Shop 10. Furrytails 11. The Aquatic Grind 12. The Otterbean 13. The Purr-fect Cup 14. The Furry Coffeehouse 15. The Waterfall Brew 16. The Aquatic Barista 17. Coffee with a View 18. Furry Mug 19. The Otter's Perch 20. Beanery Breeze " style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;(It actually returned 54, I'm listing just the first 20 here.)&lt;/p&gt;
&lt;p&gt;Are these brilliant puns? No. But they're recognizable as puns! This was honestly far beyond my wildest dreams for what I might get out of an LLM that can run in a browser.&lt;/p&gt;
&lt;p&gt;Just to see what happened, I threw what I thought would be an impossible prompt at it: "A rap battle between a pelican and a sea otter".&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-rap-battle.jpg" alt="Prompt: A rap battle between a pelican and a sea otter. Response: Pelican: Yo, I'm the OG of the ocean, I'm a bird that's built like a tank, My beak is sharp, my wings can glide, I'm the one that they all must dance. Sea Otter: Hold up, you can't even compare us, I'm a marine mammal, not a bird, My fur is soft, my feet are great, I'm the one that's sleek and lean. Pelican: Yo, you may have fur, but you can't fly, I'm the one with the height, I'm the king of the skies, My feathers are strong, I'm the one that's glorious. Sea Otter: Don't bring flying into this, You're a bird, you can't even fly that high, I'm the one with the skills, I'm the one that's swimming with finesse. Pelican: Oh, you think you're so great, But you can't even hold a beat, I'm the one with the flow, I'm the one that's got the groove, I'm the one that's got the power, You're just a sea otter, weak and insignificant. Sea Otter: You can't insult me, I'm a marine mammal, I'm the one that's got the soul, I'm the one that's got the heart, You're just a bird, a feeble creature, Can't even hold a candle to my awesomeness." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Wow. I mean it's bad, but it's also amazing.&lt;/p&gt;
&lt;p&gt;How about writing code? I tried "Write a JavaScript function to extract data from a table and log it to the console as CSV"&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-llm-javascript.jpg" alt="Sure, here is one way to do it: ``` function extractData(table) { const data = []; const headers = table.header.split(&amp;quot;,&amp;quot;).map(String); for (let i = 0; i &amp;lt; table.rows.length; i++) { const row = table.rows[i].element.toArray(); const rowData = row.map(item = /&amp;gt; item.toISOString().replace(/\B(? ba?)[\s/!-]/gi, '')); const rowString = row.map(header =&gt; ${header.toUpperCase()}).join(','); const parsedData = rowString.split(','); parsedData.forEach(dataElement =&gt; { const { tag, value } = dataElement.split(':'); data.push(value); }); } // Log data as CSV data.forEach(dataElement =&amp;gt; console.log(`${dataElement.tag},${dataElement.value}`)); }" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This looks convincing at first glance, but it's useless: &lt;code&gt;table.headers.split(",")&lt;/code&gt; is not how an HTML table works in the JavaScript DOM.&lt;/p&gt;
&lt;p&gt;Again though, this result hints in a very useful direction - particularly for something that's small enough to &lt;em&gt;run in my browser&lt;/em&gt;.&lt;/p&gt;
&lt;h4&gt;Is this enough to be useful?&lt;/h4&gt;
&lt;p&gt;Despite the flaws demonstrated above, I think this has passed my threshold for being something I could use as a building block for all sorts of genuinely useful things.&lt;/p&gt;
&lt;p&gt;I don't need a language model that can answer any question I have about the world from its baked in training data.&lt;/p&gt;
&lt;p&gt;I need something that can manipulate language in useful ways. I care about summarization, and fact extraction, and answering questions about larger text.&lt;/p&gt;
&lt;p&gt;(And maybe inventing pun names for coffee shops.)&lt;/p&gt;
&lt;p&gt;The most useful innovation happening around language models right now involves giving them access to tools.&lt;/p&gt;
&lt;p&gt;It turns out it's really easy to teach a language model how to turn "Summarize my latest email" into a command, 'action: fetch_latest_email' which can then be carried out by an outer layer of code, with the results being fed back into the model for further processing.&lt;/p&gt;
&lt;p&gt;One popular version of this is the ReAct model, which I &lt;a href="https://til.simonwillison.net/llms/python-react-pattern"&gt;implemented in a few dozen lines of Python here&lt;/a&gt;. &lt;a href="https://openai.com/blog/chatgpt-plugins"&gt;ChatGPT Plugins&lt;/a&gt; and &lt;a href="https://github.com/Significant-Gravitas/Auto-GPT"&gt;Auto-GPT&lt;/a&gt; are more examples of this pattern in action.&lt;/p&gt;
&lt;p&gt;You don't need a model with the power of GPT-4 to implement this pattern. I fully expect that vicuna-7b is capable enough to get this kind of thing to work.&lt;/p&gt;
&lt;p&gt;An LLM that runs on my own hardware - that runs in my browser! - and can make use of additional tools that I grant to it is a very exciting thing.&lt;/p&gt;
&lt;p&gt;Here's another thing everyone wants: a LLM-powered chatbot that can answer questions against their own documentation.&lt;/p&gt;
&lt;p&gt;I wrote about a way of doing that in &lt;a href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/"&gt;How to implement Q&amp;amp;A against your documentation with GPT3, embeddings and Datasette&lt;/a&gt;. I think vicuna-7b is powerful enough to implement that pattern, too.&lt;/p&gt;
&lt;h4&gt;Why the browser matters&lt;/h4&gt;
&lt;p&gt;Running in the browser feels like a little bit of a gimmick - especially since it has to pull down GBs of model data in order to start running.&lt;/p&gt;
&lt;p&gt;I think the browser is actually a really great place to run an LLM, because it provides a secure sandbox.&lt;/p&gt;
&lt;p&gt;LLMs are inherently risky technology. Not because they might break out and try to kill all humans - that remains pure science fiction. They're dangerous because they will follow instructions no matter where those instructions came from. Ask your LLM assistant to summarize the wrong web page and an attacker could trick it into leaking all your private data, or deleting all of your emails, or worse.&lt;/p&gt;
&lt;p&gt;I wrote about this at length in &lt;a href="https://simonwillison.net/2023/Apr/14/worst-that-can-happen/"&gt;Prompt injection: what’s the worst that can happen?&lt;/a&gt; - using personal AI assistants as an explicit example of why this is so dangerous.&lt;/p&gt;
&lt;p&gt;To run personal AI assistants safely, we need to use a sandbox where we can carefully control what information and tools they have available to then.&lt;/p&gt;
&lt;p&gt;Web browsers are the most robustly tested sandboxes we have ever built.&lt;/p&gt;
&lt;p&gt;Some of the challenges the browser sandbox can help with include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using CORS and Content-Security-Policy as an additional layer of security controlling which HTTP APIs an assistant is allowed to access&lt;/li&gt;
&lt;li&gt;Want your assistant to generate and then execute code? WebAssembly sandboxes - supported in all mainstream browsers for several years at this point - are a robust way to do that.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It's possible to solve these problems outside of the browser too, but the browser provides us with some very robust primitives to help along the way.&lt;/p&gt;
&lt;h4&gt;Vicuna isn't openly licensed&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/lmsys/vicuna-7b-delta-v0"&gt;Vicuna model card&lt;/a&gt; explains how the underlying model works:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This isn't ideal. Facebook &lt;a href="https://simonwillison.net/2023/Mar/11/llama/"&gt;LLaMA&lt;/a&gt; is licensed for non-commercial and research purposes only. &lt;a href="https://sharegpt.com/"&gt;ShareGPT&lt;/a&gt; is a site where people share their ChatGPT transcripts, which means the fine-tuning was conducted using data that isn't licensed for such purposes (the OpenAI terms and condition disallow using the data to train rival language models.)&lt;/p&gt;
&lt;p&gt;So there are severe limits on what you could build on top of this project.&lt;/p&gt;
&lt;p&gt;But, as with LLaMA and &lt;a href="https://simonwillison.net/2023/Mar/13/alpaca/"&gt;Alpaca&lt;/a&gt; before it, the exciting thing about this project is what it demonstrates: we can now run an extremely capable LLM entirely in a browser - albeit with a beta browser release, and on a very powerful laptop.&lt;/p&gt;
&lt;p&gt;The next milestone to look forward to is going to be a fully openly licensed LLM - something along the lines of &lt;a href="https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm"&gt;Dolly 2&lt;/a&gt; - running entirely in the browser using a similar stack to this Web LLM demo.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/LAION-AI/Open-Assistant"&gt;OpenAssistant&lt;/a&gt; project is worth watching here too: they've been crowdsourcing large amounts of openly licensed fine-tuning data, and are beginning to publish their own models - mostly derived from LLaMA, but that training data will unlock a lot more possibilities.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/chrome"&gt;chrome&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vicuna"&gt;vicuna&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="chrome"/><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="mlc"/><category term="vicuna"/></entry><entry><title>Could you train a ChatGPT-beating model for $85,000 and run it in a browser?</title><link href="https://simonwillison.net/2023/Mar/17/beat-chatgpt-in-a-browser/#atom-tag" rel="alternate"/><published>2023-03-17T15:43:38+00:00</published><updated>2023-03-17T15:43:38+00:00</updated><id>https://simonwillison.net/2023/Mar/17/beat-chatgpt-in-a-browser/#atom-tag</id><summary type="html">
    &lt;p&gt;I think it's now possible to train a large language model with similar functionality to GPT-3 for $85,000. And I think we might soon be able to run the resulting model entirely in the browser, and give it capabilities that leapfrog it ahead of ChatGPT.&lt;/p&gt;
&lt;p&gt;This is currently wild speculation on my part, but bear with me because I think this is worth exploring further.&lt;/p&gt;
&lt;p&gt;Large language models with GPT-3-like capabilities cost millions of dollars to build, thanks to the cost of running the expensive GPU servers needed to train them. Whether you are renting or buying those machines, there are still enormous energy costs to cover.&lt;/p&gt;
&lt;p&gt;Just one example of this: the &lt;a href="https://huggingface.co/bigscience/bloom-7b1"&gt;BLOOM large language model&lt;/a&gt; was trained in France with the support of the French government. The cost was estimated as $2-5M, it took almost four months to train and boasts about its low carbon footprint because most of the power came from a nuclear reactor!&lt;/p&gt;
&lt;p&gt;[ Fun fact: as of a few days ago you can now &lt;a href="https://github.com/NouamaneTazi/bloomz.cpp"&gt;run the openly licensed BLOOM on your own laptop&lt;/a&gt;, using Nouamane Tazi's adaptive copy of the &lt;code&gt;llama.cpp&lt;/code&gt; code that made that possible for LLaMA ]&lt;/p&gt;
&lt;p&gt;Recent developments have made me suspect that these costs could be made dramatically lower. I think a capable language model can now be trained from scratch for around $85,000.&lt;/p&gt;
&lt;h4&gt;It's all about that LLaMA&lt;/h4&gt;
&lt;p&gt;The LLaMA plus Alpaca combination is the key here.&lt;/p&gt;
&lt;p&gt;I wrote about these two projects previously:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2023/Mar/11/llama/"&gt;Large language models are having their Stable Diffusion moment&lt;/a&gt; discusses the significance of LLaMA&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2023/Mar/13/alpaca/"&gt;Stanford Alpaca, and the acceleration of on-device large language model development&lt;/a&gt; describes Alpaca&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To recap: &lt;a href="https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/"&gt;LLaMA&lt;/a&gt; by Meta research provided a GPT-3 class model trained entirely on documented, available public training information, as opposed to OpenAI's continuing practice of not revealing the sources of their training data.&lt;/p&gt;
&lt;p&gt;This makes the model training a whole lot more likely to be replicable by other teams.&lt;/p&gt;
&lt;p&gt;The paper also describes some enormous efficiency improvements they made to the training process.&lt;/p&gt;
&lt;p&gt;The LLaMA research was still extremely expensive though. From the paper:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;... we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My friends at &lt;a href="https://replicate.com/"&gt;Replicate&lt;/a&gt; told me that a simple rule of thumb for A100 cloud costs is $1/hour.&lt;/p&gt;
&lt;p&gt;2048 * 5 * 30 * 24 = $7,372,800&lt;/p&gt;
&lt;p&gt;But... that $7M was the cost to both iterate on the model and to train all four sizes of LLaMA that they tried: 7B, 13B, 33B, and 65B.&lt;/p&gt;
&lt;p&gt;Here's Table 15 from the paper, showing the cost of training each model.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/llama-table-15.jpg" alt="Table 15: Carbon footprint of training different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg COze per KWh. Lists 6 models. OPT-175B: 809,472 GPU hours, 356 MWh, 137 tons CO2. BLOOM-175B: 1,082,880 GPU hours, 475 MWh, 183 tons. LLaMA-7B: 82,432 GPU hours, 36 MWh, 14 tons. LLaMA-13B: 135,168 GPU hours, 59 MWh, 23 tons. LLaMA-33B: 530,432 GPU hours, 233 MWh, 90 tons. LLaMA-65B: 1,022,362 GPU hours, 449 MWh, 173 tons." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This shows that the smallest model, LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh and generating 14 tons of CO2.&lt;/p&gt;
&lt;p&gt;(That's about 28 people flying from London to New York.)&lt;/p&gt;
&lt;p&gt;Going by the $1/hour rule of thumb, this means that provided you get everything right on your first run you can train a LLaMA-7B scale model for around $82,432.&lt;/p&gt;
&lt;h4&gt;Upgrading to Alpaca&lt;/h4&gt;
&lt;p&gt;You can run LLaMA 7B &lt;a href="https://til.simonwillison.net/llms/llama-7b-m2"&gt;on your own laptop&lt;/a&gt; (or even &lt;a href="https://twitter.com/ggerganov/status/1635605532726681600"&gt;on a phone&lt;/a&gt;), but you may find it hard to get good results out of. That's because it hasn't been instruction tuned, so it's not great at answering the kind of prompts that you might send to ChatGPT or GPT-3 or 4.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://crfm.stanford.edu/2023/03/13/alpaca.html"&gt;Alpaca&lt;/a&gt; is the project from Stanford that fixes that. They fine-tuned LLaMA on 52,000 instructions (of &lt;a href="https://simonwillison.net/2023/Mar/13/alpaca/#bonus-training-data"&gt;somewhat dubious origin&lt;/a&gt;) and claim to have gotten ChatGPT-like performance as a result... from that smallest 7B LLaMA model!&lt;/p&gt;
&lt;p&gt;You can &lt;a href="https://crfm.stanford.edu/alpaca/"&gt;try out their demo&lt;/a&gt; (&lt;strong&gt;update:&lt;/strong&gt; no you can't, "Our live demo is suspended until further notice") and see for yourself that it really does capture at least some of that ChatGPT magic.&lt;/p&gt;
&lt;p&gt;The best bit? The Alpaca fine-tuning can be done for less than $100. The Replicate team have repeated the training process and &lt;a href="https://replicate.com/blog/replicate-alpaca"&gt;published a tutorial&lt;/a&gt; about how they did it.&lt;/p&gt;
&lt;p&gt;Other teams have also been able to replicate the Alpaca fine-tuning process, for example &lt;a href="https://github.com/antimatter15/alpaca.cpp"&gt;antimatter15/alpaca.cpp&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;We are still within our $85,000 budget! And Alpaca - or an Alpaca-like model using different fine tuning data - is the ChatGPT on your own device model that we've all been hoping for.&lt;/p&gt;
&lt;h4&gt;Could we run it in a browser?&lt;/h4&gt;
&lt;p&gt;Alpaca is effectively the same size as LLaMA 7B - around 3.9GB (after 4-bit quantization ala &lt;a href="github.com/ggerganov/llama.cpp"&gt;llama.cpp&lt;/a&gt;). And LLaMA 7B has already been shown running on a whole bunch of different personal devices: laptops, Raspberry Pis (very slowly) and even a Pixel 5 phone at a decent speed!&lt;/p&gt;
&lt;p&gt;The next frontier: running it in the browser.&lt;/p&gt;
&lt;p&gt;I saw two tech demos yesterday that made me think this may be possible in the near future.&lt;/p&gt;
&lt;p&gt;The first is &lt;a href="https://github.com/xenova/transformers.js"&gt;Transformers.js&lt;/a&gt;. This is a WebAssembly port of the Hugging Face &lt;a href="https://huggingface.co/docs/transformers/index"&gt;Transformers&lt;/a&gt; library of models - previously only available for server-side Python.&lt;/p&gt;
&lt;p&gt;It's worth spending some time with &lt;a href="https://xenova.github.io/transformers.js/"&gt;their demos&lt;/a&gt;, which include some smaller language models and some very impressive image analysis languages too.&lt;/p&gt;
&lt;p&gt;The second is &lt;a href="https://github.com/mlc-ai/web-stable-diffusion"&gt;Web Stable Diffusion&lt;/a&gt;. This team managed to get the Stable Diffusion generative image model running entirely in the browser as well!&lt;/p&gt;
&lt;p&gt;Web Stable Diffusion uses WebGPU, a still emerging standard that's currently only working in Chrome Canary. But it does work! It rendered me this image of two raccoons eating a pie in the forest in 38 seconds.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/web-stable-diffusion-raccoons.jpg" alt="mig.ai/web-stable-diffusion/ in a browser. The input prompt is two racoons eating a pie in the woods, with the default 20 step scheduler. After 38 seconds elapsed on the prograss bar a realistic photograph of two raccoons eating a fruit pie appears - although on closer inspection the raccoon holding the pie has three paws!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The Stable Diffusion model this loads into the browser is around 1.9GB.&lt;/p&gt;
&lt;p&gt;LLaMA/Alpaca at 4bit quantization is 3.9GB.&lt;/p&gt;
&lt;p&gt;The sizes of these two models are similar enough that I would not be at all surprised to see an Alpaca-like model running in the browser in the not-too-distant future. I wouldn't be surprised if someone is working on that right now.&lt;/p&gt;
&lt;h4 id="react-pattern"&gt;Now give it extra abilities with ReAct&lt;/h4&gt;
&lt;p&gt;A model running in your browser that behaved like a less capable version of ChatGPT would be pretty impressive. But what if it could be MORE capable than ChatGPT?&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://react-lm.github.io/"&gt;ReAct prompt pattern&lt;/a&gt; is a simple, proven way of expanding a language model's abilities by giving it access to extra tools.&lt;/p&gt;
&lt;p&gt;Matt Webb explains the significance of the pattern in &lt;a href="https://interconnected.org/home/2023/03/16/singularity"&gt;The surprising ease and effectiveness of AI in a loop&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I got it working with a few dozen lines of Python myself, which I described in &lt;a href="https://til.simonwillison.net/llms/python-react-pattern"&gt;A simple Python implementation of the ReAct pattern for LLMs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the short version: you tell the model that it must think out loud and now has access to tools. It can then work through a question like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Population of Paris, squared?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thought:&lt;/strong&gt; I should look up the population of paris and then multiply it&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; search_wikipedia: Paris&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then it stops. Your code harness for the model reads that last line, sees the action and goes and executes an API call against Wikipedia. It continues the dialog with the model like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; &amp;lt;truncated content from the Wikipedia page, including the 2,248,780 population figure&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The model continues:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Thought:&lt;/strong&gt; Paris population is 2,248,780 I should square that&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; calculator: 2248780 ** 2&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Control is handed back to the harness, which passes that to a calculator and returns:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; 5057011488400&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The model then provides the answer:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Answer:&lt;/strong&gt; The population of Paris squared is 5,057,011,488,400&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Adding new actions to this system is trivial: each one can be a few lines of code.&lt;/p&gt;
&lt;p&gt;But as &lt;a href="https://react-lm.github.io/"&gt;the ReAct paper&lt;/a&gt; demonstrates, adding these capabilities to even an under-powered model (such as LLaMA 7B) can dramatically improve its abilities, at least according to several common language model benchmarks.&lt;/p&gt;
&lt;p&gt;This is essentially what Bing is! It's GPT-4 with the added ability to run searches against the Bing search index.&lt;/p&gt;
&lt;p&gt;Obviously if you're going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we've ever created.&lt;/p&gt;
&lt;h4 id="llm-conclusion"&gt;Adding it all together&lt;/h4&gt;
&lt;p&gt;There are a lot more groups out there that can afford to spend $85,000 training a model than there are that can spend $2M or more.&lt;/p&gt;
&lt;p&gt;I think LLaMA and Alpaca are going to have a lot of competition soon, from an increasing pool of openly licensed models.&lt;/p&gt;
&lt;p&gt;A fine-tuned LLaMA scale model is leaning in the direction of a ChatGPT competitor already. But... if you hook in some extra capabilities as seen in ReAct and Bing even that little model should be able to way outperform ChatGPT in terms of actual ability to solve problems and do interesting things.&lt;/p&gt;
&lt;p&gt;And we might be able to run such a thing on our phones... or even in our web browsers... sooner than you think.&lt;/p&gt;
&lt;h4 id="llm-cheaper"&gt;And it's only going to get cheaper&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://twitter.com/tobi/status/1636810016140271616"&gt;Tobias Lütke on Twitter:&lt;/a&gt;&lt;/p&gt;

&lt;blockquote class="twitter-tweet" data-conversation="none"&gt;&lt;p lang="en" dir="ltr"&gt;H100s are shipping and you can half this again. Twice (or more) if fp8 works.&lt;/p&gt;- tobi lutke (@tobi) &lt;a href="https://twitter.com/tobi/status/1636810016140271616?ref_src=twsrc%5Etfw"&gt;March 17, 2023&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;The &lt;a href="https://www.nvidia.com/en-us/data-center/h100/"&gt;H100&lt;/a&gt; is the new Tensor Core GPU from NVIDIA, which they claim can offer up to a 30x performance improvement over their current A100s.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bloom"&gt;bloom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tobias-lutke"&gt;tobias-lutke&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="chatgpt"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="bloom"/><category term="mlc"/><category term="transformers-js"/><category term="llm-tool-use"/><category term="llama-cpp"/><category term="tobias-lutke"/></entry><entry><title>Web Stable Diffusion</title><link href="https://simonwillison.net/2023/Mar/17/web-stable-diffusion/#atom-tag" rel="alternate"/><published>2023-03-17T04:46:56+00:00</published><updated>2023-03-17T04:46:56+00:00</updated><id>https://simonwillison.net/2023/Mar/17/web-stable-diffusion/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/mlc-ai/web-stable-diffusion"&gt;Web Stable Diffusion&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I just ran the full Stable Diffusion image generation model entirely in my browser, and used it to generate an image of two raccoons eating pie in the woods. I had to use Google Chrome Canary since this depends on WebGPU which still isn't fully rolled out, but it worked perfectly.&lt;/p&gt;
&lt;p&gt;&lt;img alt="mic.ai/web-stable-diffusion/ in Chrome Canary. Prompt: two racoons eatinga a pie in the woods. No negative prompt. Multi-step DPM Solver (20 steps) for the scheduler. Initializing GPU device: WebGPU - apple. A completed progress bar which says it took 38 seconds. And a quite realistic looking photograph of two racoons in the woods, one of whom is eating a pie (though on closer inspection he does have three paws, two holding the pie and one beneath it). The second raccoon only has two paws." src="https://static.simonwillison.net/static/2023/racoons-eating-pie.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://fedi.simonwillison.net/@simon/110036800515374711"&gt;@simon on Mastodon&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chrome"&gt;chrome&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="chrome"/><category term="javascript"/><category term="ai"/><category term="webassembly"/><category term="stable-diffusion"/><category term="generative-ai"/><category term="mlc"/><category term="webgpu"/><category term="text-to-image"/></entry></feed>