Simon Willison's Weblog: llamafile

llamafile v0.8.13 (and whisperfile)

2024-08-19T20:08:59+00:00

The latest release of llamafile (previously) adds support for Gemma 2B (pre-bundled llamafiles available here), significant performance improvements and new support for the Whisper speech-to-text model, based on whisper.cpp, Georgi Gerganov's C++ implementation of Whisper that pre-dates his work on llama.cpp.

I got whisperfile working locally by first downloading the cross-platform executable attached to the GitHub release and then grabbing a whisper-tiny.en-q5_1.bin model from Hugging Face:

wget -O whisper-tiny.en-q5_1.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-q5_1.bin

Then I ran chmod 755 whisperfile-0.8.13 and then executed it against an example .wav file like this:

./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav --no-prints

The --no-prints option suppresses the debug output, so you just get text that looks like this:

[00:00:00.000 --> 00:00:12.000]   This is a LibraVox recording. All LibraVox recordings are in the public domain. For more information please visit LibraVox.org.
[00:00:12.000 --> 00:00:20.000]   Today's reading The Raven by Edgar Allan Poe, read by Chris Scurringe.
[00:00:20.000 --> 00:00:40.000]   Once upon a midnight dreary, while I pondered weak and weary, over many a quaint and curious volume of forgotten lore. While I nodded nearly napping, suddenly there came a tapping as of someone gently rapping, rapping at my chamber door.

There are quite a few undocumented options - to write out JSON to a file called transcript.json (example output):

./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/raven_poe_64kb.wav --no-prints --output-json --output-file transcript

I had to convert my own audio recordings to 16kHz .wav files in order to use them with whisperfile. I used ffmpeg to do this:

ffmpeg -i runthrough-26-oct-2023.wav -ar 16000 /tmp/out.wav

Then I could transcribe that like so:

./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/out.wav --no-prints

Update: Justine says:

I've just uploaded new whisperfiles to Hugging Face which use miniaudio.h to automatically resample and convert your mp3/ogg/flac/wav files to the appropriate format.

With that whisper-tiny model this took just 11s to transcribe a 10m41s audio file!

I also tried the much larger Whisper Medium model - I chose to use the 539MB ggml-medium-q5_0.bin quantized version of that from huggingface.co/ggerganov/whisper.cpp:

./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints

This time it took 1m49s, using 761% of CPU according to Activity Monitor.

I tried adding --gpu auto to exercise the GPU on my M2 Max MacBook Pro:

./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints --gpu auto

That used just 16.9% of CPU and 93% of GPU according to Activity Monitor, and finished in 1m08s.

I tried this with the tiny model too but the performance difference there was imperceptible.

Via @JustineTunney

Tags: ffmpeg, ai, whisper, local-llms, llamafile, justine-tunney, speech-to-text, georgi-gerganov

gemma-2-27b-it-llamafile

2024-07-02T22:38:06+00:00

gemma-2-27b-it-llamafile

Justine Tunney shipped llamafile packages of Google's new openly licensed (though definitely not open source) Gemma 2 27b model this morning.

I downloaded the gemma-2-27b-it.Q5_1.llamafile version (20.5GB) to my Mac, ran chmod 755 gemma-2-27b-it.Q5_1.llamafile and then ./gemma-2-27b-it.Q5_1.llamafile and now I'm trying it out through the llama.cpp default web UI in my browser. It works great.

It's a very capable model - currently sitting at position 12 on the LMSYS Arena making it the highest ranked open weights model - one position ahead of Llama-3-70b-Instruct and within striking distance of the GPT-4 class models.

Via @JustineTunney

Tags: google, ai, generative-ai, local-llms, llms, llamafile, justine-tunney, llama-cpp, gemma, chatbot-arena

Language models on the command-line

2024-06-17T16:44:39+00:00

I gave a talk about accessing Large Language Models from the command-line last week as part of the Mastering LLMs: A Conference For Developers & Data Scientists six week long online conference. The talk focused on my LLM Python command-line utility and ways you can use it (and its plugins) to explore LLMs and use them for useful tasks.

The talk was recorded and is available on YouTube. Here I've turned it into an annotated presentation, with detailed notes and screenshots (there were no slides) to accompany the video.

LLM is a tool I started building last year to help run LLM prompts directly from a command-line terminal. Instructions for installing it are here - you can use pipx install llm or pip install llm or brew install llm.

Once installed you can use it with OpenAI models by running llm keys set openai and pasting in your OpenAI key - or install plugins to use models by other providers, including models you can run locally.

Basic usage is very simple: once you've configured your OpenAI key, you can run prompts against their models like this:

llm 'five great names for a pet pelican'

The output will stream to your terminal, or you can redirect it to a file like this:

llm 'five great names for a pet pelican' > pelicans.txt

You can use the -c (for continue) option to send follow-up prompts as part of the same ongoing conversation:

llm -c 'now do walruses'
# ...
llm -c justify those!'

LLM supports additional models via plugins. These are listed in the LLM plugins directory, with dozens of plugins for both remote API-hosted models as well as models you can run directly on your own computer.

Here I'm using the llm-claude-3 plugin, which provides access to the Anthropic Claude 3 family of models.

I really like these models. Claude 3 Opus is about equivalent to GPT-4o in terms of quality. Claude 3 Haiku is both cheaper and better than GPT-3.5, and can handle 100,000 input tokens including images.

llm install llm-claude-3
llm keys set claude
# <Paste key here>
# Now list available models
llm models
# Then run a prompt
llm -m claude-3-haiku-20240307 'say hi in spanish with a flourish'
# Or use the haiku alias
llm -m haiku 'say hi in spanish with a flourish'

Every prompt and response run through the LLM tool is permanently logged to a SQLite database, as described here.

This command shows the path to that database:

llm logs path

If you install Datasette you can use it to browse your SQLite database like this, using a terminal trick where the output of one command is passed to another (with double quotes to avoid any problems caused by the space in the directory name):

datasette "$(llm logs path)"

Here's my searchable database of 2,434 responses I've logged from using LLM on my laptop, running in Datasette.

Earlier we ran a prompt and then sent two follow-up prompts to it using the llm -c option. Those are stored in the database as three responses that are part of the same conversation.

Datasette supports facets, which provide a quick overview of unique value counts within the data. I've used GPT-3.5 turbo 383 times, Claude 3 Opus 334 times and a whole bunch of other models.

Datasette provides search and filtering too - here are the prompts I've run against Claude 3 Opus that match "pelican".

I have an experimental feature that lets me pass images to some models using the -i filename.png option. More on that in this issue.

llm-cmd is an example of a plugin that adds an extra sub-command to LLM - in this case one that takes a description of a task and turns that into a pre-populated shell command. I wrote more about that in llm cmd undo last git commit—a new plugin for LLM.

There are plenty of ways to run local models using LLM. One of my favourite is with the llm-gpt4all plugin, which builds on top of the excellent GPT4All Python library by Nomic AI.

To install that plugin:

llm install llm-gpt4all

Then llm models to list the new models. Each model will be downloaded the first time you try running a prompt through it.

I used this to run Mistral-7B Instruct - an extremely high quality small (~4GB) model:

llm -m mistral-7b-instruct-v0 'five great names for a pet seagull, with explanations'

You can run Activity Monitory to see the resources the model is using.

Running prompts like this is inefficient, because it loads the full model into memory, runs the prompt and then shuts down the program again.

Instead, you can use the llm chat command which keeps the model in memory across multiple prompts:

llm chat -m mistral-7b-instruct-v0

Another option is to run Ollama, which runs its own local server hosting models. The llm-ollama plugin can then be used to run prompts through Ollama from LLM.

Another really exciting option for running models is llamafile, a project sponsored by Mozilla that uses Justine Tunney's Cosmopolitan to compile a binary that bundles both a model and the software needed to serve it, in a single file that can execute on several different operating systems.

I've written more about that here:

Running LLaVA using a Llamafile is particularly fun - it's an openly licensed model that can accept images as input as well. It's pretty amazing the results you can get from that, running as a single binary on your laptop.

Grab that from Mozilla/llava-v1.5-7b-llamafile on Hugging Face.

LLM can talk to Llamafile instances via the llm-llamafile plugin.

Now that we can run prompts from our terminal, we can start assembling software by writing scripts.

Here's a Bash script I wrote to summarize conversations on Hacker News, using longer context models such as Claude 3 Haiku or Google Gemini 1.5 or GPT-4o.

I wrote more about this in Summarizing Hacker News discussion themes with Claude and LLM.

The script works by hitting the Hacker News Algolia API to return the full, nested JSON structure of the conversation (e.g. this JSON endpoint for this conversation), then runs that through jq to turn it into text, then pipes that into a model using LLM.

If you want to contribute to LLM itself, a great way to do that is to write plugins that support new models. I have an extensive tutorial describing how to do that.

Another fun trick is to use templates to save and execute parameterized prompts, documented here. The easiest way to create a template is with the --save option like this:

llm --system 'You are a sentient cheesecake' -m gpt-4o --save cheesecake

Now you can chat with a cheesecake:

llm chat -t cheesecake

Being able to pipe content into llm is a really important feature.

The simplest way to do this is with cat to send in files. This command summarizes the content of a provided file:

cat demos.md | llm -s 'summary, short'

Once you've done this you can send follow-up prompts with -c.

llm -c 'no much much much shorter, and in haikus'

My files-to-prompt command can turn a directory of files into a single prompt, suitable for piping into LLM.

I wrote more about this, including how I developed it, in Building files-to-prompt entirely using Claude 3 Opus.

Another tool I frequently use with LLM is shot-scraper - my command-line tool for screenshotting and scraping websites.

shot-scraper can run JavaScript directly against a page and output the result back to the terminal, suitable for piping into LLM.

Google hate being scraped. Let's use it to scrape Google.

Google search results have a structure where each link on the page is an <h3> element wrapped in a link.

We can scrape that using the following terminal command:

shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' '
Array.from(
  document.querySelectorAll("h3"),
  el => ({href: el.parentNode.href, title: el.innerText})
)'

This will load up the search results page in an invisible browser, then execute JavaScript that extracts the results and returns them as JSON.

The results as JSON include the href and title of each of those search results.

We could send that to LLM, but I'd like to grab the search snippets as well.

Here's the more complex recipe that also fetches the search snippets visible on the page:

shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' '
() => {
    function findParentWithHveid(element) {
        while (element && !element.hasAttribute("data-hveid")) {
            element = element.parentElement;
        }
        return element;
    }
    return Array.from(
        document.querySelectorAll("h3"),
        el => findParentWithHveid(el).innerText
    );
}' | llm -s 'describe slop'

At the end it pipes them into LLM with instructions to use that context to "describe slop".

... and it works! We get back an answer from the LLM that summarizes the search results that we just scraped.

We have implemented basic RAG - Retrieval Augmented Generation, where search results are used to answer a question - using a terminal script that scrapes search results from Google and pipes them into an LLM.

Speaking of RAG... a common technique for implementing that pattern is to take advantage of embeddings and vector search to find content that is semantically similar to the user's question, without necessarily matching on exact keywords.

I wrote an extensive introduction to embeddings in Embeddings: What they are and why they matter.

LLM includes support for calculating, storing and searching embeddings through its llm embed-models, llm embed and llm embed-multi commands, documented here.

The llm embed-models command lists currently available embedding models - the OpenAI models plus any that have been added by plugins.

The llm embed command can embed content. This will output a list of floating point numbers for the specified content, using the OpenAI 3-small embedding model.

llm embed -m 3-small -c "hello there"

Add -f hex to get that out as hexadecimal. Neither of these formats are particularly useful on their own!

The llm embed-multi command is much more useful. It can run embeddings against content in bulk - from a CSV or JSON file, from a directory full of content or even from a SQLite database. Those embedding vectors will be stored in SQLite ready to be used for search or similarity queries.

I collect bookmarks (I call them "blogmarks") on my blog - links with short descriptions. I have over 7,178 of them.

Let's create embeddings for all of those using LLM.

I used the SQLite database version of my blog available from https://datasette.simonwillison.net/simonwillisonblog.db (a 90MB file).

The content lives in the blog_blogmark SQLite table, which I can explore using Datasette (in this case the Datasette Desktop macOS Electron app).

First step is to compose a SQL query returning the data we want to embed. The first column will be treated as a unique identifier to store along with the embedding vector, and any subsequent columns will be used as input to the embedding model.

select id, link_url, link_title, commentary from blog_blogmark

I can run this command to embed all of the content:

llm embed-multi links \
  -d simonwillisonblog.db \
  --sql 'select id, link_url, link_title, commentary from blog_blogmark' \
  -m 3-small --store

This will create an embedding collection called "links". It will run the SQL query we created before, using the OpenAI 3-small model. The --store link means it will store a copy of the text in the database as well - without that it would just store identifiers and we would need to use those to look up the text later on when running queries.

The llm embed-multi command shows a progress bar for how far it has got.

While the command is running we can use Datasette to watch as the embeddings table is filled with data - one row for each of the items we are embedding, each storing a big ugly binary blob of data representing the embedding vector (in this storage format).

And now we can run searches! This command finds everything in the "links" collection that is most similar to the search term "things that make me angry":

llm similar links -c 'things that make me angry' -d simonwillisonblog.db

We can implement another version of RAG on top of this as well, by finding similar documents to our search term and then piping those results back into LLM to execute a prompt.

I wrote more about this in Embedding paragraphs from my blog with E5-large-v2.

During the Q&A I mentioned that LLM can talk to anything that provides an OpenAI-compatible API endpoint using just configuration, no extra code. That's described in the documentation here.

I also showed LLM's Python API documentation, but warned that this is not yet considered stable as I'm not 100% happy with how this API works yet.

The final question from the audience was about my personal productivity.

I wrote a bit about this a few years ago in Coping strategies for the serial project hoarder - how I use extensive documentation and unit tests to allow me to work on way more projects at once by ensuring I don't have to remember the details of any of them.

My other trick is that I tend to pick projects that fit my unique combination of previous experiences. I built LLM because I already had experience with LLM APIs, Python CLI tools (using Click) and plugin systems (using Pluggy). As a result I happened to be one of the best positioned people in the world to build a plugin-based CLI tool for working with LLMs!

Colophon

Here's how I turned the YouTube video of this talk into an annotated presentation:

I downloaded a .mp4 version of the talk from YouTube using yt-dlp.
I ran that through MacWhisper to create my own transcript for copying extracts from into my write-up - although this time I didn't end up using any of the transcript text.
I played the video (at 2x speed) in QuickTime Player and used the capture.sh script described here to grab screenshots of the individual interesting frames that I wanted to use for my post.
I loaded those screenshots into my annotated presentation tool (which I described in this post) and used that to run OCR against them for alt text and to add commentary to accompany each screenshot.
I assembled the result into this finished blog entry, adding intro text and the YouTube embed as HTML.

Tags: cli, projects, my-talks, ai, datasette, openai, generative-ai, local-llms, llms, llm, anthropic, annotated-talks, llamafile, ollama, files-to-prompt, macwhisper

Options for accessing Llama 3 from the terminal using LLM

2024-04-22T13:38:09+00:00

Llama 3 was released on Thursday. Early indications are that it's now the best available openly licensed model - Llama 3 70b Instruct has taken joint 5th place on the LMSYS arena leaderboard, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!

My LLM command-line tool and Python library provides access to dozens of models via plugins. Here are several ways you can use it to access Llama 3, both hosted versions and running locally on your own hardware.

Llama-3-8B-Instruct locally with llm-gpt4all

If you want to run Llama 3 locally, the easiest way to do that with LLM is using the llm-gpt4all plugin. This plugin builds on the excellent gpt4all project by Nomic AI, providing a quantized (q4) Llama 3 8B instruct model that needs 8GB of RAM and requires a 4.34GB download.

If you don't have LLM installed yet, install it like this:

brew install llm

Or like this:

pipx install llm

Install the plugin:

llm install llm-gpt4all

The first time you run a prompt it will download the model:

llm -m Meta-Llama-3-8B-Instruct "Three great names for a pet emu"

What an interesting question! Here are three name suggestions for your pet emu:

Emmet: A playful take on the word "emu" that suits a friendly and curious bird.

Gizmo: Emus are known for their quirky behavior, so Gizmo is a fitting name to capture their unique personality.

Aussie: As an Australian native species, naming your emu Aussie would be a nod to its heritage.

Remember to choose a name that suits your pet's temperament and characteristics!

Each time you run a command like this the model will be loaded into memory from scratch. To keep it resident in memory, use the llm chat command:

llm chat -m Meta-Llama-3-8B-Instruct

Here's an example session:

Chatting with Meta-Llama-3-8B-Instruct
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> javascript function to reverse a string
%
Here is the JavaScript code that reverses a given string:
```
function reverseString(str) {
  return str.split("").reverse().join("");
}
```

You can use this function by passing your desired input string as an argument, like so:

```javascript
let reversedStr = reverseString("Hello World");
console.log(reversedStr); // Output: "dlroW olleH"
```
In the code above:
- `str.split("")` splits the original string into individual characters.
- `.reverse()` reverses the order of those characters.
- `.join("")` combines them back together as a single string.

This function works for any input string, not just "Hello World". Try it out with different strings to see how it handles punctuation and spaces!
> capitalize it too
%
Here's an updated JavaScript function that not only reverses the input string but also capitalizes each character:
```javascript
function reverseAndCapitalize(str) {
  return str.split("").reverse().map(function(char) { 
    if (char.match(/[a-z]/i)) {
      return char.toUpperCase();
    } else {
      return char;
    }
  }).join("");
}
```

(I'm not sure I'd write reverseAndCapitalize() quite like that, but still a fun example!)

Consult the LLM documentation for more details on how to use the command-line tool.

Fast API access via Groq

Groq serve openly licensed LLMs at ludicrous speeds using their own custom LPU (Language Processing Unit) Inference Engine. They currently offer a free preview of their API: you can sign up and obtain an API key to start using it.

You can run prompts against Groq using their OpenAI compatible API endpoint.

Edit the file ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml - creating it if it doesn't exist - and add the following lines to it:

- model_id: groq-openai-llama3
  model_name: llama3-70b-8192
  api_base: https://api.groq.com/openai/v1
  api_key_name: groq
- model_id: groq-openai-llama3-8b
  model_name: llama3-8b-8192
  api_base: https://api.groq.com/openai/v1
  api_key_name: groq

This tells LLM about those models, and makes them accessible via those configured model_id values.

Run this command to confirm that the models were registered correctly:

llm models | grep groq

You should see this:

OpenAI Chat: groq-openai-llama3
OpenAI Chat: groq-openai-llama3-8b

Set your Groq API key like this:

llm keys set groq
# <Paste your API key here>

Now you should be able to run prompts through the models like this:

llm -m groq-openai-llama3 "A righteous sonnet about a brave owl"

Groq is fast.

There's also a llm-groq plugin but it hasn't shipped support for the new models just yet - though there's a PR for that by Lex Herbert here and you can install the plugin directly from that PR like this:

llm install https://github.com/lexh/llm-groq/archive/ba9d7de74b3057b074a85fe99fe873b75519bd78.zip
llm keys set groq
# paste API key here
llm -m groq-llama3-70b 'say hi in spanish five ways'

Local Llama 3 70b Instruct with llamafile

The Llama 3 8b model is easy to run on a laptop, but it's pretty limited in capability. The 70b model is the one that's starting to get competitive with GPT-4. Can we run that on a laptop?

I managed to run the 70b model on my 64GB MacBook Pro M2 using llamafile (previously on this blog) - after quitting most other applications to make sure the 37GB of RAM it needed was available.

I used the Meta-Llama-3-70B-Instruct.Q4_0.llamafile Q4 version from jartine/Meta-Llama-3-70B-Instruct-llamafile - a 37GB download. I have a dedicated external hard disk (a Samsung T7 Shield) for this kind of thing.

Here's how I got it working:

curl -L -o Meta-Llama-3-70B-Instruct.Q4_0.llamafile 'https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true'
# That downloads 37GB - now make it executable
chmod 755 Meta-Llama-3-70B-Instruct.Q4_0.llamafile
# And start it running:
./Meta-Llama-3-70B-Instruct.Q4_0.llamafile

A llamafile is an executable that runs on virtually any platform - see my previous notes on Cosmopolitan and Actually Portable Executable for more on how that works.

This will take quite a while to start, because it needs to load that full 37GB of binary content into memory. Once it's finished loading a local web server becomes available at http://127.0.0.1:8080/ - this serves a web UI you can use to interact with the model, and also serves another OpenAI-compatible API endpoint.

The easiest way to access this from LLM is to install the llm-llamafile plugin:

llm install llm-llamafile

All this plugin does is configure a model called llamafile that attempts to access the model hosted on port 8080. You can run prompts like this:

llm -m llamafile "3 neat characteristics of a pelican"

Here are three neat characteristics of a pelican:

Unique Beak: Pelicans have a distinctive beak that is shaped like a scoop or a basket. This beak is specially designed to catch fish, and it can hold up to 3 gallons of water! The beak is also very sensitive, which helps pelicans detect the presence of fish in the water.

Waterproof Feathers: Pelicans have a special coating on their feathers that makes them waterproof. This is essential for their fishing lifestyle, as they need to be able to dive into the water without getting weighed down by wet feathers. The coating is made up of a waxy substance that helps to repel water.

Pouch-Like Throat: Pelicans have a unique throat pouch that allows them to catch and store fish. When they dive into the water, they use their beak to scoop up fish, and then they store them in their throat pouch. The pouch can expand to hold multiple fish, and the pelican can then swallow the fish whole or regurgitate them to feed their young. This pouch is a key adaptation that helps pelicans thrive in their aquatic environment.

If you don't want to install another plugin, you can instead configure the model by adding this to your openai-extra-models.yaml file:

- model_id: llamafile
  model_name: llamafile
  api_base: http://localhost:8080/v1
  api_key: x

One warning about this approach: if you use LLM like this then every prompt you run through llamafile will be stored under the same model name in your SQLite logs, even if you try out different llamafile models at different times. You could work around this by registering them with different model_id values in the YAML file.

Paid access via other API providers

A neat thing about open weight models is that multiple API providers can offer them, encouraging them to aggressively compete on price.

Groq is currently free, but that's with a limited number of free requests.

A number of other providers are now hosting Llama 3, and many of them have plugins available for LLM. Here are a few examples:

Perplexity Labs are offering llama-3-8b-instruct and llama-3-70b-instruct. The llm-perplexity plugin provides access - llm install llm-perplexity to install, llm keys set perplexity to set an API key and then run prompts against those two model IDs. Current price for 8b is $0.20 per million tokens, for 80b is $1.00.
Anyscale Endpoints have meta-llama/Llama-3-8b-chat-hf ($0.15/million tokens) and meta-llama/Llama-3-70b-chat-hf ($1.0/million tokens) (pricing). llm install llm-anyscale-endpoints, then llm keys set anyscale-endpoints to set the API key.
Fireworks AI have fireworks/models/llama-v3-8b-instruct for $0.20/million and fireworks/models/llama-v3-70b-instruct for $0.90/million (pricing). llm install llm-fireworks, then llm keys set fireworks to set the API key.
OpenRouter provide proxied accessed to Llama 3 from a number of different providers at different prices, documented on their meta-llama/llama-3-70b-instruct and meta-llama/llama-3-8b-instruct pages (and more). Use the llm-openrouter plugin for those.
Together AI has both models as well. The llm-together plugin provides access to meta-llama/Llama-3-8b-chat-hf and meta-llama/Llama-3-70b-chat-hf.

I'm sure there are more - these are just the ones I've tried out myself. Check the LLM plugin directory for other providers, or if a provider emulates the OpenAI API you can configure with the YAML file as shown above or described in the LLM documentation.

That's a lot of options

One key idea behind LLM is to use plugins to provide access to as many different models as possible. Above I've listed two ways to run Llama 3 locally and six different API vendors that LLM can access as well.

If you're inspired to write your own plugin it's pretty simple: each of the above plugins is open source, and there's a detailed tutorial on Writing a plugin to support a new model on the LLM website.

Tags: projects, ai, generative-ai, llama, local-llms, llms, llm, llamafile, groq, llm-release, openrouter, chatbot-arena

Many options for running Mistral models in your terminal using LLM

2023-12-18T18:18:44+00:00

Mistral AI is the most exciting AI research lab at the moment. They've now released two extremely powerful smaller Large Language Models under an Apache 2 license, and have a third much larger one that's available via their API.

I've been trying out their models using my LLM command-line tool tool. Here's what I've figured out so far.

Mixtral 8x7B via llama.cpp and llm-llama-cpp

On Friday 8th December Mistral AI tweeted a mysterious magnet (BitTorrent) link. This is the second time they've done this, the first was on September 26th when they released their excellent Mistral 7B model, also as a magnet link.

The new release was an 87GB file containing Mixtral 8x7B - "a high-quality sparse mixture of experts model (SMoE) with open weights", according to the article they released three days later.

Mixtral is a very impressive model. GPT-4 has long been rumored to use a mixture of experts architecture, and Mixtral is the first truly convincing openly licensed implementation of this architecture I've seen. It's already showing impressive benchmark scores.

This PR for llama.cpp added support for the new model. llama-cpp-python updated to land that patch shortly afterwards.

Which means... you can now run Mixtral on a Mac (and other platforms too, though I haven't tested them myself yet) using my llm-llama-cpp plugin.

Here's how to do that:

Install LLM:
```
pipx install llm
```
Install the plugin:
```
llm install llm-llama-cpp
```
Install llama-cpp-python - this needs to be done manually because the best approach differs for different platforms. On an Apple Silicon Mac I recommend running:
```
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 llm install llama-cpp-python
```
More details in the llm-llama-cpp README
Download a GGUF file for Mixtral 8X7B Instruct v0.1. TheBloke has 8 different options for this - I picked the 36GB mixtral-8x7b-instruct-v0.1.Q6_K.gguf:
```
curl -LO 'https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q6_K.gguf?download=true'
```

Run the model like this:

llm -m gguf \
  -o path mixtral-8x7b-instruct-v0.1.Q6_K.gguf \
  '[INST] Write a Python function that downloads a file from a URL[/INST]'

The -m gguf here says to use the gguf model that was installed by the llm-llama-cpp plugin - this is a model which requires you to pass a path option - with -o path filename - specifying which model you would like to run.

The [INST] part there is the format of prompt expected by the Instruct model. I need to update the llm-llama-cpp plugin to make it easier to automatically construct the correct prompt format, see this issue.

Here's what I got for that prompt:

Sure, here's an example Python function that downloads a file from a URL using the requests library:
import requests

def download_file(url, save_path):
    """
    Download a file from a URL and save it to a local path.

    :param url: The URL of the file to download.
    :param save_path: The local path to save the file.
    """
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)
        print(f"File downloaded successfully at {save_path}")
    else:
        print(f"Failed to download file from {url}. Status code: {response.status_code}")
To use this function, simply call it with the URL of the file you want to download and the local path where you want to save it. For example:
download_file('https://example.com/myfile.txt', '/path/to/save/myfile.txt')
Note that this function uses the stream=True parameter in the requests.get() method to download the file in chunks, which can help prevent memory issues when downloading large files. It also checks the HTTP status code of the response before saving the file, so you can handle any errors that might occur during the download.

That's a very solid reply!

Mistral 7B via llm-llama-cpp or llm-gpt4all or llm-mlc

The smaller Mistral 7B model dropped back in September. It's since established itself as the most capable model family of that size - a size which is very convenient for running on personal devices.

I'm even running Mistral 7B on my iPhone now, thanks to an update to the MLC Chat iOS app from a few days ago.

There are a bunch of different options for running this model and its variants locally using LLM on a Mac - and probably other platforms too, though I've not tested these options myself on Linux or Windows:

Using llm-llama-cpp: download one of these Mistral-7B-Instruct GGUF files for the chat-tuned version, or one of these for base Mistral, then follow the steps listed above
Using llm-gpt4all. This is the easiest plugin to install:
```
llm install llm-gpt4all
```
The model will be downloaded the first time you try to use it:
```
llm -m mistral-7b-instruct-v0 'Introduce yourself'
```

Using llm-mlc. Follow the instructions in the README to install it, then:

# Download the model:
llm mlc download-model https://huggingface.co/mlc-ai/mlc-chat-Mistral-7B-Instruct-v0.2-q3f16_1
# Run it like this:
llm -m mlc-chat-Mistral-7B-Instruct-v0.2-q3f16_1 'Introduce yourself'

Each of these options work, but I've not spent time yet comparing them in terms of output quality or performance.

Using the Mistral API, which includes the new Mistral-medium

Mistral also recently announced La plateforme, their early access API for calling hosted versions of their models.

Their new API renames Mistral 7B model "Mistral-tiny", the new Mixtral model "Mistral-small"... and offers something called Mistral-medium as well:

Our highest-quality endpoint currently serves a prototype model, that is currently among the top serviced models available based on standard benchmarks. It masters English/French/Italian/German/Spanish and code and obtains a score of 8.6 on MT-Bench.

I got access to their API and used it to build a new plugin, llm-mistral. Here's how to use that:

Install it:
```
llm install llm-mistral
```
Set your Mistral API key:
```
llm keys set mistral
# <paste key here>
```

Run the models like this:

llm -m mistral-tiny 'Say hi'
# Or mistral-small or mistral-medium
cat mycode.py | llm -m mistral-medium -s 'Explain this code'

Here's their comparison table pitching Mistral Small and Medium against GPT-3.5:

These may well be cherry-picked, but note that Small beats GPT-3.5 on almost every metric, and Medium beats it on everything by a wider margin.

Here's the MT Bench leaderboard which includes scores for GPT-4 and Claude 2.1:

That 8.61 score for Medium puts it half way between GPT-3.5 and GPT-4.

Benchmark scores are no replacement for spending time with a model to get a feel for how well it behaves across a wide spectrum of tasks, but these scores are extremely promising. GPT-4 may not hold the best model crown for much longer.

Mistral via other API providers

Since both Mistral 7B and Mixtral 8x7B are available under an Apache 2 license, there's been something of a race to the bottom in terms of pricing from other LLM hosting providers.

This trend makes me a little nervous, since it actively disincentivizes future open model releases from Mistral and from other providers who are hoping to offer their own hosted versions.

LLM has plugins for a bunch of these providers already. The three that I've tried so far are Replicate, Anyscale Endpoints and OpenRouter.

For Replicate using llm-replicate:

llm install llm-replicate
llm keys set replicate
# <paste API key here>
llm replicate add mistralai/mistral-7b-v0.1

Then run prompts like this:

llm -m replicate-mistralai-mistral-7b-v0.1 '3 reasons to get a pet weasel:'

This example is the non-instruct tuned model, so the prompt needs to be shaped such that the model can complete it.

For Anyscale Endpoints using llm-anyscale-endpoints:

llm install llm-anyscale-endpoints
llm keys set anyscale-endpoints
# <paste API key here>

Now you can run both the 7B and the Mixtral 8x7B models:

llm -m mistralai/Mixtral-8x7B-Instruct-v0.1 \
  '3 reasons to get a pet weasel'
llm -m mistralai/Mistral-7B-Instruct-v0.1 \
  '3 reasons to get a pet weasel'

And for OpenRouter using llm-openrouter:

llm install llm-openrouter
llm keys set openrouter
# <paste API key here>

Then run the models like so:

llm -m openrouter/mistralai/mistral-7b-instruct \
  '2 reasons to get a pet dragon'
llm -m openrouter/mistralai/mixtral-8x7b-instruct \
  '2 reasons to get a pet dragon'

OpenRouter are currently offering Mistral and Mixtral via their API for $0.00/1M input tokens - it's free! Obviously not sustainable, so don't rely on that continuing, but that does make them a great platform for running some initial experiments with these models.

Using Llamafile's OpenAI API endpoint

I wrote about Llamafile recently, a fascinating option fur running LLMs where the LLM can be bundled up in an executable that includes everything needed to run it, on multiple platforms.

Justine Tunney released llamafiles for Mixtral a few days ago.

The mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile one runs an OpenAI-compatible API endpoints which LLM can talk to.

Here's how to use that:

Download the llamafile:

curl -LO https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile

Start that running:
```
./mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile
```
You may need to chmod 755 mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile it first, but I found I didn't need to.
Configure LLM to know about that endpoint, by adding the following to a file at ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml:
```
- model_id: llamafile
  model_name: llamafile
  api_base: "http://127.0.0.1:8080/v1"
```
This registers a model called llamafile which you can now call like this:
```
llm -m llamafile 'Say hello to the world'
```

Setting up that llamafile alias means you'll be able to use the same CLI invocation for any llamafile models you run on that default 8080 port.

The same exact approach should work for other model hosting options that provide an endpoint that imitates the OpenAI API.

This is LLM plugins working as intended

When I added plugin support to LLM this was exactly what I had in mind: I want it to be as easy as possible to add support for new models, both local and remotely hosted.

The LLM plugin directory lists 19 plugins in total now.

If you want to build your own plugin - for a locally hosted model or for one exposed via a remote API - the plugin author tutorial (plus reviewing code from the existing plugins) should hopefully provide everything you need.

You're also welcome to join us in the #llm Discord channel to talk about your plans for your project.

Tags: cli, plugins, projects, ai, generative-ai, local-llms, llms, llm, mistral, llamafile, llama-cpp, openrouter

llamafile is the new best way to run an LLM on your own computer

2023-11-29T20:54:16+00:00

Mozilla’s innovation group and Justine Tunney just released llamafile, and I think it's now the single best way to get started running Large Language Models (think your own local copy of ChatGPT) on your own computer.

A llamafile is a single multi-GB file that contains both the model weights for an LLM and the code needed to run that model - in some cases a full local server with a web UI for interacting with it.

The executable is compiled using Cosmopolitan Libc, Justine's incredible project that supports compiling a single binary that works, unmodified, on multiple different operating systems and hardware architectures.

Here's how to get started with LLaVA 1.5, a large multimodal model (which means text and image inputs, like GPT-4 Vision) fine-tuned on top of Llama 2. I've tested this process on an M2 Mac, but it should work on other platforms as well (though be sure to read the Gotchas section of the README, and take a look at Justine's list of supported platforms in a comment on Hacker News).

Download the 4.29GB llava-v1.5-7b-q4.llamafile file from Justine's repository on Hugging Face.

curl -LO https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
Make that binary executable, by running this in a terminal:

chmod 755 llava-v1.5-7b-q4.llamafile
Run your new executable, which will start a web server on port 8080:

./llava-v1.5-7b-q4.llamafile
Navigate to http://127.0.0.1:8080/ to start interacting with the model in your browser.

That's all there is to it. On my M2 Mac it runs at around 55 tokens a second, which is really fast. And it can analyze images - here's what I got when I uploaded a photograph and asked "Describe this plant":

How this works

There are a number of different components working together here to make this work.

The LLaVA 1.5 model by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee is described in this paper, with further details on llava-vl.github.io.
The models are executed using llama.cpp, and in the above demo also use the llama.cpp server example to provide the UI.
Cosmopolitan Libc is the magic that makes one binary work on multiple platforms. I wrote more about that in a TIL a few months ago, Catching up with the Cosmopolitan ecosystem.

Trying more models

The llamafile README currently links to binaries for Mistral-7B-Instruct, LLaVA 1.5 and WizardCoder-Python-13B.

You can also download a much smaller llamafile binary from their releases, which can then execute any model that has been compiled to GGUF format:

I grabbed llamafile-server-0.1 (4.45MB) like this:

curl -LO https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1
chmod 755 llamafile-server-0.1

Then ran it against a 13GB llama-2-13b.Q8_0.gguf file I had previously downloaded:

./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

This gave me the same interface at http://127.0.0.1:8080/ (without the image upload) and let me talk with the model at 24 tokens per second.

One file is all you need

I think my favourite thing about llamafile is what it represents. This is a single binary file which you can download and then use, forever, on (almost) any computer.

You don't need a network connection, and you don't need to keep track of more than one file.

Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You'll never be without a language model ever again.

Tags: mozilla, ai, generative-ai, cosmopolitan, llama, local-llms, llms, mistral, llamafile, justine-tunney, llama-cpp