<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: gemma</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/gemma.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-12T23:57:53+00:00</updated><author><name>Simon Willison</name></author><entry><title>Gemma 4 audio with MLX</title><link href="https://simonwillison.net/2026/Apr/12/mlx-audio/#atom-tag" rel="alternate"/><published>2026-04-12T23:57:53+00:00</published><updated>2026-04-12T23:57:53+00:00</updated><id>https://simonwillison.net/2026/Apr/12/mlx-audio/#atom-tag</id><summary type="html">
    &lt;p&gt;Thanks to a &lt;a href="https://twitter.com/RahimNathwani/status/2039961945613209852"&gt;tip from Rahim Nathwani&lt;/a&gt;, here's a &lt;code&gt;uv run&lt;/code&gt; recipe for transcribing an audio file on macOS using the 10.28 GB &lt;a href="https://huggingface.co/google/gemma-4-E2B"&gt;Gemma 4 E2B model&lt;/a&gt; with MLX and &lt;a href="https://github.com/Blaizzy/mlx-vlm"&gt;mlx-vlm&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/demo-audio-for-gemma.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;I tried it on &lt;a href="https://static.simonwillison.net/static/2026/demo-audio-for-gemma.wav"&gt;this 14 second &lt;code&gt;.wav&lt;/code&gt; file&lt;/a&gt; and it output the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".)&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speech-to-text"&gt;speech-to-text&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="uv"/><category term="mlx"/><category term="gemma"/><category term="speech-to-text"/></entry><entry><title>Gemma 4: Byte for byte, the most capable open models</title><link href="https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag" rel="alternate"/><published>2026-04-02T18:28:54+00:00</published><updated>2026-04-02T18:28:54+00:00</updated><id>https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"&gt;Gemma 4: Byte for byte, the most capable open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.&lt;/p&gt;
&lt;p&gt;Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.&lt;/p&gt;
&lt;p&gt;They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't entirely understand that, but apparently that's what the "E" in E2B means!&lt;/p&gt;
&lt;p&gt;One particularly exciting feature of these models is that they are multi-modal beyond just images:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vision and audio&lt;/strong&gt;: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.&lt;/p&gt;
&lt;p&gt;I tried them out using the GGUFs for &lt;a href="https://lmstudio.ai/models/gemma-4"&gt;LM Studio&lt;/a&gt;. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out &lt;code&gt;"---\n"&lt;/code&gt; in a loop for every prompt I tried.&lt;/p&gt;
&lt;p&gt;The succession of &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb"&gt;pelican quality&lt;/a&gt; from 2B to 4B to 26B-A4B is notable:&lt;/p&gt;
&lt;p&gt;E2B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican" src="https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;E4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it" src="https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;26B-A4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle" src="https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105"&gt;fixing that&lt;/a&gt; I got probably the best pelican I've seen yet from a model that runs on my laptop.)&lt;/p&gt;
&lt;p&gt;Google are providing API access to the two larger Gemma models via their &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it"&gt;AI Studio&lt;/a&gt;. I added support to &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; and then &lt;a href="https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e"&gt;ran a pelican&lt;/a&gt; through the 31B model using that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good, though it is missing the front part of the bicycle frame:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. " src="https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>llm-gemini 0.30</title><link href="https://simonwillison.net/2026/Apr/2/llm-gemini/#atom-tag" rel="alternate"/><published>2026-04-02T18:25:08+00:00</published><updated>2026-04-02T18:25:08+00:00</updated><id>https://simonwillison.net/2026/Apr/2/llm-gemini/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.30"&gt;llm-gemini 0.30&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;New models &lt;code&gt;gemini-3.1-flash-lite-preview&lt;/code&gt;, &lt;code&gt;gemma-4-26b-a4b-it&lt;/code&gt; and &lt;code&gt;gemma-4-31b-it&lt;/code&gt;. See &lt;a href="https://simonwillison.net/2026/Apr/2/gemma-4/"&gt;my notes on Gemma 4&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="gemini"/><category term="gemma"/></entry><entry><title>Introducing EmbeddingGemma</title><link href="https://simonwillison.net/2025/Sep/4/embedding-gemma/#atom-tag" rel="alternate"/><published>2025-09-04T22:27:41+00:00</published><updated>2025-09-04T22:27:41+00:00</updated><id>https://simonwillison.net/2025/Sep/4/embedding-gemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-embeddinggemma/"&gt;Introducing EmbeddingGemma&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Brand new open weights (under the slightly janky &lt;a href="https://ai.google.dev/gemma/terms"&gt;Gemma license&lt;/a&gt;) 308M parameter embedding model from Google:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Based on the Gemma 3 architecture, EmbeddingGemma is trained on 100+ languages and is small enough to run on less than 200MB of RAM with quantization.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's available via &lt;a href="https://ai.google.dev/gemma/docs/embeddinggemma/fine-tuning-embeddinggemma-with-sentence-transformers"&gt;sentence-transformers&lt;/a&gt;, &lt;a href="https://huggingface.co/collections/ggml-org/embeddinggemma-300m-68b2a87d78ca52408f7918f3"&gt;llama.cpp&lt;/a&gt;, &lt;a href="https://huggingface.co/collections/mlx-community/embeddinggemma-68b9a55aac55466fbd514f7c"&gt;MLX&lt;/a&gt;, &lt;a href="https://ollama.com/library/embeddinggemma"&gt;Ollama&lt;/a&gt;, &lt;a href="https://lmstudio.ai/models/google/embedding-gemma-300m"&gt;LMStudio&lt;/a&gt; and more. &lt;/p&gt;
&lt;p&gt;As usual for these smaller models there's a &lt;a href="https://huggingface.co/blog/embeddinggemma#transformersjs"&gt;Transformers.js&lt;/a&gt; demo (&lt;a href="https://twitter.com/xenovacom/status/1963638444233511016"&gt;via&lt;/a&gt;) that runs directly in the browser (in Chrome variants) - &lt;a href="https://huggingface.co/spaces/webml-community/semantic-galaxy"&gt;Semantic Galaxy&lt;/a&gt; loads a ~400MB model and then lets you run embeddings against hundreds of text sentences, map them in a 2D space and run similarity searches to zoom to points within that space.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of The Semantic Galaxy web application interface showing a semantic search tool with a left sidebar containing &amp;quot;Your Dataset&amp;quot; with sample text &amp;quot;The sun peeked through the clouds after a drizzly&amp;quot; and a blue &amp;quot;Generate Galaxy&amp;quot; button, below which is text &amp;quot;Galaxy generated with 106 points. Ready to explore!&amp;quot; followed by &amp;quot;Search Results&amp;quot; listing various text snippets with similarity scores to the search term &amp;quot;pelican riding a bicycle&amp;quot; such as &amp;quot;The cyclist pedaled up the steep hill... 0.491&amp;quot;, &amp;quot;It was so hot that even the birds sou... 0.446&amp;quot;, etc. The main area shows a dark starfield visualization with white dots representing semantic clusters and text snippets floating as labels near the clusters." src="https://static.simonwillison.net/static/2025/semantic-galaxy-transformers.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="embeddings"/><category term="transformers-js"/><category term="gemma"/><category term="janky-licenses"/></entry><entry><title>Introducing Gemma 3 270M: The compact model for hyper-efficient AI</title><link href="https://simonwillison.net/2025/Aug/14/gemma-3-270m/#atom-tag" rel="alternate"/><published>2025-08-14T17:22:36+00:00</published><updated>2025-08-14T17:22:36+00:00</updated><id>https://simonwillison.net/2025/Aug/14/gemma-3-270m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-gemma-3-270m/"&gt;Introducing Gemma 3 270M: The compact model for hyper-efficient AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Google:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Gemma 3 270M, a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This model is &lt;em&gt;tiny&lt;/em&gt;. The version I tried was &lt;a href="https://lmstudio.ai/models/google/gemma-3-270m"&gt;the LM Studio GGUF one&lt;/a&gt;, a 241MB download.&lt;/p&gt;
&lt;p&gt;It works! You can say "hi" to it and ask it very basic questions like "What is the capital of France".&lt;/p&gt;
&lt;p&gt;I tried "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://gist.github.com/simonw/25e7b7afd6a63a2f15db48b3a51ec9bc"&gt;about a dozen times&lt;/a&gt; and didn't once get back an SVG that was more than just a blank square... but at one point it did decide to write me this poem instead, which was nice:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-----------------------+
|   Pelican Riding Bike |
+-----------------------+
|  This is the cat!  |
|  He's got big wings and a happy tail.  |
|  He loves to ride his bike!  |
+-----------------------+
|   Bike lights are shining bright.  |
|   He's got a shiny top, too!  |
|   He's ready for adventure!  |
+-----------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's not really the point though. The Gemma 3 team make it very clear that the goal of this model is to support fine-tuning: a model this tiny is never going to be useful for general purpose LLM tasks, but given the right fine-tuning data it should be able to specialize for all sorts of things:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In engineering, success is defined by efficiency, not just raw power. You wouldn't use a sledgehammer to hang a picture frame. The same principle applies to building with AI.&lt;/p&gt;
&lt;p&gt;Gemma 3 270M embodies this "right tool for the job" philosophy. It's a high-quality foundation model that follows instructions well out of the box, and its true power is unlocked through fine-tuning. Once specialized, it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness. By starting with a compact, capable model, you can build production systems that are lean, fast, and dramatically cheaper to operate.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's their tutorial on &lt;a href="https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune"&gt;Full Model Fine-Tune using Hugging Face Transformers&lt;/a&gt;, which I have not yet attempted to follow.&lt;/p&gt;
&lt;p&gt;I imagine this model will be particularly fun to play with directly in a browser using &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;transformers.js&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: It is! Here's &lt;a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator"&gt;a bedtime story generator&lt;/a&gt; using Transformers.js (requires WebGPU, so Chrome-like browsers only). Here's &lt;a href="https://huggingface.co/spaces/webml-community/bedtime-story-generator/tree/main"&gt;the source code&lt;/a&gt; for that demo.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44902148"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Introducing Gemma 3n: The developer guide</title><link href="https://simonwillison.net/2025/Jun/26/gemma-3n/#atom-tag" rel="alternate"/><published>2025-06-26T21:08:36+00:00</published><updated>2025-06-26T21:08:36+00:00</updated><id>https://simonwillison.net/2025/Jun/26/gemma-3n/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/"&gt;Introducing Gemma 3n: The developer guide&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Extremely consequential new open weights model release from Google today:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multimodal by design:&lt;/strong&gt; Gemma 3n natively supports image, audio, video, and text inputs and text outputs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized for on-device:&lt;/strong&gt; Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on &lt;a href="https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#per-layer-embeddings-(ple):-unlocking-more-memory-efficiency"&gt;&lt;strong&gt;effective&lt;/strong&gt;&lt;/a&gt; parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is &lt;strong&gt;very&lt;/strong&gt; exciting: a 2B and 4B model optimized for end-user devices which accepts text, images &lt;em&gt;and&lt;/em&gt; audio as inputs!&lt;/p&gt;
&lt;p&gt;Gemma 3n is also the most comprehensive day one launch I've seen for any model: Google partnered with "AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now.&lt;/p&gt;
&lt;p&gt;So far I've run two variants on my Mac laptop. Ollama offer &lt;a href="https://ollama.com/library/gemma3n"&gt;a 7.5GB version&lt;/a&gt; (full tag &lt;code&gt;gemma3n:e4b-it-q4_K_M0&lt;/code&gt;) of the 4B model, which I ran like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3n
llm install llm-ollama
llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It drew me this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican looks a bit like a grey pig. It is floating above a bicycle that looks more like a rail cart." src="https://static.simonwillison.net/static/2025/gemma3n-ollama.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The Ollama version doesn't appear to support image or audio input yet.&lt;/p&gt;
&lt;p&gt;... but the &lt;a href="https://github.com/Blaizzy/mlx-vlm"&gt;mlx-vlm&lt;/a&gt; version does!&lt;/p&gt;
&lt;p&gt;First I tried that on &lt;a href="https://static.simonwillison.net/static/2025/pelican-joke-request.wav"&gt;this WAV file&lt;/a&gt; like so (using a recipe adapted from &lt;a href="https://www.youtube.com/watch?v=8-8R2UvUBrc"&gt;Prince Canuma's video&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Transcribe the following speech segment in English:" \
  --audio pelican-joke-request.wav
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Tell me a joke about a pelican.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I had it draw me a pelican for good measure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I quite like this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is a sin wave, but you can make out which bits are the frame and the wheels. The pelican is white with a black triangular bill. It's a much better attempt than the Ollama one." src="https://static.simonwillison.net/static/2025/gemma3n-mlx-vlm.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It's interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations.&lt;/p&gt;
&lt;p&gt;Finally, I had it describe the image it had just created:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 1000 \
  --prompt "Describe image" \
  --image gemma3n-mlx-vlm.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got back - it thought this was a chemical diagram!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image is a cartoon-style illustration depicting a molecular structure against a light blue background. The structure is composed of several differently colored and shaped elements connected by curved black lines. &lt;/p&gt;
&lt;p&gt;Here's a breakdown of the elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Large Black Circles:&lt;/strong&gt; There are two large, solid black circles, likely representing larger molecules or components.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cream-Colored Oval:&lt;/strong&gt; A prominent cream-colored oval shape is centrally located and connected to other elements. It has a smaller, darker cream-colored sphere attached to its left side.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Red Oval:&lt;/strong&gt; A smaller, solid red oval is positioned near one of the black circles and connected to the cream-colored oval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Black Lines:&lt;/strong&gt; These lines act as bonds, connecting the various molecular components. They are curved and dynamic, suggesting movement or interaction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Triangular Shape:&lt;/strong&gt; A small black triangle is attached to the smaller cream-colored sphere.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Letter "I":&lt;/strong&gt; The letter "I" appears twice, likely labeling specific parts of the molecule. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The overall impression is of a simplified representation of a biological molecule, possibly a protein or a complex organic compound. The use of different colors helps to distinguish the various components within the structure.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prince-canuma"&gt;prince-canuma&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="prince-canuma"/></entry><entry><title>Trying out llama.cpp's new vision support</title><link href="https://simonwillison.net/2025/May/10/llama-cpp-vision/#atom-tag" rel="alternate"/><published>2025-05-10T06:29:10+00:00</published><updated>2025-05-10T06:29:10+00:00</updated><id>https://simonwillison.net/2025/May/10/llama-cpp-vision/#atom-tag</id><summary type="html">
    &lt;p&gt;This &lt;a href="https://github.com/ggml-org/llama.cpp/pull/12898"&gt;llama.cpp server vision support via libmtmd&lt;/a&gt; pull request - via &lt;a href="https://news.ycombinator.com/item?id=43943047"&gt;Hacker News&lt;/a&gt; - was merged earlier today. The PR finally adds full support for vision models to the excellent &lt;a href="https://github.com/ggml-org/llama.cpp"&gt;llama.cpp&lt;/a&gt; project. It's documented &lt;a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md"&gt;on this page&lt;/a&gt;, but the more detailed technical details are &lt;a href="https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd#multimodal-support-in-llamacpp"&gt;covered here&lt;/a&gt;. Here are my notes on getting it working on a Mac.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; models are usually distributed as &lt;code&gt;.gguf&lt;/code&gt; files. This project introduces a new variant of those called &lt;code&gt;mmproj&lt;/code&gt;, for multimodal projector. &lt;code&gt;libmtmd&lt;/code&gt; is the new library for handling these.&lt;/p&gt;
&lt;p&gt;You can try it out by compiling &lt;code&gt;llama.cpp&lt;/code&gt; from source, but I found another option that works: you can download pre-compiled binaries from the &lt;a href="https://github.com/ggml-org/llama.cpp/releases"&gt;GitHub releases&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;On macOS there's an extra step to jump through to get these working, which I'll describe below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: it turns out the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;Homebrew package&lt;/a&gt; for &lt;code&gt;llama.cpp&lt;/code&gt; turns things around &lt;em&gt;extremely&lt;/em&gt; quickly. You can run &lt;code&gt;brew install llama.cpp&lt;/code&gt; or &lt;code&gt;brew upgrade llama.cpp&lt;/code&gt; and start running the below tools without any extra steps.&lt;/p&gt;

&lt;p&gt;I downloaded the &lt;code&gt;llama-b5332-bin-macos-arm64.zip&lt;/code&gt; file from &lt;a href="https://github.com/ggml-org/llama.cpp/releases/tag/b5332"&gt;this GitHub release&lt;/a&gt; and unzipped it, which created a &lt;code&gt;build/bin&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;That directory contains a bunch of binary executables and a whole lot of &lt;code&gt;.dylib&lt;/code&gt; files. macOS wouldn't let me execute these files because they were quarantined. Running this command fixed that for the &lt;code&gt;llama-mtmd-cli&lt;/code&gt; and &lt;code&gt;llama-server&lt;/code&gt; executables and the &lt;code&gt;.dylib&lt;/code&gt; files they needed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can run an interactive terminal LLM session using this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first time this runs it fetches four files from &lt;a href="https://huggingface.co/unsloth/gemma-3-4b-it-GGUF"&gt;unsloth/gemma-3-4b-it-GGUF&lt;/a&gt; - a quantized copy of Google DeepMind's &lt;a href="https://huggingface.co/google/gemma-3-4b-pt"&gt;Gemma 3 4B&lt;/a&gt; - and stores them in &lt;code&gt;~/Library/Caches/llama.cpp&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;2.4G unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-UD-Q4_K_XL.gguf
217B unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-UD-Q4_K_XL.gguf.json
812M unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf
203B unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once they've downloaded an interactive session starts. You can chat directly with the model or use the &lt;code&gt;/image path-to-image.jpg&lt;/code&gt; command to load an image into the context.&lt;/p&gt;
&lt;p&gt;I loaded in &lt;a href="https://static.simonwillison.net/static/2025/pelican-no-hat.jpg"&gt;a pelican photo&lt;/a&gt; and prompted "describe it" and got back this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It’s a striking photograph of a brown pelican standing in shallow water on a sandy beach. Here's a breakdown of what I see:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Subject:&lt;/strong&gt; The main focus is a large brown pelican. It’s standing with its wings fully extended, creating a dramatic pose.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pose:&lt;/strong&gt; The pelican is in a slightly awkward, almost comical stance, with its legs planted firmly in the wet sand and its wings spread wide. It looks like it might be preparing to take off or is simply enjoying the water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Color &amp;amp; Texture:&lt;/strong&gt; The pelican is a rich, dark brown with lighter markings on its wings and head. The sand is a muted beige color, and the water reflects the sunlight, creating a shimmering effect. You can see the texture of the
sand and the ripples in the water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lighting:&lt;/strong&gt; The lighting appears to be bright and sunny, casting shadows and highlighting the details of the pelican's feathers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt; The background is blurred, suggesting a shallow depth of field, which keeps the focus on the pelican. It appears to be a beach scene.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overall Impression:&lt;/strong&gt; The image has a dynamic feel due to the pelican’s pose and the movement implied by the water. It’s a well-composed shot that captures a moment of wildlife in its natural habitat.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do you want me to zoom in on a particular detail, or perhaps tell you more about pelicans?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Not bad for a 3.2GB model running on my laptop!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/llama-vision.jpg" alt="Screenshot of the terminal session showing the response quoted above." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="running-llama-server"&gt;Running llama-server&lt;/h4&gt;
&lt;p&gt;Even more fun is the &lt;a href="https://github.com/ggml-org/llama.cpp/tree/master/tools/server#llamacpp-http-server"&gt;&lt;code&gt;llama-server&lt;/code&gt; command&lt;/a&gt;. This starts a localhost web server running on port 8080 to serve the model, with both a web UI and an OpenAI-compatible API endpoint.&lt;/p&gt;
&lt;p&gt;The command to run it is the same:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now visit &lt;code&gt;http://localhost:8080&lt;/code&gt; in your browser to start interacting with the model:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/llama-server-vision.jpg" alt="screenshot of llama.cpp web UI - I've added a photo of some pelicans and the prompt decsribe these creatures, it came back with the following: Okay, let's describe these beautiful birds!  These are Brown Pelicans (Pelecanus occidentalis).  Here's a breakdown of their key features, based on the image:      Appearance: They are large, brown birds with distinctive long, pointed bills. Their plumage is a rich, chocolate brown, with lighter underparts. They have large, webbed feet.     Posture: As you can see, they're standing together on a concrete embankment, likely near the water's edge. They're in a somewhat huddled or attentive stance.     Size: They are quite large birds – adults can have a wingspan of up to 6.5 feet (2 meters).     Behavior: Pelicans are known for their unique feeding technique, where they plunge-dive into the water to scoop up fish with their pouches.  In the image, you can see:      A group of 6-7 Brown Pelicans.     A single bird in the foreground, slightly out of focus, showing a more detailed view of their feathers and feet.  Where they are: The presence of these birds suggests they are likely in a coastal or wetland environment – perhaps a bay, estuary, or near a large body of water.  Do you want me to delve deeper into any specific aspect of these birds, such as their habitat, diet, or conservation status? On the right is a Conversations sidebar with three other conversations listed." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It miscounted the pelicans in &lt;a href="https://static.simonwillison.net/static/2025/pelican-group.jpg"&gt;the group photo&lt;/a&gt;, but again, this is a &lt;em&gt;tiny&lt;/em&gt; 3.2GB model.&lt;/p&gt;
&lt;p&gt;With the server running on port 8080 you can also access the OpenAI-compatible API endpoint. Here's how to do that using &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -X POST http://localhost:8080/v1/chat/completions \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Content-Type: application/json&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -d &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{&lt;/span&gt;
&lt;span class="pl-s"&gt;    "messages": [&lt;/span&gt;
&lt;span class="pl-s"&gt;      {"role": "user", "content": "Describe a pelicans ideal corporate retreat"}&lt;/span&gt;
&lt;span class="pl-s"&gt;    ]&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; jq&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I built a new plugin for LLM just now called &lt;a href="https://github.com/simonw/llm-llama-server"&gt;llm-llama-server&lt;/a&gt; to make interacting with this API more convenient. You can use that like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-llama-server
llm -m llama-server &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;invent a theme park ride for a pelican&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or for vision models use &lt;code&gt;llama-server-vision&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m llama-server-vision &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe this image&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -a https://static.simonwillison.net/static/2025/pelican-group.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The LLM plugin uses the streaming API, so responses will stream back to you as they are being generated.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/theme-park.gif" alt="Animated terminal session. $ llm -m llama-server 'invent a theme park ride for a pelican' Okay, this is a fun challenge! Let's design a theme park ride specifically for a pelican – a majestic, diving bird. Here’s my concept:  Ride Name: “Pelican’s Plunge”   Theme: Coastal Exploration &amp;amp; Underwater Discovery  Target Audience: Families with children (8+ recommended), animal lovers, and those who enjoy a mix of thrills and gentle exploration.  Ride Type: A partially submerged, rotating “pod” experience with a focus on simulated dives and underwater views.  Ride Mechanics:  1. The Pod: Guests ride in a large, semi-circular pod shaped like a stylized, open-mouthed pelican’s beak.  It’s made of reinforced, transparent acrylic and has comfortable seating inside. The pod can hold around 8-10 people.  2. The Launch: Guests board the pod and are positioned facing forward. The ride begins with a slow, gentle rise up a ramp, mimicking the pelican’s ascent from the water.   3. The &amp;quot;Dive&amp;quot; Sequence: This is the core of the ride.  The pod enters a large, darkened chamber that simulates the ocean floor.     * Rotating Simulation: The pod begins to rotate slowly, mimicking a pelican diving into the water.     * Water Effects:  The chamber is filled with realistic, moving water – created by a sophisticated system of pumps, jets, and screens. This creates the illusion of being surrounded by the ocean.    * Projection Mapping:  Powerful projection mapping is used on the walls and floor to create stunning underwater visuals: schools of fish, coral reefs, kelp forests, and even glimpses of marine life like sharks (safely projected, of course!).    * “Dive” Trigger:  At specific points in the rotation, the pod will perform a short, controlled “dive” – a sudden drop that creates a feeling of speed and the sensation of plunging into the water.  Sensors detect the speed of the rotation and trigger these dives.  4. Underwater Exploration: After the initial dive, the pod continues its rotation, offering increasingly detailed views of the projected underwater scenes.      * Interactive Elements (Optional):  Small, strategically placed screens could display sonar-like visuals, allowing guests to “scan” the environment and reveal hidden creatures or details.  5. The Ascent &amp;amp; Return: The ride gradually slows, bringing the pod back to the surface. As it rises, the projections shift to show a sunny coastline and seabirds flying overhead. The pod returns to the loading area.   Theming &amp;amp; Atmosphere:  * Sound Design: Immersive sound effects – waves crashing, seabirds calling, underwater ambience – are crucial. * Lighting: Dynamic lighting that shifts with the projections and the &amp;quot;dive&amp;quot; sequences. * Pelican Props:  Realistic pelican statues and props are integrated throughout the queue and surrounding area. * Educational Element: Small informational panels explain pelican behavior, conservation efforts, and the importance of marine ecosystems.  Why this works for a pelican:  * Mimics Natural Behavior: The ride accurately reflects a pelican’s primary activity – diving for fish. * Visually Engaging: The combination of water effects, projection mapping, and rotation creates a captivating and immersive experience. * Family-Friendly Thrill: The “dive” sequences provide a moderate thrill without being overly intense. * Educational Value: It promotes awareness and appreciation for these amazing birds and the marine environment.    ---  Further Development Ideas:  * Different &amp;quot;Dive Routes&amp;quot;: Create multiple routes through the underwater environment, each with a different theme (e.g., a coral reef route, a deep-sea route, a pelican’s feeding ground route). * Animatronic Pelican: A large animatronic pelican could “greet” guests as they board the pod. * Smell Integration: Subtle scents of saltwater and seaweed could enhance the immersion.    Would you like me to brainstorm a specific element of the ride further, such as:  *   The projection mapping details? *   The technical aspects of the water effects? *   A unique interactive element? " style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/homebrew"&gt;homebrew&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="homebrew"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llama-cpp"/><category term="gemma"/></entry><entry><title>llm-fragments-github 0.2</title><link href="https://simonwillison.net/2025/Apr/20/llm-fragments-github/#atom-tag" rel="alternate"/><published>2025-04-20T14:01:09+00:00</published><updated>2025-04-20T14:01:09+00:00</updated><id>https://simonwillison.net/2025/Apr/20/llm-fragments-github/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-fragments-github/releases/tag/0.2"&gt;llm-fragments-github 0.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I upgraded my &lt;code&gt;llm-fragments-github&lt;/code&gt; plugin to add a new fragment type called &lt;code&gt;issue&lt;/code&gt;. It lets you pull the entire content of a GitHub issue thread into your prompt as a concatenated Markdown file. &lt;/p&gt;
&lt;p&gt;(If you haven't seen fragments before I introduced them in &lt;a href="https://simonwillison.net/2025/Apr/7/long-context-llm/"&gt;Long context support in LLM 0.24 using fragments and template plugins&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;I used it just now to have Gemini 2.5 Pro provide feedback and attempt an implementation of a complex issue against my &lt;a href="https://github.com/simonw/llm"&gt;LLM&lt;/a&gt; project:&lt;/p&gt;
&lt;pre&gt;llm install llm-fragments-github
llm -f github:simonw/llm \
  -f issue:simonw/llm/938 \
  -m gemini-2.5-pro-exp-03-25 \
  --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;muse on this issue, then propose a whole bunch of code to help implement it&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;Here I'm loading the FULL content of the &lt;code&gt;simonw/llm&lt;/code&gt; repo using that &lt;code&gt;-f github:simonw/llm&lt;/code&gt; fragment (&lt;a href="https://github.com/simonw/llm-fragments-github?tab=readme-ov-file#usage"&gt;documented here&lt;/a&gt;), then loading all of the comments from &lt;a href="https://github.com/simonw/llm/issues/938"&gt;issue 938&lt;/a&gt; where I discuss quite a complex potential refactoring. I ask Gemini 2.5 Pro to "muse on this issue" and come up with some code.&lt;/p&gt;
&lt;p&gt;This worked &lt;em&gt;shockingly&lt;/em&gt; well. Here's &lt;a href="https://gist.github.com/simonw/a5f0c1e8184f4ddc8b71b30890fe690c#response"&gt;the full response&lt;/a&gt;, which highlighted a few things I hadn't considered yet (such as the need to migrate old database records to the new tree hierarchy) and then spat out a whole bunch of code which looks like a solid start to the actual implementation work I need to do.&lt;/p&gt;
&lt;p&gt;I ran this against Google's free Gemini 2.5 Preview, but if I'd used the paid model it would have cost me 202,680 input tokens, 10,460 output tokens and 1,859 thinking tokens for a total of 62.989 cents.&lt;/p&gt;
&lt;p&gt;As a fun extra, the new &lt;code&gt;issue:&lt;/code&gt; feature itself was written almost entirely by OpenAI o3, again using fragments. I ran this:&lt;/p&gt;
&lt;pre&gt;llm -m openai/o3 \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue&lt;/span&gt;
&lt;span class="pl-s"&gt;      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;Here I'm using the ability to pass a URL to &lt;code&gt;-f&lt;/code&gt; and giving it the full source of my &lt;a href="https://github.com/simonw/llm-hacker-news/blob/main/llm_hacker_news.py"&gt;llm_hacker_news.py&lt;/a&gt; plugin (which shows how a fragment can load data from an API) plus the &lt;a href="https://github.com/simonw/tools/blob/main/github-issue-to-markdown.html"&gt;HTML source&lt;/a&gt; of my &lt;a href="https://tools.simonwillison.net/github-issue-to-markdown"&gt;github-issue-to-markdown&lt;/a&gt; tool (which I wrote a few months ago &lt;a href="https://gist.github.com/simonw/cd1afb97e595b40fdeedebb48be7f4f1"&gt;with Claude&lt;/a&gt;). I effectively asked o3 to take that HTML/JavaScript tool and port it to Python to work with my fragments plugin mechanism.&lt;/p&gt;
&lt;p&gt;o3 provided &lt;a href="https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3305#response"&gt;almost the exact implementation I needed&lt;/a&gt;, and even included support for a &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; environment variable without me thinking to ask for it. Total cost: 19.928 cents.&lt;/p&gt;
&lt;p&gt;On a final note of curiosity I tried running this prompt against &lt;a href="https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/"&gt;Gemma 3 27B QAT&lt;/a&gt; running on my Mac via MLX and &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

llm -m mlx-community/gemma-3-27b-it-qat-4bit \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue&lt;/span&gt;
&lt;span class="pl-s"&gt;      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;That worked &lt;a href="https://gist.github.com/simonw/feccff6ce3254556b848c27333f52543#response"&gt;pretty well too&lt;/a&gt;. It turns out a 16GB local model file is powerful enough to write me an LLM plugin now!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="plugins"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="gemini"/><category term="mlx"/><category term="o3"/><category term="long-context"/><category term="gemma"/></entry><entry><title>Gemma 3 QAT Models</title><link href="https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/#atom-tag" rel="alternate"/><published>2025-04-19T17:20:50+00:00</published><updated>2025-04-19T17:20:50+00:00</updated><id>https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/"&gt;Gemma 3 QAT Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting release from Google, as a follow-up &lt;a href="https://simonwillison.net/2025/Mar/12/gemma-3/"&gt;to Gemma 3&lt;/a&gt; from last month:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wasn't previously aware of Quantization-Aware Training but it turns out to be quite an established pattern now, supported in both &lt;a href="https://www.tensorflow.org/model_optimization/guide/quantization/training"&gt;Tensorflow&lt;/a&gt; and &lt;a href="https://pytorch.org/blog/quantization-aware-training/"&gt;PyTorch&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Google report model size drops from BF16 to int4 for the following models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemma 3 27B: 54GB to 14.1GB&lt;/li&gt;
&lt;li&gt;Gemma 3 12B: 24GB to 6.6GB&lt;/li&gt;
&lt;li&gt;Gemma 3 4B: 8GB to 2.6GB&lt;/li&gt;
&lt;li&gt;Gemma 3 1B: 2GB to 0.5GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;They partnered with Ollama, LM Studio, MLX (here's &lt;a href="https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae"&gt;their collection&lt;/a&gt;) and llama.cpp for this release - I'd love to see more AI labs following their example.&lt;/p&gt;
&lt;p&gt;The Ollama model version picker currently hides them behind "View all" option, so here are the direct links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:1b-it-qat"&gt;gemma3:1b-it-qat&lt;/a&gt; - 1GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:4b-it-qat"&gt;gemma3:4b-it-qat&lt;/a&gt; - 4GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:12b-it-qat"&gt;gemma3:12b-it-qat&lt;/a&gt; - 8.9GB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/library/gemma3:27b-it-qat"&gt;gemma3:27b-it-qat&lt;/a&gt; - 18GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I fetched that largest model with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3:27b-it-qat
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And now I'm trying it out with &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b-it-qat "impress me with some physics"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got &lt;a href="https://gist.github.com/simonw/5b699ba6b7c05e2d071910e238823ff4"&gt;a pretty great response&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Having spent a while putting it through its paces via &lt;a href="https://simonwillison.net/2024/Dec/27/open-webui/"&gt;Open WebUI&lt;/a&gt; and &lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; to access my laptop from my phone I think this may be my new favorite general-purpose local model. Ollama appears to use 22GB of RAM while the model is running, which leaves plenty on my 64GB machine for other applications.&lt;/p&gt;
&lt;p&gt;I've also tried it via &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt; like this (downloading 16GB):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm chat -m mlx-community/gemma-3-27b-it-qat-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It feels a little faster with MLX and uses 15GB of memory according to Activity Monitor.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="ollama"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Function calling with Gemma</title><link href="https://simonwillison.net/2025/Mar/26/function-calling-with-gemma/#atom-tag" rel="alternate"/><published>2025-03-26T20:23:06+00:00</published><updated>2025-03-26T20:23:06+00:00</updated><id>https://simonwillison.net/2025/Mar/26/function-calling-with-gemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ai.google.dev/gemma/docs/capabilities/function-calling"&gt;Function calling with Gemma&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google's Gemma 3 model (the 27B variant is particularly capable, I've been trying it out &lt;a href="https://ollama.com/library/gemma3"&gt;via Ollama&lt;/a&gt;) supports function calling exclusively through prompt engineering. The official documentation describes two recommended prompts - both of them suggest that the tool definitions are passed in as JSON schema, but the way the model should request tool executions differs.&lt;/p&gt;
&lt;p&gt;The first prompt uses Python-style function calling syntax:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;You have access to functions. If you decide to invoke any of the function(s),
 you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You SHOULD NOT include any other text in the response if you call a function&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Always love seeing CAPITALS for emphasis in prompts, makes me wonder if they proved to themselves that capitalization makes a difference in this case.)&lt;/p&gt;
&lt;p&gt;The second variant uses JSON instead:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;You have access to functions. If you decide to invoke any of the function(s),
you MUST put it in the format of {"name": function name, "parameters": dictionary of argument name and its value}&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You SHOULD NOT include any other text in the response if you call a function&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a neat illustration of the fact that all of these fancy tool using LLMs are still using effectively the same pattern as was described in &lt;a href="https://react-lm.github.io/"&gt;the ReAct paper&lt;/a&gt; back in November 2022. Here's &lt;a href="https://til.simonwillison.net/llms/python-react-pattern"&gt;my implementation of that pattern&lt;/a&gt; from March 2023.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43451406"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm-tool-use"/><category term="gemma"/></entry><entry><title>Notes on Google's Gemma 3</title><link href="https://simonwillison.net/2025/Mar/12/gemma-3/#atom-tag" rel="alternate"/><published>2025-03-12T16:15:19+00:00</published><updated>2025-03-12T16:15:19+00:00</updated><id>https://simonwillison.net/2025/Mar/12/gemma-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google's Gemma team released an impressive new model today (under their not-open-source &lt;a href="https://ai.google.dev/gemma/terms"&gt;Gemma license&lt;/a&gt;). Gemma 3 comes in four sizes - 1B, 4B, 12B, and 27B - and while 1B is text-only the larger three models are all multi-modal for vision:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's the &lt;a href="https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf"&gt;Gemma 3 Technical Report PDF&lt;/a&gt;, which makes the big claim that they find "Gemma3-27B-IT comparable to Gemini-1.5-Pro across
benchmarks".&lt;/p&gt;
&lt;p&gt;I tried out the largest model using the latest &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; - this is the second time I've spotted a major model release partnering with Ollama on launch day, the first being &lt;a href="https://simonwillison.net/2025/Jan/30/mistral-small-3/"&gt;Mistral Small 3&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this (after upgrading Ollama through their menu icon upgrade option):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull gemma3:27b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That pulled 17GB of model weights. I've been trying it out using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm -m gemma3:27b 'Build a single page HTML+CSS+JavaScript UI that gives me a large textarea for writing in which constantly saves what I have entered to localStorage (restoring when I reload the page) and displays a word counter'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That was a replay of &lt;a href="https://tools.simonwillison.net/colophon#word-counter.html"&gt;a prompt&lt;/a&gt; I ran against Claude Artifacts a few months ago. Here's &lt;a href="https://static.simonwillison.net/static/2025/gemma3-word-count.html"&gt;what Gemma built&lt;/a&gt;, and the &lt;a href="https://gist.github.com/simonw/f8a87f2242439d3974405a17f5d8bc23"&gt;full chat transcript&lt;/a&gt;. It's a simple example but it worked just right.&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemma-word-count.gif" alt="Animated demo - a simple textarea, I type text in it and a word count updates above." /&gt;&lt;/p&gt;

&lt;p&gt;Let's try out the vision mode, by asking it to describe this photo:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/two-pelicans.jpg" alt="Two pelicans flying against a blue sky " style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b describe -a https://static.simonwillison.net/static/2025/two-pelicans.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/0e703ba72d0cf5a6dd117981bd03e604"&gt;the full response&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Overall Impression:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The image features two brown pelicans in flight against a solid, bright blue background. The composition is dynamic, capturing the birds mid-wingbeat.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Details:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pelicans:&lt;/strong&gt; Both pelicans are a rich, dark brown color. They have large, pale yellow-orange beaks and pouches. One pelican is shown with its wings fully extended, appearing to be gliding or soaring. The other is captured with its wings in a more dynamic, upward-sweeping motion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background:&lt;/strong&gt; The background is a flat, uniform shade of bright blue, which makes the pelicans stand out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composition:&lt;/strong&gt; The birds are positioned in opposite directions, creating a sense of movement and balance. The image is well-lit, highlighting the texture of the pelicans' feathers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Overall, the image is a striking and visually appealing depiction of these majestic birds in flight.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Is it right to say "the birds are positioned in opposite directions"? I thought that was a mistake at first, but I'll give it a pass since their wings are positioned in opposite directions to each other.&lt;/p&gt;

&lt;p&gt;Something I've been curious about recently is longer context support: how well can a local model on my laptop deal with summarization or data extraction tasks against longer pieces of text?&lt;/p&gt;
&lt;p&gt;I decided to try &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-a--m-model-option"&gt;my Hacker News summarize script&lt;/a&gt; using Gemma, against &lt;a href="https://news.ycombinator.com/item?id=43340491"&gt;the thread&lt;/a&gt; there discussing the Gemma 3 technical paper.&lt;/p&gt;
&lt;p&gt;First I did a quick token count (using the OpenAI tokenizer but it's usually a similar number to other models):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;curl 'https://hn.algolia.com/api/v1/items/43340491' | ttok
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This returned 22,260 - well within Gemma's documented limits but still a healthy number considering just last year most models topped out at 4,000 or 8,000.&lt;/p&gt;
&lt;p&gt;I ran my script like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43340491 -m gemma3:27b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It did a pretty good job! Here's the &lt;a href="https://gist.github.com/simonw/ab487ea3d1605e719dc2950cd4565146"&gt;full prompt and response&lt;/a&gt;. The one big miss is that it ignored my instructions to include illustrative quotes - I don't know if modifying the prompt will fix that but it's disappointing that it didn't handle that well, given how important direct quotes are for building confidence in RAG-style responses.&lt;/p&gt;
&lt;p&gt;Here's what I got for &lt;a href="https://gist.github.com/simonw/f79c4bd2fbe966e1b600cb8b41cae810"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemma3:27b 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;

&lt;p style="text-align: center"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemma-3-pelican.svg" alt="A collection of abstract shapes, definitely not a pelican on a bicycle" /&gt;&lt;/p&gt;

&lt;p&gt;You can also try out the new Gemma &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-3-27b-it"&gt;in Google AI Studio&lt;/a&gt;, and via their API. I added support for it to &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.15"&gt;llm-gemini 0.15&lt;/a&gt;, though sadly it appears vision mode doesn't work with that API hosted model yet.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm keys set gemini
# paste key here
llm -m gemma-3-27b-it 'five facts about pelicans of interest to skunks'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/bc22062e60e5af3faf458756cb368d0e"&gt;Here's what I got&lt;/a&gt;. I'm not sure how pricing works for that hosted model.&lt;/p&gt;
&lt;p&gt;Gemma 3 is also already available &lt;a href="https://github.com/Blaizzy/mlx-vlm/pull/235"&gt;through MLX-VLM&lt;/a&gt; - here's &lt;a href="https://huggingface.co/collections/mlx-community/gemma-3-67d14a10480a436ad478b0f9"&gt;the MLX model collection&lt;/a&gt; - but I haven't tried that version yet.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gemini"/><category term="vision-llms"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/></entry><entry><title>gemma-2-27b-it-llamafile</title><link href="https://simonwillison.net/2024/Jul/2/gemma-2-27b-it-llamafile/#atom-tag" rel="alternate"/><published>2024-07-02T22:38:06+00:00</published><updated>2024-07-02T22:38:06+00:00</updated><id>https://simonwillison.net/2024/Jul/2/gemma-2-27b-it-llamafile/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/jartine/gemma-2-27b-it-llamafile"&gt;gemma-2-27b-it-llamafile&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Justine Tunney shipped llamafile packages of Google's new openly licensed (though definitely not open source) &lt;a href="https://ai.google.dev/gemma"&gt;Gemma&lt;/a&gt; 2 27b model this morning.&lt;/p&gt;
&lt;p&gt;I downloaded the &lt;code&gt;gemma-2-27b-it.Q5_1.llamafile&lt;/code&gt; version (20.5GB) to my Mac, ran &lt;code&gt;chmod 755 gemma-2-27b-it.Q5_1.llamafile&lt;/code&gt; and then &lt;code&gt;./gemma-2-27b-it.Q5_1.llamafile&lt;/code&gt; and now I'm trying it out through the &lt;code&gt;llama.cpp&lt;/code&gt; default web UI in my browser. It works great.&lt;/p&gt;
&lt;p&gt;It's a &lt;em&gt;very&lt;/em&gt; capable model - currently sitting at position 12 on the &lt;a href="https://chat.lmsys.org/"&gt;LMSYS Arena&lt;/a&gt; making it the highest ranked open weights model - one position ahead of Llama-3-70b-Instruct and within striking distance of the GPT-4 class models.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/JustineTunney/status/1808165898743878108"&gt;@JustineTunney&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/justine-tunney"&gt;justine-tunney&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llamafile"/><category term="justine-tunney"/><category term="llama-cpp"/><category term="gemma"/><category term="chatbot-arena"/></entry><entry><title>PaliGemma model README</title><link href="https://simonwillison.net/2024/May/15/paligemma/#atom-tag" rel="alternate"/><published>2024-05-15T21:16:36+00:00</published><updated>2024-05-15T21:16:36+00:00</updated><id>https://simonwillison.net/2024/May/15/paligemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md"&gt;PaliGemma model README&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
One of the more over-looked announcements from Google I/O yesterday was PaliGemma, an openly licensed VLM (Vision Language Model) in the Gemma family of models.&lt;/p&gt;
&lt;p&gt;The model accepts an image and a text prompt. It outputs text, but that text can include special tokens representing regions on the image. This means it can return both bounding boxes and fuzzier segment outlines of detected objects, behavior that can be triggered using a prompt such as "segment puffins".&lt;/p&gt;
&lt;p&gt;From &lt;a href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md#tokenizer"&gt;the README&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;PaliGemma uses the Gemma tokenizer with 256,000 tokens, but we further extend its vocabulary with 1024 entries that represent coordinates in normalized image-space (&lt;code&gt;&amp;lt;loc0000&amp;gt;...&amp;lt;loc1023&amp;gt;&lt;/code&gt;), and another with 128 entries (&lt;code&gt;&amp;lt;seg000&amp;gt;...&amp;lt;seg127&amp;gt;&lt;/code&gt;) that are codewords used by a lightweight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE) [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can try it out &lt;a href="https://huggingface.co/spaces/google/paligemma"&gt;on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's a 3B model, making it feasible to run on consumer hardware.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://blog.roboflow.com/paligemma-multimodal-vision/"&gt;Roboflow: PaliGemma: Open Source Multimodal Model by Google&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google-io"&gt;google-io&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/image-segmentation"&gt;image-segmentation&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="google-io"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="gemma"/><category term="image-segmentation"/></entry><entry><title>Gemma: Introducing new state-of-the-art open models</title><link href="https://simonwillison.net/2024/Feb/21/gemma/#atom-tag" rel="alternate"/><published>2024-02-21T16:22:21+00:00</published><updated>2024-02-21T16:22:21+00:00</updated><id>https://simonwillison.net/2024/Feb/21/gemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/technology/developers/gemma-open-models/"&gt;Gemma: Introducing new state-of-the-art open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google get in on the openly licensed LLM game: Gemma comes in two sizes, 2B and 7B, trained on 2 trillion and 6 trillion tokens respectively. The terms of use “permit responsible commercial usage”. In the benchmarks it appears to compare favorably to Mistral and Llama 2.&lt;/p&gt;

&lt;p&gt;Something that caught my eye in the terms: “Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.”&lt;/p&gt;

&lt;p&gt;One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.&lt;/p&gt;

&lt;p&gt;UPDATE: It turns out that clause isn’t uncommon—the phrase “You shall undertake reasonable efforts to use the latest version of the Model” is present in both the Stable Diffusion and BigScience Open RAIL-M licenses.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gemma"/></entry></feed>