Simon Willison's Weblog: prince-canuma

Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation

2026-01-22T17:42:34+00:00

Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation

I haven't been paying much attention to the state-of-the-art in speech generation models other than noting that they've got really good, so I can't speak for how notable this new release from Qwen is.

From the accompanying paper:

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of- the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis [...]. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

To give an idea of size, Qwen/Qwen3-TTS-12Hz-1.7B-Base is 4.54GB on Hugging Face and Qwen/Qwen3-TTS-12Hz-0.6B-Base is 2.52GB.

The Hugging Face demo lets you try out the 0.6B and 1.7B models for free in your browser, including voice cloning:

I tried this out by recording myself reading my about page and then having Qwen3-TTS generate audio of me reading the Qwen3-TTS announcement post. Here's the result:

Your browser does not support the audio element.

It's important that everyone understands that voice cloning is now something that's available to anyone with a GPU and a few GBs of VRAM... or in this case a web browser that can access Hugging Face.

Update: Prince Canuma got this working with his mlx-audio library. I had Claude turn that into a CLI tool which you can run with uv ike this:

uv run https://tools.simonwillison.net/python/q3_tts.py \
  'I am a pirate, give me your gold!' \
  -i 'gruff voice' -o pirate.wav

The -i option lets you use a prompt to describe the voice it should use. On first run this downloads a 4.5GB model file from Hugging Face.

Via Hacker News

Tags: text-to-speech, ai, generative-ai, hugging-face, uv, qwen, mlx, prince-canuma, ai-in-china

Introducing Gemma 3n: The developer guide

2025-06-26T21:08:36+00:00

Introducing Gemma 3n: The developer guide

Extremely consequential new open weights model release from Google today:

Multimodal by design: Gemma 3n natively supports image, audio, video, and text inputs and text outputs.

Optimized for on-device: Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on effective parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory.

This is very exciting: a 2B and 4B model optimized for end-user devices which accepts text, images and audio as inputs!

Gemma 3n is also the most comprehensive day one launch I've seen for any model: Google partnered with "AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now.

So far I've run two variants on my Mac laptop. Ollama offer a 7.5GB version (full tag gemma3n:e4b-it-q4_K_M0) of the 4B model, which I ran like this:

ollama pull gemma3n
llm install llm-ollama
llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle"

It drew me this:

The Ollama version doesn't appear to support image or audio input yet.

... but the mlx-vlm version does!

First I tried that on this WAV file like so (using a recipe adapted from Prince Canuma's video):

uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Transcribe the following speech segment in English:" \
  --audio pelican-joke-request.wav

That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription:

Tell me a joke about a pelican.

Then I had it draw me a pelican for good measure:

uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Generate an SVG of a pelican riding a bicycle"

I quite like this one:

It's interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations.

Finally, I had it describe the image it had just created:

uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 1000 \
  --prompt "Describe image" \
  --image gemma3n-mlx-vlm.jpg

Here's what I got back - it thought this was a chemical diagram!

The image is a cartoon-style illustration depicting a molecular structure against a light blue background. The structure is composed of several differently colored and shaped elements connected by curved black lines.

Here's a breakdown of the elements:

Large Black Circles: There are two large, solid black circles, likely representing larger molecules or components.

Cream-Colored Oval: A prominent cream-colored oval shape is centrally located and connected to other elements. It has a smaller, darker cream-colored sphere attached to its left side.

Red Oval: A smaller, solid red oval is positioned near one of the black circles and connected to the cream-colored oval.

Black Lines: These lines act as bonds, connecting the various molecular components. They are curved and dynamic, suggesting movement or interaction.

Triangular Shape: A small black triangle is attached to the smaller cream-colored sphere.

Letter "I": The letter "I" appears twice, likely labeling specific parts of the molecule.

The overall impression is of a simplified representation of a biological molecule, possibly a protein or a complex organic compound. The use of different colors helps to distinguish the various components within the structure.

Tags: audio, google, ai, generative-ai, local-llms, llms, vision-llms, mlx, ollama, pelican-riding-a-bicycle, gemma, llm-release, prince-canuma

Qwen2.5-VL-32B: Smarter and Lighter

2025-03-24T22:43:22+00:00

Qwen2.5-VL-32B: Smarter and Lighter

The second big open weight LLM release from China today - the first being DeepSeek v3-0324.

Qwen's previous vision model was Qwen2.5 VL, released in January in 3B, 7B and 72B sizes.

Today's Apache 2.0 licensed release is a 32B model, which is quickly becoming my personal favourite model size - large enough to have GPT-4-class capabilities, but small enough that on my 64GB Mac there's still enough RAM for me to run other memory-hungry applications like Firefox and VS Code.

Qwen claim that the new model (when compared to their previous 2.5 VL family) can "align more closely with human preferences", is better at "mathematical reasoning" and provides "enhanced accuracy and detailed analysis in tasks such as image parsing, content recognition, and visual logic deduction".

They also offer some presumably carefully selected benchmark results showing it out-performing Gemma 3-27B, Mistral Small 3.1 24B and GPT-4o-0513 (there have been two more recent GPT-4o releases since that one, 2024-08-16 and 2024-11-20).

As usual, Prince Canuma had MLX versions of the models live within hours of the release, in 4 bit, 6 bit, 8 bit, and bf16 variants.

I ran the 4bit version (a 18GB model download) using uv and Prince's mlx-vlm like this:

uv run --with 'numpy<2' --with mlx-vlm \
  python -m mlx_vlm.generate \
    --model mlx-community/Qwen2.5-VL-32B-Instruct-4bit \
    --max-tokens 1000 \
    --temperature 0.0 \
    --prompt "Describe this image." \
    --image Mpaboundrycdfw-1.png

Here's the image:

And here's the resulting description, which I found to be very impressive:

This image appears to be a map highlighting specific areas along the coastline, likely in California, based on the geographical features and place names. Here is a detailed description of the image:

Geographic Features:

The map shows a section of the coastline, with land on the right and the sea on the left.

The land area includes parts of SanMateo County, as noted in the top-right corner.

The coastline features several named locations, including:

Point Montara

Pillar Point

Kings Rock

VicenteCreek

DennistonCreek

State Beach

Highlighted Areas:

The map contains two distinct areas marked with different colors:

Red Area (Top Section)

Labeled as "MONTARA" and "MONTARA M R" (possibly referring to a "Marine Reserves" or similar protected area).

This area is marked with a dark red color and extends from the coastline inland.

Blue Area (Bottom Section)

Labeled as "PILLAR" and "PILLAR P O I N T" along with "SMCA" (Special Management Area).

This area is marked with a blue color and is located closer to the coastline, including the area around "Pillar Point."

Depth Contours:

The map includes depth markings in the sea, indicating the bathymetrical features:

20 fathom (fm) line is marked in the upper-left part of the sea.

10 fathom (fm) line is marked closer to the coastline.

30 fathom (fm) line is marked further out to sea.

Other Features:

State Beach: Marked near the top-right, indicating a protected recreational area.

Kings Rock: A prominent feature near the coastline, likely a rocky outcropping.

Creeks: The map shows several creeks, including VicenteCreek and DennistonCreek, which flow into the sea.

Protected Areas:

The map highlights specific protected areas:

Marine Reserves:

"MONTARA M R" (Marine Reserves) in red.

Special Management Area (SMCA)

"PILLAR P O I N T" in blue, indicating a Special Management Area.

Grid and Coordinates:

The map includes a grid with latitude and longitude markings:

Latitude ranges from approximately 37°25'N to 37°35'N.

Longitude ranges from approximately 122°22.5'W to 122°35.5'W.

Topography:

The land area shows topographic features, including elevations and vegetation, with green areas indicating higher elevations or vegetated land.

Other Labels:

"SMR": Likely stands for "State Managed Reserves."

"SMCA": Likely stands for "Special Management Control Area."

In summary, this map highlights specific protected areas along the coastline, including a red "Marine Reserves" area and a blue "Special Management Area" near "Pillar Point." The map also includes depth markings, geographical features, and place names, providing a detailed view of the region's natural and protected areas.

It included the following runtime statistics:

Prompt: 1051 tokens, 111.985 tokens-per-sec
Generation: 760 tokens, 17.328 tokens-per-sec
Peak memory: 21.110 GB

Tags: ai, generative-ai, local-llms, llms, vision-llms, uv, qwen, mlx, llm-release, prince-canuma, ai-in-china

Run LLMs on macOS using llm-mlx and Apple's MLX framework

2025-02-15T19:13:29+00:00

llm-mlx is a brand new plugin for my LLM Python Library and CLI utility which builds on top of Apple's excellent MLX array framework library and mlx-lm package. If you're a terminal user or Python developer with a Mac this may be the new easiest way to start exploring local Large Language Models.

Running Llama 3.2 3B using llm-mlx

If you haven't already got LLM installed you'll need to install it - you can do that in a bunch of different ways - in order of preference I like uv tool install llm or pipx install llm or brew install llm or pip install llm.

Next, install the new plugin (macOS only):

llm install llm-mlx

Now download and register a model. Llama 3.2 3B is an excellent first choice - it's pretty small (a 1.8GB download) but is a surprisingly capable starter model.

llm mlx download-model mlx-community/Llama-3.2-3B-Instruct-4bit

This will download 1.8GB of model weights from mlx-community/Llama-3.2-3B-Instruct-4bit on Hugging Face and store them here:

~/.cache/huggingface/hub/models--mlx-community--Llama-3.2-3B-Instruct-4bit

Now you can start running prompts:

llm -m mlx-community/Llama-3.2-3B-Instruct-4bit 'Python code to traverse a tree, briefly'

Which output this for me:

Here's a brief example of how to traverse a tree in Python:
class Node:
    def __init__(self, value):
        self.value = value
        self.children = []

def traverse_tree(node):
    if node is None:
        return
    
    print(node.value)
    
    for child in node.children:
        traverse_tree(child)

# Example usage:
root = Node("A")
root.children = [Node("B"), Node("C")]
root.children[0].children = [Node("D"), Node("E")]
root.children[1].children = [Node("F")]

traverse_tree(root)  # Output: A, B, D, E, C, F
In this example, we define a Node class to represent each node in the tree, with a value attribute and a list of children. The traverse_tree function recursively visits each node in the tree, printing its value.

This is a basic example of a tree traversal, and there are many variations and optimizations depending on the specific use case.

That generation ran at an impressive 152 tokens per second!

That command was a bit of a mouthful, so let's assign an alias to the model:

llm aliases set l32 mlx-community/Llama-3.2-3B-Instruct-4bit

Now we can use that shorter alias instead:

llm -m l32 'a joke about a haggis buying a car'

(The joke isn't very good.)

As with other models supported by LLM, you can also pipe things to it. Here's how to get it to explain a piece of Python code (in this case itself):

cat llm_mlx.py | llm -m l32 'explain this code'

The response started like this:

This code is a part of a larger project that uses the Hugging Face Transformers library to create a text-to-text conversational AI model. The code defines a custom model class MlxModel and a set of command-line interface (CLI) commands for working with MLX models. [...]

Here's the rest of the response. I'm pretty amazed at how well it did for a tiny 1.8GB model!

This plugin can only run models that have been converted to work with Apple's MLX framework. Thankfully the mlx-community organization on Hugging Face has published over 1,000 of these. A few I've tried successfully:

mlx-community/Qwen2.5-0.5B-Instruct-4bit - 278 MB
mlx-community/Mistral-7B-Instruct-v0.3-4bit - 4.08 GB
mlx-community/Mistral-Small-24B-Instruct-2501-4bit - 13.26 GB
mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit - 18.5 GB
mlx-community/Llama-3.3-70B-Instruct-4bit - 40 GB

Mistral-Small-24B really does feel GPT-4 quality despite only needing around 12GB of RAM to run - so it's a good default model if you want to leave space to run other apps.

DeepSeek-R1-Distill-Qwen-32B-4bit is a great model if you want to try out R1 style "reasoning" which outputs a <think>...</think>... block at the start of each response.

That last Llama 70B one needs a machine with 64GB of RAM to run but is very capable - I wrote about trying out the Ollama version of Llama 3.3 70B in I can now run a GPT-4 class model on my laptop back in December.

llm mlx download-model mlx-community/Llama-3.3-70B-Instruct-4bit

For larger models like that I recommend using the llm chat command as that ensures they stay in memory for multiple rounds of the conversation:

llm chat -m mlx-community/Llama-3.3-70B-Instruct-4bit

Activity Monitor shows the process using 37.5GB of RAM while the model is running.

Some of these jokes are actually pretty good!

Chatting with mlx-community/Llama-3.3-70B-Instruct-4bit
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> hi
How's your day going so far? Is there something I can help you with or would you like to chat?
> jokes about weird animals please
Here are a few:

1. Why did the axolotl go to therapy? Because it was struggling to come to terms with its identity crisis... and also because it had a lot of "regrowing" to do!
2. What did the aye-aye say to the other aye-aye when they were out on a date? "You've got a finger on the pulse of my heart!"
3. Why did the naked mole rat go to the party? Because it was a "hair-raising" experience and it wanted to get to the "root" of the fun!
4. What do you call a group of flamingos playing instruments? The Pink Floyd... of the avian world, of course!
5. Why did the blobfish go to the doctor? It had a "gut" feeling something was wrong... and also because it was feeling a little "deflated"!

Want more? I've got a whole zoo of 'em!

That ran at 8.8 tokens per second.

As with every LLM plugin, all prompts and responses are logged to a SQLite database that can be accessed using the llm logs command.

You can use LLM models from your Python code as well. The README includes notes on how to do this, as does the LLM Python API documentation. TLDR:

import llm
model = llm.get_model("mlx-community/Llama-3.2-3B-Instruct-4bit")
print(model.prompt("hi").text())

MLX and mlx-lm

The secret sauce behind this new plugin is Apple's MLX library, which as been improving at an extraordinary pace over the past year - it first launched on GitHub just in November of 2023.

The core library is C++, but the MLX team at Apple also maintain a mlx-examples repo with a wealth of Python examples, including the mlx-lm and mlx-whisper packages.

The community around it is growing too. I'm particularly impressed by Prince Canuma's mlx-vlm library for accessing multi-modal vision LLMs - I'd love to get that working as an LLM plugin once I figure out how to use it for conversations.

I've used MLX for a few experiments in the past, but this tweet from MLX core developer Awni Hannun finally convinced me to wrap it up as an LLM plugin:

In the latest MLX small LLMs are a lot faster.

On M4 Max 4-bit Qwen 0.5B generates 1k tokens at a whopping 510 toks/sec. And runs at over 150 tok/sec on iPhone 16 pro.

This is really good software. This small team at Apple appear to be almost single-handedly giving NVIDIA's CUDA a run for their money!

Building the plugin

The llm-mlx plugin came together pretty quickly. The first version was ~100 lines of Python, much of it repurposed from my existing llm-gguf plugin.

The hardest problem was figuring out how to hide the Hugging Face progress bars that displayed every time it attempted to access a model!

I eventually found the from huggingface_hub.utils import disable_progress_bars utility function by piping library code through Gemini 2.0.

I then added model options support allowing you to pass options like this:

llm -m l32 'a greeting' -o temperature 1.0 -o seed 2

So far using a fixed seed appears to produce repeatable results, which is exciting for writing more predictable tests in the future.

For the automated tests that run in GitHub Actions I decided to use a small model - I went with the tiny 75MB mlx-community/SmolLM-135M-Instruct-4bit (explored previously). I configured GitHub Actions to cache the model in between CI runs by adding the following YAML to my .github/workflows/test.yml file:

    - name: Cache models
      uses: actions/cache@v4
      with:
        path: ~/.cache/huggingface
        key: ${{ runner.os }}-huggingface-

Tags: apple, plugins, projects, ai, github-actions, generative-ai, local-llms, llms, llm, qwen, mlx, prince-canuma, ai-in-china

Trying out QvQ - Qwen's new visual reasoning model

2024-12-24T20:49:08+00:00

I thought we were done for major model releases in 2024, but apparently not: Alibaba's Qwen team just dropped the ~~Apache 2.0 licensed~~ Qwen licensed (the license changed) QvQ-72B-Preview, "an experimental research model focusing on enhancing visual reasoning capabilities".

Their blog post is titled QvQ: To See the World with Wisdom - similar flowery language to their QwQ announcement QwQ: Reflect Deeply on the Boundaries of the Unknown a few weeks ago in November.

It's a vision-focused follow-up to QwQ, which I wrote about previousy. QwQ is an impressive openly licensed inference-scaling model: give it a prompt and it will think out loud over many tokens while trying to derive a good answer, similar to OpenAI's o1 and o3 models.

The new QvQ adds vision to the mix. You can try it out on Hugging Face Spaces - it accepts an image and a single prompt and then streams out a very long response where it thinks through the problem you have posed it. There's no option to send a follow-up prompt.

I've tried it out with a bunch of things, with mixed results - but it's really fun seeing how it works through a problem.

My most successful prompt was "Count the pelicans" with this image:

Here's the full response. Some highlights:

So I've got this picture of some pelicans, and I need to count them. Let me see... First off, I know that pelicans are those big birds with the massive beaks, right? [...]

Let me try counting again, more carefully this time. Starting from the left:

The first pelican is fully visible, standing at the front.

Behind it, slightly to the right, is the second pelican.

Then, further back, there's a third pelican.

And next to that one, there's a fourth pelican.

So, that's four pelicans in total. But, I notice that on the far right of the image, there's part of another bird, maybe a pelican's back or tail feathers, but it's not fully in the frame, so I don't think I should count it as a whole pelican.

Something about the tone of voice it uses, "First off, I know that pelicans are those big birds with the massive beaks, right?", tickles me.

I also tried this on a much harder image - it counted 17 which looks about right to me but honestly I have trouble with that image myself.

I fed in one of the ARC-AGI puzzles that o3 had failed at.

It produced a very lengthy chain of thought that was almost entirely incorrect, but had some interesting "ideas" in it:

[...] Let me try to think in terms of cellular automata rules, like Conway's Game of Life.

In Game of Life, each cell's state in the next generation is determined by its current state and the states of its eight neighbors.

Perhaps a similar rule applies here.

Finally, I asked it to "Estimate the height of the dinosaur" against this image (which, as it correctly noted, is actually an inflatable dragon):

I'm not sure this task is even possible given the lack of items in the frame of comparable known height. It had a good go at it though! I particularly enjoyed how it began:

So I've got this inflatable Christmas decoration, and it's a dragon! But not just any dragon—it's a green dragon wearing a Santa hat and a red and white scarf. And get this, there's a reindeer on its back, also wearing a Santa hat and holding onto the dragon's wings. It's pretty festive and colorful, with lots of vibrant greens, reds, and yellows.

The dragon has these big, friendly eyes and a smiling face, which makes it look really cheerful. Its wings are spread out, and it has white claws on its feet. The reindeer is brown with antlers and seems to be enjoying the ride on the dragon's back.

Now, the question is, how tall is this inflatable dragon? I need to estimate its height. Let me see. I'm standing next to it, and I'm about 5 feet 10 inches tall. The dragon is definitely taller than me. Maybe around 8 or 9 feet high? But I should try to be more precise.

I wonder how it decided that its own height was 5 feet 10 inches!

Running QvQ locally

All of my experiments so far have used the hosted Hugging Face demo. I'm hoping to try it out on my own laptop soon - I think it should just fit in 64GB of M2, maybe with a smaller quantization.

Right now the options are:

Qwen/QVQ-72B-Preview on Hugging Face has the GPU model weights, for use with Hugging Face Transformers and the qwen-vl-utils Python package.
Prince Canuma converted the model for Apple's MLX framework as mlx-community/qvq-72b-preview and got it working with his excellent mlx-vlm package.

As a happy user of Ollama's qwq port I'm hoping they add a QvQ release at some point soon as well.

Running QvQ on my Mac using mlx-vlm

Update: I successfully ran QvQ on my 64GB M2 MacBook Pro using Prince Canuma's mlx-vlm via uv run with the following command:

uv run --with 'numpy<2.0' --with mlx-vlm python \
  -m mlx_vlm.generate \
    --model mlx-community/QVQ-72B-Preview-4bit \
    --max-tokens 10000 \
    --temp 0.0 \
    --prompt "describe this" \
    --image pelicans-on-bicycles-veo2.jpg

My image input here was taken from my experiment with Google's Veo 2 video generation model:

Here's the full output from running that command, after it first downloaded 38GB to ~/.cache/huggingface/hub/models--mlx-community--QVQ-72B-Preview-4bi:

==========
Image: ['pelicans-on-bicycles-veo2.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
describe this<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Alright, I've got this "Text to Video" tool to explore. It seems pretty advanced, allowing me to turn text descriptions into actual videos. The interface looks user-friendly, with a dark theme that's easy on the eyes. On the left side, there's a panel where I can input my text prompt. It already has an example filled in: "A pelican riding a bicycle along a coastal path overlooking a harbor."

That sounds interesting! I can see four different video thumbnails on the right side, each showing variations of this scene. One of them even has the pelican wearing a helmet, which adds a humorous touch. Another shows the pelican standing next to a bicycle, almost as if it's ready to hop on and start pedaling.

Below the text input, there are some style options like "photorealistic," "35mm film," "minimal," "sketchy," "handmade," and "abstract." These probably let me choose the visual style of the video. For now, I'll stick with the default settings.

At the bottom of the left panel, there are two buttons: "Start over" and "Create videos." I guess "Create videos" is the one I need to press to generate the videos based on my text prompt.

Let me think about this. I want to see what kind of videos this tool can produce based on my description. The example prompt is already filled in, so maybe I should just go ahead and click "Create videos" to see what happens.

Wait a minute, the prompt mentions a coastal path overlooking a harbor. In the thumbnails, I can see a background with water and boats, which matches the description. It's cool how the tool has interpreted the text into visual elements.

I should also pay attention to the "Visual Prompt" section, which says "View only." That probably means I can't edit it directly, but I can view the generated videos based on this prompt.

Hmm, maybe I should try changing the prompt to see different results. For example, instead of a pelican riding a bicycle, maybe I could describe a different scene. But for now, I'll stick with the given example to understand how it works.

After clicking "Create videos," I expect to see the videos being generated. There might be a loading process, depending on the complexity of the prompt and the tool's processing speed.

Once the videos are ready, I can preview them to see how well they match my description. If I'm not satisfied, I can always go back and adjust the prompt or try different style options.

Overall, this seems like a powerful tool for creating custom videos based on textual descriptions. It could be really useful for storytelling, presentations, or even just for fun, experimenting with different scenes and styles.

I think I'll spend some time exploring the different style options to see how they affect the final videos. For instance, selecting "35mm film" might give the videos a vintage, cinematic look, while "abstract" could produce more artistic and less realistic visuals.

Also, I wonder if there's a way to customize more specific aspects, like the camera angle, lighting, or movement. Maybe after generating the initial videos, there are additional editing options available.

In summary, I'm excited to use this Text to Video tool and see what kind of creative possibilities it opens up. With a bit of practice, I could probably create some impressive visual content just by describing what I want to see.

**Final Answer**

To proceed with generating videos based on the provided text prompt, I need to click on the "Create videos" button. This action will initiate the process of converting the textual description into visual content, resulting in multiple video thumbnails that can be previewed for further use or adjustments.
==========
Prompt: 0.870 tokens-per-sec
Generation: 7.694 tokens-per-sec

The license changed from Apache 2.0 to Qwen

When I wrote this post yesterday the LICENSE file in the Hugging Face repository was Apache 2.0. Just after midnight UTC on 25th December this commit landed updating the QVQ-72B-Preview license file to the Qwen license instead.

This looks to me like they were correcting a mistake, not changing their policy. The README.md for that repository has this block of YAML:

license: other
license_name: qwen

And commits to that README at one point linked to the Qwen2.5-72B-Instruct copy of the Qwen license.

The QwQ model repository continues to list Apache 2.0, which matches the YAML in its README as well.

So it looks to me like the intention is for QvQ and Qwen2.5-72B-Instruct to be Qwen licensed, while QwQ is Apache 2.0.

Tags: python, ai, generative-ai, local-llms, llms, hugging-face, vision-llms, uv, qwen, mlx, llm-reasoning, llm-release, prince-canuma, ai-in-china

SmolVLM - small yet mighty Vision Language Model

2024-11-28T20:29:27+00:00

SmolVLM - small yet mighty Vision Language Model

I've been having fun playing with this new vision model from the Hugging Face team behind SmolLM. They describe it as:

[...] a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

I've tried it in a few flavours but my favourite so far is the mlx-vlm approach, via mlx-vlm author Prince Canuma. Here's the uv recipe I'm using to run it:

uv run \
  --with mlx-vlm \
  --with torch \
  python -m mlx_vlm.generate \
    --model mlx-community/SmolVLM-Instruct-bf16 \
    --max-tokens 500 \
    --temp 0.5 \
    --prompt "Describe this image in detail" \
    --image IMG_4414.JPG

If you run into an error using Python 3.13 (torch compatibility) try uv run --python 3.11 instead.

This one-liner installs the necessary dependencies, downloads the model (about 4.2GB, saved to ~/.cache/huggingface/hub/models--mlx-community--SmolVLM-Instruct-bf16) and executes the prompt and displays the result.

I ran that against this Pelican photo:

The model replied:

In the foreground of this photograph, a pelican is perched on a pile of rocks. The pelican’s wings are spread out, and its beak is open. There is a small bird standing on the rocks in front of the pelican. The bird has its head cocked to one side, and it seems to be looking at the pelican. To the left of the pelican is another bird, and behind the pelican are some other birds. The rocks in the background of the image are gray, and they are covered with a variety of textures. The rocks in the background appear to be wet from either rain or sea spray.

There are a few spatial mistakes in that description but the vibes are generally in the right direction.

On my 64GB M2 MacBook pro it read the prompt at 7.831 tokens/second and generated that response at an impressive 74.765 tokens/second.

Tags: python, ai, generative-ai, local-llms, llms, vision-llms, uv, mlx, smollm, llm-release, prince-canuma

mlx-vlm

2024-09-29T21:38:46+00:00

mlx-vlm

The MLX ecosystem of libraries for running machine learning models on Apple Silicon continues to expand. Prince Canuma is actively developing this library for running vision models such as Qwen-2 VL and Pixtral and LLaVA using Python running on a Mac.

I used uv to run it against this image with this shell one-liner:

uv run --with mlx-vlm \
  python -m mlx_vlm.generate \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --max-tokens 1000 \
  --temp 0.0 \
  --image https://static.simonwillison.net/static/2024/django-roadmap.png \
  --prompt "Describe image in detail, include all text"

The --image option works equally well with a URL or a path to a local file on disk.

This first downloaded 4.1GB to my ~/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct folder and then output this result, which starts:

The image is a horizontal timeline chart that represents the release dates of various software versions. The timeline is divided into years from 2023 to 2029, with each year represented by a vertical line. The chart includes a legend at the bottom, which distinguishes between different types of software versions.

Legend

Mainstream Support:

4.2 (2023)

5.0 (2024)

5.1 (2025)

5.2 (2026)

6.0 (2027) [...]

Via Chris Zubak-Skees

Tags: apple, python, ai, generative-ai, local-llms, llms, vision-llms, uv, qwen, mlx, prince-canuma, ai-in-china