Simon Willison's Weblog: ffmpeg

Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25)

2025-05-05T17:38:25+00:00

The new llm-video-frames plugin can turn a video file into a sequence of JPEG frames and feed them directly into a long context vision LLM such as GPT-4.1, even when that LLM doesn't directly support video input. It depends on a plugin feature I added to LLM 0.25, which I released last night.

Here's how to try it out:

brew install ffmpeg # or apt-get or your package manager of choice
uv tool install llm # or pipx install llm or pip install llm
llm install llm-video-frames
llm keys set openai
# Paste your OpenAI API key here

llm -f video-frames:video.mp4 \
  'describe the key scenes in this video' \
  -m gpt-4.1-mini

The video-frames:filepath.mp4 syntax is provided by the new plugin. It takes that video, converts it to a JPEG for every second of the video and then turns those into attachments that can be passed to the LLM.

Here's a demo, against this video of Cleo:

llm -f video-frames:cleo.mp4 'describe key moments' -m gpt-4.1-mini

And the output from the model (transcript here):

The sequence of images captures the key moments of a dog being offered and then enjoying a small treat:

In the first image, a hand is holding a small cupcake with purple frosting close to a black dog's face. The dog looks eager and is focused intently on the treat.

The second image shows the dog beginning to take a bite of the cupcake from the person's fingers. The dog's mouth is open, gently nibbling on the treat.

In the third image, the dog has finished or is almost done with the treat and looks content, with a slight smile and a relaxed posture. The treat is no longer in the person's hand, indicating that the dog has consumed it.

This progression highlights the dog's anticipation, enjoyment, and satisfaction with the treat.

Total cost: 7,072 input tokens, 156 output tokens - for GPT-4.1 mini that's 0.3078 cents (less than a third of a cent).

In this case the plugin converted the video into three images: frame_00001.jpg, frame_00002.jpg and frame_00003.jpg.

The plugin accepts additional arguments. You can increase the frames-per-second using ?fps=2 - for example:

llm -f 'video-frames:video.mp4?fps=2' 'summarize this video'

Or you can add ?timestamps=1 to cause ffmpeg to overlay a timestamp in the bottom right corner of each frame. This gives the model a chance to return timestamps in its output.

Let's try that with the Cleo video:

llm -f 'video-frames:cleo.mp4?timestamps=1&fps=5' \
  'key moments, include timestamps' -m gpt-4.1-mini

Here's the output (transcript here):

Here are the key moments from the video "cleo.mp4" with timestamps:

00:00:00.000 - A dog on leash looks at a small cupcake with purple frosting being held by a person.

00:00:00.800 - The dog closely sniffs the cupcake.

00:00:01.400 - The person puts a small amount of the cupcake frosting on their finger.

00:00:01.600 - The dog starts licking the frosting from the person's finger.

00:00:02.600 - The dog continues licking enthusiastically.

Let me know if you need more details or a description of any specific part.

That one sent 14 images to the API, at a total cost of 32,968 input, 141 output = 1.3413 cents.

It sent 5.9MB of image data to OpenAI's API, encoded as base64 in the JSON API call.

The GPT-4.1 model family accepts up to 1,047,576 tokens. Aside from a 20MB size limit per image I haven't seen any documentation of limits on the number of images. You can fit a whole lot of JPEG frames in a million tokens!

Here's what one of those frames looks like with the timestamp overlaid in the corner:

How I built the plugin with o4-mini

This is a great example of how rapid prototyping with an LLM can help demonstrate the value of a feature.

I was considering whether it would make sense for fragment plugins to return images in issue 972 when I had the idea to use ffmpeg to split a video into frames.

I know from past experience that a good model can write an entire plugin for LLM if you feed it the right example, so I started with this (reformatted here for readability):

llm -m o4-mini -f github:simonw/llm-hacker-news -s 'write a new plugin called llm_video_frames.py which takes video:path-to-video.mp4 and creates a temporary directory which it then populates with one frame per second of that video using ffmpeg - then it returns a list of [llm.Attachment(path="path-to-frame1.jpg"), ...] - it should also support passing video:video.mp4?fps=2 to increase to two frames per second, and if you pass ?timestamps=1 or &timestamps=1 then it should add a text timestamp to the bottom right conner of each image with the mm:ss timestamp of that frame (or hh:mm:ss if more than one hour in) and the filename of the video without the path as well.' -o reasoning_effort high

Here's the transcript.

The new attachment mechanism went from vague idea to "I should build that" as a direct result of having an LLM-built proof-of-concept that demonstrated the feasibility of the new feature.

The code it produced was about 90% of the code I shipped in the finished plugin. Total cost 5,018 input, 2,208 output = 1.5235 cents.

Annotated release notes for everything else in LLM 0.25

Here are the annotated release notes for everything else:

New plugin feature: register_fragment_loaders(register) plugins can now return a mixture of fragments and attachments. The llm-video-frames plugin is the first to take advantage of this mechanism. #972

As decsribed above. The inspiration for this feature came from the llm-arxiv plugin by agustif.

New OpenAI models: gpt-4.1, gpt-4.1-mini, gpt-41-nano, o3, o4-mini. #945, #965, #976.

My original plan was to leave these models exclusively to the new llm-openai plugin, since that allows me to add support for new models without a full LLM release. I'm going to punt on that until I'm ready to entirely remove the OpenAI models from LLM core.

New environment variables: LLM_MODEL and LLM_EMBEDDING_MODEL for setting the model to use without needing to specify -m model_id every time. #932

A convenience feature for if you want to set the default model for a terminal session with LLM without using the global default model" mechanism.

New command: llm fragments loaders, to list all currently available fragment loader prefixes provided by plugins. #941

Mainly for consistence with the existing llm templates loaders command. Here's the output when I run llm fragments loaders on my machine:

docs:
  Fetch the latest documentation for the specified package from
  https://github.com/simonw/docs-for-llms

  Use '-f docs:' for the documentation of your current version of LLM.

docs-preview:
  Similar to docs: but fetches the latest docs including alpha/beta releases.

symbex:
  Walk the given directory, parse every .py file, and for every
  top-level function or class-method produce its signature and
  docstring plus an import line.

github:
  Load files from a GitHub repository as fragments

  Argument is a GitHub repository URL or username/repository

issue:
  Fetch GitHub issue/pull and comments as Markdown

  Argument is either "owner/repo/NUMBER" or URL to an issue

pr:
  Fetch GitHub pull request with comments and diff as Markdown

  Argument is either "owner/repo/NUMBER" or URL to a pull request

hn:
  Given a Hacker News article ID returns the full nested conversation.

  For example: -f hn:43875136

video-frames:
  Fragment loader "video-frames:<path>?fps=N&timestamps=1"
  - extracts frames at `fps` per second (default 1)
  - if `timestamps=1`, overlays "filename hh:mm:ss" at bottom-right

That's from llm-docs, llm-fragments-symbex, llm-fragments-github, llm-hacker-news and llm-video-frames.

llm fragments command now shows fragments ordered by the date they were first used. #973

This makes it easier to quickly debug a new fragment plugin - you can run llm fragments and glance at the bottom few entries.

I've also been using the new llm-echo debugging plugin for this - it adds a new fake model called "echo" which simply outputs whatever the prompt, system prompt, fragments and attachments are that were passed to the model:

llm -f docs:sqlite-utils -m echo 'Show me the context'

Output here.

llm chat now includes a !edit command for editing a prompt using your default terminal text editor. Thanks, Benedikt Willi. #969

This is a really nice enhancement to llm chat, making it much more convenient to edit longe prompts.

And the rest:

Allow -t and --system to be used at the same time. #916

Fixed a bug where accessing a model via its alias would fail to respect any default options set for that model. #968

Improved documentation for extra-openai-models.yaml. Thanks, Rahim Nathwani and Dan Guido. #950, #957

llm -c/--continue now works correctly with the -d/--database option. llm chat now accepts that -d/--database option. Thanks, Sukhbinder Singh. #933

Tags: cli, ffmpeg, plugins, projects, ai, generative-ai, llms, ai-assisted-programming, llm, vision-llms

QuickTime video script to capture frames and bounding boxes

2024-11-14T19:00:54+00:00

QuickTime video script to capture frames and bounding boxes

An update to an older TIL. I'm working on the write-up for my DjangoCon US talk on plugins and I found myself wanting to capture individual frames from the video in two formats: a full frame capture, and another that captured just the portion of the screen shared from my laptop.

I have a script for the former, so I got Claude to update my script to add support for one or more --box options, like this:

capture-bbox.sh ../output.mp4  --box '31,17,100,87' --box '0,0,50,50'

Open output.mp4 in QuickTime Player, run that script and then every time you hit a key in the terminal app it will capture three JPEGs from the current position in QuickTime Player - one for the whole screen and one each for the specified bounding box regions.

Those bounding box regions are percentages of the width and height of the image. I also got Claude to build me this interactive tool on top of cropperjs to help figure out those boxes:

Tags: ffmpeg, projects, tools, ai, generative-ai, llms, ai-assisted-programming, claude, claude-artifacts, prompt-to-app

llamafile v0.8.13 (and whisperfile)

2024-08-19T20:08:59+00:00

llamafile v0.8.13 (and whisperfile)

The latest release of llamafile (previously) adds support for Gemma 2B (pre-bundled llamafiles available here), significant performance improvements and new support for the Whisper speech-to-text model, based on whisper.cpp, Georgi Gerganov's C++ implementation of Whisper that pre-dates his work on llama.cpp.

I got whisperfile working locally by first downloading the cross-platform executable attached to the GitHub release and then grabbing a whisper-tiny.en-q5_1.bin model from Hugging Face:

wget -O whisper-tiny.en-q5_1.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-q5_1.bin

Then I ran chmod 755 whisperfile-0.8.13 and then executed it against an example .wav file like this:

./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav --no-prints

The --no-prints option suppresses the debug output, so you just get text that looks like this:

[00:00:00.000 --> 00:00:12.000]   This is a LibraVox recording. All LibraVox recordings are in the public domain. For more information please visit LibraVox.org.
[00:00:12.000 --> 00:00:20.000]   Today's reading The Raven by Edgar Allan Poe, read by Chris Scurringe.
[00:00:20.000 --> 00:00:40.000]   Once upon a midnight dreary, while I pondered weak and weary, over many a quaint and curious volume of forgotten lore. While I nodded nearly napping, suddenly there came a tapping as of someone gently rapping, rapping at my chamber door.

There are quite a few undocumented options - to write out JSON to a file called transcript.json (example output):

./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/raven_poe_64kb.wav --no-prints --output-json --output-file transcript

I had to convert my own audio recordings to 16kHz .wav files in order to use them with whisperfile. I used ffmpeg to do this:

ffmpeg -i runthrough-26-oct-2023.wav -ar 16000 /tmp/out.wav

Then I could transcribe that like so:

./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/out.wav --no-prints

Update: Justine says:

I've just uploaded new whisperfiles to Hugging Face which use miniaudio.h to automatically resample and convert your mp3/ogg/flac/wav files to the appropriate format.

With that whisper-tiny model this took just 11s to transcribe a 10m41s audio file!

I also tried the much larger Whisper Medium model - I chose to use the 539MB ggml-medium-q5_0.bin quantized version of that from huggingface.co/ggerganov/whisper.cpp:

./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints

This time it took 1m49s, using 761% of CPU according to Activity Monitor.

I tried adding --gpu auto to exercise the GPU on my M2 Max MacBook Pro:

./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints --gpu auto

That used just 16.9% of CPU and 93% of GPU according to Activity Monitor, and finished in 1m08s.

I tried this with the tiny model too but the performance difference there was imperceptible.

Via @JustineTunney

Tags: ffmpeg, ai, whisper, local-llms, llamafile, justine-tunney, speech-to-text, georgi-gerganov

Tracking Fireworks Impact on Fourth of July AQI

2024-07-05T22:52:51+00:00

Tracking Fireworks Impact on Fourth of July AQI

Danny Page ran shot-scraper once per minute (using cron) against this Purple Air map of the Bay Area and turned the captured screenshots into an animation using ffmpeg. The result shows the impact of 4th of July fireworks on air quality between 7pm and 7am.

Via @DannyPage

Tags: ffmpeg, shot-scraper

Announcing the Ladybird Browser Initiative

2024-07-01T16:08:42+00:00

Announcing the Ladybird Browser Initiative

Andreas Kling's Ladybird is a really exciting project: a from-scratch implementation of a web browser, initially built as part of the Serenity OS project, which aims to provide a completely independent, open source and fully standards compliant browser.

Last month Andreas forked Ladybird away from Serenity, recognizing that the potential impact of the browser project on its own was greater than as a component of that project. Crucially, Serenity OS avoids any outside code - splitting out Ladybird allows Ladybird to add dependencies like libjpeg and ffmpeg. The Ladybird June update video talks through some of the dependencies they've been able to add since making that decision.

The new Ladybird Browser Initiative puts some financial weight behind the project: it's a US 501(c)(3) non-profit initially funded with $1m from GitHub co-founder Chris Chris Wanstrath. The money is going on engineers: Andreas says:

We are 4 full-time engineers today, and we'll be adding another 3 in the near future

Here's a 2m28s video from Chris introducing the new foundation and talking about why this project is worth supporting.

Via @ladybirdbrowser

Tags: browsers, ffmpeg, open-source, andreas-kling, ladybird

Mass Video Conversion Using AWS

2007-04-03T23:44:37+00:00

Mass Video Conversion Using AWS

How to use S3, SQS, EC2, ffmpeg and some Python to bulk convert videos with Amazon Web Services.

Tags: amazon, aws, ec2, ffmpeg, python, s3, sqs