<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: ffmpeg</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/ffmpeg.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-05-05T17:38:25+00:00</updated><author><name>Simon Willison</name></author><entry><title>Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25)</title><link href="https://simonwillison.net/2025/May/5/llm-video-frames/#atom-tag" rel="alternate"/><published>2025-05-05T17:38:25+00:00</published><updated>2025-05-05T17:38:25+00:00</updated><id>https://simonwillison.net/2025/May/5/llm-video-frames/#atom-tag</id><summary type="html">
    &lt;p&gt;The new &lt;strong&gt;&lt;a href="https://github.com/simonw/llm-video-frames"&gt;llm-video-frames&lt;/a&gt;&lt;/strong&gt; plugin can turn a video file into a sequence of JPEG frames and feed them directly into a long context vision LLM such as GPT-4.1, even when that LLM doesn't directly support video input. It depends on a plugin feature I added to &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-25"&gt;LLM 0.25&lt;/a&gt;, which I released last night.&lt;/p&gt;
&lt;p&gt;Here's how to try it out:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;brew install ffmpeg &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; or apt-get or your package manager of choice&lt;/span&gt;
uv tool install llm &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; or pipx install llm or pip install llm&lt;/span&gt;
llm install llm-video-frames
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openai
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste your OpenAI API key here&lt;/span&gt;

llm -f video-frames:video.mp4 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe the key scenes in this video&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -m gpt-4.1-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;video-frames:filepath.mp4&lt;/code&gt; syntax is provided by the new plugin. It takes that video, converts it to a JPEG for every second of the video and then turns those into &lt;a href="https://llm.datasette.io/en/stable/usage.html#attachments"&gt;attachments&lt;/a&gt; that can be passed to the LLM.&lt;/p&gt;
&lt;p&gt;Here's a demo, against &lt;a href="https://static.simonwillison.net/static/2025/cleo.mp4"&gt;this video&lt;/a&gt; of Cleo:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Cleo " poster="https://static.simonwillison.net/static/2025/cleo-still.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/cleo.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -f video-frames:cleo.mp4 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe key moments&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -m gpt-4.1-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And the output from the model (&lt;a href="https://gist.github.com/simonw/a4e26166a524f9c07b4cf32b2f8da6c9"&gt;transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The sequence of images captures the key moments of a dog being offered and then enjoying a small treat:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the first image, a hand is holding a small cupcake with purple frosting close to a black dog's face. The dog looks eager and is focused intently on the treat.&lt;/li&gt;
&lt;li&gt;The second image shows the dog beginning to take a bite of the cupcake from the person's fingers. The dog's mouth is open, gently nibbling on the treat.&lt;/li&gt;
&lt;li&gt;In the third image, the dog has finished or is almost done with the treat and looks content, with a slight smile and a relaxed posture. The treat is no longer in the person's hand, indicating that the dog has consumed it.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This progression highlights the dog's anticipation, enjoyment, and satisfaction with the treat.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Total cost: 7,072 input tokens, 156 output tokens - for GPT-4.1 mini that's 0.3078 cents (less than a third of a cent).&lt;/p&gt;
&lt;p&gt;In this case the plugin converted the video into three images: &lt;a href="https://static.simonwillison.net/static/2025/frame_00001.jpg"&gt;frame_00001.jpg&lt;/a&gt;, &lt;a href="https://static.simonwillison.net/static/2025/frame_00002.jpg"&gt;frame_00002.jpg&lt;/a&gt; and &lt;a href="https://static.simonwillison.net/static/2025/frame_00003.jpg"&gt;frame_00003.jpg&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The plugin accepts additional arguments. You can increase the frames-per-second using &lt;code&gt;?fps=2&lt;/code&gt; - for example:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;video-frames:video.mp4?fps=2&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;summarize this video&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or you can add &lt;code&gt;?timestamps=1&lt;/code&gt; to cause &lt;code&gt;ffmpeg&lt;/code&gt; to overlay a timestamp in the bottom right corner of each frame. This gives the model a chance to return timestamps in its output.&lt;/p&gt;
&lt;p&gt;Let's try that with the Cleo video:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;video-frames:cleo.mp4?timestamps=1&amp;amp;fps=5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;key moments, include timestamps&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -m gpt-4.1-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's the output (&lt;a href="https://gist.github.com/simonw/371719849a7d2260f1eb0f422ea63187"&gt;transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here are the key moments from the video "cleo.mp4" with timestamps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;00:00:00.000 - A dog on leash looks at a small cupcake with purple frosting being held by a person.&lt;/li&gt;
&lt;li&gt;00:00:00.800 - The dog closely sniffs the cupcake.&lt;/li&gt;
&lt;li&gt;00:00:01.400 - The person puts a small amount of the cupcake frosting on their finger.&lt;/li&gt;
&lt;li&gt;00:00:01.600 - The dog starts licking the frosting from the person's finger.&lt;/li&gt;
&lt;li&gt;00:00:02.600 - The dog continues licking enthusiastically.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let me know if you need more details or a description of any specific part.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one sent 14 images to the API, at a total cost of 32,968 input, 141 output = 1.3413 cents.&lt;/p&gt;
&lt;p&gt;It sent 5.9MB of image data to OpenAI's API, encoded as base64 in the JSON API call.&lt;/p&gt;
&lt;p&gt;The GPT-4.1 model family accepts up to 1,047,576 tokens. Aside from a 20MB size limit per image I haven't seen any documentation of limits on the number of images. You can fit a whole lot of JPEG frames in a million tokens!&lt;/p&gt;
&lt;p&gt;Here's what one of those frames looks like with the timestamp overlaid in the corner:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/cleo-finger.jpg" alt="Cleo taking a treat from my fingers, in the bottom right corner is an overlay t hat says cleo.mp4 00:00:01.600" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="how-i-built-the-plugin-with-o4-mini"&gt;How I built the plugin with o4-mini&lt;/h4&gt;
&lt;p&gt;This is a great example of how rapid prototyping with an LLM can help demonstrate the value of a feature.&lt;/p&gt;
&lt;p&gt;I was considering whether it would make sense for fragment plugins to return images in &lt;a href="https://github.com/simonw/llm/issues/972#issuecomment-2849342103"&gt;issue 972&lt;/a&gt; when I had the idea to use &lt;code&gt;ffmpeg&lt;/code&gt; to split a video into frames.&lt;/p&gt;
&lt;p&gt;I know &lt;a href="https://simonwillison.net/2025/Apr/23/llm-fragment-symbex/"&gt;from past experience&lt;/a&gt; that a good model can write an entire plugin for LLM if you feed it the right example, so I started with this (reformatted here for readability):&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m o4-mini -f github:simonw/llm-hacker-news -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;write a new plugin called llm_video_frames.py which takes video:path-to-video.mp4 and creates a temporary directory which it then populates with one frame per second of that video using ffmpeg - then it returns a list of [llm.Attachment(path="path-to-frame1.jpg"), ...] - it should also support passing video:video.mp4?fps=2 to increase to two frames per second, and if you pass ?timestamps=1 or &amp;amp;timestamps=1 then it should add a text timestamp to the bottom right conner of each image with the mm:ss timestamp of that frame (or hh:mm:ss if more than one hour in) and the filename of the video without the path as well.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -o reasoning_effort high&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/4f545ecb347884d1d923dbc49550b8b0#response"&gt;the transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new attachment mechanism went from vague idea to "I should build that" as a direct result of having an LLM-built proof-of-concept that demonstrated the feasibility of the new feature.&lt;/p&gt;
&lt;p&gt;The code it produced was about 90% of the code I shipped in the finished plugin. Total cost 5,018 input, 2,208 output = 1.5235 cents.&lt;/p&gt;
&lt;h4 id="annotated-release-notes-for-everything-else-in-llm-0-25"&gt;Annotated release notes for everything else in LLM 0.25&lt;/h4&gt;
&lt;p&gt;Here are the annotated release notes for everything else:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New plugin feature: &lt;a href="https://llm.datasette.io/en/stable/plugins/plugin-hooks.html#plugin-hooks-register-fragment-loaders"&gt;register_fragment_loaders(register)&lt;/a&gt; plugins can now return a mixture of fragments and attachments. The &lt;a href="https://github.com/simonw/llm-video-frames"&gt;llm-video-frames&lt;/a&gt; plugin is the first to take advantage of this mechanism. &lt;a href="https://github.com/simonw/llm/issues/972"&gt;#972&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;As decsribed above. The inspiration for this feature came from the &lt;a href="https://github.com/agustif/llm-arxiv"&gt;llm-arxiv&lt;/a&gt; plugin by &lt;a href="https://github.com/agustif"&gt;agustif&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New OpenAI models: &lt;code&gt;gpt-4.1&lt;/code&gt;, &lt;code&gt;gpt-4.1-mini&lt;/code&gt;, &lt;code&gt;gpt-41-nano&lt;/code&gt;, &lt;code&gt;o3&lt;/code&gt;, &lt;code&gt;o4-mini&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm/issues/945"&gt;#945&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm/issues/965"&gt;#965&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm/issues/976"&gt;#976&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;My original plan was to leave these models exclusively to the new &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai&lt;/a&gt; plugin, since that allows me to add support for new models without a full LLM release. I'm going to punt on that until I'm ready to entirely remove the OpenAI models from LLM core.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New environment variables: &lt;code&gt;LLM_MODEL&lt;/code&gt; and &lt;code&gt;LLM_EMBEDDING_MODEL&lt;/code&gt; for setting the model to use without needing to specify &lt;code&gt;-m model_id&lt;/code&gt; every time. &lt;a href="https://github.com/simonw/llm/issues/932"&gt;#932&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;A convenience feature for if you want to set the default model for a terminal session with LLM without using the global &lt;a href="https://llm.datasette.io/en/stable/setup.html#setting-a-custom-default-model"&gt;default model" mechanism&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New command: &lt;code&gt;llm fragments loaders&lt;/code&gt;, to list all currently available fragment loader prefixes provided by plugins. &lt;a href="https://github.com/simonw/llm/issues/941"&gt;#941&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mainly for consistence with the existing &lt;a href="https://llm.datasette.io/en/stable/help.html#llm-templates-loaders-help"&gt;llm templates loaders&lt;/a&gt; command. Here's the output when I run &lt;code&gt;llm fragments loaders&lt;/code&gt; on my machine:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docs:
  Fetch the latest documentation for the specified package from
  https://github.com/simonw/docs-for-llms

  Use '-f docs:' for the documentation of your current version of LLM.

docs-preview:
  Similar to docs: but fetches the latest docs including alpha/beta releases.

symbex:
  Walk the given directory, parse every .py file, and for every
  top-level function or class-method produce its signature and
  docstring plus an import line.

github:
  Load files from a GitHub repository as fragments

  Argument is a GitHub repository URL or username/repository

issue:
  Fetch GitHub issue/pull and comments as Markdown

  Argument is either "owner/repo/NUMBER" or URL to an issue

pr:
  Fetch GitHub pull request with comments and diff as Markdown

  Argument is either "owner/repo/NUMBER" or URL to a pull request

hn:
  Given a Hacker News article ID returns the full nested conversation.

  For example: -f hn:43875136

video-frames:
  Fragment loader "video-frames:&amp;lt;path&amp;gt;?fps=N&amp;amp;timestamps=1"
  - extracts frames at `fps` per second (default 1)
  - if `timestamps=1`, overlays "filename hh:mm:ss" at bottom-right
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's from &lt;a href="https://github.com/simonw/llm-docs"&gt;llm-docs&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm-fragments-github"&gt;llm-fragments-symbex&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm-fragments-github"&gt;llm-fragments-github&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm-hacker-news"&gt;llm-hacker-news&lt;/a&gt; and &lt;a href="https://github.com/simonw/llm-video-frames"&gt;llm-video-frames&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llm fragments&lt;/code&gt; command now shows fragments ordered by the date they were first used. &lt;a href="https://github.com/simonw/llm/issues/973"&gt;#973&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This makes it easier to quickly debug a new fragment plugin - you can run &lt;code&gt;llm fragments&lt;/code&gt; and glance at the bottom few entries.&lt;/p&gt;
&lt;p&gt;I've also been using the new &lt;a href="https://github.com/simonw/llm-echo"&gt;llm-echo&lt;/a&gt; debugging plugin for this - it adds a new fake model called "echo" which simply outputs whatever the prompt, system prompt, fragments and attachments are that were passed to the model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -f docs:sqlite-utils -m &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Show me the context&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/cb3249856887379759515022c76d0d9e"&gt;Output here&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llm chat&lt;/code&gt; now includes a &lt;code&gt;!edit&lt;/code&gt; command for editing a prompt using your default terminal text editor. Thanks, &lt;a href="https://github.com/Hopiu"&gt;Benedikt Willi&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/969"&gt;#969&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a really nice enhancement to &lt;code&gt;llm chat&lt;/code&gt;, making it much more convenient to edit longe prompts.&lt;/p&gt;
&lt;p&gt;And the rest:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Allow &lt;code&gt;-t&lt;/code&gt; and &lt;code&gt;--system&lt;/code&gt; to be used at the same time. &lt;a href="https://github.com/simonw/llm/issues/916"&gt;#916&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fixed a bug where accessing a model via its alias would fail to respect any default options set for that model. &lt;a href="https://github.com/simonw/llm/issues/968"&gt;#968&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Improved documentation for &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;extra-openai-models.yaml&lt;/a&gt;. Thanks, &lt;a href="https://github.com/rahimnathwani"&gt;Rahim Nathwani&lt;/a&gt; and &lt;a href="https://github.com/dguido"&gt;Dan Guido&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/950"&gt;#950&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm/pull/957"&gt;#957&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llm -c/--continue&lt;/code&gt; now works correctly with the &lt;code&gt;-d/--database&lt;/code&gt; option. &lt;code&gt;llm chat&lt;/code&gt; now accepts that &lt;code&gt;-d/--database&lt;/code&gt; option. Thanks, &lt;a href="https://github.com/sukhbinder"&gt;Sukhbinder Singh&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/issues/933"&gt;#933&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="ffmpeg"/><category term="plugins"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="vision-llms"/></entry><entry><title>QuickTime video script to capture frames and bounding boxes</title><link href="https://simonwillison.net/2024/Nov/14/capture-frames-and-bounding-boxes/#atom-tag" rel="alternate"/><published>2024-11-14T19:00:54+00:00</published><updated>2024-11-14T19:00:54+00:00</updated><id>https://simonwillison.net/2024/Nov/14/capture-frames-and-bounding-boxes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/macos/quicktime-capture-script#user-content-a-version-that-captures-bounding-box-regions-too"&gt;QuickTime video script to capture frames and bounding boxes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
An update to an older TIL. I'm working on the write-up for my DjangoCon US talk on plugins and I found myself wanting to capture individual frames from the video in two formats: a full frame capture, and another that captured just the portion of the screen shared from my laptop.&lt;/p&gt;
&lt;p&gt;I have a script for the former, so I &lt;a href="https://gist.github.com/simonw/799babf92e1eaf36a5336b4889f72492"&gt;got Claude&lt;/a&gt; to update my script to add support for one or more &lt;code&gt;--box&lt;/code&gt; options, like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;capture-bbox.sh ../output.mp4  --box '31,17,100,87' --box '0,0,50,50'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Open &lt;code&gt;output.mp4&lt;/code&gt; in QuickTime Player, run that script and then every time you hit a key in the terminal app it will capture three JPEGs from the current position in QuickTime Player - one for the whole screen and one each for the specified bounding box regions.&lt;/p&gt;
&lt;p&gt;Those bounding box regions are percentages of the width and height of the image. I also got Claude to build me &lt;a href="https://tools.simonwillison.net/bbox-cropper"&gt;this interactive tool&lt;/a&gt; on top of &lt;a href="https://github.com/fengyuanchen/cropperjs"&gt;cropperjs&lt;/a&gt; to help figure out those boxes:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the tool. A frame from a video of a talk I gave at DjangoCon US is shown, with a crop region on it using drag handles for the different edges of the crop. Below that is a box showing --bbox '31,17,99,86'" src="https://static.simonwillison.net/static/2024/bbox-tool.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-artifacts"&gt;claude-artifacts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-to-app"&gt;prompt-to-app&lt;/a&gt;&lt;/p&gt;



</summary><category term="ffmpeg"/><category term="projects"/><category term="tools"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="claude-artifacts"/><category term="prompt-to-app"/></entry><entry><title>llamafile v0.8.13 (and whisperfile)</title><link href="https://simonwillison.net/2024/Aug/19/whisperfile/#atom-tag" rel="alternate"/><published>2024-08-19T20:08:59+00:00</published><updated>2024-08-19T20:08:59+00:00</updated><id>https://simonwillison.net/2024/Aug/19/whisperfile/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.13"&gt;llamafile v0.8.13 (and whisperfile)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest release of &lt;a href="https://github.com/Mozilla-Ocho/llamafile"&gt;llamafile&lt;/a&gt; (&lt;a href="https://simonwillison.net/2023/Nov/29/llamafile/"&gt;previously&lt;/a&gt;) adds support for &lt;a href="https://blog.google/technology/developers/gemma-open-models/"&gt;Gemma 2B&lt;/a&gt; (pre-bundled &lt;a href="https://huggingface.co/jartine/gemma-2-27b-it-llamafile/tree/main"&gt;llamafiles available here&lt;/a&gt;), significant performance improvements and new support for the Whisper speech-to-text model, based on &lt;a href="https://github.com/ggerganov/whisper.cpp"&gt;whisper.cpp&lt;/a&gt;, Georgi Gerganov's C++ implementation of Whisper that pre-dates his work on &lt;code&gt;llama.cpp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I got &lt;code&gt;whisperfile&lt;/code&gt; working locally by first downloading the cross-platform executable attached to &lt;a href="https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.13"&gt;the GitHub release&lt;/a&gt; and then grabbing a &lt;code&gt;whisper-tiny.en-q5_1.bin&lt;/code&gt; model from Hugging Face:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget -O whisper-tiny.en-q5_1.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-q5_1.bin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I ran &lt;code&gt;chmod 755 whisperfile-0.8.13&lt;/code&gt; and then executed it against an example &lt;code&gt;.wav&lt;/code&gt; file like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav --no-prints
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--no-prints&lt;/code&gt; option suppresses the debug output, so you just get text that looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[00:00:00.000 --&amp;gt; 00:00:12.000]   This is a LibraVox recording. All LibraVox recordings are in the public domain. For more information please visit LibraVox.org.
[00:00:12.000 --&amp;gt; 00:00:20.000]   Today's reading The Raven by Edgar Allan Poe, read by Chris Scurringe.
[00:00:20.000 --&amp;gt; 00:00:40.000]   Once upon a midnight dreary, while I pondered weak and weary, over many a quaint and curious volume of forgotten lore. While I nodded nearly napping, suddenly there came a tapping as of someone gently rapping, rapping at my chamber door.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are quite a few &lt;a href="https://github.com/Mozilla-Ocho/llamafile/issues/544#issuecomment-2297368432"&gt;undocumented options&lt;/a&gt; - to write out JSON to a file called &lt;code&gt;transcript.json&lt;/code&gt; (&lt;a href="https://gist.github.com/simonw/39173ac94e71cb01b749f9256a9408c4"&gt;example output&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/raven_poe_64kb.wav --no-prints --output-json --output-file transcript
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I had to convert my own audio recordings to 16kHz &lt;code&gt;.wav&lt;/code&gt; files in order to use them with &lt;code&gt;whisperfile&lt;/code&gt;. I used &lt;code&gt;ffmpeg&lt;/code&gt; to do this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ffmpeg -i runthrough-26-oct-2023.wav -ar 16000 /tmp/out.wav
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I could transcribe that like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/out.wav --no-prints
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://twitter.com/JustineTunney/status/1825676741593149949"&gt;Justine says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I've just uploaded new whisperfiles &lt;a href="https://huggingface.co/Mozilla/whisperfile"&gt;to Hugging Face&lt;/a&gt; which use miniaudio.h to automatically resample and convert your mp3/ogg/flac/wav files to the appropriate format.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With that &lt;code&gt;whisper-tiny&lt;/code&gt; model this took just 11s to transcribe a 10m41s audio file!&lt;/p&gt;
&lt;p&gt;I also tried the much larger Whisper Medium model - I chose to use the 539MB  &lt;code&gt;ggml-medium-q5_0.bin&lt;/code&gt; quantized version of that from &lt;a href="https://huggingface.co/ggerganov/whisper.cpp/tree/main"&gt;huggingface.co/ggerganov/whisper.cpp&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time it took 1m49s, using 761% of CPU according to Activity Monitor.&lt;/p&gt;
&lt;p&gt;I tried adding &lt;code&gt;--gpu auto&lt;/code&gt; to exercise the GPU on my M2 Max MacBook Pro:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints --gpu auto
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That used just 16.9% of CPU and 93% of GPU according to Activity Monitor, and finished in 1m08s. &lt;/p&gt;
&lt;p&gt;I tried this with the &lt;code&gt;tiny&lt;/code&gt; model too but the performance difference there was imperceptible.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/JustineTunney/status/1825551821857010143"&gt;@JustineTunney&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whisper"&gt;whisper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/justine-tunney"&gt;justine-tunney&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speech-to-text"&gt;speech-to-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/georgi-gerganov"&gt;georgi-gerganov&lt;/a&gt;&lt;/p&gt;



</summary><category term="ffmpeg"/><category term="ai"/><category term="whisper"/><category term="local-llms"/><category term="llamafile"/><category term="justine-tunney"/><category term="speech-to-text"/><category term="georgi-gerganov"/></entry><entry><title>Tracking Fireworks Impact on Fourth of July AQI</title><link href="https://simonwillison.net/2024/Jul/5/tracking-fireworks/#atom-tag" rel="alternate"/><published>2024-07-05T22:52:51+00:00</published><updated>2024-07-05T22:52:51+00:00</updated><id>https://simonwillison.net/2024/Jul/5/tracking-fireworks/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://danny.page/views/tracking-fireworks-on-july-4th"&gt;Tracking Fireworks Impact on Fourth of July AQI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Danny Page ran &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; once per minute (using cron) against &lt;a href="https://map.purpleair.com/1/mAQI/a10/p604800/cC0#8.45/37.764/-121.62"&gt;this Purple Air map&lt;/a&gt; of the Bay Area and turned the captured screenshots into an animation using &lt;code&gt;ffmpeg&lt;/code&gt;. The result shows the impact of 4th of July fireworks on air quality between 7pm and 7am.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/DannyPage/status/1809331303386329194"&gt;@DannyPage&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;



</summary><category term="ffmpeg"/><category term="shot-scraper"/></entry><entry><title>Announcing the Ladybird Browser Initiative</title><link href="https://simonwillison.net/2024/Jul/1/the-ladybird-browser-initiative/#atom-tag" rel="alternate"/><published>2024-07-01T16:08:42+00:00</published><updated>2024-07-01T16:08:42+00:00</updated><id>https://simonwillison.net/2024/Jul/1/the-ladybird-browser-initiative/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ladybird.org/announcement.html"&gt;Announcing the Ladybird Browser Initiative&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Andreas Kling's &lt;a href="https://awesomekling.github.io/Ladybird-a-new-cross-platform-browser-project/"&gt;Ladybird&lt;/a&gt; is a really exciting project: a from-scratch implementation of a web browser, initially built as part of the Serenity OS project, which aims to provide a completely independent, open source and fully standards compliant browser.&lt;/p&gt;
&lt;p&gt;Last month Andreas &lt;a href="https://awesomekling.substack.com/p/forking-ladybird-and-stepping-down-serenityos"&gt;forked Ladybird away from Serenity&lt;/a&gt;, recognizing that the potential impact of the browser project on its own was greater than as a component of that project. Crucially, Serenity OS avoids &lt;em&gt;any&lt;/em&gt; outside code - splitting out Ladybird allows Ladybird to add dependencies like libjpeg and ffmpeg. The &lt;a href="https://www.youtube.com/watch?v=cbw0KrMGHvc"&gt;Ladybird June update&lt;/a&gt; video talks through some of the dependencies they've been able to add since making that decision.&lt;/p&gt;
&lt;p&gt;The new Ladybird Browser Initiative puts some financial weight behind the project: it's a US 501(c)(3) non-profit &lt;a href="https://twitter.com/defunkt/status/1807779408092234134"&gt;initially funded with $1m from GitHub co-founder Chris Chris Wanstrath&lt;/a&gt;. The money is going on engineers: &lt;a href="https://twitter.com/awesomekling/status/1807804311696531575"&gt;Andreas says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We are 4 full-time engineers today, and we'll be adding another 3 in the near future&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://www.youtube.com/watch?v=k9edTqPMX_k"&gt;a 2m28s video&lt;/a&gt; from Chris introducing the new foundation and talking about why this project is worth supporting.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ladybirdbrowser/status/1807777318330339331"&gt;@ladybirdbrowser&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andreas-kling"&gt;andreas-kling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ladybird"&gt;ladybird&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="ffmpeg"/><category term="open-source"/><category term="andreas-kling"/><category term="ladybird"/></entry><entry><title>Mass Video Conversion Using AWS</title><link href="https://simonwillison.net/2007/Apr/3/amazon/#atom-tag" rel="alternate"/><published>2007-04-03T23:44:37+00:00</published><updated>2007-04-03T23:44:37+00:00</updated><id>https://simonwillison.net/2007/Apr/3/amazon/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=691"&gt;Mass Video Conversion Using AWS&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
How to use S3, SQS, EC2, ffmpeg and some Python to  bulk convert videos with Amazon Web Services.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/amazon"&gt;amazon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ec2"&gt;ec2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqs"&gt;sqs&lt;/a&gt;&lt;/p&gt;



</summary><category term="amazon"/><category term="aws"/><category term="ec2"/><category term="ffmpeg"/><category term="python"/><category term="s3"/><category term="sqs"/></entry></feed>