<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: nvidia</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/nvidia.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-18T22:19:26+00:00</updated><author><name>Simon Willison</name></author><entry><title>MacWhisper has Automatic Speaker Recognition now</title><link href="https://simonwillison.net/2025/Nov/18/macwhisper-speaker-recognition/#atom-tag" rel="alternate"/><published>2025-11-18T22:19:26+00:00</published><updated>2025-11-18T22:19:26+00:00</updated><id>https://simonwillison.net/2025/Nov/18/macwhisper-speaker-recognition/#atom-tag</id><summary type="html">
    &lt;p&gt;Inspired by &lt;a href="https://news.ycombinator.com/item?id=45970519#45971014"&gt;this conversation&lt;/a&gt; on Hacker News I decided to upgrade &lt;a href="https://goodsnooze.gumroad.com/l/macwhisper"&gt;MacWhisper&lt;/a&gt; to try out NVIDIA Parakeet and the new Automatic Speaker Recognition feature.&lt;/p&gt;
&lt;p&gt;It appears to work really well! Here's the result against &lt;a href="https://static.simonwillison.net/static/2025/HMB-nov-4-2025.m4a"&gt;this 39.7MB m4a file&lt;/a&gt; from my &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#analyzing-a-city-council-meeting"&gt;Gemini 3 Pro write-up&lt;/a&gt; this morning:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of the MacWhisper transcription application interface displaying a file named &amp;quot;HMB_compressed.&amp;quot; The center panel shows a transcript of a City Council meeting. Speaker 2 begins, &amp;quot;Thank you, Mr. Mayor, uh City Council... Victor Hernandez, Spanish interpreter,&amp;quot; followed by Spanish instructions: &amp;quot;Buenas noches, les queremos dejar saber a todos ustedes que pueden acceder lo que es el canal de Zoom...&amp;quot; Speaker 1 responds, &amp;quot;Thank you. Appreciate that. Can we please have a roll call?&amp;quot; Speaker 3 then calls out &amp;quot;Councilmember Johnson?&amp;quot; and &amp;quot;Councilmember Nagengast?&amp;quot; to which Speaker 1 answers, &amp;quot;Here.&amp;quot; The interface includes metadata on the right indicating the model &amp;quot;Parakeet v3&amp;quot; and a total word count of 26,109." src="https://static.simonwillison.net/static/2025/macwhisper-parakeet.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;You can export the transcript with both timestamps and speaker names using the Share -&amp;gt; Segments &amp;gt; .json menu item:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A close-up of the MacWhisper interface showing the export dropdown menu with &amp;quot;Segments&amp;quot; selected. A secondary menu lists various file formats including .txt, .csv, and .pdf, with a red arrow pointing specifically to the &amp;quot;.json&amp;quot; option, set against the background of the meeting transcript." src="https://static.simonwillison.net/static/2025/macwhisper-export.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/2149eb880142561b8fccf1866bc16767"&gt;the resulting JSON&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whisper"&gt;whisper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speech-to-text"&gt;speech-to-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/macwhisper"&gt;macwhisper&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="whisper"/><category term="nvidia"/><category term="speech-to-text"/><category term="macwhisper"/></entry><entry><title>parakeet-mlx</title><link href="https://simonwillison.net/2025/Nov/14/parakeet-mlx/#atom-tag" rel="alternate"/><published>2025-11-14T20:00:32+00:00</published><updated>2025-11-14T20:00:32+00:00</updated><id>https://simonwillison.net/2025/Nov/14/parakeet-mlx/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/senstella/parakeet-mlx"&gt;parakeet-mlx&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat MLX project by Senstella bringing NVIDIA's &lt;a href="https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2"&gt;Parakeet&lt;/a&gt; ASR (Automatic Speech Recognition, like Whisper) model to to Apple's MLX framework.&lt;/p&gt;
&lt;p&gt;It's packaged as a Python CLI tool, so you can run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx parakeet-mlx default_tc.mp3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first time I ran this it downloaded a 2.5GB model file.&lt;/p&gt;
&lt;p&gt;Once that was fetched it took 53 seconds to transcribe a 65MB 1hr 1m 28s podcast episode (&lt;a href="https://accessibility-and-gen-ai.simplecast.com/episodes/ep-6-simon-willison-datasette"&gt;this one&lt;/a&gt;) and produced &lt;a href="https://gist.github.com/simonw/ea1dc73029bf080676839289e705a2a2"&gt;this default_tc.srt file&lt;/a&gt; with a timestamped transcript of the audio I fed into it. The quality appears to be very high.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speech-to-text"&gt;speech-to-text&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="nvidia"/><category term="uv"/><category term="mlx"/><category term="speech-to-text"/></entry><entry><title>Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale</title><link href="https://simonwillison.net/2025/Nov/7/codex-tailscale-spark/#atom-tag" rel="alternate"/><published>2025-11-07T07:23:12+00:00</published><updated>2025-11-07T07:23:12+00:00</updated><id>https://simonwillison.net/2025/Nov/7/codex-tailscale-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/codex-spark-gpt-oss"&gt;Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by a &lt;a href="https://www.youtube.com/watch?v=qy4ci7AoF9Y&amp;amp;lc=UgzaGdLX8TAuQ9ugx1Z4AaABAg"&gt;YouTube comment&lt;/a&gt; I wrote up how I run OpenAI's Codex CLI coding agent against the gpt-oss:120b model running in Ollama on my &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/"&gt;NVIDIA DGX Spark&lt;/a&gt; via a Tailscale network.&lt;/p&gt;
&lt;p&gt;It takes a little bit of work to configure but the result is I can now use Codex CLI on my laptop anywhere in the world against a self-hosted model.&lt;/p&gt;
&lt;p&gt;I used it to build &lt;a href="https://static.simonwillison.net/static/2025/gpt-oss-120b-invaders.html"&gt;this space invaders clone&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="tailscale"/><category term="til"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="coding-agents"/><category term="space-invaders"/><category term="codex-cli"/><category term="nvidia-spark"/></entry><entry><title>Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code</title><link href="https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-code/#atom-tag" rel="alternate"/><published>2025-10-20T17:21:52+00:00</published><updated>2025-10-20T17:21:52+00:00</updated><id>https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-code/#atom-tag</id><summary type="html">
    &lt;p&gt;DeepSeek released a new model yesterday: &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR"&gt;DeepSeek-OCR&lt;/a&gt;, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware.&lt;/p&gt;
&lt;p&gt;This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I've been exploring recently. I &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designed an agentic loop&lt;/a&gt; for the problem, gave Claude full permissions inside a Docker sandbox, embraced the &lt;a href="https://simonwillison.net/2025/Oct/5/parallel-coding-agents/"&gt;parallel agents lifestyle&lt;/a&gt; and reused my &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/"&gt;notes on the NVIDIA Spark&lt;/a&gt; from last week.&lt;/p&gt;
&lt;p&gt;I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen.&lt;/p&gt;
&lt;p&gt;TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce &lt;em&gt;copious&lt;/em&gt; notes about the process.&lt;/p&gt;
&lt;h4 id="the-setup"&gt;The setup&lt;/h4&gt;
&lt;p&gt;I connected to the Spark from my Mac via SSH and started a new Docker container there:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I installed npm and used that to install Claude Code:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then started Claude Code, telling it that it's OK that it's running as &lt;code&gt;root&lt;/code&gt; because it's in a sandbox:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It provided me a URL to click on to authenticate with my Anthropic account.&lt;/p&gt;
&lt;h4 id="the-initial-prompts"&gt;The initial prompts&lt;/h4&gt;
&lt;p&gt;I kicked things off with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a folder deepseek-ocr and do everything else in that folder&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image (&lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg"&gt;this one&lt;/a&gt;, see &lt;a href="https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/"&gt;previous post&lt;/a&gt;) that I wanted it to run OCR on.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Your task is to get this working: &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR"&gt;https://github.com/deepseek-ai/DeepSeek-OCR&lt;/a&gt; - it uses Hugging Face Transformers and models from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"&gt;https://huggingface.co/deepseek-ai/DeepSeek-OCR&lt;/a&gt; - you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those - you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example - use &lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg"&gt;https://static.simonwillison.net/static/2025/ft.jpeg&lt;/a&gt; as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note how I also told it to "make extensive notes as you go in a notes.md file which you should only ever append new notes to" so I'd have some useful notes at the end.&lt;/p&gt;
&lt;h4 id="the-follow-ups"&gt;The follow-ups&lt;/h4&gt;
&lt;p&gt;It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it.&lt;/p&gt;
&lt;p&gt;And then... &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-assistant--2025-10-20-154706"&gt;it gave up&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90a&lt;/li&gt;
&lt;li&gt;GB10 requires: sm_121&lt;/li&gt;
&lt;li&gt;Error: &lt;code&gt;CUDA error: no kernel image is available for execution on the device&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, the model has hardcoded &lt;code&gt;.cuda()&lt;/code&gt; calls, preventing CPU fallback without code modifications.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I gave it some encouragement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It ran this command to grep for all available PyTorch wheels on &lt;code&gt;download.pytorch.org&lt;/code&gt; that match the ARM architecture:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -s https://download.pytorch.org/whl/torch/ &lt;span class="pl-k"&gt;|&lt;/span&gt; grep -o &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;torch-[0-9.]*.*aarch64.whl&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -V &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -20&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;... and that unblocked it.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and &lt;strong&gt;CUDA 13.0&lt;/strong&gt; (cu130)! This is exactly what we need!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It installed the wheels it found and reported:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it's just a warning now instead of a hard error! Let's try running inference: [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then kept on churning for a bit, ignoring warnings, and managed to &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-user--2025-10-20-155056"&gt;run the model&lt;/a&gt;! It loaded the image and ran this prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|grounding|&amp;gt;OCR this image.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And got back output that started like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|ref|&amp;gt;The perils of vibe coding&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[352, 30, 624, 111]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;opt&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 0, 994, 30]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;such&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 24, 996, 58]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;days&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 52, 996, 87]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;pavi&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 85, 996, 118]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;TECHNOLOGY&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[33, 199, 127, 230]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;holds the promise of replacing program-&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 170, 390, 210]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;mers - a profession of very well paid&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 199, 390, 238]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;people.For another, the work can be&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 228, 390, 267]]&amp;lt;|/det|&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Claude declared victory and pointed me to the &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/output/result.mmd?short_path=f43da6a"&gt;output/result.mmd&lt;/a&gt; file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-user--2025-10-20-155826"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That result.mmd file looks to be exclusively whitespace to me, try and get a better result there - the bounding box image was good though&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which caused Claude to &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-assistant--2025-10-20-155900"&gt;reconsider&lt;/a&gt; the prompt it used:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\n&amp;lt;|grounding|&amp;gt;Convert the document to markdown."&lt;/code&gt; - for documents&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\n&amp;lt;|grounding|&amp;gt;OCR this image."&lt;/code&gt; - general OCR&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\nFree OCR."&lt;/code&gt; - without layouts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The "grounding" mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format.&lt;/p&gt;
&lt;p&gt;Let me try running it again with a different prompt that's more focused on text extraction rather than grounding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It tried all of the prompt variants, &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/notes.md#text-output-improvement-2025-10-20-1601"&gt;made a note about them&lt;/a&gt; and created a file called &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/PROMPTS_GUIDE.md"&gt;PROMPTS_GUIDE.md&lt;/a&gt; with a bunch more details, including this comparison table:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Text Quality&lt;/th&gt;
&lt;th&gt;Structure&lt;/th&gt;
&lt;th&gt;Coordinates&lt;/th&gt;
&lt;th&gt;Best Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡⚡⚡ Fast&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Excellent&lt;/td&gt;
&lt;td&gt;⭐ Basic&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;General OCR&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;⚡⚡ Medium&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Excellent&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Full&lt;/td&gt;
&lt;td&gt;⭐⭐ Partial&lt;/td&gt;
&lt;td&gt;Documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;⚡ Slow&lt;/td&gt;
&lt;td&gt;⭐⭐ Good&lt;/td&gt;
&lt;td&gt;⭐ Basic&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Full&lt;/td&gt;
&lt;td&gt;Annotations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;⚡⚡⚡ Fastest&lt;/td&gt;
&lt;td&gt;⭐ N/A&lt;/td&gt;
&lt;td&gt;❌ N/A&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Image analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;And this table under the heading "Performance Benchmarks":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Test image: 3503×1668 pixels (Financial Times article)&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Output Size&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free OCR&lt;/td&gt;
&lt;td&gt;24s&lt;/td&gt;
&lt;td&gt;Clean text&lt;/td&gt;
&lt;td&gt;2257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;39s&lt;/td&gt;
&lt;td&gt;Formatted MD&lt;/td&gt;
&lt;td&gt;2257 + structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;58s&lt;/td&gt;
&lt;td&gt;Text + coords&lt;/td&gt;
&lt;td&gt;2257 + boxes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;9s&lt;/td&gt;
&lt;td&gt;Description&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;My final prompt was this, to gather everything together into a zip file I could extract from the Docker container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a zip file with the output and output_text and all of the scripts and notes - but leave out the github repo and the huggingface repo directories&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I added the contents of that zip file to my new &lt;a href="https://github.com/simonw/research"&gt;simonw/research&lt;/a&gt; GitHub repo &lt;a href="https://github.com/simonw/research/tree/main/deepseek-ocr-nvidia-spark"&gt;in the deepseek-ocr-nvidia-spark&lt;/a&gt; folder.&lt;/p&gt;
&lt;p&gt;Claude really likes writing notes! Here's the directory listing of that finished folder:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  |-- download_test_image.sh
  |-- FINAL_SUMMARY.md
  |-- notes.md
  |-- output
  |   |-- images
  |   |-- result_with_boxes.jpg
  |   `-- result.mmd
  |-- output_text
  |   |-- detailed
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   |-- free_ocr
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   `-- markdown
  |       |-- images
  |       |   `-- 0.jpg
  |       |-- result_with_boxes.jpg
  |       `-- result.mmd
  |-- PROMPTS_GUIDE.md
  |-- README_SUCCESS.md
  |-- README.md
  |-- run_ocr_best.py
  |-- run_ocr_cpu_nocuda.py
  |-- run_ocr_cpu.py
  |-- run_ocr_text_focused.py
  |-- run_ocr.py
  |-- run_ocr.sh
  |-- setup.sh
  |-- SOLUTION.md
  |-- test_image.jpeg
  |-- TEXT_OUTPUT_SUMMARY.md
  `-- UPDATE_PYTORCH.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id="takeaways"&gt;Takeaways&lt;/h4&gt;
&lt;p&gt;My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things.&lt;/p&gt;
&lt;p&gt;Having tried and failed to get PyTorch stuff working in the past, I count this as a &lt;em&gt;huge&lt;/em&gt; win. I'll be using this process a whole lot more in the future.&lt;/p&gt;
&lt;p&gt;How good were the actual results? There's honestly so much material in the resulting notes created by Claude that I haven't reviewed all of it. There may well be all sorts of errors in there, but it's indisputable that it managed to run the model and made notes on how it did that such that I'll be able to do the same thing in the future.&lt;/p&gt;
&lt;p&gt;I think the key factors in executing this project successfully were the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designing agentic loops&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Running it in a Docker sandbox meant I could use &lt;code&gt;claude --dangerously-skip-permissions&lt;/code&gt; and leave it running on its own. If I'd had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes.&lt;/li&gt;
&lt;li&gt;I applied my own knowledge and experience when it got stuck. I was confident (based on &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#claude-code-for-everything"&gt;previous experiments&lt;/a&gt; with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it.&lt;/p&gt;
&lt;h4 id="bonus-using-vs-code-to-monitor-the-container"&gt;Bonus: Using VS Code to monitor the container&lt;/h4&gt;
&lt;p&gt;A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/68a0ebff-b586-4278-bd91-6b715a657d2b"&gt;asked Claude.ai&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I am running a Docker container on a remote machine, which I started over SSH&lt;/p&gt;
&lt;p&gt;How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It gave me a set of steps that solved this exact problem:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install the VS Code "Remote SSH" and "Dev Containers" extensions&lt;/li&gt;
&lt;li&gt;Use "Remote-SSH: Connect to Host" to connect to the remote machine (on my Tailscale network that's &lt;code&gt;spark@100.113.1.114&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;In the window for that remote SSH session, run "Dev Containers: Attach to Running Container" - this shows a list of containers and you can select the one you want to attach to&lt;/li&gt;
&lt;li&gt;... and that's it! VS Code opens a new window providing full access to all of the files in that container. I opened up &lt;code&gt;notes.md&lt;/code&gt; and watched it as Claude Code appended to it in real time.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the "Download" menu item to download it to my Mac.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytorch"&gt;pytorch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vs-code"&gt;vs-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="python"/><category term="ai"/><category term="docker"/><category term="pytorch"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="nvidia"/><category term="vs-code"/><category term="vision-llms"/><category term="deepseek"/><category term="llm-release"/><category term="coding-agents"/><category term="claude-code"/><category term="ai-in-china"/><category term="nvidia-spark"/></entry><entry><title>NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0</title><link href="https://simonwillison.net/2025/Oct/16/nvidia-dgx-spark-apple-mac-studio/#atom-tag" rel="alternate"/><published>2025-10-16T05:34:41+00:00</published><updated>2025-10-16T05:34:41+00:00</updated><id>https://simonwillison.net/2025/Oct/16/nvidia-dgx-spark-apple-mac-studio/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.exolabs.net/nvidia-dgx-spark"&gt;NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
EXO Labs wired a 256GB M3 Ultra Mac Studio up to an NVIDIA DGX Spark and got a 2.8x performance boost serving Llama-3.1 8B (FP16) with an 8,192 token prompt.&lt;/p&gt;
&lt;p&gt;Their detailed explanation taught me a lot about LLM performance.&lt;/p&gt;
&lt;p&gt;There are two key steps in executing a prompt. The first is the &lt;strong&gt;prefill&lt;/strong&gt; phase that reads the incoming prompt and builds a KV cache for each of the transformer layers in the model. This is compute-bound as it needs to process every token in the input and perform large matrix multiplications across all of the layers to initialize the model's internal state.&lt;/p&gt;
&lt;p&gt;Performance in the prefill stage influences TTFT - time‑to‑first‑token.&lt;/p&gt;
&lt;p&gt;The second step is the &lt;strong&gt;decode&lt;/strong&gt; phase, which generates the output one token at a time. This part is limited by memory bandwidth - there's less arithmetic, but each token needs to consider the entire KV cache.&lt;/p&gt;
&lt;p&gt;Decode performance influences TPS - tokens per second.&lt;/p&gt;
&lt;p&gt;EXO noted that the Spark has 100 TFLOPS but only 273GB/s of memory bandwidth, making it a better fit for prefill. The M3 Ultra has 26 TFLOPS but 819GB/s of memory bandwidth, making it ideal for the decode phase.&lt;/p&gt;
&lt;p&gt;They run prefill on the Spark, streaming the KV cache to the Mac over 10Gb Ethernet. They can start streaming earlier layers while the later layers are still being calculated. Then the Mac runs the decode phase, returning tokens faster than if the Spark had run the full process end-to-end.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/exolabs/status/1978525767739883736"&gt;@exolabs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apple"&gt;apple&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;



</summary><category term="apple"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="nvidia-spark"/></entry><entry><title>NVIDIA DGX Spark: great hardware, early days for the ecosystem</title><link href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag" rel="alternate"/><published>2025-10-14T23:36:21+00:00</published><updated>2025-10-14T23:36:21+00:00</updated><id>https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;NVIDIA sent me a preview unit of their new &lt;a href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/"&gt;DGX Spark&lt;/a&gt; desktop "AI supercomputer". I've never had hardware to review before! You can consider this my first ever sponsored post if you like, but they did not pay me any cash and aside from an embargo date they did not request (nor would I grant) any editorial input into what I write about the device.&lt;/p&gt;
&lt;p&gt;The device retails for around $4,000. They officially go on sale tomorrow.&lt;/p&gt;
&lt;p&gt;First impressions are that this is a snazzy little computer. It's similar in size to a Mac mini, but with an exciting textured surface that feels refreshingly different and a little bit &lt;a href="https://www.indiewire.com/awards/industry/devs-cinematography-rob-hardy-alex-garland-1234583396/"&gt;science fiction&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/nvidia-spark.jpg" alt="A rectangular small computer, sitting horizontally on a box. It is about the width of a Mac Mini. It has a NVIDIA logo on  a reflective handle portion, then textured silver metal front, then another reflective handle at the other end. It's pretty and a bit weird looking. It sits on the box it came in, which has NVIDIA DGX Spark written on it in white text on green." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;There is a &lt;em&gt;very&lt;/em&gt; powerful machine tucked into that little box. Here are the specs, which I had Claude Code figure out for me by &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d"&gt;poking around on the device itself&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hardware Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture: aarch64 (ARM64)&lt;/li&gt;
&lt;li&gt;CPU: 20 cores
&lt;ul&gt;
&lt;li&gt;10x Cortex-X925 (performance cores)&lt;/li&gt;
&lt;li&gt;10x Cortex-A725 (efficiency cores)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;RAM: 119 GB total (112 GB available) - &lt;em&gt;I’m not sure why Claude reported it differently here, the machine is listed as 128GB - it looks like a &lt;a href="https://news.ycombinator.com/item?id=45586776#45588329"&gt;128GB == 119GiB thing&lt;/a&gt; because Claude &lt;a href="https://gist.github.com/simonw/021651a14e6c5bf9876c9c4244ed6c2d#file-nvidia-claude-code-txt-L41"&gt;used free -h&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Storage: 3.7 TB (6% used, 3.3 TB available)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPU Specifications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model: NVIDIA GB10 (Blackwell architecture)&lt;/li&gt;
&lt;li&gt;Compute Capability: sm_121 (12.1)&lt;/li&gt;
&lt;li&gt;Memory: 119.68 GB&lt;/li&gt;
&lt;li&gt;Multi-processor Count: 48 streaming multiprocessors&lt;/li&gt;
&lt;li&gt;Architecture: Blackwell&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Short version: this is an ARM64 device with 128GB of memory that's available to both the GPU and the 20 CPU cores at the same time, strapped onto a 4TB NVMe SSD.&lt;/p&gt;
&lt;p&gt;The Spark is firmly targeted at “AI researchers”. It’s designed for both training and running models.&lt;/p&gt;
&lt;h4 id="the-tricky-bit-cuda-on-arm64"&gt;The tricky bit: CUDA on ARM64&lt;/h4&gt;
&lt;p&gt;Until now almost all of my own model running experiments have taken place on a Mac. This has gotten far less painful over the past year and a half thanks to the amazing work of the &lt;a href="https://simonwillison.net/tags/mlx/"&gt;MLX&lt;/a&gt; team and community, but it's still left me deeply frustrated at my lack of access to the NVIDIA CUDA ecosystem. I've lost count of the number of libraries and tutorials which expect you to be able to use Hugging Face Transformers or PyTorch with CUDA, and leave you high and dry if you don't have an NVIDIA GPU to run things on.&lt;/p&gt;
&lt;p&gt;Armed (ha) with my new NVIDIA GPU I was excited to dive into this world that had long eluded me... only to find that there was another assumption baked in to much of this software: x86 architecture for the rest of the machine.&lt;/p&gt;
&lt;p&gt;This resulted in all kinds of unexpected new traps for me to navigate. I eventually managed to get a PyTorch 2.7 wheel for CUDA on ARM, but failed to do so for 2.8. I'm not confident there because the wheel itself is unavailable but I'm finding navigating the PyTorch ARM ecosystem pretty confusing.&lt;/p&gt;
&lt;p&gt;NVIDIA are trying to make this easier, with mixed success. A lot of my initial challenges got easier when I found their &lt;a href="https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html"&gt;official Docker container&lt;/a&gt;, so now I'm figuring out how best to use Docker with GPUs. Here's the current incantation that's been working for me:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have not yet got my head around the difference between CUDA 12 and 13. 13 appears to be very new, and a lot of the existing tutorials and libraries appear to expect 12.&lt;/p&gt;
&lt;h4 id="the-missing-documentation-isn-t-missing-any-more"&gt;The missing documentation isn't missing any more&lt;/h4&gt;
&lt;p&gt;When I first received this machine around a month ago there was very little in the way of documentation to help get me started. This meant climbing the steep NVIDIA+CUDA learning curve mostly on my own.&lt;/p&gt;
&lt;p&gt;This has changed &lt;em&gt;substantially&lt;/em&gt; in just the last week. NVIDIA now have extensive guides for getting things working on the Spark and they are a huge breath of fresh air - exactly the information I needed when I started exploring this hardware.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://developer.nvidia.com/topics/ai/dgx-spark"&gt;getting started guide&lt;/a&gt;, details on the &lt;a href="https://build.nvidia.com/spark/dgx-dashboard/instructions"&gt;DGX dashboard web app&lt;/a&gt;, and the essential collection of &lt;a href="https://build.nvidia.com/spark"&gt;playbooks&lt;/a&gt;. There's still a lot I haven't tried yet just in this official set of guides.&lt;/p&gt;
&lt;h4 id="claude-code-for-everything"&gt;Claude Code for everything&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.claude.com/product/claude-code"&gt;Claude Code&lt;/a&gt; was an absolute lifesaver for me while I was trying to figure out how best to use this device. My Ubuntu skills were a little rusty, and I also needed to figure out CUDA drivers and Docker incantations and how to install the right versions of PyTorch. Claude 4.5 Sonnet is &lt;em&gt;much better than me&lt;/em&gt; at all of these things.&lt;/p&gt;
&lt;p&gt;Since many of my experiments took place in disposable Docker containers I had no qualms at all about running it in YOLO mode:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;IS_SANDBOX=1&lt;/code&gt; environment variable stops Claude from complaining about running as root.&lt;/p&gt;

&lt;details&gt;&lt;summary style="font-style: italic"&gt;Before I found out about IS_SANDBOX&lt;/summary&gt;

&lt;p&gt;&lt;br /&gt;&lt;em&gt;I was &lt;a href="https://twitter.com/lawrencecchen/status/1978255934938886409"&gt;tipped off&lt;/a&gt; about IS_SANDBOX after I published this article. Here's my original workaround:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude understandably won't let you do this as root, even in a Docker container, so I found myself using the following incantation in a fresh &lt;code&gt;nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04&lt;/code&gt; instance pretty often:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get install -y sudo
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; pick the first free UID &amp;gt;=1000&lt;/span&gt;
U=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent passwd &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen UID: &lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; same for a GID&lt;/span&gt;
G=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;for i &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;seq 1000 65000&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; getent group &lt;span class="pl-smi"&gt;$i&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-c1"&gt;break&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;fi&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; done&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Chosen GID: &lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; create user+group&lt;/span&gt;
groupadd -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; devgrp
useradd -m -u &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$U&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -g &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$G&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -s /bin/bash dev
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; enable password-less sudo:&lt;/span&gt;
&lt;span class="pl-c1"&gt;printf&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;dev ALL=(ALL) NOPASSWD:ALL\n&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /etc/sudoers.d/90-dev-nopasswd
chmod 0440 /etc/sudoers.d/90-dev-nopasswd
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install npm&lt;/span&gt;
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install Claude&lt;/span&gt;
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then switch to the &lt;code&gt;dev&lt;/code&gt; user and run Claude for the first time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;su - dev
claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;

&lt;/details&gt;&lt;br /&gt;

&lt;p&gt;This will provide a URL which you can visit to authenticate with your Anthropic account, confirming by copying back a token and pasting it into the terminal.&lt;/p&gt;
&lt;p&gt;Docker tip: you can create a snapshot of the current image (with Claude installed) by running &lt;code&gt;docker ps&lt;/code&gt; to get the container ID and then:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker commit --pause=false &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;container_id&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; cc:snapshot&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then later you can start a similar container using:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it \
  --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  cc:snapshot bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's an example of the kinds of prompts I've been running in Claude Code inside the container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want to run https://huggingface.co/unsloth/Qwen3-4B-GGUF using llama.cpp - figure out how to get llama cpp working on this machine  such that it runs with the GPU, then install it in this directory and get that model to work to serve a prompt. Goal is to get this  command to run: llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one worked flawlessly - Claude checked out the &lt;code&gt;llama.cpp&lt;/code&gt; repo, compiled it for me and iterated on it until it could run that model on the GPU. Here's a &lt;a href="https://gist.github.com/simonw/3e7d28d9ed222d842f729bfca46d6673"&gt;full transcript&lt;/a&gt;, converted from Claude's &lt;code&gt;.jsonl&lt;/code&gt; log format to Markdown using a script I &lt;a href="https://github.com/simonw/tools/blob/main/python/claude_to_markdown.py"&gt;vibe coded just now&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I later told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write out a markdown file with detailed notes on what you did. Start with the shortest form of notes on how to get a successful build, then add a full account of everything you tried, what went wrong and how you fixed it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which produced &lt;a href="https://gist.github.com/simonw/0942d96f616b9e328568ab27d911c8ed"&gt;this handy set of notes&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="tailscale-was-made-for-this"&gt;Tailscale was made for this&lt;/h4&gt;
&lt;p&gt;Having a machine like this on my local network is neat, but what's even neater is being able to access it from anywhere else in the world, from both my phone and my laptop.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; is &lt;em&gt;perfect&lt;/em&gt; for this. I installed it on the Spark (using the &lt;a href="https://tailscale.com/kb/1031/install-linux"&gt;Ubuntu instructions here&lt;/a&gt;), signed in with my SSO account (via Google)... and the Spark showed up in the "Network Devices" panel on my laptop and phone instantly.&lt;/p&gt;
&lt;p&gt;I can SSH in from my laptop or using the &lt;a href="https://termius.com/free-ssh-client-for-iphone"&gt;Termius iPhone app&lt;/a&gt; on my phone. I've also been running tools like &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt; which give me a mobile-friendly web interface for interacting with LLMs on the Spark.&lt;/p&gt;
&lt;h4 id="here-comes-the-ecosystem"&gt;Here comes the ecosystem&lt;/h4&gt;
&lt;p&gt;The embargo on these devices dropped yesterday afternoon, and it turns out a whole bunch of relevant projects have had similar preview access to myself. This is &lt;em&gt;fantastic news&lt;/em&gt; as many of the things I've been trying to figure out myself suddenly got a whole lot easier.&lt;/p&gt;
&lt;p&gt;Four particularly notable examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama &lt;a href="https://ollama.com/blog/nvidia-spark"&gt;works out of the box&lt;/a&gt;. They actually had a build that worked a few weeks ago, and were the first success I had running an LLM on the machine.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; creator Georgi Gerganov just published  &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/16578"&gt;extensive benchmark results&lt;/a&gt; from running &lt;code&gt;llama.cpp&lt;/code&gt; on a Spark. He's getting ~3,600 tokens/second to read the prompt and ~59 tokens/second to generate a response with the MXFP4 version of GPT-OSS 20B and ~817 tokens/second to read and ~18 tokens/second to generate for GLM-4.5-Air-GGUF.&lt;/li&gt;
&lt;li&gt;LM Studio now have &lt;a href="https://lmstudio.ai/blog/dgx-spark"&gt;a build for the Spark&lt;/a&gt;. I haven't tried this one yet as I'm currently using my machine exclusively via SSH.&lt;/li&gt;
&lt;li&gt;vLLM - one of the most popular engines for serving production LLMs - had &lt;a href="https://x.com/eqhylxx/status/1977928690945360049"&gt;early access&lt;/a&gt; and there's now an official &lt;a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3"&gt;NVIDIA vLLM NGC Container&lt;/a&gt; for running their stack.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's &lt;a href="https://docs.unsloth.ai/new/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth"&gt;a tutorial from Unsloth&lt;/a&gt; on fine-tuning gpt-oss-20b on the Spark.&lt;/p&gt;
&lt;h4 id="should-you-get-one-"&gt;Should you get one?&lt;/h4&gt;
&lt;p&gt;It's a bit too early for me to provide a confident recommendation concerning this machine. As indicated above, I've had a tough time figuring out how best to put it to use, largely through my own inexperience with CUDA, ARM64 and Ubuntu GPU machines in general.&lt;/p&gt;
&lt;p&gt;The ecosystem improvements in just the past 24 hours have been very reassuring though. I expect it will be clear within a few weeks how well supported this machine is going to be.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hardware"&gt;hardware&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/disclosures"&gt;disclosures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="hardware"/><category term="ai"/><category term="docker"/><category term="tailscale"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="nvidia"/><category term="ollama"/><category term="llama-cpp"/><category term="coding-agents"/><category term="claude-code"/><category term="lm-studio"/><category term="disclosures"/><category term="nvidia-spark"/></entry><entry><title>Quoting Paul Kedrosky</title><link href="https://simonwillison.net/2025/Jul/19/paul-kedrosky/#atom-tag" rel="alternate"/><published>2025-07-19T00:25:08+00:00</published><updated>2025-07-19T00:25:08+00:00</updated><id>https://simonwillison.net/2025/Jul/19/paul-kedrosky/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://paulkedrosky.com/honey-ai-capex-ate-the-economy/"&gt;&lt;p&gt;One analyst recently speculated (via &lt;a href="https://www.edwardconard.com/macro-roundup/using-nvidias-datacenter-revenue-as-a-reference-for-us-ai-capex-jensnordvig-estimates-that-ai-will-make-up-2-of-us-gdp-in-2025-given-a-standard-multiplier-implying-an-ai-contribution-to-g/?view=detail&amp;amp;filters=macro-roundup-database"&gt;Ed Conard&lt;/a&gt;) that, based on Nvidia's latest datacenter sales figures, AI capex may be &lt;strong&gt;~2% of US GDP in 2025&lt;/strong&gt;, given a standard multiplier. [...]&lt;/p&gt;
&lt;p&gt;Capital expenditures on AI data centers is likely around &lt;strong&gt;20% of the peak spending on railroads&lt;/strong&gt;, as a percentage of GDP, and it is still rising quickly. [...]&lt;/p&gt;
&lt;p&gt;Regardless of what one thinks about the merits of AI or &lt;strong&gt;explosive datacenter expansion&lt;/strong&gt;, the scale and pace of capital deployment into a &lt;strong&gt;rapidly depreciating technology&lt;/strong&gt; is remarkable. These are not railroads—we aren’t building &lt;strong&gt;century-long infrastructure&lt;/strong&gt;. AI datacenters are short-lived, asset-intensive facilities riding declining-cost technology curves, requiring &lt;strong&gt;frequent hardware replacement&lt;/strong&gt; to preserve margins.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://paulkedrosky.com/honey-ai-capex-ate-the-economy/"&gt;Paul Kedrosky&lt;/a&gt;, Honey, AI Capex is Eating the Economy&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/economics"&gt;economics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-kedrosky"&gt;paul-kedrosky&lt;/a&gt;&lt;/p&gt;



</summary><category term="economics"/><category term="ai"/><category term="nvidia"/><category term="ai-ethics"/><category term="paul-kedrosky"/></entry><entry><title>Quoting Ben Thompson</title><link href="https://simonwillison.net/2025/Jan/28/ben-thompson/#atom-tag" rel="alternate"/><published>2025-01-28T02:38:49+00:00</published><updated>2025-01-28T02:38:49+00:00</updated><id>https://simonwillison.net/2025/Jan/28/ben-thompson/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://stratechery.com/2025/deepseek-faq/"&gt;&lt;p&gt;H100s were prohibited by the chip ban, but not H800s. Everyone assumed that training leading edge models required more interchip memory bandwidth, but that is exactly what DeepSeek optimized both their model structure and infrastructure around.&lt;/p&gt;
&lt;p&gt;Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://stratechery.com/2025/deepseek-faq/"&gt;Ben Thompson&lt;/a&gt;, DeepSeek FAQ&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="nvidia"/><category term="gpus"/><category term="deepseek"/><category term="ai-in-china"/></entry><entry><title>The impact of competition and DeepSeek on Nvidia</title><link href="https://simonwillison.net/2025/Jan/27/deepseek-nvidia/#atom-tag" rel="alternate"/><published>2025-01-27T01:55:51+00:00</published><updated>2025-01-27T01:55:51+00:00</updated><id>https://simonwillison.net/2025/Jan/27/deepseek-nvidia/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda"&gt;The impact of competition and DeepSeek on Nvidia&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Long, excellent piece by Jeffrey Emanuel capturing the current state of the AI/LLM industry. The original title is "The Short Case for Nvidia Stock" - I'm using the Hacker News alternative title here, but even that I feel under-sells this essay.&lt;/p&gt;
&lt;p&gt;Jeffrey has a rare combination of experience in both computer science and investment analysis. He combines both worlds here, evaluating NVIDIA's challenges by providing deep insight into a whole host of relevant and interesting topics.&lt;/p&gt;
&lt;p&gt;As Jeffrey describes it, NVIDA's moat has four components: high-quality Linux drivers, CUDA as an industry standard, the fast GPU interconnect technology they acquired from &lt;a href="https://en.wikipedia.org/wiki/Mellanox_Technologies"&gt;Mellanox&lt;/a&gt; in 2019 and the flywheel effect where they can invest their enormous profits (75-90% margin in some cases!) into more R&amp;amp;D.&lt;/p&gt;
&lt;p&gt;Each of these is under threat.&lt;/p&gt;
&lt;p&gt;Technologies like &lt;a href="https://simonwillison.net/tags/mlx/"&gt;MLX&lt;/a&gt;, Triton and JAX are undermining the CUDA advantage by making it easier for ML developers to target multiple backends - plus LLMs themselves are getting capable enough to help port things to alternative architectures.&lt;/p&gt;
&lt;p&gt;GPU interconnect helps multiple GPUs work together on tasks like model training. Companies like Cerebras are developing &lt;a href="https://simonwillison.net/2025/Jan/16/cerebras-yield-problem/"&gt;enormous chips&lt;/a&gt; that can get way more done on a single chip.&lt;/p&gt;
&lt;p&gt;Those 75-90% margins provide a huge incentive for other companies to catch up - including the customers who spend the most on NVIDIA at the moment - Microsoft, Amazon, Meta, Google, Apple - all of whom have their own internal silicon projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The real joy of this article is the way it describes technical details of modern LLMs in a relatively accessible manner. I love this description of the inference-scaling tricks used by O1 and R1, compared to traditional transformers:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Basically, the way Transformers work in terms of predicting the next token at each step is that, if they start out on a bad "path" in their initial response, they become almost like a prevaricating child who tries to spin a yarn about why they are actually correct, even if they should have realized mid-stream using common sense that what they are saying couldn't possibly be correct.&lt;/p&gt;
&lt;p&gt;Because the models are always seeking to be internally consistent and to have each successive generated token flow naturally from the preceding tokens and context, it's very hard for them to course-correct and backtrack. By breaking the inference process into what is effectively many intermediate stages, they can try lots of different things and see what's working and keep trying to course-correct and try other approaches until they can reach a fairly high threshold of confidence that they aren't talking nonsense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The last quarter of the article talks about the seismic waves rocking the industry right now caused by &lt;a href="https://simonwillison.net/tags/deepseek/"&gt;DeepSeek&lt;/a&gt; v3 and R1. v3 remains the top-ranked open weights model, despite being around 45x more efficient in training than its competition: bad news if you are selling GPUs! R1 represents another huge breakthrough in efficiency both for training and for inference - the DeepSeek R1 API is currently 27x cheaper than OpenAI's o1, for a similar level of quality.&lt;/p&gt;
&lt;p&gt;Jeffrey summarized some of the key ideas from the &lt;a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf"&gt;v3 paper&lt;/a&gt; like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A major innovation is their sophisticated mixed-precision training framework that lets them use 8-bit floating point numbers (FP8) throughout the entire training process. [...]&lt;/p&gt;
&lt;p&gt;DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later (losing some quality in the process), DeepSeek's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs, this dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then for &lt;a href="https://arxiv.org/abs/2501.12948"&gt;R1&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.&lt;/p&gt;
&lt;p&gt;The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to "reward hacking" (where the model finds bogus ways to boost their rewards that don't actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This article is packed with insights like that - it's worth spending the time absorbing the whole thing.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=42822162"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nvidia"/><category term="mlx"/><category term="cerebras"/><category term="llm-reasoning"/><category term="deepseek"/><category term="ai-in-china"/></entry><entry><title>Generative AI – The Power and the Glory</title><link href="https://simonwillison.net/2025/Jan/12/generative-ai-the-power-and-the-glory/#atom-tag" rel="alternate"/><published>2025-01-12T01:51:46+00:00</published><updated>2025-01-12T01:51:46+00:00</updated><id>https://simonwillison.net/2025/Jan/12/generative-ai-the-power-and-the-glory/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://about.bnef.com/blog/liebreich-generative-ai-the-power-and-the-glory/"&gt;Generative AI – The Power and the Glory&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Michael Liebreich's epic report for BloombergNEF on the current state of play with regards to generative AI, energy usage and data center growth.&lt;/p&gt;
&lt;p&gt;I learned &lt;em&gt;so much&lt;/em&gt; from reading this. If you're at all interested in the energy impact of the latest wave of AI tools I recommend spending some time with this article.&lt;/p&gt;
&lt;p&gt;Just a few of the points that stood out to me:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This isn't the first time a leap in data center power use has been predicted. In 2007 the EPA predicted data center energy usage would double: it didn't, thanks to efficiency gains from better servers and the shift from in-house to cloud hosting. In 2017 the WEF predicted cryptocurrency could consume &lt;em&gt;all&lt;/em&gt; the world's electric power by 2020, which was cut short by the first crypto bubble burst. Is this time different? &lt;em&gt;Maybe&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Michael re-iterates (Sequoia) David Cahn's &lt;a href="https://www.sequoiacap.com/article/ais-600b-question/"&gt;$600B question&lt;/a&gt;, pointing out that if the anticipated infrastructure spend on AI requires $600bn in annual revenue that means 1 billion people will need to spend $600/year or 100 million intensive users will need to spend $6,000/year.&lt;/li&gt;
&lt;li&gt;Existing data centers often have a power capacity of less than 10MW, but new AI-training focused data centers tend to be in the 75-150MW range, due to the need to colocate vast numbers of GPUs for efficient communication between them - these can at least be located anywhere in the world. Inference is a lot less demanding as the GPUs don't need to collaborate in the same way, but it needs to be close to human population centers to provide low latency responses.&lt;/li&gt;
&lt;li&gt;NVIDIA are claiming huge efficiency gains. "Nvidia claims to have delivered a 45,000 improvement in energy efficiency per token (a unit of data processed by AI models) over the past eight years" - and that "training a 1.8 trillion-parameter model using Blackwell GPUs, which only required 4MW, versus 15MW using the previous Hopper architecture".&lt;/li&gt;
&lt;li&gt;Michael's own global estimate is "45GW of additional demand by 2030", which he points out is "equivalent to one third of the power demand from the world’s aluminum smelters". But much of this demand needs to be local, which makes things a lot more challenging, especially given the need to integrate with the existing grid.&lt;/li&gt;
&lt;li&gt;Google, Microsoft, Meta and Amazon all have net-zero emission targets which they take very seriously, making them "some of the most significant corporate purchasers of renewable energy in the world". This helps explain why they're taking very real interest in nuclear power.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Elon's 100,000-GPU data center in Memphis currently runs on gas:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When Elon Musk rushed to get x.AI's Memphis Supercluster up and running in record time, he brought in 14 mobile &lt;a href="https://www.npr.org/2024/09/11/nx-s1-5088134/elon-musk-ai-xai-supercomputer-memphis-pollution"&gt;natural gas-powered generators&lt;/a&gt;, each of them generating 2.5MW. It seems they do not require an air quality permit, as long as they do not remain in the same location for more than 364 days.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Here's a reassuring statistic: "91% of all new power capacity added worldwide in 2023 was wind and solar".&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There's so much more in there, I feel like I'm doing the article a disservice by attempting to extract just the points above.&lt;/p&gt;
&lt;p&gt;Michael's conclusion is somewhat optimistic:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the end, the tech titans will find out that the best way to power AI data centers is in the traditional way, by building the same generating technologies as are proving most cost effective for other users, connecting them to a robust and resilient grid, and working with local communities. [...]&lt;/p&gt;
&lt;p&gt;When it comes to new technologies – be it SMRs, fusion, novel renewables or superconducting transmission lines – it is a blessing to have some cash-rich, technologically advanced, risk-tolerant players creating demand, which has for decades been missing in low-growth developed world power markets.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(&lt;a href="https://en.wikipedia.org/wiki/Bloomberg_L.P.#New_Energy_Finance"&gt;BloombergNEF&lt;/a&gt; is an energy research group acquired by Bloomberg in 2009, originally founded by Michael as New Energy Finance in 2004.)

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/mtth.org/post/3lfitoklmms2g"&gt;Jamie Matthews&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/energy"&gt;energy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-energy-usage"&gt;ai-energy-usage&lt;/a&gt;&lt;/p&gt;



</summary><category term="energy"/><category term="ethics"/><category term="ai"/><category term="generative-ai"/><category term="nvidia"/><category term="ai-ethics"/><category term="ai-energy-usage"/></entry><entry><title>Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI</title><link href="https://simonwillison.net/2024/Aug/5/nvidia-scraping-videos/#atom-tag" rel="alternate"/><published>2024-08-05T17:19:36+00:00</published><updated>2024-08-05T17:19:36+00:00</updated><id>https://simonwillison.net/2024/Aug/5/nvidia-scraping-videos/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/"&gt;Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Samantha Cole at 404 Media reports on a huge leak of internal NVIDIA communications - mainly from a Slack channel - revealing details of how they have been collecting video training data for a new video foundation model called Cosmos. The data is mostly from YouTube, downloaded via &lt;code&gt;yt-dlp&lt;/code&gt; using a rotating set of AWS IP addresses and consisting of millions (maybe even hundreds of millions) of videos.&lt;/p&gt;
&lt;p&gt;The fact that companies scrape unlicensed data to train models isn't at all surprising. This article still provides a fascinating insight into what model training teams care about, with details like this from a project update via email:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As we measure against our desired distribution focus for the next week remains on cinematic, drone footage, egocentric, some travel and nature.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Or this from Slack:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Movies are actually a good source of data to get gaming-like 3D consistency and fictional content but much higher quality.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My intuition here is that the backlash against scraped video data will be even more intense than for static images used to train generative image models. Video is generally more expensive to create, and video creators (such as Marques Brownlee / MKBHD, who is mentioned in a Slack message here as a potential source of "tech product neviews - super high quality") have a lot of influence.&lt;/p&gt;
&lt;p&gt;There was &lt;a href="https://simonwillison.net/2024/Jul/18/youtube-captions/"&gt;considerable uproar&lt;/a&gt; a few weeks ago over &lt;a href="https://www.proofnews.org/apple-nvidia-anthropic-used-thousands-of-swiped-youtube-videos-to-train-ai/"&gt;this story&lt;/a&gt; about training against just &lt;em&gt;captions&lt;/em&gt; scraped from YouTube, and now we have a much bigger story involving the actual video content itself.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/slack"&gt;slack&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;



</summary><category term="ethics"/><category term="ai"/><category term="slack"/><category term="generative-ai"/><category term="nvidia"/><category term="training-data"/><category term="ai-ethics"/></entry><entry><title>Quoting Amir Efrati and Aaron Holmes</title><link href="https://simonwillison.net/2024/Jul/25/amir-efrati-and-aaron-holmes/#atom-tag" rel="alternate"/><published>2024-07-25T21:35:52+00:00</published><updated>2024-07-25T21:35:52+00:00</updated><id>https://simonwillison.net/2024/Jul/25/amir-efrati-and-aaron-holmes/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.theinformation.com/articles/why-openai-could-lose-5-billion-this-year"&gt;&lt;p&gt;Our estimate of OpenAI’s $4 billion in inference costs comes from a person with knowledge of the cluster of servers OpenAI rents from Microsoft. That cluster has the equivalent of 350,000 Nvidia A100 chips, this person said. About 290,000 of those chips, or more than 80% of the cluster, were powering ChartGPT, this person said.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.theinformation.com/articles/why-openai-could-lose-5-billion-this-year"&gt;Amir Efrati and Aaron Holmes&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="nvidia"/></entry><entry><title>Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI</title><link href="https://simonwillison.net/2024/Jul/18/youtube-captions/#atom-tag" rel="alternate"/><published>2024-07-18T16:22:40+00:00</published><updated>2024-07-18T16:22:40+00:00</updated><id>https://simonwillison.net/2024/Jul/18/youtube-captions/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.proofnews.org/apple-nvidia-anthropic-used-thousands-of-swiped-youtube-videos-to-train-ai/"&gt;Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This article has been getting a lot of attention over the past couple of days.&lt;/p&gt;
&lt;p&gt;The story itself is nothing new: &lt;a href="https://pile.eleuther.ai/"&gt;the Pile&lt;/a&gt; is four years old now, and has been widely used for training LLMs since before anyone even cared what an LLM was. It turns out one of the components of the Pile is a set of ~170,000 YouTube video captions (just the captions, not the actual video) and this story by Annie Gilbertson and Alex Reisner highlights that and interviews some of the creators who were included in the data, as well as providing a &lt;a href="https://www.proofnews.org/youtube-ai-search/"&gt;search tool&lt;/a&gt; for seeing if a specific creator has content that was included.&lt;/p&gt;
&lt;p&gt;What's notable is the response. Marques Brownlee (19m subscribers) &lt;a href="https://www.youtube.com/watch?v=xiJMjTnlxg4"&gt;posted a video about it&lt;/a&gt;. Abigail Thorn (&lt;a href="https://www.youtube.com/user/thephilosophytube"&gt;Philosophy Tube&lt;/a&gt;, 1.57m subscribers) &lt;a href="https://twitter.com/PhilosophyTube/status/1813227210569920685"&gt;tweeted this&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Very sad to have to say this - an AI company called EleutherAI stole tens of thousands of YouTube videos - including many of mine. I’m one of the creators Proof News spoke to. The stolen data was sold to Apple, Nvidia, and other companies to build AI&lt;/p&gt;
&lt;p&gt;When I was told about this I lay on the floor and cried, it’s so violating, it made me want to quit writing forever. The reason I got back up was because I know my audience come to my show for real connection and ideas, not cheapfake AI garbage, and I know they’ll stay with me&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Framing the data as "sold to Apple..." is a slight misrepresentation here - EleutherAI have been giving the Pile away for free since 2020. It's a good illustration of the emotional impact here though: many creative people &lt;em&gt;do not want&lt;/em&gt; their work used in this way, especially without their permission.&lt;/p&gt;
&lt;p&gt;It's interesting seeing how attitudes to this stuff change over time. Four years ago the fact that a bunch of academic researchers were sharing and training models using 170,000 YouTube subtitles would likely not have caught any attention at all. Today, people care!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/youtube"&gt;youtube&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;



</summary><category term="ethics"/><category term="youtube"/><category term="ai"/><category term="llms"/><category term="nvidia"/><category term="training-data"/><category term="ai-ethics"/></entry><entry><title>GPUs Go Brrr</title><link href="https://simonwillison.net/2024/May/13/gpus-go-brrr/#atom-tag" rel="alternate"/><published>2024-05-13T04:08:46+00:00</published><updated>2024-05-13T04:08:46+00:00</updated><id>https://simonwillison.net/2024/May/13/gpus-go-brrr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hazyresearch.stanford.edu/blog/2024-05-12-tk"&gt;GPUs Go Brrr&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fascinating, detailed low-level notes on how to get the most out of NVIDIA's H100 GPUs (currently selling for around $40,000 a piece) from the research team at Stanford who created FlashAttention, among other things.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The swizzled memory layouts are flat-out incorrectly documented, which took considerable time for us to figure out.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=40337936"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/stanford"&gt;stanford&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;&lt;/p&gt;



</summary><category term="stanford"/><category term="ai"/><category term="nvidia"/><category term="gpus"/></entry><entry><title>GPUs on Fly.io are available to everyone!</title><link href="https://simonwillison.net/2024/Feb/14/gpus-on-flyio-are-available-to-everyone/#atom-tag" rel="alternate"/><published>2024-02-14T04:28:23+00:00</published><updated>2024-02-14T04:28:23+00:00</updated><id>https://simonwillison.net/2024/Feb/14/gpus-on-flyio-are-available-to-everyone/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://fly.io/blog/gpu-ga/"&gt;GPUs on Fly.io are available to everyone!&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
We’ve been experimenting with GPUs on Fly for a few months for Datasette Cloud. They’re well documented and quite easy to use—any example Python code you find that uses NVIDIA CUDA stuff generally Just Works. Most interestingly of all, Fly GPUs can scale to zero—so while they cost $2.50/hr for a A100 40G (VRAM) and $3.50/hr for a A100 80G you can configure them to stop running when the machine runs out of things to do.&lt;/p&gt;

&lt;p&gt;We’ve successfully used them to run Whisper and to experiment with running various Llama 2 LLMs as well.&lt;/p&gt;

&lt;p&gt;To look forward to: “We are working on getting some lower-cost A10 GPUs in the next few weeks”.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-cloud"&gt;datasette-cloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whisper"&gt;whisper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="datasette-cloud"/><category term="fly"/><category term="generative-ai"/><category term="whisper"/><category term="llms"/><category term="nvidia"/><category term="gpus"/></entry><entry><title>A Hackers' Guide to Language Models</title><link href="https://simonwillison.net/2023/Sep/25/a-hackers-guide-to-language-models/#atom-tag" rel="alternate"/><published>2023-09-25T00:24:50+00:00</published><updated>2023-09-25T00:24:50+00:00</updated><id>https://simonwillison.net/2023/Sep/25/a-hackers-guide-to-language-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=jkrNMKz9pWU"&gt;A Hackers&amp;#x27; Guide to Language Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Jeremy Howard’s new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you’re an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jeremy-howard"&gt;jeremy-howard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fine-tuning"&gt;fine-tuning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="jeremy-howard"/><category term="fine-tuning"/><category term="nvidia"/></entry></feed>