<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: ocr</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/ocr.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-23T21:54:24+00:00</updated><author><name>Simon Willison</name></author><entry><title>Extract PDF text in your browser with LiteParse for the web</title><link href="https://simonwillison.net/2026/Apr/23/liteparse-for-the-web/#atom-tag" rel="alternate"/><published>2026-04-23T21:54:24+00:00</published><updated>2026-04-23T21:54:24+00:00</updated><id>https://simonwillison.net/2026/Apr/23/liteparse-for-the-web/#atom-tag</id><summary type="html">
    &lt;p&gt;LlamaIndex have a most excellent open source project called &lt;a href="https://github.com/run-llama/liteparse"&gt;LiteParse&lt;/a&gt;, which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js.&lt;/p&gt;
&lt;h4 id="spatial-text-parsing"&gt;Spatial text parsing&lt;/h4&gt;
&lt;p&gt;Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.&lt;/p&gt;
&lt;p&gt;The hard problem that LiteParse solves is extracting text in a sensible order despite the infuriating vagaries of PDF layouts. They describe this as "spatial text parsing" - they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow.&lt;/p&gt;
&lt;p&gt;The LiteParse documentation describes a pattern for implementing &lt;a href="https://developers.llamaindex.ai/liteparse/guides/visual-citations/"&gt;Visual Citations with Bounding Boxes&lt;/a&gt;. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&amp;amp;A.&lt;/p&gt;
&lt;p&gt;LiteParse is provided as a pure CLI tool, designed to be used by agents. You run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;npm i -g @llamaindex/liteparse
lit parse document.pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/44a5ed86-e5b5-4e14-90be-1eba1e0acd13"&gt;explored its capabilities with Claude&lt;/a&gt; and quickly determined that there was no real reason it had to stay a CLI app: it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser &lt;a href="https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/"&gt;in the past&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The only reason LiteParse didn't have a pure browser-based version is that nobody had built one yet...&lt;/p&gt;
&lt;h4 id="introducing-liteparse-for-the-web"&gt;Introducing LiteParse for the web&lt;/h4&gt;
&lt;p&gt;Visit &lt;a href="https://simonw.github.io/liteparse/"&gt;https://simonw.github.io/liteparse/&lt;/a&gt; to try out LiteParse against any PDF file, running entirely in your browser. Here's what that looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/liteparse-web.jpg" alt="Screenshot of the LiteParse browser demo web page. Header reads &amp;quot;LiteParse&amp;quot; with subtitle &amp;quot;Browser demo of LiteParse — parse PDFs in your browser. Nothing leaves your machine.&amp;quot; A dashed-border drop zone says &amp;quot;Drop a PDF here or click to choose / Your file stays in your browser.&amp;quot; with a file pill labeled &amp;quot;19720005243.pdf&amp;quot;. Below are a checked &amp;quot;Run OCR&amp;quot; checkbox, an unchecked &amp;quot;Render page screenshots&amp;quot; checkbox, and a blue &amp;quot;Parse&amp;quot; button. Status text: &amp;quot;Parsed 86 pages.&amp;quot; Two side-by-side panels follow. Left panel titled &amp;quot;Text&amp;quot; with a Copy button shows monospace extracted text beginning &amp;quot;Apollo 5 was an unmanned system, both propulsion systems ascent and descent stages&amp;quot;. Right panel titled &amp;quot;JSON&amp;quot;, also with a copy button, contains JSON showing the dimensions and position and detected font of each piece of text." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The tool can work with or without running OCR, and can optionally display images for every page in the PDF further down the page.&lt;/p&gt;
&lt;h4 id="building-it-with-claude-code-and-opus-4-7"&gt;Building it with Claude Code and Opus 4.7&lt;/h4&gt;
&lt;p&gt;The process of building this started in the regular Claude app on my iPhone. I wanted to try out LiteParse myself, so I started by uploading a random PDF I happened to have on my phone along with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone https://github.com/run-llama/liteparse and try it against this file&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Regular Claude chat can clone directly from GitHub these days, and while by default it can't access most of the internet from its container it can also install packages from PyPI and npm.&lt;/p&gt;
&lt;p&gt;I often use this to try out new pieces of open source software on my phone - it's a quick way to exercise something without having to sit down with my laptop.&lt;/p&gt;
&lt;p&gt;You can follow my full conversation in &lt;a href="https://claude.ai/share/44a5ed86-e5b5-4e14-90be-1eba1e0acd13"&gt;this shared Claude transcript&lt;/a&gt;. I asked a few follow-up questions about how it worked, and then asked:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Does this library run in a browser? Could it?&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This gave me a thorough enough answer that I was convinced it was worth trying getting that to work for real. I opened up my laptop and switched to Claude Code.&lt;/p&gt;
&lt;p&gt;I forked the original repo on GitHub, cloned a local copy, started a new &lt;code&gt;web&lt;/code&gt; branch and pasted that last reply from Claude into a new file called &lt;a href="https://github.com/simonw/liteparse/blob/web/notes.md"&gt;notes.md&lt;/a&gt;. Then I told Claude Code:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Get this working as a web app. index.html, when loaded, should render an app that lets users open a PDF in their browser and select OCR or non-OCR mode and have this run. Read notes.md for initial research on this problem, then write out plan.md with your detailed implementation plan&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I always like to start with a plan for this kind of project. Sometimes I'll use Claude's "planning mode", but in this case I knew I'd want the plan as an artifact in the repository so I told it to write &lt;code&gt;plan.md&lt;/code&gt; directly.&lt;/p&gt;
&lt;p&gt;This also means I can iterate on the plan with Claude. I noticed that Claude had decided to punt on generating screenshots of images in the PDF, and suggested we defer a "canvas-encode swap" to v2. I fixed that by prompting:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Update the plan to say we WILL do the canvas-encode swap so the screenshots thing works&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After a few short follow-up prompts, here's the &lt;a href="https://github.com/simonw/liteparse/blob/web/plan.md"&gt;plan.md&lt;/a&gt; I thought was strong enough to implement.&lt;/p&gt;
&lt;p&gt;I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;build it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then mostly left Claude Code to its own devices, tinkered with some other projects, caught up on Duolingo and occasionally checked in to see how it was doing.&lt;/p&gt;
&lt;p&gt;I added a few prompts to the queue as I was working. Those don't yet show up in my exported transcript, but it turns out running &lt;code&gt;rg queue-operation --no-filename | grep enqueue | jq -r '.content'&lt;/code&gt; in the relevant &lt;code&gt;~/.claude/projects/&lt;/code&gt; folder extracts them.&lt;/p&gt;
&lt;p&gt;Here are the key follow-up prompts with some notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;When you implement this use playwright and red/green TDD, plan that too&lt;/code&gt; - I've written more &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/"&gt;about red/green TDD here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;let's use PDF.js's own renderer&lt;/code&gt; (it was messing around with pdfium)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;The final UI should include both the text and the pretty-printed JSON output, both of those in textareas and both with copy-to-clipboard buttons - it should also be mobile friendly&lt;/code&gt; - I had a new idea for how the UI should work&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;small commits along the way&lt;/code&gt; - see below&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Make sure the index.html page includes a link back to https://github.com/run-llama/liteparse near the top of the page&lt;/code&gt; - it's important to credit your dependencies in a project like this!&lt;/li&gt;
&lt;li&gt;&lt;code&gt;View on GitHub → is bad copy because that's not the repo with this web app in, it's the web app for the underlying LiteParse library&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Run OCR should be unchecked by default&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...')&lt;/code&gt; - it was testing with Playwright in Chrome, turned out there was a bug in Safari&lt;/li&gt;
&lt;li&gt;&lt;code&gt;... oh that is in safari but it works in chrome&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;When "Copy" is clicked the text should change to "Copied!" for 1.5s&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;[Image #1] Style the file input so that long filenames don't break things on Firefox like this - in fact add one of those drag-drop zone UIs which you can also click to select a file&lt;/code&gt; - dropping screenshots in of small UI glitches works surprisingly well&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Tweak the drop zone such that the text is vertically centered, right now it is a bit closer to the top&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;it breaks in Safari on macOS, works in both Chrome and Firefox. On Safari I see "Parse failed: undefined is not a function (near '...value of readableStream...')" after I click the Parse button, when OCR is not checked&lt;/code&gt; - it still wasn't working in Safari...&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;works in safari now&lt;/code&gt;  - but it fixed it pretty quickly once I pointed that out and it got Playwright working with that browser&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I've started habitually asking for "small commits along the way" because it makes for code that's easier to understand or review later on, and I have an unproven hunch that it helps the agent work more effectively too - it's yet another encouragement towards planning and taking on one problem at a time.&lt;/p&gt;
&lt;p&gt;While it was working I decided it would be nice to be able to interact with an in-progress version.  I asked a separate Claude Code session against the same directory for tips on how to run it, and it told me to use &lt;code&gt;npx vite&lt;/code&gt;. Running that started a development server with live-reloading, which meant I could instantly see the effect of each change it made on disk - and prompt with further requests for tweaks and fixes.&lt;/p&gt;
&lt;p&gt;Towards the end I decided it was going to be good enough to publish. I started a fresh Claude Code instance and told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Look at the web/ folder - set up GitHub actions for this repo such that any push runs the tests, and if the tests pass it then does a GitHub Pages deploy of the built vite app such that the web/index.html page is the index.html page for the thing that is deployed and it works on GitHub Pages&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After a bit more iteration &lt;a href="https://github.com/simonw/liteparse/blob/web/.github/workflows/deploy-web.yml"&gt;here's the GitHub Actions workflow&lt;/a&gt; that builds the app using Vite and deploys the result to &lt;a href="https://simonw.github.io/liteparse/"&gt;https://simonw.github.io/liteparse/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I love GitHub Pages for this kind of thing because it can be quickly configured (by Claude, in this case) to turn any repository into a deployed web-app, at zero cost and with whatever build step is necessary. It even works against private repos, if you don't mind your only security being a secret URL.&lt;/p&gt;
&lt;p&gt;With this kind of project there's always a major risk that the model might "cheat" - mark key features as "TODO" and fake them, or take shortcuts that ignore the initial requirements.&lt;/p&gt;
&lt;p&gt;The responsible way to prevent this is to review all of the code... but this wasn't intended as that kind of project, so instead I fired up OpenAI Codex with GPT-5.5 (I had preview access) and told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Describe the difference between how the node.js CLI tool runs and how the web/ version runs&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The answer I got back was enough to give me confidence that Claude hadn't taken any project-threatening shortcuts.&lt;/p&gt;
&lt;p&gt;... and that was about it. Total time in Claude Code for that "build it" step was 59 minutes. I used my &lt;a href="https://github.com/simonw/claude-code-transcripts"&gt;claude-code-transcripts&lt;/a&gt; tool to export a readable version of the full transcript which you can &lt;a href="https://gisthost.github.io/?d64889bfc1b897fea3867adfec62ed89/index.html"&gt;view here&lt;/a&gt;, albeit without those additional queued prompts (here's my &lt;a href="https://github.com/simonw/claude-code-transcripts/issues/98"&gt;issue to fix that&lt;/a&gt;).&lt;/p&gt;
&lt;h4 id="is-this-even-vibe-coding-any-more-"&gt;Is this even vibe coding any more?&lt;/h4&gt;
&lt;p&gt;I'm a pedantic stickler when it comes to &lt;a href="https://simonwillison.net/2025/Mar/19/vibe-coding/"&gt;the original definition of vibe coding&lt;/a&gt; - vibe coding does &lt;em&gt;not&lt;/em&gt; mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all.&lt;/p&gt;
&lt;p&gt;By my own definition, this LiteParse for the web project is about as pure vibe coding as you can get! I have not looked at a &lt;em&gt;single line&lt;/em&gt; of the HTML and TypeScript written for this project - in fact while writing this sentence I had to go and check if it had used JavaScript or TypeScript.&lt;/p&gt;
&lt;p&gt;Yet somehow this one doesn't feel as vibe coded to me as many of my other vibe coded projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;As a static in-browser web application hosted on GitHub Pages the blast radius for any bugs is almost non-existent: it either works for your PDF or doesn't.&lt;/li&gt;
&lt;li&gt;No private data is transferred anywhere - all processing happens in your browser - so a security audit is unnecessary. I've glanced once at the network panel while it's running and no additional requests are made when a PDF is being parsed.&lt;/li&gt;
&lt;li&gt;There was still a whole lot of engineering experience and knowledge required to use the models in this way. Identifying that porting LiteParse to run directly in a browser was critical to the rest of the project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most importantly, I'm happy to attach my reputation to this project and recommend that other people try it out. Unlike most of my vibe coded tools I'm not convinced that spending significant additional engineering time on this would have resulted in a meaningfully better initial release. It's fine as it is!&lt;/p&gt;
&lt;p&gt;I haven't opened a PR against the &lt;a href="https://github.com/run-llama/liteparse"&gt;origin repository&lt;/a&gt; because I've not discussed it with the LiteParse team. I've &lt;a href="https://github.com/run-llama/liteparse/issues/147"&gt;opened an issue&lt;/a&gt;, and if they want my vibe coded implementation as a starting point for something more official they're welcome to take it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="javascript"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vibe-coding"/><category term="coding-agents"/><category term="claude-code"/><category term="agentic-engineering"/></entry><entry><title>Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code</title><link href="https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-code/#atom-tag" rel="alternate"/><published>2025-10-20T17:21:52+00:00</published><updated>2025-10-20T17:21:52+00:00</updated><id>https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-code/#atom-tag</id><summary type="html">
    &lt;p&gt;DeepSeek released a new model yesterday: &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR"&gt;DeepSeek-OCR&lt;/a&gt;, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware.&lt;/p&gt;
&lt;p&gt;This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I've been exploring recently. I &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designed an agentic loop&lt;/a&gt; for the problem, gave Claude full permissions inside a Docker sandbox, embraced the &lt;a href="https://simonwillison.net/2025/Oct/5/parallel-coding-agents/"&gt;parallel agents lifestyle&lt;/a&gt; and reused my &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/"&gt;notes on the NVIDIA Spark&lt;/a&gt; from last week.&lt;/p&gt;
&lt;p&gt;I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen.&lt;/p&gt;
&lt;p&gt;TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce &lt;em&gt;copious&lt;/em&gt; notes about the process.&lt;/p&gt;
&lt;h4 id="the-setup"&gt;The setup&lt;/h4&gt;
&lt;p&gt;I connected to the Spark from my Mac via SSH and started a new Docker container there:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I installed npm and used that to install Claude Code:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then started Claude Code, telling it that it's OK that it's running as &lt;code&gt;root&lt;/code&gt; because it's in a sandbox:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It provided me a URL to click on to authenticate with my Anthropic account.&lt;/p&gt;
&lt;h4 id="the-initial-prompts"&gt;The initial prompts&lt;/h4&gt;
&lt;p&gt;I kicked things off with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a folder deepseek-ocr and do everything else in that folder&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image (&lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg"&gt;this one&lt;/a&gt;, see &lt;a href="https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/"&gt;previous post&lt;/a&gt;) that I wanted it to run OCR on.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Your task is to get this working: &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR"&gt;https://github.com/deepseek-ai/DeepSeek-OCR&lt;/a&gt; - it uses Hugging Face Transformers and models from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"&gt;https://huggingface.co/deepseek-ai/DeepSeek-OCR&lt;/a&gt; - you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those - you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example - use &lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg"&gt;https://static.simonwillison.net/static/2025/ft.jpeg&lt;/a&gt; as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note how I also told it to "make extensive notes as you go in a notes.md file which you should only ever append new notes to" so I'd have some useful notes at the end.&lt;/p&gt;
&lt;h4 id="the-follow-ups"&gt;The follow-ups&lt;/h4&gt;
&lt;p&gt;It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it.&lt;/p&gt;
&lt;p&gt;And then... &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-assistant--2025-10-20-154706"&gt;it gave up&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90a&lt;/li&gt;
&lt;li&gt;GB10 requires: sm_121&lt;/li&gt;
&lt;li&gt;Error: &lt;code&gt;CUDA error: no kernel image is available for execution on the device&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, the model has hardcoded &lt;code&gt;.cuda()&lt;/code&gt; calls, preventing CPU fallback without code modifications.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I gave it some encouragement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It ran this command to grep for all available PyTorch wheels on &lt;code&gt;download.pytorch.org&lt;/code&gt; that match the ARM architecture:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -s https://download.pytorch.org/whl/torch/ &lt;span class="pl-k"&gt;|&lt;/span&gt; grep -o &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;torch-[0-9.]*.*aarch64.whl&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -V &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -20&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;... and that unblocked it.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and &lt;strong&gt;CUDA 13.0&lt;/strong&gt; (cu130)! This is exactly what we need!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It installed the wheels it found and reported:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it's just a warning now instead of a hard error! Let's try running inference: [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then kept on churning for a bit, ignoring warnings, and managed to &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-user--2025-10-20-155056"&gt;run the model&lt;/a&gt;! It loaded the image and ran this prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|grounding|&amp;gt;OCR this image.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And got back output that started like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|ref|&amp;gt;The perils of vibe coding&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[352, 30, 624, 111]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;opt&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 0, 994, 30]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;such&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 24, 996, 58]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;days&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 52, 996, 87]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;pavi&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 85, 996, 118]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;TECHNOLOGY&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[33, 199, 127, 230]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;holds the promise of replacing program-&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 170, 390, 210]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;mers - a profession of very well paid&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 199, 390, 238]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;people.For another, the work can be&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 228, 390, 267]]&amp;lt;|/det|&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Claude declared victory and pointed me to the &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/output/result.mmd?short_path=f43da6a"&gt;output/result.mmd&lt;/a&gt; file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-user--2025-10-20-155826"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That result.mmd file looks to be exclusively whitespace to me, try and get a better result there - the bounding box image was good though&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which caused Claude to &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-assistant--2025-10-20-155900"&gt;reconsider&lt;/a&gt; the prompt it used:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\n&amp;lt;|grounding|&amp;gt;Convert the document to markdown."&lt;/code&gt; - for documents&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\n&amp;lt;|grounding|&amp;gt;OCR this image."&lt;/code&gt; - general OCR&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\nFree OCR."&lt;/code&gt; - without layouts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The "grounding" mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format.&lt;/p&gt;
&lt;p&gt;Let me try running it again with a different prompt that's more focused on text extraction rather than grounding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It tried all of the prompt variants, &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/notes.md#text-output-improvement-2025-10-20-1601"&gt;made a note about them&lt;/a&gt; and created a file called &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/PROMPTS_GUIDE.md"&gt;PROMPTS_GUIDE.md&lt;/a&gt; with a bunch more details, including this comparison table:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Text Quality&lt;/th&gt;
&lt;th&gt;Structure&lt;/th&gt;
&lt;th&gt;Coordinates&lt;/th&gt;
&lt;th&gt;Best Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡⚡⚡ Fast&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Excellent&lt;/td&gt;
&lt;td&gt;⭐ Basic&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;General OCR&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;⚡⚡ Medium&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Excellent&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Full&lt;/td&gt;
&lt;td&gt;⭐⭐ Partial&lt;/td&gt;
&lt;td&gt;Documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;⚡ Slow&lt;/td&gt;
&lt;td&gt;⭐⭐ Good&lt;/td&gt;
&lt;td&gt;⭐ Basic&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Full&lt;/td&gt;
&lt;td&gt;Annotations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;⚡⚡⚡ Fastest&lt;/td&gt;
&lt;td&gt;⭐ N/A&lt;/td&gt;
&lt;td&gt;❌ N/A&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Image analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;And this table under the heading "Performance Benchmarks":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Test image: 3503×1668 pixels (Financial Times article)&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Output Size&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free OCR&lt;/td&gt;
&lt;td&gt;24s&lt;/td&gt;
&lt;td&gt;Clean text&lt;/td&gt;
&lt;td&gt;2257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;39s&lt;/td&gt;
&lt;td&gt;Formatted MD&lt;/td&gt;
&lt;td&gt;2257 + structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;58s&lt;/td&gt;
&lt;td&gt;Text + coords&lt;/td&gt;
&lt;td&gt;2257 + boxes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;9s&lt;/td&gt;
&lt;td&gt;Description&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;My final prompt was this, to gather everything together into a zip file I could extract from the Docker container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a zip file with the output and output_text and all of the scripts and notes - but leave out the github repo and the huggingface repo directories&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I added the contents of that zip file to my new &lt;a href="https://github.com/simonw/research"&gt;simonw/research&lt;/a&gt; GitHub repo &lt;a href="https://github.com/simonw/research/tree/main/deepseek-ocr-nvidia-spark"&gt;in the deepseek-ocr-nvidia-spark&lt;/a&gt; folder.&lt;/p&gt;
&lt;p&gt;Claude really likes writing notes! Here's the directory listing of that finished folder:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  |-- download_test_image.sh
  |-- FINAL_SUMMARY.md
  |-- notes.md
  |-- output
  |   |-- images
  |   |-- result_with_boxes.jpg
  |   `-- result.mmd
  |-- output_text
  |   |-- detailed
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   |-- free_ocr
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   `-- markdown
  |       |-- images
  |       |   `-- 0.jpg
  |       |-- result_with_boxes.jpg
  |       `-- result.mmd
  |-- PROMPTS_GUIDE.md
  |-- README_SUCCESS.md
  |-- README.md
  |-- run_ocr_best.py
  |-- run_ocr_cpu_nocuda.py
  |-- run_ocr_cpu.py
  |-- run_ocr_text_focused.py
  |-- run_ocr.py
  |-- run_ocr.sh
  |-- setup.sh
  |-- SOLUTION.md
  |-- test_image.jpeg
  |-- TEXT_OUTPUT_SUMMARY.md
  `-- UPDATE_PYTORCH.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id="takeaways"&gt;Takeaways&lt;/h4&gt;
&lt;p&gt;My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things.&lt;/p&gt;
&lt;p&gt;Having tried and failed to get PyTorch stuff working in the past, I count this as a &lt;em&gt;huge&lt;/em&gt; win. I'll be using this process a whole lot more in the future.&lt;/p&gt;
&lt;p&gt;How good were the actual results? There's honestly so much material in the resulting notes created by Claude that I haven't reviewed all of it. There may well be all sorts of errors in there, but it's indisputable that it managed to run the model and made notes on how it did that such that I'll be able to do the same thing in the future.&lt;/p&gt;
&lt;p&gt;I think the key factors in executing this project successfully were the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designing agentic loops&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Running it in a Docker sandbox meant I could use &lt;code&gt;claude --dangerously-skip-permissions&lt;/code&gt; and leave it running on its own. If I'd had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes.&lt;/li&gt;
&lt;li&gt;I applied my own knowledge and experience when it got stuck. I was confident (based on &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#claude-code-for-everything"&gt;previous experiments&lt;/a&gt; with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it.&lt;/p&gt;
&lt;h4 id="bonus-using-vs-code-to-monitor-the-container"&gt;Bonus: Using VS Code to monitor the container&lt;/h4&gt;
&lt;p&gt;A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/68a0ebff-b586-4278-bd91-6b715a657d2b"&gt;asked Claude.ai&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I am running a Docker container on a remote machine, which I started over SSH&lt;/p&gt;
&lt;p&gt;How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It gave me a set of steps that solved this exact problem:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install the VS Code "Remote SSH" and "Dev Containers" extensions&lt;/li&gt;
&lt;li&gt;Use "Remote-SSH: Connect to Host" to connect to the remote machine (on my Tailscale network that's &lt;code&gt;spark@100.113.1.114&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;In the window for that remote SSH session, run "Dev Containers: Attach to Running Container" - this shows a list of containers and you can select the one you want to attach to&lt;/li&gt;
&lt;li&gt;... and that's it! VS Code opens a new window providing full access to all of the files in that container. I opened up &lt;code&gt;notes.md&lt;/code&gt; and watched it as Claude Code appended to it in real time.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the "Download" menu item to download it to my Mac.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytorch"&gt;pytorch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vs-code"&gt;vs-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="python"/><category term="ai"/><category term="docker"/><category term="pytorch"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="nvidia"/><category term="vs-code"/><category term="vision-llms"/><category term="deepseek"/><category term="llm-release"/><category term="coding-agents"/><category term="claude-code"/><category term="ai-in-china"/><category term="nvidia-spark"/></entry><entry><title>The perils of vibe coding</title><link href="https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/#atom-tag" rel="alternate"/><published>2025-08-29T17:51:10+00:00</published><updated>2025-08-29T17:51:10+00:00</updated><id>https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.ft.com/content/5b3d410a-6e02-41ad-9e0a-c2e4d672ca00"&gt;The perils of vibe coding&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy yesterday:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg" style="text-decoration: none; border-bottom: none"&gt;&lt;img src="https://static.simonwillison.net/static/2025/ft.jpeg" alt="The perils of vibe coding - A new OpenAI model arrived this month with a glossy livestream, group watch parties and a lingering sense of disappointment. The YouTube comment section was underwhelmed. “I think they are all starting to realize this isn’t going to become the world like they thought it would,” wrote one viewer. “I can see it on their faces.” But if the casual user was unimpressed, the AI model’s saving grace may be vibe. Coding is generative AI’s newest battleground. With big bills to pay, high valuations to live up to and a market wobble to erase, the sector needs to prove its corporate productivity chops. Coding is hardly promoted as a business use case that already works. For one thing, AI-generated code holds the promise of replacing programmers — a profession of very well paid people. For another, the work can be quantified. In April, Microsoft chief executive Satya Nadella said that up to 50 per cent of the company’s code was now being written by AI. Google chief executive Sundar Pichai has said the same thing. Salesforce has paused engineering hires and Mark Zuckerberg told podcaster Joe Rogan that Meta would use AI as a “mid-level engineer” that writes code. Meanwhile, start-ups such as Replit and Cursor’s Anysphere are trying to persuade people that with AI, anyone can code. In theory, every employee can become a software engineer. So why aren’t we? One possibility is that it’s all still too unfamiliar. But when I ask people who write code for a living they offer an alternative suggestion: unpredictability. As programmer Simon Willison put it: “A lot of people are missing how weird and funny this space is. I’ve been a computer programmer for 30 years and [AI models] don’t behave like normal computers.” Willison is well known in the software engineering community for his AI experiments. He’s an enthusiastic vibe coder — using LLMs to generate code using natural language prompts. OpenAI’s latest model GPT-3.1s, he is now favourite. Still, he predicts that a vibe coding crash is due if it is used to produce glitchy software. It makes sense that programmers — people who are interested in finding new ways to solve problems — would be early adopters of LLMs. Code is a language, albeit an abstract one. And generative AI is trained in nearly all of them, including older ones like Cobol. That doesn’t mean they accept all of its suggestions. Willison thinks the best way to see what a new model can do is to ask for something unusual. He likes to request an svg (an image made out of lines described with code) of a pelican on a bike and asks it to remember the chickens in his garden by name. Results can be bizarre. One model ignored key prompts in favour of composing a poem. Still, his adventures in vibe coding sound like an advert for the sector’s future. Anthropic’s Claude Code, the favoured model for developers, to make an OCR (optical character recognition) software loves screenshots) tool that will copy and paste text from a screenshot. He wrote software that summarises blog comments and has planned to cut a custom tool that will alert him when a whale is visible from his Pacific coast home. All this by typing prompts in English. It’s sounds like the sort of thing Bill Gates might have had in mind when he wrote that natural language AI agents would bring about “the biggest revolution in computing since we went from typing commands to tapping on icons”. But watching code appear and know how it works are two different things. My efforts to make my own comment summary tool produced something unworkable that gave overly long answers and then congratulated itself as a success. Willison says he wouldn’t use AI-generated code for projects he planned to ship out unless he had reviewed each line. Not only is there the risk of hallucination but the chatbot’s desire to be agreeable means it may an unusable idea works. That is a particular issue for those of us who don’t know how to fix the code. We risk creating software with hidden problems. It may not save time either. A study published in July by the non-profit Model Evaluation and Threat Research assessed work done by 16 developers — some with AI tools, some without. Those using AI assistance it had made them faster. In fact it took them nearly a fifth longer. Several developers I spoke to said AI was best used as a way to talk through coding problems. It’s a version of something they call rubber ducking (after their habit of talking to the toys on their desk) — only this rubber duck can talk back. As one put it, code shouldn’t be judged by volume or speed. Progress in AI coding is tangible. But measuring productivity gains is not as neat as a simple percentage calculation."&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;From the article, with links added by me to relevant projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Willison thinks the best way to see what a new model can do is to ask for something unusual. He likes to request an SVG (an image made out of lines described with code) of &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;a pelican on a bike&lt;/a&gt; and asks it to remember the chickens in his garden by name. Results can be bizarre. One model ignored his prompts in favour of &lt;a href="https://simonwillison.net/2025/Aug/14/gemma-3-270m/"&gt;composing a poem&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Still, his adventures in vibe coding sound like an advert for the sector. He used Anthropic's Claude Code, the favoured model for developers, to &lt;a href="https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/"&gt;make an OCR&lt;/a&gt; (optical character recognition - software loves acronyms) tool that will copy and paste text from a screenshot.&lt;/p&gt;
&lt;p&gt;He wrote software that &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes"&gt;summarises blog comments&lt;/a&gt; and has plans to build a custom tool that will alert him when a whale is visible from his Pacific coast home. All this by typing prompts in English.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've been talking about that whale spotting project for far too long. Now that it's been in the FT I really need to build it.&lt;/p&gt;
&lt;p&gt;(On the subject of OCR... I tried extracting the text from the above image using GPT-5 and got a &lt;a href="https://chatgpt.com/share/68b1e707-add0-8006-8344-4c2fca902b2e"&gt;surprisingly bad result&lt;/a&gt; full of hallucinated details. Claude Opus 4.1 &lt;a href="https://claude.ai/share/e98d2fe1-0c81-4f51-8739-483f843e4c0e"&gt;did a lot better&lt;/a&gt; but still made some mistakes. Gemini 2.5 &lt;a href="https://aistudio.google.com/app/prompts?state=%257B%2522ids%2522:%255B%25221MOzgBJI-FJF1uyile_7h2zL4F6lD0sgK%2522%255D,%2522action%2522:%2522open%2522,%2522userId%2522:%2522106366615678321494423%2522,%2522resourceKeys%2522:%257B%257D%257D&amp;amp;usp=sharing,%20https://drive.google.com/file/d/1ffD88ORjgjFzbPsvQ-Z52Exhb_Z9MgtL/view?usp=sharing"&gt;did much better&lt;/a&gt;.)


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/press-quotes"&gt;press-quotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="claude"/><category term="vibe-coding"/><category term="press-quotes"/><category term="gpt-5"/><category term="gpt"/></entry><entry><title>How OpenElections Uses LLMs</title><link href="https://simonwillison.net/2025/Jun/19/how-openelections-uses-llms/#atom-tag" rel="alternate"/><published>2025-06-19T18:26:38+00:00</published><updated>2025-06-19T18:26:38+00:00</updated><id>https://simonwillison.net/2025/Jun/19/how-openelections-uses-llms/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://thescoop.org/archives/2025/06/09/how-openelections-uses-llms/index.html"&gt;How OpenElections Uses LLMs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The &lt;a href="https://github.com/openelections"&gt;OpenElections project&lt;/a&gt; collects detailed election data for the USA, all the way down to the precinct level. This is a &lt;em&gt;surprisingly&lt;/em&gt; hard problem: while county and state-level results are widely available, precinct-level results are published in thousands of different ad-hoc ways and rarely aggregated once the election result has been announced.&lt;/p&gt;
&lt;p&gt;A lot of those precinct results are published as image-filled PDFs.&lt;/p&gt;
&lt;p&gt;Derek Willis has recently started leaning on Gemini to help parse those PDFs into CSV data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For parsing image PDFs into CSV files, Google’s Gemini is my model of choice, for two main reasons. First, the results are usually very, very accurate (with a few caveats I’ll detail below), and second, Gemini’s large context window means it’s possible to work with PDF files that can be multiple MBs in size.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Is &lt;a href="https://thescoop.org/archives/2025/06/09/how-openelections-uses-llms/index.html"&gt;this piece&lt;/a&gt; he shares the process and prompts for a real-world expert level data entry project, assisted by Gemini.&lt;/p&gt;
&lt;p&gt;This example from Limestone County, Texas is a great illustration of how tricky this problem can get. Getting traditional OCR software to correctly interpret multi-column layouts like this always requires some level of manual intervention:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The results are typewritten and slightly wonky and come in several different columns" src="https://static.simonwillison.net/static/2025/election-results.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Derek's &lt;a href="https://aistudio.google.com/app/prompts/1vZq4hi_eCqR58TkuzqPugDcOc2kE1tms"&gt;prompt against Gemini 2.5 Pro&lt;/a&gt; throws in an example, some special instructions and a note about the two column format:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Produce a CSV file from the attached PDF based on this example:&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;county,precinct,office,district,party,candidate,votes,absentee,early_voting,election_day&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,Registered Voters,,,,1858,,,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,Ballots Cast,,,,1160,,,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,President,,REP,Donald J. Trump,879,,,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,President,,DEM,Kamala D. Harris,271,,,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,President,,LIB,Chase Oliver,1,,,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,President,,GRN,Jill Stein,4,,,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Limestone,Precinct 101,President,,,Write-ins,1,,,&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Skip Write-ins with candidate names and rows with "Cast Votes", "Not Assigned", "Rejected write-in votes", "Unresolved write-in votes" or "Contest Totals". Do not extract any values that end in "%"&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Use the following offices:&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;President/Vice President -&amp;gt; President&lt;/code&gt;&lt;br&gt;
&lt;code&gt;United States Senator -&amp;gt; U.S. Senate&lt;/code&gt;&lt;br&gt;
&lt;code&gt;US Representative -&amp;gt; U.S. House&lt;/code&gt;&lt;br&gt;
&lt;code&gt;State Senator -&amp;gt; State Senate&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Quote all office and candidate values. The results are split into two columns on each page; parse the left column first and then the right column.&lt;/code&gt;&lt;br&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A spot-check and a few manual tweaks and &lt;a href="https://github.com/openelections/openelections-data-tx/blob/master/2024/counties/20241105__tx__general__limestone__precinct.csv"&gt;the result&lt;/a&gt; against a 42 page PDF was exactly what was needed.&lt;/p&gt;
&lt;p&gt;How about something harder? The results for Cameron County came as more than 600 pages and looked like this - note the hole-punch holes that obscure some of the text!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Precinct results report, Cameron County Texas, November 5th 2024. A hole punch hole obscures Precinct 16 and another further down the page deletes the first three letters in both Undervotes and Overvotes" src="https://static.simonwillison.net/static/2025/cameron.png" /&gt;&lt;/p&gt;
&lt;p&gt;This file had to be split into chunks of 100 pages each, and the entire process still took a full hour of work - but the resulting table matched up with the official vote totals.&lt;/p&gt;
&lt;p&gt;I love how realistic this example is. AI data entry like this isn't a silver bullet - there's still a bunch of work needed to verify the results and creative thinking needed to work through limitations - but it represents a very real improvement in how small teams can take on projects of this scale.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the six weeks since we started working on Texas precinct results, we’ve been able to convert them for more than half of the state’s 254 counties, including many image PDFs like the ones on display here. That pace simply wouldn’t be possible with data entry or traditional OCR software.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44320001"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/derek-willis"&gt;derek-willis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/structured-extraction"&gt;structured-extraction&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="derek-willis"/><category term="ocr"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="vision-llms"/><category term="structured-extraction"/></entry><entry><title>qwen2.5vl in Ollama</title><link href="https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/#atom-tag" rel="alternate"/><published>2025-05-18T12:31:15+00:00</published><updated>2025-05-18T12:31:15+00:00</updated><id>https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ollama.com/library/qwen2.5vl"&gt;qwen2.5vl in Ollama&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ollama announced a complete overhaul of their vision support the other day. Here's the first new model they've shipped since then - a packaged version of &lt;a href="https://github.com/QwenLM/Qwen2.5-VL"&gt;Qwen 2.5 VL&lt;/a&gt; which was first released &lt;a href="https://qwenlm.github.io/blog/qwen2.5-vl/"&gt;on January 26th 2025&lt;/a&gt;. Here are my &lt;a href="https://simonwillison.net/2025/Jan/27/qwen25-vl-qwen25-vl-qwen25-vl/"&gt;notes from that release&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I upgraded Ollama (it auto-updates so I just had to restart it from the tray icon) and ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloaded a 6GB model file. I tried it out against my &lt;a href="https://static.simonwillison.net/static/2025/cleo-sand.jpg"&gt;photo of Cleo rolling on the beach&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/cleo-sand.jpg \
  'describe this image' -m qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And got a pretty good result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows a dog lying on its back on a sandy beach. The dog appears to be a medium to large breed with a dark coat, possibly black or dark brown. It is wearing a red collar or harness around its chest. The dog's legs are spread out, and its belly is exposed, suggesting it might be rolling around or playing in the sand. The sand is light-colored and appears to be dry, with some small footprints and marks visible around the dog. The lighting in the image suggests it is taken during the daytime, with the sun casting a shadow of the dog to the left side of the image. The overall scene gives a relaxed and playful impression, typical of a dog enjoying time outdoors on a beach.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Qwen 2.5 VL has a strong reputation for OCR, so I tried it on &lt;a href="https://simonwillison.net/2025/May/17/pycon-poster/#datasette-poster"&gt;my poster&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -a https://static.simonwillison.net/static/2025/poster.jpg \
  'convert to markdown' -m qwen2.5vl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result that came back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It looks like the image you provided is a jumbled and distorted text, making it difficult to interpret. If you have a specific question or need help with a particular topic, please feel free to ask, and I'll do my best to assist you!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm not sure what went wrong here. My best guess is that the maximum resolution the model can handle is too small to make out the text, or maybe Ollama resized the image to the point of illegibility before handing it to the model?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I think this may be &lt;a href="https://github.com/simonw/llm/issues/1046"&gt;a bug&lt;/a&gt; relating to URL handling in LLM/llm-ollama. I tried downloading the file first:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget https://static.simonwillison.net/static/2025/poster.jpg
llm -m qwen2.5vl 'extract text' -a poster.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time it did a lot better. The results weren't perfect though - &lt;a href="https://gist.github.com/simonw/2b46e932a16c92e673ea09dfc0186ec2#response"&gt;it ended up stuck in a loop&lt;/a&gt; outputting the same code example dozens of times.&lt;/p&gt;
&lt;p&gt;I tried with a different prompt - "extract text" - and it got confused by the three column layout, misread Datasette as "Datasetette" and missed some of the text. Here's &lt;a href="https://gist.github.com/simonw/3ececa5f5ff109a81bc6893be06f00b1#response"&gt;that result&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These experiments used &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; (6GB) - I expect the results would be better with the larger &lt;code&gt;qwen2.5vl:32b&lt;/code&gt; (21GB) and &lt;code&gt;qwen2.5vl:72b&lt;/code&gt; (71GB) models.&lt;/p&gt;
&lt;p&gt;Fred Jonsson &lt;a href="https://twitter.com/enginoid/status/1924092556079436086"&gt;reported a better result&lt;/a&gt; using the MLX model via LM studio (~9GB model running in 8bit - I think that's &lt;a href="https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit"&gt;mlx-community/Qwen2.5-VL-7B-Instruct-8bit&lt;/a&gt;). His &lt;a href="https://gist.github.com/enginoid/5c91c920124d4a2e0ab253df769e35fa"&gt;full output is here&lt;/a&gt; - looks almost exactly right to me.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="ai-in-china"/></entry><entry><title>Mistral OCR</title><link href="https://simonwillison.net/2025/Mar/7/mistral-ocr/#atom-tag" rel="alternate"/><published>2025-03-07T01:39:26+00:00</published><updated>2025-03-07T01:39:26+00:00</updated><id>https://simonwillison.net/2025/Mar/7/mistral-ocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/fr/news/mistral-ocr"&gt;Mistral OCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New closed-source specialist OCR model by Mistral - you can feed it images or a PDF and it produces Markdown with optional embedded images.&lt;/p&gt;
&lt;p&gt;It's available &lt;a href="https://docs.mistral.ai/api/#tag/ocr"&gt;via their API&lt;/a&gt;, or it's "available to self-host on a selective basis" for people with stringent privacy requirements who are willing to talk to their sales team.&lt;/p&gt;
&lt;p&gt;I decided to try out their API, so I copied and pasted example code &lt;a href="https://colab.research.google.com/drive/11NdqWVwC_TtJyKT6cmuap4l9SryAeeVt?usp=sharing"&gt;from their notebook&lt;/a&gt; into my &lt;a href="https://simonwillison.net/2024/Dec/19/one-shot-python-tools/"&gt;custom Claude project&lt;/a&gt; and &lt;a href="https://claude.ai/share/153d8eb8-82dd-4f8c-a3d0-6c23b4dc21a2"&gt;told it&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Turn this into a CLI app, depends on mistralai - it should take a file path and an optional API key defauling to env vironment called MISTRAL_API_KEY&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After &lt;a href="https://claude.ai/share/b746cab4-293b-4e04-b662-858bb164ab78"&gt;some further&lt;/a&gt; iteration / vibe coding I got to something that worked, which I then tidied up and shared as &lt;a href="https://github.com/simonw/tools/blob/main/python/mistral_ocr.py"&gt;mistral_ocr.py&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;export MISTRAL_API_KEY='...'
uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf --html --inline-images &amp;gt; mixtral.html
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I fed in &lt;a href="https://arxiv.org/abs/2401.04088"&gt;the Mixtral paper&lt;/a&gt; as a PDF. The API returns Markdown, but my &lt;code&gt;--html&lt;/code&gt; option renders that Markdown as HTML and the &lt;code&gt;--inline-images&lt;/code&gt; option takes any images and inlines them as base64 URIs (inspired &lt;a href="https://simonwillison.net/2025/Mar/6/monolith/"&gt;by monolith&lt;/a&gt;). The result is &lt;a href="https://static.simonwillison.net/static/2025/mixtral.html"&gt;mixtral.html&lt;/a&gt;, a 972KB HTML file with images and text bundled together.&lt;/p&gt;
&lt;p&gt;This did a pretty great job!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of part of the document, it has a heading, some text, an image and the start of a table. The table contains some unrendered MathML syntax." src="https://static.simonwillison.net/static/2025/mixtral-as-html.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;My script renders Markdown tables but I haven't figured out how to render inline Markdown MathML yet. I ran the command a second time and requested Markdown output (the default) like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf &amp;gt; mixtral.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/023d1cf403c1cd9f41801c85510aef21"&gt;that Markdown rendered as a Gist&lt;/a&gt; - there are a few MathML glitches so clearly the Mistral OCR MathML dialect and the GitHub Formatted Markdown dialect don't quite line up.&lt;/p&gt;
&lt;p&gt;My tool can also output raw JSON as an alternative to Markdown or HTML - full details &lt;a href="https://tools.simonwillison.net/python/#mistral_ocrpy"&gt;in the documentation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Mistral API is priced at roughly 1000 pages per dollar, with a 50% discount for batch usage.&lt;/p&gt;
&lt;p&gt;The big question with LLM-based OCR is always how well it copes with accidental instructions in the text (can you safely OCR a document full of prompting examples?) and how well it handles text it can't write.&lt;/p&gt;
&lt;p&gt;Mistral's Sophia Yang says it &lt;a href="https://x.com/sophiamyang/status/1897719199595720722"&gt;"should be robust"&lt;/a&gt; against following instructions in the text, and invited people to try and find counter-examples.&lt;/p&gt;
&lt;p&gt;Alexander Doria noted that &lt;a href="https://twitter.com/Dorialexander/status/1897702264543875535"&gt;Mistral OCR can hallucinate text&lt;/a&gt; when faced with handwriting that it cannot understand.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/sophiamyang/status/1897713370029068381"&gt;@sophiamyang&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="claude"/><category term="mistral"/><category term="vision-llms"/><category term="uv"/></entry><entry><title>olmOCR</title><link href="https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag" rel="alternate"/><published>2025-02-26T02:04:03+00:00</published><updated>2025-02-26T02:04:03+00:00</updated><id>https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from &lt;a href="https://allenai.org/"&gt;Ai2&lt;/a&gt; - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".&lt;/p&gt;
&lt;p&gt;At its core is &lt;a href="https://huggingface.co/allenai/olmOCR-7B-0225-preview"&gt;allenai/olmOCR-7B-0225-preview&lt;/a&gt;, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the &lt;a href="https://huggingface.co/datasets/allenai/olmOCR-mix-0225"&gt;olmOCR-mix-0225 dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/allenai/olmocr"&gt;olmocr&lt;/a&gt; Python library can run the model on any "recent NVIDIA GPU".  I haven't managed to run it on my own Mac yet - there are &lt;a href="https://huggingface.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF"&gt;GGUFs out there&lt;/a&gt; but it's not clear to me how to run vision prompts through them - but Ai2 offer &lt;a href="https://olmocr.allenai.org/"&gt;an online demo&lt;/a&gt; which can handle up to ten pages for free.&lt;/p&gt;
&lt;p&gt;Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting idea from &lt;a href="https://olmocr.allenai.org/papers/olmocr.pdf"&gt;the technical report (PDF)&lt;/a&gt; is something they call "document anchoring":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted
from the PDF binary file. [...]&lt;/p&gt;
&lt;p&gt;Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Left side shows a green-header interface with coordinates like [150x220]√3x−1+(1+x)², [150x180]Section 6, [150x50]Lorem ipsum dolor sit amet, [150x70]consectetur adipiscing elit, sed do, [150x90]eiusmod tempor incididunt ut, [150x110]labore et dolore magna aliqua, [100x280]Table 1, followed by grid coordinates with A, B, C, AA, BB, CC, AAA, BBB, CCC values. Right side shows the rendered document with equation, text and table." src="https://static.simonwillison.net/static/2025/olmocr-document-anchoring.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Jonathan Soma &lt;a href="https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html"&gt;figured out how to run it on a Mac&lt;/a&gt; using LM Studio and the &lt;a href="https://github.com/allenai/olmocr/"&gt;olmocr&lt;/a&gt; Python package.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soldni/status/1894418235334037570?s=46"&gt;Luca Soldaini&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fine-tuning"&gt;fine-tuning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="fine-tuning"/><category term="vision-llms"/><category term="qwen"/><category term="ai2"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!</title><link href="https://simonwillison.net/2025/Jan/27/qwen25-vl-qwen25-vl-qwen25-vl/#atom-tag" rel="alternate"/><published>2025-01-27T20:05:03+00:00</published><updated>2025-01-27T20:05:03+00:00</updated><id>https://simonwillison.net/2025/Jan/27/qwen25-vl-qwen25-vl-qwen25-vl/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen2.5-vl/"&gt;Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hot on the heels of yesterday's &lt;a href="https://simonwillison.net/2025/Jan/26/qwen25-1m/"&gt;Qwen2.5-1M&lt;/a&gt;, here's Qwen2.5 VL (with an excitable announcement title) - the latest in Qwen's series of vision LLMs.&lt;/p&gt;
&lt;p&gt;They're &lt;a href="https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5"&gt;releasing multiple versions&lt;/a&gt;: base models and instruction tuned models at 3B, 7B and 72B. There's a lot of additional information &lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/README.md"&gt;in their README&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The 72B model can be selected on Qwen's &lt;a href="https://chat.qwenlm.ai/"&gt;Qwen Chat&lt;/a&gt; interface. I gave it this image of our dog Cleo joy rolling on the beach:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cleo having a joyful roll on the beach" src="https://static.simonwillison.net/static/2025/cleo-sand.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Describe this image in detail, be flowery&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Since the Qwen team tends to use &lt;a href="https://simonwillison.net/search/?q=flowery&amp;amp;tag=qwen"&gt;flowery language&lt;/a&gt; in their announcements.)&lt;/p&gt;
&lt;p&gt;Here's what I got back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this enchanting tableau, a canine companion finds solace in the embrace of nature's softest bed. The sandy expanse stretches out like an endless canvas, its surface etched with the delicate imprints of countless wanderers who have come and gone, leaving behind a tapestry of footprints that whisper tales of joyous romps and serene strolls.&lt;/p&gt;
&lt;p&gt;The dog, a vision of contentment, lies sprawled upon the sand, its dark fur gleaming under the gentle caress of the sun. Its body is relaxed, limbs splayed in abandon, as if it has surrendered to the soothing rhythm of the beach. A vibrant red collar adorns its neck, a splash of color against the muted tones of the surroundings, hinting at the care and love bestowed upon this cherished pet.&lt;/p&gt;
&lt;p&gt;The sand itself is a work of art, its grains shimmering with a silvery sheen that catches the light. Here and there, small tufts of seaweed peek through, remnants of the ocean's bounty, adding a touch of wild beauty to the scene. The shadows cast by unseen objects dance across the sand, creating a play of light and dark that lends an air of mystery to the tranquil setting.&lt;/p&gt;
&lt;p&gt;This moment, frozen in time, speaks of peace and harmony, where man's best friend finds respite in the simple pleasures of life. It is a scene that invites one to pause, breathe deeply, and revel in the serenity of nature's embrace.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm looking forward to trying out their smaller models - the 3B model in particular should be small enough to run comfortably even on an iPhone, so hopefully someone will wire up an example of that soon (maybe using MLX).&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/reach_vb/status/1883959714796675206"&gt;VB points out&lt;/a&gt; that the vision benchmarks for Qwen 2.5 VL 7B show it out-performing GPT-4o mini!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Qwen 2.5-VL 7B scores highest on MMMU Pro, DocVQA, InfoVQA, CC-OCR, OCRBenchV2, MMStar, MMBench1.1, MathVista, MathVision, VideoMME, LVBench, and AITZ. GPT-4o Mini scores best on MMMU, MegaBench. Other models of a similar size score better on CharadesSTA, Android Control, ScreenSpot, and ScreenSpot Pro." src="https://static.simonwillison.net/static/2025/qwen-25-vl-benchmarks.jpg" /&gt;&lt;/p&gt;
&lt;h4 id="cookbooks"&gt;Qwen2.5 VL cookbooks&lt;/h4&gt;

&lt;p&gt;Qwen also just published a set of &lt;a href="https://github.com/QwenLM/Qwen2.5-VL/tree/main/cookbooks"&gt;cookbook recipes&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/universal_recognition.ipynb"&gt;universal_recognition.ipynb&lt;/a&gt; demonstrates basic visual Q&amp;amp;A, including prompts like &lt;code&gt;Who are these in this picture? Please give their names in Chinese and English&lt;/code&gt; against photos of celebrities, an ability other models have deliberately suppressed.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/spatial_understanding.ipynb"&gt;spatial_understanding.ipynb&lt;/a&gt; demonstrates bounding box support, with prompts like &lt;code&gt;Locate the top right brown cake, output its bbox coordinates using JSON format&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/video_understanding.ipynb"&gt;video_understanding.ipynb&lt;/a&gt; breaks a video into individual frames and asks questions like &lt;code&gt;Could you go into detail about the content of this long video?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr.ipynb"&gt;ocr.ipynb&lt;/a&gt; shows &lt;code&gt;Qwen2.5-VL-7B-Instruct&lt;/code&gt; performing OCR in multiple different languages.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/document_parsing.ipynb"&gt;document_parsing.ipynb&lt;/a&gt; uses Qwen to convert images of documents to HTML and other formats, and notes that "we introduce a unique Qwenvl HTML format that includes positional information for each component, enabling precise document reconstruction and manipulation."&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/mobile_agent.ipynb"&gt;mobile_agent.ipynb&lt;/a&gt; runs Qwen with tool use against tools for controlling a mobile phone, similar to ChatGPT Operator or Claude Computer Use.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/computer_use.ipynb"&gt;computer_use.ipynb&lt;/a&gt; showcases "GUI grounding" -  feeding in screenshots of a user's desktop and running tools for things like left clicking on a specific coordinate.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="qwen-vl-mlx-vlm"&gt;Running it with mlx-vlm&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Update 30th January 2025&lt;/strong&gt;: I got it working on my Mac using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; and &lt;a href="https://github.com/Blaizzy/mlx-vlm"&gt;mlx-vlm&lt;/a&gt;, with some &lt;a href="https://github.com/Blaizzy/mlx-vlm/issues/192"&gt;hints from this issue&lt;/a&gt;. Here's the recipe that worked (downloading a 9GB model from &lt;a href="https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit"&gt;mlx-community/Qwen2.5-VL-7B-Instruct-8bit&lt;/a&gt;):&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --with &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;numpy&amp;lt;2&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --with &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;git+https://github.com/huggingface/transformers&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  --with mlx-vlm \
  python -m mlx_vlm.generate \
    --model mlx-community/Qwen2.5-VL-7B-Instruct-8bit \
    --max-tokens 100 \
    --temp 0.0 \
    --prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Describe this image.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    --image path-to-image.png&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I ran that against this image:&lt;/p&gt;
&lt;p style="margin-top: 1.5em"&gt;&lt;img alt="Diagram showing data exfiltration attack flow: User conversing with AI Agent (shown as blue star), with arrows showing &amp;quot;Retrieval request&amp;quot; to information mediums (email, cloud, globe icons) and &amp;quot;Retrieval of attacker-controlled data entering prompt context &amp; agent reasoning loop&amp;quot; leading to &amp;quot;Exfiltration of private information initiated by retrieval of attacker-controlled data&amp;quot;. Attacker figure shown in red on right side with arrow indicating &amp;quot;Attacker-controlled data planted through private (e.g. email, cloud storage) or public (web search, internet) information mediums&amp;quot;" src="https://static.simonwillison.net/static/2025/exfiltration-diagram.jpg" style="max-width: 100%"&gt;&lt;/p&gt;

&lt;p&gt;And got back this result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image appears to illustrate a flowchart or diagram related to a cybersecurity scenario. Here's a breakdown of the elements:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;User&lt;/strong&gt;: Represented by a simple icon of a person.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Agent&lt;/strong&gt;: Represented by a blue diamond shape.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attacker&lt;/strong&gt;: Represented by a red icon of a person.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud and Email Icons&lt;/strong&gt;: Represented by a cloud and an envelope, indicating data or information being transferred.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Text&lt;/strong&gt;: The text&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/computer-use"&gt;computer-use&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="uv"/><category term="qwen"/><category term="mlx"/><category term="ai-in-china"/><category term="computer-use"/></entry><entry><title>Docling</title><link href="https://simonwillison.net/2024/Nov/3/docling/#atom-tag" rel="alternate"/><published>2024-11-03T04:57:56+00:00</published><updated>2024-11-03T04:57:56+00:00</updated><id>https://simonwillison.net/2024/Nov/3/docling/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ds4sd.github.io/docling/"&gt;Docling&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
MIT licensed document extraction Python library from the Deep Search team at IBM, who released &lt;a href="https://ds4sd.github.io/docling/v2/#changes-in-docling-v2"&gt;Docling v2&lt;/a&gt; on October 16th.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://arxiv.org/abs/2408.09869"&gt;Docling Technical Report&lt;/a&gt; paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.&lt;/p&gt;
&lt;p&gt;Those models are &lt;a href="https://huggingface.co/ds4sd/docling-models"&gt;available on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's how to try out the Docling CLI interface using &lt;code&gt;uvx&lt;/code&gt; (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx docling mydoc.pdf --to json --to md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will output a &lt;code&gt;mydoc.json&lt;/code&gt; file with complex layout information and a &lt;code&gt;mydoc.md&lt;/code&gt; Markdown file which includes Markdown tables where appropriate.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://ds4sd.github.io/docling/usage/"&gt;Python API&lt;/a&gt; is a lot more comprehensive. It can even extract tables &lt;a href="https://ds4sd.github.io/docling/examples/export_tables/"&gt;as Pandas DataFrames&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;docling&lt;/span&gt;.&lt;span class="pl-s1"&gt;document_converter&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;DocumentConverter&lt;/span&gt;
&lt;span class="pl-s1"&gt;converter&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;DocumentConverter&lt;/span&gt;()
&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;converter&lt;/span&gt;.&lt;span class="pl-en"&gt;convert&lt;/span&gt;(&lt;span class="pl-s"&gt;"document.pdf"&lt;/span&gt;)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;table&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-s1"&gt;document&lt;/span&gt;.&lt;span class="pl-s1"&gt;tables&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;df&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;table&lt;/span&gt;.&lt;span class="pl-en"&gt;export_to_dataframe&lt;/span&gt;()
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;df&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;I ran that inside &lt;code&gt;uv run --with docling python&lt;/code&gt;. It took a little while to run, but it demonstrated that the library works.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ibm"&gt;ibm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="ibm"/><category term="ocr"/><category term="pdf"/><category term="python"/><category term="ai"/><category term="hugging-face"/><category term="uv"/></entry><entry><title>Running prompts against images and PDFs with Google Gemini</title><link href="https://simonwillison.net/2024/Oct/23/prompt-gemini/#atom-tag" rel="alternate"/><published>2024-10-23T18:25:07+00:00</published><updated>2024-10-23T18:25:07+00:00</updated><id>https://simonwillison.net/2024/Oct/23/prompt-gemini/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/prompt-gemini"&gt;Running prompts against images and PDFs with Google Gemini&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New TIL. I've been experimenting with the Google Gemini APIs for running prompts against images and PDFs (in preparation for finally adding multi-modal support to &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;) - here are my notes on how to send images or PDF files to their API using &lt;code&gt;curl&lt;/code&gt; and the &lt;code&gt;base64 -i&lt;/code&gt; macOS command.&lt;/p&gt;
&lt;p&gt;I figured out the &lt;code&gt;curl&lt;/code&gt; incantation first and then &lt;a href="https://gist.github.com/simonw/7cc2a9c3e612a8af502d733ff619e066"&gt;got Claude to build me&lt;/a&gt; a Bash script that I can execute like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;prompt-gemini 'extract text' example-handwriting.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/prompt-gemini-extract.gif" alt="Animated terminal demo. At the top of the screen is a example-handwriting.jpg with some rough handwriting. I run this command in a terminal: 
prompt-gemini 'extract text' example-handwriting.jpg It returns JSON showing 270 tokens used by gemini-1.5-flash-8b. Then I run the command again with -r on the end and it returns the text from the image: Example handwriting Let's try this out"&gt;&lt;/p&gt;
&lt;p&gt;Playing with this is &lt;em&gt;really fun&lt;/em&gt;. The Gemini models charge less than 1/10th of a cent per image, so it's really inexpensive to try them out.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bash"&gt;bash&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;&lt;/p&gt;



</summary><category term="bash"/><category term="google"/><category term="ocr"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="gemini"/><category term="vision-llms"/><category term="llm-pricing"/></entry><entry><title>State-of-the-art music scanning by Soundslice</title><link href="https://simonwillison.net/2024/Jun/20/music-scanning-by-soundslice/#atom-tag" rel="alternate"/><published>2024-06-20T04:37:28+00:00</published><updated>2024-06-20T04:37:28+00:00</updated><id>https://simonwillison.net/2024/Jun/20/music-scanning-by-soundslice/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.soundslice.com/sheet-music-scanner/"&gt;State-of-the-art music scanning by Soundslice&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It's been a while since I checked in on &lt;a href="https://www.soundslice.com/"&gt;Soundslice&lt;/a&gt;, Adrian Holovaty's beautiful web application focused on music education.&lt;/p&gt;
&lt;p&gt;The latest feature is spectacular. The Soundslice music editor - already one of the most impressive web applications I've ever experienced - can now import notation directly from scans or photos of sheet music.&lt;/p&gt;
&lt;p&gt;The attention to detail is immaculate. The custom machine learning model can handle a wide variety of notation details, and the system asks the user to verify or correct details that it couldn't perfectly determine using a neatly designed flow.&lt;/p&gt;
&lt;p&gt;Free accounts can scan two single page documents a month, and paid plans get a much higher allowance. I tried it out just now on a low resolution image I found on Wikipedia and it did a fantastic job, even allowing me to listen to a simulated piano rendition of the music once it had finished processing.&lt;/p&gt;
&lt;p&gt;It's worth spending some time with the &lt;a href="https://www.soundslice.com/blog/music-scanning/"&gt;release notes&lt;/a&gt; for the feature to appreciate how much work they've out into improving it since the initial release.&lt;/p&gt;
&lt;p&gt;If you're new to Soundslice, here's &lt;a href="https://www.soundslice.com/slices/RXTDc/course-preview-5904/"&gt;an example&lt;/a&gt; of their core player interface which syncs the display of music notation to an accompanying video.&lt;/p&gt;
&lt;p&gt;Adrian wrote up some &lt;a href="https://www.holovaty.com/writing/machine-learning-thoughts/"&gt;detailed notes&lt;/a&gt; on the machine learning behind the feature when they first launched it in beta back in November 2022.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OMR [Optical Music Recognition] is an inherently hard problem, significantly more difficult than text OCR. For one, music symbols have complex spatial relationships, and mistakes have a tendency to cascade. A single misdetected key signature might result in &lt;em&gt;multiple&lt;/em&gt; incorrect note pitches. And there’s a wide diversity of symbols, each with its own behavior and semantics — meaning the problems and subproblems aren’t just hard, there are &lt;em&gt;many&lt;/em&gt; of them.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/adrian-holovaty"&gt;adrian-holovaty&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/music"&gt;music&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="adrian-holovaty"/><category term="machine-learning"/><category term="music"/><category term="ocr"/><category term="ai"/></entry><entry><title>Civic Band</title><link href="https://simonwillison.net/2024/Jun/19/civic-band/#atom-tag" rel="alternate"/><published>2024-06-19T21:30:46+00:00</published><updated>2024-06-19T21:30:46+00:00</updated><id>https://simonwillison.net/2024/Jun/19/civic-band/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://civic.band/"&gt;Civic Band&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Exciting new civic tech project from Philip James: 30 (and counting) Datasette instances serving full-text search enabled collections of OCRd meeting minutes for different civic governments. Includes &lt;a href="https://alameda.ca.civic.band/civic_minutes/pages"&gt;20,000 pages for Alameda&lt;/a&gt;, &lt;a href="https://pittsburgh.pa.civic.band/civic_minutes/pages"&gt;17,000 for Pittsburgh&lt;/a&gt;, &lt;a href="https://baltimore.md.civic.band/civic_minutes/pages"&gt;3,567 for Baltimore&lt;/a&gt; and an enormous &lt;a href="https://maui-county.hi.civic.band/civic_minutes/pages"&gt;117,000 for Maui County&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Philip includes &lt;a href="https://civic.band/how.html"&gt;some notes&lt;/a&gt; on how they're doing it. They gather PDF minute notes from anywhere that provides API access to them, then run local Tesseract for OCR (the cost of cloud-based OCR proving prohibitive given the volume of data). The collection is then deployed to a single VPS running multiple instances of Datasette via Caddy, one instance for each of the covered regions.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="ocr"/><category term="tesseract"/><category term="datasette"/></entry><entry><title>textract-cli</title><link href="https://simonwillison.net/2024/Mar/30/textract-cli/#atom-tag" rel="alternate"/><published>2024-03-30T19:01:13+00:00</published><updated>2024-03-30T19:01:13+00:00</updated><id>https://simonwillison.net/2024/Mar/30/textract-cli/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/textract-cli"&gt;textract-cli&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is my other OCR project from yesterday: I built the thinnest possible CLI wrapper around Amazon Textract, out of frustration at how hard that tool is to use on an ad-hoc basis.&lt;/p&gt;

&lt;p&gt;It only works with JPEGs and PNGs (not PDFs) up to 5MB in size, reflecting limitations in Textract’s synchronous API: it can handle PDFs amazingly well but you have to upload them to an S3 bucket yet and I decided to keep the scope tight for the first version of this tool.&lt;/p&gt;

&lt;p&gt;Assuming you’ve configured AWS credentials already, this is all you need to know:&lt;/p&gt;

&lt;p&gt;pipx install textract-cli&lt;br&gt;textract-cli image.jpeg &amp;gt; output.txt


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="cli"/><category term="ocr"/><category term="projects"/></entry><entry><title>Running OCR against PDFs and images directly in your browser</title><link href="https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/#atom-tag" rel="alternate"/><published>2024-03-30T17:59:56+00:00</published><updated>2024-03-30T17:59:56+00:00</updated><id>https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/#atom-tag</id><summary type="html">
    &lt;p&gt;I attended the &lt;a href="https://biglocalnews.org/content/events/"&gt;Story Discovery At Scale&lt;/a&gt; data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?&lt;/p&gt;
&lt;p&gt;I've been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently - I'll write more about that soon. But those tools are still inconvenient for most people to use.&lt;/p&gt;
&lt;p&gt;Meanwhile, older tools like &lt;a href="https://github.com/tesseract-ocr/tesseract"&gt;Tesseract OCR&lt;/a&gt; are still extremely useful - if only they were easier to use as well.&lt;/p&gt;
&lt;p&gt;Then I remembered that Tesseract runs happily in a browser these days thanks to the excellent &lt;a href="https://tesseract.projectnaptha.com/"&gt;Tesseract.js&lt;/a&gt; project. And PDFs can be processed using JavaScript too thanks to Mozilla's extremely mature and well-tested &lt;a href="https://mozilla.github.io/pdf.js/"&gt;PDF.js&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;So I built a new tool!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tools.simonwillison.net/ocr"&gt;tools.simonwillison.net/ocr&lt;/a&gt;&lt;/strong&gt; provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.&lt;/p&gt;
&lt;p&gt;Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.&lt;/p&gt;
&lt;p&gt;Here's an animated demo:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-demo.gif" alt="First an image file is dragged onto the page, which then shows that image and accompanying OCR text. Then the drop zone is clicked and a PDF file is selected - that PDF is rendered a page at a time down the page with OCR text displayed beneath each page." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.&lt;/p&gt;
&lt;p&gt;But... having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing.&lt;/p&gt;
&lt;h4 id="ocr-how-i-built-this"&gt;How I built this&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;For more recent examples of projects I've built with the assistance of LLMs, see &lt;a href="https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/"&gt;Building and testing C extensions for SQLite with ChatGPT Code Interpreter&lt;/a&gt; and &lt;a href="https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/"&gt;Claude and ChatGPT for ad-hoc sidequests&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I built the first version of this tool in just a few minutes, using Claude 3 Opus.&lt;/p&gt;
&lt;p&gt;I already had my own JavaScript code lying around for the two most important tasks: running Tesseract.js against an images and using PDF.js to turn a PDF into a series of images.&lt;/p&gt;
&lt;p&gt;The OCR code came from the system I built and explained in &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/"&gt;How I make annotated presentations&lt;/a&gt; (built with the help of &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/#chatgpt-sessions"&gt;multiple ChatGPT sessions&lt;/a&gt;). The PDF to images code was from an &lt;a href="https://gist.github.com/simonw/e58796324abb0e729b2dcd351f46728a#prompt-2"&gt;unfinished experiment&lt;/a&gt; which I wrote with the aid of Claude 3 Opus a week ago.&lt;/p&gt;
&lt;p&gt;I composed the following prompt for Claude 3, where I pasted in both of my code examples and then added some instructions about what I wanted it to build at the end:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This code shows how to open a PDF and turn it into an image per page:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;!DOCTYPE html&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;PDF to Images&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    .image-container img {
      margin-bottom: 10px;
    }
    .image-container p {
      margin: 0;
      font-size: 14px;
      color: #888;
    }
  &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;input&lt;/span&gt; &lt;span class="pl-c1"&gt;type&lt;/span&gt;="&lt;span class="pl-s"&gt;file&lt;/span&gt;" &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;fileInput&lt;/span&gt;" &lt;span class="pl-c1"&gt;accept&lt;/span&gt;="&lt;span class="pl-s"&gt;.pdf&lt;/span&gt;" /&amp;gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt; &lt;span class="pl-c1"&gt;class&lt;/span&gt;="&lt;span class="pl-s"&gt;image-container&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;

  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;800&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;fileInput&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'fileInput'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.image-container'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;fileInput&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'change'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;handleFileUpload&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;pdfjsLib&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;GlobalWorkerOptions&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;workerSrc&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;handleFileUpload&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;file&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;target&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;files&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageIterator&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;convertPDFToImages&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

      &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; imageURL&lt;span class="pl-kos"&gt;,&lt;/span&gt; size &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;of&lt;/span&gt; &lt;span class="pl-s1"&gt;imageIterator&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imgElement&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'img'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imgElement&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imgElement&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'p'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;textContent&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`Size: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-en"&gt;formatSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;convertPDFToImages&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;try&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;pdfjsLib&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getDocument&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;URL&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createObjectURL&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;promise&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;numPages&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;numPages&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

        &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;numPages&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getPage&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getViewport&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;scale&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'canvas'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;context&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getContext&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'2d'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;renderContext&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-c1"&gt;canvasContext&lt;/span&gt;: &lt;span class="pl-s1"&gt;context&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
            &lt;span class="pl-c1"&gt;viewport&lt;/span&gt;: &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getViewport&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;scale&lt;/span&gt;: &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
          &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;render&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;renderContext&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;promise&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;toDataURL&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'image/jpeg'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;0.8&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;calculateSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;yield&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; imageURL&lt;span class="pl-kos"&gt;,&lt;/span&gt; size &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;catch&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-smi"&gt;console&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'Error:'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;calculateSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;base64Length&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;'data:image/jpeg;base64,'&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInBytes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Math&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;ceil&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;base64Length&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;0.75&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInBytes&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;formatSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInKB&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;toFixed&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;sizeInKB&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; KB`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This code shows how to OCR an image:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;ocrMissingAltText&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Load Tesseract&lt;/span&gt;
    &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"script"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js"&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;onload&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;images&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementsByTagName&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"img"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Tesseract&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createWorker&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;load&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;loadLanguage&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"eng"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;initialize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"eng"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-s1"&gt;ocrButton&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"Running OCR..."&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

      &lt;span class="pl-c"&gt;// Iterate through all the images in the output div&lt;/span&gt;
      &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt; &lt;span class="pl-k"&gt;of&lt;/span&gt; &lt;span class="pl-s1"&gt;images&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;parentNode&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;".textarea-alt"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-c"&gt;// Check if the alt textarea is empty&lt;/span&gt;
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-s"&gt;""&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-c1"&gt;data&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt; text &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
          &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;recognize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-c"&gt;// Set the OCR result to the alt textarea&lt;/span&gt;
          &lt;span class="pl-s1"&gt;progressBar&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;

      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;terminate&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-s1"&gt;ocrButton&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"OCR complete"&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Use these examples to put together a single HTML page with embedded HTML and CSS and JavaScript that provides a big square which users can drag and drop a PDF file onto and when they do that the PDF has every page converted to a JPEG and shown below on the page, then OCR is run with tesseract and the results are shown in textarea blocks below each image.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I saved this prompt to a &lt;code&gt;prompt.txt&lt;/code&gt; file and ran it using my &lt;a href="https://github.com/simonw/llm-claude-3"&gt;llm-claude-3&lt;/a&gt; plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-3-opus &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt; prompt.txt&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It gave me &lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v1.html"&gt;a working initial version&lt;/a&gt; on the first attempt!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v1.jpg" alt="A square dotted border around the text Drag and drop PDF file here" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/6a9f077bf8db616e44893a24ae1d36eb"&gt;Here's the full transcript&lt;/a&gt;, including my follow-up prompts and their responses. Iterating on software in this way is &lt;em&gt;so&lt;/em&gt; much fun.&lt;/p&gt;
&lt;p&gt;First follow-up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this to also have a file input that can be used - dropping a file onto the drop area fills that input&lt;/p&gt;
&lt;p&gt;make the drop zone 100% wide but have a 2em padding on the body. it should be 10em high. it should turn pink when an image is dragged over it.&lt;/p&gt;
&lt;p&gt;Each textarea should be 100% wide and 10em high&lt;/p&gt;
&lt;p&gt;At the very bottom of the page add a h2 that says Full document - then a 30em high textarea with all of the page text in it separated by two newlines&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v2.html"&gt;Here's the interactive result&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v2.jpg" alt="A PDF file is dragged over the box and it turned pink. The heading Full document displays below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Rather delightfully it used the neater pattern where the file input itself is hidden but can be triggered by clicking on the large drop zone, and it updated the copy on the drop zone to reflect that - without me suggesting those requirements.&lt;/p&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;get rid of the code that shows image sizes. Set the placeholder on each textarea to be Processing... and clear that placeholder when the job is done.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v3.html"&gt;Which gave me this&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I realized it would be useful if it could handle non-PDF images as well. So I fired up ChatGPT (for no reason other than curiosity to see how well it did) and got GPT-4 to add that feature for me. I &lt;a href="https://chat.openai.com/share/665eca31-3b5d-4cd9-a3cb-85ab608169a6"&gt;pasted in the code so far and added&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this so jpg and png and gif images can be dropped or opened too - they skip the PDF step and get appended to the page and OCRd directly. Also move the full document heading and textarea above the page preview and hide it u til there is data to be shown in it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I spotted that the Tesseract worker was being created multiple times in a loop, which is inefficient - so I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create the worker once and use it for all OCR tasks and terminate it at the end&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'd tweaked the HTML and CSS a little before feeding it to GPT-4, so now the site had a title and rendered in Helvetica.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v4.html"&gt;the version GPT-4 produced for me&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v4.jpg" alt="A heading reads OCR a PDF or Image - This tool runs entirely in your browser. No files are uploaded to a server. The dotted box now contains text that reads Drag and drop a PDF, JPG, PNG, or GIF file here or click to select a file" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="ocr-finishing-touches"&gt;Manual finishing touches&lt;/h4&gt;
&lt;p&gt;Fun though it was iterating on this project entirely through prompting, I decided it would be more productive to make the finishing touches myself. You can see those &lt;a href="https://github.com/simonw/tools/commits/cc609194a0d0a54c2ae676dae962e14b3e3a9d22/"&gt;in the commit history&lt;/a&gt;. They're not particularly interesting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I added &lt;a href="https://plausible.io/"&gt;Plausible&lt;/a&gt; analytics (which I like because they use no cookies).&lt;/li&gt;
&lt;li&gt;I added better progress indicators, including the text that shows how many pages of the PDF have been processed so far.&lt;/li&gt;
&lt;li&gt;I bumped up the width of the rendered PDF page images from 800 to 1000. This seemed to improve OCR quality - in particular, the &lt;a href="https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf"&gt;Claude 3 model card PDF&lt;/a&gt; now has less OCR errors than it did before.&lt;/li&gt;
&lt;li&gt;I upgraded both Tesseract.js and PDF.js to the most recent versions. Unsurprisingly, Claude 3 Opus had used older versions of both libraries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm really pleased with this project. I consider it &lt;em&gt;finished&lt;/em&gt; - it does the job I designed it to do and I don't see any need to keep on iterating on it. And because it's all static JavaScript and WebAssembly I expect it to continue working effectively forever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; OK, a few more features: I added &lt;a href="https://github.com/simonw/tools/issues/4"&gt;language selection&lt;/a&gt;, &lt;a href="https://github.com/simonw/tools/issues/7"&gt;paste support&lt;/a&gt; and some &lt;a href="https://github.com/simonw/tools/issues/8"&gt;basic automated tests&lt;/a&gt; using Playwright Python.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="tesseract"/><category term="webassembly"/><category term="ai-assisted-programming"/></entry><entry><title>unstructured</title><link href="https://simonwillison.net/2024/Feb/2/unstructured/#atom-tag" rel="alternate"/><published>2024-02-02T02:47:15+00:00</published><updated>2024-02-02T02:47:15+00:00</updated><id>https://simonwillison.net/2024/Feb/2/unstructured/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Unstructured-IO/unstructured"&gt;unstructured&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.&lt;/p&gt;

&lt;p&gt;I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.&lt;/p&gt;

&lt;p&gt;There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="python"/><category term="onnx"/></entry><entry><title>Our search for the best OCR tool in 2023, and what we found</title><link href="https://simonwillison.net/2023/Oct/31/our-search-for-the-best-ocr-tool-in-2023-and-what-we-found/#atom-tag" rel="alternate"/><published>2023-10-31T19:21:53+00:00</published><updated>2023-10-31T19:21:53+00:00</updated><id>https://simonwillison.net/2023/Oct/31/our-search-for-the-best-ocr-tool-in-2023-and-what-we-found/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://source.opennews.org/articles/our-search-best-ocr-tool-2023/"&gt;Our search for the best OCR tool in 2023, and what we found&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
DocumentCloud’s Sanjin Ibrahimovic reviews the best options for OCR. Tesseract scores highly for easily machine readable text, newcomer docTR is great for ease of use but still not great at handwriting. Amazon Textract is great for everything except non-Latin languages, Google Cloud Vision is great at pretty much everything except for ease-of-use. Azure AI Document Intelligence sounds worth considering as well.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/documentcloud"&gt;documentcloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;&lt;/p&gt;



</summary><category term="documentcloud"/><category term="ocr"/></entry><entry><title>How I make annotated presentations</title><link href="https://simonwillison.net/2023/Aug/6/annotated-presentations/#atom-tag" rel="alternate"/><published>2023-08-06T17:15:33+00:00</published><updated>2023-08-06T17:15:33+00:00</updated><id>https://simonwillison.net/2023/Aug/6/annotated-presentations/#atom-tag</id><summary type="html">
    &lt;p&gt;Giving a talk is a lot of work. I go by a rule of thumb I learned from &lt;a href="https://en.wikipedia.org/wiki/Damian_Conway"&gt;Damian Conway&lt;/a&gt;: a minimum of ten hours of preparation for every one hour spent on stage.&lt;/p&gt;
&lt;p&gt;If you're going to put that much work into something, I think it's worth taking steps to maximize the value that work produces - both for you and for your audience.&lt;/p&gt;
&lt;p&gt;One of my favourite ways of getting "paid" for a talk is when the event puts in the work to produce a really good video of that talk, and then shares that video online. &lt;a href="https://2023.northbaypython.org"&gt;North Bay Python&lt;/a&gt; is a fantastic example of an event that does this well: they team up with &lt;a href="https://nextdayvideo.com"&gt;Next Day Video&lt;/a&gt; and &lt;a href="https://whitecoatcaptioning.com"&gt;White Coat Captioning&lt;/a&gt; and have talks professionally recorded, captioned and uploaded to YouTube within 24 hours of the talk being given.&lt;/p&gt;
&lt;p&gt;Even with that quality of presentation, I don't think a video on its own is enough. My most recent talk was 40 minutes long - I'd love people to watch it, but I myself watch very few 40m long YouTube videos each year.&lt;/p&gt;
&lt;p&gt;So I like to publish my talks with a text and image version of the talk that can provide as much of the value as possible to people who don't have the time or inclination to sit through a 40m talk (or 20m if you run it at 2x speed, which I do for many of the talks I watch myself).&lt;/p&gt;
&lt;h4&gt;Annotated presentations&lt;/h4&gt;
&lt;p&gt;My preferred format for publishing these documents is as an &lt;em&gt;annotated presentation&lt;/em&gt; - a single document (no clicking "next" dozens of times) combining key slides from the talk with custom written text to accompany each one, plus additional links and resources.&lt;/p&gt;
&lt;p&gt;Here's my most recent example: &lt;a href="https://simonwillison.net/2023/Aug/3/weird-world-of-llms/"&gt;Catching up on the weird world of LLMs&lt;/a&gt;, from North Bay Python last week.&lt;/p&gt;
&lt;p&gt;More examples (see also my &lt;a href="https://simonwillison.net/tags/annotated-talks/"&gt;annotated-talks tag&lt;/a&gt;):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2023/May/2/prompt-injection-explained/"&gt;Prompt injection explained, with video, slides, and a transcript&lt;/a&gt; for a LangChain webinar in May 2023.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2022/Nov/26/productivity/"&gt;Coping strategies for the serial project hoarder&lt;/a&gt; for DjangoCon US 2022.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2021/Nov/4/publish-open-source-python-library/"&gt;How to build, test and publish an open source Python library&lt;/a&gt; for PyGotham 2021&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2021/Feb/7/video/"&gt;Video introduction to Datasette and sqlite-utils&lt;/a&gt; for FOSDEM February 2021&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2021/Jul/22/small-data/"&gt;Datasette—an ecosystem of tools for working with small data&lt;/a&gt; for PyGotham 2020.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/"&gt;Personal Data Warehouses: Reclaiming Your Data&lt;/a&gt; for the GitHub OCTO speaker series in November 2020.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://static.simonwillison.net/static/2010/redis-tutorial/"&gt;Redis tutorial&lt;/a&gt; for NoSQL Europe 2010 (my first attempt at this format).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I don't tend to write a detailed script for my talks in advance. If I did, I might use that as a starting point, but I usually prepare the outline of the talk and then give it off-the-cuff on the day. I find this fits my style (best described as "enthusiastic rambling") better.&lt;/p&gt;
&lt;p&gt;Instead, I'll assemble notes for each slide from re-watching the video after it has been released.&lt;/p&gt;
&lt;p&gt;I don't just cover the things I said in the the talk - I'll also add additional context, and links to related resources. The annotated presentation isn't just for people who didn't watch the talk, it's aimed at providing extra context for people who did watch it as well.&lt;/p&gt;
&lt;h4&gt;A custom tool for building annotated presentations&lt;/h4&gt;
&lt;p&gt;For this most recent talk I finally built something I've been wanting for &lt;em&gt;years&lt;/em&gt;: a custom tool to help me construct the annotated presentation as quickly as possible.&lt;/p&gt;
&lt;p&gt;Annotated presentations look deceptively simple: each slide is an image and one or two paragraphs of text.&lt;/p&gt;
&lt;p&gt;There are a few extra details though:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The images really need good &lt;code&gt;alt=&lt;/code&gt; text - a big part of the information in the presentation is conveyed by those images, so they need to have good descriptions both for screen reader users and to index in search engines / for retrieval augmented generation.&lt;/li&gt;
&lt;li&gt;Presentations might have dozens of slides - just assembling the image tags in the correct order can be a frustrating task.&lt;/li&gt;
&lt;li&gt;For editing the annotations I like to use Markdown, as it's quicker to write than HTML. Making this as easy as possible encourages me to add more links, bullet points and code snippets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One of my favourite use-cases for tools like ChatGPT is to quickly create one-off custom tools. This was a perfect fit for that.&lt;/p&gt;
&lt;p&gt;You can see the tool I create here: &lt;a href="https://til.simonwillison.net/tools/annotated-presentations"&gt;Annotated presentation creator&lt;/a&gt; (&lt;a href="https://github.com/simonw/til/blob/main/templates/pages/tools/annotated-presentations.html"&gt;source code here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The first step is to export the slides as images, being sure to have filenames which sort alphabetically in the correct order. I use Apple Keynote for my slides and it has an "Export" feature which does this for me.&lt;/p&gt;
&lt;p&gt;Next, open those images using the annotation tool.&lt;/p&gt;
&lt;p&gt;The tool is written in JavaScript and works entirely in your browser - it asks you to select images but doesn't actually upload them to a server, just displays them directly inline in the page.&lt;/p&gt;
&lt;p&gt;Anything you type in a &lt;code&gt;textarea&lt;/code&gt; as work-in-progress will be saved to &lt;code&gt;localStorage&lt;/code&gt;, so a browser crash or restart shouldn't lose any of your work.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://tesseract.projectnaptha.com/"&gt;Tesseract.js&lt;/a&gt; to run OCR against your images, providing a starting point for the &lt;code&gt;alt=&lt;/code&gt; attributes for each slide.&lt;/p&gt;
&lt;p&gt;Annotations can be entered in Markdown and are rendered to HTML as a live preview using the &lt;a href="https://marked.js.org/"&gt;Marked&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;Finally, it offers a templating mechanism for the final output, which works using JavaScript template literals. So once you've finished editing the &lt;code&gt;alt=&lt;/code&gt; text and writing the annotations, click "Execute template" at the bottom of the page and copy out the resulting HTML.&lt;/p&gt;
&lt;p&gt;Here's an animated GIF demo of the tool in action:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/annotated-presentation-creator.gif" alt="Animated demo of the tool. I load 90 images, each one of which becomes a slide. Then I click the OCR button and it starts populating the alt textareas with OCR text from the slides. I type some markdown into an annotation box, then scroll to the bottom and click the Execute template button to get back the final HTML." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p id="chatgpt-sessions"&gt;I ended up putting this together with the help of multiple different ChatGPT sessions. You can see those here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://chat.openai.com/share/61cd85f6-7002-4676-b204-0349a723232a"&gt;HTML and JavaScript in a single document to create an app that lets me do the following...&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chat.openai.com/share/5218799e-0423-49ad-88ba-c72ee27e3fe3"&gt;JavaScript and HTML app on one page. User can select multiple image files on their own computer...&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chat.openai.com/share/7867657b-aa29-4ad0-8ab3-1d353c29a224"&gt;JavaScript that runs once every 1s and builds a JavaScript object of every textarea on the page where the key is the name= attribute of that textarea and the value is its current contents. That whole object is then stored in localStorage in a key called savedTextAreas...&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chat.openai.com/share/4e6fd644-de57-4597-a1cc-412483c2adf3"&gt;Write a JavaScript function like this: executeTemplates(template, arrayOfObjects)...&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Cleaning up the transcript with Claude&lt;/h4&gt;
&lt;p&gt;Since the video was already up on YouTube when I started writing the annotations, I decided to see if I could get a head start on writing them using the YouTube generated transcript.&lt;/p&gt;
&lt;p&gt;I used my &lt;a href="https://simonwillison.net/2022/Sep/30/action-transcription/"&gt;Action Transcription&lt;/a&gt; tool to extract the transcript, but it was pretty low quality - you can see &lt;a href="https://gist.github.com/simonw/3d8a335244711c675c456db147aa05fa"&gt;a copy of it here&lt;/a&gt;. A sample:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;okay hey everyone it's uh really
exciting to be here so yeah I call this
court talk catching up on the weird
world of llms I'm going to try and give
you the last few years of of llm
developments in 35 minutes this is
impossible so uh hopefully I'll at least
give you a flavor of some of the weirder
corners of the space because the thing
about language models is the more I look
at the more I think they're practically
interesting any particular aspect of
them anything at all if you zoom in
there are just more questions there are
just more unknowns about it there are
more interesting things to get into lots
of them are deeply disturbing and
unethical lots of them are fascinating
it's um I've called it um it's it's
impossible to tear myself away from this
I I just keep on keep on finding new
aspects of it that are interesting
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's basically one big run-on sentence, with no punctuation, little capitalization and lots of umms and ahs.&lt;/p&gt;
&lt;p&gt;Anthropic's &lt;a href="https://claude.ai"&gt;Claude 2&lt;/a&gt; was &lt;a href="https://www.anthropic.com/index/claude-2"&gt;released last month&lt;/a&gt; and supports up to 100,000 tokens per prompt - a huge improvement on ChatGPT (4,000) and GPT-4 (8,000). I decided to see if I could use that to clean up my transcript.&lt;/p&gt;
&lt;p&gt;I pasted it into Claude and tried a few prompts... until I hit upon this one:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Reformat this transcript into paragraphs and sentences, fix the capitalization and make very light edits such as removing ums&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/claude-transcript.jpg" alt="Claude interface: Taming Large Language Models. I have pasted in a paste.txt file with 42KB of data, then prompted it to reformat. It outputs Here is the reformatted transcript: followed by that transcript." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This worked really, really well! Here's the first paragraph it produced, based on the transcript I show above:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay everyone, it's really exciting to be here. Yeah I call this talk "Catching Up on the Weird World of LLMs." I'm going to try and give you the last few years of LLMs developments in 35 minutes. This is impossible, so hopefully I'll at least give you a flavor of some of the weirder corners of the space. The thing about language models is the more I look at them, the more I think they're practically interesting. Focus on any particular aspect, and there are just more questions, more unknowns, more interesting things to get into.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note that I said "fractally interesting", not "practically interesting" - but that error was there in the YouTube transcript, so Claude picked it up from there.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/f6d83d69cca018c07b58aaadfb4c918c"&gt;full generated transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's really impressive! At one point it even turns my dialogue into a set of bullet points:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today the best are ChatGPT (aka GPT-3.5 Turbo), GPT-4 for capability, and Claude 2 which is free. Google has PaLM 2 and Bard. Llama and Claude are from Anthropic, a splinter of OpenAI focused on ethics. Google and Meta are the other big players.&lt;/p&gt;
&lt;p&gt;Some tips:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OpenAI models cutoff at September 2021 training data. Anything later isn't in there. This reduces issues like recycling their own text.&lt;/li&gt;
&lt;li&gt;Claude and Palm have more recent data, so I'll use them for recent events.&lt;/li&gt;
&lt;li&gt;Always consider context length. GPT has 4,000 tokens, GPT-4 has 8,000, Claude 100,000.&lt;/li&gt;
&lt;li&gt;If a friend who read the Wikipedia article could answer my question, I'm confident feeding it in directly. The more obscure, the more likely pure invention.&lt;/li&gt;
&lt;li&gt;Avoid superstitious thinking. Long prompts that "always work" are usually mostly pointless.&lt;/li&gt;
&lt;li&gt;Develop an immunity to hallucinations. Notice signs and check answers.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Compare that to &lt;a href="https://gist.github.com/simonw/3d8a335244711c675c456db147aa05fa#file-transcription-txt-L327-L469"&gt;my rambling original&lt;/a&gt; to see quite how much of an improvement this is.&lt;/p&gt;
&lt;p&gt;But, all of that said... I specified "make very light edits" and it clearly did a whole lot more than just that.&lt;/p&gt;
&lt;p&gt;I didn't use the Claude version directly. Instead, I copied and pasted chunks of it into my annotation tool that made the most sense, then directly edited them to better fit what I was trying to convey.&lt;/p&gt;
&lt;p&gt;As with so many things in LLM/AI land: a significant time saver, but no silver bullet.&lt;/p&gt;
&lt;h4&gt;For workshops, publish the handout&lt;/h4&gt;
&lt;p&gt;I took the Software Carpentries &lt;a href="https://carpentries.org/become-instructor/"&gt;instructor training&lt;/a&gt; a few years ago, which was a really great experience.&lt;/p&gt;
&lt;p&gt;A key idea I got from that is that a great way to run a workshop is to prepare an extensive, detailed handout in advance - and then spend the actual workshop time working through that handout yourself, at a sensible pace, in a way that lets the attendees follow along.&lt;/p&gt;
&lt;p&gt;A bonus of this approach is that it forces you to put together a really high quality handout which you can distribute after the event.&lt;/p&gt;
&lt;p&gt;I used this approach for the 3 hour workshop I ran at PyCon US 2023: &lt;a href="https://datasette.io/tutorials/data-analysis"&gt;Data analysis with SQLite and Python&lt;/a&gt;. I turned that into a new official tutorial on the Datasette website, accompanied by the video but also useful for people who don't want to spend three hours watching me talk!&lt;/p&gt;
&lt;h4&gt;More people should do this&lt;/h4&gt;
&lt;p&gt;I'm writing this in the hope that I can inspire more people to give their talks this kind of treatment. It's not a zero amount of work - it takes me 2-3 hours any time I do this - but it greatly increases the longevity of the talk and ensures that the work I've already put into it provides maximum value, both to myself (giving talks is partly a selfish act!) and to the people I want to benefit from it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alt-text"&gt;alt-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/localstorage"&gt;localstorage&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="alt-text"/><category term="localstorage"/><category term="ocr"/><category term="projects"/><category term="speaking"/><category term="my-talks"/><category term="tools"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="annotated-talks"/></entry><entry><title>textra</title><link href="https://simonwillison.net/2023/Mar/23/textra/#atom-tag" rel="alternate"/><published>2023-03-23T21:08:51+00:00</published><updated>2023-03-23T21:08:51+00:00</updated><id>https://simonwillison.net/2023/Mar/23/textra/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/freedmand/textra"&gt;textra&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/agarcia_me/status/1639001713461071873"&gt;Alex Garcia&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="cli"/><category term="macos"/><category term="ocr"/><category term="pdf"/></entry><entry><title>Building a searchable archive for the San Francisco Microscopical Society</title><link href="https://simonwillison.net/2022/Aug/25/sfms-archive/#atom-tag" rel="alternate"/><published>2022-08-25T17:24:50+00:00</published><updated>2022-08-25T17:24:50+00:00</updated><id>https://simonwillison.net/2022/Aug/25/sfms-archive/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://sfmicrosociety.org/"&gt;San Francisco Microscopical Society&lt;/a&gt; was &lt;a href="https://sfmicrosociety.org/history"&gt;founded&lt;/a&gt; in 1870 by a group of scientists dedicated to advancing the field of microscopy.&lt;/p&gt;
&lt;p&gt;Today the society is a non-profit run entirely by volunteers (they welcome &lt;a href="https://sfmicrosociety.org/join"&gt;new members&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This year they've been working with the UC Berkeley Bancroft Library to &lt;a href="https://mailchi.mp/sfmicrosociety/spring-2022#FROM%20THE%20ARCHIVE"&gt;digitize their archives&lt;/a&gt;, which reach back to the founding of the organization.&lt;/p&gt;
&lt;p&gt;Hep Svadja, Jenny Jacox and Ariel Waldman have taken the lead on this effort, resulting in a wealth of scanned PDF files covering 150 years of microscope history.&lt;/p&gt;
&lt;p&gt;I've been helping out with the project running OCR against the archives (using Amazon Textract via my &lt;a href="https://simonwillison.net/2022/Jun/30/s3-ocr/"&gt;s3-ocr tool&lt;/a&gt;, which I built for this project) and turning them into a searchable website using &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The result is now live! Take a look at &lt;strong&gt;&lt;a href="https://archive.sfmicrosociety.org/"&gt;archive.sfmicrosociety.org&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/sfms-homepage.jpg" alt="The website homepage. The title reads San Francisco Microscopical Society Archive. The page has a search box, a link to browse all documents and a heading titled Some random pages which shows a mixture of six different scanned pages, some typewritten and some with handwriting." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Try running a search for &lt;a href="https://archive.sfmicrosociety.org/?q=newton"&gt;newton&lt;/a&gt; to see the search feature in action.&lt;/p&gt;
&lt;p&gt;You can also &lt;a href="https://archive.sfmicrosociety.org/folders/PUBLIC/Section+5+Supporting+Documents+for+SFHS+Application"&gt;browse by folder structure&lt;/a&gt; or by &lt;a href="https://archive.sfmicrosociety.org/docs/12abb3e5"&gt;individual document&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A neat thing about using AWS Textract is that it is significantly better at reading 19th century handwritten documents than I am.&lt;/p&gt;
&lt;p&gt;Each document page is shown with the OCRd text alongside the original scan, so you can see that for yourself on &lt;a href="https://archive.sfmicrosociety.org/docs/402f6021/15"&gt;this page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/sfms-page.jpg" alt="A page titled Misc Notes and Business Cards 1885.pdf pafe 15. The scanned image on the right shows some beautiful but very hard to read handwritten notes. The OCR text on the left looks to me like it's pretty accurate." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;How it works&lt;/h4&gt;
&lt;p&gt;The website itself is a custom Datasette instance hosted on &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt; - deployed using the &lt;a href="https://datasette.io/plugins/datasette-publish-vercel"&gt;datasette-publish-vercel plugin&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's an implementation of the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;Baked Data&lt;/a&gt; architectural pattern, where the database itself is packaged up as a SQLite and included as a binary file in the application that is deployed to Vercel.&lt;/p&gt;
&lt;p&gt;You can directly browse and query the underlying database (a 12.4MB SQLite file) using the default Datasette interface at &lt;a href="https://archive.sfmicrosociety.org/sfms"&gt;archive.sfmicrosociety.org/sfms&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The database schema (excluding the tables that enable full-text search) looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [documents] (
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [title] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [&lt;span class="pl-k"&gt;path&lt;/span&gt;] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [etag] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
)

CREATE TABLE [pages] (
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [document_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [page] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [&lt;span class="pl-k"&gt;text&lt;/span&gt;] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   &lt;span class="pl-k"&gt;FOREIGN KEY&lt;/span&gt;([document_id]) &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [documents]([id])
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see, the site consists of documents and pages. Documents have a &lt;code&gt;path&lt;/code&gt; (the directory structure), a title (the filename) and an ID. Pages belong to documents and have a &lt;code&gt;text&lt;/code&gt; column and a page number. It's a pretty simple schema!&lt;/p&gt;
&lt;p&gt;The public pages on the site are all built using custom templates and views. You can browse the source code in the &lt;a href="https://github.com/simonw/sfms-history"&gt;simonw/sfms-history&lt;/a&gt; GitHub repository.&lt;/p&gt;
&lt;p&gt;In particular, the custom pages are implemented using a &lt;a href="https://docs.datasette.io/en/stable/writing_plugins.html#writing-one-off-plugins"&gt;one-off site plugin&lt;/a&gt; (in the &lt;code&gt;plugins/&lt;/code&gt; folder) that adds extra routes to Datasette using the &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html#register-routes-datasette"&gt;register_routes()&lt;/a&gt; plugin hook, &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/plugins/views.py#L5-L12"&gt;like this&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;register_routes&lt;/span&gt;():
    &lt;span class="pl-k"&gt;return&lt;/span&gt; [
        (&lt;span class="pl-s"&gt;r"^/docs$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;docs&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;r"^/docs/(?P&amp;lt;document_id&amp;gt;[a-z0-9]+)$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;document&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;r"^/docs/(?P&amp;lt;document_id&amp;gt;[a-z0-9]+)/(?P&amp;lt;page&amp;gt;\d+)/?$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;page&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;r"^/folders/(?P&amp;lt;folder&amp;gt;.*)$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;folder&lt;/span&gt;),
    ]&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;docs&lt;/code&gt;, &lt;code&gt;document&lt;/code&gt;, &lt;code&gt;page&lt;/code&gt; and &lt;code&gt;folder&lt;/code&gt; are Python view functions that take a request and return a custom response.&lt;/p&gt;
&lt;p&gt;Here's the &lt;code&gt;docs&lt;/code&gt; function for example, which powers the listing of all documents at &lt;a href="https://archive.sfmicrosociety.org/docs"&gt;archive.sfmicrosociety.org/docs&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;docs&lt;/span&gt;(&lt;span class="pl-s1"&gt;datasette&lt;/span&gt;, &lt;span class="pl-s1"&gt;request&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;db&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt;.&lt;span class="pl-en"&gt;get_database&lt;/span&gt;(&lt;span class="pl-s"&gt;"sfms"&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;documents&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [
        &lt;span class="pl-en"&gt;to_document&lt;/span&gt;(&lt;span class="pl-s1"&gt;doc&lt;/span&gt;)
        &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;doc&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;db&lt;/span&gt;.&lt;span class="pl-en"&gt;execute&lt;/span&gt;(
            &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;    select documents.*, count(*) as num_pages&lt;/span&gt;
&lt;span class="pl-s"&gt;    from pages join documents on pages.document_id = documents.id&lt;/span&gt;
&lt;span class="pl-s"&gt;    group by documents.id&lt;/span&gt;
&lt;span class="pl-s"&gt;    order by path&lt;/span&gt;
&lt;span class="pl-s"&gt;    """&lt;/span&gt;
        )
    ]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-v"&gt;Response&lt;/span&gt;.&lt;span class="pl-en"&gt;html&lt;/span&gt;(
        &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt;.&lt;span class="pl-en"&gt;render_template&lt;/span&gt;(&lt;span class="pl-s"&gt;"docs.html"&lt;/span&gt;, {&lt;span class="pl-s"&gt;"documents"&lt;/span&gt;: &lt;span class="pl-s1"&gt;documents&lt;/span&gt;}, &lt;span class="pl-s1"&gt;request&lt;/span&gt;)
    )&lt;/pre&gt;
&lt;p&gt;And here's &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/templates/docs.html"&gt;the docs.html template&lt;/a&gt; it uses.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/templates/index.html"&gt;index.html&lt;/a&gt; template which implements both the homepage and the search results page is worth a look too. It takes &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/plugins/index.py"&gt;a slightly different approach&lt;/a&gt;, using the &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html#extra-template-vars-template-database-table-columns-view-name-request-datasette"&gt;extra_template_vars()&lt;/a&gt; plugin hook to populate extra custom template variables that can be used by that page to loop through the search results.&lt;/p&gt;
&lt;p&gt;I built that page first, before adding custom routes for the other pages - I should probably refactor that to use &lt;code&gt;register_routes()&lt;/code&gt; instead for consistency.&lt;/p&gt;
&lt;h4&gt;OCRing the documents&lt;/h4&gt;
&lt;p&gt;The documents started out in a Google Drive. I used the &lt;a href="https://rclone.org/"&gt;Rclone&lt;/a&gt; tool to sync that to an S3 bucket, and then ran my &lt;a href="https://datasette.io/tools/s3-ocr"&gt;s3-ocr tool&lt;/a&gt; against specific folders within that bucket to run them through AWS Textract.&lt;/p&gt;
&lt;p&gt;Then I ran the &lt;code&gt;s3-ocr index&lt;/code&gt; command to suck down the OCR results and build those into a SQLite database.&lt;/p&gt;
&lt;p&gt;The resulting schema wasn't quite right for this particular project. I wrote &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/build-db.sh"&gt;this build-db.sh script&lt;/a&gt; to construct a schema specifically designed for the archive, copy the data in and then enable full-text search on the resulting tables.&lt;/p&gt;
&lt;p&gt;The SQLite database file containing the inital OCR results is called &lt;code&gt;index.db&lt;/code&gt;. I used the following commands to attach that &lt;code&gt;index.db&lt;/code&gt; file to the same connection as the &lt;code&gt;sfms.db&lt;/code&gt; database and copy and re-shape the records that I wanted:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate documents&lt;/span&gt;
sqlite-utils sfms.db --attach index2 index.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;insert into documents select&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  substr(s3_ocr_etag, 2, 8) as id,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  key as title,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  key as path,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  replace(s3_ocr_etag, '"', '') as etag&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;from&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  index2.ocr_jobs&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;where&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  key in (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    select path from index2.pages&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    where (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      folder like 'INTAKE/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      or folder like 'PUBLIC/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    and folder not like '%PROCESSED INTAKE DOCUMENTS/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate pages&lt;/span&gt;
sqlite-utils sfms.db --attach index2 index.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;insert into pages select distinct&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  substr(s3_ocr_etag, 2, 8) || '-' || page as id,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  substr(s3_ocr_etag, 2, 8) as document_id,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  page,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  text&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;from index2.pages&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  join index2.ocr_jobs&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    on index2.pages.path = index2.ocr_jobs.key&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;where&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    folder like 'INTAKE/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    or folder like 'PUBLIC/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  and folder not like '%PROCESSED INTAKE DOCUMENTS/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This script is &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/.github/workflows/deploy.yml#L39-L41"&gt;run by GitHub Actions&lt;/a&gt; as part of the workflow that deploys the site to Vercel.&lt;/p&gt;
&lt;h4&gt;Images of pages&lt;/h4&gt;
&lt;p&gt;The site itself makes use of the OCRd text that is stored in the database - that's how the search feature works.&lt;/p&gt;
&lt;p&gt;But the scanned pages are visually so much more interesting than that!&lt;/p&gt;
&lt;p&gt;I wanted to display those page images both as thumbnails and larger images, both to add visual appeal and because OCRd text loses a lot of the richness of the underlying records.&lt;/p&gt;
&lt;p&gt;Since I already had the PDFs in an S3 bucket, the easiest way to build this was using &lt;a href="https://imgix.com/"&gt;imgix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;imgix offers a proxy service for running URL-defined transforms against images. I've used it on a bunch of projcets before: it's really powerful.&lt;/p&gt;
&lt;p&gt;In this case, given an imgix source configured against an S3 bucket an image for a specific page of a PDF can be constructed like this (newlines added for readability):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://sfms-history.imgix.net/
  PUBLIC/Scientific%20Documents/
  MicroscopicGymnastics_J.EdwardsSmith-04-06-1876.pdf
  ?page=1
  &amp;amp;w=200
  &amp;amp;auto=format
  &amp;amp;s=634c00249fbe4a2eda90f00be0ae66d6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here we're providing the path to the PDF in the S3 bucket, requesting a render of page 1 of that PDF, at width 200px and using &lt;code&gt;auto=format&lt;/code&gt; which caues imgix to &lt;a href="https://docs.imgix.com/apis/rendering/auto/auto#format"&gt;serve the image in an optimized format&lt;/a&gt; based on the capabilities of the user's device.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;&amp;amp;s=&lt;/code&gt; parameter is a signature which can be configured to prevent people from constructing their own arbitrary URLs. In this case the S3 bucket has some files in that are not part of the public archive, so using a signature prevents people from guessing filenames and sniffing around in the bucket contents.&lt;/p&gt;
&lt;h4&gt;s3-ocr elsewhere&lt;/h4&gt;
&lt;p&gt;I built &lt;code&gt;s3-ocr&lt;/code&gt; for this specific project, but it's already starting to see use for other projects. My favourite example so far is the work Philip James has been doing to deploy search engines against OCRd meeting minutes for the cities of Alameda and Oakland (and soon more) - he wrote about that in detail is &lt;a href="https://phildini.dev/digitizing-55-000-pages-of-civic-meetings"&gt;Digitizing 55,000 pages of civic meetings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you have an interesting archive of PDFs that need to be OCRd I would love to hear from you! Contact me &lt;a href="https://twitter.com/simonw"&gt;on Twitter&lt;/a&gt; or email me at &lt;code&gt;swillison&lt;/code&gt; at Google's webmail provider.&lt;/p&gt;
&lt;h4&gt;Also this week&lt;/h4&gt;
&lt;p&gt;I got very distracted by &lt;a href="https://simonwillison.net/2022/Aug/21/scotrail/"&gt;Analyzing ScotRail audio announcements with Datasette&lt;/a&gt; - which did at least produce a new, detailed tutorial describing my process for approaching new projects with Datasette.&lt;/p&gt;
&lt;p&gt;I added &lt;a href="https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/"&gt;Plugin support to Datasete Lite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-multiline-links"&gt;datasette-multiline-links&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-multiline-links/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-08-24
&lt;br /&gt;Make multiple newline separated URLs clickable in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-vercel"&gt;datasette-publish-vercel&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-vercel/releases/tag/0.14.2"&gt;0.14.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-publish-vercel/releases"&gt;23 releases total&lt;/a&gt;) - 2022-08-24
&lt;br /&gt;Datasette plugin for publishing data using Vercel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-edit-schema"&gt;datasette-edit-schema&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-edit-schema/releases/tag/0.5.1"&gt;0.5.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-edit-schema/releases"&gt;10 releases total&lt;/a&gt;) - 2022-08-22
&lt;br /&gt;Datasette plugin for modifying table schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/pocket-to-sqlite"&gt;pocket-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/pocket-to-sqlite/releases/tag/0.2.2"&gt;0.2.2&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/pocket-to-sqlite/releases"&gt;4 releases total&lt;/a&gt;) - 2022-08-22
&lt;br /&gt;Create a SQLite database containing data from your Pocket account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-mp3-audio"&gt;datasette-mp3-audio&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-mp3-audio/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-mp3-audio/releases"&gt;2 releases total&lt;/a&gt;) - 2022-08-20
&lt;br /&gt;Turn .mp3 URLs into an audio player in the Datasette interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-diffable"&gt;sqlite-diffable&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-diffable/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-diffable/releases"&gt;6 releases total&lt;/a&gt;) - 2022-08-18
&lt;br /&gt;Tools for dumping/loading a SQLite database to diffable directory structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sentry"&gt;datasette-sentry&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sentry/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sentry/releases"&gt;5 releases total&lt;/a&gt;) - 2022-08-14
&lt;br /&gt;Datasette plugin for configuring Sentry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.62"&gt;0.62&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;113 releases total&lt;/a&gt;) - 2022-08-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.13"&gt;0.13&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;14 releases total&lt;/a&gt;) - 2022-08-12
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/openseadragon"&gt;Creating a tiled zoomable image with OpenSeadragon and vips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/setup-py-from-url"&gt;Defining setup.py dependencies using a URL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/googlecloud/google-cloud-spend-datasette"&gt;Analyzing Google Cloud spend with Datasette&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/sort-by-number-of-json-intersections"&gt;Sort by number of JSON intersections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/quarto/trying-out-quarto"&gt;Trying out Quarto on macOS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/function-list"&gt;Seeing which functions are unique to a specific SQLite / Datasette instance&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="datasette"/><category term="weeknotes"/></entry><entry><title>Digitizing 55,000 pages of civic meetings</title><link href="https://simonwillison.net/2022/Aug/22/digitizing-55000-pages-of-civic-meetings/#atom-tag" rel="alternate"/><published>2022-08-22T16:26:04+00:00</published><updated>2022-08-22T16:26:04+00:00</updated><id>https://simonwillison.net/2022/Aug/22/digitizing-55000-pages-of-civic-meetings/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://phildini.dev/digitizing-55-000-pages-of-civic-meetings"&gt;Digitizing 55,000 pages of civic meetings&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Philip James has been building public, searchable archives of city council meetings for various cities—Oakland and Alamedia so far—using my s3-ocr script to run Textract OCR against the PDFs of the minutes, and deploying them to Fly using Datasette. This is a really cool project, and very much the kind of thing I’ve been hoping to support with the tools I’ve been building.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/phildini/status/1561745329971613696"&gt;@phildini&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/archiving"&gt;archiving&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;



</summary><category term="archiving"/><category term="ocr"/><category term="political-hacking"/><category term="datasette"/><category term="fly"/></entry><entry><title>Litestream backups for Datasette Cloud (and weeknotes)</title><link href="https://simonwillison.net/2022/Aug/11/litestream-backups/#atom-tag" rel="alternate"/><published>2022-08-11T17:19:52+00:00</published><updated>2022-08-11T17:19:52+00:00</updated><id>https://simonwillison.net/2022/Aug/11/litestream-backups/#atom-tag</id><summary type="html">
    &lt;p&gt;My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.&lt;/p&gt;
&lt;p&gt;Datasette Cloud is a SaaS service for &lt;a href="https://datasette.io"&gt;Datasette&lt;/a&gt;. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using &lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSeMRYHBHXlWGDkjCqhAOinDrKEVwqNR5GfPs3iEXn8LzDT2Qg/viewform?embedded=true"&gt;this form&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm building Datastte Cloud on &lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt;, specifically on &lt;a href="https://fly.io/blog/fly-machines/"&gt;Fly Machines&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Security is a big concern for Datasette Cloud. Teams should only be able to access their own data - bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.&lt;/p&gt;
&lt;p&gt;To help guarantee that, I've designed Datasette Cloud so that each team gets their own, dedicated instance, running in a &lt;a href="https://firecracker-microvm.github.io/"&gt;Firecracker VM&lt;/a&gt; managed by Fly. Their data lives in a dedicated volume.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://fly.io/docs/reference/volumes/"&gt;Fly volumes&lt;/a&gt; already implement snapshot backups, but I'm interested in defence in depth. This is where &lt;a href="https://litestream.io/"&gt;Litestream&lt;/a&gt; comes in (coincidentally now &lt;a href="https://fly.io/blog/all-in-on-sqlite-litestream/"&gt;part of Fly&lt;/a&gt;, although it wasn't when I first selected it as my backup strategy).&lt;/p&gt;
&lt;p&gt;I'm using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.&lt;/p&gt;
&lt;h4&gt;Keeping backups isolated&lt;/h4&gt;
&lt;p&gt;Litestream &lt;a href="https://litestream.io/how-it-works/"&gt;works&lt;/a&gt; by writing a constant stream of pages from SQLite's WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.&lt;/p&gt;
&lt;p&gt;This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.&lt;/p&gt;
&lt;p&gt;Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.&lt;/p&gt;
&lt;p&gt;Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.&lt;/p&gt;
&lt;p&gt;I've learned an absolute ton about S3 and AWS permissions building my &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; tool for creating credentials for accessing S3.&lt;/p&gt;
&lt;p&gt;One of the tricks I've learned is that it's possible to create temporary, time-limited credentials that &lt;a href="https://simonwillison.net/2022/Jan/18/weeknotes/"&gt;only work for a prefix&lt;/a&gt; (effectively a folder) within an S3 bucket.&lt;/p&gt;
&lt;p&gt;This means I can run Litestream with credentials that are specific to the team - that can read and write only from the &lt;code&gt;team-ID/&lt;/code&gt; prefix in the S3 bucket I am using to store the backups.&lt;/p&gt;
&lt;h4&gt;Obtaining temporary credentials&lt;/h4&gt;
&lt;p&gt;My &lt;code&gt;s3-credentials&lt;/code&gt; tool can create credentials for a prefix within an S3 bucket like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command uses the &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html#STS.Client.assume_role"&gt;sts.assume_role()&lt;/a&gt; AWS method to create credentials that allow access to that bucket, attaching &lt;a href="https://github.com/simonw/s3-credentials/blob/0.12.1/README.md#--prefix-my-prefix"&gt;this generated JSON policy&lt;/a&gt; to it in order to restrict access to the provided prefix.&lt;/p&gt;
&lt;p&gt;I extracted the relevant Python code from  &lt;code&gt;s3-credentials&lt;/code&gt; and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.&lt;/p&gt;
&lt;p&gt;With the endpoint in place, my code for launching a team container can do this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create the volume and machine for that team (if they do not yet exist)&lt;/li&gt;
&lt;li&gt;Generate a signed secret token that the machine container can exchange for its S3 credentials&lt;/li&gt;
&lt;li&gt;Launch the machine container, passing it the secret token&lt;/li&gt;
&lt;li&gt;On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created&lt;/li&gt;
&lt;li&gt;Those credentials are used to populate the &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt;, &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt; and &lt;code&gt;AWS_SESSION_TOKEN&lt;/code&gt; environment variables used by Litestream&lt;/li&gt;
&lt;li&gt;Start Litestream, which then starts Datasette&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Restarting every 12 hours&lt;/h4&gt;
&lt;p&gt;You may be wondering why I bothered with that initial secret token - why not just pass the temporary AWS credentials to the container when I launch it?&lt;/p&gt;
&lt;p&gt;The reason for this is that I need to be able to obtain fresh credentials every 12 hours.&lt;/p&gt;
&lt;p&gt;A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.&lt;/p&gt;
&lt;p&gt;All you need to do is call &lt;code&gt;sys.exit(0)&lt;/code&gt; in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.&lt;/p&gt;
&lt;p&gt;So far I'm mainly using this to avoid the cost of running containers when they aren't actually in- use. But there's a neat benefit when it comes to Litestream too.&lt;/p&gt;
&lt;p&gt;I'm using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.&lt;/p&gt;
&lt;p&gt;After considering a few ways of doing this, I selected the simplest to implement: have Datasette call &lt;code&gt;sys.exit(0)&lt;/code&gt; after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.&lt;/p&gt;
&lt;p&gt;I implemented this by adding it as a new setting to my existing &lt;a href="https://datasette.io/plugins/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt; plugin. You can now configure that with &lt;code&gt; "max-age": "10h"&lt;/code&gt; and it will shut down Datasette once the server has been running for that long.&lt;/p&gt;
&lt;p&gt;Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.&lt;/p&gt;
&lt;p&gt;I only just put the new backup system in place, so I'm exercising it a bit before I open things up to trial users - but so far it's looking like a very robust solution to the problem.&lt;/p&gt;
&lt;h4&gt;s3-ocr improvements&lt;/h4&gt;
&lt;p&gt;I released a &lt;a href="https://datasette.io/tools/s3-ocr"&gt;few new versions&lt;/a&gt; of &lt;a href="https://simonwillison.net/2022/Jun/30/s3-ocr/"&gt;s3-ocr&lt;/a&gt; this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.&lt;/p&gt;
&lt;p&gt;The two main improvements are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A new &lt;code&gt;--dry-run&lt;/code&gt; option to &lt;code&gt;s3-ocr start&lt;/code&gt; which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. &lt;a href="https://github.com/simonw/s3-ocr/issues/22"&gt;#22&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3-ocr start&lt;/code&gt; used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. &lt;a href="https://github.com/simonw/s3-ocr/issues/21"&gt;#21&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!&lt;/p&gt;
&lt;p&gt;Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use &lt;code&gt;@cli.command&lt;/code&gt; as a decorator instead of &lt;code&gt;@cli.command()&lt;/code&gt;. This meant that installing &lt;code&gt;s3-ocr&lt;/code&gt; in an environment that already had a previous version of Click would result in silent errors.&lt;/p&gt;
&lt;p&gt;The solution is simple: pin to &lt;code&gt;click&amp;gt;=8.1.0&lt;/code&gt; in the project dependencies if you plan to use this new syntax.&lt;/p&gt;
&lt;p&gt;If I'd read the Click &lt;a href="https://click.palletsprojects.com/en/8.1.x/changes/#version-8-1-0"&gt;changelog more closely&lt;/a&gt; I would have saved myself a whole lot of time.&lt;/p&gt;
&lt;p&gt;Issues &lt;a href="https://github.com/simonw/s3-ocr/issues/25"&gt;#25&lt;/a&gt; and &lt;a href="https://github.com/simonw/s3-ocr/issues/26"&gt;#26&lt;/a&gt; detail the many false turns I took trying to figure this out.&lt;/p&gt;
&lt;h4&gt;More fun with GPT-3 and DALL-E&lt;/h4&gt;
&lt;p&gt;This tweet scored over a million impressions on Twitter:&lt;/p&gt;

&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;New hobby: prototyping video games in 60 seconds using a combination of GPT-3 and DALL-E&lt;br /&gt;&lt;br /&gt;Here&amp;#39;s &amp;quot;Raccoon Heist&amp;quot; &lt;a href="https://t.co/xQ3Vm8p2XW"&gt;pic.twitter.com/xQ3Vm8p2XW&lt;/a&gt;&lt;/p&gt;- Simon Willison (@simonw) &lt;a href="https://twitter.com/simonw/status/1555626060384911360?ref_src=twsrc%5Etfw"&gt;August 5, 2022&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;As this got retweeted outside of my usual circles it started confusing people who thought the "prototype" was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn't kidding when I said I spent 60 seconds on this.&lt;/p&gt;
&lt;p&gt;I also figured out how to use GPT-3 to write &lt;code&gt;jq&lt;/code&gt; one-liners. I love &lt;code&gt;jq&lt;/code&gt; but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: &lt;a href="https://til.simonwillison.net/gpt3/jq"&gt;Using GPT-3 to figure out jq recipes&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-ocr/releases/tag/0.6.3"&gt;0.6.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-ocr/releases"&gt;9 releases total&lt;/a&gt;) - 2022-08-10
&lt;br /&gt;Tools for running OCR against files stored in S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases"&gt;4 releases total&lt;/a&gt;) - 2022-08-05
&lt;br /&gt;Quit Datasette if it has not received traffic for a specified time period&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.14.3"&gt;0.14.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;18 releases total&lt;/a&gt;) - 2022-08-02
&lt;br /&gt;A command-line utility for taking automated screenshots of websites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.12.1"&gt;0.12.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;13 releases total&lt;/a&gt;) - 2022-08-01
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sqlite-fts4"&gt;datasette-sqlite-fts4&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sqlite-fts4/releases/tag/0.3.2"&gt;0.3.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sqlite-fts4/releases"&gt;2 releases total&lt;/a&gt;) - 2022-07-31&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/related-content"&gt;Related content with SQLite FTS and a Datasette template function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/aws/boto-command-line"&gt;Using boto3 from the command line&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/trying-macos-extensions"&gt;Trying out SQLite extensions on macOS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/pytest/mocking-boto"&gt;Mocking a Textract LimitExceededException with boto&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/gpt3/jq"&gt;Using GPT-3 to figure out jq recipes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-cloud"&gt;datasette-cloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/litestream"&gt;litestream&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dalle"&gt;dalle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="s3"/><category term="datasette"/><category term="weeknotes"/><category term="datasette-cloud"/><category term="fly"/><category term="litestream"/><category term="gpt-3"/><category term="dalle"/><category term="gpt"/></entry><entry><title>s3-ocr: Extract text from PDF files stored in an S3 bucket</title><link href="https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag" rel="alternate"/><published>2022-06-30T21:40:27+00:00</published><updated>2022-06-30T21:40:27+00:00</updated><id>https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag</id><summary type="html">
    &lt;p&gt;I've released &lt;strong&gt;&lt;a href="https://datasette.io/tools/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;, a new tool that runs Amazon's &lt;a href="https://aws.amazon.com/textract/"&gt;Textract&lt;/a&gt; OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.&lt;/p&gt;
&lt;p&gt;You can search through a demo of 697 pages of OCRd text at &lt;a href="https://s3-ocr-demo.datasette.io/pages/pages"&gt;s3-ocr-demo.datasette.io/pages/pages&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It &lt;a href="https://aws.amazon.com/textract/pricing/"&gt;charges&lt;/a&gt; $1.50 per thousand pages processed.&lt;/p&gt;
&lt;h4&gt;Why I built this&lt;/h4&gt;
&lt;p&gt;My initial need for this is a collaboration I have running with the &lt;a href="https://sfmicrosociety.org/"&gt;San Francisco Microscopy Society&lt;/a&gt;. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.&lt;/p&gt;
&lt;p&gt;The archives are full of hand-written and type-written notes, scanned and stored as PDFs.&lt;/p&gt;
&lt;p&gt;I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.&lt;/p&gt;
&lt;p&gt;Running Textract directly against large numbers of files is somewhat inconvenient (here's my &lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;earlier TIL about it&lt;/a&gt;). &lt;code&gt;s3-ocr&lt;/code&gt; is my attempt to make it easier.&lt;/p&gt;
&lt;h4&gt;Tutorial: How I built that demo&lt;/h4&gt;
&lt;p&gt;The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection &lt;a href="https://archive.org/search.php?query=creator%3A%22Harry+Houdini+Collection+%28Library+of+Congress%29+DLC%22"&gt;on the Internet Archive&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/unmaskingrobert00houdgoog"&gt;The unmasking of Robert-Houdin&lt;/a&gt; from 1908&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/practicalmagicia00harr"&gt;The practical magician and ventriloquist's guide: a practical manual of fireside magic and conjuring illusions: containing also complete instructions for acquiring &amp;amp; practising the art of ventriloquism&lt;/a&gt; from 1876&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/latestmagicbeing00hoff"&gt;Latest magic, being original conjuring tricks&lt;/a&gt; from 1918&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I started by downloading PDFs of those three files.&lt;/p&gt;
&lt;p&gt;Then I installed the two tools I needed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install s3-ocr s3-credentials
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I used my &lt;a href="https://datasette.io/tools/s3-credentials"&gt;s3-credentials&lt;/a&gt; tool to create a new S3 bucket and credentials with the ability to write files to it, with the new &lt;a href="https://github.com/simonw/s3-credentials/issues/72"&gt;--statement option&lt;/a&gt; (which I released today) to add &lt;code&gt;textract&lt;/code&gt; permissions to the generated credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket &amp;gt; ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Note that you don't need to use &lt;code&gt;s3-credentials&lt;/code&gt; at all if you have AWS credentials configured on your machine with root access to your account - just leave off the &lt;code&gt;-a ocr.json&lt;/code&gt; options in the following examples.)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;s3-ocr-demo&lt;/code&gt; is now a bucket I can use for the demo. &lt;code&gt;ocr.json&lt;/code&gt; contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.&lt;/p&gt;
&lt;p&gt;I uploaded my three PDFs to the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I often use &lt;a href="https://panic.com/transmit/"&gt;Transmit&lt;/a&gt; as a GUI for this kind of operation.)&lt;/p&gt;
&lt;p&gt;Then I kicked off OCR jobs against every PDF file in the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--all&lt;/code&gt; option scans for any file with a &lt;code&gt;.pdf&lt;/code&gt; extension. You can pass explicit file names instead if you just want to process one or two files at a time.&lt;/p&gt;
&lt;p&gt;This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.&lt;/p&gt;
&lt;p&gt;The job IDs can be used to inspect the progress of each task like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once the job completed, I could preview the text extracted from the PDF like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON &amp;amp; CHAMBERLAIN, 120 LIBERTY ST.
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then published the resulting &lt;code&gt;pages.db&lt;/code&gt; SQLite database using Datasette - you can &lt;a href="https://s3-ocr-demo.datasette.io/pages"&gt;explore it here&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;How s3-ocr works&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;s3-ocr&lt;/code&gt; works by calling Amazon's S3 and Textract APIs.&lt;/p&gt;
&lt;p&gt;Textract only works against PDF files in &lt;a href="https://docs.aws.amazon.com/textract/latest/dg/api-async.html"&gt;asynchronous mode&lt;/a&gt;: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.&lt;/p&gt;
&lt;p&gt;It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in &lt;code&gt;s3-ocr&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.&lt;/p&gt;
&lt;p&gt;I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called &lt;code&gt;textract-output/JOB_ID/1-?&lt;/code&gt; - but there's no indication as to which PDF file the results correspond to.&lt;/p&gt;
&lt;p&gt;My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.&lt;/p&gt;
&lt;p&gt;If you have a file called &lt;code&gt;latestmagicbeing00hoff.pdf&lt;/code&gt; the &lt;code&gt;start&lt;/code&gt; command will create a new file called &lt;code&gt;latestmagicbeing00hoff.pdf.s3-ocr.json&lt;/code&gt; with the following content:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"job_id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"etag"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-cce"&gt;\"&lt;/span&gt;d79af487579dcbbef26c9b3be763eb5e-2&lt;span class="pl-cce"&gt;\"&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.&lt;/p&gt;
&lt;p&gt;The existence of these files lets me do two things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you run &lt;code&gt;s3-ocr start s3-ocr-demo --all&lt;/code&gt; it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the &lt;code&gt;.s3-ocr.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;When you later ask for the results of the OCR it can use these files to associate the PDF with the results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Scatting &lt;code&gt;.s3-ocr.json&lt;/code&gt; files all over the place feels a little messy, so I have an &lt;a href="https://github.com/simonw/s3-ocr/issues/14"&gt;open issue&lt;/a&gt; considering moving them all to a &lt;code&gt;s3-ocr/&lt;/code&gt; prefix in the bucket instead.&lt;/p&gt;
&lt;h4&gt;Try it and let me know what you think&lt;/h4&gt;
&lt;p&gt;This is a brand new project, but I think it's ready for other people to start trying it out.&lt;/p&gt;
&lt;p&gt;I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!&lt;/p&gt;
&lt;p&gt;If you try this out and it works (or it doesn't work) please &lt;a href="https://twitter.com/simonw"&gt;let me know via Twitter&lt;/a&gt; or &lt;a href="https://github.com/simonw/s3-ocr"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;A challenging example page&lt;/h4&gt;
&lt;p&gt;Here's one of the more challenging pages I processed using Textract:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A very old page of difficult to read handwriting" src="https://static.simonwillison.net/static/2022/s3-ocr-sample-handwriting.jpg" style="max-width: 100%"/&gt;&lt;/p&gt;
&lt;p&gt;Here's the result:&lt;/p&gt;
&lt;pre&gt;
In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining &amp;amp; mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining &amp;amp; mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, &amp;amp; havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by
&lt;/pre&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-ocr/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-ocr/releases"&gt;4 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;Tools for running OCR against files stored in S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.12"&gt;0.12&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;12 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases/tag/0.1.2"&gt;0.1.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases"&gt;3 releases total&lt;/a&gt;) - 2022-06-23
&lt;br /&gt;Quit Datasette if it has not received traffic for a specified time period&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/one-line-csv-operations"&gt;One-liner for running queries against CSV files with SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/bash/ignore-errors"&gt;Ignoring errors in a section of a Bash script&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;Running OCR against a PDF file with AWS Textract&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="s3"/><category term="weeknotes"/><category term="s3-credentials"/></entry><entry><title>Organize and Index Your Screenshots (OCR) on macOS</title><link href="https://simonwillison.net/2021/Jul/18/screenshots/#atom-tag" rel="alternate"/><published>2021-07-18T16:11:01+00:00</published><updated>2021-07-18T16:11:01+00:00</updated><id>https://simonwillison.net/2021/Jul/18/screenshots/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexn.org/blog/2020/11/11/organize-index-screenshots-ocr-macos.html"&gt;Organize and Index Your Screenshots (OCR) on macOS&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alexandru Nedelcu has a very neat recipe for creating an archive of searchable screenshots on macOS: set the default save location for screenshots to a Dropbox folder, then create a launch agent that runs a script against new files in that folder to run tesseract OCR to convert them into a searchable PDF.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alexelcu/status/1416782157259943945"&gt;@alexelcu&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="ocr"/></entry><entry><title>Google Docs OCR</title><link href="https://simonwillison.net/2009/Sep/29/ocr/#atom-tag" rel="alternate"/><published>2009-09-29T21:57:59+00:00</published><updated>2009-09-29T21:57:59+00:00</updated><id>https://simonwillison.net/2009/Sep/29/ocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://googlesystem.blogspot.com/2009/09/google-docs-ocr.html"&gt;Google Docs OCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Whoa, the Google Docs API just got really interesting—you can upload an image to it (POST /feeds/default/private/full?ocr=true) and it will OCR the text and turn it in to a document. Since this is Google, I imagine they’ll also be using the processed documents to further improve their OCR technology.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apis"&gt;apis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/googldocs"&gt;googldocs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;&lt;/p&gt;



</summary><category term="apis"/><category term="googldocs"/><category term="google"/><category term="ocr"/></entry><entry><title>OCR and Neural Nets in JavaScript</title><link href="https://simonwillison.net/2009/Jan/25/ocr/#atom-tag" rel="alternate"/><published>2009-01-25T00:00:28+00:00</published><updated>2009-01-25T00:00:28+00:00</updated><id>https://simonwillison.net/2009/Jan/25/ocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://ejohn.org/blog/ocr-and-neural-nets-in-javascript/"&gt;OCR and Neural Nets in JavaScript&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
John dissects the brilliant Greasemonkey script that solves simple captchas using the canvas element and HTML5’s getImageData API.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/canvas"&gt;canvas&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/captchas"&gt;captchas&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/getimagedata"&gt;getimagedata&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/greasemonkey"&gt;greasemonkey&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/john-resig"&gt;john-resig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;&lt;/p&gt;



</summary><category term="canvas"/><category term="captchas"/><category term="getimagedata"/><category term="greasemonkey"/><category term="javascript"/><category term="john-resig"/><category term="ocr"/></entry><entry><title>tesseract-ocr</title><link href="https://simonwillison.net/2007/Jul/26/tesseractocr/#atom-tag" rel="alternate"/><published>2007-07-26T20:23:08+00:00</published><updated>2007-07-26T20:23:08+00:00</updated><id>https://simonwillison.net/2007/Jul/26/tesseractocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;tesseract-ocr&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Open source OCR, sponsored by Google. I just sat in on a talk on this at OSCON and the complexity of the problem is pretty incredible.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/oscon"&gt;oscon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/oscon07"&gt;oscon07&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ocr"/><category term="oscon"/><category term="oscon07"/><category term="tesseract"/></entry></feed>