Simon Willison's Weblog: ocr

Extract PDF text in your browser with LiteParse for the web

2026-04-23T21:54:24+00:00

LlamaIndex have a most excellent open source project called LiteParse, which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js.

Spatial text parsing

Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.

The hard problem that LiteParse solves is extracting text in a sensible order despite the infuriating vagaries of PDF layouts. They describe this as "spatial text parsing" - they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow.

The LiteParse documentation describes a pattern for implementing Visual Citations with Bounding Boxes. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&A.

LiteParse is provided as a pure CLI tool, designed to be used by agents. You run it like this:

npm i -g @llamaindex/liteparse
lit parse document.pdf

I explored its capabilities with Claude and quickly determined that there was no real reason it had to stay a CLI app: it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser in the past.

The only reason LiteParse didn't have a pure browser-based version is that nobody had built one yet...

Introducing LiteParse for the web

Visit https://simonw.github.io/liteparse/ to try out LiteParse against any PDF file, running entirely in your browser. Here's what that looks like:

The tool can work with or without running OCR, and can optionally display images for every page in the PDF further down the page.

Building it with Claude Code and Opus 4.7

The process of building this started in the regular Claude app on my iPhone. I wanted to try out LiteParse myself, so I started by uploading a random PDF I happened to have on my phone along with this prompt:

Clone https://github.com/run-llama/liteparse and try it against this file

Regular Claude chat can clone directly from GitHub these days, and while by default it can't access most of the internet from its container it can also install packages from PyPI and npm.

I often use this to try out new pieces of open source software on my phone - it's a quick way to exercise something without having to sit down with my laptop.

You can follow my full conversation in this shared Claude transcript. I asked a few follow-up questions about how it worked, and then asked:

Does this library run in a browser? Could it?

This gave me a thorough enough answer that I was convinced it was worth trying getting that to work for real. I opened up my laptop and switched to Claude Code.

I forked the original repo on GitHub, cloned a local copy, started a new web branch and pasted that last reply from Claude into a new file called notes.md. Then I told Claude Code:

Get this working as a web app. index.html, when loaded, should render an app that lets users open a PDF in their browser and select OCR or non-OCR mode and have this run. Read notes.md for initial research on this problem, then write out plan.md with your detailed implementation plan

I always like to start with a plan for this kind of project. Sometimes I'll use Claude's "planning mode", but in this case I knew I'd want the plan as an artifact in the repository so I told it to write plan.md directly.

This also means I can iterate on the plan with Claude. I noticed that Claude had decided to punt on generating screenshots of images in the PDF, and suggested we defer a "canvas-encode swap" to v2. I fixed that by prompting:

Update the plan to say we WILL do the canvas-encode swap so the screenshots thing works

After a few short follow-up prompts, here's the plan.md I thought was strong enough to implement.

I prompted:

build it.

And then mostly left Claude Code to its own devices, tinkered with some other projects, caught up on Duolingo and occasionally checked in to see how it was doing.

I added a few prompts to the queue as I was working. Those don't yet show up in my exported transcript, but it turns out running rg queue-operation --no-filename | grep enqueue | jq -r '.content' in the relevant ~/.claude/projects/ folder extracts them.

Here are the key follow-up prompts with some notes:

When you implement this use playwright and red/green TDD, plan that too - I've written more about red/green TDD here.
let's use PDF.js's own renderer (it was messing around with pdfium)
The final UI should include both the text and the pretty-printed JSON output, both of those in textareas and both with copy-to-clipboard buttons - it should also be mobile friendly - I had a new idea for how the UI should work
small commits along the way - see below
Make sure the index.html page includes a link back to https://github.com/run-llama/liteparse near the top of the page - it's important to credit your dependencies in a project like this!
View on GitHub → is bad copy because that's not the repo with this web app in, it's the web app for the underlying LiteParse library
Run OCR should be unchecked by default
When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...') - it was testing with Playwright in Chrome, turned out there was a bug in Safari
... oh that is in safari but it works in chrome
When "Copy" is clicked the text should change to "Copied!" for 1.5s
[Image #1] Style the file input so that long filenames don't break things on Firefox like this - in fact add one of those drag-drop zone UIs which you can also click to select a file - dropping screenshots in of small UI glitches works surprisingly well
Tweak the drop zone such that the text is vertically centered, right now it is a bit closer to the top
it breaks in Safari on macOS, works in both Chrome and Firefox. On Safari I see "Parse failed: undefined is not a function (near '...value of readableStream...')" after I click the Parse button, when OCR is not checked - it still wasn't working in Safari...
works in safari now - but it fixed it pretty quickly once I pointed that out and it got Playwright working with that browser

I've started habitually asking for "small commits along the way" because it makes for code that's easier to understand or review later on, and I have an unproven hunch that it helps the agent work more effectively too - it's yet another encouragement towards planning and taking on one problem at a time.

While it was working I decided it would be nice to be able to interact with an in-progress version. I asked a separate Claude Code session against the same directory for tips on how to run it, and it told me to use npx vite. Running that started a development server with live-reloading, which meant I could instantly see the effect of each change it made on disk - and prompt with further requests for tweaks and fixes.

Towards the end I decided it was going to be good enough to publish. I started a fresh Claude Code instance and told it:

Look at the web/ folder - set up GitHub actions for this repo such that any push runs the tests, and if the tests pass it then does a GitHub Pages deploy of the built vite app such that the web/index.html page is the index.html page for the thing that is deployed and it works on GitHub Pages

After a bit more iteration here's the GitHub Actions workflow that builds the app using Vite and deploys the result to https://simonw.github.io/liteparse/.

I love GitHub Pages for this kind of thing because it can be quickly configured (by Claude, in this case) to turn any repository into a deployed web-app, at zero cost and with whatever build step is necessary. It even works against private repos, if you don't mind your only security being a secret URL.

With this kind of project there's always a major risk that the model might "cheat" - mark key features as "TODO" and fake them, or take shortcuts that ignore the initial requirements.

The responsible way to prevent this is to review all of the code... but this wasn't intended as that kind of project, so instead I fired up OpenAI Codex with GPT-5.5 (I had preview access) and told it:

Describe the difference between how the node.js CLI tool runs and how the web/ version runs

The answer I got back was enough to give me confidence that Claude hadn't taken any project-threatening shortcuts.

... and that was about it. Total time in Claude Code for that "build it" step was 59 minutes. I used my claude-code-transcripts tool to export a readable version of the full transcript which you can view here, albeit without those additional queued prompts (here's my issue to fix that).

Is this even vibe coding any more?

I'm a pedantic stickler when it comes to the original definition of vibe coding - vibe coding does not mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all.

By my own definition, this LiteParse for the web project is about as pure vibe coding as you can get! I have not looked at a single line of the HTML and TypeScript written for this project - in fact while writing this sentence I had to go and check if it had used JavaScript or TypeScript.

Yet somehow this one doesn't feel as vibe coded to me as many of my other vibe coded projects:

As a static in-browser web application hosted on GitHub Pages the blast radius for any bugs is almost non-existent: it either works for your PDF or doesn't.
No private data is transferred anywhere - all processing happens in your browser - so a security audit is unnecessary. I've glanced once at the network panel while it's running and no additional requests are made when a PDF is being parsed.
There was still a whole lot of engineering experience and knowledge required to use the models in this way. Identifying that porting LiteParse to run directly in a browser was critical to the rest of the project.

Most importantly, I'm happy to attach my reputation to this project and recommend that other people try it out. Unlike most of my vibe coded tools I'm not convinced that spending significant additional engineering time on this would have resulted in a meaningfully better initial release. It's fine as it is!

I haven't opened a PR against the origin repository because I've not discussed it with the LiteParse team. I've opened an issue, and if they want my vibe coded implementation as a starting point for something more official they're welcome to take it.

Tags: javascript, ocr, pdf, projects, ai, generative-ai, llms, vibe-coding, coding-agents, claude-code, agentic-engineering

Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code

2025-10-20T17:21:52+00:00

DeepSeek released a new model yesterday: DeepSeek-OCR, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware.

This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I've been exploring recently. I designed an agentic loop for the problem, gave Claude full permissions inside a Docker sandbox, embraced the parallel agents lifestyle and reused my notes on the NVIDIA Spark from last week.

I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen.

TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce copious notes about the process.

The setup

I connected to the Spark from my Mac via SSH and started a new Docker container there:

docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash

Then I installed npm and used that to install Claude Code:

apt-get update
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
npm install -g @anthropic-ai/claude-code

Then started Claude Code, telling it that it's OK that it's running as root because it's in a sandbox:

IS_SANDBOX=1 claude --dangerously-skip-permissions

It provided me a URL to click on to authenticate with my Anthropic account.

The initial prompts

I kicked things off with this prompt:

Create a folder deepseek-ocr and do everything else in that folder

Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image (this one, see previous post) that I wanted it to run OCR on.

Your task is to get this working: https://github.com/deepseek-ai/DeepSeek-OCR - it uses Hugging Face Transformers and models from https://huggingface.co/deepseek-ai/DeepSeek-OCR - you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those - you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example - use https://static.simonwillison.net/static/2025/ft.jpeg as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code.

Note how I also told it to "make extensive notes as you go in a notes.md file which you should only ever append new notes to" so I'd have some useful notes at the end.

The follow-ups

It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it.

And then... it gave up:

The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1

PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90a

GB10 requires: sm_121

Error: CUDA error: no kernel image is available for execution on the device

Additionally, the model has hardcoded .cuda() calls, preventing CPU fallback without code modifications.

So I gave it some encouragement:

Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here?

It ran this command to grep for all available PyTorch wheels on download.pytorch.org that match the ARM architecture:

curl -s https://download.pytorch.org/whl/torch/ | grep -o 'torch-[0-9.]*.*aarch64.whl' | sort -V | tail -20

... and that unblocked it.

Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and CUDA 13.0 (cu130)! This is exactly what we need!

It installed the wheels it found and reported:

Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it's just a warning now instead of a hard error! Let's try running inference: [...]

Then kept on churning for a bit, ignoring warnings, and managed to run the model! It loaded the image and ran this prompt:

<|grounding|>OCR this image.

And got back output that started like this:

<|ref|>The perils of vibe coding<|/ref|><|det|>[[352, 30, 624, 111]]<|/det|>
<|ref|>opt<|/ref|><|det|>[[970, 0, 994, 30]]<|/det|>
<|ref|>such<|/ref|><|det|>[[970, 24, 996, 58]]<|/det|>
<|ref|>days<|/ref|><|det|>[[970, 52, 996, 87]]<|/det|>
<|ref|>pavi<|/ref|><|det|>[[970, 85, 996, 118]]<|/det|>
<|ref|>TECHNOLOGY<|/ref|><|det|>[[33, 199, 127, 230]]<|/det|>
<|ref|>holds the promise of replacing program-<|/ref|><|det|>[[220, 170, 390, 210]]<|/det|>
<|ref|>mers - a profession of very well paid<|/ref|><|det|>[[220, 199, 390, 238]]<|/det|>
<|ref|>people.For another, the work can be<|/ref|><|det|>[[220, 228, 390, 267]]<|/det|>

Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.

I said:

That result.mmd file looks to be exclusively whitespace to me, try and get a better result there - the bounding box image was good though

Which caused Claude to reconsider the prompt it used:

Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts:

"<image>\n<|grounding|>Convert the document to markdown." - for documents

"<image>\n<|grounding|>OCR this image." - general OCR

"<image>\nFree OCR." - without layouts

The "grounding" mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format.

Let me try running it again with a different prompt that's more focused on text extraction rather than grounding.

It tried all of the prompt variants, made a note about them and created a file called PROMPTS_GUIDE.md with a bunch more details, including this comparison table:

Prompt Speed Text Quality Structure Coordinates Best Use Case

Free OCR ⚡⚡⚡ Fast ⭐⭐⭐ Excellent ⭐ Basic ❌ No General OCR

Markdown ⚡⚡ Medium ⭐⭐⭐ Excellent ⭐⭐⭐ Full ⭐⭐ Partial Documents

Grounding ⚡ Slow ⭐⭐ Good ⭐ Basic ⭐⭐⭐ Full Annotations

Detailed ⚡⚡⚡ Fastest ⭐ N/A ❌ N/A ❌ No Image analysis

Prompt	Speed	Text Quality	Structure	Coordinates	Best Use Case
Free OCR	⚡⚡⚡ Fast	⭐⭐⭐ Excellent	⭐ Basic	❌ No	General OCR
Markdown	⚡⚡ Medium	⭐⭐⭐ Excellent	⭐⭐⭐ Full	⭐⭐ Partial	Documents
Grounding	⚡ Slow	⭐⭐ Good	⭐ Basic	⭐⭐⭐ Full	Annotations
Detailed	⚡⚡⚡ Fastest	⭐ N/A	❌ N/A	❌ No	Image analysis

And this table under the heading "Performance Benchmarks":

Test image: 3503×1668 pixels (Financial Times article)

Prompt Time Output Size Tokens

Free OCR 24s Clean text 2257

Markdown 39s Formatted MD 2257 + structure

Grounding 58s Text + coords 2257 + boxes

Detailed 9s Description ~300

Prompt	Time	Output Size	Tokens
Free OCR	24s	Clean text	2257
Markdown	39s	Formatted MD	2257 + structure
Grounding	58s	Text + coords	2257 + boxes
Detailed	9s	Description	~300

My final prompt was this, to gather everything together into a zip file I could extract from the Docker container:

Create a zip file with the output and output_text and all of the scripts and notes - but leave out the github repo and the huggingface repo directories

I added the contents of that zip file to my new simonw/research GitHub repo in the deepseek-ocr-nvidia-spark folder.

Claude really likes writing notes! Here's the directory listing of that finished folder:

  |-- download_test_image.sh
  |-- FINAL_SUMMARY.md
  |-- notes.md
  |-- output
  |   |-- images
  |   |-- result_with_boxes.jpg
  |   `-- result.mmd
  |-- output_text
  |   |-- detailed
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   |-- free_ocr
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   `-- markdown
  |       |-- images
  |       |   `-- 0.jpg
  |       |-- result_with_boxes.jpg
  |       `-- result.mmd
  |-- PROMPTS_GUIDE.md
  |-- README_SUCCESS.md
  |-- README.md
  |-- run_ocr_best.py
  |-- run_ocr_cpu_nocuda.py
  |-- run_ocr_cpu.py
  |-- run_ocr_text_focused.py
  |-- run_ocr.py
  |-- run_ocr.sh
  |-- setup.sh
  |-- SOLUTION.md
  |-- test_image.jpeg
  |-- TEXT_OUTPUT_SUMMARY.md
  `-- UPDATE_PYTORCH.md

Takeaways

My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things.

Having tried and failed to get PyTorch stuff working in the past, I count this as a huge win. I'll be using this process a whole lot more in the future.

How good were the actual results? There's honestly so much material in the resulting notes created by Claude that I haven't reviewed all of it. There may well be all sorts of errors in there, but it's indisputable that it managed to run the model and made notes on how it did that such that I'll be able to do the same thing in the future.

I think the key factors in executing this project successfully were the following:

I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in designing agentic loops.
Running it in a Docker sandbox meant I could use claude --dangerously-skip-permissions and leave it running on its own. If I'd had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes.
I applied my own knowledge and experience when it got stuck. I was confident (based on previous experiments with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success.

Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it.

Bonus: Using VS Code to monitor the container

A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running.

I asked Claude.ai:

I am running a Docker container on a remote machine, which I started over SSH

How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything?

It gave me a set of steps that solved this exact problem:

Install the VS Code "Remote SSH" and "Dev Containers" extensions
Use "Remote-SSH: Connect to Host" to connect to the remote machine (on my Tailscale network that's spark@100.113.1.114)
In the window for that remote SSH session, run "Dev Containers: Attach to Running Container" - this shows a list of containers and you can select the one you want to attach to
... and that's it! VS Code opens a new window providing full access to all of the files in that container. I opened up notes.md and watched it as Claude Code appended to it in real time.

At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the "Download" menu item to download it to my Mac.

Tags: ocr, python, ai, docker, pytorch, generative-ai, llms, ai-assisted-programming, anthropic, claude, nvidia, vs-code, vision-llms, deepseek, llm-release, coding-agents, claude-code, ai-in-china, nvidia-spark

The perils of vibe coding

2025-08-29T17:51:10+00:00

The perils of vibe coding

I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy yesterday:

From the article, with links added by me to relevant projects:

Willison thinks the best way to see what a new model can do is to ask for something unusual. He likes to request an SVG (an image made out of lines described with code) of a pelican on a bike and asks it to remember the chickens in his garden by name. Results can be bizarre. One model ignored his prompts in favour of composing a poem.

Still, his adventures in vibe coding sound like an advert for the sector. He used Anthropic's Claude Code, the favoured model for developers, to make an OCR (optical character recognition - software loves acronyms) tool that will copy and paste text from a screenshot.

He wrote software that summarises blog comments and has plans to build a custom tool that will alert him when a whale is visible from his Pacific coast home. All this by typing prompts in English.

I've been talking about that whale spotting project for far too long. Now that it's been in the FT I really need to build it.

(On the subject of OCR... I tried extracting the text from the above image using GPT-5 and got a surprisingly bad result full of hallucinated details. Claude Opus 4.1 did a lot better but still made some mistakes. Gemini 2.5 did much better.)

Tags: ocr, ai, generative-ai, llms, claude, vibe-coding, press-quotes, gpt-5, gpt

How OpenElections Uses LLMs

2025-06-19T18:26:38+00:00

How OpenElections Uses LLMs

The OpenElections project collects detailed election data for the USA, all the way down to the precinct level. This is a surprisingly hard problem: while county and state-level results are widely available, precinct-level results are published in thousands of different ad-hoc ways and rarely aggregated once the election result has been announced.

A lot of those precinct results are published as image-filled PDFs.

Derek Willis has recently started leaning on Gemini to help parse those PDFs into CSV data:

For parsing image PDFs into CSV files, Google’s Gemini is my model of choice, for two main reasons. First, the results are usually very, very accurate (with a few caveats I’ll detail below), and second, Gemini’s large context window means it’s possible to work with PDF files that can be multiple MBs in size.

Is this piece he shares the process and prompts for a real-world expert level data entry project, assisted by Gemini.

This example from Limestone County, Texas is a great illustration of how tricky this problem can get. Getting traditional OCR software to correctly interpret multi-column layouts like this always requires some level of manual intervention:

Derek's prompt against Gemini 2.5 Pro throws in an example, some special instructions and a note about the two column format:

Produce a CSV file from the attached PDF based on this example:

county,precinct,office,district,party,candidate,votes,absentee,early_voting,election_day
Limestone,Precinct 101,Registered Voters,,,,1858,,,
Limestone,Precinct 101,Ballots Cast,,,,1160,,,
Limestone,Precinct 101,President,,REP,Donald J. Trump,879,,,
Limestone,Precinct 101,President,,DEM,Kamala D. Harris,271,,,
Limestone,Precinct 101,President,,LIB,Chase Oliver,1,,,
Limestone,Precinct 101,President,,GRN,Jill Stein,4,,,
Limestone,Precinct 101,President,,,Write-ins,1,,,

Skip Write-ins with candidate names and rows with "Cast Votes", "Not Assigned", "Rejected write-in votes", "Unresolved write-in votes" or "Contest Totals". Do not extract any values that end in "%"

Use the following offices:

President/Vice President -> President
United States Senator -> U.S. Senate
US Representative -> U.S. House
State Senator -> State Senate

Quote all office and candidate values. The results are split into two columns on each page; parse the left column first and then the right column.

A spot-check and a few manual tweaks and the result against a 42 page PDF was exactly what was needed.

How about something harder? The results for Cameron County came as more than 600 pages and looked like this - note the hole-punch holes that obscure some of the text!

This file had to be split into chunks of 100 pages each, and the entire process still took a full hour of work - but the resulting table matched up with the official vote totals.

I love how realistic this example is. AI data entry like this isn't a silver bullet - there's still a bunch of work needed to verify the results and creative thinking needed to work through limitations - but it represents a very real improvement in how small teams can take on projects of this scale.

In the six weeks since we started working on Texas precinct results, we’ve been able to convert them for more than half of the state’s 254 counties, including many image PDFs like the ones on display here. That pace simply wouldn’t be possible with data entry or traditional OCR software.

Via Hacker News

Tags: data-journalism, derek-willis, ocr, ai, generative-ai, llms, gemini, vision-llms, structured-extraction

qwen2.5vl in Ollama

2025-05-18T12:31:15+00:00

qwen2.5vl in Ollama

Ollama announced a complete overhaul of their vision support the other day. Here's the first new model they've shipped since then - a packaged version of Qwen 2.5 VL which was first released on January 26th 2025. Here are my notes from that release.

I upgraded Ollama (it auto-updates so I just had to restart it from the tray icon) and ran this:

ollama pull qwen2.5vl

This downloaded a 6GB model file. I tried it out against my photo of Cleo rolling on the beach:

llm -a https://static.simonwillison.net/static/2025/cleo-sand.jpg \
  'describe this image' -m qwen2.5vl

And got a pretty good result:

The image shows a dog lying on its back on a sandy beach. The dog appears to be a medium to large breed with a dark coat, possibly black or dark brown. It is wearing a red collar or harness around its chest. The dog's legs are spread out, and its belly is exposed, suggesting it might be rolling around or playing in the sand. The sand is light-colored and appears to be dry, with some small footprints and marks visible around the dog. The lighting in the image suggests it is taken during the daytime, with the sun casting a shadow of the dog to the left side of the image. The overall scene gives a relaxed and playful impression, typical of a dog enjoying time outdoors on a beach.

Qwen 2.5 VL has a strong reputation for OCR, so I tried it on my poster:

llm -a https://static.simonwillison.net/static/2025/poster.jpg \
  'convert to markdown' -m qwen2.5vl

The result that came back:

It looks like the image you provided is a jumbled and distorted text, making it difficult to interpret. If you have a specific question or need help with a particular topic, please feel free to ask, and I'll do my best to assist you!

I'm not sure what went wrong here. My best guess is that the maximum resolution the model can handle is too small to make out the text, or maybe Ollama resized the image to the point of illegibility before handing it to the model?

Update: I think this may be a bug relating to URL handling in LLM/llm-ollama. I tried downloading the file first:

wget https://static.simonwillison.net/static/2025/poster.jpg
llm -m qwen2.5vl 'extract text' -a poster.jpg

This time it did a lot better. The results weren't perfect though - it ended up stuck in a loop outputting the same code example dozens of times.

I tried with a different prompt - "extract text" - and it got confused by the three column layout, misread Datasette as "Datasetette" and missed some of the text. Here's that result.

These experiments used qwen2.5vl:7b (6GB) - I expect the results would be better with the larger qwen2.5vl:32b (21GB) and qwen2.5vl:72b (71GB) models.

Fred Jonsson reported a better result using the MLX model via LM studio (~9GB model running in 8bit - I think that's mlx-community/Qwen2.5-VL-7B-Instruct-8bit). His full output is here - looks almost exactly right to me.

Tags: ocr, ai, generative-ai, local-llms, llms, llm, vision-llms, qwen, mlx, ollama, ai-in-china

Mistral OCR

2025-03-07T01:39:26+00:00

Mistral OCR

New closed-source specialist OCR model by Mistral - you can feed it images or a PDF and it produces Markdown with optional embedded images.

It's available via their API, or it's "available to self-host on a selective basis" for people with stringent privacy requirements who are willing to talk to their sales team.

I decided to try out their API, so I copied and pasted example code from their notebook into my custom Claude project and told it:

Turn this into a CLI app, depends on mistralai - it should take a file path and an optional API key defauling to env vironment called MISTRAL_API_KEY

After some further iteration / vibe coding I got to something that worked, which I then tidied up and shared as mistral_ocr.py.

You can try it out like this:

export MISTRAL_API_KEY='...'
uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf --html --inline-images > mixtral.html

I fed in the Mixtral paper as a PDF. The API returns Markdown, but my --html option renders that Markdown as HTML and the --inline-images option takes any images and inlines them as base64 URIs (inspired by monolith). The result is mixtral.html, a 972KB HTML file with images and text bundled together.

This did a pretty great job!

My script renders Markdown tables but I haven't figured out how to render inline Markdown MathML yet. I ran the command a second time and requested Markdown output (the default) like this:

uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf > mixtral.md

Here's that Markdown rendered as a Gist - there are a few MathML glitches so clearly the Mistral OCR MathML dialect and the GitHub Formatted Markdown dialect don't quite line up.

My tool can also output raw JSON as an alternative to Markdown or HTML - full details in the documentation.

The Mistral API is priced at roughly 1000 pages per dollar, with a 50% discount for batch usage.

The big question with LLM-based OCR is always how well it copes with accidental instructions in the text (can you safely OCR a document full of prompting examples?) and how well it handles text it can't write.

Mistral's Sophia Yang says it "should be robust" against following instructions in the text, and invited people to try and find counter-examples.

Alexander Doria noted that Mistral OCR can hallucinate text when faced with handwriting that it cannot understand.

Via @sophiamyang

Tags: cli, ocr, pdf, projects, ai, generative-ai, llms, claude, mistral, vision-llms, uv

olmOCR

2025-02-26T02:04:03+00:00

olmOCR

New from Ai2 - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".

At its core is allenai/olmOCR-7B-0225-preview, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the olmOCR-mix-0225 dataset.

The olmocr Python library can run the model on any "recent NVIDIA GPU". I haven't managed to run it on my own Mac yet - there are GGUFs out there but it's not clear to me how to run vision prompts through them - but Ai2 offer an online demo which can handle up to ten pages for free.

Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:

We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.

The most interesting idea from the technical report (PDF) is something they call "document anchoring":

Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted from the PDF binary file. [...]

Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.

The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.

Update: Jonathan Soma figured out how to run it on a Mac using LM Studio and the olmocr Python package.

Via Luca Soldaini

Tags: ocr, pdf, ai, generative-ai, local-llms, llms, fine-tuning, vision-llms, qwen, ai2, lm-studio, ai-in-china

Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!

2025-01-27T20:05:03+00:00

Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!

Hot on the heels of yesterday's Qwen2.5-1M, here's Qwen2.5 VL (with an excitable announcement title) - the latest in Qwen's series of vision LLMs.

They're releasing multiple versions: base models and instruction tuned models at 3B, 7B and 72B. There's a lot of additional information in their README.

The 72B model can be selected on Qwen's Qwen Chat interface. I gave it this image of our dog Cleo joy rolling on the beach:

And prompted:

Describe this image in detail, be flowery

(Since the Qwen team tends to use flowery language in their announcements.)

Here's what I got back:

In this enchanting tableau, a canine companion finds solace in the embrace of nature's softest bed. The sandy expanse stretches out like an endless canvas, its surface etched with the delicate imprints of countless wanderers who have come and gone, leaving behind a tapestry of footprints that whisper tales of joyous romps and serene strolls.

The dog, a vision of contentment, lies sprawled upon the sand, its dark fur gleaming under the gentle caress of the sun. Its body is relaxed, limbs splayed in abandon, as if it has surrendered to the soothing rhythm of the beach. A vibrant red collar adorns its neck, a splash of color against the muted tones of the surroundings, hinting at the care and love bestowed upon this cherished pet.

The sand itself is a work of art, its grains shimmering with a silvery sheen that catches the light. Here and there, small tufts of seaweed peek through, remnants of the ocean's bounty, adding a touch of wild beauty to the scene. The shadows cast by unseen objects dance across the sand, creating a play of light and dark that lends an air of mystery to the tranquil setting.

This moment, frozen in time, speaks of peace and harmony, where man's best friend finds respite in the simple pleasures of life. It is a scene that invites one to pause, breathe deeply, and revel in the serenity of nature's embrace.

I'm looking forward to trying out their smaller models - the 3B model in particular should be small enough to run comfortably even on an iPhone, so hopefully someone will wire up an example of that soon (maybe using MLX).

VB points out that the vision benchmarks for Qwen 2.5 VL 7B show it out-performing GPT-4o mini!

Qwen2.5 VL cookbooks

Qwen also just published a set of cookbook recipes:

universal_recognition.ipynb demonstrates basic visual Q&A, including prompts like Who are these in this picture? Please give their names in Chinese and English against photos of celebrities, an ability other models have deliberately suppressed.
spatial_understanding.ipynb demonstrates bounding box support, with prompts like Locate the top right brown cake, output its bbox coordinates using JSON format.
video_understanding.ipynb breaks a video into individual frames and asks questions like Could you go into detail about the content of this long video?
ocr.ipynb shows Qwen2.5-VL-7B-Instruct performing OCR in multiple different languages.
document_parsing.ipynb uses Qwen to convert images of documents to HTML and other formats, and notes that "we introduce a unique Qwenvl HTML format that includes positional information for each component, enabling precise document reconstruction and manipulation."
mobile_agent.ipynb runs Qwen with tool use against tools for controlling a mobile phone, similar to ChatGPT Operator or Claude Computer Use.
computer_use.ipynb showcases "GUI grounding" - feeding in screenshots of a user's desktop and running tools for things like left clicking on a specific coordinate.

Running it with mlx-vlm

Update 30th January 2025: I got it working on my Mac using uv and mlx-vlm, with some hints from this issue. Here's the recipe that worked (downloading a 9GB model from mlx-community/Qwen2.5-VL-7B-Instruct-8bit):

uv run --with 'numpy<2' --with 'git+https://github.com/huggingface/transformers' \
  --with mlx-vlm \
  python -m mlx_vlm.generate \
    --model mlx-community/Qwen2.5-VL-7B-Instruct-8bit \
    --max-tokens 100 \
    --temp 0.0 \
    --prompt "Describe this image." \
    --image path-to-image.png

I ran that against this image:

And got back this result:

The image appears to illustrate a flowchart or diagram related to a cybersecurity scenario. Here's a breakdown of the elements:

User: Represented by a simple icon of a person.

AI Agent: Represented by a blue diamond shape.

Attacker: Represented by a red icon of a person.

Cloud and Email Icons: Represented by a cloud and an envelope, indicating data or information being transferred.

Text: The text

Tags: ocr, ai, prompt-engineering, generative-ai, local-llms, llms, vision-llms, uv, qwen, mlx, ai-in-china, computer-use

Docling

2024-11-03T04:57:56+00:00

Docling

MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

Tags: cli, ibm, ocr, pdf, python, ai, hugging-face, uv

Running prompts against images and PDFs with Google Gemini

2024-10-23T18:25:07+00:00

Running prompts against images and PDFs with Google Gemini

New TIL. I've been experimenting with the Google Gemini APIs for running prompts against images and PDFs (in preparation for finally adding multi-modal support to LLM) - here are my notes on how to send images or PDF files to their API using curl and the base64 -i macOS command.

I figured out the curl incantation first and then got Claude to build me a Bash script that I can execute like this:

prompt-gemini 'extract text' example-handwriting.jpg

Playing with this is really fun. The Gemini models charge less than 1/10th of a cent per image, so it's really inexpensive to try them out.

Tags: bash, google, ocr, projects, ai, generative-ai, llms, ai-assisted-programming, llm, gemini, vision-llms, llm-pricing

State-of-the-art music scanning by Soundslice

2024-06-20T04:37:28+00:00

State-of-the-art music scanning by Soundslice

It's been a while since I checked in on Soundslice, Adrian Holovaty's beautiful web application focused on music education.

The latest feature is spectacular. The Soundslice music editor - already one of the most impressive web applications I've ever experienced - can now import notation directly from scans or photos of sheet music.

The attention to detail is immaculate. The custom machine learning model can handle a wide variety of notation details, and the system asks the user to verify or correct details that it couldn't perfectly determine using a neatly designed flow.

Free accounts can scan two single page documents a month, and paid plans get a much higher allowance. I tried it out just now on a low resolution image I found on Wikipedia and it did a fantastic job, even allowing me to listen to a simulated piano rendition of the music once it had finished processing.

It's worth spending some time with the release notes for the feature to appreciate how much work they've out into improving it since the initial release.

If you're new to Soundslice, here's an example of their core player interface which syncs the display of music notation to an accompanying video.

Adrian wrote up some detailed notes on the machine learning behind the feature when they first launched it in beta back in November 2022.

OMR [Optical Music Recognition] is an inherently hard problem, significantly more difficult than text OCR. For one, music symbols have complex spatial relationships, and mistakes have a tendency to cascade. A single misdetected key signature might result in multiple incorrect note pitches. And there’s a wide diversity of symbols, each with its own behavior and semantics — meaning the problems and subproblems aren’t just hard, there are many of them.

Tags: adrian-holovaty, machine-learning, music, ocr, ai

Civic Band

2024-06-19T21:30:46+00:00

Civic Band

Exciting new civic tech project from Philip James: 30 (and counting) Datasette instances serving full-text search enabled collections of OCRd meeting minutes for different civic governments. Includes 20,000 pages for Alameda, 17,000 for Pittsburgh, 3,567 for Baltimore and an enormous 117,000 for Maui County.

Philip includes some notes on how they're doing it. They gather PDF minute notes from anywhere that provides API access to them, then run local Tesseract for OCR (the cost of cloud-based OCR proving prohibitive given the volume of data). The collection is then deployed to a single VPS running multiple instances of Datasette via Caddy, one instance for each of the covered regions.

Tags: data-journalism, ocr, tesseract, datasette

textract-cli

2024-03-30T19:01:13+00:00

textract-cli

This is my other OCR project from yesterday: I built the thinnest possible CLI wrapper around Amazon Textract, out of frustration at how hard that tool is to use on an ad-hoc basis.

It only works with JPEGs and PNGs (not PDFs) up to 5MB in size, reflecting limitations in Textract’s synchronous API: it can handle PDFs amazingly well but you have to upload them to an S3 bucket yet and I decided to keep the scope tight for the first version of this tool.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

Tags: aws, cli, ocr, projects

Running OCR against PDFs and images directly in your browser

2024-03-30T17:59:56+00:00

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?

I've been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently - I'll write more about that soon. But those tools are still inconvenient for most people to use.

Meanwhile, older tools like Tesseract OCR are still extremely useful - if only they were easier to use as well.

Then I remembered that Tesseract runs happily in a browser these days thanks to the excellent Tesseract.js project. And PDFs can be processed using JavaScript too thanks to Mozilla's extremely mature and well-tested PDF.js library.

So I built a new tool!

tools.simonwillison.net/ocr provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.

Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.

Here's an animated demo:

It's not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.

But... having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing.

How I built this

For more recent examples of projects I've built with the assistance of LLMs, see Building and testing C extensions for SQLite with ChatGPT Code Interpreter and Claude and ChatGPT for ad-hoc sidequests.

I built the first version of this tool in just a few minutes, using Claude 3 Opus.

I already had my own JavaScript code lying around for the two most important tasks: running Tesseract.js against an images and using PDF.js to turn a PDF into a series of images.

The OCR code came from the system I built and explained in How I make annotated presentations (built with the help of multiple ChatGPT sessions). The PDF to images code was from an unfinished experiment which I wrote with the aid of Claude 3 Opus a week ago.

I composed the following prompt for Claude 3, where I pasted in both of my code examples and then added some instructions about what I wanted it to build at the end:

This code shows how to open a PDF and turn it into an image per page:

<!DOCTYPE html>
<html>
<head>
  <title>PDF to Images</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js"></script>
  <style>
    .image-container img {
      margin-bottom: 10px;
    }
    .image-container p {
      margin: 0;
      font-size: 14px;
      color: #888;
    }
  </style>
</head>
<body>
  <input type="file" id="fileInput" accept=".pdf" />
  <div class="image-container"></div>

  <script>
  const desiredWidth = 800;
    const fileInput = document.getElementById('fileInput');
    const imageContainer = document.querySelector('.image-container');

    fileInput.addEventListener('change', handleFileUpload);

    pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js';

    async function handleFileUpload(event) {
      const file = event.target.files[0];
      const imageIterator = convertPDFToImages(file);

      for await (const { imageURL, size } of imageIterator) {
        const imgElement = document.createElement('img');
        imgElement.src = imageURL;
        imageContainer.appendChild(imgElement);

        const sizeElement = document.createElement('p');
        sizeElement.textContent = `Size: ${formatSize(size)}`;
        imageContainer.appendChild(sizeElement);
      }
    }

    async function* convertPDFToImages(file) {
      try {
        const pdf = await pdfjsLib.getDocument(URL.createObjectURL(file)).promise;
        const numPages = pdf.numPages;

        for (let i = 1; i <= numPages; i++) {
          const page = await pdf.getPage(i);
          const viewport = page.getViewport({ scale: 1 });
          const canvas = document.createElement('canvas');
          const context = canvas.getContext('2d');
          canvas.width = desiredWidth;
          canvas.height = (desiredWidth / viewport.width) * viewport.height;
          const renderContext = {
            canvasContext: context,
            viewport: page.getViewport({ scale: desiredWidth / viewport.width }),
          };
          await page.render(renderContext).promise;
          const imageURL = canvas.toDataURL('image/jpeg', 0.8);
          const size = calculateSize(imageURL);
          yield { imageURL, size };
        }
      } catch (error) {
        console.error('Error:', error);
      }
    }

    function calculateSize(imageURL) {
      const base64Length = imageURL.length - 'data:image/jpeg;base64,'.length;
      const sizeInBytes = Math.ceil(base64Length * 0.75);
      return sizeInBytes;
    }

    function formatSize(size) {
      const sizeInKB = (size / 1024).toFixed(2);
      return `${sizeInKB} KB`;
    }
  </script>
</body>
</html>

This code shows how to OCR an image:

async function ocrMissingAltText() {
    // Load Tesseract
    var s = document.createElement("script");
    s.src = "https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js";
    document.head.appendChild(s);

    s.onload = async () => {
      const images = document.getElementsByTagName("img");
      const worker = Tesseract.createWorker();
      await worker.load();
      await worker.loadLanguage("eng");
      await worker.initialize("eng");
      ocrButton.innerText = "Running OCR...";

      // Iterate through all the images in the output div
      for (const img of images) {
        const altTextarea = img.parentNode.querySelector(".textarea-alt");
        // Check if the alt textarea is empty
        if (altTextarea.value === "") {
          const imageUrl = img.src;
          var {
            data: { text },
          } = await worker.recognize(imageUrl);
          altTextarea.value = text; // Set the OCR result to the alt textarea
          progressBar.value += 1;
        }
      }

      await worker.terminate();
      ocrButton.innerText = "OCR complete";
    };
  }

Use these examples to put together a single HTML page with embedded HTML and CSS and JavaScript that provides a big square which users can drag and drop a PDF file onto and when they do that the PDF has every page converted to a JPEG and shown below on the page, then OCR is run with tesseract and the results are shown in textarea blocks below each image.

I saved this prompt to a prompt.txt file and ran it using my llm-claude-3 plugin for LLM:

llm -m claude-3-opus < prompt.txt

It gave me a working initial version on the first attempt!

Here's the full transcript, including my follow-up prompts and their responses. Iterating on software in this way is so much fun.

First follow-up:

Modify this to also have a file input that can be used - dropping a file onto the drop area fills that input

make the drop zone 100% wide but have a 2em padding on the body. it should be 10em high. it should turn pink when an image is dragged over it.

Each textarea should be 100% wide and 10em high

At the very bottom of the page add a h2 that says Full document - then a 30em high textarea with all of the page text in it separated by two newlines

Here's the interactive result.

Rather delightfully it used the neater pattern where the file input itself is hidden but can be triggered by clicking on the large drop zone, and it updated the copy on the drop zone to reflect that - without me suggesting those requirements.

And then:

get rid of the code that shows image sizes. Set the placeholder on each textarea to be Processing... and clear that placeholder when the job is done.

Which gave me this.

I realized it would be useful if it could handle non-PDF images as well. So I fired up ChatGPT (for no reason other than curiosity to see how well it did) and got GPT-4 to add that feature for me. I pasted in the code so far and added:

Modify this so jpg and png and gif images can be dropped or opened too - they skip the PDF step and get appended to the page and OCRd directly. Also move the full document heading and textarea above the page preview and hide it u til there is data to be shown in it

Then I spotted that the Tesseract worker was being created multiple times in a loop, which is inefficient - so I prompted:

Create the worker once and use it for all OCR tasks and terminate it at the end

I'd tweaked the HTML and CSS a little before feeding it to GPT-4, so now the site had a title and rendered in Helvetica.

Here's the version GPT-4 produced for me.

Manual finishing touches

Fun though it was iterating on this project entirely through prompting, I decided it would be more productive to make the finishing touches myself. You can see those in the commit history. They're not particularly interesting:

I added Plausible analytics (which I like because they use no cookies).
I added better progress indicators, including the text that shows how many pages of the PDF have been processed so far.
I bumped up the width of the rendered PDF page images from 800 to 1000. This seemed to improve OCR quality - in particular, the Claude 3 model card PDF now has less OCR errors than it did before.
I upgraded both Tesseract.js and PDF.js to the most recent versions. Unsurprisingly, Claude 3 Opus had used older versions of both libraries.

I'm really pleased with this project. I consider it finished - it does the job I designed it to do and I don't see any need to keep on iterating on it. And because it's all static JavaScript and WebAssembly I expect it to continue working effectively forever.

Update: OK, a few more features: I added language selection, paste support and some basic automated tests using Playwright Python.

Tags: data-journalism, ocr, pdf, projects, tesseract, webassembly, ai-assisted-programming

unstructured

2024-02-02T02:47:15+00:00

unstructured

Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.

I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.

There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.

Tags: ocr, pdf, python, onnx

Our search for the best OCR tool in 2023, and what we found

2023-10-31T19:21:53+00:00

Our search for the best OCR tool in 2023, and what we found

DocumentCloud’s Sanjin Ibrahimovic reviews the best options for OCR. Tesseract scores highly for easily machine readable text, newcomer docTR is great for ease of use but still not great at handwriting. Amazon Textract is great for everything except non-Latin languages, Google Cloud Vision is great at pretty much everything except for ease-of-use. Azure AI Document Intelligence sounds worth considering as well.

Tags: documentcloud, ocr

How I make annotated presentations

2023-08-06T17:15:33+00:00

Giving a talk is a lot of work. I go by a rule of thumb I learned from Damian Conway: a minimum of ten hours of preparation for every one hour spent on stage.

If you're going to put that much work into something, I think it's worth taking steps to maximize the value that work produces - both for you and for your audience.

One of my favourite ways of getting "paid" for a talk is when the event puts in the work to produce a really good video of that talk, and then shares that video online. North Bay Python is a fantastic example of an event that does this well: they team up with Next Day Video and White Coat Captioning and have talks professionally recorded, captioned and uploaded to YouTube within 24 hours of the talk being given.

Even with that quality of presentation, I don't think a video on its own is enough. My most recent talk was 40 minutes long - I'd love people to watch it, but I myself watch very few 40m long YouTube videos each year.

So I like to publish my talks with a text and image version of the talk that can provide as much of the value as possible to people who don't have the time or inclination to sit through a 40m talk (or 20m if you run it at 2x speed, which I do for many of the talks I watch myself).

Annotated presentations

My preferred format for publishing these documents is as an annotated presentation - a single document (no clicking "next" dozens of times) combining key slides from the talk with custom written text to accompany each one, plus additional links and resources.

Here's my most recent example: Catching up on the weird world of LLMs, from North Bay Python last week.

More examples (see also my annotated-talks tag):

Prompt injection explained, with video, slides, and a transcript for a LangChain webinar in May 2023.
Coping strategies for the serial project hoarder for DjangoCon US 2022.
How to build, test and publish an open source Python library for PyGotham 2021
Video introduction to Datasette and sqlite-utils for FOSDEM February 2021
Datasette—an ecosystem of tools for working with small data for PyGotham 2020.
Personal Data Warehouses: Reclaiming Your Data for the GitHub OCTO speaker series in November 2020.
Redis tutorial for NoSQL Europe 2010 (my first attempt at this format).

I don't tend to write a detailed script for my talks in advance. If I did, I might use that as a starting point, but I usually prepare the outline of the talk and then give it off-the-cuff on the day. I find this fits my style (best described as "enthusiastic rambling") better.

Instead, I'll assemble notes for each slide from re-watching the video after it has been released.

I don't just cover the things I said in the the talk - I'll also add additional context, and links to related resources. The annotated presentation isn't just for people who didn't watch the talk, it's aimed at providing extra context for people who did watch it as well.

A custom tool for building annotated presentations

For this most recent talk I finally built something I've been wanting for years: a custom tool to help me construct the annotated presentation as quickly as possible.

Annotated presentations look deceptively simple: each slide is an image and one or two paragraphs of text.

There are a few extra details though:

The images really need good alt= text - a big part of the information in the presentation is conveyed by those images, so they need to have good descriptions both for screen reader users and to index in search engines / for retrieval augmented generation.
Presentations might have dozens of slides - just assembling the image tags in the correct order can be a frustrating task.
For editing the annotations I like to use Markdown, as it's quicker to write than HTML. Making this as easy as possible encourages me to add more links, bullet points and code snippets.

One of my favourite use-cases for tools like ChatGPT is to quickly create one-off custom tools. This was a perfect fit for that.

You can see the tool I create here: Annotated presentation creator (source code here).

The first step is to export the slides as images, being sure to have filenames which sort alphabetically in the correct order. I use Apple Keynote for my slides and it has an "Export" feature which does this for me.

Next, open those images using the annotation tool.

The tool is written in JavaScript and works entirely in your browser - it asks you to select images but doesn't actually upload them to a server, just displays them directly inline in the page.

Anything you type in a textarea as work-in-progress will be saved to localStorage, so a browser crash or restart shouldn't lose any of your work.

It uses Tesseract.js to run OCR against your images, providing a starting point for the alt= attributes for each slide.

Annotations can be entered in Markdown and are rendered to HTML as a live preview using the Marked library.

Finally, it offers a templating mechanism for the final output, which works using JavaScript template literals. So once you've finished editing the alt= text and writing the annotations, click "Execute template" at the bottom of the page and copy out the resulting HTML.

Here's an animated GIF demo of the tool in action:

I ended up putting this together with the help of multiple different ChatGPT sessions. You can see those here:

Cleaning up the transcript with Claude

Since the video was already up on YouTube when I started writing the annotations, I decided to see if I could get a head start on writing them using the YouTube generated transcript.

I used my Action Transcription tool to extract the transcript, but it was pretty low quality - you can see a copy of it here. A sample:

okay hey everyone it's uh really
exciting to be here so yeah I call this
court talk catching up on the weird
world of llms I'm going to try and give
you the last few years of of llm
developments in 35 minutes this is
impossible so uh hopefully I'll at least
give you a flavor of some of the weirder
corners of the space because the thing
about language models is the more I look
at the more I think they're practically
interesting any particular aspect of
them anything at all if you zoom in
there are just more questions there are
just more unknowns about it there are
more interesting things to get into lots
of them are deeply disturbing and
unethical lots of them are fascinating
it's um I've called it um it's it's
impossible to tear myself away from this
I I just keep on keep on finding new
aspects of it that are interesting

It's basically one big run-on sentence, with no punctuation, little capitalization and lots of umms and ahs.

Anthropic's Claude 2 was released last month and supports up to 100,000 tokens per prompt - a huge improvement on ChatGPT (4,000) and GPT-4 (8,000). I decided to see if I could use that to clean up my transcript.

I pasted it into Claude and tried a few prompts... until I hit upon this one:

Reformat this transcript into paragraphs and sentences, fix the capitalization and make very light edits such as removing ums

This worked really, really well! Here's the first paragraph it produced, based on the transcript I show above:

Okay everyone, it's really exciting to be here. Yeah I call this talk "Catching Up on the Weird World of LLMs." I'm going to try and give you the last few years of LLMs developments in 35 minutes. This is impossible, so hopefully I'll at least give you a flavor of some of the weirder corners of the space. The thing about language models is the more I look at them, the more I think they're practically interesting. Focus on any particular aspect, and there are just more questions, more unknowns, more interesting things to get into.

Note that I said "fractally interesting", not "practically interesting" - but that error was there in the YouTube transcript, so Claude picked it up from there.

Here's the full generated transcript.

It's really impressive! At one point it even turns my dialogue into a set of bullet points:

Today the best are ChatGPT (aka GPT-3.5 Turbo), GPT-4 for capability, and Claude 2 which is free. Google has PaLM 2 and Bard. Llama and Claude are from Anthropic, a splinter of OpenAI focused on ethics. Google and Meta are the other big players.

Some tips:

OpenAI models cutoff at September 2021 training data. Anything later isn't in there. This reduces issues like recycling their own text.

Claude and Palm have more recent data, so I'll use them for recent events.

Always consider context length. GPT has 4,000 tokens, GPT-4 has 8,000, Claude 100,000.

If a friend who read the Wikipedia article could answer my question, I'm confident feeding it in directly. The more obscure, the more likely pure invention.

Avoid superstitious thinking. Long prompts that "always work" are usually mostly pointless.

Develop an immunity to hallucinations. Notice signs and check answers.

Compare that to my rambling original to see quite how much of an improvement this is.

But, all of that said... I specified "make very light edits" and it clearly did a whole lot more than just that.

I didn't use the Claude version directly. Instead, I copied and pasted chunks of it into my annotation tool that made the most sense, then directly edited them to better fit what I was trying to convey.

As with so many things in LLM/AI land: a significant time saver, but no silver bullet.

For workshops, publish the handout

I took the Software Carpentries instructor training a few years ago, which was a really great experience.

A key idea I got from that is that a great way to run a workshop is to prepare an extensive, detailed handout in advance - and then spend the actual workshop time working through that handout yourself, at a sensible pace, in a way that lets the attendees follow along.

A bonus of this approach is that it forces you to put together a really high quality handout which you can distribute after the event.

I used this approach for the 3 hour workshop I ran at PyCon US 2023: Data analysis with SQLite and Python. I turned that into a new official tutorial on the Datasette website, accompanied by the video but also useful for people who don't want to spend three hours watching me talk!

More people should do this

I'm writing this in the hope that I can inspire more people to give their talks this kind of treatment. It's not a zero amount of work - it takes me 2-3 hours any time I do this - but it greatly increases the longevity of the talk and ensures that the work I've already put into it provides maximum value, both to myself (giving talks is partly a selfish act!) and to the people I want to benefit from it.

Tags: alt-text, localstorage, ocr, projects, speaking, my-talks, tools, ai, generative-ai, llms, ai-assisted-programming, anthropic, claude, annotated-talks

textra

2023-03-23T21:08:51+00:00

textra

Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too!

Via Alex Garcia

Tags: audio, cli, macos, ocr, pdf

Building a searchable archive for the San Francisco Microscopical Society

2022-08-25T17:24:50+00:00

The San Francisco Microscopical Society was founded in 1870 by a group of scientists dedicated to advancing the field of microscopy.

Today the society is a non-profit run entirely by volunteers (they welcome new members).

This year they've been working with the UC Berkeley Bancroft Library to digitize their archives, which reach back to the founding of the organization.

Hep Svadja, Jenny Jacox and Ariel Waldman have taken the lead on this effort, resulting in a wealth of scanned PDF files covering 150 years of microscope history.

I've been helping out with the project running OCR against the archives (using Amazon Textract via my s3-ocr tool, which I built for this project) and turning them into a searchable website using Datasette.

The result is now live! Take a look at archive.sfmicrosociety.org.

Try running a search for newton to see the search feature in action.

You can also browse by folder structure or by individual document.

A neat thing about using AWS Textract is that it is significantly better at reading 19th century handwritten documents than I am.

Each document page is shown with the OCRd text alongside the original scan, so you can see that for yourself on this page.

How it works

The website itself is a custom Datasette instance hosted on Vercel - deployed using the datasette-publish-vercel plugin.

It's an implementation of the Baked Data architectural pattern, where the database itself is packaged up as a SQLite and included as a binary file in the application that is deployed to Vercel.

You can directly browse and query the underlying database (a 12.4MB SQLite file) using the default Datasette interface at archive.sfmicrosociety.org/sfms.

The database schema (excluding the tables that enable full-text search) looks like this:

CREATE TABLE [documents] (
   [id] TEXT PRIMARY KEY,
   [title] TEXT,
   [path] TEXT,
   [etag] TEXT
)

CREATE TABLE [pages] (
   [id] TEXT PRIMARY KEY,
   [document_id] TEXT,
   [page] INTEGER,
   [text] TEXT,
   FOREIGN KEY([document_id]) REFERENCES [documents]([id])
)

As you can see, the site consists of documents and pages. Documents have a path (the directory structure), a title (the filename) and an ID. Pages belong to documents and have a text column and a page number. It's a pretty simple schema!

The public pages on the site are all built using custom templates and views. You can browse the source code in the simonw/sfms-history GitHub repository.

In particular, the custom pages are implemented using a one-off site plugin (in the plugins/ folder) that adds extra routes to Datasette using the register_routes() plugin hook, like this:

@hookimpl
def register_routes():
    return [
        (r"^/docs$", docs),
        (r"^/docs/(?P<document_id>[a-z0-9]+)$", document),
        (r"^/docs/(?P<document_id>[a-z0-9]+)/(?P<page>\d+)/?$", page),
        (r"^/folders/(?P<folder>.*)$", folder),
    ]

docs, document, page and folder are Python view functions that take a request and return a custom response.

Here's the docs function for example, which powers the listing of all documents at archive.sfmicrosociety.org/docs:

async def docs(datasette, request):
    db = datasette.get_database("sfms")
    documents = [
        to_document(doc)
        for doc in await db.execute(
            """
    select documents.*, count(*) as num_pages
    from pages join documents on pages.document_id = documents.id
    group by documents.id
    order by path
    """
        )
    ]
    return Response.html(
        await datasette.render_template("docs.html", {"documents": documents}, request)
    )

And here's the docs.html template it uses.

The index.html template which implements both the homepage and the search results page is worth a look too. It takes a slightly different approach, using the extra_template_vars() plugin hook to populate extra custom template variables that can be used by that page to loop through the search results.

I built that page first, before adding custom routes for the other pages - I should probably refactor that to use register_routes() instead for consistency.

OCRing the documents

The documents started out in a Google Drive. I used the Rclone tool to sync that to an S3 bucket, and then ran my s3-ocr tool against specific folders within that bucket to run them through AWS Textract.

Then I ran the s3-ocr index command to suck down the OCR results and build those into a SQLite database.

The resulting schema wasn't quite right for this particular project. I wrote this build-db.sh script to construct a schema specifically designed for the archive, copy the data in and then enable full-text search on the resulting tables.

The SQLite database file containing the inital OCR results is called index.db. I used the following commands to attach that index.db file to the same connection as the sfms.db database and copy and re-shape the records that I wanted:

# Populate documents
sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
insert into documents select
  substr(s3_ocr_etag, 2, 8) as id,
  key as title,
  key as path,
  replace(s3_ocr_etag, '"', '') as etag
from
  index2.ocr_jobs
where
  key in (
    select path from index2.pages
    where (
      folder like 'INTAKE/%'
      or folder like 'PUBLIC/%'
    )
    and folder not like '%PROCESSED INTAKE DOCUMENTS/%'
  )
EOF
)"

# Populate pages
sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
insert into pages select distinct
  substr(s3_ocr_etag, 2, 8) || '-' || page as id,
  substr(s3_ocr_etag, 2, 8) as document_id,
  page,
  text
from index2.pages
  join index2.ocr_jobs
    on index2.pages.path = index2.ocr_jobs.key
where
  (
    folder like 'INTAKE/%'
    or folder like 'PUBLIC/%'
  )
  and folder not like '%PROCESSED INTAKE DOCUMENTS/%'
EOF
)"

This script is run by GitHub Actions as part of the workflow that deploys the site to Vercel.

Images of pages

The site itself makes use of the OCRd text that is stored in the database - that's how the search feature works.

But the scanned pages are visually so much more interesting than that!

I wanted to display those page images both as thumbnails and larger images, both to add visual appeal and because OCRd text loses a lot of the richness of the underlying records.

Since I already had the PDFs in an S3 bucket, the easiest way to build this was using imgix.

imgix offers a proxy service for running URL-defined transforms against images. I've used it on a bunch of projcets before: it's really powerful.

In this case, given an imgix source configured against an S3 bucket an image for a specific page of a PDF can be constructed like this (newlines added for readability):

https://sfms-history.imgix.net/
  PUBLIC/Scientific%20Documents/
  MicroscopicGymnastics_J.EdwardsSmith-04-06-1876.pdf
  ?page=1
  &w=200
  &auto=format
  &s=634c00249fbe4a2eda90f00be0ae66d6

Here we're providing the path to the PDF in the S3 bucket, requesting a render of page 1 of that PDF, at width 200px and using auto=format which caues imgix to serve the image in an optimized format based on the capabilities of the user's device.

The &s= parameter is a signature which can be configured to prevent people from constructing their own arbitrary URLs. In this case the S3 bucket has some files in that are not part of the public archive, so using a signature prevents people from guessing filenames and sniffing around in the bucket contents.

s3-ocr elsewhere

I built s3-ocr for this specific project, but it's already starting to see use for other projects. My favourite example so far is the work Philip James has been doing to deploy search engines against OCRd meeting minutes for the cities of Alameda and Oakland (and soon more) - he wrote about that in detail is Digitizing 55,000 pages of civic meetings.

If you have an interesting archive of PDFs that need to be OCRd I would love to hear from you! Contact me on Twitter or email me at swillison at Google's webmail provider.

Also this week

I got very distracted by Analyzing ScotRail audio announcements with Datasette - which did at least produce a new, detailed tutorial describing my process for approaching new projects with Datasette.

I added Plugin support to Datasete Lite.

Releases this week

datasette-multiline-links: 0.1 - 2022-08-24
Make multiple newline separated URLs clickable in Datasette
datasette-publish-vercel: 0.14.2 - (23 releases total) - 2022-08-24
Datasette plugin for publishing data using Vercel
datasette-edit-schema: 0.5.1 - (10 releases total) - 2022-08-22
Datasette plugin for modifying table schemas
pocket-to-sqlite: 0.2.2 - (4 releases total) - 2022-08-22
Create a SQLite database containing data from your Pocket account
datasette-mp3-audio: 0.2 - (2 releases total) - 2022-08-20
Turn .mp3 URLs into an audio player in the Datasette interface
sqlite-diffable: 0.5 - (6 releases total) - 2022-08-18
Tools for dumping/loading a SQLite database to diffable directory structure
datasette-sentry: 0.2 - (5 releases total) - 2022-08-14
Datasette plugin for configuring Sentry
datasette: 0.62 - (113 releases total) - 2022-08-14
An open source multi-tool for exploring and publishing data
s3-credentials: 0.13 - (14 releases total) - 2022-08-12
A tool for creating credentials for accessing S3 buckets

TIL this week

Tags: ocr, pdf, projects, datasette, weeknotes

Digitizing 55,000 pages of civic meetings

2022-08-22T16:26:04+00:00

Digitizing 55,000 pages of civic meetings

Philip James has been building public, searchable archives of city council meetings for various cities—Oakland and Alamedia so far—using my s3-ocr script to run Textract OCR against the PDFs of the minutes, and deploying them to Fly using Datasette. This is a really cool project, and very much the kind of thing I’ve been hoping to support with the tools I’ve been building.

Via @phildini

Tags: archiving, ocr, political-hacking, datasette, fly

Litestream backups for Datasette Cloud (and weeknotes)

2022-08-11T17:19:52+00:00

My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.

Datasette Cloud is a SaaS service for Datasette. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using this form.

I'm building Datastte Cloud on Fly, specifically on Fly Machines.

Security is a big concern for Datasette Cloud. Teams should only be able to access their own data - bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.

To help guarantee that, I've designed Datasette Cloud so that each team gets their own, dedicated instance, running in a Firecracker VM managed by Fly. Their data lives in a dedicated volume.

Fly volumes already implement snapshot backups, but I'm interested in defence in depth. This is where Litestream comes in (coincidentally now part of Fly, although it wasn't when I first selected it as my backup strategy).

I'm using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.

Keeping backups isolated

Litestream works by writing a constant stream of pages from SQLite's WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.

This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.

Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.

Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.

I've learned an absolute ton about S3 and AWS permissions building my s3-credentials tool for creating credentials for accessing S3.

One of the tricks I've learned is that it's possible to create temporary, time-limited credentials that only work for a prefix (effectively a folder) within an S3 bucket.

This means I can run Litestream with credentials that are specific to the team - that can read and write only from the team-ID/ prefix in the S3 bucket I am using to store the backups.

Obtaining temporary credentials

My s3-credentials tool can create credentials for a prefix within an S3 bucket like this:

s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/

This command uses the sts.assume_role() AWS method to create credentials that allow access to that bucket, attaching this generated JSON policy to it in order to restrict access to the provided prefix.

I extracted the relevant Python code from s3-credentials and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.

With the endpoint in place, my code for launching a team container can do this:

Create the volume and machine for that team (if they do not yet exist)
Generate a signed secret token that the machine container can exchange for its S3 credentials
Launch the machine container, passing it the secret token
On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created
Those credentials are used to populate the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables used by Litestream
Start Litestream, which then starts Datasette

Restarting every 12 hours

You may be wondering why I bothered with that initial secret token - why not just pass the temporary AWS credentials to the container when I launch it?

The reason for this is that I need to be able to obtain fresh credentials every 12 hours.

A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.

All you need to do is call sys.exit(0) in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.

So far I'm mainly using this to avoid the cost of running containers when they aren't actually in- use. But there's a neat benefit when it comes to Litestream too.

I'm using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.

After considering a few ways of doing this, I selected the simplest to implement: have Datasette call sys.exit(0) after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.

I implemented this by adding it as a new setting to my existing datasette-scale-to-zero plugin. You can now configure that with "max-age": "10h" and it will shut down Datasette once the server has been running for that long.

Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.

I only just put the new backup system in place, so I'm exercising it a bit before I open things up to trial users - but so far it's looking like a very robust solution to the problem.

s3-ocr improvements

I released a few new versions of s3-ocr this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.

The two main improvements are:

A new --dry-run option to s3-ocr start which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. #22
s3-ocr start used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. #21

The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!

Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use @cli.command as a decorator instead of @cli.command(). This meant that installing s3-ocr in an environment that already had a previous version of Click would result in silent errors.

The solution is simple: pin to click>=8.1.0 in the project dependencies if you plan to use this new syntax.

If I'd read the Click changelog more closely I would have saved myself a whole lot of time.

Issues #25 and #26 detail the many false turns I took trying to figure this out.

More fun with GPT-3 and DALL-E

This tweet scored over a million impressions on Twitter:

New hobby: prototyping video games in 60 seconds using a combination of GPT-3 and DALL-E

Here's "Raccoon Heist" pic.twitter.com/xQ3Vm8p2XW
- Simon Willison (@simonw) August 5, 2022

As this got retweeted outside of my usual circles it started confusing people who thought the "prototype" was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn't kidding when I said I spent 60 seconds on this.

I also figured out how to use GPT-3 to write jq one-liners. I love jq but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: Using GPT-3 to figure out jq recipes

Releases this week

s3-ocr: 0.6.3 - (9 releases total) - 2022-08-10
Tools for running OCR against files stored in S3
datasette-scale-to-zero: 0.2 - (4 releases total) - 2022-08-05
Quit Datasette if it has not received traffic for a specified time period
shot-scraper: 0.14.3 - (18 releases total) - 2022-08-02
A command-line utility for taking automated screenshots of websites
s3-credentials: 0.12.1 - (13 releases total) - 2022-08-01
A tool for creating credentials for accessing S3 buckets
datasette-sqlite-fts4: 0.3.2 - (2 releases total) - 2022-07-31

TIL this week

Tags: ocr, s3, datasette, weeknotes, datasette-cloud, fly, litestream, gpt-3, dalle, gpt

s3-ocr: Extract text from PDF files stored in an S3 bucket

2022-06-30T21:40:27+00:00

I've released s3-ocr, a new tool that runs Amazon's Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

You can search through a demo of 697 pages of OCRd text at s3-ocr-demo.datasette.io/pages/pages.

Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.

Why I built this

My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.

The archives are full of hand-written and type-written notes, scanned and stored as PDFs.

I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.

Running Textract directly against large numbers of files is somewhat inconvenient (here's my earlier TIL about it). s3-ocr is my attempt to make it easier.

Tutorial: How I built that demo

The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:

I started by downloading PDFs of those three files.

Then I installed the two tools I needed:

pip install s3-ocr s3-credentials

I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract permissions to the generated credentials:

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket > ocr.json

(Note that you don't need to use s3-credentials at all if you have AWS credentials configured on your machine with root access to your account - just leave off the -a ocr.json options in the following examples.)

s3-ocr-demo is now a bucket I can use for the demo. ocr.json contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.

I uploaded my three PDFs to the bucket:

s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json

(I often use Transmit as a GUI for this kind of operation.)

Then I kicked off OCR jobs against every PDF file in the bucket:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

The --all option scans for any file with a .pdf extension. You can pass explicit file names instead if you just want to process one or two files at a time.

This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.

The job IDs can be used to inspect the progress of each task like so:

% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

Once the job completed, I could preview the text extracted from the PDF like so:

% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON & CHAMBERLAIN, 120 LIBERTY ST.
...

To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

I then published the resulting pages.db SQLite database using Datasette - you can explore it here.

How s3-ocr works

s3-ocr works by calling Amazon's S3 and Textract APIs.

Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.

It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in s3-ocr.

A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.

I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-? - but there's no indication as to which PDF file the results correspond to.

My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.

If you have a file called latestmagicbeing00hoff.pdf the start command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json with the following content:

{
  "job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
  "etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""
}

This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.

The existence of these files lets me do two things:

If you run s3-ocr start s3-ocr-demo --all it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the .s3-ocr.json file.
When you later ask for the results of the OCR it can use these files to associate the PDF with the results.

Scatting .s3-ocr.json files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/ prefix in the bucket instead.

Try it and let me know what you think

This is a brand new project, but I think it's ready for other people to start trying it out.

I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!

If you try this out and it works (or it doesn't work) please let me know via Twitter or GitHub.

A challenging example page

Here's one of the more challenging pages I processed using Textract:

Here's the result:

In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by

Releases this week

s3-ocr: 0.4 - (4 releases total) - 2022-06-30
Tools for running OCR against files stored in S3
s3-credentials: 0.12 - (12 releases total) - 2022-06-30
A tool for creating credentials for accessing S3 buckets
datasette-scale-to-zero: 0.1.2 - (3 releases total) - 2022-06-23
Quit Datasette if it has not received traffic for a specified time period

TIL this week

Tags: aws, ocr, pdf, projects, s3, weeknotes, s3-credentials

Organize and Index Your Screenshots (OCR) on macOS

2021-07-18T16:11:01+00:00

Organize and Index Your Screenshots (OCR) on macOS

Alexandru Nedelcu has a very neat recipe for creating an archive of searchable screenshots on macOS: set the default save location for screenshots to a Dropbox folder, then create a launch agent that runs a script against new files in that folder to run tesseract OCR to convert them into a searchable PDF.

Via @alexelcu

Tags: macos, ocr

Google Docs OCR

2009-09-29T21:57:59+00:00

Google Docs OCR

Whoa, the Google Docs API just got really interesting—you can upload an image to it (POST /feeds/default/private/full?ocr=true) and it will OCR the text and turn it in to a document. Since this is Google, I imagine they’ll also be using the processed documents to further improve their OCR technology.

Tags: apis, googldocs, google, ocr

OCR and Neural Nets in JavaScript

2009-01-25T00:00:28+00:00

OCR and Neural Nets in JavaScript

John dissects the brilliant Greasemonkey script that solves simple captchas using the canvas element and HTML5’s getImageData API.

Tags: canvas, captchas, getimagedata, greasemonkey, javascript, john-resig, ocr

tesseract-ocr

2007-07-26T20:23:08+00:00

tesseract-ocr

Open source OCR, sponsored by Google. I just sat in on a talk on this at OSCON and the complexity of the problem is pretty incredible.

Tags: google, ocr, oscon, oscon07, tesseract