<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: tesseract</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/tesseract.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-06-19T21:30:46+00:00</updated><author><name>Simon Willison</name></author><entry><title>Civic Band</title><link href="https://simonwillison.net/2024/Jun/19/civic-band/#atom-tag" rel="alternate"/><published>2024-06-19T21:30:46+00:00</published><updated>2024-06-19T21:30:46+00:00</updated><id>https://simonwillison.net/2024/Jun/19/civic-band/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://civic.band/"&gt;Civic Band&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Exciting new civic tech project from Philip James: 30 (and counting) Datasette instances serving full-text search enabled collections of OCRd meeting minutes for different civic governments. Includes &lt;a href="https://alameda.ca.civic.band/civic_minutes/pages"&gt;20,000 pages for Alameda&lt;/a&gt;, &lt;a href="https://pittsburgh.pa.civic.band/civic_minutes/pages"&gt;17,000 for Pittsburgh&lt;/a&gt;, &lt;a href="https://baltimore.md.civic.band/civic_minutes/pages"&gt;3,567 for Baltimore&lt;/a&gt; and an enormous &lt;a href="https://maui-county.hi.civic.band/civic_minutes/pages"&gt;117,000 for Maui County&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Philip includes &lt;a href="https://civic.band/how.html"&gt;some notes&lt;/a&gt; on how they're doing it. They gather PDF minute notes from anywhere that provides API access to them, then run local Tesseract for OCR (the cost of cloud-based OCR proving prohibitive given the volume of data). The collection is then deployed to a single VPS running multiple instances of Datasette via Caddy, one instance for each of the covered regions.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="ocr"/><category term="tesseract"/><category term="datasette"/></entry><entry><title>Running OCR against PDFs and images directly in your browser</title><link href="https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/#atom-tag" rel="alternate"/><published>2024-03-30T17:59:56+00:00</published><updated>2024-03-30T17:59:56+00:00</updated><id>https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/#atom-tag</id><summary type="html">
    &lt;p&gt;I attended the &lt;a href="https://biglocalnews.org/content/events/"&gt;Story Discovery At Scale&lt;/a&gt; data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?&lt;/p&gt;
&lt;p&gt;I've been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently - I'll write more about that soon. But those tools are still inconvenient for most people to use.&lt;/p&gt;
&lt;p&gt;Meanwhile, older tools like &lt;a href="https://github.com/tesseract-ocr/tesseract"&gt;Tesseract OCR&lt;/a&gt; are still extremely useful - if only they were easier to use as well.&lt;/p&gt;
&lt;p&gt;Then I remembered that Tesseract runs happily in a browser these days thanks to the excellent &lt;a href="https://tesseract.projectnaptha.com/"&gt;Tesseract.js&lt;/a&gt; project. And PDFs can be processed using JavaScript too thanks to Mozilla's extremely mature and well-tested &lt;a href="https://mozilla.github.io/pdf.js/"&gt;PDF.js&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;So I built a new tool!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tools.simonwillison.net/ocr"&gt;tools.simonwillison.net/ocr&lt;/a&gt;&lt;/strong&gt; provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.&lt;/p&gt;
&lt;p&gt;Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.&lt;/p&gt;
&lt;p&gt;Here's an animated demo:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-demo.gif" alt="First an image file is dragged onto the page, which then shows that image and accompanying OCR text. Then the drop zone is clicked and a PDF file is selected - that PDF is rendered a page at a time down the page with OCR text displayed beneath each page." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.&lt;/p&gt;
&lt;p&gt;But... having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing.&lt;/p&gt;
&lt;h4 id="ocr-how-i-built-this"&gt;How I built this&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;For more recent examples of projects I've built with the assistance of LLMs, see &lt;a href="https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/"&gt;Building and testing C extensions for SQLite with ChatGPT Code Interpreter&lt;/a&gt; and &lt;a href="https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/"&gt;Claude and ChatGPT for ad-hoc sidequests&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I built the first version of this tool in just a few minutes, using Claude 3 Opus.&lt;/p&gt;
&lt;p&gt;I already had my own JavaScript code lying around for the two most important tasks: running Tesseract.js against an images and using PDF.js to turn a PDF into a series of images.&lt;/p&gt;
&lt;p&gt;The OCR code came from the system I built and explained in &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/"&gt;How I make annotated presentations&lt;/a&gt; (built with the help of &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/#chatgpt-sessions"&gt;multiple ChatGPT sessions&lt;/a&gt;). The PDF to images code was from an &lt;a href="https://gist.github.com/simonw/e58796324abb0e729b2dcd351f46728a#prompt-2"&gt;unfinished experiment&lt;/a&gt; which I wrote with the aid of Claude 3 Opus a week ago.&lt;/p&gt;
&lt;p&gt;I composed the following prompt for Claude 3, where I pasted in both of my code examples and then added some instructions about what I wanted it to build at the end:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This code shows how to open a PDF and turn it into an image per page:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;!DOCTYPE html&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;PDF to Images&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    .image-container img {
      margin-bottom: 10px;
    }
    .image-container p {
      margin: 0;
      font-size: 14px;
      color: #888;
    }
  &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;input&lt;/span&gt; &lt;span class="pl-c1"&gt;type&lt;/span&gt;="&lt;span class="pl-s"&gt;file&lt;/span&gt;" &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;fileInput&lt;/span&gt;" &lt;span class="pl-c1"&gt;accept&lt;/span&gt;="&lt;span class="pl-s"&gt;.pdf&lt;/span&gt;" /&amp;gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt; &lt;span class="pl-c1"&gt;class&lt;/span&gt;="&lt;span class="pl-s"&gt;image-container&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;

  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;800&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;fileInput&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'fileInput'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.image-container'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;fileInput&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'change'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;handleFileUpload&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;pdfjsLib&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;GlobalWorkerOptions&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;workerSrc&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;handleFileUpload&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;file&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;target&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;files&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageIterator&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;convertPDFToImages&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

      &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; imageURL&lt;span class="pl-kos"&gt;,&lt;/span&gt; size &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;of&lt;/span&gt; &lt;span class="pl-s1"&gt;imageIterator&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imgElement&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'img'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imgElement&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imgElement&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'p'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;textContent&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`Size: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-en"&gt;formatSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;convertPDFToImages&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;try&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;pdfjsLib&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getDocument&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;URL&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createObjectURL&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;promise&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;numPages&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;numPages&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

        &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;numPages&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getPage&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getViewport&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;scale&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'canvas'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;context&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getContext&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'2d'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;renderContext&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-c1"&gt;canvasContext&lt;/span&gt;: &lt;span class="pl-s1"&gt;context&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
            &lt;span class="pl-c1"&gt;viewport&lt;/span&gt;: &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getViewport&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;scale&lt;/span&gt;: &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
          &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;render&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;renderContext&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;promise&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;toDataURL&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'image/jpeg'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;0.8&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;calculateSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;yield&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; imageURL&lt;span class="pl-kos"&gt;,&lt;/span&gt; size &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;catch&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-smi"&gt;console&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'Error:'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;calculateSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;base64Length&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;'data:image/jpeg;base64,'&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInBytes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Math&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;ceil&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;base64Length&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;0.75&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInBytes&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;formatSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInKB&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;toFixed&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;sizeInKB&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; KB`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This code shows how to OCR an image:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;ocrMissingAltText&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Load Tesseract&lt;/span&gt;
    &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"script"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js"&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;onload&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;images&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementsByTagName&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"img"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Tesseract&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createWorker&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;load&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;loadLanguage&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"eng"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;initialize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"eng"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-s1"&gt;ocrButton&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"Running OCR..."&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

      &lt;span class="pl-c"&gt;// Iterate through all the images in the output div&lt;/span&gt;
      &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt; &lt;span class="pl-k"&gt;of&lt;/span&gt; &lt;span class="pl-s1"&gt;images&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;parentNode&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;".textarea-alt"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-c"&gt;// Check if the alt textarea is empty&lt;/span&gt;
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-s"&gt;""&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-c1"&gt;data&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt; text &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
          &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;recognize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-c"&gt;// Set the OCR result to the alt textarea&lt;/span&gt;
          &lt;span class="pl-s1"&gt;progressBar&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;

      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;terminate&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-s1"&gt;ocrButton&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"OCR complete"&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Use these examples to put together a single HTML page with embedded HTML and CSS and JavaScript that provides a big square which users can drag and drop a PDF file onto and when they do that the PDF has every page converted to a JPEG and shown below on the page, then OCR is run with tesseract and the results are shown in textarea blocks below each image.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I saved this prompt to a &lt;code&gt;prompt.txt&lt;/code&gt; file and ran it using my &lt;a href="https://github.com/simonw/llm-claude-3"&gt;llm-claude-3&lt;/a&gt; plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-3-opus &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt; prompt.txt&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It gave me &lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v1.html"&gt;a working initial version&lt;/a&gt; on the first attempt!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v1.jpg" alt="A square dotted border around the text Drag and drop PDF file here" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/6a9f077bf8db616e44893a24ae1d36eb"&gt;Here's the full transcript&lt;/a&gt;, including my follow-up prompts and their responses. Iterating on software in this way is &lt;em&gt;so&lt;/em&gt; much fun.&lt;/p&gt;
&lt;p&gt;First follow-up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this to also have a file input that can be used - dropping a file onto the drop area fills that input&lt;/p&gt;
&lt;p&gt;make the drop zone 100% wide but have a 2em padding on the body. it should be 10em high. it should turn pink when an image is dragged over it.&lt;/p&gt;
&lt;p&gt;Each textarea should be 100% wide and 10em high&lt;/p&gt;
&lt;p&gt;At the very bottom of the page add a h2 that says Full document - then a 30em high textarea with all of the page text in it separated by two newlines&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v2.html"&gt;Here's the interactive result&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v2.jpg" alt="A PDF file is dragged over the box and it turned pink. The heading Full document displays below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Rather delightfully it used the neater pattern where the file input itself is hidden but can be triggered by clicking on the large drop zone, and it updated the copy on the drop zone to reflect that - without me suggesting those requirements.&lt;/p&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;get rid of the code that shows image sizes. Set the placeholder on each textarea to be Processing... and clear that placeholder when the job is done.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v3.html"&gt;Which gave me this&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I realized it would be useful if it could handle non-PDF images as well. So I fired up ChatGPT (for no reason other than curiosity to see how well it did) and got GPT-4 to add that feature for me. I &lt;a href="https://chat.openai.com/share/665eca31-3b5d-4cd9-a3cb-85ab608169a6"&gt;pasted in the code so far and added&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this so jpg and png and gif images can be dropped or opened too - they skip the PDF step and get appended to the page and OCRd directly. Also move the full document heading and textarea above the page preview and hide it u til there is data to be shown in it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I spotted that the Tesseract worker was being created multiple times in a loop, which is inefficient - so I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create the worker once and use it for all OCR tasks and terminate it at the end&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'd tweaked the HTML and CSS a little before feeding it to GPT-4, so now the site had a title and rendered in Helvetica.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v4.html"&gt;the version GPT-4 produced for me&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v4.jpg" alt="A heading reads OCR a PDF or Image - This tool runs entirely in your browser. No files are uploaded to a server. The dotted box now contains text that reads Drag and drop a PDF, JPG, PNG, or GIF file here or click to select a file" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="ocr-finishing-touches"&gt;Manual finishing touches&lt;/h4&gt;
&lt;p&gt;Fun though it was iterating on this project entirely through prompting, I decided it would be more productive to make the finishing touches myself. You can see those &lt;a href="https://github.com/simonw/tools/commits/cc609194a0d0a54c2ae676dae962e14b3e3a9d22/"&gt;in the commit history&lt;/a&gt;. They're not particularly interesting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I added &lt;a href="https://plausible.io/"&gt;Plausible&lt;/a&gt; analytics (which I like because they use no cookies).&lt;/li&gt;
&lt;li&gt;I added better progress indicators, including the text that shows how many pages of the PDF have been processed so far.&lt;/li&gt;
&lt;li&gt;I bumped up the width of the rendered PDF page images from 800 to 1000. This seemed to improve OCR quality - in particular, the &lt;a href="https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf"&gt;Claude 3 model card PDF&lt;/a&gt; now has less OCR errors than it did before.&lt;/li&gt;
&lt;li&gt;I upgraded both Tesseract.js and PDF.js to the most recent versions. Unsurprisingly, Claude 3 Opus had used older versions of both libraries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm really pleased with this project. I consider it &lt;em&gt;finished&lt;/em&gt; - it does the job I designed it to do and I don't see any need to keep on iterating on it. And because it's all static JavaScript and WebAssembly I expect it to continue working effectively forever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; OK, a few more features: I added &lt;a href="https://github.com/simonw/tools/issues/4"&gt;language selection&lt;/a&gt;, &lt;a href="https://github.com/simonw/tools/issues/7"&gt;paste support&lt;/a&gt; and some &lt;a href="https://github.com/simonw/tools/issues/8"&gt;basic automated tests&lt;/a&gt; using Playwright Python.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="tesseract"/><category term="ai-assisted-programming"/></entry><entry><title>tesseract-ocr</title><link href="https://simonwillison.net/2007/Jul/26/tesseractocr/#atom-tag" rel="alternate"/><published>2007-07-26T20:23:08+00:00</published><updated>2007-07-26T20:23:08+00:00</updated><id>https://simonwillison.net/2007/Jul/26/tesseractocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;tesseract-ocr&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Open source OCR, sponsored by Google. I just sat in on a talk on this at OSCON and the complexity of the problem is pretty incredible.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/oscon"&gt;oscon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/oscon07"&gt;oscon07&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ocr"/><category term="oscon"/><category term="oscon07"/><category term="tesseract"/></entry></feed>