<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: scraping</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/scraping.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-08-11T18:11:49+00:00</updated><author><name>Simon Willison</name></author><entry><title>Reddit will block the Internet Archive</title><link href="https://simonwillison.net/2025/Aug/11/reddit-will-block-the-internet-archive/#atom-tag" rel="alternate"/><published>2025-08-11T18:11:49+00:00</published><updated>2025-08-11T18:11:49+00:00</updated><id>https://simonwillison.net/2025/Aug/11/reddit-will-block-the-internet-archive/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit"&gt;Reddit will block the Internet Archive&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Well this &lt;em&gt;sucks&lt;/em&gt;. Jay Peters for the Verge:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/reddit"&gt;reddit&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;



</summary><category term="internet-archive"/><category term="reddit"/><category term="scraping"/><category term="ai"/><category term="training-data"/><category term="ai-ethics"/></entry><entry><title>Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone</title><link href="https://simonwillison.net/2025/Jul/17/vibe-scraping/#atom-tag" rel="alternate"/><published>2025-07-17T19:38:50+00:00</published><updated>2025-07-17T19:38:50+00:00</updated><id>https://simonwillison.net/2025/Jul/17/vibe-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;This morning, working entirely on my phone, I scraped a conference website and vibe coded up an alternative UI for interacting with the schedule using a combination of OpenAI Codex and Claude Artifacts.&lt;/p&gt;
&lt;p&gt;This weekend is &lt;a href="https://opensauce.com/"&gt;Open Sauce 2025&lt;/a&gt;, the third edition of the Bay Area conference for YouTube creators in the science and engineering space. I have a couple of friends going and they were complaining that the official schedule was difficult to navigate on a phone - it's not even linked from the homepage on mobile, and once you do find &lt;a href="https://opensauce.com/agenda/"&gt;the agenda&lt;/a&gt; it isn't particularly mobile-friendly.&lt;/p&gt;
&lt;p&gt;We were out for coffee this morning so I only had my phone, but I decided to see if I could fix it anyway.&lt;/p&gt;
&lt;p&gt;TLDR: Working entirely on my iPhone, using a combination of &lt;a href="https://chatgpt.com/codex"&gt;OpenAI Codex&lt;/a&gt; in the ChatGPT mobile app and Claude Artifacts via the Claude app, I was able to scrape the full schedule and then build and deploy this: &lt;a href="https://tools.simonwillison.net/open-sauce-2025"&gt;tools.simonwillison.net/open-sauce-2025&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/open-sauce-2025-card.jpg" alt="Screenshot of a blue page, Open Sauce 2025, July 18-20 2025, Download Calendar ICS button, then Friday 18th and Saturday 18th and Sunday 20th pill buttons, Friday is selected, the Welcome to Open Sauce with William Osman event on the Industry Stage is visible." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The site offers a faster loading and more useful agenda view, but more importantly it includes an option to "Download Calendar (ICS)" which allows mobile phone users (Android and iOS) to easily import the schedule events directly into their calendar app of choice.&lt;/p&gt;
&lt;p&gt;Here are some detailed notes on how I built it.&lt;/p&gt;
&lt;h4 id="scraping-the-schedule"&gt;Scraping the schedule&lt;/h4&gt;
&lt;p&gt;Step one was to get that schedule in a structured format. I don't have good tools for viewing source on my iPhone, so I took a different approach to turning the schedule site into structured data.&lt;/p&gt;
&lt;p&gt;My first thought was to screenshot the schedule on my phone and then dump the images into a vision LLM - but the schedule was long enough that I didn't feel like scrolling through several different pages and stitching together dozens of images.&lt;/p&gt;
&lt;p&gt;If I was working on a laptop I'd turn to scraping: I'd dig around in the site itself and figure out where the data came from, then write code to extract it out.&lt;/p&gt;
&lt;p&gt;How could I do the same thing working on my phone?&lt;/p&gt;
&lt;p&gt;I decided to use &lt;strong&gt;OpenAI Codex&lt;/strong&gt; - the &lt;a href="https://simonwillison.net/2025/May/16/openai-codex/"&gt;hosted tool&lt;/a&gt;, not the confusingly named &lt;a href="https://simonwillison.net/2025/Apr/16/openai-codex/"&gt;CLI utility&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Codex recently &lt;a href="https://simonwillison.net/2025/Jun/3/codex-agent-internet-access/"&gt;grew the ability&lt;/a&gt; to interact with the internet while attempting to resolve a task. I have a dedicated Codex "environment" configured against a GitHub repository that doesn't do anything else, purely so I can run internet-enabled sessions there that can execute arbitrary network-enabled commands.&lt;/p&gt;
&lt;p&gt;I started a new task there (using the Codex interface inside the ChatGPT iPhone app) and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Install playwright and use it to visit https://opensauce.com/agenda/ and grab the full details of all three day schedules from the tabs - Friday and Saturday and Sunday - then save and on Data in as much detail as possible in a JSON file and submit that as a PR&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex is frustrating in that you only get one shot: it can go away and work autonomously on a task for a long time, but while it's working you can't give it follow-up prompts. You can wait for it to finish entirely and then tell it to try again in a new session, but ideally the instructions you give it are enough for it to get to the finish state where it submits a pull request against your repo with the results.&lt;/p&gt;
&lt;p&gt;I got lucky: my above prompt worked exactly as intended.&lt;/p&gt;
&lt;p&gt;Codex churned for a &lt;em&gt;13 minutes&lt;/em&gt;! I was sat chatting in a coffee shop, occasionally checking the logs to see what it was up to.&lt;/p&gt;
&lt;p&gt;It tried a whole bunch of approaches, all involving running the Playwright Python library to interact with the site. You can see &lt;a href="https://chatgpt.com/s/cd_687945dea5f48191892e0d73ebb45aa4"&gt;the full transcript here&lt;/a&gt;. It includes notes like "&lt;em&gt;Looks like xxd isn't installed. I'll grab "vim-common" or "xxd" to fix it.&lt;/em&gt;".&lt;/p&gt;
&lt;p&gt;Eventually it downloaded an enormous obfuscated chunk of JavaScript called &lt;a href="https://opensauce.com/wp-content/uploads/2025/07/schedule-overview-main-1752724893152.js"&gt;schedule-overview-main-1752724893152.js&lt;/a&gt; (316KB) and then ran a complex sequence of grep, grep, sed, strings, xxd and dd commands against it to figure out the location of the raw schedule data in order to extract it out.&lt;/p&gt;
&lt;p&gt;Here's the eventual &lt;a href="https://github.com/simonw/.github/blob/f671bf57f7c20a4a7a5b0642837811e37c557499/extract_schedule.py"&gt;extract_schedule.py&lt;/a&gt; Python script it wrote, which uses Playwright to save that &lt;code&gt;schedule-overview-main-1752724893152.js&lt;/code&gt; file and then extracts the raw data using the following code (which calls Node.js inside Python, just so it can use the JavaScript &lt;code&gt;eval()&lt;/code&gt; function):&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;node_script&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; (
    &lt;span class="pl-s"&gt;"const fs=require('fs');"&lt;/span&gt;
    &lt;span class="pl-s"&gt;f"const d=fs.readFileSync('&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;tmp_path&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;','utf8');"&lt;/span&gt;
    &lt;span class="pl-s"&gt;"const m=d.match(/var oo=(&lt;span class="pl-cce"&gt;\\&lt;/span&gt;{.*?&lt;span class="pl-cce"&gt;\\&lt;/span&gt;});/s);"&lt;/span&gt;
    &lt;span class="pl-s"&gt;"if(!m){throw new Error('not found');}"&lt;/span&gt;
    &lt;span class="pl-s"&gt;"const obj=eval('(' + m[1] + ')');"&lt;/span&gt;
    &lt;span class="pl-s"&gt;f"fs.writeFileSync('&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-c1"&gt;OUTPUT_FILE&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;', JSON.stringify(obj, null, 2));"&lt;/span&gt;
)
&lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;([&lt;span class="pl-s"&gt;'node'&lt;/span&gt;, &lt;span class="pl-s"&gt;'-e'&lt;/span&gt;, &lt;span class="pl-s1"&gt;node_script&lt;/span&gt;], &lt;span class="pl-s1"&gt;check&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;As instructed, it then filed &lt;a href="https://github.com/simonw/.github/pull/1"&gt;a PR against my repo&lt;/a&gt;. It included the Python Playwright script, but more importantly it also included that full extracted &lt;a href="https://github.com/simonw/.github/blob/f671bf57f7c20a4a7a5b0642837811e37c557499/schedule.json"&gt;schedule.json&lt;/a&gt; file. That meant I now had the schedule data, with a  &lt;code&gt;raw.githubusercontent.com&lt;/code&gt;  URL with open CORS headers that could be fetched by a web app!&lt;/p&gt;
&lt;h4 id="building-the-web-app"&gt;Building the web app&lt;/h4&gt;
&lt;p&gt;Now that I had the data, the next step was to build a web application to preview it and serve it up in a more useful format.&lt;/p&gt;
&lt;p&gt;I decided I wanted two things: a nice mobile friendly interface for browsing the schedule, and mechanism for importing that schedule into a calendar application, such as Apple or Google Calendar.&lt;/p&gt;
&lt;p&gt;It took me several false starts to get this to work. The biggest challenge was getting that 63KB of schedule JSON data into the app. I tried a few approaches here, all on my iPhone while sitting in coffee shop and later while driving with a friend to drop them off at the closest BART station.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Using ChatGPT Canvas and o3, since unlike Claude Artifacts a Canvas can fetch data from remote URLs if you allow-list that domain. I later found out that &lt;a href="https://chatgpt.com/share/687948b7-e8b8-8006-a450-0c07bdfd7f85"&gt;this had worked&lt;/a&gt; when I viewed it on my laptop, but on my phone it threw errors so I gave up on it.&lt;/li&gt;
&lt;li&gt;Uploading the JSON to Claude and telling it to build an artifact that read the file directly - this &lt;a href="https://claude.ai/share/25297074-37a9-4583-bc2f-630f6dea5c5d"&gt;failed with an error&lt;/a&gt; "undefined is not an object (evaluating 'window.fs.readFile')". The Claude 4 system prompt &lt;a href="https://simonwillison.net/2025/May/25/claude-4-system-prompt/#artifacts-the-missing-manual"&gt;had lead me to expect this to work&lt;/a&gt;, I'm not sure why it didn't.&lt;/li&gt;
&lt;li&gt;Having Claude copy the full JSON into the artifact. This took too long - typing out 63KB of JSON is not a sensible use of LLM tokens, and it flaked out on me when my connection went intermittent driving through a tunnel.&lt;/li&gt;
&lt;li&gt;Telling Claude to fetch from the URL to that schedule JSON instead. This was my last resort because the Claude Artifacts UI blocks access to external URLs, so you have to copy and paste the code out to a separate interface (on an iPhone, which still lacks a "select all" button) making for a frustrating process.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That final option worked! Here's the full sequence of prompts I used with Claude to get to a working implementation - &lt;a href="https://claude.ai/share/e391bbcc-09a2-4f86-9bec-c6def8fc8dc9"&gt;full transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Use your analyst tool to read this JSON file and show me the top level keys&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This was to prime Claude - I wanted to remind it about its &lt;code&gt;window.fs.readFile&lt;/code&gt; function and have it read enough of the JSON to understand the structure.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Build an artifact with no react that turns the schedule into a nice mobile friendly webpage - there are three days Friday, Saturday and Sunday, which corresponded to the 25th and 26th and 27th of July 2025&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Don’t copy the raw JSON over to the artifact - use your fs function to read it instead&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Also include a button to download ICS at the top of the page which downloads a ICS version of the schedule&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had noticed that the schedule data had keys for "friday" and "saturday" and "sunday" but no indication of the dates, so I told it those. It turned out later I'd got these wrong!&lt;/p&gt;
&lt;p&gt;This got me a version of the page that failed with an error, because that &lt;code&gt;fs.readFile()&lt;/code&gt; couldn't load the data from the artifact for some reason. So I fixed that with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Change it so instead of using the readFile thing it fetches the same JSON from  https://raw.githubusercontent.com/simonw/.github/f671bf57f7c20a4a7a5b0642837811e37c557499/schedule.json&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... then copied the HTML out to a Gist and previewed it with &lt;a href="https://gistpreview.github.io/"&gt;gistpreview.github.io&lt;/a&gt; - here's &lt;a href="https://gistpreview.github.io/?06a5d1f3bf0af81d55a411f32b2f37c7"&gt;that preview&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then we spot-checked it, since there are &lt;em&gt;so many ways&lt;/em&gt; this could have gone wrong. Thankfully the schedule JSON itself never round-tripped through an LLM so we didn't need to worry about hallucinated session details, but this was almost pure vibe coding so there was a big risk of a mistake sneaking through.&lt;/p&gt;
&lt;p&gt;I'd set myself a deadline of "by the time we drop my friend at the BART station" and I hit that deadline with just seconds to spare. I pasted the resulting HTML &lt;a href="https://github.com/simonw/tools/blob/main/open-sauce-2025.html"&gt;into my simonw/tools GitHub repo&lt;/a&gt; using the GitHub mobile web interface which deployed it to that final &lt;a href="https://tools.simonwillison.net/open-sauce-2025"&gt;tools.simonwillison.net/open-sauce-2025&lt;/a&gt; URL.&lt;/p&gt;
&lt;p&gt;... then we noticed that we &lt;em&gt;had&lt;/em&gt; missed a bug: I had given it the dates of "25th and 26th and 27th of July 2025" but actually that was a week too late, the correct dates were July 18th-20th.&lt;/p&gt;
&lt;p&gt;Thankfully I have Codex configured against my &lt;code&gt;simonw/tools&lt;/code&gt; repo as well, so fixing that was a case of prompting a new Codex session with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;The open sauce schedule got the dates wrong - Friday is 18 July 2025 and Saturday is 19 and Sunday is 20 - fix it&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://chatgpt.com/s/cd_68794c97a3d88191a2cbe9de78103334"&gt;that Codex transcript&lt;/a&gt;, which resulted in &lt;a href="https://github.com/simonw/tools/pull/34"&gt;this PR&lt;/a&gt; which I landed and deployed, again using the GitHub mobile web interface.&lt;/p&gt;
&lt;h4 id="what-this-all-demonstrates"&gt;What this all demonstrates&lt;/h4&gt;
&lt;p&gt;So, to recap: I was able to scrape a website (without even a view source too), turn the resulting JSON data into a mobile-friendly website, add an ICS export feature and deploy the results to a static hosting platform (GitHub Pages) working entirely on my phone.&lt;/p&gt;
&lt;p&gt;If I'd had a laptop this project would have been faster, but honestly aside from a little bit more hands-on debugging I wouldn't have gone about it in a particularly different way.&lt;/p&gt;
&lt;p&gt;I was able to do other stuff at the same time - the Codex scraping project ran entirely autonomously, and the app build itself was more involved only because I had to work around the limitations of the tools I was using in terms of fetching data from external sources.&lt;/p&gt;
&lt;p&gt;As usual with this stuff, my 25+ years of previous web development experience was critical to being able to execute the project. I knew about Codex, and Artifacts, and GitHub, and Playwright, and CORS headers, and Artifacts sandbox limitations, and the capabilities of ICS files on mobile phones.&lt;/p&gt;
&lt;p&gt;This whole thing was &lt;em&gt;so much fun!&lt;/em&gt; Being able to spin up multiple coding agents directly from my phone and have them solve quite complex problems while only paying partial attention to the details is a solid demonstration of why I continue to enjoying exploring the edges of &lt;a href="https://simonwillison.net/tags/ai-assisted-programming/"&gt;AI-assisted programming&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id="update-i-removed-the-speaker-avatars"&gt;Update: I removed the speaker avatars&lt;/h4&gt;
&lt;p&gt;Here's a beautiful cautionary tale about the dangers of vibe-coding on a phone with no access to performance profiling tools. A commenter on Hacker News &lt;a href="https://news.ycombinator.com/item?id=44597405#44597808"&gt;pointed out&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The web app makes 176 requests and downloads 130 megabytes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And yeah, it did! Turns out those speaker avatar images weren't optimized, and there were over 170 of them.&lt;/p&gt;
&lt;p&gt;I told &lt;a href="https://chatgpt.com/s/cd_6879631d99c48191b1ab7f84dfab8dea"&gt;a fresh Codex instance&lt;/a&gt; "Remove the speaker avatar images from open-sauce-2025.html" and now the page weighs 93.58 KB - about 1,400 times smaller!&lt;/p&gt;
&lt;h4 id="update-2-improved-accessibility"&gt;Update 2: Improved accessibility&lt;/h4&gt;
&lt;p&gt;That same commenter &lt;a href="https://news.ycombinator.com/item?id=44597405#44597808"&gt;on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It's also &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; soup and largely inaccessible.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Yeah, this HTML isn't great:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;dayContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerHTML&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sessions&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;map&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;session&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; `
    &amp;lt;div class="session-card"&amp;gt;
        &amp;lt;div class="session-header"&amp;gt;
            &amp;lt;div&amp;gt;
                &amp;lt;span class="session-time"&amp;gt;&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;session&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;time&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;&amp;lt;/span&amp;gt;
                &amp;lt;span class="length-badge"&amp;gt;&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;session&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; min&amp;lt;/span&amp;gt;
            &amp;lt;/div&amp;gt;
            &amp;lt;div class="session-location"&amp;gt;&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;session&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;where&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;&amp;lt;/&lt;span class="pl-s1"&gt;div&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;
        &amp;lt;/&lt;span class="pl-s1"&gt;div&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/tools/issues/36"&gt;opened an issue&lt;/a&gt; and had both Claude Code and Codex look at it. Claude Code &lt;a href="https://github.com/simonw/tools/issues/36#issuecomment-3085516331"&gt;failed to submit a PR&lt;/a&gt; for some reason, but Codex &lt;a href="https://github.com/simonw/tools/pull/37"&gt;opened one&lt;/a&gt; with a fix that sounded good to me when I tried it with VoiceOver on iOS (using &lt;a href="https://codex-make-open-sauce-2025-h.tools-b1q.pages.dev/open-sauce-2025"&gt;a Cloudflare Pages preview&lt;/a&gt;) so I landed that. Here's &lt;a href="https://github.com/simonw/tools/commit/29c8298363869bbd4b4e7c51378c20dc8ac30c39"&gt;the diff&lt;/a&gt;, which added a hidden "skip to content" link, some &lt;code&gt;aria-&lt;/code&gt; attributes on buttons and upgraded the HTML to use &lt;code&gt;&amp;lt;h3&amp;gt;&lt;/code&gt; for the session titles.&lt;/p&gt;
&lt;p&gt;Next time I'll remember to specify accessibility as a requirement in the initial prompt. I'm disappointed that Claude didn't consider that without me having to ask.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/icalendar"&gt;icalendar&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mobile"&gt;mobile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-artifacts"&gt;claude-artifacts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/async-coding-agents"&gt;async-coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-to-app"&gt;prompt-to-app&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="definitions"/><category term="github"/><category term="icalendar"/><category term="mobile"/><category term="scraping"/><category term="tools"/><category term="ai"/><category term="playwright"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="claude-artifacts"/><category term="ai-agents"/><category term="vibe-coding"/><category term="coding-agents"/><category term="async-coding-agents"/><category term="prompt-to-app"/></entry><entry><title>shot-scraper 1.8</title><link href="https://simonwillison.net/2025/Mar/25/shot-scraper/#atom-tag" rel="alternate"/><published>2025-03-25T01:59:38+00:00</published><updated>2025-03-25T01:59:38+00:00</updated><id>https://simonwillison.net/2025/Mar/25/shot-scraper/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.8"&gt;shot-scraper 1.8&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I've added a new feature to &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; that makes it easier to share scripts for other people to use with the &lt;a href="https://shot-scraper.datasette.io/en/stable/javascript.html"&gt;shot-scraper javascript&lt;/a&gt; command.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;shot-scraper javascript&lt;/code&gt; lets you load up a web page in an invisible Chrome browser (via Playwright), execute some JavaScript against that page and output the results to your terminal. It's a fun way of running complex screen-scraping routines as part of a terminal session, or even chained together with other commands using pipes.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-i/--input&lt;/code&gt; option lets you load that JavaScript from a file on disk - but now you can also use a &lt;code&gt;gh:&lt;/code&gt; prefix to specify loading code from GitHub instead.&lt;/p&gt;
&lt;p&gt;To quote &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.8"&gt;the release notes&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;shot-scraper javascript&lt;/code&gt; can now optionally &lt;a href="https://shot-scraper.datasette.io/en/stable/javascript.html#running-javascript-from-github"&gt;load scripts hosted on GitHub&lt;/a&gt; via the new &lt;code&gt;gh:&lt;/code&gt; prefix to the &lt;code&gt;shot-scraper javascript -i/--input&lt;/code&gt; option. &lt;a href="https://github.com/simonw/shot-scraper/issues/173"&gt;#173&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Scripts can be referenced as &lt;code&gt;gh:username/repo/path/to/script.js&lt;/code&gt; or, if the GitHub user has created a dedicated &lt;code&gt;shot-scraper-scripts&lt;/code&gt; repository and placed scripts in the root of it, using &lt;code&gt;gh:username/name-of-script&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For example, to run this &lt;a href="https://github.com/simonw/shot-scraper-scripts/blob/main/readability.js"&gt;readability.js&lt;/a&gt; script against any web page you can use the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper javascript --input gh:simonw/readability \
  https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/60e196ec39a5a75dcabfd75fbe911a4c"&gt;output from that example&lt;/a&gt; starts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Qwen2.5-VL-32B: Smarter and Lighter&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"byline"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Simon Willison&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"dir"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"lang"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;en-gb&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&amp;lt;div id=&lt;span class="pl-cce"&gt;\"&lt;/span&gt;readability-page-1&lt;span class="pl-cce"&gt;\"...&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;My &lt;a href="https://github.com/simonw/shot-scraper-scripts"&gt;simonw/shot-scraper-scripts&lt;/a&gt; repo only has that one file in it so far, but I'm looking forward to growing that collection and hopefully seeing other people create and share their own &lt;code&gt;shot-scraper-scripts&lt;/code&gt; repos as well.&lt;/p&gt;
&lt;p&gt;This feature is an imitation of &lt;a href="https://github.com/simonw/llm/issues/809"&gt;a similar feature&lt;/a&gt; that's coming in the next release of LLM.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="javascript"/><category term="projects"/><category term="scraping"/><category term="annotated-release-notes"/><category term="playwright"/><category term="shot-scraper"/></entry><entry><title>Cutting-edge web scraping techniques at NICAR</title><link href="https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/#atom-tag" rel="alternate"/><published>2025-03-08T19:25:36+00:00</published><updated>2025-03-08T19:25:36+00:00</updated><id>https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/nicar-2025-scraping/blob/main/README.md"&gt;Cutting-edge web scraping techniques at NICAR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's the handout for a workshop I presented this morning at &lt;a href="https://www.ire.org/training/conferences/nicar-2025/"&gt;NICAR 2025&lt;/a&gt; on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs.&lt;/p&gt;
&lt;p&gt;For workshops like this I like to work off an extremely detailed handout, so that people can move at their own pace or catch up later if they didn't get everything done.&lt;/p&gt;
&lt;p&gt;The workshop consisted of four parts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Building a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraper&lt;/a&gt; - an automated scraper in GitHub Actions that records changes to a resource over time&lt;/li&gt;
&lt;li&gt;Using in-browser JavaScript and then &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; to extract useful information&lt;/li&gt;
&lt;li&gt;Using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; with both OpenAI and Google Gemini to extract structured data from unstructured websites&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2024/Oct/17/video-scraping/"&gt;Video scraping&lt;/a&gt; using &lt;a href="https://aistudio.google.com/"&gt;Google AI Studio&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;I released several new tools in preparation for this workshop (I call this "NICAR Driven Development"):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/git-scraper-template"&gt;git-scraper-template&lt;/a&gt; template repository for quickly setting up new Git scrapers, which I &lt;a href="https://simonwillison.net/2025/Feb/26/git-scraper-template/"&gt;wrote about here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;LLM schemas&lt;/a&gt;, finally adding structured schema support to my LLM tool&lt;/li&gt;
&lt;li&gt;&lt;a href="https://shot-scraper.datasette.io/en/stable/har.html"&gt;shot-scraper har&lt;/a&gt;  for archiving pages as HTML Archive files - though I cut this from the workshop for time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I also came up with a fun way to distribute API keys for workshop participants: I &lt;a href="https://claude.ai/share/8d3330c8-7fd4-46d1-93d4-a3bd05915793"&gt;had Claude build me&lt;/a&gt; a web page where I can create an encrypted message with a passphrase, then share a URL to that page with users and give them the passphrase to unlock the encrypted message. You can try that at &lt;a href="https://tools.simonwillison.net/encrypt"&gt;tools.simonwillison.net/encrypt&lt;/a&gt; - or &lt;a href="https://tools.simonwillison.net/encrypt#5ZeXCdZ5pqCcHqE1y0aGtoIijlUW+ipN4gjQV4A2/6jQNovxnDvO6yoohgxBIVWWCN8m6ppAdjKR41Qzyq8Keh0RP7E="&gt;use this link&lt;/a&gt; and enter the passphrase "demo":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a message encryption/decryption web interface showing the title &amp;quot;Encrypt / decrypt message&amp;quot; with two tab options: &amp;quot;Encrypt a message&amp;quot; and &amp;quot;Decrypt a message&amp;quot; (highlighted). Below shows a decryption form with text &amp;quot;This page contains an encrypted message&amp;quot;, a passphrase input field with dots, a blue &amp;quot;Decrypt message&amp;quot; button, and a revealed message saying &amp;quot;This is a secret message&amp;quot;." src="https://static.simonwillison.net/static/2025/encrypt-decrypt.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-artifacts"&gt;claude-artifacts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-to-app"&gt;prompt-to-app&lt;/a&gt;&lt;/p&gt;



</summary><category term="scraping"/><category term="speaking"/><category term="ai"/><category term="git-scraping"/><category term="shot-scraper"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="gemini"/><category term="nicar"/><category term="claude-artifacts"/><category term="prompt-to-app"/></entry><entry><title>monolith</title><link href="https://simonwillison.net/2025/Mar/6/monolith/#atom-tag" rel="alternate"/><published>2025-03-06T15:37:48+00:00</published><updated>2025-03-06T15:37:48+00:00</updated><id>https://simonwillison.net/2025/Mar/6/monolith/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Y2Z/monolith"&gt;monolith&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat CLI tool built in Rust that can create a single packaged HTML file of a web page plus all of its dependencies.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cargo install monolith # or brew install
monolith https://simonwillison.net/ &amp;gt; simonwillison.html
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That command produced &lt;a href="https://static.simonwillison.net/static/2025/simonwillison.html"&gt;this 1.5MB single file result&lt;/a&gt;. All of the linked images, CSS and JavaScript assets have had their contents inlined into base64 URIs in their &lt;code&gt;src=&lt;/code&gt; and &lt;code&gt;href=&lt;/code&gt; attributes.&lt;/p&gt;
&lt;p&gt;I was intrigued as to how it works, so I dumped the whole repository into Gemini 2.0 Pro and asked for an architectural summary:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cd /tmp
git clone https://github.com/Y2Z/monolith
cd monolith
files-to-prompt . -c | llm -m gemini-2.0-pro-exp-02-05 \
  -s 'architectural overview as markdown'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/2c80749935ae3339d6f7175dc7cf325b"&gt;what I got&lt;/a&gt;. Short version: it uses the &lt;code&gt;reqwest&lt;/code&gt;, &lt;code&gt;html5ever&lt;/code&gt;, &lt;code&gt;markup5ever_rcdom&lt;/code&gt; and &lt;code&gt;cssparser&lt;/code&gt; crates to fetch and parse HTML and CSS and extract, combine and rewrite the assets. It doesn't currently attempt to run any JavaScript.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=42933383#42935115"&gt;Comment on Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="scraping"/><category term="ai"/><category term="rust"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="files-to-prompt"/></entry><entry><title>simonw/git-scraper-template</title><link href="https://simonwillison.net/2025/Feb/26/git-scraper-template/#atom-tag" rel="alternate"/><published>2025-02-26T05:34:05+00:00</published><updated>2025-02-26T05:34:05+00:00</updated><id>https://simonwillison.net/2025/Feb/26/git-scraper-template/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/git-scraper-template"&gt;simonw/git-scraper-template&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built this new GitHub template repository in preparation for a workshop I'm giving at &lt;a href="https://www.ire.org/training/conferences/nicar-2025/"&gt;NICAR&lt;/a&gt; (the data journalism conference) next week on &lt;a href="https://github.com/simonw/nicar-2025-scraping/"&gt;Cutting-edge web scraping techniques&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One of the topics I'll be covering is &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.&lt;/p&gt;
&lt;p&gt;This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple &lt;a href="https://github.com/new?template_name=git-scraper-template&amp;amp;template_owner=simonw"&gt;create a new repository from the template&lt;/a&gt; and paste the URL you want to scrape into the &lt;strong&gt;description&lt;/strong&gt; field and the repository will be initialized with a custom script that scrapes and stores that URL.&lt;/p&gt;
&lt;p&gt;It's modeled after my earlier &lt;a href="https://github.com/simonw/shot-scraper-template"&gt;shot-scraper-template&lt;/a&gt; tool which I described in detail in &lt;a href="https://simonwillison.net/2022/Mar/14/shot-scraper-template/"&gt;Instantly create a GitHub repository to take screenshots of a web page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;code&gt;git-scraper-template&lt;/code&gt; repo took &lt;a href="https://github.com/simonw/git-scraper-template/issues/2#issuecomment-2683871054"&gt;some help from Claude&lt;/a&gt; to figure out. It uses a &lt;a href="https://github.com/simonw/git-scraper-template/blob/a2b12972584099d7c793ee4b38303d94792bf0f0/download.sh"&gt;custom script&lt;/a&gt; to download the provided URL and derive a filename to use based on the URL and the content type, detected using &lt;code&gt;file --mime-type -b "$file_path"&lt;/code&gt; against the downloaded file.&lt;/p&gt;
&lt;p&gt;It also detects if the downloaded content is JSON and, if it is, pretty-prints it using &lt;code&gt;jq&lt;/code&gt; - I find this is a quick way to generate much more useful diffs when the content changes.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="nicar"/></entry><entry><title>Using a Tailscale exit node with GitHub Actions</title><link href="https://simonwillison.net/2025/Feb/23/tailscale-exit-node-with-github-actions/#atom-tag" rel="alternate"/><published>2025-02-23T02:49:32+00:00</published><updated>2025-02-23T02:49:32+00:00</updated><id>https://simonwillison.net/2025/Feb/23/tailscale-exit-node-with-github-actions/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/tailscale/tailscale-github-actions"&gt;Using a Tailscale exit node with GitHub Actions&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New TIL. I started running a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraper&lt;/a&gt; against doge.gov to track changes made to that website over time. The DOGE site runs behind Cloudflare which was blocking requests from the GitHub Actions IP range, but I figured out how to run a Tailscale exit node on my Apple TV and use that to proxy my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; requests.&lt;/p&gt;
&lt;p&gt;The scraper is running in &lt;a href="https://github.com/simonw/scrape-doge-gov"&gt;simonw/scrape-doge-gov&lt;/a&gt;. It uses the new &lt;a href="https://shot-scraper.datasette.io/en/stable/har.html"&gt;shot-scraper har&lt;/a&gt; command I added in &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.6"&gt;shot-scraper 1.6&lt;/a&gt; (and improved in &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.7"&gt;shot-scraper 1.7&lt;/a&gt;).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="scraping"/><category term="github-actions"/><category term="tailscale"/><category term="til"/><category term="git-scraping"/><category term="shot-scraper"/></entry><entry><title>shot-scraper 1.6 with support for HTTP Archives</title><link href="https://simonwillison.net/2025/Feb/13/shot-scraper/#atom-tag" rel="alternate"/><published>2025-02-13T21:02:37+00:00</published><updated>2025-02-13T21:02:37+00:00</updated><id>https://simonwillison.net/2025/Feb/13/shot-scraper/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.6"&gt;shot-scraper 1.6 with support for HTTP Archives&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; CLI tool for taking screenshots and scraping web pages.&lt;/p&gt;
&lt;p&gt;The big new feature is &lt;a href="https://en.wikipedia.org/wiki/HAR_(file_format)"&gt;HTTP Archive (HAR)&lt;/a&gt; support. The new &lt;a href="https://shot-scraper.datasette.io/en/stable/har.html"&gt;shot-scraper har command&lt;/a&gt; can now create an archive of a page and all of its dependents like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper har https://datasette.io/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This produces a &lt;code&gt;datasette-io.har&lt;/code&gt; file (currently 163KB) which is JSON representing the full set of requests used to render that page. Here's &lt;a href="https://gist.github.com/simonw/b1fdf434e460814efdb89c95c354f794"&gt;a copy of that file&lt;/a&gt;. You can visualize that &lt;a href="https://ericduran.github.io/chromeHAR/?url=https://gist.githubusercontent.com/simonw/b1fdf434e460814efdb89c95c354f794/raw/924c1eb12b940ff02cefa2cc068f23c9d3cc5895/datasette.har.json"&gt;here using ericduran.github.io/chromeHAR&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The HAR viewer shows a line for each of the loaded resources, with options to view timing information" src="https://static.simonwillison.net/static/2025/har-viewer.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;That JSON includes full copies of all of the responses, base64 encoded if they are binary files such as images.&lt;/p&gt;
&lt;p&gt;You can add the &lt;code&gt;--zip&lt;/code&gt; flag to instead get a &lt;code&gt;datasette-io.har.zip&lt;/code&gt; file, containing JSON data in &lt;code&gt;har.har&lt;/code&gt; but with the response bodies saved as separate files in that archive.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;shot-scraper multi&lt;/code&gt; command lets you run &lt;code&gt;shot-scraper&lt;/code&gt; against multiple URLs in sequence, specified using a YAML file. That command now takes a &lt;code&gt;--har&lt;/code&gt; option (or &lt;code&gt;--har-zip&lt;/code&gt; or &lt;code&gt;--har-file name-of-file)&lt;/code&gt;, &lt;a href="https://shot-scraper.datasette.io/en/stable/multi.html#recording-to-an-http-archive"&gt;described in the documentation&lt;/a&gt;, which will produce a HAR at the same time as taking the screenshots.&lt;/p&gt;
&lt;p&gt;Shots are usually defined in YAML that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;example.com.png&lt;/span&gt;
  &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;http://www.example.com/&lt;/span&gt;
- &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;w3c.org.png&lt;/span&gt;
  &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://www.w3.org/&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can now omit the &lt;code&gt;output:&lt;/code&gt; keys and generate a HAR file without taking any screenshots at all:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;http://www.example.com/&lt;/span&gt;
- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://www.w3.org/&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Run like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper multi shots.yml --har
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Skipping screenshot of 'https://www.example.com/'
Skipping screenshot of 'https://www.w3.org/'
Wrote to HAR file: trace.har
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;shot-scraper&lt;/code&gt; is built on top of Playwright, and the new features use the &lt;a href="https://playwright.dev/python/docs/next/api/class-browser#browser-new-context-option-record-har-path"&gt;browser.new_context(record_har_path=...)&lt;/a&gt; parameter.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="projects"/><category term="python"/><category term="scraping"/><category term="playwright"/><category term="shot-scraper"/></entry><entry><title>Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent</title><link href="https://simonwillison.net/2024/Oct/17/video-scraping/#atom-tag" rel="alternate"/><published>2024-10-17T12:32:47+00:00</published><updated>2024-10-17T12:32:47+00:00</updated><id>https://simonwillison.net/2024/Oct/17/video-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.&lt;/p&gt;
&lt;p&gt;I didn't particularly feel like copying and pasting all of the numbers out one at a time, so I decided to try something different: could I record a screen capture while browsing around my Gmail account and then extract the numbers from that video using Google Gemini?&lt;/p&gt;
&lt;p&gt;This turned out to work &lt;em&gt;incredibly&lt;/em&gt; well.&lt;/p&gt;
&lt;h4 id="ai-studio-and-quicktime"&gt;AI Studio and QuickTime&lt;/h4&gt;
&lt;p&gt;I recorded the video using QuickTime Player on my Mac: &lt;code&gt;File -&amp;gt; New Screen Recording&lt;/code&gt;. I dragged a box around a portion of my screen containing my Gmail account, then clicked on each of the emails in turn, pausing for a couple of seconds on each one.&lt;/p&gt;
&lt;p&gt;I uploaded the resulting file directly into Google's &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt; tool and prompted the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Turn this into a JSON array where each item has a yyyy-mm-dd date and a floating point dollar amount for that date&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and it worked. It spat out a JSON array like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"date"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2023-01-01&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"amount"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2...&lt;/span&gt;
  },
  &lt;span class="pl-c1"&gt;...&lt;/span&gt;
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/video-scraping.jpg" alt="Screenshot of the Google AI Studio interface - I used Gemini 1.5 Flash 0002, a 35 second screen recording video (which was 10,326 tokens) and the token count says 11,018/1,000,000 - the screenshot redacts some details but you can see the start of the JSON output with date and amount keys in a list" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I wanted to paste that into Numbers, so I followed up with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;turn that into copy-pastable csv&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which gave me back the same data formatted as CSV.&lt;/p&gt;
&lt;p&gt;You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.&lt;/p&gt;
&lt;p&gt;I had intended to use Gemini 1.5 Pro, aka Google's best model... but it turns out I forgot to select the model and I'd actually run the entire process using the much less expensive Gemini 1.5 Flash 002.&lt;/p&gt;
&lt;h4 id="how-much-did-it-cost"&gt;How much did it cost?&lt;/h4&gt;

&lt;p&gt;According to AI Studio I used 11,018 tokens, of which 10,326 were for the video.&lt;/p&gt;
&lt;p&gt;Gemini 1.5 Flash &lt;a href="https://ai.google.dev/pricing#1_5flash"&gt;charges&lt;/a&gt; $0.075/1 million tokens (the price &lt;a href="https://developers.googleblog.com/en/gemini-15-flash-updates-google-ai-studio-gemini-api/"&gt;dropped in August&lt;/a&gt;).&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;11018/1000000 = 0.011018
0.011018 * $0.075 = $0.00082635
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So this entire exercise should have cost me just under 1/10th of a cent!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;And in fact, it was &lt;strong&gt;free&lt;/strong&gt;. Google AI Studio &lt;a href="https://ai.google.dev/gemini-api/docs/billing#is-AI-Studio-free"&gt;currently&lt;/a&gt; "remains free of charge regardless of if you set up billing across all supported regions". I believe that means they &lt;a href="https://simonwillison.net/2024/Oct/17/gemini-terms-of-service/"&gt;can train on your data&lt;/a&gt; though, which is not the case for their paid APIs.&lt;/em&gt;&lt;/p&gt;
&lt;h4 id="the-alternatives-aren-t-actually-that-great"&gt;The alternatives aren't actually that great&lt;/h4&gt;
&lt;p&gt;Let's consider the alternatives here.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I could have clicked through the emails and copied out the data manually one at a time. This is error prone and kind of boring. For twelve emails it would have been OK, but for a hundred it would have been a real pain.&lt;/li&gt;
&lt;li&gt;Accessing my Gmail data programatically. This seems to get harder every year - it's still possible to access it via IMAP right now if you set up a dedicated &lt;a href="https://support.google.com/mail/answer/185833"&gt;app password&lt;/a&gt; but that's a whole lot of work for a one-off scraping task. The &lt;a href="https://developers.google.com/gmail/api/guides"&gt;official API&lt;/a&gt; is no fun at all.&lt;/li&gt;
&lt;li&gt;Some kind of browser automation (Playwright or similar) that can click through my Gmail account for me. Even with an LLM to help write the code this is still a lot more work, and it doesn't help deal with formatting differences in emails either - I'd have to solve the email parsing step separately.&lt;/li&gt;
&lt;li&gt;Using some kind of much more sophisticated pre-existing AI tool that has access to my email. A separate Google product also called Gemini can do this if you grant it access, but my results with that so far haven't been particularly great. AI tools are inherently unpredictable. I'm also nervous about giving any tool full access to my email account due to the risk from things like &lt;a href="https://simonwillison.net/tags/prompt-injection/"&gt;prompt injection&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="video-scraping-is-really-powerful"&gt;Video scraping is really powerful&lt;/h4&gt;
&lt;p&gt;The great thing about this &lt;strong&gt;video scraping&lt;/strong&gt; technique is that it works with &lt;em&gt;anything&lt;/em&gt; that you can see on your screen... and it puts you in total control of what you end up exposing to the AI model.&lt;/p&gt;
&lt;p&gt;There's no level of website authentication or anti-scraping technology that can stop me from recording a video of my screen while I manually click around inside a web application.&lt;/p&gt;
&lt;p&gt;The results I get depend entirely on how thoughtful I was about how I positioned my screen capture area and how I clicked around.&lt;/p&gt;
&lt;p&gt;There is &lt;em&gt;no setup cost&lt;/em&gt; for this at all - sign into a site, hit record, browse around a bit and then dump the video into Gemini.&lt;/p&gt;
&lt;p&gt;And the cost is so low that I had to re-run my calculations three times to make sure I hadn't made a mistake.&lt;/p&gt;
&lt;p&gt;I expect I'll be using this technique a whole lot more in the future. It also has applications in the data journalism world, which frequently involves the need to scrape data from sources that really don't want to be scraped.&lt;/p&gt;

&lt;h4 id="a-note-on-reliability"&gt;A note on reliability&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Added 22nd December 2024&lt;/em&gt;. As with anything involving LLMs, its worth noting that you cannot trust these models to return exactly correct results with 100% reliability. I verified the results here manually through eyeball comparison of the JSON to the underlying video, but in a larger project this may not be feasible. Consider spot-checks or other strategies for double-checking the results, especially if mistakes could have meaningful real-world impact.&lt;/p&gt;

&lt;h4 id="bonus-calculator"&gt;Bonus: An LLM pricing calculator&lt;/h4&gt;

&lt;p&gt;In writing up this experiment I got fed up of having to manually calculate token prices. I actually usually outsource that to ChatGPT Code Interpreter, but I've caught it &lt;a href="https://gist.github.com/simonw/3a4406eeed70f7f2de604892eb3548c4?permalink_comment_id=5239420#gistcomment-5239420"&gt;messing up the conversion&lt;/a&gt; from dollars to cents once or twice so I always have to double-check its work.&lt;/p&gt;

&lt;p&gt;So I got Claude 3.5 Sonnet with Claude Artifacts to build me &lt;a href="https://tools.simonwillison.net/llm-prices"&gt;this pricing calculator tool&lt;/a&gt; (&lt;a href="https://github.com/simonw/tools/blob/main/llm-prices.html"&gt;source code here&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/llm-pricing-calculator.jpg" alt="Screenshot of LLM Pricing Calculator interface. Left panel: input fields for tokens and costs. Input Tokens: 11018, Output Tokens: empty, Cost per Million Input Tokens: $0.075, Cost per Million Output Tokens: $0.3. Total Cost calculated: $0.000826 or 0.0826 cents. Right panel: Presets for various models including Gemini, Claude, and GPT versions with their respective input/output costs per 1M tokens. Footer: Prices were correct as of 16th October 2024, they may have changed." /&gt;&lt;/p&gt;

&lt;p&gt;You can set the input/output token prices by hand, or click one of the preset buttons to pre-fill it with the prices for different existing models (as-of 16th October 2024 - I won't promise that I'll promptly update them in the future!)&lt;/p&gt;

&lt;p&gt;The entire thing was written by Claude. Here's &lt;a href="https://gist.github.com/simonw/6b684b5f7d75fb82034fc963cc487530"&gt;the full conversation transcript&lt;/a&gt; - we spent 19 minutes iterating on it through 10 different versions.&lt;/p&gt;

&lt;p&gt;Rather than hunt down all of those prices myself, I took screenshots of the pricing pages for each of the model providers and dumped those directly into the Claude conversation:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/claude-screenshots.jpg" alt="Claude: Is there anything else you'd like me to adjust or explain about this updated calculator? Me: Add a onkeyup event too, I want that calculator to update as I type. Also add a section underneath the calculator called Presets which lets the user click a model to populate the cost per million fields with that model's prices - which should be shown on the page too. I've dumped in some screenshots of pricing pages you can use - ignore prompt caching prices. There are five attached screenshots of pricing pages for different models." /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gmail"&gt;gmail&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-artifacts"&gt;claude-artifacts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-to-app"&gt;prompt-to-app&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="gmail"/><category term="google"/><category term="scraping"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="gemini"/><category term="vision-llms"/><category term="claude-artifacts"/><category term="claude-3-5-sonnet"/><category term="prompt-to-app"/></entry><entry><title>Quoting Kieran McCarthy</title><link href="https://simonwillison.net/2024/Feb/28/kieran-mccarthy/#atom-tag" rel="alternate"/><published>2024-02-28T15:15:13+00:00</published><updated>2024-02-28T15:15:13+00:00</updated><id>https://simonwillison.net/2024/Feb/28/kieran-mccarthy/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://blog.ericgoldman.org/archives/2024/02/facebook-drops-anti-scraping-lawsuit-against-bright-data-guest-blog-post.htm"&gt;&lt;p&gt;For the last few years, Meta has had a team of attorneys dedicated to policing unauthorized forms of scraping and data collection on Meta platforms. The decision not to further pursue these claims seems as close to waving the white flag as you can get against these kinds of companies. But why? [...]&lt;/p&gt;
&lt;p&gt;In short, I think Meta cares more about access to large volumes of data and AI than it does about outsiders scraping their public data now. My hunch is that they know that any success in anti-scraping cases can be thrown back at them in their own attempts to build AI training databases and LLMs. And they care more about the latter than the former.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://blog.ericgoldman.org/archives/2024/02/facebook-drops-anti-scraping-lawsuit-against-bright-data-guest-blog-post.htm"&gt;Kieran McCarthy&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="facebook"/><category term="scraping"/><category term="ai"/><category term="llms"/><category term="training-data"/></entry><entry><title>scrapeghost</title><link href="https://simonwillison.net/2023/Mar/26/scrapeghost/#atom-tag" rel="alternate"/><published>2023-03-26T05:29:37+00:00</published><updated>2023-03-26T05:29:37+00:00</updated><id>https://simonwillison.net/2023/Mar/26/scrapeghost/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://jamesturk.github.io/scrapeghost/"&gt;scrapeghost&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Scraping is a really interesting application for large language model tools like GPT3. James Turk’s scrapeghost is a very neatly designed entrant into this space—it’s a Python library and CLI tool that can be pointed at any URL and given a roughly defined schema (using a neat mini schema language) which will then use GPT3 to scrape the page and try to return the results in the supplied format.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://mastodon.social/@jamesturk/110081261241625224"&gt;@jamesturk&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-4"&gt;gpt-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="scraping"/><category term="ai"/><category term="gpt-3"/><category term="generative-ai"/><category term="gpt-4"/><category term="llms"/></entry><entry><title>Quoting Me</title><link href="https://simonwillison.net/2023/Mar/16/gpt4-scraping/#atom-tag" rel="alternate"/><published>2023-03-16T01:09:52+00:00</published><updated>2023-03-16T01:09:52+00:00</updated><id>https://simonwillison.net/2023/Mar/16/gpt4-scraping/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://fedi.simonwillison.net/@simon/110030289294541249"&gt;&lt;p&gt;I expect GPT-4 will have a LOT of applications in web scraping&lt;/p&gt;
&lt;p&gt;The increased 32,000 token limit will be large enough to send it the full DOM of most pages, serialized to HTML - then ask questions to extract data&lt;/p&gt;
&lt;p&gt;Or... take a screenshot and use the GPT4 image input mode to ask questions about the visually rendered page instead!&lt;/p&gt;
&lt;p&gt;Might need to dust off all of those old semantic web dreams, because the world's information is rapidly becoming fully machine readable&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://fedi.simonwillison.net/@simon/110030289294541249"&gt;Me&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/semanticweb"&gt;semanticweb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-4"&gt;gpt-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="scraping"/><category term="semanticweb"/><category term="gpt-4"/><category term="llms"/></entry><entry><title>datasette-scraper walkthrough on YouTube</title><link href="https://simonwillison.net/2023/Jan/29/datasette-scraper-walkthrough/#atom-tag" rel="alternate"/><published>2023-01-29T05:23:42+00:00</published><updated>2023-01-29T05:23:42+00:00</updated><id>https://simonwillison.net/2023/Jan/29/datasette-scraper-walkthrough/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=zrSGnz7ErNI"&gt;datasette-scraper walkthrough on YouTube&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://datasette.io/plugins/datasette-scraper"&gt;datasette-scraper&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="scraping"/><category term="datasette"/><category term="colin-dellow"/></entry><entry><title>curl-impersonate</title><link href="https://simonwillison.net/2022/Aug/10/curl-impersonate/#atom-tag" rel="alternate"/><published>2022-08-10T15:34:46+00:00</published><updated>2022-08-10T15:34:46+00:00</updated><id>https://simonwillison.net/2022/Aug/10/curl-impersonate/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/lwthiker/curl-impersonate"&gt;curl-impersonate&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari &amp;amp; Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”&lt;/p&gt;

&lt;p&gt;I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=32409632"&gt;Ask HN: What are the best tools for web scraping in 2022?&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/crawling"&gt;crawling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/curl"&gt;curl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="crawling"/><category term="curl"/><category term="scraping"/></entry><entry><title>Web Scraping via Javascript Runtime Heap Snapshots</title><link href="https://simonwillison.net/2022/May/3/web-scraping-via-javascript-runtime-heap-snapshots/#atom-tag" rel="alternate"/><published>2022-05-03T00:51:29+00:00</published><updated>2022-05-03T00:51:29+00:00</updated><id>https://simonwillison.net/2022/May/3/web-scraping-via-javascript-runtime-heap-snapshots/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.adriancooney.ie/blog/web-scraping-via-javascript-heap-snapshots"&gt;Web Scraping via Javascript Runtime Heap Snapshots&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is an absolutely brilliant scraping trick. Adrian Cooney figured out a way to use Puppeteer and the Chrome DevTools protocol to take a heap snapshot of all of the JavaScript running on a web page, then recursively crawl through the heap looking for any JavaScript objects that have a specified selection of properties. This allows him to scrape data from arbitrarily complex client-side web applications. He built a JavaScript library and command line tool that implements the pattern.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/dathanvp/status/1521216735931633664"&gt;Dathan Pattishall&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="scraping"/></entry><entry><title>Scraping web pages from the command line with shot-scraper</title><link href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#atom-tag" rel="alternate"/><published>2022-03-14T01:29:56+00:00</published><updated>2022-03-14T01:29:56+00:00</updated><id>https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#atom-tag</id><summary type="html">
    &lt;p&gt;I've added a powerful new capability to my &lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt; command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.&lt;/p&gt;
&lt;p&gt;Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.&lt;/p&gt;
&lt;p&gt;It's also a really neat web scraping tool.&lt;/p&gt;
&lt;h4&gt;shot-scraper&lt;/h4&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2022/Mar/10/shot-scraper/"&gt;introduced shot-scraper&lt;/a&gt; last Thursday. It's a Python utility that wraps &lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt;, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/simonwillison-net.png" alt="Screenshot of my blog homepage" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Since Thursday &lt;code&gt;shot-scraper&lt;/code&gt; has had &lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;a flurry of releases&lt;/a&gt;, adding features like &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#saving-a-web-page-to-pdf"&gt;PDF exports&lt;/a&gt;, the ability to dump the Chromium &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#dumping-out-an-accessibility-tree"&gt;accessibilty tree&lt;/a&gt; and the ability to take screenshots of &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#websites-that-need-authentication"&gt;authenticated web pages&lt;/a&gt;. But the most exciting new feature landed today.&lt;/p&gt;
&lt;h4&gt;Executing JavaScript and returning the result&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.9"&gt;Release 0.9&lt;/a&gt; takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or you can return a JSON object:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if you want to use functions like &lt;code&gt;setTimeout()&lt;/code&gt; - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript datasette.io "
new Promise(done =&amp;gt; setInterval(
  () =&amp;gt; {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Test page title&lt;/span&gt;
  &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;    shot-scraper javascript datasette.io "&lt;/span&gt;
&lt;span class="pl-s"&gt;      if (document.title != 'Datasette') {&lt;/span&gt;
&lt;span class="pl-s"&gt;        throw 'Wrong title detected';&lt;/span&gt;
&lt;span class="pl-s"&gt;      }"&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="scrape-a-web-page"&gt;Using this to scrape a web page&lt;/h4&gt;
&lt;p&gt;The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.&lt;/p&gt;
&lt;p&gt;Posts from my blog occasionally show up on &lt;a href="https://news.ycombinator.com/"&gt;Hacker News&lt;/a&gt; - sometimes I spot them, sometimes I don't.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://news.ycombinator.com/from?site=simonwillison.net"&gt;https://news.ycombinator.com/from?site=simonwillison.net&lt;/a&gt; is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official &lt;a href="https://github.com/HackerNews/API"&gt;Hacker News API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/news-ycombinator-com-from.png" alt="Screenshot of the Hacker News listing for my domain" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;So... let's write a scraper for it.&lt;/p&gt;
&lt;p&gt;I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Array&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;from&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelectorAll&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.athing'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;title&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.titleline a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;parseInt&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.score'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.titleline a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;dt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.age'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;submitter&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.hnuser'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.age a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;id&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'?id='&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-c"&gt;// Only posts with comments have a comments link&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Array&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;from&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
    &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelectorAll&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
  &lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;filter&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;includes&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'comment'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;numComments&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-s1"&gt;numComments&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;parseInt&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;id&lt;span class="pl-kos"&gt;,&lt;/span&gt; title&lt;span class="pl-kos"&gt;,&lt;/span&gt; url&lt;span class="pl-kos"&gt;,&lt;/span&gt; dt&lt;span class="pl-kos"&gt;,&lt;/span&gt; points&lt;span class="pl-kos"&gt;,&lt;/span&gt; submitter&lt;span class="pl-kos"&gt;,&lt;/span&gt; commentsUrl&lt;span class="pl-kos"&gt;,&lt;/span&gt; numComments&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.&lt;/p&gt;
&lt;p&gt;I'm using &lt;code&gt;document.querySelectorAll('.itemlist .athing')&lt;/code&gt; to loop through each element that matches that selector.&lt;/p&gt;
&lt;p&gt;I wrap that with &lt;code&gt;Array.from(...)&lt;/code&gt; so I can use the &lt;code&gt;.map()&lt;/code&gt; method. Then for each element I can extract out the details that I need.&lt;/p&gt;
&lt;p&gt;The resulting array contains 30 items that look like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30658310&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Track changes to CLI tools by recording their help output&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://simonwillison.net/2022/Feb/2/help-scraping/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"dt"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2022-03-13T05:36:13&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"submitter"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;appwiz&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"commentsUrl"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://news.ycombinator.com/item?id=30658310&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"numComments"&lt;/span&gt;: &lt;span class="pl-c1"&gt;19&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Running it with shot-scraper&lt;/h4&gt;
&lt;p&gt;Now that I have a recipe for a scraper, I can run it in the terminal like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;shot-scraper javascript &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://news.ycombinator.com/from?site=simonwillison.net&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Array.from(document.querySelectorAll('.athing'), el =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;  const title = el.querySelector('.titleline a').innerText;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const points = parseInt(el.nextSibling.querySelector('.score').innerText);&lt;/span&gt;
&lt;span class="pl-s"&gt;  const url = el.querySelector('.titleline a').href;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const dt = el.nextSibling.querySelector('.age').title;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const submitter = el.nextSibling.querySelector('.hnuser').innerText;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const commentsUrl = el.nextSibling.querySelector('.age a').href;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const id = commentsUrl.split('?id=')[1];&lt;/span&gt;
&lt;span class="pl-s"&gt;  // Only posts with comments have a comments link&lt;/span&gt;
&lt;span class="pl-s"&gt;  const commentsLink = Array.from(&lt;/span&gt;
&lt;span class="pl-s"&gt;    el.nextSibling.querySelectorAll('a')&lt;/span&gt;
&lt;span class="pl-s"&gt;  ).filter(el =&amp;gt; el &amp;amp;&amp;amp; el.innerText.includes('comment'))[0];&lt;/span&gt;
&lt;span class="pl-s"&gt;  let numComments = 0;&lt;/span&gt;
&lt;span class="pl-s"&gt;  if (commentsLink) {&lt;/span&gt;
&lt;span class="pl-s"&gt;    numComments = parseInt(commentsLink.innerText.split()[0]);&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;/span&gt;
&lt;span class="pl-s"&gt;  return {id, title, url, dt, points, submitter, commentsUrl, numComments};&lt;/span&gt;
&lt;span class="pl-s"&gt;})&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; simonwillison-net.json&lt;/pre&gt;&lt;/div&gt;  
&lt;p&gt;&lt;code&gt;simonwillison-net.json&lt;/code&gt; is now a JSON file containing the scraped data.&lt;/p&gt;
&lt;h4&gt;Running the scraper in GitHub Actions&lt;/h4&gt;
&lt;p&gt;I want to keep track of changes to this data structure over time. My preferred technique for that is something I call &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.&lt;/p&gt;
&lt;p&gt;Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.&lt;/p&gt;
&lt;p&gt;So I built exactly that, in the &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain"&gt;simonw/scrape-hacker-news-by-domain&lt;/a&gt; repository.&lt;/p&gt;
&lt;p&gt;The GitHub Actions workflow is in &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/blob/485841482a39869759e39f4d8dee21b9adc963d7/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt;. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main/simonwillison-net.json"&gt;commit history of simonwillison-net.json&lt;/a&gt; will show me any time a new link from my site appears on Hacker News, or a comment is added.&lt;/p&gt;
&lt;p&gt;(Fun GitHub trick: add &lt;code&gt;.atom&lt;/code&gt; to the end of that URL to get &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main/simonwillison-net.json.atom"&gt;an Atom feed of those commits&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.&lt;/p&gt;
&lt;p&gt;I can see myself using this technique &lt;em&gt;a lot&lt;/em&gt; in the future.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hacker-news"&gt;hacker-news&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="github"/><category term="hacker-news"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="shot-scraper"/></entry><entry><title>shot-scraper: automated screenshots for documentation, built on Playwright</title><link href="https://simonwillison.net/2022/Mar/10/shot-scraper/#atom-tag" rel="alternate"/><published>2022-03-10T00:13:30+00:00</published><updated>2022-03-10T00:13:30+00:00</updated><id>https://simonwillison.net/2022/Mar/10/shot-scraper/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt; is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt; and &lt;a href="https://simonwillison.net/2022/Feb/2/help-scraping/"&gt;help scraping&lt;/a&gt; techniques.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 13th March 2022:&lt;/strong&gt; The new &lt;code&gt;shot-scraper javascript&lt;/code&gt; command can now be used to &lt;a href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/"&gt;scrape web pages from the command line&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 14th October 2022:&lt;/strong&gt; &lt;a href="https://simonwillison.net/2022/Oct/14/automating-screenshots/"&gt;Automating screenshots for the Datasette documentation using shot-scraper&lt;/a&gt; offers a tutorial introduction to using the tool.&lt;/p&gt;
&lt;h4&gt;The problem&lt;/h4&gt;
&lt;p&gt;I like to include screenshots in documentation. I recently &lt;a href="https://simonwillison.net/2022/Feb/27/datasette-tutorials/"&gt;started writing end-user tutorials&lt;/a&gt; for Datasette, which are particularly image heavy (&lt;a href="https://datasette.io/tutorials/explore"&gt;for example&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.&lt;/p&gt;
&lt;h4&gt;Introducing shot-scraper&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;shot-scraper&lt;/code&gt; is a tool for automating this process. You can install it using &lt;code&gt;pip&lt;/code&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install shot-scraper
shot-scraper install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That second &lt;code&gt;shot-scraper install&lt;/code&gt; line will install the browser it needs to do its job - more on that later.&lt;/p&gt;
&lt;p&gt;You can use it in two ways. To take a one-off screenshot, you can run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper https://simonwillison.net/ -o simonwillison.png
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://simonwillison.net/&lt;/span&gt;
  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;simonwillison.png&lt;/span&gt;
- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://www.example.com/&lt;/span&gt;
  &lt;span class="pl-ent"&gt;width&lt;/span&gt;: &lt;span class="pl-c1"&gt;400&lt;/span&gt;
  &lt;span class="pl-ent"&gt;height&lt;/span&gt;: &lt;span class="pl-c1"&gt;400&lt;/span&gt;
  &lt;span class="pl-ent"&gt;quality&lt;/span&gt;: &lt;span class="pl-c1"&gt;80&lt;/span&gt;
  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;example.jpg&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then use &lt;code&gt;shot-scraper multi&lt;/code&gt; to execute every screenshot in one go:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://shot-scraper.datasette.io/en/stable/screenshots.html"&gt;The documentation&lt;/a&gt; describes all of the available options you can use when taking a screenshot.&lt;/p&gt;
&lt;p&gt;Each option can be provided to the &lt;code&gt;shot-scraper&lt;/code&gt; one-off tool, or can be embedded in the YAML file for use with &lt;code&gt;shot-scraper multi&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;JavaScript and CSS selectors&lt;/h4&gt;
&lt;p&gt;The default behaviour for &lt;code&gt;shot-scraper&lt;/code&gt; is to take a full page screenshot, using a browser width of 1280px.&lt;/p&gt;
&lt;p&gt;For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--selector&lt;/code&gt; option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.&lt;/p&gt;
&lt;p&gt;What if you want to modify the page in addition to selecting a specific area?&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--javascript&lt;/code&gt; option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.&lt;/p&gt;
&lt;p&gt;The combination of these two options - also available as &lt;code&gt;javascript:&lt;/code&gt; and &lt;code&gt;selector:&lt;/code&gt; keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.&lt;/p&gt;
&lt;h4 id="a-complex-example"&gt;A complex example&lt;/h4&gt;
&lt;p&gt;To prove to myself that the tool works, I decided to try replicating this screenshot from &lt;a href="https://datasette.io/tutorials/explore"&gt;my tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I made the original using &lt;a href="https://cleanshot.com/"&gt;CleanShot X&lt;/a&gt;, manually adding the two pink arrows:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/select-facets-original.jpg" alt="A screenshot of a portion of the table interface in Datasette, with a menu open and two pink arrows pointing to menu items" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is pretty tricky!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It's not &lt;a href="https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&amp;amp;type=prez"&gt;this whole page&lt;/a&gt;, just a subset of the page&lt;/li&gt;
&lt;li&gt;The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot&lt;/li&gt;
&lt;li&gt;There are two pink arrows superimposed on the image&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.&lt;/p&gt;
&lt;p&gt;I started by &lt;a href="https://github.com/simonw/shot-scraper/issues/9#issuecomment-1063314278"&gt;creating my own pink arrow SVG&lt;/a&gt; using Figma:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/pink-arrow.png" alt="A big pink arrow, with a drop shadow" style="width: 200px; max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.&lt;/p&gt;
&lt;p&gt;With the JavaScript figured out, I pasted it into a YAML file called &lt;code&gt;shot.yml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&amp;amp;type=prez&lt;/span&gt;
  &lt;span class="pl-ent"&gt;javascript&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;    new Promise(resolve =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Run in a promise so we can sleep 1s at the end&lt;/span&gt;
&lt;span class="pl-s"&gt;      function remove(el) { el.parentNode.removeChild(el);}&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove header and footer&lt;/span&gt;
&lt;span class="pl-s"&gt;      remove(document.querySelector('header'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      remove(document.querySelector('footer'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove most of the children of .content&lt;/span&gt;
&lt;span class="pl-s"&gt;      Array.from(document.querySelectorAll('.content &amp;gt; *:not(.table-wrapper,.suggested-facets)')).map(remove)&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Bit of breathing room for the screenshot&lt;/span&gt;
&lt;span class="pl-s"&gt;      document.body.style.marginTop = '10px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Add a bit of padding to .content&lt;/span&gt;
&lt;span class="pl-s"&gt;      var content = document.querySelector('.content');&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.width = '820px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.padding = '10px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Open the menu - it's an SVG so we need to use dispatchEvent here&lt;/span&gt;
&lt;span class="pl-s"&gt;      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove all but table header and first 11 rows&lt;/span&gt;
&lt;span class="pl-s"&gt;      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Add a pink SVG arrow&lt;/span&gt;
&lt;span class="pl-s"&gt;      let div = document.createElement('div');&lt;/span&gt;
&lt;span class="pl-s"&gt;      div.innerHTML = `&amp;lt;svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;g filter="url(#a)"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;/g&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;defs&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feFlood flood-opacity="0" result="BackgroundImageFix"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feOffset dy="4"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feGaussianBlur stdDeviation="2"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feComposite in2="hardAlpha" operator="out"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;/filter&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;/defs&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;/svg&amp;gt;`;&lt;/span&gt;
&lt;span class="pl-s"&gt;      let svg = div.firstChild;&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.appendChild(svg);&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.position = 'relative';&lt;/span&gt;
&lt;span class="pl-s"&gt;      svg.style.position = 'absolute';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Give the menu time to finish fading in&lt;/span&gt;
&lt;span class="pl-s"&gt;      setTimeout(() =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;        // Position arrow pointing to the 'facet by this' menu item&lt;/span&gt;
&lt;span class="pl-s"&gt;        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();&lt;/span&gt;
&lt;span class="pl-s"&gt;        svg.style.left = (pos.left - pos.width) + 'px';&lt;/span&gt;
&lt;span class="pl-s"&gt;        svg.style.top = (pos.top - 20) + 'px';&lt;/span&gt;
&lt;span class="pl-s"&gt;        resolve();&lt;/span&gt;
&lt;span class="pl-s"&gt;      }, 1000);&lt;/span&gt;
&lt;span class="pl-s"&gt;    });&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;annotated-screenshot.png&lt;/span&gt;
  &lt;span class="pl-ent"&gt;selector&lt;/span&gt;: &lt;span class="pl-s"&gt;.content&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And ran this command to generate the screenshot:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper multi shot.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The generated &lt;code&gt;annotated-screenshot.png&lt;/code&gt; image looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/annotated-screenshot.png" alt="A screenshot of the table with the menu open and a single pink arrow pointing to the 'facet by this' menu item" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm pretty happy with this! I think it works very well as a proof of concept for the process.&lt;/p&gt;
&lt;h4 id="how-it-works-playwright"&gt;How it works: Playwright&lt;/h4&gt;
&lt;p&gt;I built the &lt;a href="https://github.com/simonw/shot-scraper/tree/44995cd45ca6c56d34c5c3d131217f7b9170f6f7"&gt;first prototype&lt;/a&gt; of &lt;code&gt;shot-scraper&lt;/code&gt; using Puppeteer, because I had &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/"&gt;used that before&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I noticed that the &lt;a href="https://www.npmjs.com/package/puppeteer-cli"&gt;puppeteer-cli&lt;/a&gt; package I was using hadn't had an update in two years, which reminded me to check out Playwright.&lt;/p&gt;
&lt;p&gt;I've been looking for an excuse to learn &lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt; for a while now, and this project turned out to be ideal.&lt;/p&gt;
&lt;p&gt;Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.&lt;/p&gt;
&lt;p&gt;Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.&lt;/p&gt;
&lt;p&gt;The second prototype used the &lt;a href="https://github.com/simonw/shot-scraper/tree/b3318b2f27ca1526d5a9f06de50cf9900dd4d8d0"&gt;Playwright CLI utility&lt;/a&gt; instead, &lt;a href="https://github.com/simonw/shot-scraper/blob/b3318b2f27ca1526d5a9f06de50cf9900dd4d8d0/shot_scraper/cli.py#L39-L50"&gt;executed via npx&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(
    [
        &lt;span class="pl-s"&gt;"npx"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"playwright"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"screenshot"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"--full-page"&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;url&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;output&lt;/span&gt;,
    ],
    &lt;span class="pl-s1"&gt;capture_output&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
)&lt;/pre&gt;
&lt;p&gt;This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.&lt;/p&gt;
&lt;p&gt;I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official &lt;a href="https://playwright.dev/python/docs/intro"&gt;Playwright for Python&lt;/a&gt; package.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install playwright
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.&lt;/p&gt;
&lt;p&gt;I was curious how they pulled this off, so I dug inside the &lt;code&gt;playwright&lt;/code&gt; Python package in my &lt;code&gt;site-packages&lt;/code&gt; folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.&lt;/p&gt;
&lt;p&gt;Thanks to Playwright, the entire implementation of &lt;code&gt;shot-scraper&lt;/code&gt; is currently just &lt;a href="https://github.com/simonw/shot-scraper/blob/0.3/shot_scraper/cli.py"&gt;181 lines of Python code&lt;/a&gt; - it's all glue code tying together a &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; CLI interface with some code that calls Playwright to do the actual work.&lt;/p&gt;
&lt;p&gt;I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my &lt;a href="https://datasette.io/desktop"&gt;Datasette Desktop&lt;/a&gt; Electron application.&lt;/p&gt;
&lt;h4&gt;Hooking shot-scraper up to GitHub Actions&lt;/h4&gt;
&lt;p&gt;I built &lt;code&gt;shot-scraper&lt;/code&gt; very much with GitHub Actions in mind.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/shot-scraper-demo"&gt;shot-scraper-demo&lt;/a&gt; repository is my first live demo of the tool.&lt;/p&gt;
&lt;p&gt;Once a day, it runs &lt;a href="https://github.com/simonw/shot-scraper-demo/blob/3fdd9d3e79f95d9d396aeefd5bf65e85a7700ef4/.github/workflows/shots.yml"&gt;this shots.yml&lt;/a&gt; file, generates two screenshots and commits them back to the repository.&lt;/p&gt;
&lt;p&gt;One of them is the tutorial screenshot described above.&lt;/p&gt;
&lt;p&gt;The other is a screenshot of the list of "recently spotted owls" from &lt;a href="https://www.owlsnearme.com/?place=127871"&gt;this page&lt;/a&gt; on &lt;a href="https://www.owlsnearme.com/"&gt;owlsnearme.com&lt;/a&gt;. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.&lt;/p&gt;
&lt;p&gt;I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, &lt;a href="https://github.com/simonw/shot-scraper-demo/commit/bc86510f49b6f8d6728c9f1880b999c83361dd5a#diff-897c3444fbbb2033cbba5840da4994d01c3f396e0cdf4b0613d7f410db9887e0"&gt;like this one&lt;/a&gt; (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).&lt;/p&gt;
&lt;p&gt;Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!&lt;/p&gt;
&lt;h4&gt;What's next?&lt;/h4&gt;
&lt;p&gt;I had ambitious plans to add utilities to the tool that would &lt;a href="https://github.com/simonw/shot-scraper/issues/9"&gt;help with annotations&lt;/a&gt;, such as adding pink arrows and drawing circles around different elements on the page.&lt;/p&gt;
&lt;p&gt;I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.&lt;/p&gt;
&lt;p&gt;So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.&lt;/p&gt;
&lt;p&gt;I'm also very interested to see what kinds of things other people use this for.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/puppeteer"&gt;puppeteer&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="documentation"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="puppeteer"/><category term="playwright"/><category term="shot-scraper"/></entry><entry><title>Help scraping: track changes to CLI tools by recording their --help using Git</title><link href="https://simonwillison.net/2022/Feb/2/help-scraping/#atom-tag" rel="alternate"/><published>2022-02-02T23:46:35+00:00</published><updated>2022-02-02T23:46:35+00:00</updated><id>https://simonwillison.net/2022/Feb/2/help-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been experimenting with a new variant of &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; this week which I'm calling &lt;strong&gt;Help scraping&lt;/strong&gt;. The key idea is to track changes made to CLI tools over time by recording the output of their &lt;code&gt;--help&lt;/code&gt; commands in a Git repository.&lt;/p&gt;
&lt;p&gt;My new &lt;a href="https://github.com/simonw/help-scraper"&gt;help-scraper GitHub repository&lt;/a&gt; is my first implementation of this pattern.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/.github/workflows/scrape.yml"&gt;this GitHub Actions workflow&lt;/a&gt; to record the &lt;code&gt;--help&lt;/code&gt; output for the Amazon Web Services &lt;code&gt;aws&lt;/code&gt; CLI tool, and also for the &lt;code&gt;flyctl&lt;/code&gt; tool maintained by the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; hosting platform.&lt;/p&gt;
&lt;p&gt;The workflow runs once a day. It loops through every available AWS command (using &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/aws_commands.py"&gt;this script&lt;/a&gt;) and records the output of that command's CLI help option to a &lt;code&gt;.txt&lt;/code&gt; file in the repository - then commits the result at the end.&lt;/p&gt;
&lt;p&gt;The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.&lt;/p&gt;
&lt;p&gt;Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://github.com/aws/aws-cli/blob/develop/CHANGELOG.rst#12247"&gt;the official release notes&lt;/a&gt; - 12 bullet points, spanning 12 different AWS services.&lt;/p&gt;
&lt;p&gt;My help scraper caught the details of the release in &lt;a href="https://github.com/simonw/help-scraper/commit/cd18c5d7c1ac7c3851823dcabaa21ee920d73720#diff-c2559859df8912eb13a6017d81019bf5452cead3e6495744e2d0c82202bf33ac"&gt;this commit&lt;/a&gt; - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.&lt;/p&gt;
&lt;p&gt;The AWS CLI tool is &lt;em&gt;enormous&lt;/em&gt;. Running &lt;code&gt;find aws -name '*.txt' | wc -l&lt;/code&gt; in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.&lt;/p&gt;
&lt;p&gt;There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on &lt;a href="https://github.com/boto/botocore/commits/develop"&gt;the botocore GitHub history&lt;/a&gt;, which exposes changes to the underlying JSON - and there are projects like &lt;a href="https://awsapichanges.info/"&gt;awschanges.info&lt;/a&gt; which try to turn those sources of data into something more readable.&lt;/p&gt;
&lt;p&gt;But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes &lt;a href="https://simonwillison.net/2022/Jan/31/release-notes/"&gt;with the detail I like from them&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;I implemented this for &lt;code&gt;flyctl&lt;/code&gt; first, because I wanted to see what changes were being made that might impact my &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; plugin which shells out to that tool. Then I realized it could be applied to AWS as well.&lt;/p&gt;
&lt;h4&gt;Help scraping my own projects&lt;/h4&gt;
&lt;p&gt;I got the initial idea for this technique from a change I made to my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://sqlite-utils.datasette.io"&gt;sqlite-utils&lt;/a&gt; projects a few weeks ago.&lt;/p&gt;
&lt;p&gt;Both tools offer CLI commands with &lt;code&gt;--help&lt;/code&gt; output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.&lt;/p&gt;
&lt;p&gt;So, I added documentation pages that list the output of &lt;code&gt;--help&lt;/code&gt; for each of the CLI commands, generated using the &lt;a href="https://nedbatchelder.com/code/cog"&gt;Cog&lt;/a&gt; file generation tool:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli-reference.html"&gt;sqlite-utils CLI reference&lt;/a&gt; (39 commands!)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.datasette.io/en/stable/cli-reference.html"&gt;datasette CLI reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the &lt;code&gt;--help&lt;/code&gt; output - here's &lt;a href="https://github.com/simonw/sqlite-utils/commits/main/docs/cli-reference.rst"&gt;that history for sqlite-utils&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was a short jump from that to the idea of combining it with &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; to generate history for other tools.&lt;/p&gt;
&lt;h4&gt;Bonus trick: GraphQL schema scraping&lt;/h4&gt;
&lt;p&gt;I've started making selective use of the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; GraphQL API as part of &lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;my plugin&lt;/a&gt; for publishing Datasette instances to that platform.&lt;/p&gt;
&lt;p&gt;Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: &lt;a href="https://til.simonwillison.net/fly/undocumented-graphql-api"&gt;Using the undocumented Fly GraphQL API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?&lt;/p&gt;
&lt;p&gt;It turns out I can! There's an NPM package called &lt;a href="https://www.npmjs.com/package/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt; which can extract the GraphQL schema from any GraphQL server and write it out to disk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;npx get-graphql-schema https://api.fly.io/graphql &amp;gt; /tmp/fly.graphql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've added that to my &lt;code&gt;help-scraper&lt;/code&gt; repository too - so now I have a &lt;a href="https://github.com/simonw/help-scraper/commits/main/flyctl/fly.graphql"&gt;commit history of changes&lt;/a&gt; of changes they are making there too. Here's &lt;a href="https://github.com/simonw/help-scraper/commit/f11072ff23f0d654395be7c2b1e98e84dbbc26a3#diff-c9cd49cf2aa3b983457e2812ba9313cc254aba74aaba9a36d56c867e32221589"&gt;an example&lt;/a&gt; from this morning.&lt;/p&gt;
&lt;h3&gt;Other weeknotes&lt;/h3&gt;
&lt;p&gt;I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to &lt;a href="https://github.com/simonw/datasette/milestone/7"&gt;that milestone&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This week I did &lt;a href="https://github.com/simonw/datasette/issues/1533"&gt;a bunch of work&lt;/a&gt; adding a &lt;code&gt;Link: https://...; rel="alternate"; type="application/datasette+json"&lt;/code&gt; HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.&lt;/p&gt;
&lt;p&gt;(I had originally planned &lt;a href="https://github.com/simonw/datasette/issues/1534"&gt;to also support&lt;/a&gt; &lt;code&gt;Accept: application/json&lt;/code&gt; request headers for this, but I've been put off that idea by the discovery that Cloudflare &lt;a href="https://twitter.com/simonw/status/1478470282931163137"&gt;deliberately ignores&lt;/a&gt; the &lt;code&gt;Vary: Accept&lt;/code&gt; header.)&lt;/p&gt;
&lt;p&gt;Unrelated to Datasette: I also started a new Twitter thread, gathering &lt;a href="https://twitter.com/simonw/status/1487673496977113088"&gt;behind the scenes material from the movie the Mitchells vs the Machines&lt;/a&gt;. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.&lt;/p&gt;
&lt;p&gt;The last time I did this &lt;a href="https://twitter.com/simonw/status/1077737871602110466"&gt;was for Into the Spider-Verse&lt;/a&gt; (from the same studio) and that thread ended up running for more than a year!&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/pytest/only-run-integration"&gt;Opt-in integration tests with pytest --integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/graphql/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/python-3-11"&gt;Testing against Python 3.11 preview using GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="graphql"/><category term="weeknotes"/><category term="github-actions"/><category term="git-scraping"/><category term="fly"/></entry><entry><title>git-history: a tool for analyzing scraped data collected using Git and SQLite</title><link href="https://simonwillison.net/2021/Dec/7/git-history/#atom-tag" rel="alternate"/><published>2021-12-07T22:32:55+00:00</published><updated>2021-12-07T22:32:55+00:00</updated><id>https://simonwillison.net/2021/Dec/7/git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;I described &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.&lt;/p&gt;
&lt;p&gt;The open challenge was how to analyze that data once it was collected. &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my new tool designed to tackle that problem.&lt;/p&gt;
&lt;h4&gt;Git scraping, a refresher&lt;/h4&gt;
&lt;p&gt;A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;five minute lightning talk&lt;/a&gt; back in March.&lt;/p&gt;
&lt;p&gt;Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at &lt;a href="https://www.fire.ca.gov/incidents/"&gt;fire.ca.gov/incidents&lt;/a&gt; showing the status of current large fires in the state.&lt;/p&gt;
&lt;p&gt;I found the underlying data here:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I built &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;a simple scraper&lt;/a&gt; that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected &lt;a href="https://github.com/simonw/ca-fires-history"&gt;1,559 commits&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.&lt;/p&gt;
&lt;p&gt;There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; to explore the resulting data.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires"&gt;an example database&lt;/a&gt; created by running the tool against my &lt;code&gt;ca-fires-history&lt;/code&gt; repository. I created the SQLite database by running this in the repository directory:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;json.loads(content)["Incidents"]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-progress.gif" alt="Animated gif showing the progress bar" style="max-width:100%; border-top: 5px solid black;" /&gt;&lt;/p&gt;
&lt;p&gt;In this example we are processing the history of a single file called &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We use the &lt;code&gt;UniqueId&lt;/code&gt; column to identify which records are changed over time as opposed to newly created.&lt;/p&gt;
&lt;p&gt;Specifying &lt;code&gt;--namespace incident&lt;/code&gt; causes the created database tables to be called &lt;code&gt;incident&lt;/code&gt; and &lt;code&gt;incident_version&lt;/code&gt; rather than the default of &lt;code&gt;item&lt;/code&gt; and &lt;code&gt;item_version&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see &lt;a href="https://github.com/simonw/git-history/blob/0.6/README.md#custom-conversions-using---convert"&gt;--convert in the documentation&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;Let's use the database to answer some questions about fires in California over the past 14 months.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;incident&lt;/code&gt; table contains a copy of the latest record for every incident. We can use that to see &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident"&gt;a map of every fire&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-map.png" alt="A map showing 250 fires in California" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This uses the &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt; plugin, which draws a map of every row with a valid latitude and longitude column.&lt;/p&gt;
&lt;p&gt;Where things get interesting is the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version"&gt;incident_version&lt;/a&gt; table. This is where changes between different scraped versions of each item are recorded.&lt;/p&gt;
&lt;p&gt;Those 250 fires have 2,060 recorded versions. If we &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item"&gt;facet by _item&lt;/a&gt; we can see which fires had the most versions recorded. Here are the top ten:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;Dixie Fire&lt;/a&gt; 268&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=209"&gt;Caldor Fire&lt;/a&gt; 153&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=197"&gt;Monument Fire&lt;/a&gt; 65&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=1"&gt;August Complex (includes Doe Fire)&lt;/a&gt; 64&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=2"&gt;Creek Fire&lt;/a&gt; 56&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=213"&gt;French Fire&lt;/a&gt; 53&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=32"&gt;Silverado Fire&lt;/a&gt; 52&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=240"&gt;Fawn Fire&lt;/a&gt; 45&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=34"&gt;Blue Ridge Fire&lt;/a&gt; 39&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=190"&gt;McFarland Fire&lt;/a&gt; 34&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire &lt;a href="https://en.wikipedia.org/wiki/Dixie_Fire"&gt;has its own Wikipedia page&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Clicking through to &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;the Dixie Fire&lt;/a&gt; lands us on a page showing every "version" that we captured, ordered by version number.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-incident-versions.png" alt="The table showing changes to that fire over time" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ConditionStatement&lt;/code&gt; is a text description that changes frequently, but the other two interesting columns look to be &lt;code&gt;AcresBurned&lt;/code&gt; and &lt;code&gt;PercentContained&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That &lt;code&gt;_commit&lt;/code&gt; table is a foreign key to &lt;a href="https://git-history-demos.datasette.io/ca-fires/commits"&gt;commits&lt;/a&gt;, which records commits that have been processed by the tool -  mainly so that when you run it a second time it can pick up where it finished last time.&lt;/p&gt;
&lt;p&gt;We can join against &lt;code&gt;commits&lt;/code&gt; to see the date that each version was created. Or we can use the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail"&gt;incident_version_detail&lt;/a&gt; view which performs that join for us.&lt;/p&gt;
&lt;p&gt;Using that view, we can filter for just rows where &lt;code&gt;_item&lt;/code&gt; is 174 and &lt;code&gt;AcresBurned&lt;/code&gt; is not blank, then use the &lt;a href=""&gt;datasette-vega&lt;/a&gt; plugin to visualize the &lt;code&gt;_commit_at&lt;/code&gt; date column against the &lt;code&gt;AcresBurned&lt;/code&gt; numeric column... and we get a graph of &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_item__exact=174&amp;amp;AcresBurned__notblank=1#g.mark=line&amp;amp;g.x_column=_commit_at&amp;amp;g.x_type=temporal&amp;amp;g.y_column=AcresBurned&amp;amp;g.y_type=quantitative"&gt;the growth of the Dixie Fire over time&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-chart.png" alt="The chart plugin showing a line chart" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to &lt;code&gt;git-history&lt;/code&gt;, Datasette and &lt;code&gt;datasette-vega&lt;/code&gt; we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.&lt;/p&gt;
&lt;h4&gt;A note on schema design&lt;/h4&gt;
&lt;p&gt;One of the hardest problems in designing &lt;code&gt;git-history&lt;/code&gt; was deciding on an appropriate schema for storing version changes over time.&lt;/p&gt;
&lt;p&gt;I ended up with the following (edited for clarity):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [columns] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [name] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_changed] (
   [item_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item_version]([_id]),
   [column] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [columns]([id]),
   &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt; ([item_version], [column])
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As shown earlier, records in the &lt;code&gt;item_version&lt;/code&gt; table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as &lt;code&gt;null&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There's one catch with this schema: what do we do if a new version of an item sets one of the columns to &lt;code&gt;null&lt;/code&gt;? How can we tell the difference between that and a column that didn't change?&lt;/p&gt;
&lt;p&gt;I ended up solving that with an &lt;code&gt;item_changed&lt;/code&gt; many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which &lt;code&gt;item_version&lt;/code&gt; records.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version_detail&lt;/code&gt; view displays columns from that many-to-many table as JSON - here's &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_version__gt=1&amp;amp;_col=_changed_columns&amp;amp;_col=_item&amp;amp;_col=_version"&gt;a filtered example&lt;/a&gt; showing which columns were changed in which versions of which items:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-changed-columns.png" alt="This table shows a JSON list of column names against items and versions" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+columns.name%2C+count%28*%29%0D%0Afrom+incident_changed%0D%0A++join+incident_version+on+incident_changed.item_version+%3D+incident_version._id%0D%0A++join+columns+on+incident_changed.column+%3D+columns.id%0D%0Awhere+incident_version._version+%3E+1%0D%0Agroup+by+columns.name%0D%0Aorder+by+count%28*%29+desc"&gt;a SQL query&lt;/a&gt; that shows, for &lt;code&gt;ca-fires&lt;/code&gt;, which columns were updated most often:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;, &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;)
&lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed
  &lt;span class="pl-k"&gt;join&lt;/span&gt; incident_version &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;item_version&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_id&lt;/span&gt;
  &lt;span class="pl-k"&gt;join&lt;/span&gt; columns &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;column&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;where&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_version&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
&lt;span class="pl-k"&gt;group by&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Updated: 1785&lt;/li&gt;
&lt;li&gt;PercentContained: 740&lt;/li&gt;
&lt;li&gt;ConditionStatement: 734&lt;/li&gt;
&lt;li&gt;AcresBurned: 616&lt;/li&gt;
&lt;li&gt;Started: 327&lt;/li&gt;
&lt;li&gt;PersonnelInvolved: 286&lt;/li&gt;
&lt;li&gt;Engines: 274&lt;/li&gt;
&lt;li&gt;CrewsInvolved: 256&lt;/li&gt;
&lt;li&gt;WaterTenders: 225&lt;/li&gt;
&lt;li&gt;Dozers: 211&lt;/li&gt;
&lt;li&gt;AirTankers: 181&lt;/li&gt;
&lt;li&gt;StructuresDestroyed: 125&lt;/li&gt;
&lt;li&gt;Helicopters: 122&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; incident
&lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; _item &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_version
  &lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
    &lt;span class="pl-k"&gt;select&lt;/span&gt; item_version &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed &lt;span class="pl-k"&gt;where&lt;/span&gt; column &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;15&lt;/span&gt;
  )
  &lt;span class="pl-k"&gt;and&lt;/span&gt; _version &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That returned 19 fires that were significant enough to involve helicopters - &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+*+from+incident%0D%0Awhere+_id+in+%28%0D%0A++select+_item+from+incident_version%0D%0A++where+_id+in+%28%0D%0A++++select+item_version+from+incident_changed+where+column+%3D+15%0D%0A++%29%0D%0A++and+_version+%3E+1%0D%0A%29"&gt;here they are on a map&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fire-helicopter-map.png" alt="A map of 19 fires that involved helicopters" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Advanced usage of --convert&lt;/h4&gt;
&lt;p&gt;Drew Breunig has been running a Git scraper for the past 8 months in &lt;a href="https://github.com/dbreunig/511-events-history"&gt;dbreunig/511-events-history&lt;/a&gt; against &lt;a href="https://511.org/"&gt;511.org&lt;/a&gt;, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example &lt;a href="https://git-history-demos.datasette.io/sf-bay-511"&gt;sf-bay-511 database&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;sf-bay-511&lt;/code&gt; example is useful for digging more into the &lt;code&gt;--convert&lt;/code&gt; option to &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.&lt;/p&gt;
&lt;p&gt;The ideal tracked JSON file would look something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;abc123&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Corner of 4th and Vermont&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;fire&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cde448&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;555 West Example Drive&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;medical&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's common for data that has been scraped to not fit this ideal shape.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;511.org&lt;/code&gt; JSON feed &lt;a href="https://backend-prod.511.org/api-proxy/api/v1/traffic/events/?extended=true"&gt;can be found here&lt;/a&gt; - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a &lt;code&gt;updated&lt;/code&gt; timestamp field that changes in every version even if there are no changes, or a deeply nested &lt;code&gt;"extension"&lt;/code&gt; object full of duplicate data.&lt;/p&gt;
&lt;p&gt;I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the &lt;code&gt;--convert&lt;/code&gt; option to the script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The single-quoted string passed to &lt;code&gt;--convert&lt;/code&gt; is compiled into a Python function and run against each Git version in turn. My code loops through the nested &lt;code&gt;Events&lt;/code&gt; list, modifying each record and then outputting them as an iterable sequence using &lt;code&gt;yield&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.&lt;/p&gt;
&lt;p&gt;When working with &lt;code&gt;git-history&lt;/code&gt; I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it &lt;a href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/"&gt;for sqlite-utils convert&lt;/a&gt; earlier this year.&lt;/p&gt;
&lt;h4&gt;Trying this out yourself&lt;/h4&gt;
&lt;p&gt;If you want to try this out for yourself the &lt;code&gt;git-history&lt;/code&gt; tool has &lt;a href="https://github.com/simonw/git-history/blob/main/README.md"&gt;an extensive README&lt;/a&gt; describing the other options, and the scripts used to create these demos can be found in the &lt;a href="https://github.com/simonw/git-history/tree/main/demos"&gt;demos folder&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="data-journalism"/><category term="git"/><category term="projects"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/><category term="git-history"/></entry><entry><title>Git scraping, the five minute lightning talk</title><link href="https://simonwillison.net/2021/Mar/5/git-scraping/#atom-tag" rel="alternate"/><published>2021-03-05T00:44:15+00:00</published><updated>2021-03-05T00:44:15+00:00</updated><id>https://simonwillison.net/2021/Mar/5/git-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;I prepared a lightning talk about &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; for the &lt;a href="https://www.ire.org/training/conferences/nicar-2021/"&gt;NICAR 2021&lt;/a&gt; data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC's vaccination data using the GitHub web interface. Here's the video.&lt;/p&gt;
&lt;div class="resp-container"&gt;
    &lt;iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/2CjA-03yK8I" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;
&lt;/div&gt;
&lt;h4&gt;Notes from the talk&lt;/h4&gt;
&lt;p&gt;Here's &lt;a href="https://m.pge.com/#outages"&gt;the PG&amp;amp;E outage map&lt;/a&gt; that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.&lt;/p&gt;
&lt;p&gt;I scraped that outage data into &lt;a href="https://github.com/simonw/pge-outages"&gt;simonw/pge-outages&lt;/a&gt; - here's the &lt;a href="https://github.com/simonw/pge-outages/commits"&gt;commit history&lt;/a&gt; (over 40,000 commits now!)&lt;/p&gt;
&lt;p&gt;The scraper code itself &lt;a href="https://github.com/simonw/disaster-scrapers/blob/3eed6eca820e14e2f89db3910d1aece72717d387/pge.py"&gt;is here&lt;/a&gt;. I wrote about the project in detail in &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; - my database of outages database is at &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outages"&gt;pge-outages.simonwillison.net&lt;/a&gt; and the animation I made of outages over time is attached to &lt;a href="https://twitter.com/simonw/status/1188612004572880896"&gt;this tweet&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;Here&amp;#39;s a video animation of PG&amp;amp;E&amp;#39;s outages from October 5th up until just a few minutes ago &lt;a href="https://t.co/50K3BrROZR"&gt;pic.twitter.com/50K3BrROZR&lt;/a&gt;&lt;/p&gt;- Simon Willison (@simonw) &lt;a href="https://twitter.com/simonw/status/1188612004572880896?ref_src=twsrc%5Etfw"&gt;October 28, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;The much simpler scraper for the &lt;a href="https://www.fire.ca.gov/incidents"&gt;www.fire.ca.gov/incidents&lt;/a&gt; website is at &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the video I used that as the template to create a new scraper for CDC vaccination data - their website is &lt;a href="https://covid.cdc.gov/covid-data-tracker/#vaccinations"&gt;https://covid.cdc.gov/covid-data-tracker/#vaccinations&lt;/a&gt; and the API I found using the browser developer tools is &lt;a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data"&gt;https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new CDC scraper and the data it has scraped lives in &lt;a href="https://github.com/simonw/cdc-vaccination-history"&gt;simonw/cdc-vaccination-history&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can find more examples of Git scraping in the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="scraping"/><category term="my-talks"/><category term="github-actions"/><category term="git-scraping"/><category term="annotated-talks"/><category term="nicar"/></entry><entry><title>selenium-wire</title><link href="https://simonwillison.net/2020/Nov/2/selenium-wire/#atom-tag" rel="alternate"/><published>2020-11-02T18:58:59+00:00</published><updated>2020-11-02T18:58:59+00:00</updated><id>https://simonwillison.net/2020/Nov/2/selenium-wire/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://pypi.org/project/selenium-wire/"&gt;selenium-wire&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really useful scraping tool: enhances the Python Selenium bindings to run against a proxy which then allows Python scraping code to look at captured requests—great for if a site you are working with triggers Ajax requests and you want to extract data from the raw JSON that came back.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/selenium"&gt;selenium&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="python"/><category term="scraping"/><category term="selenium"/></entry><entry><title>Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils</title><link href="https://simonwillison.net/2020/Oct/16/weeknotes-evernote-datasette-weekly/#atom-tag" rel="alternate"/><published>2020-10-16T21:14:46+00:00</published><updated>2020-10-16T21:14:46+00:00</updated><id>https://simonwillison.net/2020/Oct/16/weeknotes-evernote-datasette-weekly/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I built &lt;code&gt;evernote-to-sqlite&lt;/code&gt; (see &lt;a href="https://simonwillison.net/2020/Oct/16/building-evernote-sqlite-exporter/"&gt;Building an Evernote to SQLite exporter&lt;/a&gt;), launched the &lt;a href="https://datasette.substack.com/"&gt;Datasette Weekly newsletter&lt;/a&gt;, worked on some scrapers and pushed out some small improvements to several other projects.&lt;/p&gt;
&lt;h4&gt;The Datasette Weekly newsletter&lt;/h4&gt;
&lt;p&gt;After procrastinating on it for several months I finally launched the new &lt;a href="https://datasette.substack.com/"&gt;Datasette Weekly&lt;/a&gt; newsletter!&lt;/p&gt;
&lt;p&gt;My plan is to put this out once a week with a combination of news from the Datasette/Dogsheep/sqlite-utils ecosystem of tools, plus tips and tricks for using them to solve data problems.&lt;/p&gt;
&lt;p&gt;You can read &lt;a href="https://datasette.substack.com/p/datasette-050-git-scraping-extracting"&gt;the first edition here&lt;/a&gt;, which covers &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-50"&gt;Datasette 0.50&lt;/a&gt;, &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/"&gt;sqlite-utils extract&lt;/a&gt; and features &lt;a href="https://github.com/simonw/datasette-graphql"&gt;datasette-graphql&lt;/a&gt; as the plugin of the week.&lt;/p&gt;
&lt;p&gt;I'm using &lt;a href="https://substack.com/"&gt;Substack&lt;/a&gt; because people I trust use it for their newsletters and I decided that picking an option and launching was more important than spending even more time procrastinating on picking the best possible newsletter platform. So far it seems fit for purpose, and it provides an export option should I decide to move to something else.&lt;/p&gt;
&lt;h4&gt;Writing scrapers with a Python+JavaScript hybrid&lt;/h4&gt;
&lt;p&gt;I've been writing some scraper code to help out with a student journalism project at Stanford. I ended up using &lt;a href="https://selenium-python.readthedocs.io/"&gt;Selenium Python&lt;/a&gt; running in a Jupyter Notebook.&lt;/p&gt;
&lt;p&gt;Historically I've avoided Selenium due to how weird and complex it has been to use in the past. I've now completely changed my mind: these days it's a really solid option for browser automation driven by Python thanks to &lt;code&gt;chromedriver&lt;/code&gt; and &lt;code&gt;geckodriver&lt;/code&gt;, which I recently learned can be &lt;a href="https://til.simonwillison.net/til/til/selenium_selenium-python-macos.md"&gt;installed using Homebrow&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My preferred way of writing scrapers is to do most of the work in JavaScript. The combination of &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector"&gt;querySelector()&lt;/a&gt;, &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll"&gt;querySelectorAll()&lt;/a&gt;, &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API"&gt;fetch()&lt;/a&gt; and the new-to-me &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/DOMParser"&gt;DOMParser&lt;/a&gt; class makes light work of extracting data from any shape of HTML, and browser DevTools mean that I can interactively build up scrapers by pasting code directly into the console.&lt;/p&gt;
&lt;p&gt;My big break-through this week was figuring out how to write scrapers as a Python-JavaScript hybrid. The Selenium &lt;code&gt;driver.execute_script()&lt;/code&gt; and &lt;code&gt;driver.execute_async_script()&lt;/code&gt; (&lt;a href="https://til.simonwillison.net/til/til/selenium_async-javascript-in-selenium.md"&gt;TIL&lt;/a&gt;) methods make it trivial to feed execute chunks of JavaScript from Python and get back the results.&lt;/p&gt;
&lt;p&gt;This meant I could scrape pages one at time using JavaScript and save the results directly to SQLite via &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/python-api.html"&gt;sqlite-utils&lt;/a&gt;. I could even run database queries on the Python side to skip items that had already been scraped.&lt;/p&gt;
&lt;h4&gt;csv-diff 1.0&lt;/h4&gt;
&lt;p&gt;I'm trying to get more of my tools past the 1.0 mark, mainly to indicate to potential users that I won't be breaking backwards compatibility without bumping them to 2.0.&lt;/p&gt;
&lt;p&gt;I built &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt; for my &lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/#csvdiff_18"&gt;San Francisco Trees project&lt;/a&gt; last year. It produces human-readable diffs for CSV files.&lt;/p&gt;
&lt;p&gt;The version 1.0 release notes are &lt;a href="https://github.com/simonw/csv-diff/releases/tag/1.0"&gt;as follows&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New &lt;code&gt;--show-unchanged&lt;/code&gt; option for outputting the unchanged values of rows that had at least one change. &lt;a href="https://github.com/simonw/csv-diff/issues/9"&gt;#9&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fix for bug with column names that contained a &lt;code&gt;.&lt;/code&gt; character. &lt;a href="https://github.com/simonw/csv-diff/issues/7"&gt;#7&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fix for error when no &lt;code&gt;--key&lt;/code&gt; provided - thanks, @MainHanzo. &lt;a href="https://github.com/simonw/csv-diff/issues/3"&gt;#3&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CSV delimiter sniffer now &lt;code&gt;;&lt;/code&gt; delimited files. &lt;a href="https://github.com/simonw/csv-diff/issues/6"&gt;#6&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;sqlite-utils 2.22&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-22"&gt;sqlite-utils 2.22&lt;/a&gt; adds some minor features - an &lt;code&gt;--encoding&lt;/code&gt; option for processing TSV and CSV files in encodings other than UTF-8, and more support for loading SQLite extensions modules.&lt;/p&gt;
&lt;p&gt;Full release notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New &lt;code&gt;--encoding&lt;/code&gt; option for processing CSV and TSV files that use a non-utf-8 encoding, for both the &lt;code&gt;insert&lt;/code&gt; and &lt;code&gt;update&lt;/code&gt; commands. (&lt;a href="https://github.com/simonw/sqlite-utils/issues/182"&gt;#182&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;--load-extension&lt;/code&gt; option is now available to many more commands. (&lt;a href="https://github.com/simonw/sqlite-utils/issues/137"&gt;#137&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--load-extension=spatialite&lt;/code&gt; can be used to load SpatiaLite from common installation locations, if it is available. (&lt;a href="https://github.com/simonw/sqlite-utils/issues/136"&gt;#136&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Tests now also run against Python 3.9. (&lt;a href="https://github.com/simonw/sqlite-utils/issues/184"&gt;#184&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Passing &lt;code&gt;pk=["id"]&lt;/code&gt; now has the same effect as passing &lt;code&gt;pk="id"&lt;/code&gt;. (&lt;a href="https://github.com/simonw/sqlite-utils/issues/181"&gt;#181&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Datasette&lt;/h4&gt;
&lt;p&gt;No new release yet, but I've landed some small new features to the &lt;code&gt;main&lt;/code&gt; branch.&lt;/p&gt;
&lt;p&gt;Inspired by the GitHub and WordPress APIs, Datasette's JSON API now supports &lt;code&gt;Link:&lt;/code&gt; HTTP header pagination (&lt;a href="https://github.com/simonw/datasette/issues/1014"&gt;#1014&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This is part of my ongoing effort to &lt;a href="https://github.com/simonw/datasette/issues/782"&gt;redesign the default JSON format&lt;/a&gt; ready for Datasette 1.0. I started a new plugin called &lt;a href="https://github.com/simonw/datasette-json-preview"&gt;datasette-json-preview&lt;/a&gt; to let me iterate on that format independent of Datasette itself.&lt;/p&gt;
&lt;p&gt;Jacob Fenton suggested an &lt;a href="https://github.com/simonw/datasette/issues/1019"&gt;"Edit SQL" button on canned queries&lt;/a&gt;. That's a great idea, so I built it - &lt;a href="https://github.com/simonw/datasette/issues/1019#issuecomment-708139822"&gt;this issue comment&lt;/a&gt; links to some demos, e.g. &lt;a href="https://latest.datasette.io/fixtures/neighborhood_search?text=ber"&gt;this one here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I added an "x" button for clearing filters to the table page (&lt;a href="https://github.com/simonw/datasette/issues/1016"&gt;#1016&lt;/a&gt;) demonstrated by this GIF:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Animation demonstrating the new x button next to filters" src="https://static.simonwillison.net/static/2020/x-button-filters.gif" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/homebrew_upgrading-python-homebrew-packages.md"&gt;Upgrading Python Homebrew packages using pip&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/python_click-file-encoding.md"&gt;Explicit file encodings using click.File&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/2.22"&gt;sqlite-utils 2.22&lt;/a&gt; - 2020-10-16&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/csv-diff/releases/tag/1.0"&gt;csv-diff 1.0&lt;/a&gt; - 2020-10-16&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/swarm-to-sqlite/releases/tag/0.3.2"&gt;swarm-to-sqlite 0.3.2&lt;/a&gt; - 2020-10-12&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/evernote-to-sqlite/releases/tag/0.2"&gt;evernote-to-sqlite 0.2&lt;/a&gt; - 2020-10-12&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/evernote-to-sqlite/releases/tag/0.1"&gt;evernote-to-sqlite 0.1&lt;/a&gt; - 2020-10-11&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/xml-analyser/releases/tag/1.0"&gt;xml-analyser 1.0&lt;/a&gt; - 2020-10-11&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-json-preview/releases/tag/0.1"&gt;datasette-json-preview 0.1&lt;/a&gt; - 2020-10-11&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="scraping"/><category term="datasette"/><category term="weeknotes"/><category term="sqlite-utils"/></entry><entry><title>Git scraping: track changes over time by scraping to a Git repository</title><link href="https://simonwillison.net/2020/Oct/9/git-scraping/#atom-tag" rel="alternate"/><published>2020-10-09T18:27:23+00:00</published><updated>2020-10-09T18:27:23+00:00</updated><id>https://simonwillison.net/2020/Oct/9/git-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Git scraping&lt;/strong&gt; is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th March 2021:&lt;/strong&gt; I presented a version of this post as &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;a five minute lightning talk at NICAR 2021&lt;/a&gt;, which includes a live coding demo of building a new git scraper.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th January 2022:&lt;/strong&gt; I released a tool called &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history&lt;/a&gt; that helps analyze data that has been collected using this technique.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The &lt;a href="https://twitter.com/nyt_diff"&gt;@nyt_diff Twitter account&lt;/a&gt; tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.&lt;/p&gt;
&lt;p&gt;We already have a great tool for efficiently tracking changes to text over time: &lt;strong&gt;Git&lt;/strong&gt;. And &lt;a href="https://github.com/features/actions"&gt;GitHub Actions&lt;/a&gt; (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.&lt;/p&gt;
&lt;p&gt;Here's a recent example. Fires continue to rage in California, and the &lt;a href="https://www.fire.ca.gov/"&gt;CAL FIRE website&lt;/a&gt; offers an &lt;a href="https://www.fire.ca.gov/incidents/"&gt;incident map&lt;/a&gt; showing the latest fire activity around the state.&lt;/p&gt;
&lt;p&gt;Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents"&gt;https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That's a 241KB JSON endpoints with full details of the various fires around the state.&lt;/p&gt;
&lt;p&gt;So... I started running a git scraper against it. My scraper lives in the &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt; repository on GitHub.&lt;/p&gt;
&lt;p&gt;Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using &lt;code&gt;jq&lt;/code&gt; and commits it back to the repo if it has changed.&lt;/p&gt;
&lt;p&gt;This means I now have a &lt;a href="https://github.com/simonw/ca-fires-history/commits/main"&gt;commit log&lt;/a&gt; of changes to that information about fires in California. Here's an &lt;a href="https://github.com/simonw/ca-fires-history/commit/7b0f42d4bf198885ab2b41a22a8da47157572d18"&gt;example commit&lt;/a&gt; showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/git-scraping.png" alt="Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt; which looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape latest data&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;6,26,46 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;scheduled&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Check out this repo&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Fetch latest data&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . &amp;gt; incidents.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push if it changed&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "Latest data: ${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's not a lot of code!&lt;/p&gt;
&lt;p&gt;It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.&lt;/p&gt;
&lt;p&gt;The scraper itself works by fetching the JSON using &lt;code&gt;curl&lt;/code&gt;, piping it through &lt;code&gt;jq .&lt;/code&gt; to pretty-print it and saving the result to &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in &lt;a href="https://til.simonwillison.net/til/til/github-actions_commit-if-file-changed.md"&gt;this TIL&lt;/a&gt; a few months ago.&lt;/p&gt;
&lt;p&gt;I have a whole bunch of repositories running git scrapers now. I've been labeling them with the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; so they show up in one place on GitHub (other people have started using that topic as well).&lt;/p&gt;
&lt;p&gt;I've written about some of these &lt;a href="https://simonwillison.net/tags/gitscraping/"&gt;in the past&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;Scraping hurricane Irma&lt;/a&gt; back in September 2017 is when I first came up with the idea to use a Git repository in this way.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;Changelogs to help understand the fires in the North Bay&lt;/a&gt; from October 2017 describes an early attempt at scraping fire-related information.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;Generating a commit log for San Francisco’s official list of trees&lt;/a&gt; remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have &lt;a href="https://github.com/simonw/sf-tree-history/find/master"&gt;a commit log&lt;/a&gt; of changes to it stretching back over more than a year. This example uses my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt; utility to generate human-readable commit messages.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; documents my attempts to track the impact of PG&amp;amp;E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Tracking FARA by deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=24732943"&gt;Comment thread&lt;/a&gt; on this post over on Hacker News.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/></entry><entry><title>Tracking PG&amp;E outages by scraping to a git repo</title><link href="https://simonwillison.net/2019/Oct/10/pge-outages/#atom-tag" rel="alternate"/><published>2019-10-10T23:32:14+00:00</published><updated>2019-10-10T23:32:14+00:00</updated><id>https://simonwillison.net/2019/Oct/10/pge-outages/#atom-tag</id><summary type="html">
    &lt;p&gt;PG&amp;amp;E have &lt;a href="https://twitter.com/bedwardstiek/status/1182047040932470784"&gt;cut off power&lt;/a&gt; to several million people in northern California, supposedly as a precaution against wildfires.&lt;/p&gt;

&lt;p&gt;As it happens, I've been scraping and recording PG&amp;amp;E's outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!&lt;/p&gt;

&lt;p&gt;The original data lives in &lt;a href="https://github.com/simonw/pge-outages"&gt;a GitHub repo&lt;/a&gt; (more importantly in &lt;a href="https://github.com/simonw/pge-outages/commits/master"&gt;the commit history&lt;/a&gt; of that repo).&lt;/p&gt;

&lt;p&gt;Reading JSON in a Git repo isn't particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The result is &lt;code&gt;https://pge-outages.simonwillison.net/&lt;/code&gt; (no longer available)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update from 27th October 2019&lt;/strong&gt;: I also used the data to create this animation (first shared &lt;a href="https://twitter.com/simonw/status/1188612004572880896"&gt;on Twitter&lt;/a&gt;):&lt;/p&gt;

&lt;video style="max-width: 100%" src="https://static.simonwillison.net/static/2019/outages.mp4" controls="controls"&gt;
  Your browser does not support the video tag.
&lt;/video&gt;

&lt;h3 id="thedatamodeloutagesandsnapshots"&gt;The data model: outages and snapshots&lt;/h3&gt;

&lt;p&gt;The three key tables to understand are &lt;code&gt;outages&lt;/code&gt;, &lt;code&gt;snapshots&lt;/code&gt; and &lt;code&gt;outage_snapshots&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;PG&amp;amp;E assign an outage ID to every outage - where an outage is usually something that affects a few dozen customers. I store these in the &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outages?_sort_desc=outageStartTime"&gt;outages table&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the &lt;a href="https://pge-outages.simonwillison.net/pge-outages/snapshots?_sort_desc=id"&gt;snapshots table&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The most interesting table is &lt;code&gt;outage_snapshots&lt;/code&gt;. Every time I see an outage in the JSON feed, I record a new copy of its data as an &lt;code&gt;outage_snapshot&lt;/code&gt; row. This allows me to reconstruct the full history of any outage, in 10 minute increments.&lt;/p&gt;

&lt;p&gt;Here are &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outage_snapshots?snapshot=1269"&gt;all of the outages&lt;/a&gt; that were represented in &lt;a href="https://pge-outages.simonwillison.net/pge-outages/snapshots/1269"&gt;snapshot 1269&lt;/a&gt; - captured at 4:10pm Pacific Time today.&lt;/p&gt;

&lt;p&gt;I can run &lt;code&gt;select sum(estCustAffected) from outage_snapshots where snapshot = 1269&lt;/code&gt; (&lt;a href="https://pge-outages.simonwillison.net/pge-outages?sql=select+sum%28estCustAffected%29+from+outage_snapshots+where+snapshot+%3D+%3Aid&amp;amp;id=1269"&gt;try it here&lt;/a&gt;) to count up the total PG&amp;amp;E estimate of the number of affected customers - it's 545,706!&lt;/p&gt;

&lt;p&gt;I've installed &lt;a href="https://github.com/simonw/datasette-vega"&gt;datasette-vega&lt;/a&gt; which means I can render graphs. Here's my first attempt at a graph showing &lt;a href="https://pge-outages.simonwillison.net/pge-outages?sql=select+snapshots.id%2C+title+as+snapshotTime%2C+hash%2C+sum%28outage_snapshots.estCustAffected%29+as+totalEstCustAffected%0D%0Afrom+snapshots+join+outage_snapshots+on+snapshots.id+%3D+outage_snapshots.snapshot%0D%0Agroup+by+snapshots.id+order+by+snapshots.id+desc+limit+150#g.mark=line&amp;amp;g.x_column=snapshotTime&amp;amp;g.x_type=ordinal&amp;amp;g.y_column=totalEstCustAffected&amp;amp;g.y_type=quantitative"&gt;the number of estimated customers affected over time&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2019/pge-outages-graph.png" style="text-decoration: none; border: none;"&gt;&lt;img src="https://static.simonwillison.net/static/2019/pge-outages-graph.png" style="max-width: 100%" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(I don't know why there's a dip towards the end of the graph).&lt;/p&gt;

&lt;p&gt;I also defined &lt;a href="https://pge-outages.simonwillison.net/pge-outages/most_recent_snapshot"&gt;a SQL view&lt;/a&gt; which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&amp;amp;E website hasn't gone down) and renders them using &lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2019/pge-map.jpg" style="text-decoration: none; border: none;"&gt;&lt;img src="https://static.simonwillison.net/static/2019/pge-map.jpg" style="max-width: 100%" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3 id="thingstobeawareof"&gt;Things to be aware of&lt;/h3&gt;

&lt;p&gt;There are a huge amount of unanswered questions about this data. I've just been looking at PG&amp;amp;E's JSON and making guesses about what things like &lt;code&gt;estCustAffected&lt;/code&gt; means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.&lt;/p&gt;

&lt;p&gt;Some things to question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's the quality of this data? Does it reflect accurately on what's actually going on out there?&lt;/li&gt;

&lt;li&gt;What's the exact meaning of the different columns - &lt;code&gt;estCustAffected&lt;/code&gt;, &lt;code&gt;currentEtor&lt;/code&gt;, &lt;code&gt;autoEtor&lt;/code&gt;, &lt;code&gt;hazardFlag&lt;/code&gt; etc?&lt;/li&gt;

&lt;li&gt;Various columns (&lt;code&gt;lastUpdateTime&lt;/code&gt;, &lt;code&gt;currentEtor&lt;/code&gt;, &lt;code&gt;autoEtor&lt;/code&gt;) appear to be integer &lt;a href="https://en.wikipedia.org/wiki/Unix_time"&gt;unix timestamps&lt;/a&gt;. What timezone were they recorded in? Do they include DST etc?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="howitworks"&gt;How it works&lt;/h3&gt;

&lt;p&gt;I originally wrote the scraper &lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;back in October 2017&lt;/a&gt; during the North Bay fires, and moved it to run on Circle CI based on my work building &lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;a commit history of San Francisco's trees&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's pretty simple: every 10 minutes &lt;a href="https://circleci.com/gh/simonw/disaster-scrapers"&gt;a Circle CI job&lt;/a&gt; runs which scrapes &lt;a href="https://apim.pge.com/cocoutage/outages/getOutagesRegions?regionType=city&amp;amp;expand=true"&gt;the JSON feed&lt;/a&gt; that powers the PG&amp;amp;E website's &lt;a href="https://www.pge.com/myhome/outages/outage/index.shtml"&gt;outage map&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The JSON is then committed to my &lt;a href="https://github.com/simonw/pge-outages"&gt;pge-outages GitHub repository&lt;/a&gt;, over-writing the existing &lt;a href="https://github.com/simonw/pge-outages/blob/master/pge-outages.json"&gt;pge-outages.json file&lt;/a&gt;. There's some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.&lt;/p&gt;

&lt;h3 id="buildingthedatasette"&gt;Building the Datasette&lt;/h3&gt;

&lt;p&gt;The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.&lt;/p&gt;

&lt;p&gt;After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.&lt;/p&gt;

&lt;p&gt;The code that generates the database can be found in &lt;a href="https://github.com/simonw/pge-outages/blob/master/build_database.py"&gt;build_database.py&lt;/a&gt;. I used &lt;a href="https://gitpython.readthedocs.io/en/stable/"&gt;GitPython&lt;/a&gt; to read data from the git repository and my &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/python-api.html"&gt;sqlite-utils library&lt;/a&gt; to create and update the database.&lt;/p&gt;

&lt;h3 id="deployment"&gt;Deployment&lt;/h3&gt;

&lt;p&gt;Since this is a large database that changes every ten minutes, I couldn't use the usual &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html "&gt;datasette publish&lt;/a&gt; trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.&lt;/p&gt;

&lt;p&gt;Instead, I'm running it on a VPS instance. I ended up trying out Digital Ocean for this, after &lt;a href="https://twitter.com/simonw/status/1182077259839991808"&gt;an enjoyable Twitter conversation&lt;/a&gt; about good options for stateful (as opposed to stateless) hosting.&lt;/p&gt;

&lt;h3 id="nextsteps"&gt;Next steps&lt;/h3&gt;

&lt;p&gt;I'm putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology - or take what I've done and spin up much more interesting forks of it.&lt;/p&gt;

&lt;p&gt;If you build something interesting with this please let me know, via email (swillison is my Gmail) or &lt;a href="https://twitter.com/simonw"&gt;on Twitter&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/digitalocean"&gt;digitalocean&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="projects"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/><category term="git-scraping"/><category term="digitalocean"/><category term="sqlite-utils"/></entry><entry><title>scrapely</title><link href="https://simonwillison.net/2018/Jul/10/scrapely/#atom-tag" rel="alternate"/><published>2018-07-10T20:25:01+00:00</published><updated>2018-07-10T20:25:01+00:00</updated><id>https://simonwillison.net/2018/Jul/10/scrapely/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/scrapy/scrapely"&gt;scrapely&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat twist on a screen scraping library: this one lets you “train it” by feeding it examples of URLs paired with a dictionary of the data you would like to have extracted from that URL, then uses an instance based learning earning algorithm to run against new URLs. Slightly confusing name since it’s maintained by the scrapy team but is a totally independent project from the scrapy web crawling framework.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="scraping"/></entry><entry><title>sqlitebiter</title><link href="https://simonwillison.net/2018/May/17/sqlitebiter/#atom-tag" rel="alternate"/><published>2018-05-17T22:40:28+00:00</published><updated>2018-05-17T22:40:28+00:00</updated><id>https://simonwillison.net/2018/May/17/sqlitebiter/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/thombashi/sqlitebiter"&gt;sqlitebiter&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Similar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/></entry><entry><title>kennethreitz/requests-html: HTML Parsing for Humans™</title><link href="https://simonwillison.net/2018/Feb/25/requests-html/#atom-tag" rel="alternate"/><published>2018-02-25T16:49:19+00:00</published><updated>2018-02-25T16:49:19+00:00</updated><id>https://simonwillison.net/2018/Feb/25/requests-html/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/kennethreitz/requests-html"&gt;kennethreitz/requests-html: HTML Parsing for Humans™&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat and tiny wrapper around requests, lxml and html2text that provides a Kenneth Reitz grade API design for intuitively fetching and scraping web pages. The inclusion of html2text means you can use a CSS selector to select a specific HTML element and then convert that to the equivalent markdown in a one-liner.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/kennethreitz/status/967749676312211456"&gt;@kennethreitz&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/requests"&gt;requests&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="html"/><category term="python"/><category term="requests"/><category term="scraping"/></entry><entry><title>Using “import refs” to iteratively import data into Django</title><link href="https://simonwillison.net/2017/Nov/4/import-refs/#atom-tag" rel="alternate"/><published>2017-11-04T19:17:00+00:00</published><updated>2017-11-04T19:17:00+00:00</updated><id>https://simonwillison.net/2017/Nov/4/import-refs/#atom-tag</id><summary type="html">
    &lt;p&gt;I’ve been writing a few scripts to backfill my blog with content I originally posted elsewhere. So far I’ve imported &lt;a href="https://simonwillison.net/tags/quora/"&gt;answers I posted on Quora&lt;/a&gt; (&lt;a href="https://simonwillison.net/2017/Oct/1/ship/"&gt;background&lt;/a&gt;), &lt;a href="https://simonwillison.net/tags/askmetafilter/"&gt;answers I posted on Ask MetaFilter&lt;/a&gt; and &lt;a href="https://simonwillison.net/2017/Oct/8/missing-content/"&gt;content I recovered from the Internet Archive&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I started out writing custom import scripts (like &lt;a href="https://github.com/simonw/simonwillisonblog/blob/e737be8b4228229e833fe7a9ec698f3e262cd094/blog/management/commands/import_quora.py"&gt;this Quora one&lt;/a&gt;), but I’ve now built a generalized mechanism for this which I thought was worth writing up.&lt;/p&gt;
&lt;p&gt;Any of my content imports now take the form of a JSON document, which looks something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[
  {
    &amp;quot;body&amp;quot;: &amp;quot;&amp;lt;p&amp;gt;&amp;lt;em&amp;gt;My answer to ...&amp;lt;/em&amp;gt;&amp;lt;/p&amp;gt;&amp;quot;,
    &amp;quot;tags&amp;quot;: [
      &amp;quot;backpacks&amp;quot;,
      &amp;quot;laptops&amp;quot;,
      &amp;quot;style&amp;quot;,
      &amp;quot;accessories&amp;quot;,
      &amp;quot;bags&amp;quot;
    ],
    &amp;quot;title&amp;quot;: &amp;quot;I need a new backpack&amp;quot;,
    &amp;quot;datetime&amp;quot;: &amp;quot;2005-01-16T14:08:00&amp;quot;,
    &amp;quot;import_ref&amp;quot;: &amp;quot;askmetafilter:14075&amp;quot;,
    &amp;quot;type&amp;quot;: &amp;quot;entry&amp;quot;,
    &amp;quot;slug&amp;quot;: &amp;quot;i-need-a-new-backpack&amp;quot;
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Two larger examples: the &lt;a href="https://gist.github.com/simonw/5a5bc1f58297d2c7d68dd7448a4d6614"&gt;missing content I extracted from the Internet Archive&lt;/a&gt;, and &lt;a href="https://gist.github.com/simonw/857572d9b36cd1e791c730790ed489ef"&gt;the answers I scraped from Ask MetaFilter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;type&lt;/code&gt; property can be set to &lt;code&gt;entry&lt;/code&gt;, &lt;code&gt;quotation&lt;/code&gt; or &lt;code&gt;blogmark&lt;/code&gt; and specifies which type of content should be imported. The &lt;code&gt;datetime&lt;/code&gt;, &lt;code&gt;slug&lt;/code&gt; and &lt;code&gt;tags&lt;/code&gt; fields are common across all three types - the other fields differ for each type.&lt;/p&gt;
&lt;p&gt;The most interesting field here is &lt;code&gt;import_ref&lt;/code&gt;. This is optional, but if provided forms a unique reference associated with that item of content. I then use that reference in a call Django’s &lt;a href="https://docs.djangoproject.com/en/1.11/ref/models/querysets/#update-or-create"&gt;&lt;code&gt;update_or_create()&lt;/code&gt;&lt;/a&gt; method. This means I can run the same import multiple times - the first run will create objects, while subsequent runs update objects in place.&lt;/p&gt;
&lt;p&gt;The end result is that I can incrementally improve the scrapers I am writing, re-importing the resulting JSON to update previously imported records in-place. In addition to hacking on my blog, I’ve been using this pattern for some API integrations at work recently and it’s worked out very well.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;import_ref&lt;/code&gt; is defined on my models as a unique, nullable text field:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;    import_ref = models.TextField(max_length=64, null=True, unique=True)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the Django admin doesn’t handle nullable fields well by default, I &lt;a href="https://github.com/simonw/simonwillisonblog/blob/e737be8b4228229e833fe7a9ec698f3e262cd094/blog/admin.py#L19"&gt;added &lt;code&gt;import_ref&lt;/code&gt; to my &lt;code&gt;readonly_fields&lt;/code&gt; property&lt;/a&gt; in my admin configuration to avoid accidentally setting it to a blank string when editing through the admin interface.&lt;/p&gt;
&lt;p&gt;Here’s my completed &lt;a href="https://github.com/simonw/simonwillisonblog/blob/739a8cb49cfd49da5c643e41027af04d484e2aef/blog/management/commands/import_blog_json.py"&gt;&lt;code&gt;import_blog_json&lt;/code&gt; management command&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My workflow for importing data is now pretty streamlined. I write the scrapers in &lt;a href="https://github.com/simonw/simonwillisonblog/tree/b7b59e504b5c2f5e04ad59e83a1f4fb6f76c58da/jupyter-notebooks"&gt;a Juyter notebook&lt;/a&gt; and use that to generate a list of importable items as Python dictionaries. I run &lt;code&gt;open('/tmp/items.json').write(json.dumps(items, indent=2))&lt;/code&gt; to dump the items to a JSON file. Then I can run &lt;code&gt;./manage.py import_blog_json /tmp/items.json&lt;/code&gt; to import them into my local development environment - thanks to the &lt;code&gt;import_ref&lt;/code&gt; I can do this as many times as I like until I’m pleased with the result.&lt;/p&gt;
&lt;p&gt;Once it’s ready, I run &lt;code&gt;!cat /tmp/blah.json | pbcopy&lt;/code&gt; in Jupyter to copy the JSON to my clipboard, then paste the JSON into a new &lt;a href="https://gist.github.com/"&gt;GitHub Gist&lt;/a&gt;. I then copy the URL to that raw JSON and execute it against my production instance.&lt;/p&gt;
&lt;p&gt;Heroku tip: running &lt;code&gt;heroku run bash&lt;/code&gt; will start a bash prompt in a dyno hooked up to your application. You can then run &lt;code&gt;./manage.py ...&lt;/code&gt; commands against your production environment.&lt;/p&gt;
&lt;p&gt;So… I just have to run &lt;code&gt;heroku run bash&lt;/code&gt; followed by  &lt;code&gt;./manage.py import_blog_json https://gist.github.com/path-to-json --tag_with=askmetafilter&lt;/code&gt; and the new content will be live on my site.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;tag_with&lt;/code&gt; option allows me to specify a tag to apply to all of that imported content, useful for checking that everything worked as expected.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-admin"&gt;django-admin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/heroku"&gt;heroku&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="django"/><category term="django-admin"/><category term="scraping"/><category term="heroku"/><category term="jupyter"/></entry><entry><title>Changelogs to help understand the fires in the North Bay</title><link href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/#atom-tag" rel="alternate"/><published>2017-10-10T06:48:07+00:00</published><updated>2017-10-10T06:48:07+00:00</updated><id>https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/#atom-tag</id><summary type="html">
    &lt;p&gt;The situation in the counties north of San Francisco &lt;a href="http://www.sfgate.com/bayarea/article/Latest-on-North-Bay-fires-A-really-rough-12263721.php"&gt;is horrifying right now&lt;/a&gt;. I’ve repurposed some of &lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;the tools I built to for the Irma Response project&lt;/a&gt; last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.&lt;/p&gt;
&lt;p&gt;I’m scraping a number of sources relevant to the crisis, and making the data available in &lt;a href="https://github.com/simonw/irma-scraped-data/"&gt;a repository on GitHub&lt;/a&gt;. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/santa-rosa-emergency.json"&gt;this history of changes&lt;/a&gt; to &lt;a href="https://srcity.org/610/Emergency-Information"&gt;srcity.org/610/Emergency-Information&lt;/a&gt; to see when they edited the information.&lt;/p&gt;
&lt;p&gt;The sources I’m tracking right now are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;a href="https://srcity.org/610/Emergency-Information"&gt;Santa Rosa Fire Department’s Emergency Information&lt;/a&gt; page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/santa-rosa-emergency.json"&gt;History of changes to that page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m.pge.com/#outages"&gt;PG&amp;amp;E power outages&lt;/a&gt;. This is probably the highest quality dataset with the &lt;a href="https://github.com/simonw/irma-scraped-data/commit/50ab3d3f3a5f117054e3209c7f0d520e6b483f0e#diff-2432d375ba73b2c87c88f55b12a0a2f0"&gt;neatest commit messages&lt;/a&gt;. The &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/pge-outages-individual.json"&gt;commit history of these&lt;/a&gt; shows exactly when new outages are reported and how many customers were affected.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://roadconditions.sonoma-county.org/"&gt;Road Conditions in the County of Sonoma&lt;/a&gt;. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/sonoma-road-conditions.json"&gt;History of changes&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;California Highway Patrol Incidents, extracted from a KML feed on &lt;a href="http://quickmap.dot.ca.gov/"&gt;quickmap.dot.ca.gov&lt;/a&gt;. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/chp-incidents.json"&gt;History of changes&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The code for the scrapers can be &lt;a href="https://github.com/simonw/irma-scrapers/blob/master/north_bay.py"&gt;found in north_bay.py&lt;/a&gt;. Please leave comments, feedback or suggestions on other useful potential sources of data &lt;a href="https://github.com/simonw/simonwillisonblog/issues/4"&gt;in this GitHub issue&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crisishacking"&gt;crisishacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="scraping"/><category term="crisishacking"/><category term="git-scraping"/></entry><entry><title>Scraping hurricane Irma</title><link href="https://simonwillison.net/2017/Sep/10/scraping-irma/#atom-tag" rel="alternate"/><published>2017-09-10T06:21:17+00:00</published><updated>2017-09-10T06:21:17+00:00</updated><id>https://simonwillison.net/2017/Sep/10/scraping-irma/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://www.irmaresponse.org/"&gt;Irma Response project&lt;/a&gt; is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The &lt;a href="https://irma-api.herokuapp.com/"&gt;Irma API&lt;/a&gt; is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the &lt;a href="https://www.irmashelters.org/"&gt;irmashelters.org&lt;/a&gt; website.&lt;/p&gt;
&lt;p&gt;To aid this effort, I built a collection of screen scrapers that pull data from a number of different websites and APIs. That data is then stored in &lt;a href="https://github.com/simonw/irma-scraped-data/"&gt;a Git repository&lt;/a&gt;, providing a clear history of changes made to the various sources that are being tracked.&lt;/p&gt;
&lt;p&gt;Some of the scrapers also publish their findings to Slack in a format designed to make it obvious when key events happen, such as new shelters being added or removed from public listings.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Tracking_changes_over_time_8"&gt;&lt;/a&gt;Tracking changes over time&lt;/h3&gt;
&lt;p&gt;A key goal of this screen scraping mechanism is to allow changes to the underlying data sources to be tracked over time. This is achieved using git, via the GitHub API. Each scraper pulls down data from a source (an API or a website) and reformats that data into a sanitized JSON format. That JSON is then written to the git repository. If the data has changed since the last time the scraper ran, those changes will be captured by git and made available in the commit log.&lt;/p&gt;
&lt;p&gt;Recent changes tracked by the scraper collection can be seen here: &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master"&gt;https://github.com/simonw/irma-scraped-data/commits/master&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="Generating_useful_commit_messages_14"&gt;&lt;/a&gt;Generating useful commit messages&lt;/h3&gt;
&lt;p&gt;The most complex code for most of the scrapers isn’t in fetching the data: it’s in generating useful, human-readable commit messages that summarize the underlying change. For example, here is &lt;a href="https://github.com/simonw/irma-scraped-data/commit/7919aeff0913ec26d1bea8dc"&gt;a commit message&lt;/a&gt; generated by the scraper that tracks the &lt;a href="http://www.floridadisaster.org/shelters/summary.aspx"&gt;http://www.floridadisaster.org/shelters/summary.aspx&lt;/a&gt; page:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;florida-shelters.json: 2 shelters added

Added shelter: Atwater Elementary School (Sarasota County)
Added shelter: DEBARY ELEMENTARY SCHOOL (Volusia County)
Change detected on http://www.floridadisaster.org/shelters/summary.aspx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The full commit also shows the changes to the underlying JSON, but the human-readable message provides enough information that people who are not JSON-literate programmers can still derive value from the commit.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Publishing_to_Slack_26"&gt;&lt;/a&gt;Publishing to Slack&lt;/h3&gt;
&lt;p&gt;The Irma Response team use Slack to co-ordinate their efforts. You can join their Slack here: &lt;a href="https://irma-response-slack.herokuapp.com/"&gt;https://irma-response-slack.herokuapp.com/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Some of the scrapers publish detected changes in their data source to Slack, as links to the commits generated for each change. The human-readable message is posted directly to the channel.&lt;/p&gt;
&lt;p&gt;&lt;img style="width: 100%" src="http://static.simonwillison.net.s3.amazonaws.com/static/2017/irma-slack.jpg" alt="Bot publishing to Slack" /&gt;&lt;/p&gt;
&lt;p&gt;The source code for all of the scrapers can be found at &lt;a href="https://github.com/simonw/irma-scrapers"&gt;https://github.com/simonw/irma-scrapers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This Entry started out as &lt;a href="https://github.com/simonw/irma-scrapers/blob/master/README.md"&gt;README file&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crisishacking"&gt;crisishacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="scraping"/><category term="crisishacking"/><category term="git-scraping"/></entry></feed>