Simon Willison's Weblog: scraping

Reddit will block the Internet Archive

2025-08-11T18:11:49+00:00

Reddit will block the Internet Archive

Well this sucks. Jay Peters for the Verge:

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.

Tags: internet-archive, reddit, scraping, ai, training-data, ai-ethics

Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone

2025-07-17T19:38:50+00:00

This morning, working entirely on my phone, I scraped a conference website and vibe coded up an alternative UI for interacting with the schedule using a combination of OpenAI Codex and Claude Artifacts.

This weekend is Open Sauce 2025, the third edition of the Bay Area conference for YouTube creators in the science and engineering space. I have a couple of friends going and they were complaining that the official schedule was difficult to navigate on a phone - it's not even linked from the homepage on mobile, and once you do find the agenda it isn't particularly mobile-friendly.

We were out for coffee this morning so I only had my phone, but I decided to see if I could fix it anyway.

TLDR: Working entirely on my iPhone, using a combination of OpenAI Codex in the ChatGPT mobile app and Claude Artifacts via the Claude app, I was able to scrape the full schedule and then build and deploy this: tools.simonwillison.net/open-sauce-2025

The site offers a faster loading and more useful agenda view, but more importantly it includes an option to "Download Calendar (ICS)" which allows mobile phone users (Android and iOS) to easily import the schedule events directly into their calendar app of choice.

Here are some detailed notes on how I built it.

Scraping the schedule

Step one was to get that schedule in a structured format. I don't have good tools for viewing source on my iPhone, so I took a different approach to turning the schedule site into structured data.

My first thought was to screenshot the schedule on my phone and then dump the images into a vision LLM - but the schedule was long enough that I didn't feel like scrolling through several different pages and stitching together dozens of images.

If I was working on a laptop I'd turn to scraping: I'd dig around in the site itself and figure out where the data came from, then write code to extract it out.

How could I do the same thing working on my phone?

I decided to use OpenAI Codex - the hosted tool, not the confusingly named CLI utility.

Codex recently grew the ability to interact with the internet while attempting to resolve a task. I have a dedicated Codex "environment" configured against a GitHub repository that doesn't do anything else, purely so I can run internet-enabled sessions there that can execute arbitrary network-enabled commands.

I started a new task there (using the Codex interface inside the ChatGPT iPhone app) and prompted:

Install playwright and use it to visit https://opensauce.com/agenda/ and grab the full details of all three day schedules from the tabs - Friday and Saturday and Sunday - then save and on Data in as much detail as possible in a JSON file and submit that as a PR

Codex is frustrating in that you only get one shot: it can go away and work autonomously on a task for a long time, but while it's working you can't give it follow-up prompts. You can wait for it to finish entirely and then tell it to try again in a new session, but ideally the instructions you give it are enough for it to get to the finish state where it submits a pull request against your repo with the results.

I got lucky: my above prompt worked exactly as intended.

Codex churned for a 13 minutes! I was sat chatting in a coffee shop, occasionally checking the logs to see what it was up to.

It tried a whole bunch of approaches, all involving running the Playwright Python library to interact with the site. You can see the full transcript here. It includes notes like "Looks like xxd isn't installed. I'll grab "vim-common" or "xxd" to fix it.".

Eventually it downloaded an enormous obfuscated chunk of JavaScript called schedule-overview-main-1752724893152.js (316KB) and then ran a complex sequence of grep, grep, sed, strings, xxd and dd commands against it to figure out the location of the raw schedule data in order to extract it out.

Here's the eventual extract_schedule.py Python script it wrote, which uses Playwright to save that schedule-overview-main-1752724893152.js file and then extracts the raw data using the following code (which calls Node.js inside Python, just so it can use the JavaScript eval() function):

node_script = (
    "const fs=require('fs');"
    f"const d=fs.readFileSync('{tmp_path}','utf8');"
    "const m=d.match(/var oo=(\\{.*?\\});/s);"
    "if(!m){throw new Error('not found');}"
    "const obj=eval('(' + m[1] + ')');"
    f"fs.writeFileSync('{OUTPUT_FILE}', JSON.stringify(obj, null, 2));"
)
subprocess.run(['node', '-e', node_script], check=True)

As instructed, it then filed a PR against my repo. It included the Python Playwright script, but more importantly it also included that full extracted schedule.json file. That meant I now had the schedule data, with a raw.githubusercontent.com URL with open CORS headers that could be fetched by a web app!

Building the web app

Now that I had the data, the next step was to build a web application to preview it and serve it up in a more useful format.

I decided I wanted two things: a nice mobile friendly interface for browsing the schedule, and mechanism for importing that schedule into a calendar application, such as Apple or Google Calendar.

It took me several false starts to get this to work. The biggest challenge was getting that 63KB of schedule JSON data into the app. I tried a few approaches here, all on my iPhone while sitting in coffee shop and later while driving with a friend to drop them off at the closest BART station.

Using ChatGPT Canvas and o3, since unlike Claude Artifacts a Canvas can fetch data from remote URLs if you allow-list that domain. I later found out that this had worked when I viewed it on my laptop, but on my phone it threw errors so I gave up on it.
Uploading the JSON to Claude and telling it to build an artifact that read the file directly - this failed with an error "undefined is not an object (evaluating 'window.fs.readFile')". The Claude 4 system prompt had lead me to expect this to work, I'm not sure why it didn't.
Having Claude copy the full JSON into the artifact. This took too long - typing out 63KB of JSON is not a sensible use of LLM tokens, and it flaked out on me when my connection went intermittent driving through a tunnel.
Telling Claude to fetch from the URL to that schedule JSON instead. This was my last resort because the Claude Artifacts UI blocks access to external URLs, so you have to copy and paste the code out to a separate interface (on an iPhone, which still lacks a "select all" button) making for a frustrating process.

That final option worked! Here's the full sequence of prompts I used with Claude to get to a working implementation - full transcript here:

Use your analyst tool to read this JSON file and show me the top level keys

This was to prime Claude - I wanted to remind it about its window.fs.readFile function and have it read enough of the JSON to understand the structure.

Build an artifact with no react that turns the schedule into a nice mobile friendly webpage - there are three days Friday, Saturday and Sunday, which corresponded to the 25th and 26th and 27th of July 2025

Don’t copy the raw JSON over to the artifact - use your fs function to read it instead

Also include a button to download ICS at the top of the page which downloads a ICS version of the schedule

I had noticed that the schedule data had keys for "friday" and "saturday" and "sunday" but no indication of the dates, so I told it those. It turned out later I'd got these wrong!

This got me a version of the page that failed with an error, because that fs.readFile() couldn't load the data from the artifact for some reason. So I fixed that with:

Change it so instead of using the readFile thing it fetches the same JSON from https://raw.githubusercontent.com/simonw/.github/f671bf57f7c20a4a7a5b0642837811e37c557499/schedule.json

... then copied the HTML out to a Gist and previewed it with gistpreview.github.io - here's that preview.

Then we spot-checked it, since there are so many ways this could have gone wrong. Thankfully the schedule JSON itself never round-tripped through an LLM so we didn't need to worry about hallucinated session details, but this was almost pure vibe coding so there was a big risk of a mistake sneaking through.

I'd set myself a deadline of "by the time we drop my friend at the BART station" and I hit that deadline with just seconds to spare. I pasted the resulting HTML into my simonw/tools GitHub repo using the GitHub mobile web interface which deployed it to that final tools.simonwillison.net/open-sauce-2025 URL.

... then we noticed that we had missed a bug: I had given it the dates of "25th and 26th and 27th of July 2025" but actually that was a week too late, the correct dates were July 18th-20th.

Thankfully I have Codex configured against my simonw/tools repo as well, so fixing that was a case of prompting a new Codex session with:

The open sauce schedule got the dates wrong - Friday is 18 July 2025 and Saturday is 19 and Sunday is 20 - fix it

Here's that Codex transcript, which resulted in this PR which I landed and deployed, again using the GitHub mobile web interface.

What this all demonstrates

So, to recap: I was able to scrape a website (without even a view source too), turn the resulting JSON data into a mobile-friendly website, add an ICS export feature and deploy the results to a static hosting platform (GitHub Pages) working entirely on my phone.

If I'd had a laptop this project would have been faster, but honestly aside from a little bit more hands-on debugging I wouldn't have gone about it in a particularly different way.

I was able to do other stuff at the same time - the Codex scraping project ran entirely autonomously, and the app build itself was more involved only because I had to work around the limitations of the tools I was using in terms of fetching data from external sources.

As usual with this stuff, my 25+ years of previous web development experience was critical to being able to execute the project. I knew about Codex, and Artifacts, and GitHub, and Playwright, and CORS headers, and Artifacts sandbox limitations, and the capabilities of ICS files on mobile phones.

This whole thing was so much fun! Being able to spin up multiple coding agents directly from my phone and have them solve quite complex problems while only paying partial attention to the details is a solid demonstration of why I continue to enjoying exploring the edges of AI-assisted programming.

Update: I removed the speaker avatars

Here's a beautiful cautionary tale about the dangers of vibe-coding on a phone with no access to performance profiling tools. A commenter on Hacker News pointed out:

The web app makes 176 requests and downloads 130 megabytes.

And yeah, it did! Turns out those speaker avatar images weren't optimized, and there were over 170 of them.

I told a fresh Codex instance "Remove the speaker avatar images from open-sauce-2025.html" and now the page weighs 93.58 KB - about 1,400 times smaller!

Update 2: Improved accessibility

That same commenter on Hacker News:

It's also <div> soup and largely inaccessible.

Yeah, this HTML isn't great:

dayContainer.innerHTML = sessions.map(session => `
    <div class="session-card">
        <div class="session-header">
            <div>
                <span class="session-time">${session.time}</span>
                <span class="length-badge">${session.length} min</span>
            </div>
            <div class="session-location">${session.where}</div>
        </div>

I opened an issue and had both Claude Code and Codex look at it. Claude Code failed to submit a PR for some reason, but Codex opened one with a fix that sounded good to me when I tried it with VoiceOver on iOS (using a Cloudflare Pages preview) so I landed that. Here's the diff, which added a hidden "skip to content" link, some aria- attributes on buttons and upgraded the HTML to use <h3> for the session titles.

Next time I'll remember to specify accessibility as a requirement in the initial prompt. I'm disappointed that Claude didn't consider that without me having to ask.

Tags: definitions, github, icalendar, mobile, scraping, tools, ai, playwright, openai, generative-ai, chatgpt, llms, ai-assisted-programming, claude, claude-artifacts, ai-agents, vibe-coding, coding-agents, async-coding-agents, prompt-to-app

shot-scraper 1.8

2025-03-25T01:59:38+00:00

shot-scraper 1.8

I've added a new feature to shot-scraper that makes it easier to share scripts for other people to use with the shot-scraper javascript command.

shot-scraper javascript lets you load up a web page in an invisible Chrome browser (via Playwright), execute some JavaScript against that page and output the results to your terminal. It's a fun way of running complex screen-scraping routines as part of a terminal session, or even chained together with other commands using pipes.

The -i/--input option lets you load that JavaScript from a file on disk - but now you can also use a gh: prefix to specify loading code from GitHub instead.

To quote the release notes:

shot-scraper javascript can now optionally load scripts hosted on GitHub via the new gh: prefix to the shot-scraper javascript -i/--input option. #173

Scripts can be referenced as gh:username/repo/path/to/script.js or, if the GitHub user has created a dedicated shot-scraper-scripts repository and placed scripts in the root of it, using gh:username/name-of-script.

For example, to run this readability.js script against any web page you can use the following:
shot-scraper javascript --input gh:simonw/readability \
  https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/

The output from that example starts like this:

{
    "title": "Qwen2.5-VL-32B: Smarter and Lighter",
    "byline": "Simon Willison",
    "dir": null,
    "lang": "en-gb",
    "content": "<div id=\"readability-page-1\"...

My simonw/shot-scraper-scripts repo only has that one file in it so far, but I'm looking forward to growing that collection and hopefully seeing other people create and share their own shot-scraper-scripts repos as well.

This feature is an imitation of a similar feature that's coming in the next release of LLM.

Tags: github, javascript, projects, scraping, annotated-release-notes, playwright, shot-scraper

Cutting-edge web scraping techniques at NICAR

2025-03-08T19:25:36+00:00

Cutting-edge web scraping techniques at NICAR

Here's the handout for a workshop I presented this morning at NICAR 2025 on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs.

For workshops like this I like to work off an extremely detailed handout, so that people can move at their own pace or catch up later if they didn't get everything done.

The workshop consisted of four parts:

Building a Git scraper - an automated scraper in GitHub Actions that records changes to a resource over time

Using in-browser JavaScript and then shot-scraper to extract useful information

Using LLM with both OpenAI and Google Gemini to extract structured data from unstructured websites

Video scraping using Google AI Studio

I released several new tools in preparation for this workshop (I call this "NICAR Driven Development"):

git-scraper-template template repository for quickly setting up new Git scrapers, which I wrote about here
LLM schemas, finally adding structured schema support to my LLM tool
shot-scraper har for archiving pages as HTML Archive files - though I cut this from the workshop for time

I also came up with a fun way to distribute API keys for workshop participants: I had Claude build me a web page where I can create an encrypted message with a passphrase, then share a URL to that page with users and give them the passphrase to unlock the encrypted message. You can try that at tools.simonwillison.net/encrypt - or use this link and enter the passphrase "demo":

Tags: scraping, speaking, ai, git-scraping, shot-scraper, openai, generative-ai, llms, ai-assisted-programming, claude, gemini, nicar, claude-artifacts, prompt-to-app

monolith

2025-03-06T15:37:48+00:00

monolith

Neat CLI tool built in Rust that can create a single packaged HTML file of a web page plus all of its dependencies.

cargo install monolith # or brew install
monolith https://simonwillison.net/ > simonwillison.html

That command produced this 1.5MB single file result. All of the linked images, CSS and JavaScript assets have had their contents inlined into base64 URIs in their src= and href= attributes.

I was intrigued as to how it works, so I dumped the whole repository into Gemini 2.0 Pro and asked for an architectural summary:

cd /tmp
git clone https://github.com/Y2Z/monolith
cd monolith
files-to-prompt . -c | llm -m gemini-2.0-pro-exp-02-05 \
  -s 'architectural overview as markdown'

Here's what I got. Short version: it uses the reqwest, html5ever, markup5ever_rcdom and cssparser crates to fetch and parse HTML and CSS and extract, combine and rewrite the assets. It doesn't currently attempt to run any JavaScript.

Via Comment on Hacker News

Tags: cli, scraping, ai, rust, generative-ai, llms, ai-assisted-programming, files-to-prompt

simonw/git-scraper-template

2025-02-26T05:34:05+00:00

simonw/git-scraper-template

I built this new GitHub template repository in preparation for a workshop I'm giving at NICAR (the data journalism conference) next week on Cutting-edge web scraping techniques.

One of the topics I'll be covering is Git scraping - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.

This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple create a new repository from the template and paste the URL you want to scrape into the description field and the repository will be initialized with a custom script that scrapes and stores that URL.

It's modeled after my earlier shot-scraper-template tool which I described in detail in Instantly create a GitHub repository to take screenshots of a web page.

The new git-scraper-template repo took some help from Claude to figure out. It uses a custom script to download the provided URL and derive a filename to use based on the URL and the content type, detected using file --mime-type -b "$file_path" against the downloaded file.

It also detects if the downloaded content is JSON and, if it is, pretty-prints it using jq - I find this is a quick way to generate much more useful diffs when the content changes.

Tags: data-journalism, git, github, projects, scraping, github-actions, git-scraping, nicar

Using a Tailscale exit node with GitHub Actions

2025-02-23T02:49:32+00:00

Using a Tailscale exit node with GitHub Actions

New TIL. I started running a git scraper against doge.gov to track changes made to that website over time. The DOGE site runs behind Cloudflare which was blocking requests from the GitHub Actions IP range, but I figured out how to run a Tailscale exit node on my Apple TV and use that to proxy my shot-scraper requests.

The scraper is running in simonw/scrape-doge-gov. It uses the new shot-scraper har command I added in shot-scraper 1.6 (and improved in shot-scraper 1.7).

Tags: github, scraping, github-actions, tailscale, til, git-scraping, shot-scraper

shot-scraper 1.6 with support for HTTP Archives

2025-02-13T21:02:37+00:00

shot-scraper 1.6 with support for HTTP Archives

New release of my shot-scraper CLI tool for taking screenshots and scraping web pages.

The big new feature is HTTP Archive (HAR) support. The new shot-scraper har command can now create an archive of a page and all of its dependents like this:

shot-scraper har https://datasette.io/

This produces a datasette-io.har file (currently 163KB) which is JSON representing the full set of requests used to render that page. Here's a copy of that file. You can visualize that here using ericduran.github.io/chromeHAR.

That JSON includes full copies of all of the responses, base64 encoded if they are binary files such as images.

You can add the --zip flag to instead get a datasette-io.har.zip file, containing JSON data in har.har but with the response bodies saved as separate files in that archive.

The shot-scraper multi command lets you run shot-scraper against multiple URLs in sequence, specified using a YAML file. That command now takes a --har option (or --har-zip or --har-file name-of-file), described in the documentation, which will produce a HAR at the same time as taking the screenshots.

Shots are usually defined in YAML that looks like this:

- output: example.com.png
  url: http://www.example.com/
- output: w3c.org.png
  url: https://www.w3.org/

You can now omit the output: keys and generate a HAR file without taking any screenshots at all:

- url: http://www.example.com/
- url: https://www.w3.org/

Run like this:

shot-scraper multi shots.yml --har

Which outputs:

Skipping screenshot of 'https://www.example.com/'
Skipping screenshot of 'https://www.w3.org/'
Wrote to HAR file: trace.har

shot-scraper is built on top of Playwright, and the new features use the browser.new_context(record_har_path=...) parameter.

Tags: cli, projects, python, scraping, playwright, shot-scraper

Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent

2024-10-17T12:32:47+00:00

The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.

I didn't particularly feel like copying and pasting all of the numbers out one at a time, so I decided to try something different: could I record a screen capture while browsing around my Gmail account and then extract the numbers from that video using Google Gemini?

This turned out to work incredibly well.

AI Studio and QuickTime

I recorded the video using QuickTime Player on my Mac: File -> New Screen Recording. I dragged a box around a portion of my screen containing my Gmail account, then clicked on each of the emails in turn, pausing for a couple of seconds on each one.

I uploaded the resulting file directly into Google's AI Studio tool and prompted the following:

Turn this into a JSON array where each item has a yyyy-mm-dd date and a floating point dollar amount for that date

... and it worked. It spat out a JSON array like this:

[
  {
    "date": "2023-01-01",
    "amount": 2...
  },
  ...
]

I wanted to paste that into Numbers, so I followed up with:

turn that into copy-pastable csv

Which gave me back the same data formatted as CSV.

You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.

I had intended to use Gemini 1.5 Pro, aka Google's best model... but it turns out I forgot to select the model and I'd actually run the entire process using the much less expensive Gemini 1.5 Flash 002.

How much did it cost?

According to AI Studio I used 11,018 tokens, of which 10,326 were for the video.

Gemini 1.5 Flash charges $0.075/1 million tokens (the price dropped in August).

11018/1000000 = 0.011018
0.011018 * $0.075 = $0.00082635

So this entire exercise should have cost me just under 1/10th of a cent!

And in fact, it was free. Google AI Studio currently "remains free of charge regardless of if you set up billing across all supported regions". I believe that means they can train on your data though, which is not the case for their paid APIs.

The alternatives aren't actually that great

Let's consider the alternatives here.

I could have clicked through the emails and copied out the data manually one at a time. This is error prone and kind of boring. For twelve emails it would have been OK, but for a hundred it would have been a real pain.
Accessing my Gmail data programatically. This seems to get harder every year - it's still possible to access it via IMAP right now if you set up a dedicated app password but that's a whole lot of work for a one-off scraping task. The official API is no fun at all.
Some kind of browser automation (Playwright or similar) that can click through my Gmail account for me. Even with an LLM to help write the code this is still a lot more work, and it doesn't help deal with formatting differences in emails either - I'd have to solve the email parsing step separately.
Using some kind of much more sophisticated pre-existing AI tool that has access to my email. A separate Google product also called Gemini can do this if you grant it access, but my results with that so far haven't been particularly great. AI tools are inherently unpredictable. I'm also nervous about giving any tool full access to my email account due to the risk from things like prompt injection.

Video scraping is really powerful

The great thing about this video scraping technique is that it works with anything that you can see on your screen... and it puts you in total control of what you end up exposing to the AI model.

There's no level of website authentication or anti-scraping technology that can stop me from recording a video of my screen while I manually click around inside a web application.

The results I get depend entirely on how thoughtful I was about how I positioned my screen capture area and how I clicked around.

There is no setup cost for this at all - sign into a site, hit record, browse around a bit and then dump the video into Gemini.

And the cost is so low that I had to re-run my calculations three times to make sure I hadn't made a mistake.

I expect I'll be using this technique a whole lot more in the future. It also has applications in the data journalism world, which frequently involves the need to scrape data from sources that really don't want to be scraped.

A note on reliability

Added 22nd December 2024. As with anything involving LLMs, its worth noting that you cannot trust these models to return exactly correct results with 100% reliability. I verified the results here manually through eyeball comparison of the JSON to the underlying video, but in a larger project this may not be feasible. Consider spot-checks or other strategies for double-checking the results, especially if mistakes could have meaningful real-world impact.

Bonus: An LLM pricing calculator

In writing up this experiment I got fed up of having to manually calculate token prices. I actually usually outsource that to ChatGPT Code Interpreter, but I've caught it messing up the conversion from dollars to cents once or twice so I always have to double-check its work.

So I got Claude 3.5 Sonnet with Claude Artifacts to build me this pricing calculator tool (source code here):

You can set the input/output token prices by hand, or click one of the preset buttons to pre-fill it with the prices for different existing models (as-of 16th October 2024 - I won't promise that I'll promptly update them in the future!)

The entire thing was written by Claude. Here's the full conversation transcript - we spent 19 minutes iterating on it through 10 different versions.

Rather than hunt down all of those prices myself, I took screenshots of the pricing pages for each of the model providers and dumped those directly into the Claude conversation:

Tags: data-journalism, gmail, google, scraping, ai, generative-ai, llms, ai-assisted-programming, claude, gemini, vision-llms, claude-artifacts, claude-3-5-sonnet, prompt-to-app

Quoting Kieran McCarthy

2024-02-28T15:15:13+00:00

For the last few years, Meta has had a team of attorneys dedicated to policing unauthorized forms of scraping and data collection on Meta platforms. The decision not to further pursue these claims seems as close to waving the white flag as you can get against these kinds of companies. But why? [...]

In short, I think Meta cares more about access to large volumes of data and AI than it does about outsiders scraping their public data now. My hunch is that they know that any success in anti-scraping cases can be thrown back at them in their own attempts to build AI training databases and LLMs. And they care more about the latter than the former.

— Kieran McCarthy

Tags: facebook, scraping, ai, llms, training-data

scrapeghost

2023-03-26T05:29:37+00:00

scrapeghost

Scraping is a really interesting application for large language model tools like GPT3. James Turk’s scrapeghost is a very neatly designed entrant into this space—it’s a Python library and CLI tool that can be pointed at any URL and given a roughly defined schema (using a neat mini schema language) which will then use GPT3 to scrape the page and try to return the results in the supplied format.

Via @jamesturk

Tags: cli, scraping, ai, gpt-3, generative-ai, gpt-4, llms

Quoting Me

2023-03-16T01:09:52+00:00

I expect GPT-4 will have a LOT of applications in web scraping

The increased 32,000 token limit will be large enough to send it the full DOM of most pages, serialized to HTML - then ask questions to extract data

Or... take a screenshot and use the GPT4 image input mode to ask questions about the visually rendered page instead!

Might need to dust off all of those old semantic web dreams, because the world's information is rapidly becoming fully machine readable

— Me

Tags: scraping, semanticweb, gpt-4, llms

datasette-scraper walkthrough on YouTube

2023-01-29T05:23:42+00:00

datasette-scraper walkthrough on YouTube

datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table!

Via datasette-scraper

Tags: plugins, scraping, datasette, colin-dellow

curl-impersonate

2022-08-10T15:34:46+00:00

curl-impersonate

“A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”

I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL.

Via Ask HN: What are the best tools for web scraping in 2022?

Tags: crawling, curl, scraping

Web Scraping via Javascript Runtime Heap Snapshots

2022-05-03T00:51:29+00:00

Web Scraping via Javascript Runtime Heap Snapshots

This is an absolutely brilliant scraping trick. Adrian Cooney figured out a way to use Puppeteer and the Chrome DevTools protocol to take a heap snapshot of all of the JavaScript running on a web page, then recursively crawl through the heap looking for any JavaScript objects that have a specified selection of properties. This allows him to scrape data from arbitrarily complex client-side web applications. He built a JavaScript library and command line tool that implements the pattern.

Via Dathan Pattishall

Tags: javascript, scraping

Scraping web pages from the command line with shot-scraper

2022-03-14T01:29:56+00:00

I've added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.

It's also a really neat web scraping tool.

shot-scraper

I introduced shot-scraper last Thursday. It's a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.

% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'

Since Thursday shot-scraper has had a flurry of releases, adding features like PDF exports, the ability to dump the Chromium accessibilty tree and the ability to take screenshots of authenticated web pages. But the most exciting new feature landed today.

Executing JavaScript and returning the result

Release 0.9 takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:

% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"

Or you can return a JSON object:

% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}

Or if you want to use functions like setTimeout() - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:

% shot-scraper javascript datasette.io "
new Promise(done => setInterval(
  () => {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"

Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:

- name: Test page title
  run: |-
    shot-scraper javascript datasette.io "
      if (document.title != 'Datasette') {
        throw 'Wrong title detected';
      }"

Using this to scrape a web page

The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.

Posts from my blog occasionally show up on Hacker News - sometimes I spot them, sometimes I don't.

https://news.ycombinator.com/from?site=simonwillison.net is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official Hacker News API.

So... let's write a scraper for it.

I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:

Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})

The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.

I'm using document.querySelectorAll('.itemlist .athing') to loop through each element that matches that selector.

I wrap that with Array.from(...) so I can use the .map() method. Then for each element I can extract out the details that I need.

The resulting array contains 30 items that look like this:

[
  {
    "id": "30658310",
    "title": "Track changes to CLI tools by recording their help output",
    "url": "https://simonwillison.net/2022/Feb/2/help-scraping/",
    "dt": "2022-03-13T05:36:13",
    "submitter": "appwiz",
    "commentsUrl": "https://news.ycombinator.com/item?id=30658310",
    "numComments": 19
  }
]

Running it with shot-scraper

Now that I have a recipe for a scraper, I can run it in the terminal like this:

shot-scraper javascript 'https://news.ycombinator.com/from?site=simonwillison.net' "
Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})" > simonwillison-net.json

simonwillison-net.json is now a JSON file containing the scraped data.

Running the scraper in GitHub Actions

I want to keep track of changes to this data structure over time. My preferred technique for that is something I call Git scraping - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.

Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.

So I built exactly that, in the simonw/scrape-hacker-news-by-domain repository.

The GitHub Actions workflow is in .github/workflows/scrape.yml. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.

The commit history of simonwillison-net.json will show me any time a new link from my site appears on Hacker News, or a comment is added.

(Fun GitHub trick: add .atom to the end of that URL to get an Atom feed of those commits.)

The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.

I can see myself using this technique a lot in the future.

Tags: cli, github, hacker-news, scraping, github-actions, git-scraping, shot-scraper

shot-scraper: automated screenshots for documentation, built on Playwright

2022-03-10T00:13:30+00:00

shot-scraper is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my git scraping and help scraping techniques.

Update 13th March 2022: The new shot-scraper javascript command can now be used to scrape web pages from the command line.

Update 14th October 2022: Automating screenshots for the Datasette documentation using shot-scraper offers a tutorial introduction to using the tool.

The problem

I like to include screenshots in documentation. I recently started writing end-user tutorials for Datasette, which are particularly image heavy (for example).

As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.

Introducing shot-scraper

shot-scraper is a tool for automating this process. You can install it using pip like this:

pip install shot-scraper
shot-scraper install

That second shot-scraper install line will install the browser it needs to do its job - more on that later.

You can use it in two ways. To take a one-off screenshot, you can run it like this:

shot-scraper https://simonwillison.net/ -o simonwillison.png

Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:

- url: https://simonwillison.net/
  output: simonwillison.png
- url: https://www.example.com/
  width: 400
  height: 400
  quality: 80
  output: example.jpg

And then use shot-scraper multi to execute every screenshot in one go:

% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'

The documentation describes all of the available options you can use when taking a screenshot.

Each option can be provided to the shot-scraper one-off tool, or can be embedded in the YAML file for use with shot-scraper multi.

JavaScript and CSS selectors

The default behaviour for shot-scraper is to take a full page screenshot, using a browser width of 1280px.

For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.

The --selector option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.

What if you want to modify the page in addition to selecting a specific area?

The --javascript option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.

The combination of these two options - also available as javascript: and selector: keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.

A complex example

To prove to myself that the tool works, I decided to try replicating this screenshot from my tutorial.

I made the original using CleanShot X, manually adding the two pink arrows:

This is pretty tricky!

It's not this whole page, just a subset of the page
The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot
There are two pink arrows superimposed on the image

I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.

I started by creating my own pink arrow SVG using Figma:

I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.

With the JavaScript figured out, I pasted it into a YAML file called shot.yml:

- url: https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&type=prez
  javascript: |
    new Promise(resolve => {
      // Run in a promise so we can sleep 1s at the end
      function remove(el) { el.parentNode.removeChild(el);}
      // Remove header and footer
      remove(document.querySelector('header'));
      remove(document.querySelector('footer'));
      // Remove most of the children of .content
      Array.from(document.querySelectorAll('.content > *:not(.table-wrapper,.suggested-facets)')).map(remove)
      // Bit of breathing room for the screenshot
      document.body.style.marginTop = '10px';
      // Add a bit of padding to .content
      var content = document.querySelector('.content');
      content.style.width = '820px';
      content.style.padding = '10px';
      // Open the menu - it's an SVG so we need to use dispatchEvent here
      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));
      // Remove all but table header and first 11 rows
      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);
      // Add a pink SVG arrow
      let div = document.createElement('div');
      div.innerHTML = `<svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg">
        <g filter="url(#a)">
          <path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/>
        </g>
        <defs>
          <filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB">
              <feFlood flood-opacity="0" result="BackgroundImageFix"/>
              <feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/>
              <feOffset dy="4"/>
              <feGaussianBlur stdDeviation="2"/>
              <feComposite in2="hardAlpha" operator="out"/>
              <feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/>
              <feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/>
              <feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/>
          </filter>
        </defs>
      </svg>`;
      let svg = div.firstChild;
      content.appendChild(svg);
      content.style.position = 'relative';
      svg.style.position = 'absolute';
      // Give the menu time to finish fading in
      setTimeout(() => {
        // Position arrow pointing to the 'facet by this' menu item
        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();
        svg.style.left = (pos.left - pos.width) + 'px';
        svg.style.top = (pos.top - 20) + 'px';
        resolve();
      }, 1000);
    });
  output: annotated-screenshot.png
  selector: .content

And ran this command to generate the screenshot:

shot-scraper multi shot.yml

The generated annotated-screenshot.png image looks like this:

I'm pretty happy with this! I think it works very well as a proof of concept for the process.

How it works: Playwright

I built the first prototype of shot-scraper using Puppeteer, because I had used that before.

Then I noticed that the puppeteer-cli package I was using hadn't had an update in two years, which reminded me to check out Playwright.

I've been looking for an excuse to learn Playwright for a while now, and this project turned out to be ideal.

Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.

Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.

The second prototype used the Playwright CLI utility instead, executed via npx:

subprocess.run(
    [
        "npx",
        "playwright",
        "screenshot",
        "--full-page",
        url,
        output,
    ],
    capture_output=True,
)

This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.

I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official Playwright for Python package.

pip install playwright

It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.

I was curious how they pulled this off, so I dug inside the playwright Python package in my site-packages folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.

Thanks to Playwright, the entire implementation of shot-scraper is currently just 181 lines of Python code - it's all glue code tying together a Click CLI interface with some code that calls Playwright to do the actual work.

I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my Datasette Desktop Electron application.

Hooking shot-scraper up to GitHub Actions

I built shot-scraper very much with GitHub Actions in mind.

My shot-scraper-demo repository is my first live demo of the tool.

Once a day, it runs this shots.yml file, generates two screenshots and commits them back to the repository.

One of them is the tutorial screenshot described above.

The other is a screenshot of the list of "recently spotted owls" from this page on owlsnearme.com. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.

I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, like this one (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).

Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!

What's next?

I had ambitious plans to add utilities to the tool that would help with annotations, such as adding pink arrows and drawing circles around different elements on the page.

I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.

So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.

I'm also very interested to see what kinds of things other people use this for.

Tags: cli, documentation, projects, scraping, github-actions, git-scraping, puppeteer, playwright, shot-scraper

Help scraping: track changes to CLI tools by recording their --help using Git

2022-02-02T23:46:35+00:00

I've been experimenting with a new variant of Git scraping this week which I'm calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help commands in a Git repository.

My new help-scraper GitHub repository is my first implementation of this pattern.

It uses this GitHub Actions workflow to record the --help output for the Amazon Web Services aws CLI tool, and also for the flyctl tool maintained by the Fly.io hosting platform.

The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command's CLI help option to a .txt file in the repository - then commits the result at the end.

The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.

Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.

Here are the official release notes - 12 bullet points, spanning 12 different AWS services.

My help scraper caught the details of the release in this commit - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.

The AWS CLI tool is enormous. Running find aws -name '*.txt' | wc -l in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.

There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON - and there are projects like awschanges.info which try to turn those sources of data into something more readable.

But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!

I implemented this for flyctl first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.

Help scraping my own projects

I got the initial idea for this technique from a change I made to my Datasette and sqlite-utils projects a few weeks ago.

Both tools offer CLI commands with --help output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.

So, I added documentation pages that list the output of --help for each of the CLI commands, generated using the Cog file generation tool:

sqlite-utils CLI reference (39 commands!)
datasette CLI reference

Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the --help output - here's that history for sqlite-utils.

It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.

Bonus trick: GraphQL schema scraping

I've started making selective use of the Fly.io GraphQL API as part of my plugin for publishing Datasette instances to that platform.

Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.

This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?

It turns out I can! There's an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:

npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql

I've added that to my help-scraper repository too - so now I have a commit history of changes of changes they are making there too. Here's an example from this morning.

Other weeknotes

I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to that milestone.

This week I did a bunch of work adding a Link: https://...; rel="alternate"; type="application/datasette+json" HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.

(I had originally planned to also support Accept: application/json request headers for this, but I've been put off that idea by the discovery that Cloudflare deliberately ignores the Vary: Accept header.)

Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.

The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!

TIL this week

Tags: cli, git, github, projects, scraping, graphql, weeknotes, github-actions, git-scraping, fly

git-history: a tool for analyzing scraped data collected using Git and SQLite

2021-12-07T22:32:55+00:00

I described Git scraping last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.

The open challenge was how to analyze that data once it was collected. git-history is my new tool designed to tackle that problem.

Git scraping, a refresher

A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this five minute lightning talk back in March.

Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at fire.ca.gov/incidents showing the status of current large fires in the state.

I found the underlying data here:

curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

Then I built a simple scraper that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected 1,559 commits!

The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.

There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.

git-history

git-history is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use Datasette to explore the resulting data.

Here's an example database created by running the tool against my ca-fires-history repository. I created the SQLite database by running this in the repository directory:

git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert 'json.loads(content)["Incidents"]'

In this example we are processing the history of a single file called incidents.json.

We use the UniqueId column to identify which records are changed over time as opposed to newly created.

Specifying --namespace incident causes the created database tables to be called incident and incident_version rather than the default of item and item_version.

And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see --convert in the documentation for details.

Let's use the database to answer some questions about fires in California over the past 14 months.

The incident table contains a copy of the latest record for every incident. We can use that to see a map of every fire:

This uses the datasette-cluster-map plugin, which draws a map of every row with a valid latitude and longitude column.

Where things get interesting is the incident_version table. This is where changes between different scraped versions of each item are recorded.

Those 250 fires have 2,060 recorded versions. If we facet by _item we can see which fires had the most versions recorded. Here are the top ten:

This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire has its own Wikipedia page!

Clicking through to the Dixie Fire lands us on a page showing every "version" that we captured, ordered by version number.

git-history only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:

The ConditionStatement is a text description that changes frequently, but the other two interesting columns look to be AcresBurned and PercentContained.

That _commit table is a foreign key to commits, which records commits that have been processed by the tool - mainly so that when you run it a second time it can pick up where it finished last time.

We can join against commits to see the date that each version was created. Or we can use the incident_version_detail view which performs that join for us.

Using that view, we can filter for just rows where _item is 174 and AcresBurned is not blank, then use the datasette-vega plugin to visualize the _commit_at date column against the AcresBurned numeric column... and we get a graph of the growth of the Dixie Fire over time!

To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to git-history, Datasette and datasette-vega we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.

A note on schema design

One of the hardest problems in designing git-history was deciding on an appropriate schema for storing version changes over time.

I ended up with the following (edited for clarity):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item] (
   [_id] INTEGER PRIMARY KEY,
   [_item_id] TEXT,
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT,
   [_commit] INTEGER
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT
);
CREATE TABLE [columns] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [name] TEXT
);
CREATE TABLE [item_changed] (
   [item_version] INTEGER REFERENCES [item_version]([_id]),
   [column] INTEGER REFERENCES [columns]([id]),
   PRIMARY KEY ([item_version], [column])
);

As shown earlier, records in the item_version table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as null.

There's one catch with this schema: what do we do if a new version of an item sets one of the columns to null? How can we tell the difference between that and a column that didn't change?

I ended up solving that with an item_changed many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which item_version records.

The item_version_detail view displays columns from that many-to-many table as JSON - here's a filtered example showing which columns were changed in which versions of which items:

Here's a SQL query that shows, for ca-fires, which columns were updated most often:

select columns.name, count(*)
from incident_changed
  join incident_version on incident_changed.item_version = incident_version._id
  join columns on incident_changed.column = columns.id
where incident_version._version > 1
group by columns.name
order by count(*) desc

Updated: 1785
PercentContained: 740
ConditionStatement: 734
AcresBurned: 616
Started: 327
PersonnelInvolved: 286
Engines: 274
CrewsInvolved: 256
WaterTenders: 225
Dozers: 211
AirTankers: 181
StructuresDestroyed: 125
Helicopters: 122

Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:

select * from incident
where _id in (
  select _item from incident_version
  where _id in (
    select item_version from incident_changed where column = 15
  )
  and _version > 1
)

That returned 19 fires that were significant enough to involve helicopters - here they are on a map:

Advanced usage of --convert

Drew Breunig has been running a Git scraper for the past 8 months in dbreunig/511-events-history against 511.org, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example sf-bay-511 database.

The sf-bay-511 example is useful for digging more into the --convert option to git-history.

git-history requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.

The ideal tracked JSON file would look something like this:

[
  {
    "IncidentID": "abc123",
    "Location": "Corner of 4th and Vermont",
    "Type": "fire"
  },
  {
    "IncidentID": "cde448",
    "Location": "555 West Example Drive",
    "Type": "medical"
  }
]

It's common for data that has been scraped to not fit this ideal shape.

The 511.org JSON feed can be found here - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a updated timestamp field that changes in every version even if there are no changes, or a deeply nested "extension" object full of duplicate data.

I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the --convert option to the script:

#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'

The single-quoted string passed to --convert is compiled into a Python function and run against each Git version in turn. My code loops through the nested Events list, modifying each record and then outputting them as an iterable sequence using yield.

A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.

When working with git-history I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it for sqlite-utils convert earlier this year.

Trying this out yourself

If you want to try this out for yourself the git-history tool has an extensive README describing the other options, and the scripts used to create these demos can be found in the demos folder.

The git-scraping topic on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!

Tags: cli, data-journalism, git, projects, scraping, sqlite, datasette, git-history

Git scraping, the five minute lightning talk

2021-03-05T00:44:15+00:00

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC's vaccination data using the GitHub web interface. Here's the video.

Notes from the talk

Here's the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.

I scraped that outage data into simonw/pge-outages - here's the commit history (over 40,000 commits now!)

The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo - my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.

Here's a video animation of PG&E's outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019

The much simpler scraper for the www.fire.ca.gov/incidents website is at simonw/ca-fires-history.

In the video I used that as the template to create a new scraper for CDC vaccination data - their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.

The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.

You can find more examples of Git scraping in the git-scraping GitHub topic.

Tags: data-journalism, scraping, my-talks, github-actions, git-scraping, annotated-talks, nicar

selenium-wire

2020-11-02T18:58:59+00:00

selenium-wire

Really useful scraping tool: enhances the Python Selenium bindings to run against a proxy which then allows Python scraping code to look at captured requests—great for if a site you are working with triggers Ajax requests and you want to extract data from the raw JSON that came back.

Tags: data-journalism, python, scraping, selenium

Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils

2020-10-16T21:14:46+00:00

This week I built evernote-to-sqlite (see Building an Evernote to SQLite exporter), launched the Datasette Weekly newsletter, worked on some scrapers and pushed out some small improvements to several other projects.

The Datasette Weekly newsletter

After procrastinating on it for several months I finally launched the new Datasette Weekly newsletter!

My plan is to put this out once a week with a combination of news from the Datasette/Dogsheep/sqlite-utils ecosystem of tools, plus tips and tricks for using them to solve data problems.

You can read the first edition here, which covers Datasette 0.50, git scraping, sqlite-utils extract and features datasette-graphql as the plugin of the week.

I'm using Substack because people I trust use it for their newsletters and I decided that picking an option and launching was more important than spending even more time procrastinating on picking the best possible newsletter platform. So far it seems fit for purpose, and it provides an export option should I decide to move to something else.

Writing scrapers with a Python+JavaScript hybrid

I've been writing some scraper code to help out with a student journalism project at Stanford. I ended up using Selenium Python running in a Jupyter Notebook.

Historically I've avoided Selenium due to how weird and complex it has been to use in the past. I've now completely changed my mind: these days it's a really solid option for browser automation driven by Python thanks to chromedriver and geckodriver, which I recently learned can be installed using Homebrow.

My preferred way of writing scrapers is to do most of the work in JavaScript. The combination of querySelector(), querySelectorAll(), fetch() and the new-to-me DOMParser class makes light work of extracting data from any shape of HTML, and browser DevTools mean that I can interactively build up scrapers by pasting code directly into the console.

My big break-through this week was figuring out how to write scrapers as a Python-JavaScript hybrid. The Selenium driver.execute_script() and driver.execute_async_script() (TIL) methods make it trivial to feed execute chunks of JavaScript from Python and get back the results.

This meant I could scrape pages one at time using JavaScript and save the results directly to SQLite via sqlite-utils. I could even run database queries on the Python side to skip items that had already been scraped.

csv-diff 1.0

I'm trying to get more of my tools past the 1.0 mark, mainly to indicate to potential users that I won't be breaking backwards compatibility without bumping them to 2.0.

I built csv-diff for my San Francisco Trees project last year. It produces human-readable diffs for CSV files.

The version 1.0 release notes are as follows:

New --show-unchanged option for outputting the unchanged values of rows that had at least one change. #9
Fix for bug with column names that contained a . character. #7
Fix for error when no --key provided - thanks, @MainHanzo. #3
CSV delimiter sniffer now ; delimited files. #6

sqlite-utils 2.22

sqlite-utils 2.22 adds some minor features - an --encoding option for processing TSV and CSV files in encodings other than UTF-8, and more support for loading SQLite extensions modules.

Full release notes:

New --encoding option for processing CSV and TSV files that use a non-utf-8 encoding, for both the insert and update commands. (#182)
The --load-extension option is now available to many more commands. (#137)
--load-extension=spatialite can be used to load SpatiaLite from common installation locations, if it is available. (#136)
Tests now also run against Python 3.9. (#184)
Passing pk=["id"] now has the same effect as passing pk="id". (#181)

Datasette

No new release yet, but I've landed some small new features to the main branch.

Inspired by the GitHub and WordPress APIs, Datasette's JSON API now supports Link: HTTP header pagination (#1014).

This is part of my ongoing effort to redesign the default JSON format ready for Datasette 1.0. I started a new plugin called datasette-json-preview to let me iterate on that format independent of Datasette itself.

Jacob Fenton suggested an "Edit SQL" button on canned queries. That's a great idea, so I built it - this issue comment links to some demos, e.g. this one here.

I added an "x" button for clearing filters to the table page (#1016) demonstrated by this GIF:

TIL this week

Releases this week

sqlite-utils 2.22 - 2020-10-16
csv-diff 1.0 - 2020-10-16
swarm-to-sqlite 0.3.2 - 2020-10-12
evernote-to-sqlite 0.2 - 2020-10-12
evernote-to-sqlite 0.1 - 2020-10-11
xml-analyser 1.0 - 2020-10-11
datasette-json-preview 0.1 - 2020-10-11

Tags: projects, scraping, datasette, weeknotes, sqlite-utils

Git scraping: track changes over time by scraping to a Git repository

2020-10-09T18:27:23+00:00

Git scraping is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.

Update 5th March 2021: I presented a version of this post as a five minute lightning talk at NICAR 2021, which includes a live coding demo of building a new git scraper.

Update 5th January 2022: I released a tool called git-history that helps analyze data that has been collected using this technique.

The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The @nyt_diff Twitter account tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.

We already have a great tool for efficiently tracking changes to text over time: Git. And GitHub Actions (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.

Here's a recent example. Fires continue to rage in California, and the CAL FIRE website offers an incident map showing the latest fire activity around the state.

Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That's a 241KB JSON endpoints with full details of the various fires around the state.

So... I started running a git scraper against it. My scraper lives in the simonw/ca-fires-history repository on GitHub.

Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using jq and commits it back to the repo if it has changed.

This means I now have a commit log of changes to that information about fires in California. Here's an example commit showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.

The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called .github/workflows/scrape.yml which looks like this:

name: Scrape latest data

on:
  push:
  workflow_dispatch:
  schedule:
    - cron:  '6,26,46 * * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Fetch latest data
      run: |-
        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

That's not a lot of code!

It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.

The scraper itself works by fetching the JSON using curl, piping it through jq . to pretty-print it and saving the result to incidents.json.

The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in this TIL a few months ago.

I have a whole bunch of repositories running git scrapers now. I've been labeling them with the git-scraping topic so they show up in one place on GitHub (other people have started using that topic as well).

I've written about some of these in the past:

Scraping hurricane Irma back in September 2017 is when I first came up with the idea to use a Git repository in this way.
Changelogs to help understand the fires in the North Bay from October 2017 describes an early attempt at scraping fire-related information.
Generating a commit log for San Francisco’s official list of trees remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have a commit log of changes to it stretching back over more than a year. This example uses my csv-diff utility to generate human-readable commit messages.
Tracking PG&E outages by scraping to a git repo documents my attempts to track the impact of PG&E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.
Tracking FARA by deploying a data API using GitHub Actions and Cloud Run shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.

I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.

Comment thread on this post over on Hacker News.

Tags: git, github, projects, scraping, github-actions, git-scraping

Tracking PG&E outages by scraping to a git repo

2019-10-10T23:32:14+00:00

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

As it happens, I've been scraping and recording PG&E's outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!

The original data lives in a GitHub repo (more importantly in the commit history of that repo).

Reading JSON in a Git repo isn't particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with Datasette.

The result is https://pge-outages.simonwillison.net/ (no longer available)

Update from 27th October 2019: I also used the data to create this animation (first shared on Twitter):

Your browser does not support the video tag.

The data model: outages and snapshots

The three key tables to understand are outages, snapshots and outage_snapshots.

PG&E assign an outage ID to every outage - where an outage is usually something that affects a few dozen customers. I store these in the outages table.

Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the snapshots table.

The most interesting table is outage_snapshots. Every time I see an outage in the JSON feed, I record a new copy of its data as an outage_snapshot row. This allows me to reconstruct the full history of any outage, in 10 minute increments.

Here are all of the outages that were represented in snapshot 1269 - captured at 4:10pm Pacific Time today.

I can run select sum(estCustAffected) from outage_snapshots where snapshot = 1269 (try it here) to count up the total PG&E estimate of the number of affected customers - it's 545,706!

I've installed datasette-vega which means I can render graphs. Here's my first attempt at a graph showing the number of estimated customers affected over time.

(I don't know why there's a dip towards the end of the graph).

I also defined a SQL view which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&E website hasn't gone down) and renders them using datasette-cluster-map.

Things to be aware of

There are a huge amount of unanswered questions about this data. I've just been looking at PG&E's JSON and making guesses about what things like estCustAffected means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.

Some things to question:

What's the quality of this data? Does it reflect accurately on what's actually going on out there?
What's the exact meaning of the different columns - estCustAffected, currentEtor, autoEtor, hazardFlag etc?
Various columns (lastUpdateTime, currentEtor, autoEtor) appear to be integer unix timestamps. What timezone were they recorded in? Do they include DST etc?

How it works

I originally wrote the scraper back in October 2017 during the North Bay fires, and moved it to run on Circle CI based on my work building a commit history of San Francisco's trees.

It's pretty simple: every 10 minutes a Circle CI job runs which scrapes the JSON feed that powers the PG&E website's outage map.

The JSON is then committed to my pge-outages GitHub repository, over-writing the existing pge-outages.json file. There's some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.

Building the Datasette

The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.

After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.

The code that generates the database can be found in build_database.py. I used GitPython to read data from the git repository and my sqlite-utils library to create and update the database.

Deployment

Since this is a large database that changes every ten minutes, I couldn't use the usual datasette publish trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.

Instead, I'm running it on a VPS instance. I ended up trying out Digital Ocean for this, after an enjoyable Twitter conversation about good options for stateful (as opposed to stateless) hosting.

Next steps

I'm putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology - or take what I've done and spin up much more interesting forks of it.

If you build something interesting with this please let me know, via email (swillison is my Gmail) or on Twitter.

Tags: data-journalism, projects, scraping, sqlite, datasette, git-scraping, digitalocean, sqlite-utils

scrapely

2018-07-10T20:25:01+00:00

scrapely

Neat twist on a screen scraping library: this one lets you “train it” by feeding it examples of URLs paired with a dictionary of the data you would like to have extracted from that URL, then uses an instance based learning earning algorithm to run against new URLs. Slightly confusing name since it’s maintained by the scrapy team but is a totally independent project from the scrapy web crawling framework.

Tags: python, scraping

sqlitebiter

2018-05-17T22:40:28+00:00

sqlitebiter

Similar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there.

Tags: csv, scraping, sqlite, datasette

kennethreitz/requests-html: HTML Parsing for Humans™

2018-02-25T16:49:19+00:00

kennethreitz/requests-html: HTML Parsing for Humans™

Neat and tiny wrapper around requests, lxml and html2text that provides a Kenneth Reitz grade API design for intuitively fetching and scraping web pages. The inclusion of html2text means you can use a CSS selector to select a specific HTML element and then convert that to the equivalent markdown in a one-liner.

Via @kennethreitz

Tags: html, python, requests, scraping

Using “import refs” to iteratively import data into Django

2017-11-04T19:17:00+00:00

I’ve been writing a few scripts to backfill my blog with content I originally posted elsewhere. So far I’ve imported answers I posted on Quora (background), answers I posted on Ask MetaFilter and content I recovered from the Internet Archive.

I started out writing custom import scripts (like this Quora one), but I’ve now built a generalized mechanism for this which I thought was worth writing up.

Any of my content imports now take the form of a JSON document, which looks something like this:

[
  {
    "body": "<p><em>My answer to ...</em></p>",
    "tags": [
      "backpacks",
      "laptops",
      "style",
      "accessories",
      "bags"
    ],
    "title": "I need a new backpack",
    "datetime": "2005-01-16T14:08:00",
    "import_ref": "askmetafilter:14075",
    "type": "entry",
    "slug": "i-need-a-new-backpack"
  }
]

Two larger examples: the missing content I extracted from the Internet Archive, and the answers I scraped from Ask MetaFilter.

The type property can be set to entry, quotation or blogmark and specifies which type of content should be imported. The datetime, slug and tags fields are common across all three types - the other fields differ for each type.

The most interesting field here is import_ref. This is optional, but if provided forms a unique reference associated with that item of content. I then use that reference in a call Django’s update_or_create() method. This means I can run the same import multiple times - the first run will create objects, while subsequent runs update objects in place.

The end result is that I can incrementally improve the scrapers I am writing, re-importing the resulting JSON to update previously imported records in-place. In addition to hacking on my blog, I’ve been using this pattern for some API integrations at work recently and it’s worked out very well.

import_ref is defined on my models as a unique, nullable text field:

    import_ref = models.TextField(max_length=64, null=True, unique=True)

Since the Django admin doesn’t handle nullable fields well by default, I added import_ref to my readonly_fields property in my admin configuration to avoid accidentally setting it to a blank string when editing through the admin interface.

Here’s my completed import_blog_json management command.

My workflow for importing data is now pretty streamlined. I write the scrapers in a Juyter notebook and use that to generate a list of importable items as Python dictionaries. I run open('/tmp/items.json').write(json.dumps(items, indent=2)) to dump the items to a JSON file. Then I can run ./manage.py import_blog_json /tmp/items.json to import them into my local development environment - thanks to the import_ref I can do this as many times as I like until I’m pleased with the result.

Once it’s ready, I run !cat /tmp/blah.json | pbcopy in Jupyter to copy the JSON to my clipboard, then paste the JSON into a new GitHub Gist. I then copy the URL to that raw JSON and execute it against my production instance.

Heroku tip: running heroku run bash will start a bash prompt in a dyno hooked up to your application. You can then run ./manage.py ... commands against your production environment.

So… I just have to run heroku run bash followed by ./manage.py import_blog_json https://gist.github.com/path-to-json --tag_with=askmetafilter and the new content will be live on my site.

The tag_with option allows me to specify a tag to apply to all of that imported content, useful for checking that everything worked as expected.

Tags: django, django-admin, scraping, heroku, jupyter

Changelogs to help understand the fires in the North Bay

2017-10-10T06:48:07+00:00

The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.

I’m scraping a number of sources relevant to the crisis, and making the data available in a repository on GitHub. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on this history of changes to srcity.org/610/Emergency-Information to see when they edited the information.

The sources I’m tracking right now are:

The Santa Rosa Fire Department’s Emergency Information page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. History of changes to that page.
PG&E power outages. This is probably the highest quality dataset with the neatest commit messages. The commit history of these shows exactly when new outages are reported and how many customers were affected.
Road Conditions in the County of Sonoma. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. History of changes.
California Highway Patrol Incidents, extracted from a KML feed on quickmap.dot.ca.gov. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. History of changes.

The code for the scrapers can be found in north_bay.py. Please leave comments, feedback or suggestions on other useful potential sources of data in this GitHub issue.

Tags: scraping, crisishacking, git-scraping

Scraping hurricane Irma

2017-09-10T06:21:17+00:00

The Irma Response project is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The Irma API is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the irmashelters.org website.

To aid this effort, I built a collection of screen scrapers that pull data from a number of different websites and APIs. That data is then stored in a Git repository, providing a clear history of changes made to the various sources that are being tracked.

Some of the scrapers also publish their findings to Slack in a format designed to make it obvious when key events happen, such as new shelters being added or removed from public listings.

Tracking changes over time

A key goal of this screen scraping mechanism is to allow changes to the underlying data sources to be tracked over time. This is achieved using git, via the GitHub API. Each scraper pulls down data from a source (an API or a website) and reformats that data into a sanitized JSON format. That JSON is then written to the git repository. If the data has changed since the last time the scraper ran, those changes will be captured by git and made available in the commit log.

Recent changes tracked by the scraper collection can be seen here: https://github.com/simonw/irma-scraped-data/commits/master

Generating useful commit messages

The most complex code for most of the scrapers isn’t in fetching the data: it’s in generating useful, human-readable commit messages that summarize the underlying change. For example, here is a commit message generated by the scraper that tracks the http://www.floridadisaster.org/shelters/summary.aspx page:

florida-shelters.json: 2 shelters added

Added shelter: Atwater Elementary School (Sarasota County)
Added shelter: DEBARY ELEMENTARY SCHOOL (Volusia County)
Change detected on http://www.floridadisaster.org/shelters/summary.aspx

The full commit also shows the changes to the underlying JSON, but the human-readable message provides enough information that people who are not JSON-literate programmers can still derive value from the commit.

Publishing to Slack

The Irma Response team use Slack to co-ordinate their efforts. You can join their Slack here: https://irma-response-slack.herokuapp.com/

Some of the scrapers publish detected changes in their data source to Slack, as links to the commits generated for each change. The human-readable message is posted directly to the channel.

The source code for all of the scrapers can be found at https://github.com/simonw/irma-scrapers

This Entry started out as README file.

Tags: scraping, crisishacking, git-scraping