Simon Willison's Weblog: git

Using Git with coding agents

2026-03-21T22:08:24+00:00

Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced.

This fluency means we can be more ambitious about how we use Git ourselves. We don't need to memorize how to do things with Git, but staying aware of what's possible means we can take advantage of the full suite of Git's abilities.

Git essentials

Each Git project lives in a repository - a folder on disk that can track changes made to the files within it. Those changes are recorded in commits - timestamped bundles of changes to one or more files accompanied by a commit message describing those changes and an author recording who made them.

Git supports branches, which allow you to construct and experiment with new changes independently of each other. Branches can then be merged back into your main branch (using various methods) once they are deemed ready.

Git repositories can be cloned onto a new machine, and that clone includes both the current files and the full history of changes to them. This means developers - or coding agents - can browse and explore that history without any extra network traffic, making history diving effectively free.

Git repositories can live just on your own machine, but Git is designed to support collaboration and backups by publishing them to a remote, which can be public or private. GitHub is the most popular place for these remotes but Git is open source software that enables hosting these remotes on any machine or service that supports the Git protocol.

Core concepts and prompts

Coding agents all have a deep understanding of Git jargon. The following prompts should work with any of them:

To turn the folder the agent is working in into a Git repository - the agent will probably run the git init command. If you just say "repo" agents will assume you mean a Git repository.

Create a new Git commit to record the changes the agent has made - usually with the git commit -m "commit message" command.

This should configure your repository for GitHub. You'll need to create a new repo first using github.com/new, and configure your machine to talk to GitHub.

Or "recent changes" or "last three commits".

This is a great way to start a fresh coding agents session. Telling the agent to look at recent changes causes it to run git log, which can instantly load its context with details of what you have been working on recently - both the modified code and the commit messages that describe it.

Seeding the session in this way means you can start talking about that code - suggest additional fixes, ask questions about how it works, or propose the next change that builds on what came before.

Run this on your main branch to fetch other contributions from the remote repository, or run it in a branch to integrate the latest changes on main.

There are multiple ways to merge changes, including merge, rebase, squash or fast-forward. If you can't remember the details of these that's fine:

Agents are great at explaining the pros and cons of different merging strategies, and everything in git can always be undone so there's minimal risk in trying new things.

I use this universal prompt surprisingly often! Here's a recent example where it fixed a cherry-pick for me that failed with a merge conflict.

There are plenty of ways you can get into a mess with Git, often through pulls or rebase commands that end in a merge conflict, or just through adding the wrong things to Git's staging environment.

Unpicking those used to be the most difficult and time consuming parts of working with Git. No more! Coding agents can navigate the most Byzantine of merge conflicts, reasoning through the intent of the new code and figuring out what to keep and how to combine conflicting changes. If your code has automated tests (and it should) the agent can ensure those pass before finalizing that merge.

If you lose code that you are working on that's previously been committed (or saved with git stash) your agent can probably find it for you.

Git has a mechanism called the reflog which can often capture details of code that hasn't been committed to a permanent branch. Agents can search that, and search other branches too.

Just tell them what to find and watch them dive in.

Git bisect is one of the most powerful debugging tools in Git's arsenal, but it has a relatively steep learning curve that often deters developers from using it.

When you run a bisect operation you provide Git with some kind of test condition and a start and ending commit range. Git then runs a binary search to identify the earliest commit for which your test condition fails.

This can efficiently answer the question "what first caused this bug". The only downside is the need to express the test for the bug in a format that Git bisect can execute.

Coding agents can handle this boilerplate for you. This upgrades Git bisect from an occasional use tool to one you can deploy any time you are curious about the historic behavior of your software.

Rewriting history

Let's get into the fun advanced stuff.

The commit history of a Git repository is not fixed. The data is just files on disk after all (tucked away in a hidden .git/ directory), and Git itself provides tools that can be used to modify that history.

Don't think of the Git history as a permanent record of what actually happened - instead consider it to be a deliberately authored story that describes the progression of the software project.

This story is a tool to aid future development. Permanently recording mistakes and cancelled directions can sometimes be useful, but repository authors can make editorial decisions about what to keep and how best to capture that history.

Coding agents are really good at using Git's advanced history rewriting features.

Undo or rewrite commits

It's common to commit code and then regret it - realize that it includes a file you didn't mean to include, for example. The git recipe for this is git reset --soft HEAD~1. I've never been able to remember that, and now I don't have to!

You can also perform more finely grained surgery on commits - rewriting them to remove just a single file, for example.

Agents can rewrite commit messages and can combine multiple commits into a single unit.

I've found that frontier models usually have really good taste in commit messages. I used to insist on writing these myself but I've accepted that the quality they produce is generally good enough, and often even better than what I would have produced myself.

Building a new repository from scraps of an older one

A trick I find myself using quite often is extracting out code from a larger repository into a new one while maintaining the key history of that code.

One common example is library extraction. I may have built some classes and functions into a project and later realized they would make more sense as a standalone reusable code library.

This kind of operation used to be involved enough that most developers would create a fresh copy detached from that old commit history. We don't have to settle for that any more!

Tags: coding-agents, generative-ai, github, agentic-engineering, ai, git, llms

TIL: Downloading archived Git repositories from archive.softwareheritage.org

2025-12-30T23:51:33+00:00

TIL: Downloading archived Git repositories from archive.softwareheritage.org

Back in February I blogged about a neat Python library called sqlite-s3vfs for accessing SQLite databases hosted in an S3 bucket, released as MIT licensed open source by the UK government's Department for Business and Trade.

I went looking for it today and found that the github.com/uktrade/sqlite-s3vfs repository is now a 404.

Since this is taxpayer-funded open source software I saw it as my moral duty to try and restore access! It turns out a full copy had been captured by the Software Heritage archive, so I was able to restore the repository from there. My copy is now archived at simonw/sqlite-s3vfs.

The process for retrieving an archive was non-obvious, so I've written up a TIL and also published a new Software Heritage Repository Retriever tool which takes advantage of the CORS-enabled APIs provided by Software Heritage. Here's the Claude Code transcript from building that.

Via Hacker News comment

Tags: archives, git, github, open-source, tools, ai, til, generative-ai, llms, ai-assisted-programming, claude-code

simonw/git-scraper-template

2025-02-26T05:34:05+00:00

simonw/git-scraper-template

I built this new GitHub template repository in preparation for a workshop I'm giving at NICAR (the data journalism conference) next week on Cutting-edge web scraping techniques.

One of the topics I'll be covering is Git scraping - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.

This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple create a new repository from the template and paste the URL you want to scrape into the description field and the repository will be initialized with a custom script that scrapes and stores that URL.

It's modeled after my earlier shot-scraper-template tool which I described in detail in Instantly create a GitHub repository to take screenshots of a web page.

The new git-scraper-template repo took some help from Claude to figure out. It uses a custom script to download the provided URL and derive a filename to use based on the URL and the content type, detected using file --mime-type -b "$file_path" against the downloaded file.

It also detects if the downloaded content is JSON and, if it is, pretty-prints it using jq - I find this is a quick way to generate much more useful diffs when the content changes.

Tags: data-journalism, git, github, projects, scraping, github-actions, git-scraping, nicar

otterwiki

2024-10-09T15:22:04+00:00

otterwiki

It's been a while since I've seen a new-ish Wiki implementation, and this one by Ralph Thesen is really nice. It's written in Python (Flask + SQLAlchemy + mistune for Markdown + GitPython) and keeps all of the actual wiki content as Markdown files in a local Git repository.

The installation instructions are a little in-depth as they assume a production installation with Docker or systemd - I figured out this recipe for trying it locally using uv:

git clone https://github.com/redimp/otterwiki.git
cd otterwiki

mkdir -p app-data/repository
git init app-data/repository

echo "REPOSITORY='${PWD}/app-data/repository'" >> settings.cfg
echo "SQLALCHEMY_DATABASE_URI='sqlite:///${PWD}/app-data/db.sqlite'" >> settings.cfg
echo "SECRET_KEY='$(echo $RANDOM | md5sum | head -c 16)'" >> settings.cfg

export OTTERWIKI_SETTINGS=$PWD/settings.cfg
uv run --with gunicorn gunicorn --bind 127.0.0.1:8080 otterwiki.server:app

Via Hacker News

Tags: flask, git, python, sqlalchemy, sqlite, markdown, wikis, uv

Why GitHub Actually Won

2024-09-09T17:16:22+00:00

Why GitHub Actually Won

GitHub co-founder Scott Chacon shares some thoughts on how GitHub won the open source code hosting market. Shortened to two words: timing, and taste.

There are some interesting numbers in here. I hadn't realized that when GitHub launched in 2008 the term "open source" had only been coined ten years earlier, in 1998. This paper by Dirk Riehle estimates there were 18,000 open source projects in 2008 - Scott points out that today there are over 280 million public repositories on GitHub alone.

Scott's conclusion:

We were there when a new paradigm was being born and we approached the problem of helping people embrace that new paradigm with a developer experience centric approach that nobody else had the capacity for or interest in.

Via Hacker News

Tags: git, github, open-source

AI-powered Git Commit Function

2024-08-26T01:06:59+00:00

AI-powered Git Commit Function

Andrej Karpathy built a shell alias, gcm, which passes your staged Git changes to an LLM via my LLM tool, generates a short commit message and then asks you if you want to "(a)ccept, (e)dit, (r)egenerate, or (c)ancel?".

Here's the incantation he's using to generate that commit message:

git diff --cached | llm "
Below is a diff of all staged changes, coming from the command:
\`\`\`
git diff --cached
\`\`\`
Please generate a concise, one-line commit message for these changes."

This pipes the data into LLM (using the default model, currently gpt-4o-mini unless you set it to something else) and then appends the prompt telling it what to do with that input.

Via @karpathy

Tags: git, ai, andrej-karpathy, prompt-engineering, generative-ai, llms, ai-assisted-programming, llm

EpicEnv

2024-08-03T00:31:33+00:00

EpicEnv

Dan Goodman's tool for managing shared secrets via a Git repository. This uses a really neat trick: you can run epicenv invite githubuser and the tool will retrieve that user's public key from github.com/{username}.keys (here's mine) and use that to encrypt the secrets such that the user can decrypt them with their private key.

Via lobste.rs

Tags: encryption, git

1991-WWW-NeXT-Implementation on GitHub

2024-08-01T21:15:29+00:00

1991-WWW-NeXT-Implementation on GitHub

I fell down a bit of a rabbit hole today trying to answer that question about when World Wide Web Day was first celebrated. I found my way to www.w3.org/History/1991-WWW-NeXT/Implementation/ - an Apache directory listing of the source code for Tim Berners-Lee's original WorldWideWeb application for NeXT!

The code wasn't particularly easy to browse: clicking a .m file would trigger a download rather than showing the code in the browser, and there were no niceties like syntax highlighting.

So I decided to mirror that code to a new repository on GitHub. I grabbed the code using wget -r and was delighted to find that the last modified dates (from the early 1990s) were preserved ... which made me want to preserve them in the GitHub repo too.

I used Claude to write a Python script to back-date those commits, and wrote up what I learned in this new TIL: Back-dating Git commits based on file modification dates.

End result: I now have a repo with Tim's original code, plus commit dates that reflect when that code was last modified.

Tags: git, github, history, tim-berners-lee, w3c

AWS CodeCommit quietly deprecated

2024-07-30T05:51:42+00:00

AWS CodeCommit quietly deprecated

CodeCommit is AWS's Git hosting service. In a reply from an AWS employee to this forum thread:

Beginning on 06 June 2024, AWS CodeCommit ceased onboarding new customers. Going forward, only customers who have an existing repository in AWS CodeCommit will be able to create additional repositories.

[...] If you would like to use AWS CodeCommit in a new AWS account that is part of your AWS Organization, please let us know so that we can evaluate the request for allowlisting the new account. If you would like to use an alternative to AWS CodeCommit given this news, we recommend using GitLab, GitHub, or another third party source provider of your choice.

What's weird about this is that, as far as I can tell, this is the first official public acknowledgement from AWS that CodeCommit is no longer accepting customers. The CodeCommit landing page continues to promote the product, though it does link to the How to migrate your AWS CodeCommit repository to another Git provider blog post from July 25th, which gives no direct indication that CodeCommit is being quietly sunset.

I wonder how long they'll continue to support their existing customers?

Amazon QLDB too

It looks like AWS may be having a bit of a clear-out. Amazon QLDB - Quantum Ledger Database (a blockchain-adjacent immutable ledger, launched in 2019) - quietly put out a deprecation announcement in their release history on July 18th (again, no official announcement elsewhere):

End of support notice: Existing customers will be able to use Amazon QLDB until end of support on 07/31/2025. For more details, see Migrate an Amazon QLDB Ledger to Amazon Aurora PostgreSQL.

This one is more surprising, because migrating to a different Git host is massively less work than entirely re-writing a system to use a fundamentally different database.

It turns out there's an infrequently updated community GitHub repo called SummitRoute/aws_breaking_changes which tracks these kinds of changes. Other services listed there include CodeStar, Cloud9, CloudSearch, OpsWorks, Workdocs and Snowmobile, and they cleverly (ab)use the GitHub releases mechanism to provide an Atom feed.

Via Hacker News

Tags: aws, git, blockchain

Quoting Ed Page

2024-07-13T05:28:01+00:00

Add tests in a commit before the fix. They should pass, showing the behavior before your change. Then, the commit with your change will update the tests. The diff between these commits represents the change in behavior. This helps the author test their tests (I've written tests thinking they covered the relevant case but didn't), the reviewer to more precisely see the change in behavior and comment on it, and the wider community to understand what the PR description is about.

— Ed Page

Tags: git

uv pip install --exclude-newer example

2024-05-10T16:35:40+00:00

uv pip install --exclude-newer example

A neat new feature of the uv pip install command is the --exclude-newer option, which can be used to avoid installing any package versions released after the specified date.

Here's a clever example of that in use from the typing_extensions packages CI tests that run against some downstream packages:

uv pip install --system -r test-requirements.txt --exclude-newer $(git show -s --date=format:'%Y-%m-%dT%H:%M:%SZ' --format=%cd HEAD)

They use git show to get the date of the most recent commit (%cd means commit date) formatted as an ISO timestamp, then pass that to --exclude-newer.

Via @hauntsaninja

Tags: git, pip, python, uv, astral

Use an llm to automagically generate meaningful git commit messages

2024-04-11T04:06:15+00:00

Use an llm to automagically generate meaningful git commit messages

Neat, thoroughly documented recipe by Harper Reed using my LLM CLI tool as part of a scheme for if you’re feeling too lazy to write a commit message—it uses a prepare-commit-msg Git hook which runs any time you commit without a message and pipes your changes to a model along with a custom system prompt.

Tags: cli, git, ai, generative-ai, llm

How I use git worktrees

2024-03-06T15:21:31+00:00

How I use git worktrees

TIL about worktrees, a Git feature that lets you have multiple repository branches checked out to separate directories at the same time.

The default UI for them is a little unergonomic (classic Git) but Bill Mill here shares a neat utility script for managing them in a more convenient way.

One particularly neat trick: Bill’s “worktree” Bash script checks for a node_modules folder and, if one exists, duplicates it to the new directory using copy-on-write, saving you from having to run yet another lengthy “npm install”.

Via lobste.rs

Tags: git

Figure out who's leaving the company: dump, diff, repeat

2024-02-09T05:44:31+00:00

Figure out who's leaving the company: dump, diff, repeat

Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.

I suggest using Git for this - a form of Git scraping - as then you get a detailed commit log of changes over time effectively for free.

I really enjoyed Rachel's closing thought:

Incidentally, if someone gets mad about you running this sort of thing, you probably don't want to work there anyway. On the other hand, if you're able to build such tools without IT or similar getting "threatened" by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don't tend to last.

Via Hacker News

Tags: git, git-scraping, rachel-kroll

Inside .git

2024-01-25T14:59:54+00:00

Inside .git

This single diagram filled in all sorts of gaps in my mental model of how git actually works under the hood.

Tags: git, julia-evans

See the History of a Method with git log -L

2023-11-05T20:16:55+00:00

See the History of a Method with git log -L

Neat Git trick from Caleb Hearth that I hadn’t seen before, and it works for Python out of the box:

git log -L :path_with_format:__init__.py

That command displays a log (with diffs) of just the portion of commits that changed the path_with_format function in the __init__.py file.

Via Hacker News

Tags: git, python

Tracking SQLite Database Changes in Git

2023-11-01T18:53:51+00:00

Tracking SQLite Database Changes in Git

A neat trick from Garrit Franke that I hadn’t seen before: you can teach “git diff” how to display human readable versions of the differences between binary files with a specific extension using the following:

git config diff.sqlite3.binary true
git config diff.sqlite3.textconv “echo .dump | sqlite3”

That way you can store binary files in your repo but still get back SQL diffs to compare them.

I still worry about the efficiency of storing binary files in Git, since I expect multiple versions of a text text file to compress together better.

Via lobste.rs

Tags: git, sqlite

The Perfect Commit

2022-10-29T20:41:01+00:00

For the last few years I've been trying to center my work around creating what I consider to be the Perfect Commit. This is a single commit that contains all of the following:

The implementation: a single, focused change
Tests that demonstrate the implementation works
Updated documentation reflecting the change
A link to an issue thread providing further context

Our job as software engineers generally isn't to write new software from scratch: we spend the majority of our time adding features and fixing bugs in existing software.

The commit is our principle unit of work. It deserves to be treated thoughtfully and with care.

Update 26th November 2022: My 25 minute talk Massively increase your productivity on personal projects with comprehensive documentation and automated tests describes this approach to software development in detail.

Implementation

Each commit should change a single thing.

The definition of "thing" here is left deliberately vague!

The goal is have something that can be easily reviewed, and that can be clearly understood in the future when revisited using tools like git blame or git bisect.

I like to keep my commit history linear, as I find that makes it much easier to comprehend later. This further reinforces the value of each commit being a single, focused change.

Atomic commits are also much easier to cleanly revert if something goes wrong - or to cherry-pick into other branches.

For things like web applications that can be deployed to production, a commit should be a unit that can be deployed. Aiming to keep the main branch in a deployable state is a good rule of thumb for deciding if a commit is a sensible atomic change or not.

Tests

The ultimate goal of tests is to increase your productivity. If your testing practices are slowing you down, you should consider ways to improve them.

In the longer term, this productivity improvement comes from gaining the freedom to make changes and stay confident that your change hasn't broken something else.

But tests can help increase productivity in the immediate short term as well.

How do you know when the change you have made is finished and ready to commit? It's ready when the new tests pass.

I find this reduces the time I spend second-guessing myself and questioning whether I've done enough and thought through all of the edge cases.

Without tests, there's a very strong possibility that your change will have broken some other, potentially unrelated feature. Your commit could be held up by hours of tedious manual testing. Or you could YOLO it and learn that you broke something important later!

Writing tests becomes far less time consuming if you already have good testing practices in place.

Adding a new test to a project with a lot of existing tests is easy: you can often find an existing test that has 90% of the pattern you need already worked out for you.

If your project has no tests at all, adding a test for your change will be a lot more work.

This is why I start every single one of my projects with a passing test. It doesn't matter what this test is - assert 1 + 1 == 2 is fine! The key thing is to get a testing framework in place, such that you can run a command (for me that's usually pytest) to execute the test suite - and you have an obvious place to add new tests in the future.

I use these cookiecutter templates for almost all of my new projects. They configure a testing framework with a single passing test and GitHub Actions workflows to exercise it all from the very start.

I'm not a huge advocate of test-first development, where tests are written before the code itself. What I care about is tests-included development, where the final commit bundles the tests and the implementation together. I wrote more about my approach to testing in How to cheat at unit tests with pytest and Black.

Documentation

If your project defines APIs that are meant to be used outside of your project, they need to be documented. In my work these projects are usually one of the following:

Python APIs (modules, functions and classes) that provide code designed to be imported into other projects.
Web APIs - usually JSON over HTTP these days - that provide functionality to be consumed by other applications.
Command line interface tools, such as those implemented using Click or Typer or argparse.

It is critical that this documentation must live in the same repository as the code itself.

This is important for a number of reasons.

Documentation is only valuable if people trust it. People will only trust it if they know that it is kept up to date.

If your docs live in a separate wiki somewhere it's easy for them to get out of date - but more importantly it's hard for anyone to quickly confirm if the documentation is being updated in sync with the code or not.

Documentation should be versioned. People need to be able to find the docs for the specific version of your software that they are using. Keeping it in the same repository as the code gives you synchronized versioning for free.

Documentation changes should be reviewed in the same way as your code. If they live in the same repository you can catch changes that need to be reflected in the documentation as part of your code review process.

And ideally, documentation should be tested. I wrote about my approach to doing this using Documentation unit tests. Executing example code in the documentation using a testing framework is a great idea too.

As with tests, writing documentation from scratch is much more work than incrementally modifying existing documentation.

Many of my commits include documentation that is just a sentence or two. This doesn't take very long to write, but it adds up to something very comprehensive over time.

How about end-user facing documentation? I'm still figuring that out myself. I created my shot-scraper tool to help automate the process of keeping screenshots up-to-date, but I've not yet found personal habits and styles for end-user documentation that I'm confident in.

A link to an issue

Every perfect commit should include a link to an issue thread that accompanies that change.

Sometimes I'll even open an issue seconds before writing the commit message, just to give myself something I can link to from the commit itself!

The reason I like issue threads is that they provide effectively unlimited space for commentary and background for the change that is being made.

Most of my issue threads are me talking to myself - sometimes with dozens of issue comments, all written by me.

Things that can go in an issue thread include:

Background: the reason for the change. I try to include this in the opening comment.
State of play before the change. I'll often link to the current version of the code and documentation. This is great for if I return to an open issue a few days later, as it saves me from having to repeat that initial research.
Links to things. So many links! Inspiration for the change, relevant documentation, conversations on Slack or Discord, clues found on StackOverflow.
Code snippets illustrating potential designs and false-starts. Use ```python ... ``` blocks to get syntax highlighting in your issue comments.
Decisions. What did you consider? What did you decide? As programmers we make hundreds of tiny decisions a day. Write them down! Then you'll never find yourself relitigating them in the future having forgotten your original reasoning.
Screenshots. What it looked like before, what it looked like after. Animated screenshots are even better! I use LICEcap to generate quick GIF screen captures or QuickTime to capture videos - both of which can be dropped straight into a GitHub issue comment.
Prototypes. I'll often paste a few lines of code copied from a Python console session. Sometimes I'll even paste in a block of HTML and CSS, or add a screenshot of a UI prototype.

After I've closed my issues I like to add one last comment that links to the updated documentation and ideally a live demo of the new feature.

An issue is more valuable than a commit message

I went through a several year phase of writing essays in my commit messages, trying to capture as much of the background context and thinking as possible.

My commit messages grew a lot shorter when I started bundling the updated documentation in the commit - since often much of the material I'd previously included in the commit message was now in that documentation instead.

As I extended my practice of writing issue threads, I found that they were a better place for most of this context than the commit messages themselves. They supported embedded media, were more discoverable and I could continue to extend them even after the commit had landed.

Today many of my commit messages are a single line summary and a link to an issue!

The biggest benefit of lengthy commit messages is that they are guaranteed to survive for as long as the repository itself. If you're going to use issue threads in the way I describe here it is critical that you consider their long term archival value.

I expect this to be controversial! I'm advocating for abandoning one of the core ideas of Git here - that each repository should incorporate a full, decentralized record of its history that is copied in its entirety when someone clones a repo.

I understand that philosophy. All I'll say here is that my own experience has been that dropping that requirement has resulted in a net increase in my overall productivity. Other people may reach a different conclusion.

If this offends you too much, you're welcome to construct an even more perfect commit that incorporates background information and additional context in an extended commit message as well.

One of the reasons I like GitHub Issues is that it includes a comprehensive API, which can be used to extract all of that data. I use my github-to-sqlite tool to maintain an ongoing archive of my issues and issue comments as a SQLite database file.

Not every commit needs to be "perfect"

I find that the vast majority of my work fits into this pattern, but there are exceptions.

Typo fix for some documentation or a comment? Just ship it, it's fine.

Bug fix that doesn't deserve documentation? Still bundle the implementation and the test plus a link to an issue, but no need to update the docs - especially if they already describe the expected bug-free behaviour.

Generally though, I find that aiming for implementation, tests, documentation and an issue link covers almost all of my work. It's a really good default model.

Write scrappy commits in a branch

If I'm writing more exploratory or experimental code it often doesn't make sense to work in this strict way. For those instances I'll usually work in a branch, where I can ship "WIP" commit messages and failing tests with abandon. I'll then squash-merge them into a single perfect commit (sometimes via a self-closed GitHub pull request) to keep my main branch as tidy as possible.

Some examples

Here are some examples of my commits that follow this pattern:

Upgrade Docker images to Python 3.11 for datasette #1853 - a pretty tiny change, but still includes tests, docs and an issue link.
sqlite-utils schema now takes optional tables for sqlite-utils #299
shot-scraper html command for shot-scraper #96
s3-credentials put-objects command for s3-credentials #68
Initial implementation for datasette-gunicorn #1 - this was the first commit to this repository, but I still bundled the tests, docs, implementation and a link to an issue.

Tags: code-review, definitions, documentation, git, github, software-engineering, testing, github-issues

sqlite-comprehend: run AWS entity extraction against content in a SQLite database

2022-07-11T21:31:21+00:00

I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.

I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.

Short version: given a database table full of text, run the following:

% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
  [###---------------------------------]    9%  00:01:02

This will churn through every piece of text in the body column of the blog_entry table in the myblog.db SQLite database, strip any HTML tags (the --strip-tags option), submit it to AWS Comprehend, and store the extracted entities in the following tables:

comprehend_entities - the extracted entities, classified by type
blog_entry_comprehend_entities - a table relating entities to the entries that they appear in
comprehend_entity_types - a small lookup table of entity types

The above table names link to a live demo produced by running the tool against all of the content in my blog.

Here are 225 mentions that Comprehend classified as the organization called "Mozilla".

The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.

AWS Comprehend pricing starts at $0.0001 per hundred characters. sqlite-comprehend only submits the first 5,000 characters of each row.

How the demo works

My live demo for this tool uses a new Datasette instance at datasette.simonwillison.net. It hosts a complete copy of the data from my blog - data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.

The demo runs out of my simonwillisonblog-backup GitHub repository.

A couple of years ago I realized that I'm no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a git pull.

So I created simonwillisonblog-backup using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a "diffable" directory of newline-delimited JSON files.

Here's the simplest version of that pattern:

db-to-sqlite \
    'postgresql+psycopg2://user:pass@hostname:5432/dbname' \
    simonwillisonblog.db --all

This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in simonwillisonblog.db.

sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all

This converts that SQLite database into a directory of JSON files. Each table gets two files: table.metadata.json containing the table's name, columns and schema and table.ndjson containing a newline-separated list of JSON arrays representing every row in that table.

You can see these files for my blog's database in the simonwillisonblog folder.

(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)

Since I set this up it's captured over 600 changes I've applied to my blog's database, all made the regular Django admin interface.

This morning I extended the script to run sqlite-comprehend against my blog entries and deploy the resulting data using Datasette.

The concise version of the new script looks like this:

wget -q https://datasette.simonwillison.net/simonwillisonblog.db

This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.

sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace

This creates the simonwillisonblog.db database by loading in the JSON from the simonwillisonblog/ folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.

The --replace option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).

sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags

This runs sqlite-comprehend against the blog entries that have not yet been processed.

set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null
set -e

This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.

Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.

datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all

This uses the using datasette publish command to deploy the datasette.simonwillison.net site to Google Cloud Run.

I'm adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog's content, and datasette-graphql to enable GraphQL queries against my data.

Here's an example GraphQL query that returns my most recent blog entries that are tagged with datasette.

Releases this week

sqlite-comprehend: 0.2.1 - (4 releases total) - 2022-07-11
Tools for running data in a SQLite database through AWS Comprehend
sqlite-diffable: 0.4 - (5 releases total) - 2022-07-11
Tools for dumping/loading a SQLite database to diffable directory structure
datasette-redirect-to-https: 0.2 - (2 releases total) - 2022-07-04
Datasette plugin that redirects all non-https requests to https
datasette-unsafe-expose-env: 0.1.1 - (2 releases total) - 2022-07-03
Datasette plugin to expose some environment variables at /-/env for debugging
datasette-expose-env: 0.1 - 2022-07-03
Datasette plugin to expose selected environment variables at /-/env for debugging
datasette-upload-csvs: 0.7.2 - (10 releases total) - 2022-07-03
Datasette plugin for uploading CSV files and converting them to database tables
datasette-packages: 0.2 - (3 releases total) - 2022-07-03
Show a list of currently installed Python packages
datasette-graphql: 2.1 - (35 releases total) - 2022-07-03
Datasette plugin providing an automatic GraphQL API for your SQLite databases
datasette-edit-schema: 0.5 - (9 releases total) - 2022-07-01
Datasette plugin for modifying table schemas

TIL this week

Tags: aws, git, github, projects, sqlite, datasette, weeknotes

A tiny CI system

2022-04-26T15:39:27+00:00

A tiny CI system

Christian Ştefănescu shares a recipe for building a tiny self-hosted CI system using Git and Redis. A post-receive hook runs when a commit is pushed to the repo and uses redis-cli to push jobs to a list. Then a separate bash script runs a loop with a blocking “redis-cli blpop jobs” operation which waits for new jobs and then executes the CI job as a shell script.

Via @stchris_

Tags: bash, continuous-integration, git, redis

Help scraping: track changes to CLI tools by recording their --help using Git

2022-02-02T23:46:35+00:00

I've been experimenting with a new variant of Git scraping this week which I'm calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help commands in a Git repository.

My new help-scraper GitHub repository is my first implementation of this pattern.

It uses this GitHub Actions workflow to record the --help output for the Amazon Web Services aws CLI tool, and also for the flyctl tool maintained by the Fly.io hosting platform.

The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command's CLI help option to a .txt file in the repository - then commits the result at the end.

The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.

Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.

Here are the official release notes - 12 bullet points, spanning 12 different AWS services.

My help scraper caught the details of the release in this commit - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.

The AWS CLI tool is enormous. Running find aws -name '*.txt' | wc -l in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.

There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON - and there are projects like awschanges.info which try to turn those sources of data into something more readable.

But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!

I implemented this for flyctl first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.

Help scraping my own projects

I got the initial idea for this technique from a change I made to my Datasette and sqlite-utils projects a few weeks ago.

Both tools offer CLI commands with --help output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.

So, I added documentation pages that list the output of --help for each of the CLI commands, generated using the Cog file generation tool:

sqlite-utils CLI reference (39 commands!)
datasette CLI reference

Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the --help output - here's that history for sqlite-utils.

It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.

Bonus trick: GraphQL schema scraping

I've started making selective use of the Fly.io GraphQL API as part of my plugin for publishing Datasette instances to that platform.

Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.

This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?

It turns out I can! There's an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:

npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql

I've added that to my help-scraper repository too - so now I have a commit history of changes of changes they are making there too. Here's an example from this morning.

Other weeknotes

I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to that milestone.

This week I did a bunch of work adding a Link: https://...; rel="alternate"; type="application/datasette+json" HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.

(I had originally planned to also support Accept: application/json request headers for this, but I've been put off that idea by the discovery that Cloudflare deliberately ignores the Vary: Accept header.)

Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.

The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!

TIL this week

Tags: cli, git, github, projects, scraping, graphql, weeknotes, github-actions, git-scraping, fly

How I build a feature

2022-01-12T18:10:17+00:00

I'm maintaining a lot of different projects at the moment. I thought it would be useful to describe the process I use for adding a new feature to one of them, using the new sqlite-utils create-database command as an example.

I like each feature to be represented by what I consider to be the perfect commit - one that bundles together the implementation, the tests, the documentation and a link to an external issue thread.

Update 29th October 2022: I wrote more about the perfect commit.

The sqlite-utils create-database command is very simple: it creates a new, empty SQLite database file. You use it like this:

% sqlite-utils create-database empty.db

Everything starts with an issue

Every piece of work I do has an associated issue. This acts as ongoing work-in-progress notes and lets me record decisions, reference any research, drop in code snippets and sometimes even add screenshots and video - stuff that is really helpful but doesn't necessarily fit in code comments or commit messages.

Even if it's a tiny improvement that's only a few lines of code, I'll still open an issue for it - sometimes just a few minutes before closing it again as complete.

Any commits that I create that relate to an issue reference the issue number in their commit message. GitHub does a great job of automatically linking these together, bidirectionally so I can navigate from the commit to the issue or from the issue to the commit.

Having an issue also gives me something I can link to from my release notes.

In the case of the create-database command, I opened this issue in November when I had the idea for the feature.

I didn't do the work until over a month later - but because I had designed the feature in the issue comments I could get started on the implementation really quickly.

Development environment

Being able to quickly spin up a development environment for a project is crucial. All of my projects have a section in the README or the documentation describing how to do this - here's that section for sqlite-utils.

On my own laptop each project gets a directory, and I use pipenv shell in that directory to activate a directory-specific virtual environment, then pip install -e '.[test]' to install the dependencies and test dependencies.

Automated tests

All of my features are accompanied by automated tests. This gives me the confidence to boldly make changes to the software in the future without fear of breaking any existing features.

This means that writing tests needs to be as quick and easy as possible - the less friction here the better.

The best way to make writing tests easy is to have a great testing framework in place from the very beginning of the project. My cookiecutter templates (python-lib, datasette-plugin and click-app) all configure pytest and add a tests/ folder with a single passing test, to give me something to start adding tests to.

I can't say enough good things about pytest. Before I adopted it, writing tests was a chore. Now it's an activity I genuinely look forward to!

I'm not a religious adherent to writing the tests first - see How to cheat at unit tests with pytest and Black for more thoughts on that - but I'll write the test first if it's pragmatic to do so.

In the case of create-database, writing the test first felt like the right thing to do. Here's the test I started with:

def test_create_database(tmpdir):
    db_path = tmpdir / "test.db"
    assert not db_path.exists()
    result = CliRunner().invoke(
        cli.cli, ["create-database", str(db_path)]
    )
    assert result.exit_code == 0
    assert db_path.exists()

This test uses the tmpdir pytest fixture to provide a temporary directory that will be automatically cleaned up by pytest after the test run finishes.

It checks that the test.db file doesn't exist yet, then uses the Click framework's CliRunner utility to execute the create-database command. Then it checks that the command didn't throw an error and that the file has been created.

The I run the test, and watch it fail - because I haven't built the feature yet!

% pytest -k test_create_database

============ test session starts ============
platform darwin -- Python 3.8.2, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/simon/Dropbox/Development/sqlite-utils
plugins: cov-2.12.1, hypothesis-6.14.5
collected 808 items / 807 deselected / 1 selected                           

tests/test_cli.py F                                                   [100%]

================= FAILURES ==================
___________ test_create_database ____________

tmpdir = local('/private/var/folders/wr/hn3206rs1yzgq3r49bz8nvnh0000gn/T/pytest-of-simon/pytest-659/test_create_database0')

    def test_create_database(tmpdir):
        db_path = tmpdir / "test.db"
        assert not db_path.exists()
        result = CliRunner().invoke(
            cli.cli, ["create-database", str(db_path)]
        )
>       assert result.exit_code == 0
E       assert 1 == 0
E        +  where 1 = <Result SystemExit(1)>.exit_code

tests/test_cli.py:2097: AssertionError
========== short test summary info ==========
FAILED tests/test_cli.py::test_create_database - assert 1 == 0
===== 1 failed, 807 deselected in 0.99s ====

The -k option lets me run any test that match the search string, rather than running the full test suite. I use this all the time.

Other pytest features I often use:

pytest -x: runs the entire test suite but quits at the first test that fails
pytest --lf: re-runs any tests that failed during the last test run
pytest --pdb -x: open the Python debugger at the first failed test (omit the -x to open it at every failed test). This is the main way I interact with the Python debugger. I often use this to help write the tests, since I can add assert False and get a shell inside the test to interact with various objects and figure out how to best run assertions against them.

Implementing the feature

Test in place, it's time to implement the command. I added this code to my existing cli.py module:

@cli.command(name="create-database")
@click.argument(
    "path",
    type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
    required=True,
)
def create_database(path):
    "Create a new empty database file."
    db = sqlite_utils.Database(path)
    db.vacuum()

(I happen to know that the quickest way to create an empty SQLite database file is to run VACUUM against it.)

The test now passes!

I iterated on this implementation a little bit more, to add the --enable-wal option I had designed in the issue comments - and updated the test to match. You can see the final implementation in this commit: 1d64cd2e5b402ff957f9be2d9bb490d313c73989.

If I add a new test and it passes the first time, I’m always suspicious of it. I’ll deliberately break the test (change a 1 to a 2 for example) and run it again to make sure it fails, then change it back again.

Code formatting with Black

Black has increased my productivity as a Python developer by a material amount. I used to spend a whole bunch of brain cycles agonizing over how to indent my code, where to break up long function calls and suchlike. Thanks to Black I never think about this at all - I instinctively run black . in the root of my project and accept whatever style decisions it applies for me.

Linting

I have a few linters set up to run on every commit. I can run these locally too - how to do that is documented here - but I'm often a bit lazy and leave them to run in CI.

In this case one of my linters failed! I accidentally called the new command function create_table() when it should have been called create_database(). The code worked fine due to how the cli.command(name=...) decorator works but mypy complained about the redefined function name. I fixed that in a separate commit.

Documentation

My policy these days is that if a feature isn't documented it doesn't exist. Updating existing documentation isn't much work at all if the documentation already exists, and over time these incremental improvements add up to something really comprehensive.

For smaller projects I use a single README.md which gets displayed on both GitHub and PyPI (and the Datasette website too, for example on datasette.io/tools/git-history).

My larger projects, such as Datasette and sqlite-utils, use Read the Docs and reStructuredText with Sphinx instead.

I like reStructuredText mainly because it has really good support for internal reference links - something that is missing from Markdown, though it can be enabled using MyST.

sqlite-utils uses Sphinx. I have the sphinx-autobuild extension configured, which means I can run a live reloading server with the documentation like so:

cd docs
make livehtml

Any time I'm working on the documentation I have that server running, so I can hit "save" in VS Code and see a preview in my browser a few seconds later.

For Markdown documentation I use the VS Code preview pane directly.

The moment the documentation is live online, I like to add a link to it in a comment on the issue thread.

Committing the change

I run git diff a LOT while hacking on code, to make sure I haven’t accidentally changed something unrelated. This also helps spot things like rogue print() debug statements I may have added.

Before my final commit, I sometimes even run git diff | grep print to check for those.

My goal with the commit is to bundle the test, documentation and implementation. If those are the only files I've changed I do this:

git commit -a -m "sqlite-utils create-database command, closes #348"

If this completes the work on the issue I use "closes #N", which causes GitHub to close the issue for me. If it's not yet ready to close I use "refs #N" instead.

Sometimes there will be unrelated changes in my working directory. If so, I use git add <files> and then commit just with git commit -m message.

Branches and pull requests

create-database is a good example of a feature that can be implemented in a single commit, with no need to work in a branch.

For larger features, I'll work in a feature branch:

git checkout -b my-feature

I'll make a commit (often just labelled "WIP prototype, refs #N") and then push that to GitHub and open a pull request for it:

git push -u origin my-feature

I ensure the new pull request links back to the issue in its description, then switch my ongoing commentary to comments on the pull request itself.

I'll sometimes add a task checklist to the opening comment on the pull request, since tasks there get reflected in the GitHub UI anywhere that links to the PR. Then I'll check those off as I complete them.

An example of a PR I used like this is #361: --lines and --text and --convert and --import.

I don't like merge commits - I much prefer to keep my main branch history as linear as possible. I usually merge my PRs through the GitHub web interface using the squash feature, which results in a single, clean commit to main with the combined tests, documentation and implementation. Occasionally I will see value in keeping the individual commits, in which case I will rebase merge them.

Another goal here is to keep the main branch releasable at all times. Incomplete work should stay in a branch. This makes turning around and releasing quick bug fixes a lot less stressful!

Release notes, and a release

A feature isn't truly finished until it's been released to PyPI.

All of my projects are configured the same way: they use GitHub releases to trigger a GitHub Actions workflow which publishes the new release to PyPI. The sqlite-utils workflow for that is here in publish.yml.

My cookiecutter templates for new projects set up this workflow for me. I just need to create a PyPI token for the project and assign it as a repository secret. See the python-lib cookiecutter README for details.

To push out a new release, I need to increment the version number in setup.py and write the release notes.

I use semantic versioning - a new feature is a minor version bump, a breaking change is a major version bump (I try very hard to avoid these) and a bug fix or documentation-only update is a patch increment.

Since create-database was a new feature, it went out in release 3.21.

My projects that use Sphinx for documentation have changelog.rst files in their repositories. I add the release notes there, linking to the relevant issues and cross-referencing the new documentation. Then I ship a commit that bundles the release notes with the bumped version number, with a commit message that looks like this:

git commit -m "Release 3.21

Refs #348, #364, #366, #368, #371, #372, #374, #375, #376, #379"

Here's the commit for release 3.21.

Referencing the issue numbers in the release automatically adds a note to their issue threads indicating the release that they went out in.

I generate that list of issue numbers by pasting the release notes into an Observable notebook I built for the purpose: Extract issue numbers from pasted text. Observable is really great for building this kind of tiny interactive utility.

For projects that just have a README I write the release notes in Markdown and paste them directly into the GitHub "new release" form.

I like to duplicate the release notes to GiHub releases for my Sphinx changelog projects too. This is mainly so the datasette.io website will display the release notes on its homepage, which is populated at build time using the GitHub GraphQL API.

To convert my reStructuredText to Markdown I copy and paste the rendered HTML into this brilliant Paste to Markdown tool by Euan Goddard.

A live demo

When possible, I like to have a live demo that I can link to.

This is easiest for features in Datasette core. Datesette’s main branch gets deployed automatically to latest.datasette.io so I can often link to a demo there.

For Datasette plugins, I’ll deploy a fresh instance with the plugin (e.g. this one for datasette-graphql) or (more commonly) add it to my big latest-with-plugins.datasette.io instance - which tries to demonstrate what happens to Datasette if you install dozens of plugins at once (so far it works OK).

Here’s a demo of the datasette-copyable plugin running there: https://latest-with-plugins.datasette.io/github/commits.copyable

Tell the world about it

The last step is to tell the world (beyond the people who meticulously read the release notes) about the new feature.

Depending on the size of the feature, I might do this with a tweet like this one - usually with a screenshot and a link to the documentation. I often extend this into a short Twitter thread, which gives me a chance to link to related concepts and demos or add more screenshots.

For larger or more interesting feature I'll blog about them. I may save this for my weekly weeknotes, but sometimes for particularly exciting features I'll write up a dedicated blog entry. Some examples include:

I may even assemble a full set of annotated release notes on my blog, where I quote each item from the release in turn and provide some fleshed out examples plus background information on why I built it.

If it’s a new Datasette (or Datasette-adjacent) feature, I’ll try to remember to write about it in the next edition of the Datasette Newsletter.

Finally, if I learned a new trick while building a feature I might extract that into a TIL. If I do that I'll link to the new TIL from the issue thread.

More examples of this pattern

Here are a bunch of examples of commits that implement this pattern, combining the tests, implementation and documentation into a single unit:

sqlite-utils: adding —limit and —offset to sqlite-utils rows
sqlite-utils: --where and -p options for sqlite-utils convert
s3-credentials: s3-credentials policy command
datasette: db.execute_write_script() and db.execute_write_many()
datasette: ?_nosuggest=1 parameter for table views
datasette-graphql: GraphQL execution limits: time_limit_ms and num_queries_limit

Tags: git, github, software-engineering, testing, pytest, black, read-the-docs, github-issues

git-history: a tool for analyzing scraped data collected using Git and SQLite

2021-12-07T22:32:55+00:00

I described Git scraping last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.

The open challenge was how to analyze that data once it was collected. git-history is my new tool designed to tackle that problem.

Git scraping, a refresher

A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this five minute lightning talk back in March.

Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at fire.ca.gov/incidents showing the status of current large fires in the state.

I found the underlying data here:

curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

Then I built a simple scraper that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected 1,559 commits!

The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.

There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.

git-history

git-history is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use Datasette to explore the resulting data.

Here's an example database created by running the tool against my ca-fires-history repository. I created the SQLite database by running this in the repository directory:

git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert 'json.loads(content)["Incidents"]'

In this example we are processing the history of a single file called incidents.json.

We use the UniqueId column to identify which records are changed over time as opposed to newly created.

Specifying --namespace incident causes the created database tables to be called incident and incident_version rather than the default of item and item_version.

And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see --convert in the documentation for details.

Let's use the database to answer some questions about fires in California over the past 14 months.

The incident table contains a copy of the latest record for every incident. We can use that to see a map of every fire:

This uses the datasette-cluster-map plugin, which draws a map of every row with a valid latitude and longitude column.

Where things get interesting is the incident_version table. This is where changes between different scraped versions of each item are recorded.

Those 250 fires have 2,060 recorded versions. If we facet by _item we can see which fires had the most versions recorded. Here are the top ten:

This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire has its own Wikipedia page!

Clicking through to the Dixie Fire lands us on a page showing every "version" that we captured, ordered by version number.

git-history only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:

The ConditionStatement is a text description that changes frequently, but the other two interesting columns look to be AcresBurned and PercentContained.

That _commit table is a foreign key to commits, which records commits that have been processed by the tool - mainly so that when you run it a second time it can pick up where it finished last time.

We can join against commits to see the date that each version was created. Or we can use the incident_version_detail view which performs that join for us.

Using that view, we can filter for just rows where _item is 174 and AcresBurned is not blank, then use the datasette-vega plugin to visualize the _commit_at date column against the AcresBurned numeric column... and we get a graph of the growth of the Dixie Fire over time!

To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to git-history, Datasette and datasette-vega we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.

A note on schema design

One of the hardest problems in designing git-history was deciding on an appropriate schema for storing version changes over time.

I ended up with the following (edited for clarity):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item] (
   [_id] INTEGER PRIMARY KEY,
   [_item_id] TEXT,
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT,
   [_commit] INTEGER
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT
);
CREATE TABLE [columns] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [name] TEXT
);
CREATE TABLE [item_changed] (
   [item_version] INTEGER REFERENCES [item_version]([_id]),
   [column] INTEGER REFERENCES [columns]([id]),
   PRIMARY KEY ([item_version], [column])
);

As shown earlier, records in the item_version table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as null.

There's one catch with this schema: what do we do if a new version of an item sets one of the columns to null? How can we tell the difference between that and a column that didn't change?

I ended up solving that with an item_changed many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which item_version records.

The item_version_detail view displays columns from that many-to-many table as JSON - here's a filtered example showing which columns were changed in which versions of which items:

Here's a SQL query that shows, for ca-fires, which columns were updated most often:

select columns.name, count(*)
from incident_changed
  join incident_version on incident_changed.item_version = incident_version._id
  join columns on incident_changed.column = columns.id
where incident_version._version > 1
group by columns.name
order by count(*) desc

Updated: 1785
PercentContained: 740
ConditionStatement: 734
AcresBurned: 616
Started: 327
PersonnelInvolved: 286
Engines: 274
CrewsInvolved: 256
WaterTenders: 225
Dozers: 211
AirTankers: 181
StructuresDestroyed: 125
Helicopters: 122

Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:

select * from incident
where _id in (
  select _item from incident_version
  where _id in (
    select item_version from incident_changed where column = 15
  )
  and _version > 1
)

That returned 19 fires that were significant enough to involve helicopters - here they are on a map:

Advanced usage of --convert

Drew Breunig has been running a Git scraper for the past 8 months in dbreunig/511-events-history against 511.org, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example sf-bay-511 database.

The sf-bay-511 example is useful for digging more into the --convert option to git-history.

git-history requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.

The ideal tracked JSON file would look something like this:

[
  {
    "IncidentID": "abc123",
    "Location": "Corner of 4th and Vermont",
    "Type": "fire"
  },
  {
    "IncidentID": "cde448",
    "Location": "555 West Example Drive",
    "Type": "medical"
  }
]

It's common for data that has been scraped to not fit this ideal shape.

The 511.org JSON feed can be found here - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a updated timestamp field that changes in every version even if there are no changes, or a deeply nested "extension" object full of duplicate data.

I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the --convert option to the script:

#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'

The single-quoted string passed to --convert is compiled into a Python function and run against each Git version in turn. My code loops through the nested Events list, modifying each record and then outputting them as an iterable sequence using yield.

A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.

When working with git-history I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it for sqlite-utils convert earlier this year.

Trying this out yourself

If you want to try this out for yourself the git-history tool has an extensive README describing the other options, and the scripts used to create these demos can be found in the demos folder.

The git-scraping topic on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!

Tags: cli, data-journalism, git, projects, scraping, sqlite, datasette, git-history

Commits are snapshots, not diffs

2020-12-17T22:01:39+00:00

Commits are snapshots, not diffs

Useful, clearly explained revision of some Git fundamentals.

Via Hacker News

Tags: git, github

nyt-2020-election-scraper

2020-11-06T14:24:36+00:00

nyt-2020-election-scraper

Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.

Tags: alex-gaynor, data-journalism, elections, git, new-york-times, git-scraping

Git scraping: track changes over time by scraping to a Git repository

2020-10-09T18:27:23+00:00

Git scraping is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.

Update 5th March 2021: I presented a version of this post as a five minute lightning talk at NICAR 2021, which includes a live coding demo of building a new git scraper.

Update 5th January 2022: I released a tool called git-history that helps analyze data that has been collected using this technique.

The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The @nyt_diff Twitter account tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.

We already have a great tool for efficiently tracking changes to text over time: Git. And GitHub Actions (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.

Here's a recent example. Fires continue to rage in California, and the CAL FIRE website offers an incident map showing the latest fire activity around the state.

Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That's a 241KB JSON endpoints with full details of the various fires around the state.

So... I started running a git scraper against it. My scraper lives in the simonw/ca-fires-history repository on GitHub.

Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using jq and commits it back to the repo if it has changed.

This means I now have a commit log of changes to that information about fires in California. Here's an example commit showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.

The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called .github/workflows/scrape.yml which looks like this:

name: Scrape latest data

on:
  push:
  workflow_dispatch:
  schedule:
    - cron:  '6,26,46 * * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Fetch latest data
      run: |-
        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

That's not a lot of code!

It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.

The scraper itself works by fetching the JSON using curl, piping it through jq . to pretty-print it and saving the result to incidents.json.

The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in this TIL a few months ago.

I have a whole bunch of repositories running git scrapers now. I've been labeling them with the git-scraping topic so they show up in one place on GitHub (other people have started using that topic as well).

I've written about some of these in the past:

Scraping hurricane Irma back in September 2017 is when I first came up with the idea to use a Git repository in this way.
Changelogs to help understand the fires in the North Bay from October 2017 describes an early attempt at scraping fire-related information.
Generating a commit log for San Francisco’s official list of trees remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have a commit log of changes to it stretching back over more than a year. This example uses my csv-diff utility to generate human-readable commit messages.
Tracking PG&E outages by scraping to a git repo documents my attempts to track the impact of PG&E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.
Tracking FARA by deploying a data API using GitHub Actions and Cloud Run shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.

I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.

Comment thread on this post over on Hacker News.

Tags: git, github, projects, scraping, github-actions, git-scraping

Weeknotes: Rocky Beaches, Datasette 0.48, a commit history of my database

2020-08-21T00:52:16+00:00

This week I helped Natalie launch Rocky Beaches, shipped Datasette 0.48 and several releases of datasette-graphql, upgraded the CSRF protection for datasette-upload-csvs and figured out how to get a commit log of changes to my blog by backing up its database to a GitHub repository.

Rocky Beaches

Natalie released the first version of rockybeaches.com this week. It's a site that helps you find places to go tidepooling (known as rockpooling in the UK) and figure out the best times to go based on low tide times.

I helped out with the backend for the site, mainly as an excuse to further explore the idea of using Datasette to power full websites (previously explored with Niche Museums and my TILs).

The site uses a pattern I've been really enjoying: it's essentially a static dynamic site. Pages are dynamically rendered by Datasette using Jinja templates and a SQLite database, but the database itself is treated as a static asset: it's built at deploy time by this GitHub Actions workflow and deployed (currently to Vercel) as a binary asset along with the code.

The build script uses yaml-to-sqlite to load two YAML files - places.yml and stations.yml - and create the stations and places database tables.

It then runs two custom Python scripts to fetch relevant data for those places from iNaturalist and the NOAA Tides & Currents API.

The data all ends up in the Datasette instance that powers the site - you can browse it at www.rockybeaches.com/data or interact with it using GraphQL API at www.rockybeaches.com/graphql

The code is a little convoluted at the moment - I'm still iterating towards the best patterns for building websites like this using Datasette - but I'm very pleased with the productivity and performance that this approach produced.

Datasette 0.48

Highlights from Datasette 0.48 release notes:

Datasette documentation now lives at docs.datasette.io
The extra_template_vars, extra_css_urls, extra_js_urls and extra_body_script plugin hooks now all accept the same arguments. See extra_template_vars(template, database, table, columns, view_name, request, datasette) for details. (#939)
Those hooks now accept a new columns argument detailing the table columns that will be rendered on that page. (#938)

I released a new version of datasette-cluster-map that takes advantage of the new columns argument to only inject Leaflet maps JavaScript onto the page if the table being rendered includes latitude and longitude columns - previously the plugin would load extra code on pages that weren't going to render a map at all. That's now running on https://global-power-plants.datasettes.com/.

datasette-graphql

Using datasette-graphql for Rocky Beaches inspired me to add two new features:

A new graphql() Jinja custom template function that lets you execute custom GraphQL queries inside a Datasette template page - which turns out to be a pretty elegant way for the template to load exactly the data that it needs in order to render the page. Here's how Rocky Beaches uses that. Issue 50.
Some of the iNaturalist data that Rocky Beaches uses is stored as JSON data in text columns in SQLite - mainly because I was too lazy to model it out as tables. This was coming out of the GraphQL API as strings-containing-JSON, so I added a json_columns plugin configuration mechanism for turning those into Graphene GenericScalar fields - see issue 53 for details.

I also landed a big performance improvement. The plugin works by introspecting the database and generating a GraphQL schema that represents those tables, columns and views. For tables with a lot of tables this can get expensive, and the introspection was being run on every request.

I didn't want to require a server restart any time the schema changed, so I didn't want to cache the schema in-memory. Ideally it would be cached but the cache would become invalid any time the schema itself changed.

It turns out SQLite has a mechanism for this: the PRAGMA schema_version statement, which returns an integer version number that changes any time the underlying schema is changed (e.g. a table is added or modified).

I built a quick datasette-schema-versions plugin to try this feature out (in less than twenty minutes thanks to my datasette-plugin cookiecutter template) and prove to myself that it works. Then I built a caching mechanism for datasette-graphql that uses the current schema_version as the cache key. See issue 51 for details.

asgi-csrf and datasette-upload-csvs

datasette-upload-csvs is a Datasette plugin that adds a form for uploading CSV files and converting them to SQLite tables.

Datasette 0.44 added CSRF protection, which broke the plugin. I fixed that this week, but it took some extra work because file uploads use the multipart/form-data HTTP mechanism and my asgi-csrf library didn't support that.

I fixed that this week, but the code was quite complicated. Since asgi-csrf is a security library I decided to aim for 100% code coverage, the first time I've done that for one of my projects.

I got there with the help of codecov.io and pytest-cov. I wrote up what I learned about those tools in a TIL.

Backing up my blog database to a GitHub repository

I really like keeping content in a git repository (see Rocky Beaches and Niche Museums). Every content management system I've ever been has eventually desired revision control, and modeling that in a database and adding it to an existing project is always a huge pain.

I have 18 years of content on this blog. I want that backed up to git - and this week I realized I have the tools to do that already.

db-to-sqlite is my tool for taking any SQL Alchemy supported database (so far tested with MySQL and PostgreSQL) and exporting it into a SQLite database.

sqlite-diffable is a very early stage tool I built last year. The idea is to dump a SQLite database out to disk in a way that is designed to work well with git diffs. Each table is dumped out as newline-delimited JSON, one row per line.

So... how about converting my blog's PostgreSQL database to SQLite, then dumping it to disk with sqlite-diffable and committing the result to a git repository? And then running that in a GitHub Action?

Here's the workflow. It does exactly that, with a few extra steps: it only grabs a subset of my tables, and it redacts the password column from my auth_user table so that my hashed password isn't exposed in the backup.

I now have a commit log of changes to my blog's database!

I've set it to run nightly, but I can trigger it manually by clicking a button too.

TIL this week

Releases this week

datasette-graphql 0.14 - 2020-08-20
datasette-graphql 0.13 - 2020-08-19
datasette-schema-versions 0.1 - 2020-08-19
datasette-graphql 0.12.3 - 2020-08-19
github-to-sqlite 2.5 - 2020-08-18
datasette-publish-vercel 0.8 - 2020-08-17
datasette-cluster-map 0.12 - 2020-08-16
datasette 0.48 - 2020-08-16
datasette-graphql 0.12.2 - 2020-08-16
datasette-saved-queries 0.2.1 - 2020-08-15
datasette 0.47.3 - 2020-08-15
datasette-upload-csvs 0.5 - 2020-08-15
asgi-csrf 0.7 - 2020-08-15
asgi-csrf 0.7a0 - 2020-08-15
asgi-csrf 0.7a0 - 2020-08-15
datasette-cluster-map 0.11.1 - 2020-08-14
datasette-cluster-map 0.11 - 2020-08-14
datasette-graphql 0.12.1 - 2020-08-13

Tags: csrf, databases, git, github, natalie-downe, projects, graphql, datasette, inaturalist, weeknotes

Weeknotes: cookiecutter templates, better plugin documentation, sqlite-generate

2020-06-26T01:39:50+00:00

I spent this week spreading myself between a bunch of smaller projects, and finally getting familiar with cookiecutter. I wrote about my datasette-plugin cookiecutter template earlier in the week; here's what else I've been working on.

sqlite-generate

Datasette is supposed to work against any SQLite database you throw at it, no matter how weird the schema or how unwieldy the database shape or size.

I built a new tool called sqlite-generate this week to help me create databases of different shapes. It's a Python command-line tool which uses Faker to populate a new database with random data. You run it something like this:

sqlite-generate demo.db \
    --tables=20 \
    --rows=100,500 \
    --columns=5,20 \
    --fks=0,3 \
    --pks=0,2 \
    --fts

This command creates a database containing 20 tables, each with between 100 and 500 rows and 5-20 columns. Each table will also have between 0 and 3 foreign key columns to other tables, and will feature between 0 and 2 primary key columns. SQLite full-text search will be configured against all of the text columns in the table.

I always try to include a live demo with any of my projects, and sqlite-generate is no exception. This GitHub Action runs on every push to main and deploys a demo to https://sqlite-generate-demo.datasette.io/ showing the latest version of the code in action.

The demo runs my datasette-search-all plugin in order to more easily demonstrate full-text search across all of the text columns in the generated tables. Try searching for newspaper.

click-app cookiecutter template

I write quite a lot of Click powered command-line tools like this one, so inspired by datasette-plugin I created a new click-app cookiecutter template that bakes in my own preferences about how to set up a new Click project (complete with GitHub Actions). sqlite-generate is the first tool I've built using that template.

Improved Datasette plugin documentation

I've split Datasette's plugin documentation into five separate pages, and added a new page to the documentation about patterns for testing plugins.

The five pages are:

Plugins describing how to install and configure plugins
Writing plugins showing how to write one-off plugins, how to use the datasette-plugin cookiecutter template and how to package templates for release to PyPI
Plugin hooks documenting all of the available plugin hooks
Testing plugins describing my preferred patterns for writing tests for them (using pytest and HTTPX)
Internals for plugins describing the APIs Datasette makes available for use within plugin hook implementations

There's also a list of available plugins on the Datasette Ecosystem page of the documentation, though I plan to move those to a separate plugin directory in the future.

datasette-block-robots

The datasette-plugin template practically eliminates the friction involved in starting a new plugin.

sqlite-generate generates random names for people. I don't particularly want people who search for their own names stumbling across the live demo and being weirded out by their name featured there, so I decided to block it from search engine crawlers using robots.txt.

I wrote a tiny plugin to do this: datasette-block-robots, which uses the new register_routes() plugin hook to add a /robots.txt page.

It's also a neat example of the simplest possible plugin to use that feature - along with the simplest possible unit test for exercising such a page.

datasette-saved-queries

Another new plugin, this time with a bit more substance to it. datasette-saved-queries exercises the new canned_queries() hook I described last week. It uses the new startup() hook to create tables on startup (if they are missing), then lets users insert records into those tables to save their own queries. Queries saved in this way are then returned as canned queries for that particular database.

main, not master

main is a better name for the main GitHub branch than master, which has unpleasant connotations (it apparently derives from master/slave in BitKeeper). My datasette-plugin and click-app cookiecutter templates both include instructions for renaming master to main in their READMEs - it's as easy as running git branch -m master main before running your first push to GitHub.

I'm working towards making the switch for Datasette itself.

Tags: git, plugins, projects, robots-txt, sqlite, datasette, weeknotes, cookiecutter

Quoting Vincent Driessen

2020-05-14T13:49:55+00:00

Web apps are typically continuously delivered, not rolled back, and you don't have to support multiple versions of the software running in the wild.

This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team.

— Vincent Driessen

Tags: continuous-deployment, continuous-integration, git

Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite

2020-04-29T19:41:11+00:00

I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.

Archiving coronavirus.data.gov.uk

The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at coronavirus.data.gov.uk.

As with everything else built in 2020, it's a big single-page JavaScript app. Matthew Somerville investigated what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.

One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew's article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from https://c19downloads.azureedge.net/downloads/data/data_latest.json.

Any time I see a file like this my first questions are how often does it change - and what kind of changes are being made to it?

I've written about scraping to a git repository (see my new gitscraping tag) a bunch in the past:

Scraping hurricane Irma - September 2017
Changelogs to help understand the fires in the North Bay - October 2017
Generating a commit log for San Francisco’s official list of trees - March 2019
Tracking PG&E outages by scraping to a git repo - October 2019
Deploying a data API using GitHub Actions and Cloud Run - January 2020

Now that I've figured out a really clean way to Commit a file if it changed in a GitHub Action knocking out new versions of this pattern is really quick.

simonw/coronavirus-data-gov-archive is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a commit history of changes made to the underlying data.

The first implementation was extremely simple - here's the entire action:

name: Fetch latest data

on:
push:
repository_dispatch:
schedule:
    - cron:  '25 * * * *'

jobs:
scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
    uses: actions/checkout@v2
    - name: Fetch latest data
    run: |-
        curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . > data_latest.json
        curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . > utlas.geojson
        curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . > countries.geojson
        curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . > regions.geojson
    - name: Commit and push if it changed
    run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

It uses a combination of curl and jq (both available in the default worker environment) to pull down the data and pretty-print it (better for readable diffs), then commits the result.

Matthew Somerville pointed out that inefficient polling sets a bad precedent. Here I'm hitting azureedge.net, the Azure CDN, so that didn't particularly worry me - but since I want this pattern to be used widely it's good to provide a best-practice example.

Figuring out the best way to make conditional get requests in a GitHub Action lead me down something of a rabbit hole. I wanted to use curl's new ETag support but I ran into a curl bug, so I ended up rolling a simple Python CLI tool called conditional-get to solve my problem. In the time it took me to release that tool (just a few hours) a new curl release came out with a fix for that bug!

Here's the workflow using my conditional-get tool. See the issue thread for all of the other potential solutions, including a really neat Action shell-script solution by Alf Eaton.

To my absolute delight, the project has already been forked once by Daniel Langer to capture Canadian Covid-19 cases!

New Datasette features

I pushed two new features to Datasette master, ready for release in 0.41.

Configuration directory mode

This is an idea I had while building datasette-publish-now. Datasette instances can be run with custom metadata, custom plugins and custom templates. I'm increasingly finding myself working on projects that run using something like this:

$ datasette data1.db data2.db data3.db \
    --metadata=metadata.json
    --template-dir=templates \
    --plugins-dir=plugins

Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:

my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py

Then run Datasette directly targetting that directory:

$ datasette my-project/

See issue #731 for more details. Directory configuration mode is documented here.

Define custom pages using templates/pages

In niche-museums.com, powered by Datasette I described how I built the www.niche-museums.com website as a heavily customized Datasette instance.

That site has /about and /map pages which are served by custom templates - but I had to do some gnarly hacks with empty about.db and map.db files to get them to work.

Issue #648 introduces a new mechanism for creating this kind of page: create a templates/pages/map.html template file and custom 404 handling code will ensure that any hits to /map serve the rendered contents of that template.

This could work really well with the datasette-template-sql plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).

Here's the new documentation on custom pages, including details of how to use the new custom_status(), custom_header() and custom_redirect() template functions to go beyond just returning HTML.

photos-to-sqlite

My Dogsheep personal analytics project brings my tweets, GitHub activity, Swarm checkins and more together in one place. But the big missing feature is my photos.

As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new photos-to-sqlite tool. I can run the following SQL query and get back ten random photos!

select
  json_object(
    'img_src',
    'https://photos.simonwillison.net/i/' || 
    sha256 || '.' || ext || '?w=400'
  ),
  filepath,
  ext
from
  photos
where
  ext in ('jpeg', 'jpg', 'heic')
order by
  random()
limit
  10

photos.simonwillison.net is running a modified version of my heic-to-jpeg image converting and resizing proxy, which I'll release at some point soon.

There's still plenty of work to do - I still need to import EXIF data (including locations) into SQLite, and I plan to use osxphotos to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.

TIL this week

Generated using this query.

Tags: git, http, matthew-somerville, photos, projects, datasette, weeknotes, covid19, git-scraping