<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: git</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/git.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-21T22:08:24+00:00</updated><author><name>Simon Willison</name></author><entry><title>Using Git with coding agents</title><link href="https://simonwillison.net/guides/agentic-engineering-patterns/using-git-with-coding-agents/#atom-tag" rel="alternate"/><published>2026-03-21T22:08:24+00:00</published><updated>2026-03-21T22:08:24+00:00</updated><id>https://simonwillison.net/guides/agentic-engineering-patterns/using-git-with-coding-agents/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;&lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/"&gt;Agentic Engineering Patterns&lt;/a&gt; &amp;gt;&lt;/em&gt;&lt;/p&gt;
    &lt;p&gt;Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced.&lt;/p&gt;
&lt;p&gt;This fluency means we can be more ambitious about how we use Git ourselves. We don't need to  memorize &lt;em&gt;how&lt;/em&gt; to do things with Git, but staying aware of what's possible means we can take advantage of the full suite of Git's abilities.&lt;/p&gt;
&lt;h2 id="git-essentials"&gt;Git essentials&lt;/h2&gt;
&lt;p&gt;Each Git project lives in a &lt;strong&gt;repository&lt;/strong&gt; - a folder on disk that can track changes made to the files within it. Those changes are recorded in &lt;strong&gt;commits&lt;/strong&gt; - timestamped bundles of changes to one or more files accompanied by a &lt;strong&gt;commit message&lt;/strong&gt; describing those changes and an &lt;strong&gt;author&lt;/strong&gt; recording who made them.&lt;/p&gt;
&lt;p&gt;Git supports &lt;strong&gt;branches&lt;/strong&gt;, which allow you to construct and experiment with new changes independently of each other. Branches can then be &lt;strong&gt;merged&lt;/strong&gt; back into your main branch (using various methods) once they are deemed ready.&lt;/p&gt;
&lt;p&gt;Git repositories can be &lt;strong&gt;cloned&lt;/strong&gt; onto a new machine, and that clone includes both the current files and the full history of changes to them.
This means developers - or coding agents - can browse and explore that history without any extra network traffic, making history diving effectively free.&lt;/p&gt;
&lt;p&gt;Git repositories can live just on your own machine,  but Git is designed to support collaboration and backups by publishing them to a &lt;strong&gt;remote&lt;/strong&gt;, which can be public or private. GitHub is the most popular place for these remotes but Git is open source software that enables hosting these remotes on any machine or service that supports the Git protocol.&lt;/p&gt;
&lt;h2 id="core-concepts-and-prompts"&gt;Core concepts and prompts&lt;/h2&gt;
&lt;p&gt;Coding agents all have a deep understanding of Git jargon. The following prompts should work with any of them:&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Start a new Git repo here&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
To turn the folder the agent is working in into a Git repository - the agent will probably run the &lt;code&gt;git init&lt;/code&gt; command. If you just say "repo" agents will assume you mean a Git repository.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Commit these changes&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
Create a new Git commit to record the changes the agent has made - usually with the &lt;code&gt;git commit -m "commit message"&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Add username/repo as a github remote&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
This should configure your repository for GitHub. You'll need to create a new repo first using &lt;a href="https://github.com/new"&gt;github.com/new&lt;/a&gt;, and configure your machine to talk to GitHub.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Review changes made today&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
Or "recent changes" or "last three commits".&lt;/p&gt;
&lt;p&gt;This is a great way to start a fresh coding agents session. Telling the agent to look at recent changes causes it to run &lt;code&gt;git log&lt;/code&gt;, which can instantly load its context with details of what you have been working on recently - both the modified code and the commit messages that describe it.&lt;/p&gt;
&lt;p&gt;Seeding the session in this way means you can start talking about that code - suggest additional fixes, ask questions about how it works, or propose the next change that builds on what came before.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Integrate latest changes from main&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
Run this on your main branch to fetch other contributions from the remote repository, or run it in a branch to integrate the latest changes on main.&lt;/p&gt;
&lt;p&gt;There are multiple ways to merge changes, including merge, rebase, squash or fast-forward. If you can't remember the details of these that's fine:
&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Discuss options for integrating changes from main&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
Agents are great at explaining the pros and cons of different merging strategies, and everything in git can always be undone so there's minimal risk in trying new things.
&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Sort out this git mess for me&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;&lt;/p&gt;
&lt;p&gt;I use this universal prompt surprisingly often! Here's &lt;a href="https://gisthost.github.io/?2aa2ee2fbd08d272528bbfc3b54a1a7d/page-001.html"&gt;a recent example&lt;/a&gt; where it fixed a cherry-pick for me that failed with a merge conflict.&lt;/p&gt;
&lt;p&gt;There are plenty of ways you can get into a mess with Git, often through pulls or rebase commands that end in a merge conflict, or just through adding the wrong things to Git's staging environment.&lt;/p&gt;
&lt;p&gt;Unpicking those used to be the most difficult and time consuming parts of working with Git. No more! Coding agents can navigate the most Byzantine of merge conflicts, reasoning through the intent of the new code and figuring out what to keep and how to combine conflicting changes. If your code has automated tests (and &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/"&gt;it should&lt;/a&gt;) the agent can ensure those pass before finalizing that merge.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Find and recover my code that does ...&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
If you lose code that you are working on that's previously been committed (or saved with &lt;code&gt;git stash&lt;/code&gt;) your agent can probably find it for you. &lt;/p&gt;
&lt;p&gt;Git has a mechanism called the &lt;code&gt;reflog&lt;/code&gt; which can often capture details of code that hasn't been committed to a permanent branch. Agents can search that, and search other branches too.&lt;/p&gt;
&lt;p&gt;Just tell them what to find and watch them dive in.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Use git bisect to find when this bug was introduced: ...&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
Git bisect is one of the most powerful debugging tools in Git's arsenal, but it has a relatively steep learning curve that often deters developers from using it.&lt;/p&gt;
&lt;p&gt;When you run a bisect operation you provide Git with some kind of test condition and a start and ending commit range. Git then runs a binary search to identify the earliest commit for which your test condition fails. &lt;/p&gt;
&lt;p&gt;This can efficiently answer the question "what first caused this bug". The only downside is the need to express the test for the bug in a format that Git bisect can execute.&lt;/p&gt;
&lt;p&gt;Coding agents can handle this boilerplate for you. This upgrades Git bisect from an occasional use tool to one you can deploy any time you are curious about the historic behavior of your software.&lt;/p&gt;
&lt;h2 id="rewriting-history"&gt;Rewriting history&lt;/h2&gt;
&lt;p&gt;Let's get into the fun advanced stuff.&lt;/p&gt;
&lt;p&gt;The commit history of a Git repository is not fixed. The data is just files on disk after all (tucked away in a hidden &lt;code&gt;.git/&lt;/code&gt; directory), and Git itself provides tools that can be used to modify that history.&lt;/p&gt;
&lt;p&gt;Don't think of the Git history as a permanent record of what actually happened - instead consider it to be a deliberately authored story that describes the progression of the software project.&lt;/p&gt;
&lt;p&gt;This story is a tool to aid future development. Permanently recording mistakes and cancelled directions can sometimes be useful, but repository authors can make editorial decisions about what to keep and how best to capture that history.&lt;/p&gt;
&lt;p&gt;Coding agents are really good at using Git's advanced history rewriting features.&lt;/p&gt;
&lt;h3 id="undo-or-rewrite-commits"&gt;Undo or rewrite commits&lt;/h3&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Undo last commit&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
It's common to commit code and then regret it - realize that it includes a file you didn't mean to include, for example. The git recipe for this is &lt;code&gt;git reset --soft HEAD~1&lt;/code&gt;. I've never been able to remember that, and now I don't have to!&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Remove uv.lock from that last commit&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
You can also perform more finely grained surgery on commits - rewriting them to remove just a single file, for example.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Combine last three commits with a better commit message&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
Agents can rewrite commit messages and can combine multiple commits into a single unit.&lt;/p&gt;
&lt;p&gt;I've found that frontier models usually have really good taste in commit messages. I used to insist on writing these myself but I've accepted that the quality they produce is generally good enough, and often even better than what I would have produced myself.&lt;/p&gt;
&lt;h3 id="building-a-new-repository-from-scraps-of-an-older-one"&gt;Building a new repository from scraps of an older one&lt;/h3&gt;
&lt;p&gt;A trick I find myself using quite often is extracting out code from a larger repository into a new one while maintaining the key history of that code.&lt;/p&gt;
&lt;p&gt;One common example is library extraction. I may have built some classes and functions into a project and later realized they would make more sense as a standalone reusable code library.&lt;/p&gt;
&lt;p&gt;&lt;div&gt;&lt;markdown-copy&gt;&lt;textarea&gt;Start a new repo at /tmp/distance-functions and build a Python library there with the lib/distance_functions.py module from here - build a similar commit history copying the author and commit dates in the new repo&lt;/textarea&gt;&lt;/markdown-copy&gt;&lt;/div&gt;
This kind of operation used to be involved enough that most developers would create a fresh copy detached from that old commit history. We don't have to settle for that any more!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="coding-agents"/><category term="generative-ai"/><category term="github"/><category term="agentic-engineering"/><category term="ai"/><category term="git"/><category term="llms"/></entry><entry><title>TIL: Downloading archived Git repositories from archive.softwareheritage.org</title><link href="https://simonwillison.net/2025/Dec/30/software-heritage/#atom-tag" rel="alternate"/><published>2025-12-30T23:51:33+00:00</published><updated>2025-12-30T23:51:33+00:00</updated><id>https://simonwillison.net/2025/Dec/30/software-heritage/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/github/software-archive-recovery"&gt;TIL: Downloading archived Git repositories from archive.softwareheritage.org&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Back in February I &lt;a href="https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/"&gt;blogged about&lt;/a&gt; a neat Python library called &lt;code&gt;sqlite-s3vfs&lt;/code&gt; for accessing SQLite databases hosted in an S3 bucket, released as MIT licensed open source by the UK government's Department for Business and Trade.&lt;/p&gt;
&lt;p&gt;I went looking for it today and found that the &lt;a href="https://github.com/uktrade/sqlite-s3vfs"&gt;github.com/uktrade/sqlite-s3vfs&lt;/a&gt; repository is now a 404.&lt;/p&gt;
&lt;p&gt;Since this is taxpayer-funded open source software I saw it as my moral duty to try and restore access! It turns out &lt;a href="https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/uktrade/sqlite-s3vfs"&gt;a full copy&lt;/a&gt; had been captured by &lt;a href="https://archive.softwareheritage.org/"&gt;the Software Heritage archive&lt;/a&gt;, so I was able to restore  the repository from there. My copy is now archived at &lt;a href="https://github.com/simonw/sqlite-s3vfs"&gt;simonw/sqlite-s3vfs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The process for retrieving an archive was non-obvious, so I've written up a TIL and also published a new &lt;a href="https://tools.simonwillison.net/software-heritage-repo#https%3A%2F%2Fgithub.com%2Fuktrade%2Fsqlite-s3vfs"&gt;Software Heritage Repository Retriever&lt;/a&gt; tool which takes advantage of the CORS-enabled APIs provided by Software Heritage. Here's &lt;a href="https://gistpreview.github.io/?3a76a868095c989d159c226b7622b092/index.html"&gt;the Claude Code transcript&lt;/a&gt; from building that.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46435308#46438857"&gt;Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/archives"&gt;archives&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="archives"/><category term="git"/><category term="github"/><category term="open-source"/><category term="tools"/><category term="ai"/><category term="til"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude-code"/></entry><entry><title>simonw/git-scraper-template</title><link href="https://simonwillison.net/2025/Feb/26/git-scraper-template/#atom-tag" rel="alternate"/><published>2025-02-26T05:34:05+00:00</published><updated>2025-02-26T05:34:05+00:00</updated><id>https://simonwillison.net/2025/Feb/26/git-scraper-template/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/git-scraper-template"&gt;simonw/git-scraper-template&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built this new GitHub template repository in preparation for a workshop I'm giving at &lt;a href="https://www.ire.org/training/conferences/nicar-2025/"&gt;NICAR&lt;/a&gt; (the data journalism conference) next week on &lt;a href="https://github.com/simonw/nicar-2025-scraping/"&gt;Cutting-edge web scraping techniques&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One of the topics I'll be covering is &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.&lt;/p&gt;
&lt;p&gt;This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple &lt;a href="https://github.com/new?template_name=git-scraper-template&amp;amp;template_owner=simonw"&gt;create a new repository from the template&lt;/a&gt; and paste the URL you want to scrape into the &lt;strong&gt;description&lt;/strong&gt; field and the repository will be initialized with a custom script that scrapes and stores that URL.&lt;/p&gt;
&lt;p&gt;It's modeled after my earlier &lt;a href="https://github.com/simonw/shot-scraper-template"&gt;shot-scraper-template&lt;/a&gt; tool which I described in detail in &lt;a href="https://simonwillison.net/2022/Mar/14/shot-scraper-template/"&gt;Instantly create a GitHub repository to take screenshots of a web page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;code&gt;git-scraper-template&lt;/code&gt; repo took &lt;a href="https://github.com/simonw/git-scraper-template/issues/2#issuecomment-2683871054"&gt;some help from Claude&lt;/a&gt; to figure out. It uses a &lt;a href="https://github.com/simonw/git-scraper-template/blob/a2b12972584099d7c793ee4b38303d94792bf0f0/download.sh"&gt;custom script&lt;/a&gt; to download the provided URL and derive a filename to use based on the URL and the content type, detected using &lt;code&gt;file --mime-type -b "$file_path"&lt;/code&gt; against the downloaded file.&lt;/p&gt;
&lt;p&gt;It also detects if the downloaded content is JSON and, if it is, pretty-prints it using &lt;code&gt;jq&lt;/code&gt; - I find this is a quick way to generate much more useful diffs when the content changes.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="nicar"/></entry><entry><title>otterwiki</title><link href="https://simonwillison.net/2024/Oct/9/otterwiki/#atom-tag" rel="alternate"/><published>2024-10-09T15:22:04+00:00</published><updated>2024-10-09T15:22:04+00:00</updated><id>https://simonwillison.net/2024/Oct/9/otterwiki/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/redimp/otterwiki"&gt;otterwiki&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It's been a while since I've seen a new-ish Wiki implementation, and this one by  Ralph Thesen is really nice. It's written in Python (Flask + SQLAlchemy + &lt;a href="https://github.com/lepture/mistune"&gt;mistune&lt;/a&gt; for Markdown + &lt;a href="https://github.com/gitpython-developers/GitPython"&gt;GitPython&lt;/a&gt;) and keeps all of the actual wiki content as Markdown files in a local Git repository.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://otterwiki.com/Installation"&gt;installation instructions&lt;/a&gt; are a little in-depth as they assume a production installation with Docker or systemd - I figured out &lt;a href="https://github.com/redimp/otterwiki/issues/146"&gt;this recipe&lt;/a&gt; for trying it locally using &lt;code&gt;uv&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/redimp/otterwiki.git
cd otterwiki

mkdir -p app-data/repository
git init app-data/repository

echo "REPOSITORY='${PWD}/app-data/repository'" &amp;gt;&amp;gt; settings.cfg
echo "SQLALCHEMY_DATABASE_URI='sqlite:///${PWD}/app-data/db.sqlite'" &amp;gt;&amp;gt; settings.cfg
echo "SECRET_KEY='$(echo $RANDOM | md5sum | head -c 16)'" &amp;gt;&amp;gt; settings.cfg

export OTTERWIKI_SETTINGS=$PWD/settings.cfg
uv run --with gunicorn gunicorn --bind 127.0.0.1:8080 otterwiki.server:app
&lt;/code&gt;&lt;/pre&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41749680"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/flask"&gt;flask&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlalchemy"&gt;sqlalchemy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/markdown"&gt;markdown&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/wikis"&gt;wikis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="flask"/><category term="git"/><category term="python"/><category term="sqlalchemy"/><category term="sqlite"/><category term="markdown"/><category term="wikis"/><category term="uv"/></entry><entry><title>Why GitHub Actually Won</title><link href="https://simonwillison.net/2024/Sep/9/why-github-actually-won/#atom-tag" rel="alternate"/><published>2024-09-09T17:16:22+00:00</published><updated>2024-09-09T17:16:22+00:00</updated><id>https://simonwillison.net/2024/Sep/9/why-github-actually-won/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.gitbutler.com/why-github-actually-won/"&gt;Why GitHub Actually Won&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
GitHub co-founder Scott Chacon shares some thoughts on how GitHub won the open source code hosting market. Shortened to two words: timing, and taste.&lt;/p&gt;
&lt;p&gt;There are some interesting numbers in here. I hadn't realized that when GitHub launched in 2008 the term "open source" had only been coined ten years earlier, in 1998. &lt;a href="https://dirkriehle.com/publications/2008-selected/the-total-growth-of-open-source/comment-page-1/"&gt;This paper&lt;/a&gt; by Dirk Riehle estimates there were 18,000 open source projects in 2008 - Scott points out that today there are over 280 million public repositories on GitHub alone.&lt;/p&gt;
&lt;p&gt;Scott's conclusion:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We were there when a new paradigm was being born and we approached the problem of helping people embrace that new paradigm with a developer experience centric approach that nobody else had the capacity for or interest in.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41490161"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="github"/><category term="open-source"/></entry><entry><title>AI-powered Git Commit Function</title><link href="https://simonwillison.net/2024/Aug/26/ai-powered-git-commit-function/#atom-tag" rel="alternate"/><published>2024-08-26T01:06:59+00:00</published><updated>2024-08-26T01:06:59+00:00</updated><id>https://simonwillison.net/2024/Aug/26/ai-powered-git-commit-function/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://gist.github.com/karpathy/1dd0294ef9567971c1e4348a90d69285"&gt;AI-powered Git Commit Function&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Andrej Karpathy built a shell alias, &lt;code&gt;gcm&lt;/code&gt;, which passes your staged Git changes to an LLM via my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool, generates a short commit message and then asks you if you want to "(a)ccept, (e)dit, (r)egenerate, or (c)ancel?".&lt;/p&gt;
&lt;p&gt;Here's the incantation he's using to generate that commit message:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git diff --cached &lt;span class="pl-k"&gt;|&lt;/span&gt; llm &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Below is a diff of all staged changes, coming from the command:&lt;/span&gt;
&lt;span class="pl-s"&gt;\`\`\`&lt;/span&gt;
&lt;span class="pl-s"&gt;git diff --cached&lt;/span&gt;
&lt;span class="pl-s"&gt;\`\`\`&lt;/span&gt;
&lt;span class="pl-s"&gt;Please generate a concise, one-line commit message for these changes.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This pipes the data into LLM (using the default model, currently &lt;code&gt;gpt-4o-mini&lt;/code&gt; unless you &lt;a href="https://llm.datasette.io/en/stable/setup.html#setting-a-custom-default-model"&gt;set it to something else&lt;/a&gt;) and then appends the prompt telling it what to do with that input.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/karpathy/status/1827810695658029262"&gt;@karpathy&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="ai"/><category term="andrej-karpathy"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/></entry><entry><title>EpicEnv</title><link href="https://simonwillison.net/2024/Aug/3/epicenv/#atom-tag" rel="alternate"/><published>2024-08-03T00:31:33+00:00</published><updated>2024-08-03T00:31:33+00:00</updated><id>https://simonwillison.net/2024/Aug/3/epicenv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/danthegoodman1/EpicEnv"&gt;EpicEnv&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Dan Goodman's tool for managing shared secrets via a Git repository. This uses a really neat trick: you can run &lt;code&gt;epicenv invite githubuser&lt;/code&gt; and the tool will retrieve that user's public key from &lt;code&gt;github.com/{username}.keys&lt;/code&gt; (&lt;a href="https://github.com/simonw.keys"&gt;here's mine&lt;/a&gt;) and use that to encrypt the secrets such that the user can decrypt them with their private key.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/gruxeg/epicenv_local_environment_variable"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/encryption"&gt;encryption&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;&lt;/p&gt;



</summary><category term="encryption"/><category term="git"/></entry><entry><title>1991-WWW-NeXT-Implementation on GitHub</title><link href="https://simonwillison.net/2024/Aug/1/www-next-implementation-on-github/#atom-tag" rel="alternate"/><published>2024-08-01T21:15:29+00:00</published><updated>2024-08-01T21:15:29+00:00</updated><id>https://simonwillison.net/2024/Aug/1/www-next-implementation-on-github/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/1991-WWW-NeXT-Implementation"&gt;1991-WWW-NeXT-Implementation on GitHub&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I fell down a bit of a rabbit hole today trying to answer &lt;a href="https://simonwillison.net/2024/Aug/1/august-1st-world-wide-web-day/"&gt;that question about when World Wide Web Day was first celebrated&lt;/a&gt;. I found my way to &lt;a href="https://www.w3.org/History/1991-WWW-NeXT/Implementation/"&gt;www.w3.org/History/1991-WWW-NeXT/Implementation/&lt;/a&gt; - an Apache directory listing of the source code for Tim Berners-Lee's original WorldWideWeb application for NeXT!&lt;/p&gt;
&lt;p&gt;The code wasn't particularly easy to browse: clicking a &lt;code&gt;.m&lt;/code&gt; file would trigger a download rather than showing the code in the browser, and there were no niceties like syntax highlighting.&lt;/p&gt;
&lt;p&gt;So I decided to mirror that code to a &lt;a href="https://github.com/simonw/1991-WWW-NeXT-Implementation"&gt;new repository on GitHub&lt;/a&gt;. I grabbed the code using &lt;code&gt;wget -r&lt;/code&gt; and was delighted to find that the last modified dates (from the early 1990s) were preserved ... which made me want to preserve them in the GitHub repo too.&lt;/p&gt;
&lt;p&gt;I used Claude to write a Python script to back-date those commits, and wrote up what I learned in this new TIL: &lt;a href="https://til.simonwillison.net/git/backdate-git-commits"&gt;Back-dating Git commits based on file modification dates&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;End result: I now have a repo with Tim's original code, plus commit dates that reflect when that code was last modified.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Three commits credited to Tim Berners-Lee, in 1995, 1994 and 1993" src="https://static.simonwillison.net/static/2024/tbl-commits.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/history"&gt;history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tim-berners-lee"&gt;tim-berners-lee&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/w3c"&gt;w3c&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="github"/><category term="history"/><category term="tim-berners-lee"/><category term="w3c"/></entry><entry><title>AWS CodeCommit quietly deprecated</title><link href="https://simonwillison.net/2024/Jul/30/aws-codecommit-quietly-deprecated/#atom-tag" rel="alternate"/><published>2024-07-30T05:51:42+00:00</published><updated>2024-07-30T05:51:42+00:00</updated><id>https://simonwillison.net/2024/Jul/30/aws-codecommit-quietly-deprecated/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://repost.aws/questions/QUshILm0xbTjWJZSD8afYVgA/codecommit-cannot-create-a-repository"&gt;AWS CodeCommit quietly deprecated&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
CodeCommit is AWS's Git hosting service. In a reply from an AWS employee to this forum thread:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Beginning on 06 June 2024, AWS CodeCommit ceased onboarding new customers. Going forward, only customers who have an existing repository in AWS CodeCommit will be able to create additional repositories.&lt;/p&gt;
&lt;p&gt;[...] If you would like to use AWS CodeCommit in a new AWS account that is part of your AWS Organization, please let us know so that we can evaluate the request for allowlisting the new account. If you would like to use an alternative to AWS CodeCommit given this news, we recommend using GitLab, GitHub, or another third party source provider of your choice.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What's weird about this is that, as far as I can tell, this is the first official public acknowledgement from AWS that CodeCommit is no longer accepting customers. The &lt;a href="https://aws.amazon.com/codecommit/"&gt;CodeCommit landing page&lt;/a&gt; continues to promote the product, though it does link to the &lt;a href="https://aws.amazon.com/blogs/devops/how-to-migrate-your-aws-codecommit-repository-to-another-git-provider/"&gt;How to migrate your AWS CodeCommit repository to another Git provider&lt;/a&gt; blog post from July 25th, which gives no direct indication that CodeCommit is being quietly sunset.&lt;/p&gt;
&lt;p&gt;I wonder how long they'll continue to support their existing customers?&lt;/p&gt;
&lt;h4 id="aws-qldb"&gt;Amazon QLDB too&lt;/h4&gt;

&lt;p&gt;It looks like AWS may be having a bit of a clear-out. &lt;a href="https://aws.amazon.com/qldb/"&gt;Amazon QLDB&lt;/a&gt; - Quantum Ledger Database (a blockchain-adjacent immutable ledger, launched in 2019) - quietly put out a deprecation announcement &lt;a href="https://docs.aws.amazon.com/qldb/latest/developerguide/document-history.html"&gt;in their release history on July 18th&lt;/a&gt; (again, no official announcement elsewhere):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;End of support notice: Existing customers will be able to use Amazon QLDB until end of support on 07/31/2025. For more details, see &lt;a href="https://aws.amazon.com/blogs/database/migrate-an-amazon-qldb-ledger-to-amazon-aurora-postgresql/"&gt;Migrate an Amazon QLDB Ledger to Amazon Aurora PostgreSQL&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one is more surprising, because migrating to a different Git host is massively less work than entirely re-writing a system to use a fundamentally different database.&lt;/p&gt;
&lt;p&gt;It turns out there's an infrequently updated community GitHub repo called &lt;a href="https://github.com/SummitRoute/aws_breaking_changes"&gt;SummitRoute/aws_breaking_changes&lt;/a&gt; which tracks these kinds of changes. Other services listed there include CodeStar, Cloud9, CloudSearch, OpsWorks, Workdocs and Snowmobile, and they cleverly (ab)use the GitHub releases mechanism to provide an &lt;a href="https://github.com/SummitRoute/aws_breaking_changes/releases.atom"&gt;Atom feed&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41104997"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/blockchain"&gt;blockchain&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="git"/><category term="blockchain"/></entry><entry><title>Quoting Ed Page</title><link href="https://simonwillison.net/2024/Jul/13/ed-page/#atom-tag" rel="alternate"/><published>2024-07-13T05:28:01+00:00</published><updated>2024-07-13T05:28:01+00:00</updated><id>https://simonwillison.net/2024/Jul/13/ed-page/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://news.ycombinator.com/item?id=40949229#40951540"&gt;&lt;p&gt;Add tests in a commit &lt;em&gt;before&lt;/em&gt; the fix. They should pass, showing the behavior before your change. Then, the commit with your change will update the tests. The diff between these commits represents the change in behavior. This helps the author test their tests (I've written tests thinking they covered the relevant case but didn't), the reviewer to more precisely see the change in behavior and comment on it, and the wider community to understand what the PR description is about.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://news.ycombinator.com/item?id=40949229#40951540"&gt;Ed Page&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/></entry><entry><title>uv pip install --exclude-newer example</title><link href="https://simonwillison.net/2024/May/10/uv-pip-install-exclude-newer/#atom-tag" rel="alternate"/><published>2024-05-10T16:35:40+00:00</published><updated>2024-05-10T16:35:40+00:00</updated><id>https://simonwillison.net/2024/May/10/uv-pip-install-exclude-newer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/hauntsaninja/typing_extensions/blob/f694a4e2effdd2179f76e886498ffd3446e96b0b/.github/workflows/third_party.yml#L111"&gt;uv pip install --exclude-newer example&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A neat new feature of the &lt;code&gt;uv pip install&lt;/code&gt; command is the &lt;code&gt;--exclude-newer&lt;/code&gt; option, which can be used to avoid installing any package versions released after the specified date.&lt;/p&gt;
&lt;p&gt;Here's a clever example of that in use from the &lt;code&gt;typing_extensions&lt;/code&gt; packages CI tests that run against some downstream packages:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;uv pip install --system -r test-requirements.txt --exclude-newer $(git show -s --date=format:'%Y-%m-%dT%H:%M:%SZ' --format=%cd HEAD)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;They use &lt;code&gt;git show&lt;/code&gt; to get the date of the most recent commit (&lt;code&gt;%cd&lt;/code&gt; means commit date) formatted as an ISO timestamp, then pass that to &lt;code&gt;--exclude-newer&lt;/code&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/hauntsaninja/status/1788848732437713171"&gt;@hauntsaninja&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pip"&gt;pip&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/astral"&gt;astral&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="pip"/><category term="python"/><category term="uv"/><category term="astral"/></entry><entry><title>Use an llm to automagically generate meaningful git commit messages</title><link href="https://simonwillison.net/2024/Apr/11/use-an-llm-to-automagically-generate-meaningful-git-commit-messa/#atom-tag" rel="alternate"/><published>2024-04-11T04:06:15+00:00</published><updated>2024-04-11T04:06:15+00:00</updated><id>https://simonwillison.net/2024/Apr/11/use-an-llm-to-automagically-generate-meaningful-git-commit-messa/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://harper.blog/2024/03/11/use-an-llm-to-automagically-generate-meaningful-git-commit-messages/"&gt;Use an llm to automagically generate meaningful git commit messages&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat, thoroughly documented recipe by Harper Reed using my LLM CLI tool as part of a scheme for if you’re feeling too lazy to write a commit message—it uses a prepare-commit-msg Git hook which runs any time you commit without a message and pipes your changes to a model along with a custom system prompt.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="git"/><category term="ai"/><category term="generative-ai"/><category term="llm"/></entry><entry><title>How I use git worktrees</title><link href="https://simonwillison.net/2024/Mar/6/how-i-use-git-worktrees/#atom-tag" rel="alternate"/><published>2024-03-06T15:21:31+00:00</published><updated>2024-03-06T15:21:31+00:00</updated><id>https://simonwillison.net/2024/Mar/6/how-i-use-git-worktrees/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://notes.billmill.org/blog/2024/03/How_I_use_git_worktrees.html"&gt;How I use git worktrees&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
TIL about worktrees, a Git feature that lets you have multiple repository branches checked out to separate directories at the same time.&lt;/p&gt;

&lt;p&gt;The default UI for them is a little unergonomic (classic Git) but Bill Mill here shares a neat utility script for managing them in a more convenient way.&lt;/p&gt;

&lt;p&gt;One particularly neat trick: Bill’s “worktree” Bash script checks for a node_modules folder and, if one exists, duplicates it to the new directory using copy-on-write, saving you from having to run yet another lengthy “npm install”.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/ikbbnt/how_i_use_git_worktrees"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/></entry><entry><title>Figure out who's leaving the company: dump, diff, repeat</title><link href="https://simonwillison.net/2024/Feb/9/figure-out-whos-leaving-the-company/#atom-tag" rel="alternate"/><published>2024-02-09T05:44:31+00:00</published><updated>2024-02-09T05:44:31+00:00</updated><id>https://simonwillison.net/2024/Feb/9/figure-out-whos-leaving-the-company/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://rachelbythebay.com/w/2024/02/08/ldap/"&gt;Figure out who&amp;#x27;s leaving the company: dump, diff, repeat&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.&lt;/p&gt;
&lt;p&gt;I suggest using Git for this - a form of Git scraping - as then you get a detailed commit log of changes over time effectively for free.&lt;/p&gt;
&lt;p&gt;I really enjoyed Rachel's closing thought:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Incidentally, if someone gets mad about you running this sort of thing, you probably don't want to work there anyway. On the other hand, if you're able to build such tools without IT or similar getting "threatened" by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don't tend to last.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=39311507"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rachel-kroll"&gt;rachel-kroll&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="git-scraping"/><category term="rachel-kroll"/></entry><entry><title>Inside .git</title><link href="https://simonwillison.net/2024/Jan/25/inside-git/#atom-tag" rel="alternate"/><published>2024-01-25T14:59:54+00:00</published><updated>2024-01-25T14:59:54+00:00</updated><id>https://simonwillison.net/2024/Jan/25/inside-git/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://wizardzines.com/comics/inside-git/"&gt;Inside .git&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This single diagram filled in all sorts of gaps in my mental model of how git actually works under the hood.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/julia-evans"&gt;julia-evans&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="julia-evans"/></entry><entry><title>See the History of a Method with git log -L</title><link href="https://simonwillison.net/2023/Nov/5/see-the-history-of-a-method-with-git-log-l/#atom-tag" rel="alternate"/><published>2023-11-05T20:16:55+00:00</published><updated>2023-11-05T20:16:55+00:00</updated><id>https://simonwillison.net/2023/Nov/5/see-the-history-of-a-method-with-git-log-l/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://calebhearth.com/git-method-history"&gt;See the History of a Method with git log -L&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat Git trick from Caleb Hearth that I hadn’t seen before, and it works for Python out of the box:&lt;/p&gt;

&lt;p&gt;git log -L :path_with_format:__init__.py&lt;/p&gt;

&lt;p&gt;That command displays a log (with diffs) of just the portion of commits that changed the path_with_format function in the __init__.py file.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=38153309"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="python"/></entry><entry><title>Tracking SQLite Database Changes in Git</title><link href="https://simonwillison.net/2023/Nov/1/tracking-sqlite-database-changes-in-git/#atom-tag" rel="alternate"/><published>2023-11-01T18:53:51+00:00</published><updated>2023-11-01T18:53:51+00:00</updated><id>https://simonwillison.net/2023/Nov/1/tracking-sqlite-database-changes-in-git/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://garrit.xyz/posts/2023-11-01-tracking-sqlite-database-changes-in-git"&gt;Tracking SQLite Database Changes in Git&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A neat trick from Garrit Franke that I hadn’t seen before: you can teach “git diff” how to display human readable versions of the differences between binary files with a specific extension using the following:&lt;/p&gt;

&lt;p&gt;git config diff.sqlite3.binary true&lt;br&gt;git config diff.sqlite3.textconv “echo .dump | sqlite3”&lt;/p&gt;

&lt;p&gt;That way you can store binary files in your repo but still get back SQL diffs to compare them.&lt;/p&gt;

&lt;p&gt;I still worry about the efficiency of storing binary files in Git, since I expect multiple versions of a text text file to compress together better.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/gnv9ho/tracking_sqlite_database_changes_git"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="sqlite"/></entry><entry><title>The Perfect Commit</title><link href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#atom-tag" rel="alternate"/><published>2022-10-29T20:41:01+00:00</published><updated>2022-10-29T20:41:01+00:00</updated><id>https://simonwillison.net/2022/Oct/29/the-perfect-commit/#atom-tag</id><summary type="html">
    &lt;p&gt;For the last few years I've been trying to center my work around creating what I consider to be the &lt;em&gt;Perfect Commit&lt;/em&gt;. This is a single commit that contains all of the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;implementation&lt;/strong&gt;: a single, focused change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests&lt;/strong&gt; that demonstrate the implementation works&lt;/li&gt;
&lt;li&gt;Updated &lt;strong&gt;documentation&lt;/strong&gt; reflecting the change&lt;/li&gt;
&lt;li&gt;A link to an &lt;strong&gt;issue thread&lt;/strong&gt; providing further context&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our job as software engineers generally isn't to write new software from scratch: we spend the majority of our time adding features and fixing bugs in existing software.&lt;/p&gt;
&lt;p&gt;The commit is our principle unit of work. It deserves to be treated thoughtfully and with care.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 26th November 2022&lt;/strong&gt;: My 25 minute talk &lt;a href="https://simonwillison.net/2022/Nov/26/productivity/"&gt;Massively increase your productivity on personal projects with comprehensive documentation and automated tests&lt;/a&gt; describes this approach to software development in detail.&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#implementation"&gt;Implementation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#tests"&gt;Tests&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#documentation"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#link-to-an-issue"&gt;A link to an issue&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#issue-over-commit-message"&gt;An issue is more valuable than a commit message&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#not-all-perfect"&gt;Not every commit needs to be "perfect"&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#scrappy-branches"&gt;Write scrappy commits in a branch&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#examples"&gt;Some examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="implementation"&gt;Implementation&lt;/h4&gt;
&lt;p&gt;Each commit should change a single thing.&lt;/p&gt;
&lt;p&gt;The definition of "thing" here is left deliberately vague!&lt;/p&gt;
&lt;p&gt;The goal is have something that can be easily reviewed, and that can be clearly understood in the future when revisited using tools like &lt;code&gt;git blame&lt;/code&gt; or &lt;a href="https://til.simonwillison.net/git/git-bisect"&gt;git bisect&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I like to keep my commit history linear, as I find that makes it much easier to comprehend later. This further reinforces the value of each commit being a single, focused change.&lt;/p&gt;
&lt;p&gt;Atomic commits are also much easier to cleanly revert if something goes wrong - or to cherry-pick into other branches.&lt;/p&gt;
&lt;p&gt;For things like web applications that can be deployed to production, a commit should be a unit that can be deployed. Aiming to keep the main branch in a deployable state is a good rule of thumb for deciding if a commit is a sensible atomic change or not.&lt;/p&gt;
&lt;h4 id="tests"&gt;Tests&lt;/h4&gt;
&lt;p&gt;The ultimate goal of tests is to &lt;em&gt;increase&lt;/em&gt; your productivity. If your testing practices are slowing you down, you should consider ways to improve them.&lt;/p&gt;
&lt;p&gt;In the longer term, this productivity improvement comes from gaining the freedom to make changes and stay confident that your change hasn't broken something else.&lt;/p&gt;
&lt;p&gt;But tests can help increase productivity in the immediate short term as well.&lt;/p&gt;
&lt;p&gt;How do you know when the change you have made is finished and ready to commit? It's ready when the new tests pass.&lt;/p&gt;
&lt;p&gt;I find this reduces the time I spend second-guessing myself and questioning whether I've done enough and thought through all of the edge cases.&lt;/p&gt;
&lt;p&gt;Without tests, there's a very strong possibility that your change will have broken some other, potentially unrelated feature. Your commit could be held up by hours of tedious manual testing. Or you could &lt;abbr title="You Only Live Once"&gt;YOLO&lt;/abbr&gt; it and learn that you broke something important later!&lt;/p&gt;
&lt;p&gt;Writing tests becomes far less time consuming if you already have good testing practices in place.&lt;/p&gt;
&lt;p&gt;Adding a new test to a project with a lot of existing tests is easy: you can often find an existing test that has 90% of the pattern you need already worked out for you.&lt;/p&gt;
&lt;p&gt;If your project has no tests at all, adding a test for your change will be a lot more work.&lt;/p&gt;
&lt;p&gt;This is why I start every single one of my projects with a passing test. It doesn't matter what this test is - &lt;code&gt;assert 1 + 1 == 2&lt;/code&gt; is fine! The key thing is to get a testing framework in place, such that you can run a command (for me that's usually &lt;code&gt;pytest&lt;/code&gt;) to execute the test suite - and you have an obvious place to add new tests in the future.&lt;/p&gt;
&lt;p&gt;I use &lt;a href="https://simonwillison.net/2021/Aug/28/dynamic-github-repository-templates/"&gt;these cookiecutter templates&lt;/a&gt; for almost all of my new projects. They configure a testing framework with a single passing test and GitHub Actions workflows to exercise it all from the very start.&lt;/p&gt;
&lt;p&gt;I'm not a huge advocate of test-first development, where tests are written before the code itself. What I care about is tests-included development, where the final commit bundles the tests and the implementation together. I wrote more about my approach to testing in &lt;a href="https://simonwillison.net/2020/Feb/11/cheating-at-unit-tests-pytest-black/"&gt;How to cheat at unit tests with pytest and Black&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="documentation"&gt;Documentation&lt;/h4&gt;
&lt;p&gt;If your project defines APIs that are meant to be used outside of your project, they need to be documented. In my work these projects are usually one of the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Python APIs (modules, functions and classes) that provide code designed to be imported into other projects.&lt;/li&gt;
&lt;li&gt;Web APIs - usually JSON over HTTP these days - that provide functionality to be consumed by other applications.&lt;/li&gt;
&lt;li&gt;Command line interface tools, such as those implemented using &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; or &lt;a href="https://typer.tiangolo.com/"&gt;Typer&lt;/a&gt; or &lt;a href="https://docs.python.org/3/library/argparse.html"&gt;argparse&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is critical that this documentation &lt;strong&gt;must live in the same repository as the code itself&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This is important for a number of reasons.&lt;/p&gt;
&lt;p&gt;Documentation is only valuable &lt;strong&gt;if people trust it&lt;/strong&gt;. People will only trust it if they know that it is kept up to date.&lt;/p&gt;
&lt;p&gt;If your docs live in a separate wiki somewhere it's easy for them to get out of date - but more importantly it's hard for anyone to quickly confirm if the documentation is being updated in sync with the code or not.&lt;/p&gt;
&lt;p&gt;Documentation should be &lt;strong&gt;versioned&lt;/strong&gt;. People need to be able to find the docs for the specific version of your software that they are using. Keeping it in the same repository as the code gives you synchronized versioning for free.&lt;/p&gt;
&lt;p&gt;Documentation changes should be &lt;strong&gt;reviewed&lt;/strong&gt; in the same way as your code. If they live in the same repository you can catch changes that need to be reflected in the documentation as part of your code review process.&lt;/p&gt;
&lt;p&gt;And ideally, documentation should be &lt;strong&gt;tested&lt;/strong&gt;. I wrote about my approach to doing this using &lt;a href="https://simonwillison.net/2018/Jul/28/documentation-unit-tests/"&gt;Documentation unit tests&lt;/a&gt;. Executing example code in the documentation using a testing framework is a great idea too.&lt;/p&gt;
&lt;p&gt;As with tests, writing documentation from scratch is much more work than incrementally modifying existing documentation.&lt;/p&gt;
&lt;p&gt;Many of my commits include documentation that is just a sentence or two. This doesn't take very long to write, but it adds up to something very comprehensive over time.&lt;/p&gt;
&lt;p&gt;How about end-user facing documentation? I'm still figuring that out myself. I created my &lt;a href="https://simonwillison.net/2022/Mar/10/shot-scraper/"&gt;shot-scraper tool&lt;/a&gt; to help automate the process of keeping screenshots up-to-date, but I've not yet found personal habits and styles for end-user documentation that I'm confident in.&lt;/p&gt;
&lt;h4 id="link-to-an-issue"&gt;A link to an issue&lt;/h4&gt;
&lt;p&gt;Every perfect commit should include a link to an issue thread that accompanies that change.&lt;/p&gt;
&lt;p&gt;Sometimes I'll even open an issue seconds before writing the commit message, just to give myself something I can link to from the commit itself!&lt;/p&gt;
&lt;p&gt;The reason I like issue threads is that they provide effectively unlimited space for commentary and background for the change that is being made.&lt;/p&gt;
&lt;p&gt;Most of my issue threads are me talking to myself - sometimes with dozens of issue comments, all written by me.&lt;/p&gt;
&lt;p&gt;Things that can go in an issue thread include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Background&lt;/strong&gt;: the reason for the change. I try to include this in the opening comment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State of play&lt;/strong&gt; before the change. I'll often link to the current version of the code and documentation. This is great for if I return to an open issue a few days later, as it saves me from having to repeat that initial research.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Links to things&lt;/strong&gt;. So many links! Inspiration for the change, relevant documentation, conversations on Slack or Discord, clues found on StackOverflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code snippets&lt;/strong&gt; illustrating potential designs and false-starts. Use &lt;code&gt;```python ... ```&lt;/code&gt; blocks to get syntax highlighting in your issue comments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decisions&lt;/strong&gt;. What did you consider? What did you decide? As programmers we make hundreds of tiny decisions a day. Write them down! Then you'll never find yourself relitigating them in the future having forgotten your original reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshots&lt;/strong&gt;. What it looked like before, what it looked like after. Animated screenshots are even better! I use &lt;a href="https://www.cockos.com/licecap/"&gt;LICEcap&lt;/a&gt; to generate quick GIF screen captures or QuickTime to capture videos - both of which can be dropped straight into a GitHub issue comment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototypes&lt;/strong&gt;. I'll often paste a few lines of code copied from a Python console session. Sometimes I'll even paste in a block of HTML and CSS, or add a screenshot of a UI prototype.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After I've closed my issues I like to add one last comment that links to the updated documentation and ideally a live demo of the new feature.&lt;/p&gt;
&lt;h4 id="issue-over-commit-message"&gt;An issue is more valuable than a commit message&lt;/h4&gt;
&lt;p&gt;I went through a several year phase of writing essays in my commit messages, trying to capture as much of the background context and thinking as possible.&lt;/p&gt;
&lt;p&gt;My commit messages grew a lot shorter when I started bundling the updated documentation in the commit - since often much of the material I'd previously included in the commit message was now in that documentation instead.&lt;/p&gt;
&lt;p&gt;As I extended my practice of writing issue threads, I found that they were a better place for most of this context than the commit messages themselves. They supported embedded media, were more discoverable and I could continue to extend them even after the commit had landed.&lt;/p&gt;
&lt;p&gt;Today many of my commit messages are a single line summary and a link to an issue!&lt;/p&gt;
&lt;p&gt;The biggest benefit of lengthy commit messages is that they are guaranteed to survive for as long as the repository itself. If you're going to use issue threads in the way I describe here it is critical that you consider their long term archival value.&lt;/p&gt;
&lt;p&gt;I expect this to be controversial! I'm advocating for abandoning one of the core ideas of Git here - that each repository should incorporate a full, decentralized record of its history that is copied in its entirety when someone clones a repo.&lt;/p&gt;
&lt;p&gt;I understand that philosophy. All I'll say here is that my own experience has been that dropping that requirement has resulted in a net increase in my overall productivity. Other people may reach a different conclusion.&lt;/p&gt;
&lt;p&gt;If this offends you too much, you're welcome to construct an &lt;em&gt;even more perfect commit&lt;/em&gt; that incorporates background information and additional context in an extended commit message as well.&lt;/p&gt;
&lt;p&gt;One of the reasons I like GitHub Issues is that it includes a comprehensive API, which can be used to extract all of that data. I use my &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite tool&lt;/a&gt; to maintain an ongoing archive of my issues and issue comments as a SQLite database file.&lt;/p&gt;
&lt;h4 id="not-all-perfect"&gt;Not every commit needs to be "perfect"&lt;/h4&gt;
&lt;p&gt;I find that the vast majority of my work fits into this pattern, but there are exceptions.&lt;/p&gt;
&lt;p&gt;Typo fix for some documentation or a comment? Just ship it, it's fine.&lt;/p&gt;
&lt;p&gt;Bug fix that doesn't deserve documentation? Still bundle the implementation and the test plus a link to an issue, but no need to update the docs - especially if they already describe the expected bug-free behaviour.&lt;/p&gt;
&lt;p&gt;Generally though, I find that aiming for implementation, tests, documentation and an issue link covers almost all of my work. It's a really good default model.&lt;/p&gt;
&lt;h4 id="scrappy-branches"&gt;Write scrappy commits in a branch&lt;/h4&gt;
&lt;p&gt;If I'm writing more exploratory or experimental code it often doesn't make sense to work in this strict way. For those instances I'll usually work in a branch, where I can ship "WIP" commit messages and failing tests with abandon. I'll then squash-merge them into a single perfect commit (sometimes via a self-closed GitHub pull request) to keep my main branch as tidy as possible.&lt;/p&gt;
&lt;h4 id="examples"&gt;Some examples&lt;/h4&gt;
&lt;p&gt;Here are some examples of my commits that follow this pattern:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette/commit/9676b2deb07cff20247ba91dad3e84a4ab0b00d1"&gt;Upgrade Docker images to Python 3.11&lt;/a&gt; for &lt;a href="https://github.com/simonw/datasette/issues/1853"&gt;datasette #1853&lt;/a&gt; - a pretty tiny change, but still includes tests, docs and an issue link.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-utils/commit/ab8d4aad0c42f905640981f6f24bc1e37205ae62"&gt;sqlite-utils schema now takes optional tables&lt;/a&gt; for &lt;a href="https://github.com/simonw/sqlite-utils/issues/299"&gt;sqlite-utils #299&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/shot-scraper/commit/5048e21a1ca5accedfeca6ac25a16a38dc240b81"&gt;shot-scraper html command&lt;/a&gt; for &lt;a href="https://github.com/simonw/shot-scraper/issues/96"&gt;shot-scraper #96&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/s3-credentials/commit/c7bb7268c4a124349bb511f7ec3ee3f28f9581ad"&gt;s3-credentials put-objects command&lt;/a&gt; for &lt;a href="https://github.com/simonw/s3-credentials/issues/68"&gt;s3-credentials #68&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-gunicorn/commit/0d561d7a94f76079b1eb7779b3e944c163d2539e"&gt;Initial implementation&lt;/a&gt; for &lt;a href="https://github.com/simonw/datasette-gunicorn/issues/1"&gt;datasette-gunicorn #1&lt;/a&gt; - this was the first commit to this repository, but I still bundled the tests, docs, implementation and a link to an issue.&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/code-review"&gt;code-review&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-engineering"&gt;software-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-issues"&gt;github-issues&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="code-review"/><category term="definitions"/><category term="documentation"/><category term="git"/><category term="github"/><category term="software-engineering"/><category term="testing"/><category term="github-issues"/></entry><entry><title>sqlite-comprehend: run AWS entity extraction against content in a SQLite database</title><link href="https://simonwillison.net/2022/Jul/11/sqlite-comprehend/#atom-tag" rel="alternate"/><published>2022-07-11T21:31:21+00:00</published><updated>2022-07-11T21:31:21+00:00</updated><id>https://simonwillison.net/2022/Jul/11/sqlite-comprehend/#atom-tag</id><summary type="html">
    &lt;p&gt;I built a new tool this week: &lt;a href="https://datasette.io/tools/sqlite-comprehend"&gt;sqlite-comprehend&lt;/a&gt;, which passes text from a SQLite database through the &lt;a href="https://aws.amazon.com/comprehend/"&gt;AWS Comprehend&lt;/a&gt; entity extraction service and stores the returned entities.&lt;/p&gt;
&lt;p&gt;I created this as a complement to my &lt;a href="https://simonwillison.net/2022/Jun/30/s3-ocr/"&gt;s3-ocr tool&lt;/a&gt;, which uses &lt;a href="https://aws.amazon.com/textract/"&gt;AWS Textract&lt;/a&gt; service to perform OCR against every PDF file in an S3 bucket.&lt;/p&gt;
&lt;p&gt;Short version: given a database table full of text, run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
  [###---------------------------------]    9%  00:01:02
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will churn through every piece of text in the &lt;code&gt;body&lt;/code&gt; column of the &lt;code&gt;blog_entry&lt;/code&gt; table in the &lt;code&gt;myblog.db&lt;/code&gt; SQLite database, strip any HTML tags (the &lt;code&gt;--strip-tags&lt;/code&gt; option), submit it to AWS Comprehend, and store the extracted entities in the following tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://datasette.simonwillison.net/simonwillisonblog/comprehend_entities"&gt;comprehend_entities&lt;/a&gt; - the extracted entities, classified by type&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.simonwillison.net/simonwillisonblog/blog_entry_comprehend_entities"&gt;blog_entry_comprehend_entities&lt;/a&gt; - a table relating entities to the entries that they appear in&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.simonwillison.net/simonwillisonblog/comprehend_entity_types"&gt;comprehend_entity_types&lt;/a&gt; - a small lookup table of entity types&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The above table names link to a live demo produced by running the tool against all of the content in my blog.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://datasette.simonwillison.net/simonwillisonblog/blog_entry_comprehend_entities?entity=47&amp;amp;_sort_desc=rowid"&gt;225 mentions&lt;/a&gt; that Comprehend classified as the organization called "Mozilla".&lt;/p&gt;
&lt;p&gt;The tool tracks which rows have been processed already (in the &lt;a href="https://datasette.simonwillison.net/simonwillisonblog/blog_entry_comprehend_entities_done"&gt;blog_entry_comprehend_entities_done&lt;/a&gt; table), so you can run it multiple times and it will only process newly added rows.&lt;/p&gt;
&lt;p&gt;AWS Comprehend &lt;a href="https://aws.amazon.com/comprehend/pricing/"&gt;pricing&lt;/a&gt; starts at $0.0001 per hundred characters. &lt;code&gt;sqlite-comprehend&lt;/code&gt; only submits the first 5,000 characters of each row.&lt;/p&gt;
&lt;h4&gt;How the demo works&lt;/h4&gt;
&lt;p&gt;My live demo for this tool uses a new &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; instance at &lt;a href="https://datasette.simonwillison.net/"&gt;datasette.simonwillison.net&lt;/a&gt;. It hosts a complete copy of the data from my blog - data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.&lt;/p&gt;
&lt;p&gt;The demo runs out of my &lt;a href="https://github.com/simonw/simonwillisonblog-backup"&gt;simonwillisonblog-backup&lt;/a&gt; GitHub repository.&lt;/p&gt;
&lt;p&gt;A couple of years ago I realized that I'm no longer happy having any content I care about &lt;em&gt;not&lt;/em&gt; stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a &lt;code&gt;git pull&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;So I created &lt;code&gt;simonwillisonblog-backup&lt;/code&gt; using a couple of my other tools: &lt;a href="https://datasette.io/tools/db-to-sqlite"&gt;db-to-sqlite&lt;/a&gt;, which converts a PostgreSQL database to a SQLite database, and &lt;a href="https://datasette.io/tools/sqlite-diffable"&gt;sqlite-diffable&lt;/a&gt;, which dumps out a SQLite database as a "diffable" directory of newline-delimited JSON files.&lt;/p&gt;
&lt;p&gt;Here's the simplest version of that pattern:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;db-to-sqlite \
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;postgresql+psycopg2://user:pass@hostname:5432/dbname&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    simonwillisonblog.db --all&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in &lt;code&gt;simonwillisonblog.db&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This converts that SQLite database into a directory of JSON files. Each table gets two files: &lt;code&gt;table.metadata.json&lt;/code&gt; containing the table's name, columns and schema and &lt;code&gt;table.ndjson&lt;/code&gt; containing a newline-separated list of JSON arrays representing every row in that table.&lt;/p&gt;
&lt;p&gt;You can see these files for my blog's database in the &lt;a href="https://github.com/simonw/simonwillisonblog-backup/tree/main/simonwillisonblog"&gt;simonwillisonblog&lt;/a&gt; folder.&lt;/p&gt;
&lt;p&gt;(My actual script is &lt;a href="https://github.com/simonw/simonwillisonblog-backup/blob/30fc957b5b6711c9a9cd3e427b709f59fd5b9c56/.github/workflows/backup.yml#L36-L67"&gt;a little more complex&lt;/a&gt;, because I backup only selected tables and then run extra code to redact some of the fields.)&lt;/p&gt;
&lt;p&gt;Since I set this up it's captured &lt;a href="https://github.com/simonw/simonwillisonblog-backup/commits/main"&gt;over 600 changes&lt;/a&gt; I've applied to my blog's database, all made the regular Django admin interface.&lt;/p&gt;
&lt;p&gt;This morning I &lt;a href="https://github.com/simonw/simonwillisonblog-backup/compare/bab595ae7cb0802b9dbc12ef29864bee75081be5...30fc957b5b6711c9a9cd3e427b709f59fd5b9c56#diff-5876767d72a925ae5674ea64687a681f6a5a935fea020c62d168c7a172ccd2c6"&gt;extended the script&lt;/a&gt; to run &lt;code&gt;sqlite-comprehend&lt;/code&gt; against my blog entries and deploy the resulting data using Datasette.&lt;/p&gt;
&lt;p&gt;The concise version of the new script looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget -q https://datasette.simonwillison.net/simonwillisonblog.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates the &lt;code&gt;simonwillisonblog.db&lt;/code&gt; database by loading in the JSON from the &lt;code&gt;simonwillisonblog/&lt;/code&gt; folder. I do it this way mainly to exercise the new &lt;a href="https://github.com/simonw/sqlite-diffable/issues/3"&gt;sqlite-diffable load&lt;/a&gt; command I just added to that tool.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--replace&lt;/code&gt; option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This runs &lt;code&gt;sqlite-comprehend&lt;/code&gt; against the blog entries that have not yet been processed.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2&amp;gt;/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2&amp;gt;/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2&amp;gt;/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2&amp;gt;/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2&amp;gt;/dev/null
set -e
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This configures SQLite full-text search against each of those tables, using &lt;a href="https://til.simonwillison.net/bash/ignore-errors"&gt;this pattern&lt;/a&gt; to supress any errors that occur if the FTS tables already exist.&lt;/p&gt;
&lt;p&gt;Setting up FTS in this way means I can use the &lt;a href="https://datasette.io/plugins/datasette-search-all"&gt;datasette-search-all plugin&lt;/a&gt; to run searches like &lt;a href="https://datasette.simonwillison.net/-/search?q=s3"&gt;this one for aws&lt;/a&gt; across all of those tables at once.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This uses the using &lt;a href="https://docs.datasette.io/en/stable/publish.html"&gt;datasette publish&lt;/a&gt; command to deploy the &lt;a href="https://datasette.simonwillison.net/"&gt;datasette.simonwillison.net&lt;/a&gt; site to Google Cloud Run.&lt;/p&gt;
&lt;p&gt;I'm adding two more plugins here: &lt;a href="https://datasette.io/plugins/datasette-block-robots"&gt;datasette-block-robots&lt;/a&gt; to avoid search engine crawlers indexing a duplicate copy of my blog's content, and &lt;a href="https://datasette.io/plugins/datasette-graphql"&gt;datasette-graphql&lt;/a&gt; to enable GraphQL queries against my data.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://datasette.simonwillison.net/graphql/simonwillisonblog?query=%7B%0A%20%20blog_entry(%0A%20%20%20%20sort_desc%3A%20id%0A%20%20%20%20where%3A%20%22id%20in%20(select%20entry_id%20from%20blog_entry_tags%20where%20tag_id%20in%20(select%20id%20from%20blog_tag%20where%20tag%20%3D%20%27datasette%27))%22%2C%0A%20%20)%20%7B%0A%20%20%20%20totalCount%0A%20%20%20%20pageInfo%20%7B%0A%20%20%20%20%20%20hasNextPage%0A%20%20%20%20%20%20endCursor%0A%20%20%20%20%7D%0A%20%20%20%20nodes%20%7B%0A%20%20%20%20%20%20id%0A%20%20%20%20%20%20title%0A%20%20%20%20%20%20created%0A%20%20%20%20%20%20body%0A%20%20%20%20%20%20blog_entry_tags_list%20%7B%0A%20%20%20%20%20%20%20%20nodes%20%7B%0A%20%20%20%20%20%20%20%20%20%20tag_id%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20tag%0A%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D"&gt;an example GraphQL query&lt;/a&gt; that returns my most recent blog entries that are tagged with &lt;code&gt;datasette&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-comprehend"&gt;sqlite-comprehend&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-comprehend/releases/tag/0.2.1"&gt;0.2.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-comprehend/releases"&gt;4 releases total&lt;/a&gt;) - 2022-07-11
&lt;br /&gt;Tools for running data in a SQLite database through AWS Comprehend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-diffable"&gt;sqlite-diffable&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-diffable/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-diffable/releases"&gt;5 releases total&lt;/a&gt;) - 2022-07-11
&lt;br /&gt;Tools for dumping/loading a SQLite database to diffable directory structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-redirect-to-https"&gt;datasette-redirect-to-https&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-redirect-to-https/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-redirect-to-https/releases"&gt;2 releases total&lt;/a&gt;) - 2022-07-04
&lt;br /&gt;Datasette plugin that redirects all non-https requests to https&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-unsafe-expose-env"&gt;datasette-unsafe-expose-env&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-unsafe-expose-env/releases/tag/0.1.1"&gt;0.1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-unsafe-expose-env/releases"&gt;2 releases total&lt;/a&gt;) - 2022-07-03
&lt;br /&gt;Datasette plugin to expose some environment variables at /-/env for debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-expose-env"&gt;datasette-expose-env&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-expose-env/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-07-03
&lt;br /&gt;Datasette plugin to expose selected environment variables at /-/env for debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-upload-csvs/releases/tag/0.7.2"&gt;0.7.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-upload-csvs/releases"&gt;10 releases total&lt;/a&gt;) - 2022-07-03
&lt;br /&gt;Datasette plugin for uploading CSV files and converting them to database tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-packages"&gt;datasette-packages&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-packages/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-packages/releases"&gt;3 releases total&lt;/a&gt;) - 2022-07-03
&lt;br /&gt;Show a list of currently installed Python packages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-graphql"&gt;datasette-graphql&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/2.1"&gt;2.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-graphql/releases"&gt;35 releases total&lt;/a&gt;) - 2022-07-03
&lt;br /&gt;Datasette plugin providing an automatic GraphQL API for your SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-edit-schema"&gt;datasette-edit-schema&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-edit-schema/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-edit-schema/releases"&gt;9 releases total&lt;/a&gt;) - 2022-07-01
&lt;br /&gt;Datasette plugin for modifying table schemas&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/zsh/argument-heredoc"&gt;Passing command arguments using heredoc syntax&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github/reporting-bugs"&gt;Reporting bugs in GitHub to GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/conditionally-run-a-second-job"&gt;Conditionally running a second job in a GitHub Actions workflow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="git"/><category term="github"/><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="weeknotes"/></entry><entry><title>A tiny CI system</title><link href="https://simonwillison.net/2022/Apr/26/a-tiny-ci-system/#atom-tag" rel="alternate"/><published>2022-04-26T15:39:27+00:00</published><updated>2022-04-26T15:39:27+00:00</updated><id>https://simonwillison.net/2022/Apr/26/a-tiny-ci-system/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.0chris.com/tiny-ci-system.html"&gt;A tiny CI system&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Christian Ştefănescu shares a recipe for building a tiny self-hosted CI system using Git and Redis. A post-receive hook runs when a commit is pushed to the repo and uses redis-cli to push jobs to a list. Then a separate bash script runs a loop with a blocking “redis-cli blpop jobs” operation which waits for new jobs and then executes the CI job as a shell script.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/stchris_/status/1518977088723861505"&gt;@stchris_&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bash"&gt;bash&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/continuous-integration"&gt;continuous-integration&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;&lt;/p&gt;



</summary><category term="bash"/><category term="continuous-integration"/><category term="git"/><category term="redis"/></entry><entry><title>Help scraping: track changes to CLI tools by recording their --help using Git</title><link href="https://simonwillison.net/2022/Feb/2/help-scraping/#atom-tag" rel="alternate"/><published>2022-02-02T23:46:35+00:00</published><updated>2022-02-02T23:46:35+00:00</updated><id>https://simonwillison.net/2022/Feb/2/help-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been experimenting with a new variant of &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; this week which I'm calling &lt;strong&gt;Help scraping&lt;/strong&gt;. The key idea is to track changes made to CLI tools over time by recording the output of their &lt;code&gt;--help&lt;/code&gt; commands in a Git repository.&lt;/p&gt;
&lt;p&gt;My new &lt;a href="https://github.com/simonw/help-scraper"&gt;help-scraper GitHub repository&lt;/a&gt; is my first implementation of this pattern.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/.github/workflows/scrape.yml"&gt;this GitHub Actions workflow&lt;/a&gt; to record the &lt;code&gt;--help&lt;/code&gt; output for the Amazon Web Services &lt;code&gt;aws&lt;/code&gt; CLI tool, and also for the &lt;code&gt;flyctl&lt;/code&gt; tool maintained by the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; hosting platform.&lt;/p&gt;
&lt;p&gt;The workflow runs once a day. It loops through every available AWS command (using &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/aws_commands.py"&gt;this script&lt;/a&gt;) and records the output of that command's CLI help option to a &lt;code&gt;.txt&lt;/code&gt; file in the repository - then commits the result at the end.&lt;/p&gt;
&lt;p&gt;The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.&lt;/p&gt;
&lt;p&gt;Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://github.com/aws/aws-cli/blob/develop/CHANGELOG.rst#12247"&gt;the official release notes&lt;/a&gt; - 12 bullet points, spanning 12 different AWS services.&lt;/p&gt;
&lt;p&gt;My help scraper caught the details of the release in &lt;a href="https://github.com/simonw/help-scraper/commit/cd18c5d7c1ac7c3851823dcabaa21ee920d73720#diff-c2559859df8912eb13a6017d81019bf5452cead3e6495744e2d0c82202bf33ac"&gt;this commit&lt;/a&gt; - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.&lt;/p&gt;
&lt;p&gt;The AWS CLI tool is &lt;em&gt;enormous&lt;/em&gt;. Running &lt;code&gt;find aws -name '*.txt' | wc -l&lt;/code&gt; in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.&lt;/p&gt;
&lt;p&gt;There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on &lt;a href="https://github.com/boto/botocore/commits/develop"&gt;the botocore GitHub history&lt;/a&gt;, which exposes changes to the underlying JSON - and there are projects like &lt;a href="https://awsapichanges.info/"&gt;awschanges.info&lt;/a&gt; which try to turn those sources of data into something more readable.&lt;/p&gt;
&lt;p&gt;But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes &lt;a href="https://simonwillison.net/2022/Jan/31/release-notes/"&gt;with the detail I like from them&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;I implemented this for &lt;code&gt;flyctl&lt;/code&gt; first, because I wanted to see what changes were being made that might impact my &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; plugin which shells out to that tool. Then I realized it could be applied to AWS as well.&lt;/p&gt;
&lt;h4&gt;Help scraping my own projects&lt;/h4&gt;
&lt;p&gt;I got the initial idea for this technique from a change I made to my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://sqlite-utils.datasette.io"&gt;sqlite-utils&lt;/a&gt; projects a few weeks ago.&lt;/p&gt;
&lt;p&gt;Both tools offer CLI commands with &lt;code&gt;--help&lt;/code&gt; output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.&lt;/p&gt;
&lt;p&gt;So, I added documentation pages that list the output of &lt;code&gt;--help&lt;/code&gt; for each of the CLI commands, generated using the &lt;a href="https://nedbatchelder.com/code/cog"&gt;Cog&lt;/a&gt; file generation tool:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli-reference.html"&gt;sqlite-utils CLI reference&lt;/a&gt; (39 commands!)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.datasette.io/en/stable/cli-reference.html"&gt;datasette CLI reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the &lt;code&gt;--help&lt;/code&gt; output - here's &lt;a href="https://github.com/simonw/sqlite-utils/commits/main/docs/cli-reference.rst"&gt;that history for sqlite-utils&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was a short jump from that to the idea of combining it with &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; to generate history for other tools.&lt;/p&gt;
&lt;h4&gt;Bonus trick: GraphQL schema scraping&lt;/h4&gt;
&lt;p&gt;I've started making selective use of the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; GraphQL API as part of &lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;my plugin&lt;/a&gt; for publishing Datasette instances to that platform.&lt;/p&gt;
&lt;p&gt;Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: &lt;a href="https://til.simonwillison.net/fly/undocumented-graphql-api"&gt;Using the undocumented Fly GraphQL API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?&lt;/p&gt;
&lt;p&gt;It turns out I can! There's an NPM package called &lt;a href="https://www.npmjs.com/package/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt; which can extract the GraphQL schema from any GraphQL server and write it out to disk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;npx get-graphql-schema https://api.fly.io/graphql &amp;gt; /tmp/fly.graphql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've added that to my &lt;code&gt;help-scraper&lt;/code&gt; repository too - so now I have a &lt;a href="https://github.com/simonw/help-scraper/commits/main/flyctl/fly.graphql"&gt;commit history of changes&lt;/a&gt; of changes they are making there too. Here's &lt;a href="https://github.com/simonw/help-scraper/commit/f11072ff23f0d654395be7c2b1e98e84dbbc26a3#diff-c9cd49cf2aa3b983457e2812ba9313cc254aba74aaba9a36d56c867e32221589"&gt;an example&lt;/a&gt; from this morning.&lt;/p&gt;
&lt;h3&gt;Other weeknotes&lt;/h3&gt;
&lt;p&gt;I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to &lt;a href="https://github.com/simonw/datasette/milestone/7"&gt;that milestone&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This week I did &lt;a href="https://github.com/simonw/datasette/issues/1533"&gt;a bunch of work&lt;/a&gt; adding a &lt;code&gt;Link: https://...; rel="alternate"; type="application/datasette+json"&lt;/code&gt; HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.&lt;/p&gt;
&lt;p&gt;(I had originally planned &lt;a href="https://github.com/simonw/datasette/issues/1534"&gt;to also support&lt;/a&gt; &lt;code&gt;Accept: application/json&lt;/code&gt; request headers for this, but I've been put off that idea by the discovery that Cloudflare &lt;a href="https://twitter.com/simonw/status/1478470282931163137"&gt;deliberately ignores&lt;/a&gt; the &lt;code&gt;Vary: Accept&lt;/code&gt; header.)&lt;/p&gt;
&lt;p&gt;Unrelated to Datasette: I also started a new Twitter thread, gathering &lt;a href="https://twitter.com/simonw/status/1487673496977113088"&gt;behind the scenes material from the movie the Mitchells vs the Machines&lt;/a&gt;. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.&lt;/p&gt;
&lt;p&gt;The last time I did this &lt;a href="https://twitter.com/simonw/status/1077737871602110466"&gt;was for Into the Spider-Verse&lt;/a&gt; (from the same studio) and that thread ended up running for more than a year!&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/pytest/only-run-integration"&gt;Opt-in integration tests with pytest --integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/graphql/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/python-3-11"&gt;Testing against Python 3.11 preview using GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="graphql"/><category term="weeknotes"/><category term="github-actions"/><category term="git-scraping"/><category term="fly"/></entry><entry><title>How I build a feature</title><link href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#atom-tag" rel="alternate"/><published>2022-01-12T18:10:17+00:00</published><updated>2022-01-12T18:10:17+00:00</updated><id>https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#atom-tag</id><summary type="html">
    &lt;p&gt;I'm maintaining &lt;a href="https://github.com/simonw/simonw/blob/main/releases.md"&gt;a lot of different projects&lt;/a&gt; at the moment. I thought it would be useful to describe the process I use for adding a new feature to one of them, using the new &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-create-database"&gt;sqlite-utils create-database&lt;/a&gt; command as an example.&lt;/p&gt;
&lt;p&gt;I like each feature to be represented by what I consider to be the &lt;strong&gt;perfect commit&lt;/strong&gt; - one that bundles together the implementation, the tests, the documentation and a link to an external issue thread.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 29th October 2022:&lt;/strong&gt; I wrote &lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/"&gt;more about the perfect commit&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;sqlite-utils create-database&lt;/code&gt; command is very simple: it creates a new, empty SQLite database file. You use it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% sqlite-utils create-database empty.db
&lt;/code&gt;&lt;/pre&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#everything-starts-with-an-issue"&gt;Everything starts with an issue&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#development-environment"&gt;Development environment&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#automated-tests"&gt;Automated tests&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#implementing-the-feature"&gt;Implementing the feature&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#code-formatting-with-black"&gt;Code formatting with Black&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#linting"&gt;Linting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#documentation"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#committing-the-change"&gt;Committing the change&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#branches-and-pull-requests"&gt;Branches and pull requests&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#release-notes-and-a-release"&gt;Release notes, and a release&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#a-live-demo"&gt;A live demo&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#tell-the-world-about-it"&gt;Tell the world about it&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/#more-examples-of-this-pattern"&gt;More examples of this pattern&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="everything-starts-with-an-issue"&gt;Everything starts with an issue&lt;/h4&gt;
&lt;p&gt;Every piece of work I do has an associated issue. This acts as ongoing work-in-progress notes and lets me record decisions, reference any research, drop in code snippets and sometimes even add screenshots and video - stuff that is really helpful but doesn't necessarily fit in code comments or commit messages.&lt;/p&gt;
&lt;p&gt;Even if it's a tiny improvement that's only a few lines of code, I'll still open an issue for it - sometimes just a few minutes before closing it again as complete.&lt;/p&gt;
&lt;p&gt;Any commits that I create that relate to an issue reference the issue number in their commit message. GitHub does a great job of automatically linking these together, bidirectionally so I can navigate from the commit to the issue or from the issue to the commit.&lt;/p&gt;
&lt;p&gt;Having an issue also gives me something I can link to from my release notes.&lt;/p&gt;
&lt;p&gt;In the case of the &lt;code&gt;create-database&lt;/code&gt; command, I opened &lt;a href="https://github.com/simonw/sqlite-utils/issues/348"&gt;this issue&lt;/a&gt; in November when I had the idea for the feature.&lt;/p&gt;
&lt;p&gt;I didn't do the work until over a month later - but because I had designed the feature in the issue comments I could get started on the implementation really quickly.&lt;/p&gt;
&lt;h4 id="development-environment"&gt;Development environment&lt;/h4&gt;
&lt;p&gt;Being able to quickly spin up a development environment for a project is crucial. All of my projects have a section in the README or the documentation describing how to do this - here's &lt;a href="https://sqlite-utils.datasette.io/en/stable/contributing.html"&gt;that section for sqlite-utils&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;On my own laptop each project gets a directory, and I use &lt;code&gt;pipenv shell&lt;/code&gt; in that directory to activate a directory-specific virtual environment, then &lt;code&gt;pip install -e '.[test]'&lt;/code&gt; to install the dependencies and test dependencies.&lt;/p&gt;
&lt;h4 id="automated-tests"&gt;Automated tests&lt;/h4&gt;
&lt;p&gt;All of my features are accompanied by automated tests. This gives me the confidence to boldly make changes to the software in the future without fear of breaking any existing features.&lt;/p&gt;
&lt;p&gt;This means that writing tests needs to be as quick and easy as possible - the less friction here the better.&lt;/p&gt;
&lt;p&gt;The best way to make writing tests easy is to have a great testing framework in place from the very beginning of the project. My cookiecutter templates (&lt;a href="https://github.com/simonw/python-lib"&gt;python-lib&lt;/a&gt;, &lt;a href="https://github.com/simonw/datasette-plugin"&gt;datasette-plugin&lt;/a&gt; and &lt;a href="https://github.com/simonw/click-app"&gt;click-app&lt;/a&gt;) all configure &lt;a href="https://docs.pytest.org/"&gt;pytest&lt;/a&gt; and add a &lt;code&gt;tests/&lt;/code&gt; folder with a single passing test, to give me something to start adding tests to.&lt;/p&gt;
&lt;p&gt;I can't say enough good things about pytest. Before I adopted it, writing tests was a chore. Now it's an activity I genuinely look forward to!&lt;/p&gt;
&lt;p&gt;I'm not a religious adherent to writing the tests first - see &lt;a href="https://simonwillison.net/2020/Feb/11/cheating-at-unit-tests-pytest-black/"&gt;How to cheat at unit tests with pytest and Black&lt;/a&gt; for more thoughts on that - but I'll write the test first if it's pragmatic to do so.&lt;/p&gt;
&lt;p&gt;In the case of &lt;code&gt;create-database&lt;/code&gt;, writing the test first felt like the right thing to do. Here's the test I started with:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_create_database&lt;/span&gt;(&lt;span class="pl-s1"&gt;tmpdir&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;db_path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tmpdir&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s"&gt;"test.db"&lt;/span&gt;
    &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-s1"&gt;db_path&lt;/span&gt;.&lt;span class="pl-en"&gt;exists&lt;/span&gt;()
    &lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;CliRunner&lt;/span&gt;().&lt;span class="pl-en"&gt;invoke&lt;/span&gt;(
        &lt;span class="pl-s1"&gt;cli&lt;/span&gt;.&lt;span class="pl-s1"&gt;cli&lt;/span&gt;, [&lt;span class="pl-s"&gt;"create-database"&lt;/span&gt;, &lt;span class="pl-en"&gt;str&lt;/span&gt;(&lt;span class="pl-s1"&gt;db_path&lt;/span&gt;)]
    )
    &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-s1"&gt;exit_code&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;
    &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;db_path&lt;/span&gt;.&lt;span class="pl-en"&gt;exists&lt;/span&gt;()&lt;/pre&gt;
&lt;p&gt;This test uses the &lt;a href="https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture"&gt;tmpdir pytest fixture&lt;/a&gt; to provide a temporary directory that will be automatically cleaned up by pytest after the test run finishes.&lt;/p&gt;
&lt;p&gt;It checks that the &lt;code&gt;test.db&lt;/code&gt; file doesn't exist yet, then uses the Click framework's &lt;a href="https://click.palletsprojects.com/en/8.0.x/testing/"&gt;CliRunner utility&lt;/a&gt; to execute the create-database command. Then it checks that the command didn't throw an error and that the file has been created.&lt;/p&gt;
&lt;p&gt;The I run the test, and watch it fail - because I haven't built the feature yet!&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pytest -k test_create_database

============ test session starts ============
platform darwin -- Python 3.8.2, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/simon/Dropbox/Development/sqlite-utils
plugins: cov-2.12.1, hypothesis-6.14.5
collected 808 items / 807 deselected / 1 selected                           

tests/test_cli.py F                                                   [100%]

================= FAILURES ==================
___________ test_create_database ____________

tmpdir = local('/private/var/folders/wr/hn3206rs1yzgq3r49bz8nvnh0000gn/T/pytest-of-simon/pytest-659/test_create_database0')

    def test_create_database(tmpdir):
        db_path = tmpdir / "test.db"
        assert not db_path.exists()
        result = CliRunner().invoke(
            cli.cli, ["create-database", str(db_path)]
        )
&amp;gt;       assert result.exit_code == 0
E       assert 1 == 0
E        +  where 1 = &amp;lt;Result SystemExit(1)&amp;gt;.exit_code

tests/test_cli.py:2097: AssertionError
========== short test summary info ==========
FAILED tests/test_cli.py::test_create_database - assert 1 == 0
===== 1 failed, 807 deselected in 0.99s ====
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;-k&lt;/code&gt; option lets me run any test that match the search string, rather than running the full test suite. I use this all the time.&lt;/p&gt;
&lt;p&gt;Other pytest features I often use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pytest -x&lt;/code&gt;: runs the entire test suite but quits at the first test that fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pytest --lf&lt;/code&gt;: re-runs any tests that failed during the last test run&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pytest --pdb -x&lt;/code&gt;: open the Python debugger at the first failed test (omit the &lt;code&gt;-x&lt;/code&gt; to open it at every failed test). This is the main way I interact with the Python debugger. I often use this to help write the tests, since I can add &lt;code&gt;assert False&lt;/code&gt; and get a shell inside the test to interact with various objects and figure out how to best run assertions against them.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="implementing-the-feature"&gt;Implementing the feature&lt;/h4&gt;
&lt;p&gt;Test in place, it's time to implement the command. I added this code to my existing &lt;a href="https://github.com/simonw/sqlite-utils/blob/3.20/sqlite_utils/cli.py"&gt;cli.py module&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;cli&lt;/span&gt;.&lt;span class="pl-en"&gt;command&lt;/span&gt;(&lt;span class="pl-s1"&gt;name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"create-database"&lt;/span&gt;)&lt;/span&gt;
&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;click&lt;/span&gt;.&lt;span class="pl-en"&gt;argument&lt;/span&gt;(&lt;/span&gt;
&lt;span class="pl-en"&gt;    &lt;span class="pl-s"&gt;"path"&lt;/span&gt;,&lt;/span&gt;
&lt;span class="pl-en"&gt;    &lt;span class="pl-s1"&gt;type&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;click&lt;/span&gt;.&lt;span class="pl-v"&gt;Path&lt;/span&gt;(&lt;span class="pl-s1"&gt;file_okay&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;, &lt;span class="pl-s1"&gt;dir_okay&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;False&lt;/span&gt;, &lt;span class="pl-s1"&gt;allow_dash&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;False&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;    &lt;span class="pl-s1"&gt;required&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,&lt;/span&gt;
&lt;span class="pl-en"&gt;)&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;create_database&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;):
    &lt;span class="pl-s"&gt;"Create a new empty database file."&lt;/span&gt;
    &lt;span class="pl-s1"&gt;db&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sqlite_utils&lt;/span&gt;.&lt;span class="pl-v"&gt;Database&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;db&lt;/span&gt;.&lt;span class="pl-en"&gt;vacuum&lt;/span&gt;()&lt;/pre&gt;
&lt;p&gt;(I happen to know that the quickest way to create an empty SQLite database file is to run &lt;code&gt;VACUUM&lt;/code&gt; against it.)&lt;/p&gt;
&lt;p&gt;The test now passes!&lt;/p&gt;
&lt;p&gt;I iterated on this implementation a little bit more, to add the &lt;code&gt;--enable-wal&lt;/code&gt; option I had designed &lt;a href="https://github.com/simonw/sqlite-utils/issues/348#issuecomment-983120066"&gt;in the issue comments&lt;/a&gt; - and updated the test to match. You can see the final implementation in this commit: &lt;a href="https://github.com/simonw/sqlite-utils/commit/1d64cd2e5b402ff957f9be2d9bb490d313c73989"&gt;1d64cd2e5b402ff957f9be2d9bb490d313c73989&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If I add a new test and it passes the first time, I’m always suspicious of it. I’ll deliberately break the test (change a 1 to a 2 for example) and run it again to make sure it fails, then change it back again.&lt;/p&gt;
&lt;h4 id="code-formatting-with-black"&gt;Code formatting with Black&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/psf/black"&gt;Black&lt;/a&gt; has increased my productivity as a Python developer by a material amount. I used to spend a whole bunch of brain cycles agonizing over how to indent my code, where to break up long function calls and suchlike. Thanks to Black I never think about this at all - I instinctively run &lt;code&gt;black .&lt;/code&gt; in the root of my project and accept whatever style decisions it applies for me.&lt;/p&gt;
&lt;h4 id="linting"&gt;Linting&lt;/h4&gt;
&lt;p&gt;I have a few linters set up to run on every commit. I can run these locally too - how to do that is &lt;a href="https://sqlite-utils.datasette.io/en/stable/contributing.html#linting-and-formatting"&gt;documented here&lt;/a&gt; - but I'm often a bit lazy and leave them to &lt;a href="https://github.com/simonw/sqlite-utils/blob/main/.github/workflows/test.yml"&gt;run in CI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this case one of my linters failed! I accidentally called the new command function &lt;code&gt;create_table()&lt;/code&gt; when it should have been called &lt;code&gt;create_database()&lt;/code&gt;. The code worked fine due to how the &lt;code&gt;cli.command(name=...)&lt;/code&gt; decorator works but &lt;code&gt;mypy&lt;/code&gt; &lt;a href="https://github.com/simonw/sqlite-utils/runs/4754944593?check_suite_focus=true"&gt;complained about&lt;/a&gt; the redefined function name. I fixed that in &lt;a href="https://github.com/simonw/sqlite-utils/commit/2f8879235afc6a06a8ae25ded1b2fe289ad8c3a6#diff-76294b3d4afeb27e74e738daa01c26dd4dc9ccb6f4477451483a2ece1095902e"&gt;a separate commit&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="documentation"&gt;Documentation&lt;/h4&gt;
&lt;p&gt;My policy these days is that if a feature isn't documented it doesn't exist. Updating existing documentation isn't much work at all if the documentation already exists, and over time these incremental improvements add up to something really comprehensive.&lt;/p&gt;
&lt;p&gt;For smaller projects I use a single &lt;code&gt;README.md&lt;/code&gt; which gets displayed on both GitHub and PyPI (and the Datasette website too, for example on &lt;a href="https://datasette.io/tools/git-history"&gt;datasette.io/tools/git-history&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;My larger projects, such as &lt;a href="https://docs.datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt;, use &lt;a href="https://readthedocs.org/"&gt;Read the Docs&lt;/a&gt; and &lt;a href="https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html"&gt;reStructuredText&lt;/a&gt; with &lt;a href="https://www.sphinx-doc.org/"&gt;Sphinx&lt;/a&gt; instead.&lt;/p&gt;
&lt;p&gt;I like reStructuredText mainly because it has really good support for internal reference links - something that is missing from Markdown, though it can be enabled using &lt;a href="https://myst-parser.readthedocs.io"&gt;MyST&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; uses Sphinx. I have the &lt;a href="https://github.com/executablebooks/sphinx-autobuild"&gt;sphinx-autobuild&lt;/a&gt; extension configured, which means I can run a live reloading server with the documentation like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cd docs
make livehtml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any time I'm working on the documentation I have that server running, so I can hit "save" in VS Code and see a preview in my browser a few seconds later.&lt;/p&gt;
&lt;p&gt;For Markdown documentation I use the VS Code preview pane directly.&lt;/p&gt;
&lt;p&gt;The moment the documentation is live online, I like to add a link to it in a comment on the issue thread.&lt;/p&gt;
&lt;h4 id="committing-the-change"&gt;Committing the change&lt;/h4&gt;
&lt;p&gt;I run &lt;code&gt;git diff&lt;/code&gt; a LOT while hacking on code, to make sure I haven’t accidentally changed something unrelated. This also helps spot things like rogue &lt;code&gt;print()&lt;/code&gt; debug statements I may have added.&lt;/p&gt;
&lt;p&gt;Before my final commit, I sometimes even run &lt;code&gt;git diff | grep print&lt;/code&gt; to check for those.&lt;/p&gt;
&lt;p&gt;My goal with the commit is to bundle the test, documentation and implementation. If those are the only files I've changed I do this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git commit -a -m "sqlite-utils create-database command, closes #348"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If this completes the work on the issue I use "&lt;code&gt;closes #N&lt;/code&gt;", which causes GitHub to close the issue for me. If it's not yet ready to close I use "&lt;code&gt;refs #N&lt;/code&gt;" instead.&lt;/p&gt;
&lt;p&gt;Sometimes there will be unrelated changes in my working directory. If so, I use &lt;code&gt;git add &amp;lt;files&amp;gt;&lt;/code&gt; and then commit just with &lt;code&gt;git commit -m message&lt;/code&gt;.&lt;/p&gt;
&lt;h4 id="branches-and-pull-requests"&gt;Branches and pull requests&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;create-database&lt;/code&gt; is a good example of a feature that can be implemented in a single commit, with no need to work in a branch.&lt;/p&gt;
&lt;p&gt;For larger features, I'll work in a feature branch:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git checkout -b my-feature
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I'll make a commit (often just labelled "WIP prototype, refs #N") and then push that to GitHub and open a pull request for it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git push -u origin my-feature 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I ensure the new pull request links back to the issue in its description, then switch my ongoing commentary to comments on the pull request itself.&lt;/p&gt;
&lt;p&gt;I'll sometimes add a task checklist to the opening comment on the pull request, since tasks there get reflected in the GitHub UI anywhere that links to the PR. Then I'll check those off as I complete them.&lt;/p&gt;
&lt;p&gt;An example of a PR I used like this is &lt;a href="https://github.com/simonw/sqlite-utils/pull/361"&gt;#361: --lines and --text and --convert and --import&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I don't like merge commits - I much prefer to keep my &lt;code&gt;main&lt;/code&gt; branch history as linear as possible. I usually merge my PRs through the GitHub web interface using the squash feature, which results in a single, clean commit to main with the combined tests, documentation and implementation. Occasionally I will see value in keeping the individual commits, in which case I will rebase merge them.&lt;/p&gt;
&lt;p&gt;Another goal here is to keep the &lt;code&gt;main&lt;/code&gt; branch releasable at all times. Incomplete work should stay in a branch. This makes turning around and releasing quick bug fixes a lot less stressful!&lt;/p&gt;
&lt;h4 id="release-notes-and-a-release"&gt;Release notes, and a release&lt;/h4&gt;
&lt;p&gt;A feature isn't truly finished until it's been released to &lt;a href="https://pypi.org/"&gt;PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;All of my projects are configured the same way: they use GitHub releases to trigger a GitHub Actions workflow which publishes the new release to PyPI. The &lt;code&gt;sqlite-utils&lt;/code&gt; workflow for that &lt;a href="https://github.com/simonw/sqlite-utils/blob/main/.github/workflows/publish.yml"&gt;is here in publish.yml&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://cookiecutter.readthedocs.io/"&gt;cookiecutter&lt;/a&gt; templates for new projects set up this workflow for me. I just need to create a PyPI token for the project and assign it as a repository secret. See the &lt;a href="https://github.com/simonw/python-lib"&gt;python-lib cookiecutter README&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;To push out a new release, I need to increment the version number in &lt;a href="https://github.com/simonw/sqlite-utils/blob/main/setup.py"&gt;setup.py&lt;/a&gt; and write the release notes.&lt;/p&gt;
&lt;p&gt;I use &lt;a href="https://semver.org/"&gt;semantic versioning&lt;/a&gt; - a new feature is a minor version bump, a breaking change is a major version bump (I try very hard to avoid these) and a bug fix or documentation-only update is a patch increment.&lt;/p&gt;
&lt;p&gt;Since &lt;code&gt;create-database&lt;/code&gt; was a new feature, it went out in &lt;a href="https://github.com/simonw/sqlite-utils/releases/3.21"&gt;release 3.21&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My projects that use Sphinx for documentation have &lt;a href="https://github.com/simonw/sqlite-utils/blob/main/docs/changelog.rst"&gt;changelog.rst&lt;/a&gt; files in their repositories. I add the release notes there, linking to the relevant issues and cross-referencing the new documentation. Then I ship a commit that bundles the release notes with the bumped version number, with a commit message that looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git commit -m "Release 3.21

Refs #348, #364, #366, #368, #371, #372, #374, #375, #376, #379"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/sqlite-utils/commit/7c637b11805adc3d3970076a7ba6afe8e34b371e"&gt;the commit for release 3.21&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Referencing the issue numbers in the release automatically adds a note to their issue threads indicating the release that they went out in.&lt;/p&gt;
&lt;p&gt;I generate that list of issue numbers by pasting the release notes into an Observable notebook I built for the purpose: &lt;a href="https://observablehq.com/@simonw/extract-issue-numbers-from-pasted-text"&gt;Extract issue numbers from pasted text&lt;/a&gt;. Observable is really great for building this kind of tiny interactive utility.&lt;/p&gt;
&lt;p&gt;For projects that just have a README I write the release notes in Markdown and paste them directly into the GitHub "new release" form.&lt;/p&gt;
&lt;p&gt;I like to duplicate the release notes to GiHub releases for my Sphinx changelog projects too. This is mainly so the &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt; website will display the release notes on its homepage, which is populated &lt;a href="https://simonwillison.net/2020/Dec/13/datasette-io/"&gt;at build time&lt;/a&gt; using the GitHub GraphQL API.&lt;/p&gt;
&lt;p&gt;To convert my reStructuredText to Markdown I copy and paste the rendered HTML into this brilliant &lt;a href="https://euangoddard.github.io/clipboard2markdown/"&gt;Paste to Markdown&lt;/a&gt; tool by &lt;a href="https://github.com/euangoddard"&gt;Euan Goddard&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="a-live-demo"&gt;A live demo&lt;/h4&gt;
&lt;p&gt;When possible, I like to have a live demo that I can link to.&lt;/p&gt;
&lt;p&gt;This is easiest for features in Datasette core. Datesette’s main branch gets &lt;a href="https://github.com/simonw/datasette/blob/0.60a1/.github/workflows/deploy-latest.yml#L51-L73"&gt;deployed automatically&lt;/a&gt; to &lt;a href="https://latest.datasette.io/"&gt;latest.datasette.io&lt;/a&gt; so I can often link to a demo there.&lt;/p&gt;
&lt;p&gt;For Datasette plugins, I’ll deploy a fresh instance with the plugin (e.g. &lt;a href="https://datasette-graphql-demo.datasette.io/"&gt;this one for datasette-graphql&lt;/a&gt;) or (more commonly) add it to my big &lt;a href="https://latest-with-plugins.datasette.io/"&gt;latest-with-plugins.datasette.io&lt;/a&gt; instance - which tries to demonstrate what happens to Datasette if you install dozens of plugins at once (so far it works OK).&lt;/p&gt;
&lt;p&gt;Here’s a demo of the &lt;a href="https://datasette.io/plugins/datasette-copyable"&gt;datasette-copyable plugin&lt;/a&gt; running there:  &lt;a href="https://latest-with-plugins.datasette.io/github/commits.copyable"&gt;https://latest-with-plugins.datasette.io/github/commits.copyable&lt;/a&gt;&lt;/p&gt;
&lt;h4 id="tell-the-world-about-it"&gt;Tell the world about it&lt;/h4&gt;
&lt;p&gt;The last step is to tell the world (beyond the people who meticulously read the release notes) about the new feature.&lt;/p&gt;
&lt;p&gt;Depending on the size of the feature, I might do this with a tweet &lt;a href="https://twitter.com/simonw/status/1455266746701471746"&gt;like this one&lt;/a&gt; - usually with a screenshot and a link to the documentation. I often extend this into a short Twitter thread, which gives me a chance to link to related concepts and demos or add more screenshots.&lt;/p&gt;
&lt;p&gt;For larger or more interesting feature I'll blog about them. I may save this for my weekly &lt;a href="https://simonwillison.net/tags/weeknotes/"&gt;weeknotes&lt;/a&gt;, but sometimes for particularly exciting features I'll write up a dedicated blog entry. Some examples include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/"&gt;Executing advanced ALTER TABLE operations in SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/"&gt;Fun with binary data and SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/"&gt;Refactoring databases with sqlite-utils extract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/"&gt;Joining CSV and JSON data with an in-memory SQLite database&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/"&gt;Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I may even assemble a full set of &lt;a href="https://simonwillison.net/tags/annotatedreleasenotes/"&gt;annotated release notes&lt;/a&gt; on my blog, where I quote each item from the release in turn and provide some fleshed out examples plus background information on why I built it.&lt;/p&gt;
&lt;p&gt;If it’s a new Datasette (or Datasette-adjacent) feature, I’ll try to remember to write about it in the next edition of the &lt;a href="https://datasette.substack.com/"&gt;Datasette Newsletter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally, if I learned a new trick while building a feature I might extract that into &lt;a href="https://til.simonwillison.net/"&gt;a TIL&lt;/a&gt;. If I do that I'll link to the new TIL from the issue thread.&lt;/p&gt;
&lt;h4 id="more-examples-of-this-pattern"&gt;More examples of this pattern&lt;/h4&gt;
&lt;p&gt;Here are a bunch of examples of commits that implement this pattern, combining the tests, implementation and documentation into a single unit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;sqlite-utils: &lt;a href="https://github.com/simonw/sqlite-utils/commit/324ebc31308752004fe5f7e4941fc83706c5539c"&gt;adding —limit and —offset to sqlite-utils rows&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;sqlite-utils: &lt;a href="https://github.com/simonw/sqlite-utils/commit/d83b2568131f2b1cc01228419bb08c96d843d65d"&gt;--where and -p options for sqlite-utils convert&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;s3-credentials: &lt;a href="https://github.com/simonw/s3-credentials/commit/905258379817e8b458528e4ccc5e6cc2c8cf4352"&gt;s3-credentials policy command&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;datasette: &lt;a href="https://github.com/simonw/datasette/commit/5cadc244895fc47e0534c6e90df976d34293921e"&gt;db.execute_write_script() and db.execute_write_many()&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;datasette: &lt;a href="https://github.com/simonw/datasette/commit/992496f2611a72bd51e94bfd0b17c1d84e732487"&gt;?_nosuggest=1 parameter for table views&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;datasette-graphql: &lt;a href="https://github.com/simonw/datasette-graphql/commit/2d8c042e93e3429c5b187121d26f8817997073dd"&gt;GraphQL execution limits: time_limit_ms and num_queries_limit&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-engineering"&gt;software-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/black"&gt;black&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/read-the-docs"&gt;read-the-docs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-issues"&gt;github-issues&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="github"/><category term="software-engineering"/><category term="testing"/><category term="pytest"/><category term="black"/><category term="read-the-docs"/><category term="github-issues"/></entry><entry><title>git-history: a tool for analyzing scraped data collected using Git and SQLite</title><link href="https://simonwillison.net/2021/Dec/7/git-history/#atom-tag" rel="alternate"/><published>2021-12-07T22:32:55+00:00</published><updated>2021-12-07T22:32:55+00:00</updated><id>https://simonwillison.net/2021/Dec/7/git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;I described &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.&lt;/p&gt;
&lt;p&gt;The open challenge was how to analyze that data once it was collected. &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my new tool designed to tackle that problem.&lt;/p&gt;
&lt;h4&gt;Git scraping, a refresher&lt;/h4&gt;
&lt;p&gt;A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;five minute lightning talk&lt;/a&gt; back in March.&lt;/p&gt;
&lt;p&gt;Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at &lt;a href="https://www.fire.ca.gov/incidents/"&gt;fire.ca.gov/incidents&lt;/a&gt; showing the status of current large fires in the state.&lt;/p&gt;
&lt;p&gt;I found the underlying data here:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I built &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;a simple scraper&lt;/a&gt; that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected &lt;a href="https://github.com/simonw/ca-fires-history"&gt;1,559 commits&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.&lt;/p&gt;
&lt;p&gt;There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; to explore the resulting data.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires"&gt;an example database&lt;/a&gt; created by running the tool against my &lt;code&gt;ca-fires-history&lt;/code&gt; repository. I created the SQLite database by running this in the repository directory:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;json.loads(content)["Incidents"]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-progress.gif" alt="Animated gif showing the progress bar" style="max-width:100%; border-top: 5px solid black;" /&gt;&lt;/p&gt;
&lt;p&gt;In this example we are processing the history of a single file called &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We use the &lt;code&gt;UniqueId&lt;/code&gt; column to identify which records are changed over time as opposed to newly created.&lt;/p&gt;
&lt;p&gt;Specifying &lt;code&gt;--namespace incident&lt;/code&gt; causes the created database tables to be called &lt;code&gt;incident&lt;/code&gt; and &lt;code&gt;incident_version&lt;/code&gt; rather than the default of &lt;code&gt;item&lt;/code&gt; and &lt;code&gt;item_version&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see &lt;a href="https://github.com/simonw/git-history/blob/0.6/README.md#custom-conversions-using---convert"&gt;--convert in the documentation&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;Let's use the database to answer some questions about fires in California over the past 14 months.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;incident&lt;/code&gt; table contains a copy of the latest record for every incident. We can use that to see &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident"&gt;a map of every fire&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-map.png" alt="A map showing 250 fires in California" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This uses the &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt; plugin, which draws a map of every row with a valid latitude and longitude column.&lt;/p&gt;
&lt;p&gt;Where things get interesting is the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version"&gt;incident_version&lt;/a&gt; table. This is where changes between different scraped versions of each item are recorded.&lt;/p&gt;
&lt;p&gt;Those 250 fires have 2,060 recorded versions. If we &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item"&gt;facet by _item&lt;/a&gt; we can see which fires had the most versions recorded. Here are the top ten:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;Dixie Fire&lt;/a&gt; 268&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=209"&gt;Caldor Fire&lt;/a&gt; 153&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=197"&gt;Monument Fire&lt;/a&gt; 65&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=1"&gt;August Complex (includes Doe Fire)&lt;/a&gt; 64&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=2"&gt;Creek Fire&lt;/a&gt; 56&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=213"&gt;French Fire&lt;/a&gt; 53&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=32"&gt;Silverado Fire&lt;/a&gt; 52&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=240"&gt;Fawn Fire&lt;/a&gt; 45&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=34"&gt;Blue Ridge Fire&lt;/a&gt; 39&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=190"&gt;McFarland Fire&lt;/a&gt; 34&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire &lt;a href="https://en.wikipedia.org/wiki/Dixie_Fire"&gt;has its own Wikipedia page&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Clicking through to &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;the Dixie Fire&lt;/a&gt; lands us on a page showing every "version" that we captured, ordered by version number.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-incident-versions.png" alt="The table showing changes to that fire over time" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ConditionStatement&lt;/code&gt; is a text description that changes frequently, but the other two interesting columns look to be &lt;code&gt;AcresBurned&lt;/code&gt; and &lt;code&gt;PercentContained&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That &lt;code&gt;_commit&lt;/code&gt; table is a foreign key to &lt;a href="https://git-history-demos.datasette.io/ca-fires/commits"&gt;commits&lt;/a&gt;, which records commits that have been processed by the tool -  mainly so that when you run it a second time it can pick up where it finished last time.&lt;/p&gt;
&lt;p&gt;We can join against &lt;code&gt;commits&lt;/code&gt; to see the date that each version was created. Or we can use the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail"&gt;incident_version_detail&lt;/a&gt; view which performs that join for us.&lt;/p&gt;
&lt;p&gt;Using that view, we can filter for just rows where &lt;code&gt;_item&lt;/code&gt; is 174 and &lt;code&gt;AcresBurned&lt;/code&gt; is not blank, then use the &lt;a href=""&gt;datasette-vega&lt;/a&gt; plugin to visualize the &lt;code&gt;_commit_at&lt;/code&gt; date column against the &lt;code&gt;AcresBurned&lt;/code&gt; numeric column... and we get a graph of &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_item__exact=174&amp;amp;AcresBurned__notblank=1#g.mark=line&amp;amp;g.x_column=_commit_at&amp;amp;g.x_type=temporal&amp;amp;g.y_column=AcresBurned&amp;amp;g.y_type=quantitative"&gt;the growth of the Dixie Fire over time&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-chart.png" alt="The chart plugin showing a line chart" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to &lt;code&gt;git-history&lt;/code&gt;, Datasette and &lt;code&gt;datasette-vega&lt;/code&gt; we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.&lt;/p&gt;
&lt;h4&gt;A note on schema design&lt;/h4&gt;
&lt;p&gt;One of the hardest problems in designing &lt;code&gt;git-history&lt;/code&gt; was deciding on an appropriate schema for storing version changes over time.&lt;/p&gt;
&lt;p&gt;I ended up with the following (edited for clarity):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [columns] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [name] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_changed] (
   [item_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item_version]([_id]),
   [column] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [columns]([id]),
   &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt; ([item_version], [column])
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As shown earlier, records in the &lt;code&gt;item_version&lt;/code&gt; table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as &lt;code&gt;null&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There's one catch with this schema: what do we do if a new version of an item sets one of the columns to &lt;code&gt;null&lt;/code&gt;? How can we tell the difference between that and a column that didn't change?&lt;/p&gt;
&lt;p&gt;I ended up solving that with an &lt;code&gt;item_changed&lt;/code&gt; many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which &lt;code&gt;item_version&lt;/code&gt; records.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version_detail&lt;/code&gt; view displays columns from that many-to-many table as JSON - here's &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_version__gt=1&amp;amp;_col=_changed_columns&amp;amp;_col=_item&amp;amp;_col=_version"&gt;a filtered example&lt;/a&gt; showing which columns were changed in which versions of which items:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-changed-columns.png" alt="This table shows a JSON list of column names against items and versions" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+columns.name%2C+count%28*%29%0D%0Afrom+incident_changed%0D%0A++join+incident_version+on+incident_changed.item_version+%3D+incident_version._id%0D%0A++join+columns+on+incident_changed.column+%3D+columns.id%0D%0Awhere+incident_version._version+%3E+1%0D%0Agroup+by+columns.name%0D%0Aorder+by+count%28*%29+desc"&gt;a SQL query&lt;/a&gt; that shows, for &lt;code&gt;ca-fires&lt;/code&gt;, which columns were updated most often:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;, &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;)
&lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed
  &lt;span class="pl-k"&gt;join&lt;/span&gt; incident_version &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;item_version&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_id&lt;/span&gt;
  &lt;span class="pl-k"&gt;join&lt;/span&gt; columns &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;column&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;where&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_version&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
&lt;span class="pl-k"&gt;group by&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Updated: 1785&lt;/li&gt;
&lt;li&gt;PercentContained: 740&lt;/li&gt;
&lt;li&gt;ConditionStatement: 734&lt;/li&gt;
&lt;li&gt;AcresBurned: 616&lt;/li&gt;
&lt;li&gt;Started: 327&lt;/li&gt;
&lt;li&gt;PersonnelInvolved: 286&lt;/li&gt;
&lt;li&gt;Engines: 274&lt;/li&gt;
&lt;li&gt;CrewsInvolved: 256&lt;/li&gt;
&lt;li&gt;WaterTenders: 225&lt;/li&gt;
&lt;li&gt;Dozers: 211&lt;/li&gt;
&lt;li&gt;AirTankers: 181&lt;/li&gt;
&lt;li&gt;StructuresDestroyed: 125&lt;/li&gt;
&lt;li&gt;Helicopters: 122&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; incident
&lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; _item &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_version
  &lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
    &lt;span class="pl-k"&gt;select&lt;/span&gt; item_version &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed &lt;span class="pl-k"&gt;where&lt;/span&gt; column &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;15&lt;/span&gt;
  )
  &lt;span class="pl-k"&gt;and&lt;/span&gt; _version &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That returned 19 fires that were significant enough to involve helicopters - &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+*+from+incident%0D%0Awhere+_id+in+%28%0D%0A++select+_item+from+incident_version%0D%0A++where+_id+in+%28%0D%0A++++select+item_version+from+incident_changed+where+column+%3D+15%0D%0A++%29%0D%0A++and+_version+%3E+1%0D%0A%29"&gt;here they are on a map&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fire-helicopter-map.png" alt="A map of 19 fires that involved helicopters" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Advanced usage of --convert&lt;/h4&gt;
&lt;p&gt;Drew Breunig has been running a Git scraper for the past 8 months in &lt;a href="https://github.com/dbreunig/511-events-history"&gt;dbreunig/511-events-history&lt;/a&gt; against &lt;a href="https://511.org/"&gt;511.org&lt;/a&gt;, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example &lt;a href="https://git-history-demos.datasette.io/sf-bay-511"&gt;sf-bay-511 database&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;sf-bay-511&lt;/code&gt; example is useful for digging more into the &lt;code&gt;--convert&lt;/code&gt; option to &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.&lt;/p&gt;
&lt;p&gt;The ideal tracked JSON file would look something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;abc123&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Corner of 4th and Vermont&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;fire&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cde448&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;555 West Example Drive&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;medical&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's common for data that has been scraped to not fit this ideal shape.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;511.org&lt;/code&gt; JSON feed &lt;a href="https://backend-prod.511.org/api-proxy/api/v1/traffic/events/?extended=true"&gt;can be found here&lt;/a&gt; - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a &lt;code&gt;updated&lt;/code&gt; timestamp field that changes in every version even if there are no changes, or a deeply nested &lt;code&gt;"extension"&lt;/code&gt; object full of duplicate data.&lt;/p&gt;
&lt;p&gt;I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the &lt;code&gt;--convert&lt;/code&gt; option to the script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The single-quoted string passed to &lt;code&gt;--convert&lt;/code&gt; is compiled into a Python function and run against each Git version in turn. My code loops through the nested &lt;code&gt;Events&lt;/code&gt; list, modifying each record and then outputting them as an iterable sequence using &lt;code&gt;yield&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.&lt;/p&gt;
&lt;p&gt;When working with &lt;code&gt;git-history&lt;/code&gt; I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it &lt;a href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/"&gt;for sqlite-utils convert&lt;/a&gt; earlier this year.&lt;/p&gt;
&lt;h4&gt;Trying this out yourself&lt;/h4&gt;
&lt;p&gt;If you want to try this out for yourself the &lt;code&gt;git-history&lt;/code&gt; tool has &lt;a href="https://github.com/simonw/git-history/blob/main/README.md"&gt;an extensive README&lt;/a&gt; describing the other options, and the scripts used to create these demos can be found in the &lt;a href="https://github.com/simonw/git-history/tree/main/demos"&gt;demos folder&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="data-journalism"/><category term="git"/><category term="projects"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/><category term="git-history"/></entry><entry><title>Commits are snapshots, not diffs</title><link href="https://simonwillison.net/2020/Dec/17/commits-are-snapshots-not-diffs/#atom-tag" rel="alternate"/><published>2020-12-17T22:01:39+00:00</published><updated>2020-12-17T22:01:39+00:00</updated><id>https://simonwillison.net/2020/Dec/17/commits-are-snapshots-not-diffs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.blog/2020-12-17-commits-are-snapshots-not-diffs/"&gt;Commits are snapshots, not diffs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Useful, clearly explained revision of some Git fundamentals.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=25458230"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="github"/></entry><entry><title>nyt-2020-election-scraper</title><link href="https://simonwillison.net/2020/Nov/6/nyt-2020-election-scraper/#atom-tag" rel="alternate"/><published>2020-11-06T14:24:36+00:00</published><updated>2020-11-06T14:24:36+00:00</updated><id>https://simonwillison.net/2020/Nov/6/nyt-2020-election-scraper/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/alex/nyt-2020-election-scraper"&gt;nyt-2020-election-scraper&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alex-gaynor"&gt;alex-gaynor&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/elections"&gt;elections&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/new-york-times"&gt;new-york-times&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="alex-gaynor"/><category term="data-journalism"/><category term="elections"/><category term="git"/><category term="new-york-times"/><category term="git-scraping"/></entry><entry><title>Git scraping: track changes over time by scraping to a Git repository</title><link href="https://simonwillison.net/2020/Oct/9/git-scraping/#atom-tag" rel="alternate"/><published>2020-10-09T18:27:23+00:00</published><updated>2020-10-09T18:27:23+00:00</updated><id>https://simonwillison.net/2020/Oct/9/git-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Git scraping&lt;/strong&gt; is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th March 2021:&lt;/strong&gt; I presented a version of this post as &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;a five minute lightning talk at NICAR 2021&lt;/a&gt;, which includes a live coding demo of building a new git scraper.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th January 2022:&lt;/strong&gt; I released a tool called &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history&lt;/a&gt; that helps analyze data that has been collected using this technique.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The &lt;a href="https://twitter.com/nyt_diff"&gt;@nyt_diff Twitter account&lt;/a&gt; tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.&lt;/p&gt;
&lt;p&gt;We already have a great tool for efficiently tracking changes to text over time: &lt;strong&gt;Git&lt;/strong&gt;. And &lt;a href="https://github.com/features/actions"&gt;GitHub Actions&lt;/a&gt; (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.&lt;/p&gt;
&lt;p&gt;Here's a recent example. Fires continue to rage in California, and the &lt;a href="https://www.fire.ca.gov/"&gt;CAL FIRE website&lt;/a&gt; offers an &lt;a href="https://www.fire.ca.gov/incidents/"&gt;incident map&lt;/a&gt; showing the latest fire activity around the state.&lt;/p&gt;
&lt;p&gt;Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents"&gt;https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That's a 241KB JSON endpoints with full details of the various fires around the state.&lt;/p&gt;
&lt;p&gt;So... I started running a git scraper against it. My scraper lives in the &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt; repository on GitHub.&lt;/p&gt;
&lt;p&gt;Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using &lt;code&gt;jq&lt;/code&gt; and commits it back to the repo if it has changed.&lt;/p&gt;
&lt;p&gt;This means I now have a &lt;a href="https://github.com/simonw/ca-fires-history/commits/main"&gt;commit log&lt;/a&gt; of changes to that information about fires in California. Here's an &lt;a href="https://github.com/simonw/ca-fires-history/commit/7b0f42d4bf198885ab2b41a22a8da47157572d18"&gt;example commit&lt;/a&gt; showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/git-scraping.png" alt="Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt; which looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape latest data&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;6,26,46 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;scheduled&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Check out this repo&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Fetch latest data&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . &amp;gt; incidents.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push if it changed&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "Latest data: ${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's not a lot of code!&lt;/p&gt;
&lt;p&gt;It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.&lt;/p&gt;
&lt;p&gt;The scraper itself works by fetching the JSON using &lt;code&gt;curl&lt;/code&gt;, piping it through &lt;code&gt;jq .&lt;/code&gt; to pretty-print it and saving the result to &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in &lt;a href="https://til.simonwillison.net/til/til/github-actions_commit-if-file-changed.md"&gt;this TIL&lt;/a&gt; a few months ago.&lt;/p&gt;
&lt;p&gt;I have a whole bunch of repositories running git scrapers now. I've been labeling them with the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; so they show up in one place on GitHub (other people have started using that topic as well).&lt;/p&gt;
&lt;p&gt;I've written about some of these &lt;a href="https://simonwillison.net/tags/gitscraping/"&gt;in the past&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;Scraping hurricane Irma&lt;/a&gt; back in September 2017 is when I first came up with the idea to use a Git repository in this way.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;Changelogs to help understand the fires in the North Bay&lt;/a&gt; from October 2017 describes an early attempt at scraping fire-related information.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;Generating a commit log for San Francisco’s official list of trees&lt;/a&gt; remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have &lt;a href="https://github.com/simonw/sf-tree-history/find/master"&gt;a commit log&lt;/a&gt; of changes to it stretching back over more than a year. This example uses my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt; utility to generate human-readable commit messages.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; documents my attempts to track the impact of PG&amp;amp;E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Tracking FARA by deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=24732943"&gt;Comment thread&lt;/a&gt; on this post over on Hacker News.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/></entry><entry><title>Weeknotes: Rocky Beaches, Datasette 0.48, a commit history of my database</title><link href="https://simonwillison.net/2020/Aug/21/weeknotes-rocky-beaches/#atom-tag" rel="alternate"/><published>2020-08-21T00:52:16+00:00</published><updated>2020-08-21T00:52:16+00:00</updated><id>https://simonwillison.net/2020/Aug/21/weeknotes-rocky-beaches/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I helped Natalie launch &lt;a href="https://www.rockybeaches.com/"&gt;Rocky Beaches&lt;/a&gt;, shipped Datasette 0.48 and several releases of &lt;code&gt;datasette-graphql&lt;/code&gt;, upgraded the CSRF protection for &lt;code&gt;datasette-upload-csvs&lt;/code&gt; and figured out how to get a commit log of changes to my blog by backing up its database to a GitHub repository.&lt;/p&gt;
&lt;h4 id="rocky-beaches"&gt;Rocky Beaches&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://twitter.com/natbat"&gt;Natalie&lt;/a&gt; released the first version of &lt;a href="https://www.rockybeaches.com/"&gt;rockybeaches.com&lt;/a&gt; this week. It's a site that helps you find places to go tidepooling (known as rockpooling in the UK) and figure out the best times to go based on low tide times.&lt;/p&gt;

&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2020/Rocky_Beaches__Pillar_Point_Harbor_CA.jpg" alt="Screenshot of the Pillar Point page for Rocky Beaches" /&gt;&lt;/p&gt;

&lt;p&gt;I helped out with the backend for the site, mainly as an excuse to further explore the idea of using Datasette to power full websites (previously explored with &lt;a href="https://simonwillison.net/2019/Nov/25/niche-museums/"&gt;Niche Museums&lt;/a&gt; and &lt;a href="https://simonwillison.net/2020/Apr/20/self-rewriting-readme/"&gt;my TILs&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The site uses a pattern I've been really enjoying: it's essentially a static dynamic site. Pages are dynamically rendered by Datasette using Jinja templates and a SQLite database, but the database itself is treated as a static asset: it's built at deploy time by &lt;a href="https://github.com/natbat/rockybeaches/blob/main/.github/workflows/deploy.yml"&gt;this GitHub Actions workflow&lt;/a&gt; and deployed (currently to &lt;a href="https://www.vercel.com/"&gt;Vercel&lt;/a&gt;) as a binary asset along with the code.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/natbat/rockybeaches/blob/main/scripts/build.sh"&gt;build script&lt;/a&gt; uses &lt;a href="https://github.com/simonw/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt; to load two YAML files - &lt;a href="https://github.com/natbat/rockybeaches/blob/4127c0f0539178664cefed4aca00db2b5c00c855/data/places.yml"&gt;places.yml&lt;/a&gt; and &lt;a href="https://github.com/natbat/rockybeaches/blob/4127c0f0539178664cefed4aca00db2b5c00c855/data/stations.yml"&gt;stations.yml&lt;/a&gt; - and create the &lt;code&gt;stations&lt;/code&gt; and &lt;code&gt;places&lt;/code&gt; database tables.&lt;/p&gt;
&lt;p&gt;It then runs two custom Python scripts to fetch relevant data for those places from &lt;a href="https://www.inaturalist.org/"&gt;iNaturalist&lt;/a&gt; and the &lt;a href="https://tidesandcurrents.noaa.gov/web_services_info.html"&gt;NOAA Tides &amp;amp; Currents API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The data all ends up in the Datasette instance that powers the site - you can browse it at &lt;a href="http://www.rockybeaches.com/data"&gt;www.rockybeaches.com/data&lt;/a&gt; or interact with it using GraphQL API at &lt;a href="http://www.rockybeaches.com/graphql"&gt;www.rockybeaches.com/graphql&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The code is a little convoluted at the moment - I'm still iterating towards the best patterns for building websites like this using Datasette - but I'm very pleased with the productivity and performance that this approach produced.&lt;/p&gt;
&lt;h4 id="datasette-048"&gt;Datasette 0.48&lt;/h4&gt;
&lt;p&gt;Highlights from &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-48"&gt;Datasette 0.48&lt;/a&gt; release notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Datasette documentation now lives at &lt;a href="https://docs.datasette.io/"&gt;docs.datasette.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;extra_template_vars&lt;/code&gt;, &lt;code&gt;extra_css_urls&lt;/code&gt;, &lt;code&gt;extra_js_urls&lt;/code&gt; and &lt;code&gt;extra_body_script&lt;/code&gt; plugin hooks now all accept the same arguments. See &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html#plugin-hook-extra-template-vars"&gt;extra_template_vars(template, database, table, columns, view_name, request, datasette)&lt;/a&gt; for details. (&lt;a href="https://github.com/simonw/datasette/issues/939"&gt;#939&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Those hooks now accept a new &lt;code&gt;columns&lt;/code&gt; argument detailing the table columns that will be rendered on that page. (&lt;a href="https://github.com/simonw/datasette/issues/938"&gt;#938&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I released a new version of &lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt; that takes advantage of the new &lt;code&gt;columns&lt;/code&gt; argument to only inject Leaflet maps JavaScript onto the page if the table being rendered includes latitude and longitude columns - previously the plugin would load extra code on pages that weren't going to render a map at all. That's now running on &lt;a href="https://global-power-plants.datasettes.com/"&gt;https://global-power-plants.datasettes.com/&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="datasette-graphql"&gt;datasette-graphql&lt;/h4&gt;
&lt;p&gt;Using &lt;a href="https://github.com/simonw/datasette-graphql"&gt;datasette-graphql&lt;/a&gt; for Rocky Beaches inspired me to add two new features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A new &lt;code&gt;graphql()&lt;/code&gt; Jinja custom template function that lets you execute custom GraphQL queries inside a Datasette template page - which turns out to be a pretty elegant way for the template to load exactly the data that it needs in order to render the page. Here's &lt;a href="https://github.com/natbat/rockybeaches/blob/70039f18b3d3823a4f069deca513e950a3aaba4f/templates/row-data-places.html#L29-L46"&gt;how Rocky Beaches uses that&lt;/a&gt;. &lt;a href="https://github.com/simonw/datasette-graphql/issues/50"&gt;Issue 50&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Some of the iNaturalist data that Rocky Beaches uses is stored as JSON data in text columns in SQLite - mainly because I was too lazy to model it out as tables. This was coming out of the GraphQL API as strings-containing-JSON, so I added a &lt;code&gt;json_columns&lt;/code&gt; plugin configuration mechanism for turning those into Graphene &lt;code&gt;GenericScalar&lt;/code&gt; fields - see &lt;a href="https://github.com/simonw/datasette-graphql/issues/53"&gt;issue 53&lt;/a&gt; for details.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I also landed a big performance improvement. The plugin works by introspecting the database and generating a GraphQL schema that represents those tables, columns and views. For tables with a lot of tables this can get expensive, and the introspection was being run on every request.&lt;/p&gt;
&lt;p&gt;I didn't want to require a server restart any time the schema changed, so I didn't want to cache the schema in-memory. Ideally it would be cached but the cache would become invalid any time the schema itself changed.&lt;/p&gt;
&lt;p&gt;It turns out SQLite has a mechanism for this: the &lt;code&gt;PRAGMA schema_version&lt;/code&gt; statement, which returns an integer version number that changes any time the underlying schema is changed (e.g. a table is added or modified).&lt;/p&gt;
&lt;p&gt;I built a quick &lt;a href="https://github.com/simonw/datasette-schema-versions"&gt;datasette-schema-versions&lt;/a&gt; plugin to try this feature out (in less than twenty minutes thanks to my &lt;a href="https://simonwillison.net/2020/Jun/20/cookiecutter-plugins/"&gt;datasette-plugin cookiecutter template&lt;/a&gt;) and prove to myself that it works. Then I built a caching mechanism for &lt;code&gt;datasette-graphql&lt;/code&gt; that uses the current &lt;code&gt;schema_version&lt;/code&gt; as the cache key. See &lt;a href="https://github.com/simonw/datasette-graphql/issues/51"&gt;issue 51&lt;/a&gt; for details.&lt;/p&gt;
&lt;h4 id="asgi-csrf-and-datasette-upload-csvs"&gt;asgi-csrf and datasette-upload-csvs&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt; is a Datasette plugin that adds a form for uploading CSV files and converting them to SQLite tables.&lt;/p&gt;
&lt;p&gt;Datasette 0.44 &lt;a href="https://docs.datasette.io/en/latest/changelog.html#csrf-protection"&gt;added CSRF protection&lt;/a&gt;, which broke the plugin. I fixed that this week, but it took some extra work because file uploads use the &lt;code&gt;multipart/form-data&lt;/code&gt; HTTP mechanism and my &lt;a href="https://github.com/simonw/asgi-csrf"&gt;asgi-csrf&lt;/a&gt; library didn't support that.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/asgi-csrf/issues/1"&gt;fixed that&lt;/a&gt; this week, but the code was quite complicated. Since &lt;code&gt;asgi-csrf&lt;/code&gt; is a security library I decided to aim for 100% code coverage, the first time I've done that for one of my projects.&lt;/p&gt;
&lt;p&gt;I got there with the help of codecov.io and &lt;a href="https://pypi.org/project/pytest-cov/"&gt;pytest-cov&lt;/a&gt;. I wrote up what I learned about those tools in &lt;a href="https://github.com/simonw/til/blob/main/pytest/pytest-code-coverage.md"&gt;a TIL&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="backing-up-my-blog-database-to-a-github-repository"&gt;Backing up my blog database to a GitHub repository&lt;/h4&gt;
&lt;p&gt;I really like keeping content in a git repository (see Rocky Beaches and Niche Museums). Every content management system I've ever been has eventually desired revision control, and modeling that in a database and adding it to an existing project is always a huge pain.&lt;/p&gt;
&lt;p&gt;I have 18 years of content on this blog. I want that backed up to git - and this week I realized I have the tools to do that already.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/db-to-sqlite"&gt;db-to-sqlite&lt;/a&gt; is my tool for taking any SQL Alchemy supported database (so far tested with MySQL and PostgreSQL) and exporting it into a SQLite database.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/sqlite-diffable"&gt;sqlite-diffable&lt;/a&gt; is a very early stage tool I built last year. The idea is to dump a SQLite database out to disk in a way that is designed to work well with git diffs. Each table is dumped out as newline-delimited JSON, one row per line.&lt;/p&gt;
&lt;p&gt;So... how about converting my blog's PostgreSQL database to SQLite, then dumping it to disk with &lt;code&gt;sqlite-diffable&lt;/code&gt; and committing the result to a git repository? And then running that in a GitHub Action?&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/simonwillisonblog-backup/blob/main/.github/workflows/backup.yml"&gt;the workflow&lt;/a&gt;. It does exactly that, with a few extra steps: it only grabs a subset of my tables, and it redacts the &lt;code&gt;password&lt;/code&gt; column from my &lt;code&gt;auth_user&lt;/code&gt; table so that my hashed password isn't exposed in the backup.&lt;/p&gt;
&lt;p&gt;I now have &lt;a href="https://github.com/simonw/simonwillisonblog-backup/commits/main"&gt;a commit log&lt;/a&gt; of changes to my blog's database!&lt;/p&gt;
&lt;p&gt;I've set it to run nightly, but I can trigger it manually by clicking a button too.&lt;/p&gt;
&lt;h4 id="til-this-week-46"&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/readthedocs/custom-subdomain.md"&gt;Pointing a custom subdomain at Read the Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/pytest/pytest-code-coverage.md"&gt;Code coverage using pytest and codecov.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/readthedocs/readthedocs-search-api.md"&gt;Read the Docs Search API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/heroku/programatic-access-postgresql.md"&gt;Programatically accessing Heroku PostgreSQL from GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/macos/find-largest-sqlite.md"&gt;Finding the largest SQLite files on a Mac&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/github-actions/grep-tests.md"&gt;Using grep to write tests in CI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="releases-this-week-46"&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/0.14"&gt;datasette-graphql 0.14&lt;/a&gt; - 2020-08-20&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/0.13"&gt;datasette-graphql 0.13&lt;/a&gt; - 2020-08-19&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-schema-versions/releases/tag/0.1"&gt;datasette-schema-versions 0.1&lt;/a&gt; - 2020-08-19&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/0.12.3"&gt;datasette-graphql 0.12.3&lt;/a&gt; - 2020-08-19&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/dogsheep/github-to-sqlite/releases/tag/2.5"&gt;github-to-sqlite 2.5&lt;/a&gt; - 2020-08-18&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-publish-vercel/releases/tag/0.8"&gt;datasette-publish-vercel 0.8&lt;/a&gt; - 2020-08-17&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.12"&gt;datasette-cluster-map 0.12&lt;/a&gt; - 2020-08-16&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette/releases/tag/0.48"&gt;datasette 0.48&lt;/a&gt; - 2020-08-16&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/0.12.2"&gt;datasette-graphql 0.12.2&lt;/a&gt; - 2020-08-16&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-saved-queries/releases/tag/0.2.1"&gt;datasette-saved-queries 0.2.1&lt;/a&gt; - 2020-08-15&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette/releases/tag/0.47.3"&gt;datasette 0.47.3&lt;/a&gt; - 2020-08-15&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-upload-csvs/releases/tag/0.5"&gt;datasette-upload-csvs 0.5&lt;/a&gt; - 2020-08-15&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/asgi-csrf/releases/tag/0.7"&gt;asgi-csrf 0.7&lt;/a&gt; - 2020-08-15&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/asgi-csrf/releases/tag/0.7a0"&gt;asgi-csrf 0.7a0&lt;/a&gt; - 2020-08-15&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/asgi-csrf/releases/tag/0.7a0"&gt;asgi-csrf 0.7a0&lt;/a&gt; - 2020-08-15&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.11.1"&gt;datasette-cluster-map 0.11.1&lt;/a&gt; - 2020-08-14&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.11"&gt;datasette-cluster-map 0.11&lt;/a&gt; - 2020-08-14&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/0.12.1"&gt;datasette-graphql 0.12.1&lt;/a&gt; - 2020-08-13&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csrf"&gt;csrf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/natalie-downe"&gt;natalie-downe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/inaturalist"&gt;inaturalist&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csrf"/><category term="databases"/><category term="git"/><category term="github"/><category term="natalie-downe"/><category term="projects"/><category term="graphql"/><category term="datasette"/><category term="inaturalist"/><category term="weeknotes"/></entry><entry><title>Weeknotes: cookiecutter templates, better plugin documentation, sqlite-generate</title><link href="https://simonwillison.net/2020/Jun/26/weeknotes-plugins-sqlite-generate/#atom-tag" rel="alternate"/><published>2020-06-26T01:39:50+00:00</published><updated>2020-06-26T01:39:50+00:00</updated><id>https://simonwillison.net/2020/Jun/26/weeknotes-plugins-sqlite-generate/#atom-tag</id><summary type="html">
    &lt;p&gt;I spent this week spreading myself between a bunch of smaller projects, and finally getting familiar with &lt;a href="https://cookiecutter.readthedocs.io/"&gt;cookiecutter&lt;/a&gt;. I wrote about &lt;a href="https://simonwillison.net/2020/Jun/20/cookiecutter-plugins/"&gt;my datasette-plugin cookiecutter template&lt;/a&gt; earlier in the week; here's what else I've been working on.&lt;/p&gt;

&lt;h4 id="sqlite-generate"&gt;sqlite-generate&lt;/h4&gt;

&lt;p&gt;Datasette is supposed to work against any SQLite database you throw at it, no matter how weird the schema or how unwieldy the database shape or size.&lt;/p&gt;

&lt;p&gt;I built a new tool called &lt;a href="https://github.com/simonw/sqlite-generate"&gt;sqlite-generate&lt;/a&gt; this week to help me create databases of different shapes. It's a Python command-line tool which uses &lt;a href="https://faker.readthedocs.io/"&gt;Faker&lt;/a&gt; to populate a new database with random data. You run it something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sqlite-generate demo.db \
    --tables=20 \
    --rows=100,500 \
    --columns=5,20 \
    --fks=0,3 \
    --pks=0,2 \
    --fts&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This command creates a database containing 20 tables, each with between 100 and 500 rows and 5-20 columns. Each table will also have between 0 and 3 foreign key columns to other tables, and will feature between 0 and 2 primary key columns. SQLite full-text search will be configured against all of the text columns in the table.&lt;/p&gt;

&lt;p&gt;I always try to include a live demo with any of my projects, and &lt;code&gt;sqlite-generate&lt;/code&gt; is no exception. &lt;a href="https://github.com/simonw/sqlite-generate/blob/main/.github/workflows/demo.yml"&gt;This GitHub Action&lt;/a&gt; runs on every push to main and deploys a demo to &lt;a href="https://sqlite-generate-demo.datasette.io/"&gt;https://sqlite-generate-demo.datasette.io/&lt;/a&gt; showing the latest version of the code in action.&lt;/p&gt;

&lt;p&gt;The demo runs my &lt;a href="https://github.com/simonw/datasette-search-all"&gt;datasette-search-all&lt;/a&gt; plugin in order to more easily demonstrate full-text search across all of the text columns in the generated tables. Try searching for &lt;a href="https://sqlite-generate-demo.datasette.io/-/search?q=newspaper"&gt;newspaper&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id="click-app"&gt;click-app cookiecutter template&lt;/h4&gt;

&lt;p&gt;I write quite a lot of &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; powered command-line tools like this one, so inspired by &lt;a href="https://github.com/simonw/datasette-plugin"&gt;datasette-plugin&lt;/a&gt; I created a new &lt;a href="https://github.com/simonw/click-app"&gt;click-app&lt;/a&gt; cookiecutter template that bakes in my own preferences about how to set up a new Click project (complete with GitHub Actions). &lt;code&gt;sqlite-generate&lt;/code&gt; is the first tool I've built using that template.&lt;/p&gt;

&lt;h4 id="improved-plugin-docs"&gt;Improved Datasette plugin documentation&lt;/h4&gt;

&lt;p&gt;I've split Datasette's plugin documentation into five separate pages, and added a new page to the documentation about patterns for testing plugins.&lt;/p&gt;

&lt;p&gt;The five pages are:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://datasette.readthedocs.io/en/latest/plugins.html"&gt;Plugins&lt;/a&gt; describing how to install and configure plugins&lt;/li&gt;&lt;li&gt;&lt;a href="https://datasette.readthedocs.io/en/latest/writing_plugins.html"&gt;Writing plugins&lt;/a&gt; showing how to write one-off plugins, how to use the &lt;code&gt;datasette-plugin&lt;/code&gt; cookiecutter template and how to package templates for release to PyPI&lt;/li&gt;&lt;li&gt;&lt;a href="https://datasette.readthedocs.io/en/latest/plugin_hooks.html"&gt;Plugin hooks&lt;/a&gt; documenting all of the available plugin hooks&lt;/li&gt;&lt;li&gt;&lt;a href="https://datasette.readthedocs.io/en/latest/testing_plugins.html"&gt;Testing plugins&lt;/a&gt; describing my preferred patterns for writing tests for them (using &lt;a href="https://docs.pytest.org/"&gt;pytest&lt;/a&gt; and &lt;a href="https://www.python-httpx.org/"&gt;HTTPX&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;&lt;a href="https://datasette.readthedocs.io/en/latest/internals.html"&gt;Internals for plugins&lt;/a&gt; describing the APIs Datasette makes available for use within plugin hook implementations&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;There's also a &lt;a href="https://datasette.readthedocs.io/en/latest/ecosystem.html#datasette-plugins"&gt;list of available plugins&lt;/a&gt; on the Datasette Ecosystem page of the documentation, though I plan to move those to a separate plugin directory in the future.&lt;/p&gt;

&lt;h4 id="datasette-block-robots"&gt;datasette-block-robots&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette-plugin"&gt;datasette-plugin&lt;/a&gt; template practically eliminates the friction involved in starting a new plugin.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sqlite-generate&lt;/code&gt; generates random names for people. I don't particularly want people who search for their own names stumbling across the live demo and being weirded out by their name featured there, so I decided to block it from search engine crawlers using &lt;code&gt;robots.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I wrote a tiny plugin to do this: &lt;a href="https://github.com/simonw/datasette-block-robots"&gt;datasette-block-robots&lt;/a&gt;, which uses the new &lt;a href="https://datasette.readthedocs.io/en/latest/plugin_hooks.html#register-routes"&gt;register_routes() plugin hook&lt;/a&gt; to add a &lt;code&gt;/robots.txt&lt;/code&gt; page.&lt;/p&gt;

&lt;p&gt;It's also a neat example of the &lt;a href="https://github.com/simonw/datasette-block-robots/blob/main/datasette_block_robots/__init__.py"&gt;simplest possible plugin&lt;/a&gt; to use that feature - along with the &lt;a href="https://github.com/simonw/datasette-block-robots/blob/main/tests/test_block_robots.py"&gt;simplest possible unit test&lt;/a&gt; for exercising such a page.&lt;/p&gt;

&lt;h4 id="datasette-saved-queries"&gt;datasette-saved-queries&lt;/h4&gt;

&lt;p&gt;Another new plugin, this time with a bit more substance to it. &lt;a href="https://github.com/simonw/datasette-saved-queries"&gt;datasette-saved-queries&lt;/a&gt; exercises the new &lt;a href="https://datasette.readthedocs.io/en/latest/plugin_hooks.html#canned-queries-datasette-database-actor"&gt;canned_queries()&lt;/a&gt; hook I &lt;a href="https://simonwillison.net/2020/Jun/19/datasette-alphas/"&gt;described last week&lt;/a&gt;. It uses the new &lt;a href="https://datasette.readthedocs.io/en/latest/plugin_hooks.html#startup-datasette"&gt;startup()&lt;/a&gt; hook to create tables on startup (if they are missing), then lets users insert records into those tables to save their own queries. Queries saved in this way are then returned as canned queries for that particular database.&lt;/p&gt;

&lt;h4 id="main-not-master"&gt;main, not master&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;main&lt;/code&gt; is a better name for the main GitHub branch than &lt;code&gt;master&lt;/code&gt;, which has unpleasant connotations (it apparently derives from master/slave in BitKeeper). My &lt;code&gt;datasette-plugin&lt;/code&gt; and &lt;code&gt;click-app&lt;/code&gt; cookiecutter templates both include instructions for renaming &lt;code&gt;master&lt;/code&gt; to &lt;code&gt;main&lt;/code&gt; in their READMEs - it's as easy as running &lt;code&gt;git branch -m master main&lt;/code&gt; before running your first push to GitHub.&lt;/p&gt;

&lt;p&gt;I'm working towards &lt;a href="https://github.com/simonw/datasette/issues/849"&gt;making the switch&lt;/a&gt; for Datasette itself.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/robots-txt"&gt;robots-txt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cookiecutter"&gt;cookiecutter&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="plugins"/><category term="projects"/><category term="robots-txt"/><category term="sqlite"/><category term="datasette"/><category term="weeknotes"/><category term="cookiecutter"/></entry><entry><title>Quoting Vincent Driessen</title><link href="https://simonwillison.net/2020/May/14/vincent-driessen/#atom-tag" rel="alternate"/><published>2020-05-14T13:49:55+00:00</published><updated>2020-05-14T13:49:55+00:00</updated><id>https://simonwillison.net/2020/May/14/vincent-driessen/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://nvie.com/posts/a-successful-git-branching-model/"&gt;&lt;p&gt;Web apps are typically continuously delivered, not rolled back, and you don't have to support multiple versions of the software running in the wild.&lt;/p&gt;
&lt;p&gt;This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://nvie.com/posts/a-successful-git-branching-model/"&gt;Vincent Driessen&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/continuous-deployment"&gt;continuous-deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/continuous-integration"&gt;continuous-integration&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;&lt;/p&gt;



</summary><category term="continuous-deployment"/><category term="continuous-integration"/><category term="git"/></entry><entry><title>Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite</title><link href="https://simonwillison.net/2020/Apr/29/weeknotes/#atom-tag" rel="alternate"/><published>2020-04-29T19:41:11+00:00</published><updated>2020-04-29T19:41:11+00:00</updated><id>https://simonwillison.net/2020/Apr/29/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.&lt;/p&gt;

&lt;h3&gt;Archiving coronavirus.data.gov.uk&lt;/h3&gt;

&lt;p&gt;The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at &lt;a href="https://coronavirus.data.gov.uk/"&gt;coronavirus.data.gov.uk&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As with everything else built in 2020, it's a big single-page JavaScript app. Matthew Somerville &lt;a href="http://dracos.co.uk/wrote/coronavirus-dashboard/"&gt;investigated&lt;/a&gt; what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.&lt;/p&gt;

&lt;p&gt;One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew's article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from &lt;a href="https://c19downloads.azureedge.net/downloads/data/data_latest.json"&gt;https://c19downloads.azureedge.net/downloads/data/data_latest.json&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Any time I see a file like this my first questions are how often does it change - and what kind of changes are being made to it?&lt;/p&gt;

&lt;p&gt;I've written about scraping to a git repository (see my new &lt;a href="https://simonwillison.net/tags/gitscraping/"&gt;gitscraping&lt;/a&gt; tag) a bunch in the past:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;Scraping hurricane Irma&lt;/a&gt; - September 2017&lt;/li&gt;&lt;li&gt;&lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;Changelogs to help understand the fires in the North Bay&lt;/a&gt; - October 2017&lt;/li&gt;&lt;li&gt;&lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;Generating a commit log for San Francisco’s official list of trees&lt;/a&gt; - March 2019&lt;/li&gt;&lt;li&gt;&lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; - October 2019&lt;/li&gt;&lt;li&gt;&lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; - January 2020&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Now that I've figured out a really clean way to &lt;a href="https://github.com/simonw/til/blob/master/github-actions/commit-if-file-changed.md"&gt;Commit a file if it changed&lt;/a&gt; in a GitHub Action knocking out new versions of this pattern is really quick.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/simonw/coronavirus-data-gov-archive"&gt;simonw/coronavirus-data-gov-archive&lt;/a&gt; is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a &lt;a href="https://github.com/simonw/coronavirus-data-gov-archive/commits/master/data_latest.json"&gt;commit history&lt;/a&gt; of changes made to the underlying data.&lt;/p&gt;

&lt;p&gt;The first implementation was extremely simple - here's the &lt;a href="https://github.com/simonw/coronavirus-data-gov-archive/blob/c83d69e95ec6400bf77d7b0d474e868baa78841e/.github/workflows/scheduled.yml"&gt;entire action&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;name: Fetch latest data

on:
push:
repository_dispatch:
schedule:
    - cron:  '25 * * * *'

jobs:
scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
    uses: actions/checkout@v2
    - name: Fetch latest data
    run: |-
        curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . &amp;gt; data_latest.json
        curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . &amp;gt; utlas.geojson
        curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . &amp;gt; countries.geojson
        curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . &amp;gt; regions.geojson
    - name: Commit and push if it changed
    run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It uses a combination of &lt;code&gt;curl&lt;/code&gt; and &lt;code&gt;jq&lt;/code&gt; (both available &lt;a href="https://github.com/actions/virtual-environments/blob/master/images/linux/Ubuntu1804-README.md"&gt;in the default worker environment&lt;/a&gt;) to pull down the data and pretty-print it (better for readable diffs), then commits the result.&lt;/p&gt;

&lt;p&gt;Matthew Somerville &lt;a href="https://twitter.com/dracos/status/1255221799085846532"&gt;pointed out&lt;/a&gt; that inefficient polling sets a bad precedent. Here I'm hitting &lt;code&gt;azureedge.net&lt;/code&gt;, the Azure CDN, so that didn't particularly worry me - but since I want this pattern to be used widely it's good to provide a best-practice example.&lt;/p&gt;

&lt;p&gt;Figuring out the best way to make &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests"&gt;conditional get requests&lt;/a&gt; in a GitHub Action lead me down &lt;a href="https://github.com/simonw/coronavirus-data-gov-archive/issues/1"&gt;something of a rabbit hole&lt;/a&gt;. I wanted to use &lt;a href="https://daniel.haxx.se/blog/2019/12/06/curl-speaks-etag/"&gt;curl's new ETag support&lt;/a&gt; but I ran into &lt;a href="https://github.com/curl/curl/issues/5309"&gt;a curl bug&lt;/a&gt;, so I ended up rolling a simple Python CLI tool called &lt;a href="https://github.com/simonw/conditional-get"&gt;conditional-get&lt;/a&gt; to solve my problem. In the time it took me to release that tool (just a few hours) a &lt;a href="https://github.com/curl/curl/issues/5309#issuecomment-621265179"&gt;new curl release&lt;/a&gt; came out with a fix for that bug!&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://github.com/simonw/coronavirus-data-gov-archive/blob/a95d7661b236a9ee9a26a441dd948eb00308f919/.github/workflows/scheduled.yml"&gt;the workflow&lt;/a&gt; using my &lt;code&gt;conditional-get&lt;/code&gt; tool. See &lt;a href="https://github.com/simonw/coronavirus-data-gov-archive/issues/1"&gt;the issue thread&lt;/a&gt; for all of the other potential solutions, including a really neat &lt;a href="https://github.com/hubgit/curl-etag"&gt;Action shell-script solution&lt;/a&gt; by Alf Eaton.&lt;/p&gt;

&lt;p&gt;To my absolute delight, the project has already been forked once by Daniel Langer to &lt;a href="https://github.com/dlanger/coronavirus-hc-infobase-archive"&gt;capture Canadian Covid-19 cases&lt;/a&gt;!&lt;/p&gt;

&lt;h3 id="new-datasette-features"&gt;New Datasette features&lt;/h3&gt;

&lt;p&gt;I pushed two new features to &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; master, ready for release in 0.41.&lt;/p&gt;

&lt;h4&gt;Configuration directory mode&lt;/h4&gt;

&lt;p&gt;This is an idea I had while building &lt;a href="https://github.com/simonw/datasette-publish-now"&gt;datasette-publish-now&lt;/a&gt;. Datasette instances can be run with custom metadata, custom plugins and custom templates. I'm increasingly finding myself working on projects that run using something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ datasette data1.db data2.db data3.db \
    --metadata=metadata.json
    --template-dir=templates \
    --plugins-dir=plugins&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then run Datasette directly targetting that directory:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ datasette my-project/&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;See &lt;a href="https://github.com/simonw/datasette/issues/731"&gt;issue #731&lt;/a&gt; for more details. Directory configuration mode &lt;a href="https://datasette.readthedocs.io/en/latest/config.html#configuration-directory-mode"&gt;is documented here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Define custom pages using templates/pages&lt;/h4&gt;

&lt;p&gt;In &lt;a href="https://simonwillison.net/2019/Nov/25/niche-museums/"&gt;niche-museums.com, powered by Datasette&lt;/a&gt; I described how I built the &lt;a href="https://www.niche-museums.com/"&gt;www.niche-museums.com&lt;/a&gt; website as a heavily customized Datasette instance.&lt;/p&gt;

&lt;p&gt;That site has &lt;a href="https://www.niche-museums.com/about"&gt;/about&lt;/a&gt; and &lt;a href="https://www.niche-museums.com/map"&gt;/map&lt;/a&gt; pages which are served by custom templates - but I had to do some gnarly hacks with empty &lt;code&gt;about.db&lt;/code&gt; and &lt;code&gt;map.db&lt;/code&gt; files to get them to work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/simonw/datasette/issues/648"&gt;Issue #648&lt;/a&gt; introduces a new mechanism for creating this kind of page: create a &lt;code&gt;templates/pages/map.html&lt;/code&gt; template file and custom 404 handling code will ensure that any hits to &lt;code&gt;/map&lt;/code&gt; serve the rendered contents of that template.&lt;/p&gt;

&lt;p&gt;This could work really well with the &lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).&lt;/p&gt;

&lt;p&gt;Here's the new &lt;a href="https://datasette.readthedocs.io/en/latest/custom_templates.html#custom-pages"&gt;documentation on custom pages&lt;/a&gt;, including details of how to use the new &lt;code&gt;custom_status()&lt;/code&gt;, &lt;code&gt;custom_header()&lt;/code&gt; and &lt;code&gt;custom_redirect()&lt;/code&gt; template functions to go beyond just returning HTML.&lt;/p&gt;

&lt;h3&gt;photos-to-sqlite&lt;/h3&gt;

&lt;p&gt;My &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; personal analytics project brings my &lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;tweets&lt;/a&gt;, &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;GitHub activity&lt;/a&gt;, &lt;a href="https://github.com/dogsheep/swarm-to-sqlite"&gt;Swarm checkins&lt;/a&gt; and more together in one place. But the big missing feature is my photos.&lt;/p&gt;

&lt;p&gt;As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new &lt;a href="https://github.com/dogsheep/photos-to-sqlite/"&gt;photos-to-sqlite&lt;/a&gt; tool. I can run the following SQL query and get back ten random photos!&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;select
  json_object(
    'img_src',
    'https://photos.simonwillison.net/i/' || 
    sha256 || '.' || ext || '?w=400'
  ),
  filepath,
  ext
from
  photos
where
  ext in ('jpeg', 'jpg', 'heic')
order by
  random()
limit
  10&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;photos.simonwillison.net&lt;/code&gt; is running a modified version of my &lt;a href="https://github.com/simonw/heic-to-jpeg"&gt;heic-to-jpeg&lt;/a&gt; image converting and resizing proxy, which I'll release at some point soon.&lt;/p&gt;

&lt;p&gt;There's still plenty of work to do - I still need to import EXIF data (including locations) into SQLite, and I plan to use &lt;a href="https://github.com/RhetTbull/osxphotos"&gt;osxphotos&lt;/a&gt; to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.&lt;/p&gt;

&lt;h3&gt;TIL this week&lt;/h3&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/macos/fixing-compinit-insecure-directories.md"&gt;Fixing "compinit: insecure directories" error&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/tailscale/lock-down-sshd.md"&gt;Restricting SSH connections to devices within a Tailscale network&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/python/generate-nested-json-summary.md"&gt;Generated a summary of nested JSON data&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/pytest/session-scoped-tmp.md"&gt;Session-scoped temporary directories in pytest&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/pytest/mock-httpx.md"&gt;How to mock httpx using pytest-mock&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Generated using &lt;a href="https://til.simonwillison.net/til?sql=select+json_object(%27pre%27%2C+group_concat(%27*+[%27+||+title+||+%27](%27+||+url+||+%27)%27%2C+%27%0D%0A%27))+from+til+where+%22created_utc%22+%3E%3D+%3Ap0+order+by+updated_utc+desc+limit+101&amp;amp;p0=2020-04-23"&gt;this query&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http"&gt;http&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/matthew-somerville"&gt;matthew-somerville&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/photos"&gt;photos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/covid19"&gt;covid19&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="http"/><category term="matthew-somerville"/><category term="photos"/><category term="projects"/><category term="datasette"/><category term="weeknotes"/><category term="covid19"/><category term="git-scraping"/></entry></feed>