Simon Willison's Weblog: covid19

Quoting Ed Yong

2024-10-11T01:45:23+00:00

Providing validation, strength, and stability to people who feel gaslit and dismissed and forgotten can help them feel stronger and surer in their decisions. These pieces made me understand that journalism can be a caretaking profession, even if it is never really thought about in those terms. It is often framed in terms of antagonism. Speaking truth to power turns into being hard-nosed and removed from our subject matter, which so easily turns into be an asshole and do whatever you like.

This is a viewpoint that I reject. My pillars are empathy, curiosity, and kindness. And much else flows from that. For people who feel lost and alone, we get to say through our work, you are not. For people who feel like society has abandoned them and their lives do not matter, we get to say, actually, they fucking do. We are one of the only professions that can do that through our work and that can do that at scale.

— Ed Yong, at 19:47

Tags: journalism, covid19

My @covidsewage bot now includes useful alt text

2024-08-25T16:09:49+00:00

My @covidsewage bot now includes useful alt text

I've been running a @covidsewage Mastodon bot for a while now, posting daily screenshots (taken with shot-scraper) of the Santa Clara County COVID in wastewater dashboard.

Prior to today the screenshot was accompanied by the decidedly unhelpful alt text "Screenshot of the latest Covid charts".

I finally fixed that today, closing issue #2 more than two years after I first opened it.

The screenshot is of a Microsoft Power BI dashboard. I hoped I could scrape the key information out of it using JavaScript, but the weirdness of their DOM proved insurmountable.

Instead, I'm using GPT-4o - specifically, this Python code (run using a python -c block in the GitHub Actions YAML file):

import base64, openai
client = openai.OpenAI()
with open('/tmp/covid.png', 'rb') as image_file:
    encoded_image = base64.b64encode(image_file.read()).decode('utf-8')
messages = [
    {'role': 'system',
     'content': 'Return the concentration levels in the sewersheds - single paragraph, no markdown'},
    {'role': 'user', 'content': [
        {'type': 'image_url', 'image_url': {
            'url': 'data:image/png;base64,' + encoded_image
        }}
    ]}
]
completion = client.chat.completions.create(model='gpt-4o', messages=messages)
print(completion.choices[0].message.content)

I'm base64 encoding the screenshot and sending it with this system prompt:

Return the concentration levels in the sewersheds - single paragraph, no markdown

Given this input image:

Here's the text that comes back:

The concentration levels of SARS-CoV-2 in the sewersheds from collected samples are as follows: San Jose Sewershed has a high concentration, Palo Alto Sewershed has a high concentration, Sunnyvale Sewershed has a high concentration, and Gilroy Sewershed has a medium concentration.

The full implementation can be found in the GitHub Actions workflow, which runs on a schedule at 7am Pacific time every day.

Tags: accessibility, alt-text, projects, ai, covid19, shot-scraper, openai, generative-ai, gpt-4, llms

Fix @covidsewage bot to handle a change to the underlying website

2024-08-18T17:26:32+00:00

Fix @covidsewage bot to handle a change to the underlying website

I've been running @covidsewage on Mastodon since February last year tweeting a daily screenshot of the Santa Clara County charts showing Covid levels in wastewater.

A few days ago the county changed their website, breaking the bot. The chart now lives on their new COVID in wastewater page.

It's still a Microsoft Power BI dashboard in an <iframe>, but my initial attempts to scrape it didn't quite work. Eventually I realized that Cloudflare protection was blocking my attempts to access the page, but thankfully sending a Firefox user-agent fixed that problem.

The new recipe I'm using to screenshot the chart involves a delightfully messy nested set of calls to shot-scraper - first using shot-scraper javascript to extract the URL attribute for that <iframe>, then feeding that URL to a separate shot-scraper call to generate the screenshot:

shot-scraper -o /tmp/covid.png $(
  shot-scraper javascript \
    'https://publichealth.santaclaracounty.gov/health-information/health-data/disease-data/covid-19/covid-19-wastewater' \
    'document.querySelector("iframe").src' \
    -b firefox \
    --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0' \
    --raw
) --wait 5000 -b firefox --retina

Tags: projects, covid19, shot-scraper

Building a Covid sewage Twitter bot (and other weeknotes)

2022-04-18T02:49:06+00:00

I built a new Twitter bot today: @covidsewage. It tweets a daily screenshot of the latest Covid sewage monitoring data published by Santa Clara county.

I'm increasingly distrustful of Covid numbers as fewer people are tested in ways that feed into the official statistics. But the sewage numbers don't lie! As the Santa Clara county page explains:

SARS-CoV-2 (the virus that causes COVID-19) is shed in feces by infected individuals and can be measured in wastewater. More cases of COVID-19 in the community are associated with increased levels of SARS-CoV-2 in wastewater, meaning that data from wastewater analysis can be used as an indicator of the level of transmission of COVID-19 in the community.

That page also embeds some beautiful charts of the latest numbers, powered by an embedded Observable notebook built by Zan Armstrong.

Once a day, my bot tweets a screenshot of those latest charts that looks like this:

How the bot works

The bot runs once a daily using this scheduled GitHub Actions workflow.

Here's the bit of the workflow that generates the screenshot:

- name: Generate screenshot with shot-scraper
  run: |-
    shot-scraper https://covid19.sccgov.org/dashboard-wastewater \
      -s iframe --wait 3000 -b firefox --retina -o /tmp/covid.png

This uses my shot-scraper screenshot tool, described here previously. It takes a retina screenshot just of the embedded iframe, and uses Firefox because for some reason the default Chromium screenshot failed to load the embed.

This bit sends the tweet:

- name: Tweet the new image
  env:
    TWITTER_CONSUMER_KEY: ${{ secrets.TWITTER_CONSUMER_KEY }}
    TWITTER_CONSUMER_SECRET: ${{ secrets.TWITTER_CONSUMER_SECRET }}
    TWITTER_ACCESS_TOKEN_KEY: ${{ secrets.TWITTER_ACCESS_TOKEN_KEY }}
    TWITTER_ACCESS_TOKEN_SECRET: ${{ secrets.TWITTER_ACCESS_TOKEN_SECRET }}
  run: |-
    tweet-images "Latest Covid sewage charts for the SF Bay Area" \
      /tmp/covid.png --alt "Screenshot of the charts" > latest-tweet.md

tweet-images is a tiny new tool I built for this project. It uses the python-twitter library to send a tweet with one or more images attached to it.

The hardest part of the project was getting the credentials for sending tweets with the bot! I had to go through Twitter's manual verification flow, presumably because I checked the "bot" option when I applied for the new developer account. I also had to figure out how to extract all four credentials (with write permissions) from the Twitter developer portal.

I wrote up full notes on this in a TIL: How to get credentials for a new Twitter bot.

Datasette for geospatial analysis

I stumbled across datanews/amtrak-geojson, a GitHub repository containing GeoJSON files (from 2015) showing all of the Amtrak stations and sections of track in the USA.

I decided to try exploring it using my geojson-to-sqlite tool, which revealed a bug triggered by records with a geometry but no properties. I fixed that in version 1.0.1, and later shipped version 1.1 with improvements by Chris Amico.

In exploring the Amtrak data I found myself needing to learn how to use the SpatiaLite GUnion function to aggregate multiple geometries together. This resulted in a detailed TIL on using GUnion to combine geometries in SpatiaLite, which further evolved as I used it as a chance to learn how to use Chris's datasette-geojson-map and sqlite-colorbrewer plugins.

This was so much fun that I was inspired to add a new "uses" page to the official Datasette website: Datasette for geospatial analysis now gathers together links to plugins, tools and tutorials for handling geospatial data.

sqlite-utils 3.26

I'll quote the release notes for sqlite-utils 3.26 in full:

New errors=r.IGNORE/r.SET_NULL parameter for the r.parsedatetime() and r.parsedate() convert recipes. (#416)

Fixed a bug where --multi could not be used in combination with --dry-run for the convert command. (#415)

New documentation: Using a convert() function to execute initialization. (#420)

More robust detection for whether or not deterministic=True is supported. (#425)

shot-scraper 0.12

In addition to support for WebKit contributed by Ryan Murphy, shot-scraper 0.12 adds options for taking a screenshot that encompasses all of the elements on a page that match a CSS selector.

In also adds a new --js-selector option, suggested by Tony Hirst. This covers the case where you want to take a screenshot of an element on the page that cannot be easily specified using a CSS selector. For example, this expression takes a screenshot of the first paragraph on a page that includes the text "shot-scraper":

shot-scraper https://simonwillison.net/2022/Apr/8/weeknotes/ \
  --js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")' \
  --padding 15 --retina

And an airship museum!

I finally got to add another listing to my www.niche-museums.com website about small or niche museums I have visited.

The Moffett Field Historical Society museum in Mountain View is situated in the shadow of Hangar One, an airship hangar built in 1933 to house the mighty USS Macon.

It's the absolute best kind of local history museum. Our docent was a retired pilot who had landed planes on aircraft carriers using the kind of equipment now on display in the museum. They had dioramas and models. They even had a model railway. It was superb.

Releases this week

tweet-images: 0.1.1 - (2 releases total) - 2022-04-17
Send tweets with images from the command line
asyncinject: 0.3 - (5 releases total) - 2022-04-16
Run async workflows using pytest-fixtures-style dependency injection
geojson-to-sqlite: 1.1.1 - (11 releases total) - 2022-04-13
CLI tool for converting GeoJSON files to SQLite (with SpatiaLite)
sqlite-utils: 3.26 - (99 releases total) - 2022-04-13
Python CLI utility and library for manipulating SQLite databases
summarize-template: 0.1 - 2022-04-13
Show a summary of a Django or Jinja template
shot-scraper: 0.12 - (13 releases total) - 2022-04-11
Tools for taking automated screenshots of websites

TIL this week

Tags: projects, twitter, datasette, weeknotes, github-actions, covid19, sqlite-utils

Weeknotes: CDC vaccination history fixes, developing in GitHub Codespaces

2021-09-28T01:53:49+00:00

I spent the last week mostly surrounded by boxes: we're completing our move to the new place and life is mostly unpacking now. I did find some time to fix some issues with my CDC vaccination history Datasette instance though.

Fixing my CDC vaccination history site

I started tracking changes made to the CDC's COVID Data Tracker website back in Feburary. I created a git scraper repository for it as part of my five minute lightning talk on git scraping (notes and video) at this year's NICAR data journalism conference.

Since then it's been quietly ticking along, recording the latest data in a git repository that now has 335 commits.

In March I added a script to build the collected historic data into a SQLite database and publish it to Vercel using GitHub. That started breaking a few weeks ago, and it turnoud out that was because the database file had grown in size to the point where it was too large to deploy to Vercel (~100MB).

I got a bug report about this, so I took some time to move the deployment over to Google Cloud Run which doesn't have a documented size limit (though in my experience starts to creak once you go above about 2GB.)

I also started publishing the raw collected data directly as a CSV file, partly as an excuse to learn how to publish to Google Cloud Storage.

datasette-template-request

I released an extremely simple plugin this week called datasette-template-request - all it does is expose Datasette's request object in the context passed to custom templates, for people who want to update their custom page based on incoming request parameters.

More notable is how I built the plugin: this is the first plugin I've developed, tested and released entirely in my browser using the new GitHub Codespaces online development environment.

I created the new repo using my Datasette plugin template repository, opened it up in Codespaces, implemented the plugin and tests, tried it out using the port forwarding feature and then published it to PyPI using the publish.yml workflow.

Not having to even open a text editor on my laptop (let alone get a new Python development environment up and running) felt really good. I should turn this into a tutorial.

Releases this week

datasette-template-request: 0.1 - 2021-09-23
Expose the Datasette request object to custom templates
datasette-notebook: 0.1a1 - (2 releases total) - 2021-09-22
A markdown wiki and dashboarding system for Datasette
datasette-render-markdown: 2.0 - (8 releases total) - 2021-09-22
Datasette plugin for rendering Markdown
sqlite-utils: 3.17.1 - (87 releases total) - 2021-09-22
Python CLI utility and library for manipulating SQLite databases
twitter-to-sqlite: 0.22 - (28 releases total) - 2021-09-21
Save data from Twitter to a SQLite database

TIL this week

Tags: github, projects, weeknotes, covid19, git-scraping, github-codespaces

Quoting Dan Sinker

2021-08-23T01:59:52+00:00

The rapid increase of COVID-19 cases among kids has shattered last year’s oft-repeated falsehood that kids don’t get COVID-19, and if they do, it’s not that bad. It was a convenient lie that was easy to believe in part because we kept most of our kids home. With remote learning not an option now, this year we’ll find out how dangerous this virus is for children in the worst way possible.

— Dan Sinker

Tags: covid19

The Tyranny of Spreadsheets

2021-07-23T03:57:50+00:00

The Tyranny of Spreadsheets

In discussing the notorious Excel incident last year when the UK lost track of 16,000 Covid cases due to a .xls row limit, Tim Harford presents a history of the spreadsheet, dating all the way back to Francesco di Marco Datini and double-entry bookkeeping in 1396. A delightful piece of writing.

Via Hacker News

Tags: history, spreadsheets, covid19

Trying to end the pandemic a little earlier with VaccinateCA

2021-02-28T05:40:28+00:00

This week I got involved with the VaccinateCA effort. We are trying to end the pandemic a little earlier, by building the most accurate database possible of vaccination locations and availability in California.

VaccinateCA

I’ve been following this project for a while through Twitter, mainly via Patrick McKenzie - here’s his tweet about the project from January 20th.

https://t.co/JrD5mb4TAN calls medical professionals daily to ask who they could vaccinate and how to get in line. We publish this, covering the entire state of California, to help more people get their vaccines faster. Please tell your friends and networks.
- Patrick McKenzie (@patio11) January 20, 2021

The core idea is one of those things that sounds obviously correct the moment you hear it. The Covid vaccination roll-out is decentralized and pretty chaotic. VaccinateCA figured out that the best way to figure out where the vaccine is available is to call the places that are distributing it - pharmacies, hospitals, clinics - as often as possible and ask if they have any in stock, who is eligible for the shot and how people can sign up for an appointment.

What We've Learned (So Far) by Patrick talks about lessons learned in the first 42 days of the project.

There are three public-facing components to VaccinateCA:

www.vaccinateca.com is a website to help you find available vaccines near you.
help.vaccinateca is the web app used by volunteers who make calls - it provides a script and buttons to submit information gleaned from the call. If you’re interested in volunteering there’s information on the website.
api.vaccinateca is the public API, which is documented here and is also used by the end-user facing website. It provides a full dump of collected location data, plus information on county policies and large-scale providers (pharmacy chains, health care providers).

The system currently mostly runs on Airtable, and takes advantage of pretty much every feature of that platform.

Why I got involved

Jesse Vincent convinced me to get involved. It turns out to be a perfect fit for both my interests and my skills and experience.

I’ve built crowdsourcing platforms before - for MP’s expense reports at the Guardian, and then for conference and event listings with our startup, Lanyrd.

VaccinateCA is a very data-heavy organization: the key goal is to build a comprehensive database of vaccine locations and availability. My background in data journalism and the last three years I’ve spent working on Datasette have given me a wealth of relevant experience here.

And finally… VaccinateCA are quickly running up against the limits of what you can sensibly do with Airtable - especially given Airtable’s hard limit at 100,000 records. They need to port critical tables to a custom PostgreSQL database, while maintaining as much as possible the agility that Airtable has enabled for them.

Django is a great fit for this kind of challenge, and I know quite a bit about both Django and using Django to quickly build robust, scalable and maintainable applications!

So I spent this week starting a Django replacement for the Airtable backend used by the volunteer calling application. I hope to get to feature parity (at least as an API backend that the application can write to) in the next few days, to demonstrate that a switch-over is both possible and a good idea.

What about Datasette?

On Monday I spun up a Datasette instance at vaccinateca.datasette.io (underlying repository) against data from the public VaccinateCA API. The map visualization of all of the locations instantly proved useful in helping spot locations that had incorrectly been located with latitudes and longitudes outside of California.

I hope to use Datasette for a variety of tasks like this, but it shouldn’t be the core of the solution. VaccinateCA is the perfect example of a problem that needs to be solved with Boring Technology - it needs to Just Work, and time that could be spent learning exciting new technologies needs to be spent building what’s needed as quickly, robustly and risk-free as possible.

That said, I’m already starting to experiment with the new JSONField introduced in Django 3.1 - I’m hoping that a few JSON columns can help compensate for the lack of flexibility compared to Airtable, which makes it ridiculously easy for anyone to add additional columns.

(To be fair JSONField has been a feature of the Django PostgreSQL Django extension since version 1.9 in 2015 so it’s just about made it into the boring technology bucket by now.)

Also this week

Working on VaccinateCA has given me a chance to use some of my tools in new and interesting ways, so I got to ship a bunch of small fixes, detailed in Releases this week below.

On Friday I gave a talk at Speakeasy JS, "the JavaScript meetup for 🥼 mad science, 🧙‍♂️ hacking, and 🧪 experiments" about why "SQL in your client-side JavaScript is a great idea". The video for that is on YouTube and I plan to provide a full write-up soon.

I also recorded a five minute lightning talk about Git Scraping for next week's NICAR 2021 data journalism conference.

I also made a few small cosmetic upgrades to the way tags are displayed on my blog - they now show with a rounded border and purple background, and include a count of items published with that tag. My tags page is one example of where I've now applied this style.

TIL this week

Releases this week

flatten-single-item-arrays: 0.1 - 2021-02-25
Given a JSON list of objects, flatten any keys which always contain single item arrays to just a single value
datasette-auth-github: 0.13.1 - (25 releases total) - 2021-02-25
Datasette plugin that authenticates users against GitHub
datasette-block: 0.1.1 - (2 releases total) - 2021-02-25
Block all access to specific path prefixes
github-contents: 0.2 - 2021-02-24
Python class for reading and writing data to a GitHub repository
csv-diff: 1.1 - (9 releases total) - 2021-02-23
Python CLI tool and library for diffing CSV and JSON files
sqlite-transform: 0.4 - (5 releases total) - 2021-02-22
Tool for running transformations on columns in a SQLite database
airtable-export: 0.5 - (7 releases total) - 2021-02-22
Export Airtable data to YAML, JSON or SQLite files on disk

Tags: crowdsourcing, django, postgresql, patrick-mckenzie, datasette, weeknotes, covid19, vaccinate-ca, personal-news, jesse-vincent

CoronaFaceImpact

2020-11-15T22:41:16+00:00

CoronaFaceImpact

Variable fonts are fonts that can be customized by passing in additional parameters, which is done in CSS using the font-variation-settings property. Here’s a variable font that shows multiple effects of Covid-19 lockdown on a bearded face, created by Friedrich Althausen.

Via Kevin Marks

Tags: css, fonts, typography, covid19

Quoting Wade Davis

2020-08-08T15:48:28+00:00

COVID-19 attacks our physical bodies, but also the cultural foundations of our lives, the toolbox of community and connectivity that is for the human what claws and teeth represent to the tiger.

— Wade Davis

Tags: covid19

Weeknotes: datasette-auth-passwords, a Datasette logo and a whole lot more

2020-07-17T03:41:13+00:00

All sorts of project updates this week.

datasette-auth-passwords

Datasette 0.44 added authentication support as a core concept, but left the actual implementation details up to the plugins.

I released datasette-auth-passwords on Monday. It's an implementation of the most obvious form of authentication (as opposed to GitHub SSO or bearer tokens or existing domain cookies): usernames and passwords, typed into a form.

Implementing passwords responsibly is actually pretty tricky, due to the need to effectively hash them. After some research I ended up mostly copying how Django does it (never a bad approach): I'm using 260,000 salted pbkdf2_hmac iterations, taking advantage of the Python standard library. I wrote this up in a TIL.

The plugin currently only supports hard-coded password hashes that are fed to Datasette via an environment variable - enough to set up a password-protected Datasette instance with a couple of users, but not really good for anything more complex than that. I have an open issue for implementing database-backed password accounts, although again the big challenge is figuring out how to responsible store those password hashes.

I've set up a live demo of the password plugin at datasette-auth-passwords-demo.datasette.io - you can sign into it to reveal a private database that's only available to authenticated users.

Datasette website and logo

I'm finally making good progress on a website for Datasette. As part of that I've been learning to use Figma, which I used to create a Datasette logo.

Figma is really neat: it's an entirely web-based vector image editor, aimed at supporting the kind of design work that goes into websites and apps. It has full collaborative editing for teams but it's free for single users. Most importantly it has extremely competent SVG exports.

I've added the logo to the latest version of the Datasette docs, and I have an open pull request to sphinx_rtd_theme to add support for setting a custom link target on the logo so I can link back to the rest of the official site, when it goes live.

TIL search snippet highlighting

My TIL site has a search engine, but it didn't do snippet highlighting. I reused the pattern I described in Fast Autocomplete Search for Your Website - implemented server-side rather than client-side this time - to add that functionality. The implementation is here - here's a demo of it in action.

SRCCON schedule

I'm attending (virtually) the SRCCON 2020 journalism conference this week, and Datasette is part of the Projects, Products, & Research track.

As a demo, I set up a Datasette powered copy of the conference schedule at srccon-2020.datasette.io - it's running the datasette-ics plugin which means it can provide a URL that can be subscribed to in Google or Apple Calendar.

The site runs out of the simonw/srccon-2020-datasette repository, which uses a GitHub Action to download the schedule JSON, modify it a little (mainly to turn the start and end dates into ISO datestamps), save it to a SQLite database with sqlite-utils and publish it to Vercel.

Covid 19 population data

My Covid-19 tracker publishes updated numbers of cases and deaths from the New York Times, the LA Times and Johns Hopkins university on an hourly basis.

One thing that was missing was county population data. US counties are identified in the data by their FIPS codes, which offers a mechanism for joining against population estimates pulled from the US Census.

Thanks to Aaron King I've now incorporated that data into the site, as a new us_census_county_populations_2019 table.

I used that data to define a SQL view - latest_ny_times_counties_with_populations - which shows the latest New York Times county data with new derived cases_per_million and deaths_per_million columns.

Tweaks to this blog

For many years this blog's main content has sat on the left of the page - which looks increasingly strange as screens get wider and wider. As of this commit the main layout is centered, which I think looks much nicer.

I also ran a data migration to fix some old internal links.

Miscellaneous

I gave a (virtual) talk at Django London on Monday about Datasette. I've taken to sharing a Google Doc for this kind of talk, which I prepare before the talk with notes and then update afterwards to reflect additional material from the Q&A. Here's the document from Monday's talk.

San Francisco Public Works maintain a page of tree removal notifications showing trees that are scheduled for removal. I like those trees. They don't provide an archive of notifications from that page, so I've set up a git scraping GitHub repository that scrapes the page daily and maintains a history of its contents in the commit log.

I updated datasette-publish-fly for compatibility with Datasette 0.44 and Python 3.6.

I made a few tweaks to my GitHub profile README, which is now Apache 2 licensed so people know they can adapt it for their own purposes.

I released github-to-sqlite 2.3 with a new option for fetching information for just specific repositories.

The Develomentor podcast published an interview with me about my career, and how it's been mostly defined by side-projects.

TIL this week

Tags: design, passwords, projects, datasette, weeknotes, covid19, git-scraping

Weeknotes: SBA Covid-19 PPP loans, Datasette talks, Datasette plugin upgrades

2020-07-09T22:44:49+00:00

This week I've mainly been exploring Small Business Administration Covid-19 loans data, pitching some talks and upgrading some plugins for compatibility with Datasette 0.44+.

SBA PPP Covid-19 loan data

On Monday the Small Business Administration and the Treasury Department released detailed loan-level data for loans made under the Paycheck Protection Program as part of their Covid-19 response.

They released the data as a zip file full of CSVs on their Box account (the first time I've seen Box used for this kind of government data release).

The most interesting file in there was foia_150k_plus.csv - a file containing 661,218 loans over $150,000. So I loaded it into Datasette and published it at https://sba-loans-covid-19.datasettes.com/loans_150k_plus/foia_150k_plus

I made one modification to the data: on the suggestion of David Cramer I imported a list of NAICS code descriptions from the US Census and set up the NAICSCode column as a foreign key to that table.

Here's a custom query showing the NAICS codes with the most loan claims > $150k - Offices of Dentists come in 8th place with 10,627 loans!

My Twitter thread has more commentary on things I found exploring the data, and my sba-loans-covid-19-datasette GitHub repo describes the exact steps I went through to create the Datasette instance (using csvs-to-sqlite and sqlite-utils).

Pitching some talks

I haven't done any public speaking in a while, and the pandemic means I'm not going to be giving any in-person talks for the forseeable future... so I spent some time pitching talks to remote events.

I'll be speaking at Django London on July 14th and I have a few other submissions in the pipeline.

I'm also attending (virtually) the SRCCON journalism conference next week. They asked me to put together a short video introduction to Datasette, which I've embedded below. I'll be hanging out and talking to anyone who's interested in learning more about the project, or who can help me figure out what direction to take it next.

SRCCON 2020: Datasette from OpenNews Source on Vimeo.

Upgrading plugins

Datasette 0.44 broke some of my existing plugins due to a change in how it handles ASGI lifespan events. I've upgraded the following this week:

datasette-configure-fts 1.0 - a plugin for configuring which columns in a table are enabled for full-text search.
datasette-edit-tables 0.2a - tools for renaming tables and adding columns. This isn't particularly useful yet but I'm excited about its potential.
datasette-media 0.3 - a plugin for serving media from disk based on paths served out of the SQLite database.
datasette-search-all 0.3 - a plugin providing a mechanism for searching all FTS-enabled tables at once, discussed here previously.

sqlite-utils 2.11

sqlite-utils 2.11 is the first release of sqlite-utils that was entirely written by someone else! Thomas Sibley added a new --truncate option for emptying a table (safely within a transaction) before populating it and made an improvement to how transactions work generally.

Thomas inspired me to start thinking more carefully about how transactions should work with the library.

Tags: projects, datasette, weeknotes, covid19, sqlite-utils

sba-loans-covid-19-datasette

2020-07-07T02:42:40+00:00

sba-loans-covid-19-datasette

The treasury department released a bunch of data on the Covid-19 SBA Paycheck Protection Program Loan recipients today—I’ve loaded the most interesting data (the $150,000+ loans) into a Datasette instance.

Via @simonw

Tags: data-journalism, projects, datasette, covid19

Quoting Tim O'Reilly

2020-07-04T16:06:41+00:00

The future will not be like the past. The comfortable Victorian and Georgian world complete with grand country houses, a globe-spanning British empire, and lords and commoners each knowing their place, was swept away by the events that began in the summer of 1914 (and that with Britain on the “winning” side of both world wars.) So too, our comfortable “American century” of conspicuous consumer consumption, global tourism, and ever-increasing stock and home prices may be gone forever.

— Tim O'Reilly

Tags: tim-oreilly, covid19

Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite

2020-04-29T19:41:11+00:00

I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.

Archiving coronavirus.data.gov.uk

The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at coronavirus.data.gov.uk.

As with everything else built in 2020, it's a big single-page JavaScript app. Matthew Somerville investigated what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.

One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew's article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from https://c19downloads.azureedge.net/downloads/data/data_latest.json.

Any time I see a file like this my first questions are how often does it change - and what kind of changes are being made to it?

I've written about scraping to a git repository (see my new gitscraping tag) a bunch in the past:

Scraping hurricane Irma - September 2017
Changelogs to help understand the fires in the North Bay - October 2017
Generating a commit log for San Francisco’s official list of trees - March 2019
Tracking PG&E outages by scraping to a git repo - October 2019
Deploying a data API using GitHub Actions and Cloud Run - January 2020

Now that I've figured out a really clean way to Commit a file if it changed in a GitHub Action knocking out new versions of this pattern is really quick.

simonw/coronavirus-data-gov-archive is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a commit history of changes made to the underlying data.

The first implementation was extremely simple - here's the entire action:

name: Fetch latest data

on:
push:
repository_dispatch:
schedule:
    - cron:  '25 * * * *'

jobs:
scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
    uses: actions/checkout@v2
    - name: Fetch latest data
    run: |-
        curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . > data_latest.json
        curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . > utlas.geojson
        curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . > countries.geojson
        curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . > regions.geojson
    - name: Commit and push if it changed
    run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

It uses a combination of curl and jq (both available in the default worker environment) to pull down the data and pretty-print it (better for readable diffs), then commits the result.

Matthew Somerville pointed out that inefficient polling sets a bad precedent. Here I'm hitting azureedge.net, the Azure CDN, so that didn't particularly worry me - but since I want this pattern to be used widely it's good to provide a best-practice example.

Figuring out the best way to make conditional get requests in a GitHub Action lead me down something of a rabbit hole. I wanted to use curl's new ETag support but I ran into a curl bug, so I ended up rolling a simple Python CLI tool called conditional-get to solve my problem. In the time it took me to release that tool (just a few hours) a new curl release came out with a fix for that bug!

Here's the workflow using my conditional-get tool. See the issue thread for all of the other potential solutions, including a really neat Action shell-script solution by Alf Eaton.

To my absolute delight, the project has already been forked once by Daniel Langer to capture Canadian Covid-19 cases!

New Datasette features

I pushed two new features to Datasette master, ready for release in 0.41.

Configuration directory mode

This is an idea I had while building datasette-publish-now. Datasette instances can be run with custom metadata, custom plugins and custom templates. I'm increasingly finding myself working on projects that run using something like this:

$ datasette data1.db data2.db data3.db \
    --metadata=metadata.json
    --template-dir=templates \
    --plugins-dir=plugins

Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:

my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py

Then run Datasette directly targetting that directory:

$ datasette my-project/

See issue #731 for more details. Directory configuration mode is documented here.

Define custom pages using templates/pages

In niche-museums.com, powered by Datasette I described how I built the www.niche-museums.com website as a heavily customized Datasette instance.

That site has /about and /map pages which are served by custom templates - but I had to do some gnarly hacks with empty about.db and map.db files to get them to work.

Issue #648 introduces a new mechanism for creating this kind of page: create a templates/pages/map.html template file and custom 404 handling code will ensure that any hits to /map serve the rendered contents of that template.

This could work really well with the datasette-template-sql plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).

Here's the new documentation on custom pages, including details of how to use the new custom_status(), custom_header() and custom_redirect() template functions to go beyond just returning HTML.

photos-to-sqlite

My Dogsheep personal analytics project brings my tweets, GitHub activity, Swarm checkins and more together in one place. But the big missing feature is my photos.

As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new photos-to-sqlite tool. I can run the following SQL query and get back ten random photos!

select
  json_object(
    'img_src',
    'https://photos.simonwillison.net/i/' || 
    sha256 || '.' || ext || '?w=400'
  ),
  filepath,
  ext
from
  photos
where
  ext in ('jpeg', 'jpg', 'heic')
order by
  random()
limit
  10

photos.simonwillison.net is running a modified version of my heic-to-jpeg image converting and resizing proxy, which I'll release at some point soon.

There's still plenty of work to do - I still need to import EXIF data (including locations) into SQLite, and I plan to use osxphotos to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.

TIL this week

Generated using this query.

Tags: git, http, matthew-somerville, photos, projects, datasette, weeknotes, covid19, git-scraping

Bill Gates’s vision for life beyond the coronavirus

2020-04-28T01:01:58+00:00

Bill Gates’s vision for life beyond the coronavirus

Fascinating interview with Bill Gates—the most interesting and informative article I’ve read about Covid-19 in quite a while.

Tags: bill-gates, covid19

Estimating COVID-19's Rt in Real-Time

2020-04-20T15:06:53+00:00

Estimating COVID-19's Rt in Real-Time

I’m not qualified to comment on the mathematical approach, but this is a really nice example of a Jupyter Notebook explanatory essay by Kevin Systrom.

Tags: jupyter, covid19

Weeknotes: Covid-19, First Python Notebook, more Dogsheep, Tailscale

2020-04-01T20:29:59+00:00

My covid-19.datasettes.com project publishes information on COVID-19 cases around the world. The project started out using data from Johns Hopkins CSSE, but last week the New York Times started publishing high quality USA county- and state-level daily numbers to their own repository. Here's the change that added the NY Times data.

It's very easy to use this data to accidentally build misleading things. I've been updating the README with links about this - my current favourite is Why It’s So Freaking Hard To Make A Good COVID-19 Model by Maggie Koerth, Laura Bronner and Jasmine Mithani at FiveThirtyEight.

First Python Notebook

Ben Welsh from the LA Times teaches a course called First Python Notebook at journalism conferences such as NICAR. He ran a free online version the course last weekend, and I offered to help out as a TA.

Most of the help I provided came before the course: Ben asked attendees to confirm that they had working installations of Python 3 and pipenv, and if they didn't volunteers such as myself would step in to help. I had Zoom and email conversations with at least ten people to help them get their environments into shape.

This XKCD neatly summarizes the problem:

One of the most common problems I had to debug was PATH issues: people had installed the software, but due to various environmental differences python3 and pipenv weren't available on the PATH. Talking people through the obscurities of creating a ~/.bashrc file and using it to define a PATH over-ride really helps emphasize how arcane this kind of knowledge is.

I enjoyed this comment:

"Welcome to intro to Tennis. In the first two weeks, we'll discuss how to rig a net and resurface a court." - Claus Wilke

Ben's course itself is hands down the best introduction to Python from a Data Journalism perspective I have ever seen. Within an hour of starting the students are using Pandas in a Jupyter notebook to find interesting discrepancies in California campaign finance data.

If you want to check it out yourself, the entire four hour workshop is now on YouTube and closely follows the material on firstpythonnotebook.org.

Coronavirus Diary

We are clearly living through a notable and very painful period of history right now. On the 19th of March (just under two weeks ago, but time is moving both really fast and incredibly slowly right now) I started a personal diary - something I've never done before. It lives in an Apple Note and I'm adding around a dozen paragraphs to it every day. I think it's helping. I'm sure it will be interesting to look back on in a few years time.

Dogsheep

Much of my development work this past week has gone into my Dogsheep suite of tools for personal analytics.

I upgraded the entire family of tools for compatibility with sqlite-utils 2.x.
pocket-to-sqlite got a major upgrade: it now fetches items using Pocket's API pagination (previously it just tried to pull in 5,000 items in one go) and has the ability to only fetch new items. As a result I'm now running it from cron in my personal Dogsheep instance, so "Save to Pocket" is now my preferred Dogsheep-compatible way of bookmarking content.
twitter-to-sqlite got a couple of important new features in release 0.20. I fixed a nasty bug in the --since flag where retweets from other accounts could cause new tweets from an account to be ignored. I also added a new count_history table which automatically tracks changes to a Twitter user's friends, follower and listed counts over time (#40).

I'm also now using Dogsheep for some journalism! I'm working with the Big Local News team at Stanford to help track and archive tweets by a number of different US politicians and health departments relating to the ongoing pandemic. This collaboration resulted in the above improvements to twitter-to-sqlite.

Tailscale

My personal Dogsheep is currently protected by client certificates, so only my personal laptop and iPhone (with the right certificates installed) can connect to the web server it is running on.

I spent a bit of time this week playing with Tailscale, and I'm really impressed by it.

Tailscale is a commercial company built on top of WireGuard, the new approach to VPN tunnels which just got merged into the Linux 5.6 kernel. Tailscale first caught my attention in January when they hired Brad Fitzpatrick.

WireGuard lets you form a private network by having individual hosts exchange public/private keys with each other. Tailscale provides software which manages those keys for you, making it trivial to set up a private network between different nodes.

How trivial? It took me less than ten minutes to get a three-node private network running between my iPhone, laptop and a Linux server. I installed the iPhone app, the Ubuntu package, the OS X app, signed them all into my Google account and I was done.

Each of those devices now has an additional IP address in the 100.x range which they can use to talk to each other. Tailscale guarantees that the IP address will stay constant for each of them.

Since the network is public/private key encrypted between the nodes, Tailscale can't see any of my traffic - they're purely acting as a key management mechanism. And it's free: Tailscale charge for networks with multiple users, but a personal network like this is free of charge.

I'm not running my own personal Dogsheep on it yet, but I'm tempted to switch over. I'd love other people to start running their own personal Dogsheep instances but I'm paranoid about encouraging this when securing them is so important. Tailscale looks like it might be a great solution for making secure personal infrastructure more easily and widely available.

Tags: brad-fitzpatrick, data-journalism, projects, python, teaching, datasette, dogsheep, weeknotes, tailscale, covid19, ben-welsh

Weeknotes: COVID-19 numbers in Datasette

2020-03-11T04:49:35+00:00

COVID-19, the disease caused by the novel coronavirus, gets more terrifying every day. Johns Hopkins Center for Systems Science and Engineering (CSSE) have been collating data about the spread of the disease and publishing it as CSV files on GitHub.

This morning I used the pattern described in Deploying a data API using GitHub Actions and Cloud Run to set up a scheduled task that grabs their data once an hour and publishes it to https://covid-19.datasettes.com/ as a table in Datasette.

If you're not yet concerned about COVID-19 you clearly haven't been paying atttention to what's been happening in Italy. Here's a query which shows a graph of the number of confirmed cases in Italy over the past few weeks (using datasette-vega):

155 cases 17 days ago to 10,149 cases today is really frightening. And the USA still doesn't have robust testing in place, so the numbers here are likely to really shock people once they start to become more apparent.

If you're going to use the data in covid-19.datasettes.com for anything please be responsible with it and read the warnings in the README file in detail: it's important to fully understand the sources of the data and how it is being processed before you use it to make any assertions about the spread of COVID-19.

My favourite resource to understand Coronavirus and what we should be doing about it is flattenthecurve.com, compiled by Julie McMurry, an assistant professor at Oregon State University College of Public Health. I strongly recommend checking it out.

Other projects

I've worked on a bunch of other projects this week, some of which were inspired by my time at NICAR.

fec-to-sqlite is a script for saving FEC campaign finance filings to a SQLite database. Since those filings are pulled in via HTTP and can get pretty big, it uses a neat trick to generate a progress bar with the tqdm library - it initiates a progress bar with the Content-Length of the incoming file, then as it iterates over the lines coming in over HTTP it uses the length of each line to update that bar.
datasette-search-all is a new plugin that enables search across multiple FTS-enabled SQLite tables at once. I wrote more about that in this blog post on Monday.
datasette-column-inspect is an extremely experimental plugin that tries out a "column inspector" tool for Datasette tables - click on a column heading and the plugin shows you interesting facts about that column, such as the min/mean/max/stdev, any outlying values, the most common values and the least common values. Screenshot below. This prototype came about as part of a JSK team project for the Designing Machine Learning course at Stanford - we were thinking about ways in which machine learning could help journalists find stories in large datasets. The prototype doesn't have any machine learning in it - just some simple statistics to identify outliers - but it's meant to illustrate how a tool that exposes machine learning insights against tabular data might work.
github-to-sqlite grew a new sub-command: github-to-sqlite commits github.db simonw/datasette - which imports information about commits to a repository (just the author and commit message, not the body of the commit itself). I'm running a private version of this against all of my projects, which is really useful for seeing what I worked on over the past week when writing my weeknotes.

Here are two screenshots of datasette-column-inspect in action. You can try out a live demo of the plugin over here.

Tags: plugins, projects, datasette, weeknotes, coronavirus, covid19