Simon Willison's Weblog: baked-data

OpenTimes

2025-03-17T22:49:59+00:00

Spectacular new open geospatial project by Dan Snow:

OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time data for free and with no limits.

Here's what I get for travel times by car from El Granada, California:

The technical details are fascinating:

The entire OpenTimes backend is just static Parquet files on Cloudflare's R2. There's no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a great way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).

Sure enough, R2 pricing charges "based on the total volume of data stored" - $0.015 / GB-month for standard storage, then $0.36 / million requests for "Class B" operations which include reads. They charge nothing for outbound bandwidth.

All travel times were calculated by pre-building the inputs (OSM, OSRM networks) and then distributing the compute over hundreds of GitHub Actions jobs. This worked shockingly well for this specific workload (and was also completely free).

Here's a GitHub Actions run of the calculate-times.yaml workflow which uses a matrix to run 255 jobs!

Relevant YAML:

  matrix:
    year: ${{ fromJSON(needs.setup-jobs.outputs.years) }}
    state: ${{ fromJSON(needs.setup-jobs.outputs.states) }}

Where those JSON files were created by the previous step, which reads in the year and state values from this params.yaml file.

The query layer uses a single DuckDB database file with views that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.

This is a really creative use of DuckDB's feature that lets you run queries against large data from a laptop using HTTP range queries to avoid downloading the whole thing.

The README shows how to use that from R and Python - I got this working in the duckdb client (brew install duckdb):

INSTALL httpfs;
LOAD httpfs;
ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb' AS opentimes;

SELECT origin_id, destination_id, duration_sec
  FROM opentimes.public.times
  WHERE version = '0.0.1'
      AND mode = 'car'
      AND year = '2024'
      AND geography = 'tract'
      AND state = '17'
      AND origin_id LIKE '17031%' limit 10;

In answer to a question about adding public transit times Dan said:

In the next year or so maybe. The biggest obstacles to adding public transit are:

Collecting all the necessary scheduling data (e.g. GTFS feeds) for every transit system in the county. Not insurmountable since there are services that do this currently.

Finding a routing engine that can compute nation-scale travel time matrices quickly. Currently, the two fastest open-source engines I've tried (OSRM and Valhalla) don't support public transit for matrix calculations and the engines that do support public transit (R5, OpenTripPlanner, etc.) are too slow.

GTFS is a popular CSV-based format for sharing transit schedules - here's an official list of available feed directories.

This whole project feels to me like a great example of the baked data architectural pattern in action.

Via Hacker News

Tags: census, geospatial, open-data, openstreetmap, cloudflare, parquet, github-actions, baked-data, duckdb, http-range-requests

Quoting Jake Teton-Landis

2024-09-25T18:08:19+00:00

We used this model [periodically transmitting configuration to different hosts] to distribute translations, feature flags, configuration, search indexes, etc at Airbnb. But instead of SQLite we used Sparkey, a KV file format developed by Spotify. In early years there was a Cron job on every box that pulled that service’s thingies; then once we switched to Kubernetes we used a daemonset & host tagging (taints?) to pull a variety of thingies to each host and then ensure the services that use the thingies only ran on the hosts that had the thingies.

— Jake Teton-Landis

Tags: sqlite, kubernetes, feature-flags, baked-data

Clickhouse on Cloud Run

2021-07-29T06:07:51+00:00

Clickhouse on Cloud Run

Alex Reid figured out how to run Clickhouse against read-only baked data on Cloud Run last year, and wrote up some comprehensive notes.

Via @alexjreid

Tags: cloudrun, baked-data, clickhouse

The Baked Data architectural pattern

2021-07-28T20:23:44+00:00

I've been exploring an architectural pattern for publishing websites over the past few years that I call the "Baked Data" pattern. It provides many of the advantages of static site generators while avoiding most of their limitations. I think it deserves to be used more widely.

I define the Baked Data architectural pattern as the following:

Baked Data: bundling a read-only copy of your data alongside the code for your application, as part of the same deployment

Most dynamic websites keep their code and data separate: the code runs on an application server, the data lives independently in some kind of external data store - something like PostgreSQL, MySQL or MongoDB.

With Baked Data, the data is deployed as part of the application bundle. Any time the content changes, a fresh copy of the site is deployed that includes those updates.

I mostly use SQLite database files for this, but plenty of other formats can work here too.

This works particularly well with so-called "serverless" deployment platforms - platforms that support stateless deployments and only charge for resources spent servicing incoming requests ("scale to zero").

Since every change to the data results in a fresh deployment this pattern doesn't work for sites that change often - but in my experience many content-oriented sites update their content at most a few times a day. Consider blogs, documentation sites, project websites - anything where content is edited by a small group of authors.

Benefits of Baked Data

Why would you want to apply this pattern? A few reasons:

Inexpensive to host. Anywhere that can run application code can host a Baked Data application - there's no need to pay extra for a managed database system. Scale to zero serverless hosts such as Cloud Run, Vercel or AWS Lambda will charge only cents per month for low-traffic deployments.
Easy to scale. Need to handle more traffic? Run more copies of your application and its bundled data. Horizontally scaling Baked Data applications is trivial. They're also a great fit to run behind a caching proxy CDN such as Cloudflare or Fastly - when you deploy a new version you can purge that entire cache.
Difficult to break. Hosting server-side applications on a VPS is always disquieting because there's so much that might go wrong - the server could be compromised, or a rogue log file could cause it to run out of disk space. With Baked Data the worst that can happen is that you need to re-deploy the application - there's no risk at all of data loss, and providers that can auto-restart code can recover from errors automatically.
Server-side functionality is supported. Static site generators provide many of the above benefits, but with the limitation that any dynamic functionality needs to happen in client-side JavaScript. With a Baked Data application you can execute server-side code too.
Templated pages. Another improvement over static site generators: if you have 10,000 pages, a static site generator will need to generate 10,000 HTML files. With Baked Data those 10,000 pages can exist as rows in a single SQLite database file, and the pages can be generated at run-time using a server-side template.
Easy to support multiple formats. Since your content is in a dynamic data store, outputting that same content in alternative formats is easy. I use Datasette plugins for this: datasette-atom can produce an Atom feed from a SQL query, and datasette-ics does the same thing for iCalendar feeds.
Integrates well with version control. I like to keep my site content under version control. The Baked Data pattern works well with build scripts that read content from a git repository and use it to build assets that are bundled with the deployment.

How to bake your data

My initial implementations of Baked Data have all used SQLite. It's an ideal format for this kind of application: a single binary file which can store anything that can be represented as relational tables, JSON documents or binary objects - essentially anything at all.

Any format that can be read from disk by your dynamic server-side code will work too: YAML or CSV files, Berkeley DB files, or anything else that can be represented by a bucket of read-only bytes in a file on disk.

[I have a hunch that you could even use something like PostgreSQL, MySQL or Elasticsearch by packaging up their on-disk representations and shipping them as part of a Docker container, but I've not tried that myself yet.]

Once your data is available in a file, your application code can read from that file and use it to generate and return web pages.

You can write code that does this in any server-side language. I use Python, usually with my Datasette application server which can read from a SQLite database file and use Jinja templates to generate pages.

The final piece of the puzzle is a build and deploy script. I use GitHub Actions for this, but any CI tool will work well here. The script builds the site content into a deployable asset, then deploys that asset along with the application code to a hosting platform.

Baked Data in action: datasette.io

The most sophisticated Baked Data site I've published myself is the official website for my Datasette project, datasette.io - source code in this repo.

The site is deployed using Cloud Run. It's actually a heavily customized Datasette instance, using a custom template for the homepage, custom pages for other parts of the site and the datasette-template-sql plugin to execute SQL queries and display their results from those templates.

The site currently runs off four database files:

content.db has most of the site content. It is built inside GitHub Actions by the build.sh script, which does the following:
- Import the contents of the news.yaml file into a news table using yaml-to-sqlite.
- Import the markdown files from the for/ folder (use-cases for Datasette) into the uses table using markdown-to-sqlite.
- Populate the plugin_repos and tool_repos single-column tables using data from more YAML files. These are used in the next step.
- Runs the build_directory.py Python script. This uses the GitHub GraphQL API to fetch information about all of those plugin and tool repositories, including their README files and their most recent tagged releases.
- Populates a stats table with the latest download statistics for all of the Datasette ecosystem PyPI packages. That data is imported from a stats.json file in my simonw/package-stats repository, which is itself populated by this git scraping script that runs in GitHub Actions. I also use this for my Datasette Downloads Observable notebook.
blog.db contains content from my blog that carries any of the datasette, dogsheep or sqliteutils tags.
- This is fetched by the fetch_blog_content.py script, which hits the paginated per-tag Atom feed for my blog content, implemented in Django here.
docs-index.db is a database table containing the documentation for the most recent stable Datasette release, broken up by sections.
- This database file is downloaded from a separate site, stable-docs.datasette.io, which is built and deployed as part of Datasette's release process.
dogsheep-index.db is the search index that powers site search (e.g. this search for dogsheep).
- The search index is built by dogsheep-beta using data pulled from tables in the other database files, as configured by this YAML file.

The site is automatically deployed once a day by a scheduled action, and I can also manually trigger that action if I want to ensure a new software release is reflected on the homepage.

Other real-world examples of Baked Data

I'm currently running two other sites using this pattern:

Niche Museums is my blog about tiny museums that I've visited. Again, it's Datasette with custom templates. Most of the content comes from this museums.yaml file, but I also run a script to figure out when each item was created or updated from the git history.
My TILs site runs on Vercel and is built from my simonw/til GitHub repository by this build script (populating this tils table). It uses the GitHub API to convert GitHub Flavored Markdown to HTML. I'm also running a script that generates small screenshots of each page and stashes them in a BLOB column in SQLite in order to provide social media preview cards, see Social media cards for my TILs.

My favourite example of this pattern in a site that I haven't worked on myself is Mozilla.org.

They started using SQLite back in 2018 in a system they call Bedrock - Paul McLanahan provides a detailed description of how this works.

Their site content lives in a ~22MB SQLite database file, which is built and uploaded to S3 and then downloaded on a regular basis to each of their application servers.

You can view their healthcheck page to see when the database was last downloaded, and grab a copy of the SQLite file yourself. It's fun to explore that using Datasette:

Compared to static site generators

Static site generators have exploded in popularity over the past ten years. They drive the cost of hosting a site down to almost nothing, provide excellent performance, work well with CDNs and produce sites that are extremely unlikely to break.

Used carefully, the Baked Data keeps most of these characteristics while still enabling server-side code execution.

My example sites use this in a few different ways:

datasette.io provides search across 1,588 different pieces of content, plus simpler search on the plugins and tools pages.
My TIL site also provides search, as does Niche Museums.
All three sites provide Atom feeds that are configured using a server-side SQL query: Datasette, Niche Museums, TILs.
Niche Museums offers a "Use my location" button which then serves museums near you, using a SQL query that makes use of the datasette-haversine plugin.

A common complaint about static site generators when used for larger sites is that build times can get pretty long if the builder has to generate tens of thousands of pages.

With Baked Data, 10,000 pages can be generated by a single template file and 10,000 rows in a SQLite database table.

This also makes for a faster iteration cycle during development: you can edit a template and hit "refresh" to see any page rendered by the new template instantly, without needing to rebuild any pages.

Want to give this a go?

If you want to give the Baked Data pattern a try, I recommend starting out using the combination of Datasette, GitHub Actions and Vercel. Hopefully the examples I've provided above are a good starting point - also feel free to reach out to me on Twitter or in the Datasette Discussions forum with any questions.

Tags: definitions, design-patterns, sqlite, static-generator, datasette, baked-data

Building a search engine for datasette.io

2020-12-19T18:12:31+00:00

This week I added a search engine to datasette.io, using the search indexing tool I've been building for Dogsheep.

Project search for Datasette

The Datasette project has a lot of constituent parts. There's the project itself and its documentation - 171 pages when exported to PDF and counting. Then there are the 48 plugins, sqlite-utils and 21 more tools for creating SQLite databases, the Dogsheep collection and over three years of content I've written about the project on my blog.

The new datasette.io search engine provides a faceted search interface to all of this material in one place. It currently searches across:

Every section of the latest documentation (415 total)
48 plugin READMEs
22 tool READMEs
63 news items posted on the Datasette website
212 items from my blog
Release notes from 557 package releases

I plan to extend it with more data sources in the future.

How it works: Dogsheep Beta

I'm reusing the search engine I originally built for my Dogsheep personal analytics project (see Personal Data Warehouses: Reclaiming Your Data). I call that search engine Dogsheep Beta. The name is a pun.

SQLite has great full-text search built in, and I make extensive use of that in Datasette projects already. But out of the box it's not quite right for this kind of search engine that spans multiple different content types.

The problem is relevance calculation. I wrote about this in Exploring search relevance algorithms with SQLite - short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.

This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.

To get usable scores, you need everything in a single table. That's what Dogsheep Beta does: it creates a new table, called search_index, and copies searchable content from the other tables into that new table.

This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It's the denormalized query engine design pattern in action.

Configuring Dogsheep Beta

There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.

Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.

(Having one configuration file handle both indexing and display feels a little inelegant, but it's extremely productive for iterating on so I'm letting that slide.)

Here's the full Dogsheep configuration for datasette.io. An annotated extract:

# Index material in the content.db SQLite file
content.db:
  # Define a search type called 'releases'
  releases:
    # Populate that search type by executing this SQL
    sql: |-
      select
        releases.id as key,
        repos.name || ' ' || releases.tag_name as title,
        releases.published_at as timestamp,
        releases.body as search_1,
        1 as is_public
      from
        releases
        join repos on releases.repo = repos.id
    # When displaying a search result, use this SQL to
    # return extra details about the item
    display_sql: |-
      select
        -- highlight() is a custom SQL function
        highlight(render_markdown(releases.body), :q) as snippet,
        html_url
      from releases where id = :key
    # Jinja template fragment to display the result
    display: |-
      <h3>Release: <a href="{{ display.html_url }}">{{ title }}</a></h3>
      <p>{{ display.snippet|safe }}</p>
      <p><small>Released {{ timestamp }}</small></p>

The core pattern here is the sql: key, which defines a SQL query that must return the following columns:

key - a unique identifier for this search item
title - a title for this indexed document
timestamp - a timestamp for when it was created. May be null.
search_1 - text to be searched. I may add support for search_2 and search_3 later on to store text that will be treated with a lower relevance score.
is_public - should this be considered "public" data. This is a holdover from Dogsheep Beta's application for personal analytics, I don't actually need it for datasette.io.

To create an index, run the following:

dogsheep-beta index dogsheep-index.db dogsheep-config.yml

The index command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a search_index table in the dogsheep-index.db SQLite database file.

Here's the search_index table for datasette.io.

When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).

To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.

If a display_sql: query is defined, that query will be executed for each result to populate the {{ display }} object made available to the template. Many Small Queries Are Efficient In SQLite.

Search term highlighting

I spent a bit of time thinking about search highlighting. SQLite has an implementation of highlighting built in - the snippet() function - but it's not designed to be HTML-aware so there's a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.

I ended up rolling borrowing a BSD licensed highlighting class from the django-haystack project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that's good enough for me.

I used this one-off site plugin to wrap the highlighting code in a custom SQLite function. This meant I could call it from the display_sql: query in the Dogsheep Beta YAML configuration.

A custom template tag would be more elegant, but I don't yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.

Build, index, deploy

The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.

Building the index is just another step of that process.

Here's the deploy.yml GitHub workflow used by the site. It roughly does the following:

Download the current version of the content.db database file. This is so it doesn't have to re-fetch release and README content that was previously stored there.
Download the current version of blog.db, with entries from my blog. This means I don't have to fetch all entries, just the new ones.
Run build_directory.py, the script which fetches data for the plugins and tools pages.
- This hits the GitHub GraphQL API to find new repositories tagged datasette-io and datasette-plugin and datasette-tool.
- That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses github-to-sqlite to fetch them.
Imports the data from news.yaml into a news table using yaml-to-sqlite
Imports the latest PyPI download statistics for my packages from my simonw/package-stats repository, which implements git scraping against the most excellent pypistats.org.
Runs the dogsheep-beta index command to build a dogsheep-index.db search index.
Runs some soundness checks, e.g. datasette . --get "/plugins", to verify that Datasette is likely to at least return 200 results for some critical pages once published.
Uses datasette publish cloudrun to deploy the results to Google Cloud Run, which hosts the website.

I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script generates screenshots using Puppeteer) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.

Tags: projects, search, sqlite, datasette, dogsheep, weeknotes, cloudrun, baked-data

datasette.io, an official project website for Datasette

2020-12-13T08:34:44+00:00

This week I launched datasette.io - the new official project website for Datasette.

Datasette's first open source release was just over three years ago, but until now the official site duties have been split between the GitHub repository and the documentation.

The Baked Data architectural pattern

The site itself is built on Datasette (source code here). I'm using a pattern that I first started exploring with Niche Museums: most of the site content lives in a SQLite database, and I use custom Jinja templates to implement the site's different pages.

This is effectively a variant of the static site generator pattern. The SQLite database is built by scripts as part of the deploy process, then deployed to Google Cloud Run as a binary asset bundled with the templates and Datasette itself.

I call this the Baked Data architectural pattern - with credit to Kevin Marks for helping me coin the right term. You bake the data into the application.

Update: I wrote more about this in July 2021: The Baked Data architectural pattern

It's comparable to static site generation because everything is immutable, which greatly reduces the amount of things that can go wrong - and any content changes require a fresh deploy. It's extremely easy to scale - just run more copies of the application with the bundled copy of the database. Cloud Run and other serverless providers handle that kind of scaling automatically.

Unlike static site generation, if a site has a thousand pages you don't need to build a thousand HTML pages in order to deploy. A single template and a SQL query that incorporates arguments from the URL can serve as many pages as there are records in the database.

How the site is built

You can browse the site's underlying database tables in Datasette here.

The news table powers the latest news on the homepage and /news. News lives in a news.yaml file in the site's GitHub repository. I wrote a script to import the news that had been accumulating in the 0.52 README - now that news has moved to the site the README is a lot more slim!

At build time my yaml-to-sqlite script runs to load that news content into a database table.

The index.html template then uses the following Jinja code to output the latest news stories, using the sql() function from the datasette-template-sql Datasette plugin:

{% set ns = namespace(current_date="") %}
{% for row in sql("select date, body from news order by date desc limit 15", database="content") %}
    {% if prettydate(row["date"]) != (ns.current_date and prettydate(ns.current_date)) %}
    <h3>{{ prettydate(row["date"]) }} <a href="/news/{{ row["date"] }}" style="font-size: 0.8em; opacity: 0.4">#</a></h3>
    {% set ns.current_date = prettydate(row["date"]) %}
    {% endif %}
    {{ render_markdown(row["body"]) }}
{% endfor %}

prettydate() is a custom function I wrote in a one-off plugin for the site. The namespace() stuff is a Jinja trick that lets me keep track of the current date heading in the loop, so I can output a new date heading only if the news item occurs on a different day from the previous one.

render_markdown() is provided by the datasette-render-markdown plugin.

I wanted permalinks for news stories, but since they don't have identifiers or titles I decided to provide a page for each day instead - for example https://datasette.io/news/2020-12-10

These pages are implemented using Path parameters for custom page templates, introduced in Datasette 0.49. The implementation is a single template file at templates/pages/news/{yyyy}-{mm}-{dd}.html, the full contents of which is:

{% extends "page_base.html" %}

{% block title %}Datasette News: {{ prettydate(yyyy + "-" + mm + "-" + dd) }}{% endblock %}

{% block content %}

{% set stories = sql("select date, body from news where date = ? order by date desc", [yyyy + "-" + mm + "-" + dd], database="content") %}
{% if not stories %}
    {{ raise_404("News not found") }}
{% endif %}
<h1><a href="/news">News</a>: {{ prettydate(yyyy + "-" + mm + "-" + dd) }}</h1>

{% for row in stories %}
    {{ render_markdown(row["body"]) }}
{% endfor %}

{% endblock %}

The crucial trick here is that, because the filename is news/{yyyy}-{mm}-{dd}.html, a request to /news/2020-12-10 will render that template with the yyyy, mm and dd template variables set to those values from the URL.

It can then execute a SQL query that incorporates those values. It assigns the results to a stories variable, then checks that at least one story was returned - if not, it raises a 404 error.

See Datasette's custom pages documentation for more details on how this all works.

The site also offers an Atom feed of recent news. This is powered by the datasette-atom using the output of this canned SQL query, with a render_markdown() SQL function provided by this site plugin.

The plugin directory

One of the features I'm most excited about on the site is the new Datasette plugin directory. Datasette has over 50 plugins now and I've been wanting a definitive directory of them for a while.

It's pretty basic at the moment, offering a list of plugins plus simple LIKE based search, but I plan to expand it a great deal in the future.

The fun part is where the data comes from. For a couple of years now I've been using GitHub topics to tag my plugins - I tag them with datasette-plugin, and the ones that I planned to feature on the site when I finally launched it were also tagged with datasette-io.

The datasette.io deployment process runs a script called build_plugin_directory.py, which uses a GraphQL query against the GitHub search API to find all repositories belonging to me that have been tagged with those tags.

That GraphQL query looks like this:

query {
  search(query:"topic:datasette-io topic:datasette-plugin user:simonw" type:REPOSITORY, first:100) {
    repositoryCount
    nodes {
      ... on Repository {
        id
        nameWithOwner
        openGraphImageUrl
        usesCustomOpenGraphImage
        repositoryTopics(first:100) {
          totalCount
          nodes {
            topic {
              name
            }
          }
        }
        openIssueCount: issues(states:[OPEN]) {
          totalCount
        }
        closedIssueCount: issues(states:[CLOSED]) {
          totalCount
        }
        releases(last: 1) {
          totalCount
          nodes {
            tagName
          }
        }
      }
    }
  }
}

It fetches the name of each repository, the openGraphImageUrl (which doesn't appear to be included in the regular GitHub REST API), the number of open and closed issues and details of the most recent release.

The script has access to a copy of the current site database, which is downloaded on each deploy by the build script. It uses this to check if any of the repositories have new releases that haven't previously been seen by the script.

Then it runs the github-to-sqlite releases command (part of github-to-sqlite) to fetch details of those new releases.

The end result is a database of repositories and releases for all of my tagged plugins. The plugin directory is then built against a custom SQL view.

Also this week: sqlite-utils analyze-tables

My other big project this week has involved building out a Datasette instance for a client. I'm working with over 5,000,000 rows of CSV data for this, which has been a great opportunity to push the limits of some of my tools.

Any time I'm working with new data I like to get a feel for its general shape. Having imported 5,000,000 rows with dozens of columns into a database, what can I learn about the columns beyond just browsing them in Datasette?

sqlite-utils analyze-tables (documented here) is my new tool for doing just that. It loops through every table and every column in the database, and for each column it calculates statistics that include:

The total number of distinct values
The total number of null or blank values
For non-distinct columns, the 10 most common and 10 least common values

It can output those to the terminal, or if you add the --save option it will also save them to a SQLite table called _analyze_tables_ - here's that table for my github-to-sqlite demo instance.

I can then use the output of the tool to figure out which columns might be a primary key, or which ones warrant being extracted out into a separate lookup table using sqlite-utils extract.

I expect I'll be expanding this feature a lot in the future, but I'm already finding it to be really helpful.

Datasette 0.53

I pushed out a small feature release of Datasette to accompany the new project website. Quoting the release notes:

New ?column__arraynotcontains= table filter. (#1132)

datasette serve has a new --create option, which will create blank database files if they do not already exist rather than exiting with an error. (#1135)

New ?_header=off option for CSV export which omits the CSV header row, documented here. (#1133)

"Powered by Datasette" link in the footer now links to https://datasette.io/. (#1138)

Project news no longer lives in the README - it can now be found at https://datasette.io/news. (#1137)

Office hours

I had my first round of Datasette office hours on Friday - 20 minute video chats with anyone who wants to talk to me about the project. I had five great conversations - it's hard to overstate how thrilling it is to talk to people who are using Datasette to solve problems. If you're an open source maintainer I can thoroughly recommend giving this format a try.

Releases this week

datasette-publish-fly: 1.0.1 - 2020-12-12
Datasette plugin for publishing data using Fly
datasette-auth-passwords: 0.3.3 - 2020-12-11
Datasette plugin for authentication using passwords
datasette: 0.53 - - 2020-12-11
An open source multi-tool for exploring and publishing data
datasette-column-inspect: 0.2a - 2020-12-09
Experimental plugin that adds a column inspector
datasette-pretty-json: 0.2.1 - 2020-12-09
Datasette plugin that pretty-prints any column values that are valid JSON objects or arrays
yaml-to-sqlite: 0.3.1 - 2020-12-07
Utility for converting YAML files to SQLite
datasette-seaborn: 0.2a0 - 2020-12-07
Statistical visualizations for Datasette using Seaborn

TIL this week

Tags: projects, datasette, weeknotes, sqlite-utils, baked-data

datasette-ripgrep: deploy a regular expression search engine for your source code

2020-11-28T06:51:06+00:00

This week I built datasette-ripgrep - a web application for running regular expression searches against source code, built on top of the amazing ripgrep command-line tool.

datasette-ripgrep demo

I've deployed a demo version of the application here:

ripgrep.datasette.io/-/ripgrep?pattern=pytest

The demo runs searches against the source code of every one of my GitHub repositories that start with datasette - 61 repos right now - so it should include all of my Datasette plugins plus the core Datasette repository itself.

Since it's running on top of ripgrep, it supports regular expressions. This is absurdly useful. Some examples:

Every usage of the .plugin_config( method: plugin_config\(
Everywhere I use async with httpx.AsyncClient (usually in tests): async with.*AsyncClient
All places where I use a Jinja | filter inside a variable: \{\{.*\|.*\}\}

I usually run ripgrep as rg on the command-line, or use it within Visual Studio Code (fun fact: the reason VS Code's "Find in Files" is so good is it's running ripgrep under the hood).

So why have it as a web application? Because this means I can link to it, bookmark it and use it on my phone.

Why build this?

There are plenty of great existing code search tools out there already: I've heard great things about livegrep, and a quick Google search shows a bunch of other options.

Aside from being a fun project, datasette-ripgrep has one key advantage: it gets to benefit from Datasette's publishing mechanism, which means it's really easy to deploy.

That ripgrep.datasette.io demo is deployed by checking out the source code to be searched into a all directory and then using the following command:

datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep

all is a folder containing the source code to be searched. metadata.json contains this:

{
    "plugins": {
        "datasette-ripgrep": {
            "path": "/app/all",
            "time_limit": 3.0
        }
    }
}

That's all there is to it! The result is a deployed code search engine, running on Google Cloud Run.

(If you want to try this yourself you'll need to be using the just-released Datasette 0.52.)

The GitHub Action workflow that deploys the demo also uses my github-to-sqlite tool to fetch my repos and then shallow-clones the ones that begin with datasette.

If you have your own Google Cloud Run credentials, you can run your own copy of that workflow against your own repositories.

A different kind of Datasette plugin

Datasette is a tool for publishing SQLite databases, so most Datasette plugins integrate with SQLite in some way.

datasette-ripgrep is different: it makes no use of SQLite at all, but instead takes advantage of Datasette's URL routing, datasette publish deployments and permissions system.

The plugin implementation is currently 134 lines of code, excluding tests and templates.

While the plugin doesn't use SQLite, it does share a common philosophy with Datasette: the plugin bundles the source code that it is going to search as part of the deployed application, in a similar way to how Datasette usually bundles one or more SQLite database files.

As such, it's extremely inexpensive to run and can be deployed to serverless hosting. If you need to scale it, you can run more copies.

This does mean that the application needs to be re-deployed to pick up changes to the searchable code. I'll probably set my demo to do this on a daily basis.

Controlling processes from asyncio

The trickiest part of the implementation was figuring out how to use Python's asyncio.create_subprocess_exec() method to safely run the rg process in response to incoming requests.

I don't want expensive searches to tie up the server, so I implemented two limits here. The first is a time limit: by default, searches have a second to run after which the rg process will be terminated and only results recieved so far will be returned. This is achieved using the asyncio.wait_for() function.

I also implemented a limit on the number of matching lines that can be returned, defaulting to 2,000. Any more than that and the process is terminated early.

Both of these limits can be customized using plugin settings (documented in the README). You can see how they are implemented in the async def run_ripgrep(pattern, path, time_limit=1.0, max_lines=2000) function.

Highlighted linkable line numbers

The other fun implementation detail is the way the source code listings are displayed. I'm using CSS to display the line numbers in a way that makes them visible without them breaking copy-and-paste (inspired by this article by Sylvain Durand).

code:before {
    content: attr(data-line);
    display: inline-block;
    width: 3.5ch;
    -webkit-user-select: none;
    color: #666;
}

The HTML looks like this:

<pre><code id="L1" data-line="1">from setuptools import setup</code>
<code id="L2" data-line="2">import os</code>
<code id="L3" data-line="3">&nbsp;</code>
<code id="L4" data-line="4">VERSION = &#34;0.1&#34;</code>
...

I wanted to imitate GitHub's handling of line links, where adding #L23 to the URL both jumps to that line and causes the line to be highlighted. Here's a demo of that - I use the following JavaScript to update the contents of a <style id="highlightStyle"></style> element in the document head any time the URL fragment changes:

<script>
var highlightStyle = document.getElementById('highlightStyle');
function highlightLineFromFragment() {
    if (/^#L\d+$/.exec(location.hash)) {
        highlightStyle.innerText = `${location.hash} { background-color: yellow; }`;
    }
}
highlightLineFromFragment();
window.addEventListener("hashchange", highlightLineFromFragment);
</script>

It's the simplest way I could think of to achieve this effect.

Update 28th November 2020: Louis Lévêque on Twitter suggested using the CSS :target selector instead, which is indeed MUCH simpler - I deleted the above JavaScript and replaced it with this CSS:

:target {
    background-color: #FFFF99;
}

Next steps for this project

I'm pleased to have got datasette-ripgrep to a workable state, and I'm looking forward to using it to answer questions about the growing Datasette ecosystem. I don't know how much more time I'll invest in this - if it proves useful then I may well expand it.

I do think there's something really interesting about being able to spin up this kind of code search engine on demand using datasette publish. It feels like a very useful trick to have access to.

Better URLs for my TILs

My other project this week was an upgrade to til.simonwillison.net: I finally spent the time to design nicer URLs for the site.

Before:

til.simonwillison.net/til/til/javascript_manipulating-query-params.md

After:

til.simonwillison.net/javascript/manipulating-query-params

The implementation for this takes advantage of a feature I sneaked into Datasette 0.49: Path parameters for custom page templates. I can create a template file called pages/{topic}/{slug}.html and Datasette use that template to handle 404 errors that match that pattern.

Here's the new pages/{topic}/{slug}.html template for my TIL site. It uses the sql() template function from the datasette-template-sql plugin to retrieve and render the matching TIL, or raises a 404 if no TIL can be found.

I also needed to setup redirects from the old pages to the new ones. I wrote a TIL on edirects for Datasette explaining how I did that.

TIL this week

Redirects for Datasette

Releases this week

datasette-ripgrep 0.2 - 2020-11-27
datasette-ripgrep 0.1 - 2020-11-26
datasette-atom 0.8.1 - 2020-11-25
datasette-ripgrep 0.1a1 - 2020-11-25
datasette-ripgrep 0.1a0 - 2020-11-25
datasette-graphql 1.2.1 - 2020-11-24

Tags: async, css, projects, python, datasette, weeknotes, cloudrun, ripgrep, baked-data

Bedrock: The SQLitening

2020-10-07T23:47:22+00:00

Bedrock: The SQLitening

Back in March 2018 www.mozilla.org switched over to running on Django using SQLite! They’re using the same pattern I’ve been exploring with Datasette: their SQLite database is treated as a read-only cache by their frontend servers, and a new SQLite database is built by a separate process and fetched onto the frontend machines every five minutes by a scheduled task. They have a healthcheck page which shows the latest version of the database and when it was fetched, and even lets you download the 25MB SQLite database directly (I’ve been exploring it using Datasette).

Via @simonw

Tags: django, mozilla, sqlite, datasette, baked-data

niche-museums.com, powered by Datasette

2019-11-25T22:27:46+00:00

I just released a major upgrade to my www.niche-museums.com website (launched last month).

The site is now rendered server-side. The previous version used lit-html to render content using JavaScript.
Each museum now has its own page. Here's today's new museum listing for the Conservatory of Flowers in San Francisco. These pages have a map on them.
The site has an about page.
You can now link to the page for a specific latitude and longitude, e.g. this location in Golden Gate Park.
The source code for the site is now available on GitHub.

Notably, the site is entirely powered by Datasette. It's a heavily customized Datasette instance, making extensive use of custom templates and plugins.

It's a really fun experiment. I'm essentially using Datasette as a weird twist on a static site generator - no moving parts since the database is immutable but there's still stuff happening server-side to render the pages.

Continuous deployment

The site is entirely stateless and is published using Circle CI to a serverless hosting provider (currently Zeit Now v1, but I'll probably move it to Google Cloud Run in the near future.)

The site content - 46 museums and counting - lives in the museums.yaml file. I've been adding a new museum listing every day by editing the YAML file using Working Copy on my iPhone.

The build script runs automatically on every commit. It converts the YAML file into a SQLite database using my yaml-to-sqlite tool, then runs datasette publish now... to deploy the resulting database.

The full deployment command is as follows:

datasette publish now browse.db about.db \
    --token=$NOW_TOKEN \
    --alias=www.niche-museums.com \
    --name=niche-museums \
    --install=datasette-haversine \
    --install=datasette-pretty-json \
    --install=datasette-template-sql \
    --install=datasette-json-html \
    --install=datasette-cluster-map~=0.8 \
    --metadata=metadata.json \
    --template-dir=templates \
    --plugins-dir=plugins \
    --branch=master

There's a lot going on here.

browse.db is the SQLite database file that was built by running yaml-to-sqlite.

about.db is an empty database built using sqlite3 about.db '' - more on this later.

The --alias= option tells Zeit Now to alias that URL to the resulting deployment. This is the single biggest feature that I'm missing from Google Cloud Run at the moment. It's possible to point domains at deployments there but it's not nearly as easy to script.

The --install= options tell datasette publish which plugins should be installed on the resulting instance.

--metadata=, --template-dir= and --plugins-dir= are the options that customize the instance.

--branch=master means we always deploy the latest master of Datasette directly from GitHub, ignoring the most recent release to PyPI. This isn't strictly necessary here.

Customization

The site itself is built almost entirely using Datasette custom templates. I have four of them:

index.html is the template used for the homepage, and for the page you see when you search for museums near a specific latitude and longitude.
row-browse-museums.html is the template used for the individual museum pages. It includes the JavaScript used for the map (which is powered by Leaflet and uses Wikimedia's OpenStreetMap tiles, which I discovered thanks to this Observable notebook by Tom MacWright).
_museum_card.html is an included template rendering a card for a museum, shared by the index and museum pages.
database-about.html is the template for the about page.

The about page uses a particularly devious hack.

Datasette doesn't have an easy way to create additional custom pages with URLs at the moment (without abusing the asgi_wrapper() hook, which is pretty low-level).

But... every attached database gets its own URL at /database-name.

So, to create the /about page I create an empty database called about.db using the sqlite3 about.db "" command. I serve that using Datasette, then create a custom template for that specific database using Datasette's template naming conventions.

I'll probably come up with a less grotesque way of doing this and bake it into Datasette in the future. For the moment this seems to work pretty well.

Plugins

The two key plugins here are datasette-haversine and datasette-template-sql.

datasette-haversine adds a custom SQL function to Datasette called haversine(), which calculates the haversine distance between two latitude/longitude points.

It's used by the SQL query which finds the nearest museums to the user.

This is very inefficient - it's essentially a brute-force approach which calculates that distance for every museum in the database and sorts them accordingly - but it will be years before I have enough museums listed for that to cause any kind of performance issue.

datasette-template-sql is the new plugin I described last week, made possible by Datasette dropping Python 3.5 support. It allows SQL queries to be executed directly from templates. I'm using it here to run the queries that power homepage.

I tried to get the site working just using code in the templates, but it got pretty messy. Instead, I took advantage of Datasette's --plugins-dir option, which causes Datasette to treat all Python modules in a specific directory as plugins and attempt to load them.

index_vars.py is a single custom plugin that I'm bundling with the site. It uses the extra_template_vars() plugin took to detect requests to the index page and inject some additional custom template variables based on values read from the querystring.

This ends up acting a little bit like a custom Django view function. It's a slightly weird pattern but again it does the job - and helps me further explore the potential of Datasette as a tool for powering websites in addition to just providing an API.

Weeknotes

This post is standing in for my regular weeknotes, because it represents most of what I achieved this last week. A few other bits and pieces:

I've been exploring ways to enable CSV upload directly into a Datasette instance. I'm building a prototype of this on top of Starlette, because it has solid ASGI file upload support. This is currently a standalone web application but I'll probably make it work as a Datasette ASGI plugin once I have something I like.
Shortcuts in iOS 13 got some very interesting new features, most importantly the ability to trigger shortcuts automatically on specific actions - including every time you open a specific app. I've been experimenting with using this to automatically copy data from my iPhone up to a custom web application - maybe this could help ingest notes and photos into Dogsheep.
Posted seven new museums to niche-museums.com:
- Cable Car Museum in San Francisco
- Audium in San Francisco
- House of Broel Dollhouse Museum in New Orleans
- Neptune Society Columbarium in San Francisco
- Recoleta Cemetery in Buenos Aires
- NASA Glenn Visitor Center in Cleveland
- Conservatory of Flowers in San Francisco
I composed devious SQL query for generating the markdown for the seven most recently added museums.

Tags: museums, projects, yaml, datasette, weeknotes, baked-data

Weeknotes: Niche Museums, Kepler, Trees and Streaks

2019-10-28T22:42:10+00:00

Niche Museums

Every now and then someone will ask “so when are you going to build Museums Near Me then?”, based on my obsession with niche museums and websites like www.owlsnearme.com.

For my Strategic Communications course at Stanford last week I had to perform a midterm presentation - a six minute talk to convince my audience of something, accompanied by slides and a handout.

I chose “you should seek out and explore tiny museums” as my topic, and used it as an excuse to finally start the website!

www.niche-museums.com is the result. It’s a small but growing collection of niche museums (17 so far, mostly in the San Francisco Bay Area) complete with the all important blue “Use my location” button to see museums near you.

Naturally I built it on Datasette. I’ll be writing more about the implementation (and releasing the underlying code) soon. I also built a new plugin for it, datasette-haversine.

Mapping museums against Starbucks

I needed a way to emphasize quite how many tiny museums there are in the USA. I decided to do this with a visualization.

It turns out there are 15,891 branches of Starbucks in the USA… and at least 30,132 museums!

I made these maps using a couple of sources.

All The Places is a crowdsourced scraper project which aims to build scrapers for every company that has a “store locator” area of their website. Starbucks has a store locator and All The Places have a scraper for it, so you can download GeoJSON of every Starbucks. I wrote a quick script to import that GeoJSON into Datasette using sqlite-utils.

The Institute of Museum and Library Services is an independent agency of the federal government that supports museums and libraries across the country. They publish a dataset of Museums in the USA as a set of CSV files. I used csvs-to-sqlite to load those into Datasette, than ran a union query to combine the three files together.

So I have Datasette instances (with a CSV export feature) for both Starbucks and USA museums, with altitudes and longitudes for each.

Now how to turn that into a map?

I turned to my new favourite GIS tool, Kepler. Kepler is an open source GIS visualization tool released by Uber, based on WebGL. It’s astonishingly powerful and can be used directly in your browser by clicking the “Get Started” button on their website (which I assumed would take you to installation instructions, but no, it loads up the entire tool in your browser).

You can import millions of points of data into Kepler and it will visualize them for you directly. I used a Datasette query to export the CSVs, then loaded in my Starbucks CSV, exported an image, loaded in the Museums CSV as a separate colour and exported a second image. The whole project ended up taking about 15 minutes. Kepler is a great addition to the toolbelt!

Animating the PG&E outages

My PG&E outages scraper continues to record a snapshot of the PG&E outage map JSON every ten minutes. I’m posting updates to a thread on Twitter, but discovering Kepler inspired me to look at more sophisticated visualization options.

This tutorial by Giuseppe Macrì tipped me off the the fact that you can use Kepler to animate points against timestamps!

Here’s the result: a video animation showing how PG&E’s outages have evolved since the 5th of October:

Here's a video animation of PG&E's outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019

Hayes Valley Trees

The city announced plans to cut down 27 ficus trees in our neighborhood in San Francisco. I’ve been working with Natalie to help a small group of citizens organize an appeal, and this weekend I helped run a survey of the affected trees (recording their exact locations in a CSV file) and then built www.hayes-valley-trees.com (source on GitHub) to link to from fliers attached to each affected tree.

It started out as a Datasette (running on Glitch) but since it’s only 27 data points I ended up freezing the data in a static JSON file to avoid having to tolerate any cold start times. The site is deployed as static assets on Zeit Now using their handy GitHub continuous deployment tool.

Streaks

It turns out I’m very motivated by streaks: I’m at 342 days for Duolingo Spanish and 603 days for an Apple Watch move streak. Could I apply this to other things in my life?

I asked on Twitter and was recommended the Streaks iOS app. It’s beautiful! I’m now tracking streaks for guitar practice, Duolingo, checking email, checking Slack, reading some books and adding a new museum to www.niche-museums.com (if I add one a day I can get from 17 museums today to 382 in a year!)

It seems to be working pretty well so far. I particularly like their iPhone widget.

Tags: museums, productivity, projects, visualization, weeknotes, baked-data, streaks, duolingo

The interesting ideas in Datasette

2018-10-04T02:28:45+00:00

Datasette (previously) is my open source tool for exploring and publishing structured data. There are a lot of ideas embedded in Datasette. I realized that I haven’t put many of them into writing.

Publishing read-only data
Bundling the data with the code
SQLite as the underlying data engine
Far-future cache expiration
Publishing as a core feature
License and source metadata
Facet everything
Respect for CSV
SQL as an API language
Optimistic query execution with time limits
Keyset pagination
Interactive demos based on the unit tests
Documentation unit tests

Publishing read-only data

Datasette provides a read-only API to your data. It makes no attempt to deal with writes. Avoiding writes entirely is fundamental to a plethora of interesting properties, many of which are expanded on further below. In brief:

Hosting web applications with no read/write persistence requirements is incredibly cheap in 2018 - often free (both ZEIT Now and a Heroku have generous free tiers). This is a big deal: even having to pay a few dollars a month is enough to dicentivise sharing data, since now you have to figure out who will pay and ensure the payments don’t expire in the future.
Being read-only makes it trivial to scale: just add more instances, each with their own copy of the data. All of the hard problems in scaling web applications that relate to writable data stores can be skipped entirely.
Since the database file is opened using SQLite’s immutable mode, we can accept arbitrary SQL queries with no risk of them corrupting the data.

Any time your data changes, you need to publish a brand new copy of the whole database. With the right hosting this is easy: deploy a brand new copy of your data and application in parallel to your existing live deployment, then switch over incoming HTTP traffic to your API at the load balancer level. Heroku and Zeit Now both support this strategy out of the box.

Bundling the data with the code

Since the data is read-only and is encapsulated in a single binary SQLite database file, we can bundle the data as part of the app. This means we can trivially create and publish Docker images that provide both the data and the API and UI for accessing it. We can also publish to any hosting provider that will allow us to run a Python application, without also needing to provision a mutable database.

The datasette package command takes one or more SQLite databases and bundles them together with the Datasette application in a single Docker image, ready to be deployed anywhere that can run Docker containers.

SQLite as the underlying data engine

Datasette encourages people to use SQLite as a standard format for publishing data.

Relational database are great: once you know how to use them, you can represent any data you can imagine using a carefully designed schema.

What about data that’s too unstructured to fit a relational schema? SQLite includes excellent support for JSON data - so if you can’t shape your data to fit a table schema you can instead store it as text blobs of JSON - and use SQLite’s JSON functions to filter by or extract specific fields.

What about binary data? Even that’s covered: SQLite will happily store binary blobs. My datasette-render-images plugin (live demo here) is one example of a tool that works with binary image data stored in SQLite blobs.

What if my data is too big? Datasette is not a “big data” tool, but if your definition of big data is something that won’t fit in RAM that threshold is growing all the time (2TB of RAM on a single AWS instance now costs less than $4/hour).

I’ve personally had great results from multiple GB SQLite databases and Datasette. The theoretical maximum size of a single SQLite database is around 140TB.

SQLite also has built-in support for surprisingly good full-text search, and thanks to being extensible via modules has excellent geospatial functionality in the form of the SpatiaLite extension. Datasette benefits enormously from this wider ecosystem.

The reason most developers avoid SQLite for production web applications is that it doesn’t deal brilliantly with large volumes of concurrent writes. Since Datasette is read-only we can entirely ignore this limitation.

Far-future cache expiration

Since the data in a Datasette instance never changes, why not cache calls to it forever?

Datasette sends a far future HTTP cache expiry header with every API response. This means that browsers will only ever fetch data the first time a specific URL is accessed, and if you host Datasette behind a CDN such as Fastly or Cloudflare each unique API call will hit Datasette just once and then be cached essentially forever by the CDN.

This means it’s safe to deploy a JavaScript app using an inexpensively hosted Datasette-backed API to the front page of even a high traffic site - the CDN will easily take the load.

Zeit added Cloudflare to every deployment (even their free tier) back in July, so if you are hosted there you get this CDN benefit for free.

What if you re-publish an updated copy of your data? Datasette has that covered too. You may have noticed that every Datasette database gets a hashed suffix automatically when it is deployed:

https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4

This suffix is based on the SHA256 hash of the entire database file contents - so any change to the data will result in new URLs. If you query a previous suffix Datasette will notice and redirect you to the new one.

If you know you’ll be changing your data, you can build your application against the non-suffixed URL. This will not be cached and will always 302 redirect to the correct version (and these redirects are extremely fast).

https://fivethirtyeight.datasettes.com/fivethirtyeight/alcohol-consumption%2Fdrinks.json

The redirect sends an HTTP/2 push header such that if you are running behind a CDN that understands push (such as Cloudflare) your browser won’t have to make two requests to follow the redirect. You can use the Chrome DevTools to see this in action:

And finally, if you need to opt out of HTTP caching for some reason you can disable it on a per-request basis by including ?_ttl=0 in the URL query string. - for example, if you want to return a random member of the Avengers it doesn’t make sense to cache the response:

https://fivethirtyeight.datasettes.com/fivethirtyeight?sql=select+*+from+[avengers%2Favengers]+order+by+random()+limit+1&_ttl=0

Publishing as a core feature

Datasette aims to reduce the friction for publishing interesting data online as much as possible.

To this end, Datasette includes a “publish” subcommand:

# deploy to Heroku
datasette publish heroku mydatabase.db
# Or deploy to Zeit Now
datasette publish now mydatabase.db

These commands take one or more SQLite databases, upload them to a hosting provider, configure a Datasette instance to serve them and return the public URL of the newly deployed application.

Out of the box, Datasette can publish to either Heroku or to Zeit Now. The publish_subcommand plugin hook means other providers can be supported by writing plugins.

License and source metadata

Datasette believes that data should be accompanied by source information and a license, whenever possible. The metadata.json file that can be bundled with your data supports these. You can also provide source and license information when you run datasette publish:

datasette publish fivethirtyeight.db \
    --source="FiveThirtyEight" \
    --source_url="https://github.com/fivethirtyeight/data" \
    --license="CC BY 4.0" \
    --license_url="https://creativecommons.org/licenses/by/4.0/"

When you use these options Datasette will create the corresponding metadata.json file for you as part of the deployment.

Facet everything

I really love faceted search: it’s the first tool I turn to whenever I want to start understanding a collection of data. I’ve built faceted search engines on top of Solr, Elasticsearch and PostgreSQL and many of my favourite tools (like Splunk and Datadog) have it as a core feature.

Datasette automatically attempts to calculate facets against every table. You can read more about the Datasette Facets feature here - as a huge faceted search fan it’s one of my all-time favourite features of the project. Now I can add SQLite to the list of technologies I’ve used to build faceted search!

Respect for CSV

CSV is by far the most common format for sharing and publishing data online. Almost every useful data tool has the ability to export to it, and it remains the lingua franca of spreadsheet import and export.

It has many flaws: it can’t easily represent nested data structures, escaping rules for values containing commas are inconsistently implemented and it doesn’t have a standard way of representing character encoding.

Datasette aims to promote SQLite as a much better default format for publishing data. I would much rather download a .db file full of pre-structured data than download a .csv and then have to re-structure it as a separate piece of work.

But interacting well with the enormous CSV ecosystem is essential. Datasette has deep CSV export functionality: any data you can see, you can export - including the results of arbitrary SQL queries. If your query can be paginated Datasette can stream down every page in a single CSV file for you.

Datasette’s sister-tool csvs-to-sqlite handles the other side of the equation: importing data from CSV into SQLite tables. And the Datasette Publish web application allows users to upload their CSVs and have them deployed directly to their own fresh Datasette instance - no command line required.

SQL as an API language

A lot of people these days are excited about GraphQL, because it allows API clients to request exactly the data they need, including traversing into related objects in a single query.

Guess what? SQL has been able to do that since the 1970s!

There are a number of reasons most APIs don’t allow people to pass them arbitrary SQL queries:

Security: we don’t want people messing up our data
Performance: what if someone sends an accidental (or deliberate) expensive query that exhausts our resources?
Hiding implementation details: if people write SQL against our API we can never change the structure of our database tables

Datasette has answers to all three.

On security: the data is read-only, using SQLite’s immutable mode. You can’t damage it with a query - INSERT and UPDATEs will simply throw harmless errors.

On performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold. Datasette sets this to one second by default, though you can alter that configuration if you need to (I often bump it up to ten seconds when exploring multi-GB data on my laptop).

On hidden implementation details: since we are publishing static data rather than maintaining an evolving API, we can mostly ignore this issue. If you are really worried about it you can take advantage of canned queries and SQL view definitions to expose a carefully selected forward-compatible view into your data.

Optimistic query execution with time limits

I mentioned Datasette’s SQL time limits above. These aren’t just there to avoid malicious queries: the idea of “optimistic SQL evaluation” is baked into some of Datasette’s core features.

Consider suggested facets - where Datasette inspects any table you view and tries to suggest columns that are worth faceting against.

The way this works is Datasette loops over every column in the table and runs a query to see if there are less than 20 unique values for that column. On a large table this could take a prohibitive amount of time, so Datasette sets an aggressive timeout on those queries: just 50ms. If the query fails to run in that time it is silently dropped and the column is not listed as a suggested facet.

Datasette’s JSON API provides a mechanism for JavaScript applications to use that same pattern. If you add ?_timelimit=20 to any Datasette API call, the underlying query will only get 20ms to run. If it goes over you’ll get a very fast error response from the API. This means you can design your own features that attempt to optimistically run expensive queries without damaging the performance of your app.

Keyset pagination

SQL pagination using OFFSET/LIMIT has a fatal flaw: if you request page number 300 at 20 per page the underlying SQL engine needs to calculate and sort all 6,000 preceding rows before it can return the 20 you have requested.

This does not scale at all well.

Keyset pagination (often known by other names, including cursor-based pagination) is a far more efficient way to paginate through data. It works against ordered data. Each page is returned with a token representing the last record you saw, then when you request the next page the engine merely has to filter for records that are greater than that tokenized value and scan through the next 20 of them.

(Actually, it scans through 21. By requesting one more record than you intend to display you can detect if another page of results exists - if you ask for 21 but get back 20 or less you know you are on the last page.)

Datasette’s table view includes a sophisticated implementation of keyset pagination.

Datasette defaults to sorting by primary key (or SQLite rowid). This is perfect for efficient pagination: running a select against the primary key column for values greater than X is one of the fastest range scan queries any database can support. This allows users to paginate as deep as they like without paying the offset/limit performance penalty.

This is also how the “export all rows as CSV” option works: when you select that option, Datasette opens a stream to your browser and internally starts keyset-pagination over the entire table. This keeps resource usage in check even while streaming back millions of rows.

Here’s where Datasette gets fancy: it handles keyset pagination for any other sort order as well. If you sort by any column and click “next” you’ll be requesting the next set of rows after the last value you saw. And this even works for columns containing duplicate values: If you sort by such a column, Datasette actually sorts by that column combined with the primary key. The “next” pagination token it generates encodes both the sorted value and the primary key, allowing it to correctly serve you the next page when you click the link.

Try clicking “next” on this page to see keyset pagination against a sorted column in action.

Interactive demos based on the unit tests

I love interactive demos. I decided it would be useful if every single release of Datasette had a permanent interactive demo illustrating its features.

Thanks to Zeit Now, this was pretty easy to set up. I’ve actually taken it a step further: every successful push to master on GitHub is also deployed to a permanent URL.

Some examples:

https://latest.datasette.io/ - the most recent commit to Datasette master. You can see the currently deployed commit hash on https://latest.datasette.io/-/versions and compare it to https://github.com/simonw/datasette/commits
https://v0-25.datasette.io/ is a permanent URL to the 0.25 tagged release of Datasette. See also https://v0-24.datasette.io/ and https://v0-23-2.datasette.io/
https://700d83d.datasette.io/-/versions is a permanent URL to the code from this commit: https://github.com/simonw/datasette/commit/700d83d

The database that is used for this demo is the exact same database that is created by Datasette’s unit test fixtures. The unit tests are already designed to exercise every feature, so reusing them for a live demo makes a lot of sense.

You can view this test database on your own machine by checking out the full Datasette repository from GitHub and running the following:

python tests/fixtures.py fixtures.db metadata.json
datasette fixtures.db -m metadata.json

Here’s the code in the Datasette Travis CI configuration that deploys a live demo for every commit and every released tag.

Documentation unit tests

I wrote about the Documentation unit tests pattern back in July.

Datasette’s unit tests include some assertions that ensure that every plugin hook, configuration setting and underlying view class is mentioned in the documentation. A commit or pull request that adds or modifies these without also updating the documentation (or at least ensuring there is a corresponding heading in the docs) will fail its tests.

Learning more

Datasette’s documentation is in pretty good shape now, and the changelog provides a detailed overview of new features that I’ve added to the project. I presented Datasette at the PyBay conference in August and I’ve published my annotated slides from that talk. I was interviewed about Datasette for the Changelog podcast in May and my notes from that conversation include some of my favourite demos.

Datasette now has an official Twitter account - you can follow @datasetteproj there for updates about the project.

Tags: projects, sqlite, testing, datasette, baked-data

Simon Willison's Weblog: baked-data

OpenTimes

Quoting Jake Teton-Landis

Clickhouse on Cloud Run

The Baked Data architectural pattern

Benefits of Baked Data

How to bake your data

Baked Data in action: datasette.io

Other real-world examples of Baked Data

Compared to static site generators

Want to give this a go?

Building a search engine for datasette.io

Project search for Datasette

How it works: Dogsheep Beta

Configuring Dogsheep Beta

Search term highlighting

Build, index, deploy

datasette.io, an official project website for Datasette

The Baked Data architectural pattern

How the site is built

The plugin directory

Other site content

Also this week: sqlite-utils analyze-tables

Datasette 0.53

Office hours

Releases this week

TIL this week

datasette-ripgrep: deploy a regular expression search engine for your source code

datasette-ripgrep demo

Why build this?

A different kind of Datasette plugin

Controlling processes from asyncio

Highlighted linkable line numbers

Next steps for this project

Better URLs for my TILs

TIL this week

Releases this week

Bedrock: The SQLitening

niche-museums.com, powered by Datasette

Continuous deployment

Customization

Plugins

Weeknotes

Weeknotes: Niche Museums, Kepler, Trees and Streaks

Niche Museums

Mapping museums against Starbucks

Animating the PG&E outages

Hayes Valley Trees

Streaks

The interesting ideas in Datasette

Publishing read-only data

Bundling the data with the code

SQLite as the underlying data engine

Far-future cache expiration

Publishing as a core feature

License and source metadata

Facet everything

Respect for CSV

SQL as an API language

Optimistic query execution with time limits

Keyset pagination

Interactive demos based on the unit tests

Documentation unit tests

Learning more