<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: baked-data</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/baked-data.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-03-17T22:49:59+00:00</updated><author><name>Simon Willison</name></author><entry><title>OpenTimes</title><link href="https://simonwillison.net/2025/Mar/17/opentimes/#atom-tag" rel="alternate"/><published>2025-03-17T22:49:59+00:00</published><updated>2025-03-17T22:49:59+00:00</updated><id>https://simonwillison.net/2025/Mar/17/opentimes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sno.ws/opentimes/"&gt;OpenTimes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Spectacular new open geospatial project by &lt;a href="https://sno.ws/"&gt;Dan Snow&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time data for free and with no limits.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://opentimes.org/?id=060816135022&amp;amp;mode=car#9.76/37.5566/-122.3085"&gt;what I get&lt;/a&gt; for travel times by car from El Granada, California:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Isochrone map showing driving times from the El Granada census tract to other places in the San Francisco Bay Area" src="https://static.simonwillison.net/static/2025/opentimes.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The technical details are &lt;em&gt;fascinating&lt;/em&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The entire OpenTimes backend is just static Parquet files on &lt;a href="https://www.cloudflare.com/developer-platform/products/r2/"&gt;Cloudflare's R2&lt;/a&gt;. There's no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a &lt;em&gt;great&lt;/em&gt; way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sure enough, &lt;a href="https://developers.cloudflare.com/r2/pricing/"&gt;R2 pricing&lt;/a&gt; charges "based on the total volume of data stored" - $0.015 / GB-month for standard storage, then $0.36 / million requests for "Class B" operations which include reads. They charge nothing for outbound bandwidth.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;All travel times were calculated by pre-building the inputs (OSM, OSRM networks) and then distributing the compute over &lt;a href="https://github.com/dfsnow/opentimes/actions/workflows/calculate-times.yaml"&gt;hundreds of GitHub Actions jobs&lt;/a&gt;. This worked shockingly well for this specific workload (and was also completely free).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's a &lt;a href="https://github.com/dfsnow/opentimes/actions/runs/13094249792"&gt;GitHub Actions run&lt;/a&gt; of the &lt;a href="https://github.com/dfsnow/opentimes/blob/a6a5f7abcdd69559b3e29f360fe0ff0399dbb400/.github/workflows/calculate-times.yaml#L78-L80"&gt;calculate-times.yaml workflow&lt;/a&gt; which uses a matrix to run 255 jobs!&lt;/p&gt;
&lt;p&gt;&lt;img alt="GitHub Actions run: calculate-times.yaml run by workflow_dispatch taking 1h49m to execute 255 jobs with names like run-job (2020-01) " src="https://static.simonwillison.net/static/2025/opentimes-github-actions.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Relevant YAML:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  matrix:
    year: ${{ fromJSON(needs.setup-jobs.outputs.years) }}
    state: ${{ fromJSON(needs.setup-jobs.outputs.states) }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where those JSON files were created by the previous step, which reads in the year and state values from &lt;a href="https://github.com/dfsnow/opentimes/blob/a6a5f7abcdd69559b3e29f360fe0ff0399dbb400/data/params.yaml#L72-L132"&gt;this params.yaml file&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The query layer uses a single DuckDB database file with &lt;em&gt;views&lt;/em&gt; that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a really creative use of DuckDB's feature that lets you run queries against large data from a laptop using HTTP range queries to avoid downloading the whole thing.&lt;/p&gt;
&lt;p&gt;The README shows &lt;a href="https://github.com/dfsnow/opentimes/blob/3439fa2c54af227e40997b4a5f55678739e0f6df/README.md#using-duckdb"&gt;how to use that from R and Python&lt;/a&gt; - I got this working in the &lt;code&gt;duckdb&lt;/code&gt; client (&lt;code&gt;brew install duckdb&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;INSTALL httpfs;
LOAD httpfs;
ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb' AS opentimes;

SELECT origin_id, destination_id, duration_sec
  FROM opentimes.public.times
  WHERE version = '0.0.1'
      AND mode = 'car'
      AND year = '2024'
      AND geography = 'tract'
      AND state = '17'
      AND origin_id LIKE '17031%' limit 10;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In answer to a question about adding public transit times &lt;a href="https://news.ycombinator.com/item?id=43392521#43393183"&gt;Dan said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the next year or so maybe. The biggest obstacles to adding public transit are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Collecting all the necessary scheduling data (e.g. GTFS feeds) for every transit system in the county. Not insurmountable since there are services that do this currently.&lt;/li&gt;
&lt;li&gt;Finding a routing engine that can compute nation-scale travel time matrices quickly. Currently, the two fastest open-source engines I've tried (OSRM and Valhalla) don't support public transit for matrix calculations and the engines that do support public transit (R5, OpenTripPlanner, etc.) are too slow.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gtfs.org/"&gt;GTFS&lt;/a&gt; is a popular CSV-based format for sharing transit schedules - here's &lt;a href="https://gtfs.org/resources/data/"&gt;an official list&lt;/a&gt; of available feed directories.&lt;/p&gt;
&lt;p&gt;This whole project feels to me like a great example of the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;baked data&lt;/a&gt; architectural pattern in action.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43392521"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/census"&gt;census&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openstreetmap"&gt;openstreetmap&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudflare"&gt;cloudflare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http-range-requests"&gt;http-range-requests&lt;/a&gt;&lt;/p&gt;



</summary><category term="census"/><category term="geospatial"/><category term="open-data"/><category term="openstreetmap"/><category term="cloudflare"/><category term="parquet"/><category term="github-actions"/><category term="baked-data"/><category term="duckdb"/><category term="http-range-requests"/></entry><entry><title>Quoting Jake Teton-Landis</title><link href="https://simonwillison.net/2024/Sep/25/jake-teton-landis/#atom-tag" rel="alternate"/><published>2024-09-25T18:08:19+00:00</published><updated>2024-09-25T18:08:19+00:00</updated><id>https://simonwillison.net/2024/Sep/25/jake-teton-landis/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://news.ycombinator.com/item?id=41645173#41648480"&gt;&lt;p&gt;We used this model [periodically transmitting configuration to different hosts] to distribute translations, feature flags, configuration, search indexes, etc at Airbnb. But instead of SQLite we used &lt;a href="https://github.com/spotify/sparkey"&gt;Sparkey&lt;/a&gt;, a KV file format developed by Spotify. In early years there was a Cron job on every box that pulled that service’s thingies; then once we switched to Kubernetes we used a daemonset &amp;amp; host tagging (taints?) to pull a variety of thingies to each host and then ensure the services that use the thingies only ran on the hosts that had the thingies.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://news.ycombinator.com/item?id=41645173#41648480"&gt;Jake Teton-Landis&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kubernetes"&gt;kubernetes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/feature-flags"&gt;feature-flags&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="kubernetes"/><category term="feature-flags"/><category term="baked-data"/></entry><entry><title>Clickhouse on Cloud Run</title><link href="https://simonwillison.net/2021/Jul/29/clickhouse-on-cloud-run/#atom-tag" rel="alternate"/><published>2021-07-29T06:07:51+00:00</published><updated>2021-07-29T06:07:51+00:00</updated><id>https://simonwillison.net/2021/Jul/29/clickhouse-on-cloud-run/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexjreid.dev/posts/clickhouse-on-cloud-run/"&gt;Clickhouse on Cloud Run&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alex Reid figured out how to run Clickhouse against read-only baked data on Cloud Run last year, and wrote up some comprehensive notes.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alexjreid/status/1420625467384737797"&gt;@alexjreid&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/clickhouse"&gt;clickhouse&lt;/a&gt;&lt;/p&gt;



</summary><category term="cloudrun"/><category term="baked-data"/><category term="clickhouse"/></entry><entry><title>The Baked Data architectural pattern</title><link href="https://simonwillison.net/2021/Jul/28/baked-data/#atom-tag" rel="alternate"/><published>2021-07-28T20:23:44+00:00</published><updated>2021-07-28T20:23:44+00:00</updated><id>https://simonwillison.net/2021/Jul/28/baked-data/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been exploring an architectural pattern for publishing websites over the past few years that I call the "Baked Data" pattern. It provides many of the advantages of static site generators while avoiding most of their limitations. I think it deserves to be used more widely.&lt;/p&gt;
&lt;p&gt;I define the Baked Data architectural pattern as the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Baked Data: bundling a read-only copy of your data alongside the code for your application, as part of the same deployment&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Most dynamic websites keep their code and data separate: the code runs on an application server, the data lives independently in some kind of external data store - something like PostgreSQL, MySQL or MongoDB.&lt;/p&gt;
&lt;p&gt;With Baked Data, the data is deployed as part of the application bundle. Any time the content changes, a fresh copy of the site is deployed that includes those updates.&lt;/p&gt;
&lt;p&gt;I mostly use SQLite database files for this, but plenty of other formats can work here too.&lt;/p&gt;
&lt;p&gt;This works particularly well with so-called "serverless" deployment platforms - platforms that support stateless deployments and only charge for resources spent servicing incoming requests ("scale to zero").&lt;/p&gt;
&lt;p&gt;Since every change to the data results in a fresh deployment this pattern doesn't work for sites that change often - but in my experience many content-oriented sites update their content at most a few times a day. Consider blogs, documentation sites, project websites - anything where content is edited by a small group of authors.&lt;/p&gt;
&lt;h4 id="benefits-of-baked-data"&gt;Benefits of Baked Data&lt;/h4&gt;
&lt;p&gt;Why would you want to apply this pattern? A few reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inexpensive to host&lt;/strong&gt;. Anywhere that can run application code can host a Baked Data application - there's no need to pay extra for a managed database system. Scale to zero serverless hosts such as &lt;a href="https://cloud.google.com/run"&gt;Cloud Run&lt;/a&gt;, &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt; or &lt;a href="https://aws.amazon.com/lambda/"&gt;AWS Lambda&lt;/a&gt; will charge only cents per month for low-traffic deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy to scale&lt;/strong&gt;. Need to handle more traffic? Run more copies of your application and its bundled data. Horizontally scaling Baked Data applications is trivial. They're also a great fit to run behind a caching proxy CDN such as Cloudflare or Fastly - when you deploy a new version you can purge that entire cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Difficult to break&lt;/strong&gt;. Hosting server-side applications on a VPS is always disquieting because there's so much that might go wrong - the server could be compromised, or a rogue log file could cause it to run out of disk space. With Baked Data the worst that can happen is that you need to re-deploy the application - there's no risk at all of data loss, and providers that can auto-restart code can recover from errors automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server-side functionality is supported&lt;/strong&gt;. Static site generators provide many of the above benefits, but with the limitation that any dynamic functionality needs to happen in client-side JavaScript. With a Baked Data application you can execute server-side code too.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Templated pages&lt;/strong&gt;. Another improvement over static site generators: if you have 10,000 pages, a static site generator will need to generate 10,000 HTML files. With Baked Data those 10,000 pages can exist as rows in a single SQLite database file, and the pages can be generated at run-time using a server-side template.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy to support multiple formats&lt;/strong&gt;. Since your content is in a dynamic data store, outputting that same content in alternative formats is easy. I use Datasette plugins for this: &lt;a href="https://datasette.io/plugins/datasette-atom"&gt;datasette-atom&lt;/a&gt; can produce an Atom feed from a SQL query, and &lt;a href="https://datasette.io/plugins/datasette-ics"&gt;datasette-ics&lt;/a&gt; does the same thing for iCalendar feeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrates well with version control&lt;/strong&gt;. I like to keep my site content under version control. The Baked Data pattern works well with build scripts that read content from a git repository and use it to build assets that are bundled with the deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="how-to-bake-your-data"&gt;How to bake your data&lt;/h4&gt;
&lt;p&gt;My initial implementations of Baked Data have all used SQLite. It's an ideal format for this kind of application: a single binary file which can store anything that can be represented as relational tables, &lt;a href="https://www.sqlite.org/json1.html"&gt;JSON documents&lt;/a&gt; or &lt;a href="https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/"&gt;binary objects&lt;/a&gt; - essentially anything at all.&lt;/p&gt;
&lt;p&gt;Any format that can be read from disk by your dynamic server-side code will work too: YAML or CSV files, Berkeley DB files, or anything else that can be represented by a bucket of read-only bytes in a file on disk.&lt;/p&gt;
&lt;p&gt;[I have a hunch that you could even use something like PostgreSQL, MySQL or Elasticsearch by packaging up their on-disk representations and shipping them as part of a Docker container, but I've not tried that myself yet.]&lt;/p&gt;
&lt;p&gt;Once your data is available in a file, your application code can read from that file and use it to generate and return web pages.&lt;/p&gt;
&lt;p&gt;You can write code that does this in any server-side language. I use Python, usually with my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; application server which can read from a SQLite database file and use &lt;a href="https://docs.datasette.io/en/stable/custom_templates.html"&gt;Jinja templates&lt;/a&gt; to generate pages.&lt;/p&gt;
&lt;p&gt;The final piece of the puzzle is a build and deploy script. I use GitHub Actions for this, but any CI tool will work well here. The script builds the site content into a deployable asset, then deploys that asset along with the application code to a hosting platform.&lt;/p&gt;
&lt;h4 id="baked-data-datasette-io"&gt;Baked Data in action: datasette.io&lt;/h4&gt;
&lt;p&gt;The most sophisticated Baked Data site I've published myself is the official website for my Datasette project, &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt; - source code &lt;a href="https://github.com/simonw/datasette.io"&gt;in this repo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/datasette-io-baked-data.png" alt="A screenshot of the datasette.io homepage" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The site is deployed using Cloud Run. It's actually a heavily customized Datasette instance, using &lt;a href="https://github.com/simonw/datasette.io/blob/main/templates/index.html"&gt;a custom template&lt;/a&gt; for the homepage, &lt;a href="https://github.com/simonw/datasette.io/tree/main/templates/pages"&gt;custom pages&lt;/a&gt; for other parts of the site and the &lt;a href="https://datasette.io/plugins/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; plugin to execute SQL queries and display their results from those templates.&lt;/p&gt;
&lt;p&gt;The site currently runs off &lt;a href="https://datasette.io/-/databases"&gt;four database files&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/content"&gt;content.db&lt;/a&gt; has most of the site content. It is built inside GitHub Actions by the &lt;a href="https://github.com/simonw/datasette.io/blob/main/scripts/build.sh"&gt;build.sh&lt;/a&gt; script, which does the following:
&lt;ul&gt;
&lt;li&gt;Import the contents of the &lt;a href="https://github.com/simonw/datasette.io/blob/main/news.yaml"&gt;news.yaml&lt;/a&gt; file into a &lt;a href="https://datasette.io/content/news"&gt;news&lt;/a&gt; table using &lt;a href="https://datasette.io/tools/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Import the markdown files from the &lt;a href="https://github.com/simonw/datasette.io/tree/main/for"&gt;for/ folder&lt;/a&gt; (use-cases for Datasette) into the &lt;a href="https://datasette.io/content/uses"&gt;uses&lt;/a&gt; table using &lt;a href="https://datasette.io/tools/markdown-to-sqlite"&gt;markdown-to-sqlite&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Populate the &lt;a href="https://datasette.io/content/plugin_repos"&gt;plugin_repos&lt;/a&gt; and &lt;a href="https://datasette.io/content/plugin_repos"&gt;tool_repos&lt;/a&gt; single-column tables using data from more YAML files. These are used in the next step.&lt;/li&gt;
&lt;li&gt;Runs the &lt;a href="https://github.com/simonw/datasette.io/blob/main/build_directory.py"&gt;build_directory.py&lt;/a&gt; Python script. This uses the GitHub GraphQL API to fetch information about all of those plugin and tool repositories, including their README files and their most recent tagged &lt;a href="https://datasette.io/content/releases"&gt;releases&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Populates a &lt;a href="https://datasette.io/content/stats"&gt;stats&lt;/a&gt; table with the latest download statistics for all of the Datasette ecosystem PyPI packages. That data is imported from a &lt;code&gt;stats.json&lt;/code&gt; file in my &lt;a href="https://github.com/simonw/package-stats"&gt;simonw/package-stats&lt;/a&gt; repository, which is itself populated by this &lt;a href="https://github.com/simonw/package-stats/blob/main/.github/workflows/fetch_stats.yml"&gt;git scraping script&lt;/a&gt; that runs in GitHub Actions. I also use this for my &lt;a href="https://observablehq.com/@simonw/datasette-downloads-per-day-with-observable-plot"&gt;Datasette Downloads Observable notebook&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/blog"&gt;blog.db&lt;/a&gt; contains content from my blog that carries any of the &lt;a href="https://simonwillison.net/tags/datasette/"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep/"&gt;dogsheep&lt;/a&gt; or &lt;a href="https://simonwillison.net/tags/sqliteutils/"&gt;sqliteutils&lt;/a&gt; tags.
&lt;ul&gt;
&lt;li&gt;This is fetched by the &lt;a href="https://github.com/simonw/datasette.io/blob/main/fetch_blog_content.py"&gt;fetch_blog_content.py&lt;/a&gt; script, which hits the paginated per-tag Atom feed for my blog content, &lt;a href="https://github.com/simonw/simonwillisonblog/blob/a5b53a24b00d4c95c88c8371cfc17453b0726c23/blog/views.py#L411-L421"&gt;implemented in Django here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/docs-index"&gt;docs-index.db&lt;/a&gt; is a database table containing the documentation for the most recent stable Datasette release, broken up by &lt;a href="https://datasette.io/docs-index/sections"&gt;sections&lt;/a&gt;.
&lt;ul&gt;
&lt;li&gt;This database file is downloaded from a separate site, &lt;a href="https://stable-docs.datasette.io/"&gt;stable-docs.datasette.io&lt;/a&gt;, which is built and deployed as &lt;a href="https://github.com/simonw/datasette/blob/0.58.1/.github/workflows/publish.yml#L60-L98"&gt;part of Datasette's release process&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/dogsheep-index"&gt;dogsheep-index.db&lt;/a&gt; is the search index that powers site search (e.g. &lt;a href="https://datasette.io/-/beta?q=dogsheep"&gt;this search for dogsheep&lt;/a&gt;).
&lt;ul&gt;
&lt;li&gt;The search index is built by &lt;a href="https://datasette.io/plugins/dogsheep-beta"&gt;dogsheep-beta&lt;/a&gt; using data pulled from tables in the other database files, as configured by &lt;a href="https://github.com/simonw/datasette.io/blob/main/templates/dogsheep-beta.yml"&gt;this YAML file&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The site is automatically deployed once a day by &lt;a href="https://github.com/simonw/datasette.io/blob/main/.github/workflows/deploy.yml"&gt;a scheduled action&lt;/a&gt;, and I can also manually trigger that action if I want to ensure a new software release is reflected on the homepage.&lt;/p&gt;
&lt;h4 id="other-real-world-examples"&gt;Other real-world examples of Baked Data&lt;/h4&gt;
&lt;p&gt;I'm currently running two other sites using this pattern:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.niche-museums.com/"&gt;Niche Museums&lt;/a&gt; is my blog about tiny museums that I've visited. Again, it's Datasette with custom templates. Most of the content comes from this &lt;a href="https://github.com/simonw/museums/blob/main/museums.yaml"&gt;museums.yaml&lt;/a&gt; file, but I also run &lt;a href="https://github.com/simonw/museums/blob/main/annotate_timestamps.py"&gt;a script&lt;/a&gt; to figure out when each item was created or updated from the git history.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/"&gt;My TILs site&lt;/a&gt; runs on &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt; and is built from my &lt;a href="https://github.com/simonw/til"&gt;simonw/til&lt;/a&gt; GitHub repository by &lt;a href="https://github.com/simonw/til/blob/main/build_database.py"&gt;this build script&lt;/a&gt; (populating &lt;a href="https://til.simonwillison.net/tils"&gt;this tils table&lt;/a&gt;). It uses the GitHub API to convert GitHub Flavored Markdown to HTML. I'm also running &lt;a href="https://github.com/simonw/til/blob/main/generate_screenshots.py"&gt;a script&lt;/a&gt; that generates small screenshots of each page and stashes them in a BLOB column in SQLite in order to provide social media preview cards, see &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/#weeknotes-2020-09-03-social-media-cards-tils"&gt;Social media cards for my TILs&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My favourite example of this pattern in a site that I haven't worked on myself is &lt;a href="https://www.mozilla.org/"&gt;Mozilla.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They started using SQLite back in 2018 in a system they call Bedrock - Paul McLanahan provides &lt;a href="https://mozilla.github.io/meao/2018/03/28/bedrock-the-sqlitening/"&gt;a detailed description&lt;/a&gt; of how this works.&lt;/p&gt;
&lt;p&gt;Their site content lives in a ~22MB SQLite database file, which is built and uploaded to S3 and then downloaded on a regular basis to each of their application servers.&lt;/p&gt;
&lt;p&gt;You can view &lt;a href="https://www.mozilla.org/healthz-cron/"&gt;their healthcheck page&lt;/a&gt; to see when the database was last downloaded, and grab a copy of the SQLite file yourself. It's fun to explore that using Datasette:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/mozilla-site-content.png" alt="Datasette running against the Mozilla contentncards_contentcard table" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4 id="compared-to-ssgs"&gt;Compared to static site generators&lt;/h4&gt;
&lt;p&gt;Static site generators have exploded in popularity over the past ten years. They drive the cost of hosting a site down to almost nothing, provide excellent performance, work well with CDNs and produce sites that are extremely unlikely to break.&lt;/p&gt;
&lt;p&gt;Used carefully, the Baked Data keeps most of these characteristics while still enabling server-side code execution.&lt;/p&gt;
&lt;p&gt;My example sites use this in a few different ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;datasette.io &lt;a href="https://datasette.io/-/beta?q=search"&gt;provides search&lt;/a&gt; across 1,588 different pieces of content, plus &lt;a href="https://datasette.io/plugins?q=ics"&gt;simpler search&lt;/a&gt; on the plugins and tools pages.&lt;/li&gt;
&lt;li&gt;My TIL site also &lt;a href="https://til.simonwillison.net/tils/search?q=search"&gt;provides search&lt;/a&gt;, as &lt;a href="https://www.niche-museums.com/browse/search?q=bigfoot"&gt;does Niche Museums&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;All three sites provide Atom feeds that are configured using a server-side SQL query: &lt;a href="https://datasette.io/content/feed.atom"&gt;Datasette&lt;/a&gt;, &lt;a href="https://www.niche-museums.com/browse/feed.atom"&gt;Niche Museums&lt;/a&gt;, &lt;a href="https://til.simonwillison.net/tils/feed.atom"&gt;TILs&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Niche Museums offers a "Use my location" button which then serves &lt;a href="https://www.niche-museums.com/?latitude=37.5&amp;amp;longitude=-122.5"&gt;museums near you&lt;/a&gt;, using &lt;a href="https://www.niche-museums.com/browse/nearby"&gt;a SQL query&lt;/a&gt; that makes use of the &lt;a href="https://datasette.io/plugins/datasette-haversine"&gt;datasette-haversine&lt;/a&gt; plugin.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A common complaint about static site generators when used for larger sites is that build times can get pretty long if the builder has to generate tens of thousands of pages.&lt;/p&gt;
&lt;p&gt;With Baked Data, 10,000 pages can be generated by a single template file and 10,000 rows in a SQLite database table.&lt;/p&gt;
&lt;p&gt;This also makes for a faster iteration cycle during development: you can edit a template and hit "refresh" to see any page rendered by the new template instantly, without needing to rebuild any pages.&lt;/p&gt;
&lt;h4 id="give-this-a-go"&gt;Want to give this a go?&lt;/h4&gt;
&lt;p&gt;If you want to give the Baked Data pattern a try, I recommend starting out using the combination of &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;, &lt;a href="https://docs.github.com/en/actions"&gt;GitHub Actions&lt;/a&gt; and &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt;. Hopefully the examples I've provided above are a good starting point - also feel free to reach out to me &lt;a href="https://twitter.com/"&gt;on Twitter&lt;/a&gt; or in &lt;a href="https://github.com/simonw/datasette/discussions"&gt;the Datasette Discussions forum&lt;/a&gt; with any questions.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/design-patterns"&gt;design-patterns&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/static-generator"&gt;static-generator&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="definitions"/><category term="design-patterns"/><category term="sqlite"/><category term="static-generator"/><category term="datasette"/><category term="baked-data"/></entry><entry><title>Building a search engine for datasette.io</title><link href="https://simonwillison.net/2020/Dec/19/dogsheep-beta/#atom-tag" rel="alternate"/><published>2020-12-19T18:12:31+00:00</published><updated>2020-12-19T18:12:31+00:00</updated><id>https://simonwillison.net/2020/Dec/19/dogsheep-beta/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I added &lt;a href="https://datasette.io/-/beta"&gt;a search engine&lt;/a&gt; to &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;, using the search indexing tool I've been building for &lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;Dogsheep&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of dogsheep.io search results for ripgrep" src="https://static.simonwillison.net/static/2020/dogsheep-beta-ripgrep.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Project search for Datasette&lt;/h4&gt;
&lt;p&gt;The Datasette project has a &lt;em&gt;lot&lt;/em&gt; of constituent parts. There's the project itself and its &lt;a href="https://docs.datasette.io/"&gt;documentation&lt;/a&gt; - 171 pages when exported to PDF and counting. Then there are the &lt;a href="https://datasette.io/plugins"&gt;48 plugins&lt;/a&gt;, &lt;a href="https://datasette.io/tools/sqlite-utils"&gt;sqlite-utils&lt;/a&gt; and &lt;a href="https://datasette.io/tools"&gt;21 more tools&lt;/a&gt; for creating SQLite databases, the &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; collection and over three years of content I've written about the project &lt;a href="https://simonwillison.net/tags/datasette/"&gt;on my blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;a href="https://datasette.io/-/beta"&gt;datasette.io search engine&lt;/a&gt; provides a faceted search interface to all of this material in one place. It currently searches across:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every section of the latest documentation (415 total)&lt;/li&gt;
&lt;li&gt;48 plugin READMEs&lt;/li&gt;
&lt;li&gt;22 tool READMEs&lt;/li&gt;
&lt;li&gt;63 news items posted on the Datasette website&lt;/li&gt;
&lt;li&gt;212 items from my blog&lt;/li&gt;
&lt;li&gt;Release notes from 557 package releases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I plan to extend it with more data sources in the future.&lt;/p&gt;
&lt;h4&gt;How it works: Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;I'm reusing the search engine I originally built for my Dogsheep personal analytics project (see &lt;a href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/"&gt;Personal Data Warehouses: Reclaiming Your Data&lt;/a&gt;). I call that search engine &lt;a href="https://github.com/dogsheep/beta"&gt;Dogsheep Beta&lt;/a&gt;. The name is &lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;a pun&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;SQLite has great &lt;a href="https://sqlite.org/fts5.html"&gt;full-text search&lt;/a&gt; built in, and I make extensive use of that in Datasette projects already. But out of the box it's not quite right for this kind of search engine that spans multiple different content types.&lt;/p&gt;
&lt;p&gt;The problem is relevance calculation. I wrote about this in &lt;a href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/"&gt;Exploring search relevance algorithms with SQLite&lt;/a&gt; - short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.&lt;/p&gt;
&lt;p&gt;This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.&lt;/p&gt;
&lt;p&gt;To get usable scores, you need everything in a single table. That's what Dogsheep Beta does: it creates a new table, called &lt;code&gt;search_index&lt;/code&gt;, and copies searchable content from the other tables into that new table.&lt;/p&gt;
&lt;p&gt;This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It's the &lt;a href="https://2017.djangocon.us/talks/the-denormalized-query-engine-design-pattern/"&gt;denormalized query engine&lt;/a&gt; design pattern in action.&lt;/p&gt;
&lt;h4&gt;Configuring Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.&lt;/p&gt;
&lt;p&gt;Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.&lt;/p&gt;
&lt;p&gt;(Having one configuration file handle both indexing and display feels a little inelegant, but it's extremely productive for iterating on so I'm letting that slide.)&lt;/p&gt;
&lt;p&gt;Here's the full &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/templates/dogsheep-beta.yml"&gt;Dogsheep configuration for datasette.io&lt;/a&gt;. An annotated extract:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Index material in the content.db SQLite file&lt;/span&gt;
&lt;span class="pl-ent"&gt;content.db&lt;/span&gt;:
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Define a search type called 'releases'&lt;/span&gt;
  &lt;span class="pl-ent"&gt;releases&lt;/span&gt;:
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate that search type by executing this SQL&lt;/span&gt;
    &lt;span class="pl-ent"&gt;sql&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      select&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.id as key,&lt;/span&gt;
&lt;span class="pl-s"&gt;        repos.name || ' ' || releases.tag_name as title,&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.published_at as timestamp,&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.body as search_1,&lt;/span&gt;
&lt;span class="pl-s"&gt;        1 as is_public&lt;/span&gt;
&lt;span class="pl-s"&gt;      from&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases&lt;/span&gt;
&lt;span class="pl-s"&gt;        join repos on releases.repo = repos.id&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; When displaying a search result, use this SQL to&lt;/span&gt;
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; return extra details about the item&lt;/span&gt;
    &lt;span class="pl-ent"&gt;display_sql&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      select&lt;/span&gt;
&lt;span class="pl-s"&gt;        -- highlight() is a custom SQL function&lt;/span&gt;
&lt;span class="pl-s"&gt;        highlight(render_markdown(releases.body), :q) as snippet,&lt;/span&gt;
&lt;span class="pl-s"&gt;        html_url&lt;/span&gt;
&lt;span class="pl-s"&gt;      from releases where id = :key&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Jinja template fragment to display the result&lt;/span&gt;
    &lt;span class="pl-ent"&gt;display&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;h3&amp;gt;Release: &amp;lt;a href="{{ display.html_url }}"&amp;gt;{{ title }}&amp;lt;/a&amp;gt;&amp;lt;/h3&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;p&amp;gt;{{ display.snippet|safe }}&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;p&amp;gt;&amp;lt;small&amp;gt;Released {{ timestamp }}&amp;lt;/small&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The core pattern here is the &lt;code&gt;sql:&lt;/code&gt; key, which defines a SQL query that must return the following columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;key&lt;/code&gt; - a unique identifier for this search item&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;title&lt;/code&gt; - a title for this indexed document&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt; - a timestamp for when it was created. May be null.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;search_1&lt;/code&gt; - text to be searched. I may add support for &lt;code&gt;search_2&lt;/code&gt; and &lt;code&gt;search_3&lt;/code&gt; later on to store text that will be treated with a lower relevance score.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;is_public&lt;/code&gt; - should this be considered "public" data. This is a holdover from Dogsheep Beta's application for personal analytics, I don't actually need it for datasette.io.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To create an index, run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dogsheep-beta index dogsheep-index.db dogsheep-config.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;index&lt;/code&gt; command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a &lt;code&gt;search_index&lt;/code&gt; table in the &lt;code&gt;dogsheep-index.db&lt;/code&gt; SQLite database file.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://datasette.io/dogsheep-index/search_index"&gt;the search_index table&lt;/a&gt; for &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).&lt;/p&gt;
&lt;p&gt;To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.&lt;/p&gt;
&lt;p&gt;If a &lt;code&gt;display_sql:&lt;/code&gt; query is defined, that query will be executed for each result to populate the &lt;code&gt;{{ display }}&lt;/code&gt; object made available to the template. &lt;a href="https://www.sqlite.org/np1queryprob.html"&gt;Many Small Queries Are Efficient In SQLite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Search term highlighting&lt;/h4&gt;
&lt;p&gt;I spent &lt;a href="https://github.com/simonw/datasette.io/issues/49"&gt;a bit of time&lt;/a&gt; thinking about search highlighting. SQLite has an implementation of highlighting built in - &lt;a href="https://sqlite.org/fts5.html#the_snippet_function"&gt;the snippet() function&lt;/a&gt; - but it's not designed to be HTML-aware so there's a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.&lt;/p&gt;
&lt;p&gt;I ended up rolling borrowing a BSD licensed &lt;a href="https://github.com/django-haystack/django-haystack/blob/v3.0/haystack/utils/highlighting.py"&gt;highlighting class&lt;/a&gt; from the &lt;a href="https://github.com/django-haystack/django-haystack"&gt;django-haystack&lt;/a&gt; project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that's good enough for me.&lt;/p&gt;
&lt;p&gt;I used this &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/plugins/sql_functions.py"&gt;one-off site plugin&lt;/a&gt; to wrap the highlighting code in a custom SQLite function. This meant I could call it from the &lt;code&gt;display_sql:&lt;/code&gt; query in the Dogsheep Beta YAML configuration.&lt;/p&gt;
&lt;p&gt;A custom template tag would be more elegant, but I don't yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.&lt;/p&gt;
&lt;h4&gt;Build, index, deploy&lt;/h4&gt;
&lt;p&gt;The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.&lt;/p&gt;
&lt;p&gt;Building the index is just another step of that process.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/.github/workflows/deploy.yml"&gt;the deploy.yml&lt;/a&gt; GitHub workflow used by the site. It roughly does the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the current version of the &lt;a href="https://datasette.io/content"&gt;content.db&lt;/a&gt; database file. This is so it doesn't have to re-fetch release and README content that was previously stored there.&lt;/li&gt;
&lt;li&gt;Download the current version of &lt;a href="https://datasette.io/blog"&gt;blog.db&lt;/a&gt;, with entries from my blog. This means I don't have to fetch all entries, just the new ones.&lt;/li&gt;
&lt;li&gt;Run &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/build_directory.py"&gt;build_directory.py&lt;/a&gt;, the script which fetches data for the plugins and tools pages.
&lt;ul&gt;
&lt;li&gt;This hits the GitHub GraphQL API to find new repositories tagged &lt;code&gt;datasette-io&lt;/code&gt; and &lt;code&gt;datasette-plugin&lt;/code&gt; and &lt;code&gt;datasette-tool&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses &lt;a href="https://datasette.io/tools/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; to fetch them.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Imports the data from &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/news.yaml"&gt;news.yaml&lt;/a&gt; into a &lt;code&gt;news&lt;/code&gt; table using &lt;a href="https://datasette.io/tools/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Imports the latest PyPI download statistics for my packages from my &lt;a href="https://github.com/simonw/package-stats"&gt;simonw/package-stats&lt;/a&gt; repository, which implements &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt; against the most excellent &lt;a href="https://pypistats.org/"&gt;pypistats.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Runs the &lt;code&gt;dogsheep-beta index&lt;/code&gt; command to build a &lt;code&gt;dogsheep-index.db&lt;/code&gt; search index.&lt;/li&gt;
&lt;li&gt;Runs some soundness checks, e.g. &lt;code&gt;datasette . --get "/plugins"&lt;/code&gt;, to verify that Datasette is likely to at least return 200 results for some critical pages once published.&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;datasette publish cloudrun&lt;/code&gt; to deploy the results to Google Cloud Run, which hosts the website.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/#weeknotes-2020-09-03-social-media-cards-tils"&gt;generates screenshots using Puppeteer&lt;/a&gt;) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="search"/><category term="sqlite"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="cloudrun"/><category term="baked-data"/></entry><entry><title>datasette.io, an official project website for Datasette</title><link href="https://simonwillison.net/2020/Dec/13/datasette-io/#atom-tag" rel="alternate"/><published>2020-12-13T08:34:44+00:00</published><updated>2020-12-13T08:34:44+00:00</updated><id>https://simonwillison.net/2020/Dec/13/datasette-io/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I launched &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt; - the new official project website for Datasette.&lt;/p&gt;
&lt;p&gt;Datasette's first open source release was &lt;a href="https://simonwillison.net/2017/Nov/13/datasette/"&gt;just over three years ago&lt;/a&gt;, but until now the official site duties have been split between the &lt;a href="https://github.com/simonw/datasette"&gt;GitHub repository&lt;/a&gt; and &lt;a href="https://docs.datasette.io/"&gt;the documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img style="max-width: 100%" alt="A screenshot of datasette.io" src="https://static.simonwillison.net/static/2020/datasette-io.png" /&gt;&lt;/p&gt;

&lt;h4&gt;The Baked Data architectural pattern&lt;/h4&gt;
&lt;p&gt;The site itself is built on Datasette (&lt;a href="https://github.com/simonw/datasette.io"&gt;source code here&lt;/a&gt;). I'm using a pattern that I &lt;a href="https://simonwillison.net/2019/Oct/28/niche-museums-kepler/"&gt;first started exploring&lt;/a&gt; with &lt;a href="https://www.niche-museums.com/"&gt;Niche Museums&lt;/a&gt;: most of the site content lives in a SQLite database, and I use custom Jinja templates to implement the site's different pages.&lt;/p&gt;
&lt;p&gt;This is effectively a variant of the static site generator pattern. The SQLite database is built by scripts as part of the deploy process, then deployed to Google Cloud Run as a binary asset bundled with the templates and Datasette itself.&lt;/p&gt;
&lt;p&gt;I call this the &lt;strong&gt;Baked Data&lt;/strong&gt; architectural pattern - with credit to Kevin Marks for &lt;a href="https://chat.indieweb.org/dev/2020-11-15#t1605478365993400"&gt;helping me&lt;/a&gt; coin the right term. You bake the data into the application.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Update: I wrote more about this in July 2021: &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;The Baked Data architectural pattern&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It's comparable to static site generation because everything is immutable, which greatly reduces the amount of things that can go wrong - and any content changes require a fresh deploy. It's extremely easy to scale - just run more copies of the application with the bundled copy of the database. Cloud Run and other serverless providers handle that kind of scaling automatically.&lt;/p&gt;
&lt;p&gt;Unlike static site generation, if a site has a thousand pages you don't need to build a thousand HTML pages in order to deploy. A single template and a SQL query that incorporates arguments from the URL can serve as many pages as there are records in the database.&lt;/p&gt;
&lt;h4&gt;How the site is built&lt;/h4&gt;
&lt;p&gt;You can browse the site's underlying database tables in Datasette &lt;a href="https://datasette.io/content"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datasette.io/content/news"&gt;news table&lt;/a&gt; powers the latest news on the homepage and &lt;a href="https://datasette.io/news"&gt;/news&lt;/a&gt;. News lives in a &lt;a href="https://github.com/simonw/datasette.io/blob/main/news.yaml"&gt;news.yaml file&lt;/a&gt; in the site's GitHub repository. I wrote &lt;a href="https://gist.github.com/simonw/6a59833eee83bec1f1317c7f80406275"&gt;a script&lt;/a&gt; to import the news that had been accumulating in &lt;a href="https://github.com/simonw/datasette/blob/0.52.5/README.md"&gt;the 0.52 README&lt;/a&gt; - now that news has moved to the site the README is a lot more slim!&lt;/p&gt;
&lt;p&gt;At build time my &lt;a href="https://github.com/simonw/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt; script runs to load that news content into a database table.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette.io/blob/72e3b6a470d2543dede13d51a61a22180b5c96f9/templates/index.html#L78-L85"&gt;index.html&lt;/a&gt; template then uses the following Jinja code to output the latest news stories, using the &lt;code&gt;sql()&lt;/code&gt; function from the &lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; Datasette plugin:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-django"&gt;&lt;pre&gt;&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-s"&gt;set&lt;/span&gt; &lt;span class="pl-s"&gt;ns&lt;/span&gt; = &lt;span class="pl-s"&gt;namespace&lt;/span&gt;(&lt;span class="pl-s"&gt;current_date&lt;/span&gt;=&lt;span class="pl-s"&gt;""&lt;/span&gt;) &lt;span class="pl-e"&gt;%}&lt;/span&gt;
&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s"&gt;row&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;sql&lt;/span&gt;(&lt;span class="pl-s"&gt;"select date, body from news order by date desc limit 15"&lt;/span&gt;, &lt;span class="pl-s"&gt;database&lt;/span&gt;=&lt;span class="pl-s"&gt;"content"&lt;/span&gt;) &lt;span class="pl-e"&gt;%}&lt;/span&gt;
    &lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s"&gt;prettydate&lt;/span&gt;(&lt;span class="pl-s"&gt;row&lt;/span&gt;[&lt;span class="pl-s"&gt;"date"&lt;/span&gt;]) != (&lt;span class="pl-s"&gt;ns&lt;/span&gt;.&lt;span class="pl-s"&gt;current_date&lt;/span&gt; &lt;span class="pl-k"&gt;and&lt;/span&gt; &lt;span class="pl-s"&gt;prettydate&lt;/span&gt;(&lt;span class="pl-s"&gt;ns&lt;/span&gt;.&lt;span class="pl-s"&gt;current_date&lt;/span&gt;)) &lt;span class="pl-e"&gt;%}&lt;/span&gt;
    &amp;lt;&lt;span class="pl-ent"&gt;h3&lt;/span&gt;&amp;gt;{{ prettydate(row["date&lt;span class="pl-smi"&gt;"]&lt;/span&gt;) }} &amp;lt;&lt;span class="pl-ent"&gt;a&lt;/span&gt; &lt;span class="pl-e"&gt;href&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/news/{{ row[&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-e"&gt;date&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;] }}&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;style&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s1"&gt;&lt;span class="pl-c1"&gt;&lt;span class="pl-c1"&gt;font-size&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;0.8&lt;span class="pl-k"&gt;em&lt;/span&gt;&lt;/span&gt;; &lt;span class="pl-c1"&gt;&lt;span class="pl-c1"&gt;opacity&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;0.4&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;#&amp;lt;/&lt;span class="pl-ent"&gt;a&lt;/span&gt;&amp;gt;&amp;lt;/&lt;span class="pl-ent"&gt;h3&lt;/span&gt;&amp;gt;
    &lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-s"&gt;set&lt;/span&gt; &lt;span class="pl-s"&gt;ns&lt;/span&gt;.&lt;span class="pl-s"&gt;current_date&lt;/span&gt; = &lt;span class="pl-s"&gt;prettydate&lt;/span&gt;(&lt;span class="pl-s"&gt;row&lt;/span&gt;[&lt;span class="pl-s"&gt;"date"&lt;/span&gt;]) &lt;span class="pl-e"&gt;%}&lt;/span&gt;
    &lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;endif&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;
    {{ render_markdown(row["body&lt;span class="pl-smi"&gt;"]&lt;/span&gt;) }}
&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;endfor&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;prettydate()&lt;/code&gt; is a custom function I wrote in &lt;a href="https://github.com/simonw/datasette.io/blob/353f1f940cb0cb3c38d3bdf6e345328740990702/plugins/dateformat.py"&gt;a one-off plugin&lt;/a&gt; for the site. The &lt;code&gt;namespace()&lt;/code&gt; stuff is a Jinja trick that lets me keep track of the current date heading in the loop, so I can output a new date heading only if the news item occurs on a different day from the previous one.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;render_markdown()&lt;/code&gt; is provided by the &lt;a href="https://github.com/simonw/datasette-render-markdown"&gt;datasette-render-markdown&lt;/a&gt; plugin.&lt;/p&gt;
&lt;p&gt;I wanted permalinks for news stories, but since they don't have identifiers or titles I decided to provide a page for each day instead - for example &lt;a href="https://datasette.io/news/2020-12-10"&gt;https://datasette.io/news/2020-12-10&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;These pages are implemented using &lt;a href="https://simonwillison.net/2020/Sep/15/datasette-0-49/#path-parameters-custom-page-templates"&gt;Path parameters for custom page templates&lt;/a&gt;, introduced in Datasette 0.49. The implementation is a single template file at &lt;a href="https://github.com/simonw/datasette.io/blob/353f1f940cb0cb3c38d3bdf6e345328740990702/templates/pages/news/%7Byyyy%7D-%7Bmm%7D-%7Bdd%7D.html"&gt;templates/pages/news/{yyyy}-{mm}-{dd}.html&lt;/a&gt;, the full contents of which is:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-django"&gt;&lt;pre&gt;&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;extends&lt;/span&gt; &lt;span class="pl-s"&gt;"page_base.html"&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;

&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;block&lt;/span&gt; &lt;span class="pl-s"&gt;title&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;Datasette News: {{ prettydate(yyyy + "-" + mm + "-" + dd) }}&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;endblock&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;

&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;block&lt;/span&gt; &lt;span class="pl-s"&gt;content&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;

&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-s"&gt;set&lt;/span&gt; &lt;span class="pl-s"&gt;stories&lt;/span&gt; = &lt;span class="pl-s"&gt;sql&lt;/span&gt;(&lt;span class="pl-s"&gt;"select date, body from news where date = ? order by date desc"&lt;/span&gt;, [&lt;span class="pl-s"&gt;yyyy&lt;/span&gt; + &lt;span class="pl-s"&gt;"-"&lt;/span&gt; + &lt;span class="pl-s"&gt;mm&lt;/span&gt; + &lt;span class="pl-s"&gt;"-"&lt;/span&gt; + &lt;span class="pl-s"&gt;dd&lt;/span&gt;], &lt;span class="pl-s"&gt;database&lt;/span&gt;=&lt;span class="pl-s"&gt;"content"&lt;/span&gt;) &lt;span class="pl-e"&gt;%}&lt;/span&gt;
&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;not&lt;/span&gt; &lt;span class="pl-s"&gt;stories&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;
    {{ raise_404("News not found") }}
&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;endif&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;
&amp;lt;&lt;span class="pl-ent"&gt;h1&lt;/span&gt;&amp;gt;&amp;lt;&lt;span class="pl-ent"&gt;a&lt;/span&gt; &lt;span class="pl-e"&gt;href&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/news&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;News&amp;lt;/&lt;span class="pl-ent"&gt;a&lt;/span&gt;&amp;gt;: {{ prettydate(yyyy + "-" + mm + "-" + dd) }}&amp;lt;/&lt;span class="pl-ent"&gt;h1&lt;/span&gt;&amp;gt;

&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s"&gt;row&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;stories&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;
    {{ render_markdown(row["body&lt;span class="pl-smi"&gt;"]&lt;/span&gt;) }}
&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;endfor&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;

&lt;span class="pl-e"&gt;{%&lt;/span&gt; &lt;span class="pl-k"&gt;endblock&lt;/span&gt; &lt;span class="pl-e"&gt;%}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The crucial trick here is that, because the filename is &lt;code&gt;news/{yyyy}-{mm}-{dd}.html&lt;/code&gt;, a request to &lt;code&gt;/news/2020-12-10&lt;/code&gt; will render that template with the &lt;code&gt;yyyy&lt;/code&gt;, &lt;code&gt;mm&lt;/code&gt; and &lt;code&gt;dd&lt;/code&gt; template variables set to those values from the URL.&lt;/p&gt;
&lt;p&gt;It can then execute a SQL query that incorporates those values. It assigns the results to a &lt;code&gt;stories&lt;/code&gt; variable, then checks that at least one story was returned - if not, it raises a 404 error.&lt;/p&gt;
&lt;p&gt;See Datasette's &lt;a href="https://docs.datasette.io/en/stable/custom_templates.html#custom-pages"&gt;custom pages documentation&lt;/a&gt; for more details on how this all works.&lt;/p&gt;
&lt;p&gt;The site also offers an &lt;a href="https://datasette.io/content/feed.atom"&gt;Atom feed&lt;/a&gt; of recent news. This is powered by the &lt;a href="https://github.com/simonw/datasette-atom"&gt;datasette-atom&lt;/a&gt; using the output of &lt;a href="https://datasette.io/content/feed"&gt;this canned SQL query&lt;/a&gt;, with a &lt;code&gt;render_markdown()&lt;/code&gt; SQL function provided by &lt;a href="https://github.com/simonw/datasette.io/blob/353f1f940cb0cb3c38d3bdf6e345328740990702/plugins/sql_functions.py"&gt;this site plugin&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;The plugin directory&lt;/h4&gt;
&lt;p&gt;One of the features I'm most excited about on the site is the new &lt;a href="https://datasette.io/plugins"&gt;Datasette plugin directory&lt;/a&gt;. Datasette has over 50 plugins now and I've been wanting a definitive directory of them for a while.&lt;/p&gt;
&lt;p&gt;It's pretty basic at the moment, offering a list of plugins plus simple &lt;code&gt;LIKE&lt;/code&gt; based search, but I plan to expand it a great deal in the future.&lt;/p&gt;
&lt;p&gt;The fun part is where the data comes from. For a couple of years now I've been using GitHub topics to tag my plugins - I tag them with &lt;code&gt;datasette-plugin&lt;/code&gt;, and the ones that I planned to feature on the site when I finally launched it were also tagged with &lt;code&gt;datasette-io&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;datasette.io&lt;/code&gt; deployment process runs a script called &lt;a href="https://github.com/simonw/datasette.io/blob/353f1f940cb0cb3c38d3bdf6e345328740990702/build_plugin_directory.py"&gt;build_plugin_directory.py&lt;/a&gt;, which uses a GraphQL query against the GitHub search API to find all repositories belonging to me that have been tagged with those tags.&lt;/p&gt;
&lt;p&gt;That GraphQL query looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-graphql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;query&lt;/span&gt; {
  &lt;span class="pl-v"&gt;search&lt;/span&gt;(&lt;span class="pl-v"&gt;query&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;topic:datasette-io topic:datasette-plugin user:simonw&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-v"&gt;type&lt;/span&gt;:&lt;span class="pl-c1"&gt;REPOSITORY&lt;/span&gt;, &lt;span class="pl-v"&gt;first&lt;/span&gt;:&lt;span class="pl-c1"&gt;100&lt;/span&gt;) {
    &lt;span class="pl-v"&gt;repositoryCount&lt;/span&gt;
    &lt;span class="pl-v"&gt;nodes&lt;/span&gt; {
      &lt;span class="pl-k"&gt;...&lt;/span&gt; &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;Repository&lt;/span&gt; {
        &lt;span class="pl-v"&gt;id&lt;/span&gt;
        &lt;span class="pl-v"&gt;nameWithOwner&lt;/span&gt;
        &lt;span class="pl-v"&gt;openGraphImageUrl&lt;/span&gt;
        &lt;span class="pl-v"&gt;usesCustomOpenGraphImage&lt;/span&gt;
        &lt;span class="pl-v"&gt;repositoryTopics&lt;/span&gt;(&lt;span class="pl-v"&gt;first&lt;/span&gt;:&lt;span class="pl-c1"&gt;100&lt;/span&gt;) {
          &lt;span class="pl-v"&gt;totalCount&lt;/span&gt;
          &lt;span class="pl-v"&gt;nodes&lt;/span&gt; {
            &lt;span class="pl-v"&gt;topic&lt;/span&gt; {
              &lt;span class="pl-v"&gt;name&lt;/span&gt;
            }
          }
        }
        &lt;span class="pl-s"&gt;openIssueCount&lt;/span&gt;: &lt;span class="pl-v"&gt;issues&lt;/span&gt;(&lt;span class="pl-v"&gt;states&lt;/span&gt;:[&lt;span class="pl-c1"&gt;OPEN&lt;/span&gt;]) {
          &lt;span class="pl-v"&gt;totalCount&lt;/span&gt;
        }
        &lt;span class="pl-s"&gt;closedIssueCount&lt;/span&gt;: &lt;span class="pl-v"&gt;issues&lt;/span&gt;(&lt;span class="pl-v"&gt;states&lt;/span&gt;:[&lt;span class="pl-c1"&gt;CLOSED&lt;/span&gt;]) {
          &lt;span class="pl-v"&gt;totalCount&lt;/span&gt;
        }
        &lt;span class="pl-v"&gt;releases&lt;/span&gt;(&lt;span class="pl-v"&gt;last&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt;) {
          &lt;span class="pl-v"&gt;totalCount&lt;/span&gt;
          &lt;span class="pl-v"&gt;nodes&lt;/span&gt; {
            &lt;span class="pl-v"&gt;tagName&lt;/span&gt;
          }
        }
      }
    }
  }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It fetches the name of each repository, the &lt;code&gt;openGraphImageUrl&lt;/code&gt; (which doesn't appear to be included in the regular GitHub REST API), the number of open and closed issues and details of the most recent release.&lt;/p&gt;
&lt;p&gt;The script has access to a copy of the current site database, which is downloaded on each deploy by the build script. It uses this to check if any of the repositories have new releases that haven't previously been seen by the script.&lt;/p&gt;
&lt;p&gt;Then it runs the &lt;code&gt;github-to-sqlite releases&lt;/code&gt; command (part of &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt;) to fetch details of those new releases.&lt;/p&gt;
&lt;p&gt;The end result is a database of repositories and releases for all of my tagged plugins. The plugin directory is then built against a &lt;a href="https://datasette.io/content/plugins"&gt;custom SQL view&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Other site content&lt;/h4&gt;
&lt;p&gt;The rest of the site content is mainly static template files. I use the &lt;code&gt;render_markdown()&lt;/code&gt; function inline in some of them so I can author in Markdown rather than HTML - here's &lt;a href="https://github.com/simonw/datasette.io/blob/main/templates/pages/examples.html"&gt;the template&lt;/a&gt; for the &lt;a href="https://datasette.io/examples"&gt;/examples page&lt;/a&gt;. The various &lt;a href="https://datasette.io/for"&gt;Use cases for Datasette&lt;/a&gt; pages are likewise built as static templates.&lt;/p&gt;
&lt;h4 id="sqlite-utils-analyze-tables"&gt;Also this week: sqlite-utils analyze-tables&lt;/h4&gt;
&lt;p&gt;My other big project this week has involved building out a Datasette instance for a client. I'm working with over 5,000,000 rows of CSV data for this, which has been a great opportunity to push the limits of some of my tools.&lt;/p&gt;
&lt;p&gt;Any time I'm working with new data I like to get a feel for its general shape. Having imported 5,000,000 rows with dozens of columns into a database, what can I learn about the columns beyond just browsing them in Datasette?&lt;/p&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils analyze-tables&lt;/code&gt; (&lt;a href="https://sqlite-utils.readthedocs.io/en/stable/cli.html#analyzing-tables"&gt;documented here&lt;/a&gt;) is my new tool for doing just that. It loops through every table and every column in the database, and for each column it calculates statistics that include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The total number of distinct values&lt;/li&gt;
&lt;li&gt;The total number of null or blank values&lt;/li&gt;
&lt;li&gt;For non-distinct columns, the 10 most common and 10 least common values&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It can output those to the terminal, or if you add the &lt;code&gt;--save&lt;/code&gt; option it will also save them to a SQLite table called &lt;code&gt;_analyze_tables_&lt;/code&gt; - here's &lt;a href="https://github-to-sqlite.dogsheep.net/github/_analyze_tables_"&gt;that table&lt;/a&gt; for my github-to-sqlite demo instance.&lt;/p&gt;
&lt;p&gt;I can then use the output of the tool to figure out which columns might be a primary key, or which ones warrant being extracted out into a separate lookup table using &lt;a href="https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/"&gt;sqlite-utils extract&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I expect I'll be expanding this feature a lot in the future, but I'm already finding it to be really helpful.&lt;/p&gt;
&lt;h4&gt;Datasette 0.53&lt;/h4&gt;
&lt;p&gt;I pushed out a small feature release of Datasette to accompany the new project website. Quoting &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-53"&gt;the release notes&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New &lt;code&gt;?column__arraynotcontains=&lt;/code&gt; table filter. (&lt;a href="https://github.com/simonw/datasette/issues/1132"&gt;#1132&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;datasette serve&lt;/code&gt; has a new &lt;code&gt;--create&lt;/code&gt; option, which will create blank database files if they do not already exist rather than exiting with an error. (&lt;a href="https://github.com/simonw/datasette/issues/1135"&gt;#1135&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;?_header=off&lt;/code&gt; option for CSV export which omits the CSV header row, &lt;a href="https://docs.datasette.io/en/stable/csv_export.html#csv-export-url-parameters"&gt;documented here&lt;/a&gt;. (&lt;a href="https://github.com/simonw/datasette/issues/1133"&gt;#1133&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;"Powered by Datasette" link in the footer now links to &lt;a href="https://datasette.io/"&gt;https://datasette.io/&lt;/a&gt;. (&lt;a href="https://github.com/simonw/datasette/issues/1138"&gt;#1138&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Project news no longer lives in the README - it can now be found at &lt;a href="https://datasette.io/news"&gt;https://datasette.io/news&lt;/a&gt;. (&lt;a href="https://github.com/simonw/datasette/issues/1137"&gt;#1137&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h4&gt;Office hours&lt;/h4&gt;

&lt;p&gt;I had my first round of &lt;a href="https://calendly.com/swillison/datasette-office-hours"&gt;Datasette office hours&lt;/a&gt; on Friday - 20 minute video chats with anyone who wants to talk to me about the project. I had five great conversations - it's hard to overstate how thrilling it is to talk to people who are using Datasette to solve problems. If you're an open source maintainer I can thoroughly recommend giving this format a try.&lt;/p&gt;

&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-fly/releases/tag/1.0.1"&gt;1.0.1&lt;/a&gt; - 2020-12-12&lt;br /&gt;
Datasette plugin for publishing data using Fly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-auth-passwords"&gt;datasette-auth-passwords&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-auth-passwords/releases/tag/0.3.3"&gt;0.3.3&lt;/a&gt; - 2020-12-11&lt;br /&gt;
Datasette plugin for authentication using passwords&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.53"&gt;0.53&lt;/a&gt; -  - 2020-12-11&lt;br /&gt;
An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-column-inspect"&gt;datasette-column-inspect&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-column-inspect/releases/tag/0.2a"&gt;0.2a&lt;/a&gt; - 2020-12-09&lt;br /&gt;
Experimental plugin that adds a column inspector&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-pretty-json"&gt;datasette-pretty-json&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-pretty-json/releases/tag/0.2.1"&gt;0.2.1&lt;/a&gt; - 2020-12-09&lt;br /&gt;
Datasette plugin that pretty-prints any column values that are valid JSON objects or arrays&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/yaml-to-sqlite/releases/tag/0.3.1"&gt;0.3.1&lt;/a&gt; - 2020-12-07&lt;br /&gt;
Utility for converting YAML files to SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-seaborn"&gt;datasette-seaborn&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-seaborn/releases/tag/0.2a0"&gt;0.2a0&lt;/a&gt; - 2020-12-07&lt;br /&gt;
Statistical visualizations for Datasette using Seaborn&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/readthedocs/custom-sphinx-templates"&gt;Using custom Sphinx templates on Read the Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/style-yaml-dump"&gt;Controlling the style of dumped YAML using PyYAML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/bash/escaping-sql-for-curl-to-datasette"&gt;Escaping a SQL query to use with curl and Datasette&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/bash/skip-csv-rows-with-odd-numbers"&gt;Skipping CSV rows with odd numbers of quotes using ripgrep&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="datasette"/><category term="weeknotes"/><category term="sqlite-utils"/><category term="baked-data"/></entry><entry><title>datasette-ripgrep: deploy a regular expression search engine for your source code</title><link href="https://simonwillison.net/2020/Nov/28/datasette-ripgrep/#atom-tag" rel="alternate"/><published>2020-11-28T06:51:06+00:00</published><updated>2020-11-28T06:51:06+00:00</updated><id>https://simonwillison.net/2020/Nov/28/datasette-ripgrep/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I built &lt;a href="https://github.com/simonw/datasette-ripgrep"&gt;datasette-ripgrep&lt;/a&gt; - a web application  for running regular expression searches against source code, built on top of the amazing &lt;a href="https://github.com/BurntSushi/ripgrep"&gt;ripgrep&lt;/a&gt; command-line tool.&lt;/p&gt;
&lt;h4&gt;datasette-ripgrep demo&lt;/h4&gt;
&lt;p&gt;I've deployed a demo version of the application here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=pytest"&gt;ripgrep.datasette.io/-/ripgrep?pattern=pytest&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The demo runs searches against the source code of every one of my GitHub repositories that start with &lt;code&gt;datasette&lt;/code&gt; - &lt;a href="https://github-to-sqlite.dogsheep.net/github/repos?name__startswith=datasette&amp;amp;owner__exact=9599"&gt;61 repos&lt;/a&gt; right now - so it should include all of my Datasette plugins plus the core Datasette repository itself.&lt;/p&gt;
&lt;p&gt;Since it's running on top of &lt;code&gt;ripgrep&lt;/code&gt;, it supports regular expressions. This is absurdly useful. Some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every usage of the &lt;code&gt;.plugin_config(&lt;/code&gt; method: &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=%5C.plugin_config%5C%28"&gt;plugin_config\(&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Everywhere I use &lt;code&gt;async with httpx.AsyncClient&lt;/code&gt; (usually in tests): &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=async+with.*AsyncClient"&gt;async with.*AsyncClient&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;All places where I use a Jinja &lt;code&gt;|&lt;/code&gt; filter inside a variable: &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=%5C%7B%5C%7B.*%5C%7C.*%5C%7D%5C%7D"&gt;\{\{.*\|.*\}\}&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I usually run ripgrep as &lt;code&gt;rg&lt;/code&gt; on the command-line, or use it within Visual Studio Code (&lt;a href="https://twitter.com/simonw/status/1331381448171929600"&gt;fun fact&lt;/a&gt;: the reason VS Code's "Find in Files" is so good is it's running ripgrep under the hood).&lt;/p&gt;
&lt;p&gt;So why have it as a web application? Because this means I can link to it, bookmark it and use it on my phone.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/datasette-ripgrep.png" alt="A screenshot of datasette-ripgrep in action" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;h4&gt;Why build this?&lt;/h4&gt;
&lt;p&gt;There are plenty of great existing code search tools out there already: I've heard great things about &lt;a href="https://github.com/livegrep/livegrep"&gt;livegrep&lt;/a&gt;, and a quick Google search shows a bunch of other options.&lt;/p&gt;
&lt;p&gt;Aside from being a fun project, &lt;code&gt;datasette-ripgrep&lt;/code&gt; has one key advantage: it gets to benefit from Datasette's publishing mechanism, which means it's really easy to deploy.&lt;/p&gt;
&lt;p&gt;That &lt;a href="https://ripgrep.datasette.io/"&gt;ripgrep.datasette.io&lt;/a&gt; demo is deployed by checking out the source code to be searched into a &lt;code&gt;all&lt;/code&gt; directory and then using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;all&lt;/code&gt; is a folder containing the source code to be searched. &lt;code&gt;metadata.json&lt;/code&gt; contains this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;plugins&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;datasette-ripgrep&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;path&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/app/all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;time_limit&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;3.0&lt;/span&gt;
        }
    }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's all there is to it! The result is a deployed code search engine, running on Google Cloud Run.&lt;/p&gt;
&lt;p&gt;(If you want to try this yourself you'll need to be using the just-released Datasette 0.52.)&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/main/.github/workflows/deploy_demo.yml"&gt;GitHub Action workflow&lt;/a&gt; that deploys the demo also uses my &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; tool to fetch my repos and then shallow-clones the ones that begin with &lt;code&gt;datasette&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you have &lt;a href="https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run"&gt;your own Google Cloud Run credentials&lt;/a&gt;, you can run your own copy of that workflow against your own repositories.&lt;/p&gt;
&lt;h4&gt;A different kind of Datasette plugin&lt;/h4&gt;
&lt;p&gt;Datasette is a tool for publishing SQLite databases, so most Datasette plugins integrate with SQLite in some way.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;datasette-ripgrep&lt;/code&gt; is different: it makes no use of SQLite at all, but instead takes advantage of Datasette's URL routing, &lt;code&gt;datasette publish&lt;/code&gt; deployments and permissions system.&lt;/p&gt;
&lt;p&gt;The plugin implementation is currently &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/07b9ced2935b0b6080c1c42fcaf6ab9e8003d186/datasette_ripgrep/__init__.py"&gt;134 lines of code&lt;/a&gt;, excluding tests and templates.&lt;/p&gt;
&lt;p&gt;While the plugin doesn't use SQLite, it does share a common philosophy with Datasette: the plugin bundles the source code that it is going to search as part of the deployed application, in a similar way to how Datasette usually bundles one or more SQLite database files.&lt;/p&gt;
&lt;p&gt;As such, it's extremely inexpensive to run and can be deployed to serverless hosting. If you need to scale it, you can run more copies.&lt;/p&gt;
&lt;p&gt;This does mean that the application needs to be re-deployed to pick up changes to the searchable code. I'll probably set my demo to do this on a daily basis.&lt;/p&gt;
&lt;h4&gt;Controlling processes from asyncio&lt;/h4&gt;
&lt;p&gt;The trickiest part of the implementation was figuring out how to use Python's &lt;code&gt;asyncio.create_subprocess_exec()&lt;/code&gt; method to safely run the &lt;code&gt;rg&lt;/code&gt; process in response to incoming requests.&lt;/p&gt;
&lt;p&gt;I don't want expensive searches to tie up the server, so I implemented two limits here. The first is a time limit: by default, searches have a second to run after which the &lt;code&gt;rg&lt;/code&gt; process will be terminated and only results recieved so far will be returned. This is achieved using the &lt;a href="https://docs.python.org/3/library/asyncio-task.html#asyncio.wait_for"&gt;asyncio.wait_for()&lt;/a&gt; function.&lt;/p&gt;
&lt;p&gt;I also implemented a limit on the number of matching lines that can be returned, defaulting to 2,000. Any more than that and the process is terminated early.&lt;/p&gt;
&lt;p&gt;Both of these limits can be customized using plugin settings (documented in &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/main/README.md"&gt;the README&lt;/a&gt;). You can see how they are implemented in the &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/0.2/datasette_ripgrep/__init__.py#L9-L55"&gt;async def run_ripgrep(pattern, path, time_limit=1.0, max_lines=2000)&lt;/a&gt; function.&lt;/p&gt;
&lt;h4&gt;Highlighted linkable line numbers&lt;/h4&gt;
&lt;p&gt;The other fun implementation detail is the way the source code listings are displayed. I'm using CSS to display the line numbers in a way that makes them visible without them breaking copy-and-paste (inspired by &lt;a href="https://www.sylvaindurand.org/using-css-to-add-line-numbering/"&gt;this article by Sylvain Durand&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;:&lt;span class="pl-c1"&gt;before&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;content&lt;/span&gt;: &lt;span class="pl-en"&gt;attr&lt;/span&gt;(data-line);
    &lt;span class="pl-c1"&gt;display&lt;/span&gt;: inline-block;
    &lt;span class="pl-c1"&gt;width&lt;/span&gt;: &lt;span class="pl-c1"&gt;3.5&lt;span class="pl-smi"&gt;ch&lt;/span&gt;&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;-webkit-user-select&lt;/span&gt;: none;
    &lt;span class="pl-c1"&gt;color&lt;/span&gt;: &lt;span class="pl-pds"&gt;&lt;span class="pl-kos"&gt;#&lt;/span&gt;666&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The HTML looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;pre&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L1&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;1&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;from setuptools import setup&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L2&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;2&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;import os&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L3&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;3&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&amp;amp;nbsp;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L4&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;VERSION = &amp;amp;#34;0.1&amp;amp;#34;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
...&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I wanted to imitate GitHub's handling of line links, where adding &lt;code&gt;#L23&lt;/code&gt; to the URL both jumps to that line and causes the line to be highlighted. Here's &lt;a href="https://ripgrep.datasette.io/-/ripgrep/view/datasette-allow-permissions-debug/setup.py#L23"&gt;a demo of that&lt;/a&gt; - I use the following JavaScript to update the contents of a &lt;code&gt;&amp;lt;style id="highlightStyle"&amp;gt;&amp;lt;/style&amp;gt;&lt;/code&gt; element in the document head any time the URL fragment changes:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;
var highlightStyle = document.getElementById('highlightStyle');
function highlightLineFromFragment() &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-en"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;/&lt;span class="pl-cce"&gt;^&lt;/span&gt;#L&lt;span class="pl-cce"&gt;\d&lt;/span&gt;&lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-cce"&gt;$&lt;/span&gt;/&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;exec&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;location&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;hash&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-s1"&gt;highlightStyle&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;location&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;hash&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; { background-color: yellow; }`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-en"&gt;highlightLineFromFragment&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-smi"&gt;window&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"hashchange"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;highlightLineFromFragment&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt;/&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's the simplest way I could think of to achieve this effect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 28th November 2020&lt;/strong&gt;: Louis Lévêque on Twitter suggested using the CSS &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/:target"&gt;:target selector&lt;/a&gt; instead, which is indeed MUCH simpler - I deleted the above JavaScript and replaced it with this CSS:&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;:&lt;span class="pl-c1"&gt;target&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;background-color&lt;/span&gt;: &lt;span class="pl-pds"&gt;&lt;span class="pl-kos"&gt;#&lt;/span&gt;FFFF99&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Next steps for this project&lt;/h4&gt;
&lt;p&gt;I'm pleased to have got &lt;a href=""&gt;datasette-ripgrep&lt;/a&gt; to a workable state, and I'm looking forward to using it to answer questions about the growing Datasette ecosystem. I don't know how much more time I'll invest in this - if it proves useful then I may well expand it.&lt;/p&gt;
&lt;p&gt;I do think there's something really interesting about being able to spin up this kind of code search engine on demand using &lt;code&gt;datasette publish&lt;/code&gt;. It feels like a very useful trick to have access to.&lt;/p&gt;
&lt;h4&gt;Better URLs for my TILs&lt;/h4&gt;
&lt;p&gt;My other project this week was an upgrade to &lt;a href="https://til.simonwillison.net/"&gt;til.simonwillison.net&lt;/a&gt;: I finally spent the time to &lt;a href="https://github.com/simonw/til/issues/34"&gt;design nicer URLs&lt;/a&gt; for the site.&lt;/p&gt;
&lt;p&gt;Before:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;til.simonwillison.net/til/til/javascript_manipulating-query-params.md&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;After:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;til.simonwillison.net/javascript/manipulating-query-params&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The implementation for this takes advantage of a feature I sneaked into Datasette 0.49: &lt;a href="https://simonwillison.net/2020/Sep/15/datasette-0-49#path-parameters-custom-page-templates"&gt;Path parameters for custom page templates&lt;/a&gt;. I can create a template file called &lt;code&gt;pages/{topic}/{slug}.html&lt;/code&gt; and Datasette use that template to handle 404 errors that match that pattern.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/til/blob/main/templates/pages/%7Btopic%7D/%7Bslug%7D.html"&gt;the new pages/{topic}/{slug}.html&lt;/a&gt; template for my TIL site. It uses the &lt;code&gt;sql()&lt;/code&gt; template function from the &lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; plugin to retrieve and render the matching TIL, or raises a 404 if no TIL can be found.&lt;/p&gt;
&lt;p&gt;I also needed to setup redirects from the old pages to the new ones. I wrote a &lt;a href="https://til.simonwillison.net/til/til/datasette_redirects-for-datasette.md"&gt;TIL on edirects for Datasette&lt;/a&gt; explaining how I did that.&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/datasette_redirects-for-datasette.md"&gt;Redirects for Datasette&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.2"&gt;datasette-ripgrep 0.2&lt;/a&gt; - 2020-11-27&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1"&gt;datasette-ripgrep 0.1&lt;/a&gt; - 2020-11-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-atom/releases/tag/0.8.1"&gt;datasette-atom 0.8.1&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1a1"&gt;datasette-ripgrep 0.1a1&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1a0"&gt;datasette-ripgrep 0.1a0&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/1.2.1"&gt;datasette-graphql 1.2.1&lt;/a&gt; - 2020-11-24&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/async"&gt;async&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ripgrep"&gt;ripgrep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="async"/><category term="css"/><category term="projects"/><category term="python"/><category term="datasette"/><category term="weeknotes"/><category term="cloudrun"/><category term="ripgrep"/><category term="baked-data"/></entry><entry><title>Bedrock: The SQLitening</title><link href="https://simonwillison.net/2020/Oct/7/bedrock-sqlitening/#atom-tag" rel="alternate"/><published>2020-10-07T23:47:22+00:00</published><updated>2020-10-07T23:47:22+00:00</updated><id>https://simonwillison.net/2020/Oct/7/bedrock-sqlitening/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mozilla.github.io/meao/2018/03/28/bedrock-the-sqlitening/"&gt;Bedrock: The SQLitening&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Back in March 2018 www.mozilla.org switched over to running on Django using SQLite! They’re using the same pattern I’ve been exploring with Datasette: their SQLite database is treated as a read-only cache by their frontend servers, and a new SQLite database is built by a separate process and fetched onto the frontend machines every five minutes by a scheduled task. They have a healthcheck page which shows the latest version of the database and when it was fetched, and even lets you download the 25MB SQLite database directly (I’ve been exploring it using Datasette).

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1313986116161134592"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="mozilla"/><category term="sqlite"/><category term="datasette"/><category term="baked-data"/></entry><entry><title>niche-museums.com, powered by Datasette</title><link href="https://simonwillison.net/2019/Nov/25/niche-museums/#atom-tag" rel="alternate"/><published>2019-11-25T22:27:46+00:00</published><updated>2019-11-25T22:27:46+00:00</updated><id>https://simonwillison.net/2019/Nov/25/niche-museums/#atom-tag</id><summary type="html">
    &lt;p&gt;I just released a major upgrade to my &lt;a href="https://www.niche-museums.com/"&gt;www.niche-museums.com&lt;/a&gt; website (launched &lt;a href="https://simonwillison.net/2019/Oct/28/niche-museums-kepler/"&gt;last month&lt;/a&gt;).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The site is now rendered server-side. The previous version used &lt;a href="https://lit-html.polymer-project.org/"&gt;lit-html&lt;/a&gt; to render content using JavaScript.&lt;/li&gt;
&lt;li&gt;Each museum now has its own page. Here&amp;#39;s today&amp;#39;s new museum listing for the &lt;a href="https://www.niche-museums.com/browse/museums/46"&gt;Conservatory of Flowers&lt;/a&gt; in San Francisco. These pages have a map on them.&lt;/li&gt;
&lt;li&gt;The site has an &lt;a href="https://www.niche-museums.com/about"&gt;about page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You can now link to the page for a specific latitude and longitude, e.g. &lt;a href="https://www.niche-museums.com/?latitude=37.77&amp;amp;longitude=-122.458"&gt;this location in Golden Gate Park&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The source code for the site is now &lt;a href="https://github.com/simonw/museums"&gt;available on GitHub&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notably, the site is entirely powered by &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt;. It&amp;#39;s a heavily customized Datasette instance, making extensive use of &lt;a href="https://datasette.readthedocs.io/en/0.32/custom_templates.html#custom-templates"&gt;custom templates&lt;/a&gt; and &lt;a href="https://datasette.readthedocs.io/en/0.32/plugins.html"&gt;plugins&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It&amp;#39;s a really fun experiment. I&amp;#39;m essentially using Datasette as a weird twist on a static site generator - no moving parts since the database is immutable but there&amp;#39;s still stuff happening server-side to render the pages.&lt;/p&gt;
&lt;h3 id="continuous-deployment"&gt;Continuous deployment&lt;/h3&gt;
&lt;p&gt;The site is entirely stateless and is published &lt;a href="https://circleci.com/gh/simonw/museums"&gt;using Circle CI&lt;/a&gt; to a serverless hosting provider (currently Zeit Now v1, but I&amp;#39;ll probably move it to Google Cloud Run in the near future.)&lt;/p&gt;
&lt;p&gt;The site content - 46 museums and counting - lives in the &lt;a href="https://github.com/simonw/museums/blob/master/museums.yaml"&gt;museums.yaml&lt;/a&gt; file. I&amp;#39;ve been adding a new museum listing every day by editing the YAML file using &lt;a href="https://workingcopyapp.com/"&gt;Working Copy&lt;/a&gt; on my iPhone.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/museums/blob/master/.circleci/config.yml"&gt;build script&lt;/a&gt; runs automatically on every commit. It converts the YAML file into a SQLite database using my &lt;a href="https://github.com/simonw/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt; tool, then runs &lt;code&gt;datasette publish now...&lt;/code&gt; to deploy the resulting database.&lt;/p&gt;
&lt;p&gt;The full deployment command is as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish now browse.db about.db \
    --token=$NOW_TOKEN \
    --alias=www.niche-museums.com \
    --name=niche-museums \
    --install=datasette-haversine \
    --install=datasette-pretty-json \
    --install=datasette-template-sql \
    --install=datasette-json-html \
    --install=datasette-cluster-map~=0.8 \
    --metadata=metadata.json \
    --template-dir=templates \
    --plugins-dir=plugins \
    --branch=master
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;There&amp;#39;s a lot going on here.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;browse.db&lt;/code&gt; is the SQLite database file that was built by running &lt;code&gt;yaml-to-sqlite&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;about.db&lt;/code&gt; is an empty database built using &lt;code&gt;sqlite3 about.db &amp;#39;&amp;#39;&lt;/code&gt; - more on this later.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--alias=&lt;/code&gt; option tells Zeit Now to alias that URL to the resulting deployment. This is the single biggest feature that I&amp;#39;m missing from Google Cloud Run at the moment. It&amp;#39;s possible to point domains at deployments there but it&amp;#39;s not nearly as easy to script.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--install=&lt;/code&gt; options tell &lt;code&gt;datasette publish&lt;/code&gt; which plugins should be installed on the resulting instance.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--metadata=&lt;/code&gt;, &lt;code&gt;--template-dir=&lt;/code&gt; and &lt;code&gt;--plugins-dir=&lt;/code&gt; are the options that customize the instance.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--branch=master&lt;/code&gt; means we always deploy the latest master of Datasette directly from GitHub, ignoring the most recent release to PyPI. This isn&amp;#39;t strictly necessary here.&lt;/p&gt;
&lt;h3 id="customization"&gt;Customization&lt;/h3&gt;
&lt;p&gt;The site itself is built almost entirely using Datasette custom templates. I have four of them:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/museums/blob/c81e8ec9f39d87f13481608832c94b8e824fd347/templates/index.html"&gt;index.html&lt;/a&gt; is the template used for the homepage, and for the page you see when you search for museums near a specific latitude and longitude.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/museums/blob/c81e8ec9f39d87f13481608832c94b8e824fd347/templates/row-browse-museums.html"&gt;row-browse-museums.html&lt;/a&gt; is the template used for the &lt;a href="https://www.niche-museums.com/browse/museums/43"&gt;individual museum pages&lt;/a&gt;. It includes the JavaScript used for the map (which is powered by &lt;a href="https://leafletjs.com/"&gt;Leaflet&lt;/a&gt; and uses &lt;a href="https://foundation.wikimedia.org/wiki/Maps_Terms_of_Use"&gt;Wikimedia&amp;#39;s OpenStreetMap tiles&lt;/a&gt;, which I discovered thanks to &lt;a href="https://observablehq.com/@tmcw/leaflet"&gt;this Observable notebook&lt;/a&gt; by Tom MacWright).&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/museums/blob/c81e8ec9f39d87f13481608832c94b8e824fd347/templates/_museum_card.html"&gt;_museum_card.html&lt;/a&gt; is an included template rendering a card for a museum, shared by the index and museum pages.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/museums/blob/c81e8ec9f39d87f13481608832c94b8e824fd347/templates/database-about.html"&gt;database-about.html&lt;/a&gt; is the template for &lt;a href="https://www.niche-museums.com/about"&gt;the about page&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The about page uses a particularly devious hack.&lt;/p&gt;
&lt;p&gt;Datasette doesn&amp;#39;t have an easy way to create additional custom pages with URLs at the moment (without abusing the &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#asgi-wrapper-datasette"&gt;asgi_wrapper()&lt;/a&gt; hook, which is pretty low-level).&lt;/p&gt;
&lt;p&gt;But... every attached database gets its own URL at &lt;code&gt;/database-name&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;So, to create the &lt;code&gt;/about&lt;/code&gt; page I create an empty database called &lt;code&gt;about.db&lt;/code&gt; using the &lt;code&gt;sqlite3 about.db &amp;quot;&amp;quot;&lt;/code&gt; command. I serve that using Datasette, then create a custom template for that specific database using Datasette&amp;#39;s &lt;a href="https://datasette.readthedocs.io/en/0.32/custom_templates.html#custom-templates"&gt;template naming conventions&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I&amp;#39;ll probably come up with a less grotesque way of doing this and bake it into Datasette in the future. For the moment this seems to work pretty well.&lt;/p&gt;
&lt;h3 id="plugins"&gt;Plugins&lt;/h3&gt;
&lt;p&gt;The two key plugins here are &lt;code&gt;datasette-haversine&lt;/code&gt; and &lt;code&gt;datasette-template-sql&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-haversine"&gt;datasette-haversine&lt;/a&gt; adds a custom SQL function to Datasette called &lt;code&gt;haversine()&lt;/code&gt;, which calculates the haversine distance between two latitude/longitude points.&lt;/p&gt;
&lt;p&gt;It&amp;#39;s used by the SQL query which finds the nearest museums to the user.&lt;/p&gt;
&lt;p&gt;This is very inefficient - it&amp;#39;s essentially a brute-force approach which calculates that distance for every museum in the database and sorts them accordingly - but it will be years before I have enough museums listed for that to cause any kind of performance issue.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; is the new plugin I &lt;a href="https://simonwillison.net/2019/Nov/18/datasette-template-sql/"&gt;described last week&lt;/a&gt;, made possible by Datasette dropping Python 3.5 support. It allows SQL queries to be executed directly from templates. I&amp;#39;m using it here to &lt;a href="https://github.com/simonw/museums/blob/c81e8ec9f39d87f13481608832c94b8e824fd347/templates/index.html#L58-L69"&gt;run the queries&lt;/a&gt; that power homepage.&lt;/p&gt;
&lt;p&gt;I tried to get the site working just using code in the templates, but it got pretty messy. Instead, I took advantage of Datasette&amp;#39;s &lt;code&gt;--plugins-dir&lt;/code&gt; option, which causes Datasette to treat all Python modules in a specific directory as plugins and attempt to load them.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/museums/blob/c81e8ec9f39d87f13481608832c94b8e824fd347/plugins/index_vars.py"&gt;index_vars.py&lt;/a&gt; is a single custom plugin that I&amp;#39;m bundling with the site. It uses the &lt;a href="https://datasette.readthedocs.io/en/0.32/plugins.html#extra-template-vars-template-database-table-view-name-request-datasette"&gt;extra_template_vars()&lt;/a&gt; plugin took to detect requests to the &lt;code&gt;index&lt;/code&gt; page and inject some additional custom template variables based on values read from the querystring.&lt;/p&gt;
&lt;p&gt;This ends up acting a little bit like a custom Django view function. It&amp;#39;s a slightly weird pattern but again it does the job - and helps me further explore the potential of Datasette as a tool for powering websites in addition to just providing an API.&lt;/p&gt;
&lt;h2 id="weeknotes"&gt;Weeknotes&lt;/h2&gt;
&lt;p&gt;This post is standing in for my regular weeknotes, because it represents most of what I achieved this last week. A few other bits and pieces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I&amp;#39;ve been exploring ways to enable CSV upload directly into a Datasette instance. I&amp;#39;m building a prototype of this on top of &lt;a href="https://www.starlette.io/"&gt;Starlette&lt;/a&gt;, because it has solid ASGI &lt;a href="https://www.starlette.io/requests/#request-files"&gt;file upload support&lt;/a&gt;. This is currently a standalone web application but I&amp;#39;ll probably make it work as a Datasette ASGI plugin once I have something I like.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sixcolors.com/post/2019/09/13-features-of-ios-13-shortcuts/"&gt;Shortcuts in iOS 13&lt;/a&gt; got some very interesting new features, most importantly the ability to trigger shortcuts automatically on specific actions - including every time you open a specific app. I&amp;#39;ve been experimenting with using this to automatically copy data from my iPhone up to a custom web application - maybe this could help ingest notes and photos into &lt;a href="https://simonwillison.net/2019/Oct/7/dogsheep/"&gt;Dogsheep&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Posted seven new museums to niche-museums.com: &lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/39"&gt;Cable Car Museum&lt;/a&gt; in San Francisco&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/40"&gt;Audium&lt;/a&gt; in San Francisco&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/41"&gt;House of Broel Dollhouse Museum&lt;/a&gt; in New Orleans&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/43"&gt;Neptune Society Columbarium&lt;/a&gt; in San Francisco&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/44"&gt;Recoleta Cemetery&lt;/a&gt; in Buenos Aires&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/45"&gt;NASA Glenn Visitor Center&lt;/a&gt; in Cleveland&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/46"&gt;Conservatory of Flowers&lt;/a&gt; in San Francisco&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;I composed &lt;a href="https://www.niche-museums.com/browse?sql=select+json_object%28%22pre%22%2C+group_concat%28%27*+%5B%27+%7C%7C+name+%7C%7C+%27%5D%28https%3A%2F%2Fwww.niche-museums.com%2Fbrowse%2Fmuseums%2F%27+%7C%7C+id+%7C%7C++%2B+%27%29+in+%27+%7C%7C+coalesce%28osm_city%2C+osm_county%2C+osm_state%2C+osm_country%2C+%27%27%29%2C+%27%0D%0A%27%29%29+from+%28select+*+from+%28select+*+from+museums+order+by+id+desc+limit+7%29+order+by+id%29%3B"&gt;devious SQL query&lt;/a&gt; for generating the markdown for the seven most recently added museums.&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/museums"&gt;museums&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yaml"&gt;yaml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="museums"/><category term="projects"/><category term="yaml"/><category term="datasette"/><category term="weeknotes"/><category term="baked-data"/></entry><entry><title>Weeknotes: Niche Museums, Kepler, Trees and Streaks</title><link href="https://simonwillison.net/2019/Oct/28/niche-museums-kepler/#atom-tag" rel="alternate"/><published>2019-10-28T22:42:10+00:00</published><updated>2019-10-28T22:42:10+00:00</updated><id>https://simonwillison.net/2019/Oct/28/niche-museums-kepler/#atom-tag</id><summary type="html">
    &lt;h3 id="Niche_Museums_4"&gt;Niche Museums&lt;/h3&gt;

&lt;p&gt;Every now and then someone will ask “so when are you going to build Museums Near Me then?”, based on &lt;a href="https://twitter.com/simonw/status/1171159213436997633"&gt;my obsession with niche museums&lt;/a&gt; and websites like &lt;a href="https://www.owlsnearme.com/"&gt;www.owlsnearme.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For my Strategic Communications course at Stanford last week I had to perform a midterm presentation - a six minute talk to convince my audience of something, accompanied by slides and a handout.&lt;/p&gt;
&lt;p&gt;I chose “you should seek out and explore tiny museums” as my topic, and used it as an excuse to finally start the website!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.niche-museums.com/"&gt;www.niche-museums.com&lt;/a&gt; is the result. It’s a small but growing collection of niche museums (17 so far, mostly in the San Francisco Bay Area) complete with the all important blue “Use my location” button to see museums near you.&lt;/p&gt;
&lt;p&gt;Naturally I built it on &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt;. I’ll be writing more about the implementation (and releasing the underlying code) soon. I also built a new plugin for it, &lt;a href="https://github.com/simonw/datasette-haversine "&gt;datasette-haversine&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="Mapping_museums_against_Starbucks_16"&gt;Mapping museums against Starbucks&lt;/h3&gt;

&lt;p&gt;I needed a way to emphasize quite how many tiny museums there are in the USA. I decided to do this with a visualization.&lt;/p&gt;
&lt;p&gt;It turns out there are 15,891 branches of Starbucks in the USA… and at least 30,132 museums!&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2019/starbucks.png" alt="15,891 Starbucks" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2019/museums.png" alt="At least 30.132 museums!" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;I made these maps using a couple of sources.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.alltheplaces.xyz/"&gt;All The Places&lt;/a&gt; is a crowdsourced scraper project which aims to build scrapers for every company that has a “store locator” area of their website. Starbucks has &lt;a href="https://www.starbucks.com/store-locator"&gt;a store locator&lt;/a&gt; and All The Places have &lt;a href="https://github.com/alltheplaces/alltheplaces/blob/master/locations/spiders/starbucks.py"&gt;a scraper for it&lt;/a&gt;, so you can download GeoJSON of every Starbucks. I wrote a quick script to import that GeoJSON into Datasette using sqlite-utils.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://www.imls.gov/"&gt;Institute of Museum and Library Services&lt;/a&gt; is an independent agency of the federal government that supports museums and libraries across the country. They publish a &lt;a href="https://www.imls.gov/research-evaluation/data-collection/museum-data-files"&gt;dataset of Museums in the USA&lt;/a&gt; as a set of CSV files. I used &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; to load those into Datasette, than ran a union query to combine the three files together.&lt;/p&gt;
&lt;p&gt;So I have Datasette instances (with a CSV export feature) for both Starbucks and USA museums, with altitudes and longitudes for each.&lt;/p&gt;
&lt;p&gt;Now how to turn that into a map?&lt;/p&gt;
&lt;p&gt;I turned to my new favourite GIS tool, &lt;a href="https://kepler.gl/"&gt;Kepler&lt;/a&gt;. Kepler is an open source GIS visualization tool released by Uber, based on WebGL. It’s astonishingly powerful and can be used directly in your browser by clicking the “Get Started” button on their website (which I assumed would take you to installation instructions, but no, it loads up the entire tool in your browser).&lt;/p&gt;
&lt;p&gt;You can import millions of points of data into Kepler and it will visualize them for you directly. I used a Datasette query to export the CSVs, then loaded in my Starbucks CSV, exported an image, loaded in the Museums CSV as a separate colour and exported a second image. The whole project ended up taking about 15 minutes. Kepler is a great addition to the toolbelt!&lt;/p&gt;

&lt;h3 id="Animating_the_PGE_outages_40"&gt;Animating the PG&amp;amp;E outages&lt;/h3&gt;

&lt;p&gt;My &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;PG&amp;amp;E outages scraper&lt;/a&gt; continues to record a snapshot of the PG&amp;amp;E outage map JSON every ten minutes. I’m posting updates to &lt;a href="https://twitter.com/simonw/status/1182440312590848001"&gt;a thread on Twitter&lt;/a&gt;, but discovering Kepler inspired me to look at more sophisticated visualization options.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://medium.com/vis-gl/animating-40-years-of-california-earthquakes-e4ffcdd4a289"&gt;This tutorial&lt;/a&gt; by Giuseppe Macrì tipped me off the the fact that you can use Kepler to animate points against timestamps!&lt;/p&gt;
&lt;p&gt;Here’s the result: a video animation showing how PG&amp;amp;E’s outages have evolved since the 5th of October:&lt;/p&gt;

&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;Here&amp;#39;s a video animation of PG&amp;amp;E&amp;#39;s outages from October 5th up until just a few minutes ago &lt;a href="https://t.co/50K3BrROZR"&gt;pic.twitter.com/50K3BrROZR&lt;/a&gt;&lt;/p&gt;- Simon Willison (@simonw) &lt;a href="https://twitter.com/simonw/status/1188612004572880896?ref_src=twsrc%5Etfw"&gt;October 28, 2019&lt;/a&gt;&lt;/blockquote&gt;

&lt;h3 id="Hayes_Valley_Trees_50"&gt;Hayes Valley Trees&lt;/h3&gt;

&lt;p&gt;The city announced plans to cut down 27 ficus trees in our neighborhood in San Francisco. I’ve been working with Natalie to help a small group of citizens organize an appeal, and this weekend I helped run a survey of the affected trees (recording their exact locations in a CSV file) and then built &lt;a href="https://www.hayes-valley-trees.com/"&gt;www.hayes-valley-trees.com&lt;/a&gt; (&lt;a href="https://github.com/simonw/hayes-valley-trees"&gt;source on GitHub&lt;/a&gt;) to link to from fliers attached to each affected tree.&lt;/p&gt;
&lt;p&gt;It started out as &lt;a href="https://glitch.com/~hayes-valley-trees"&gt;a Datasette&lt;/a&gt; (running on Glitch) but since it’s only 27 data points I ended up freezing the data in a static JSON file to avoid having to tolerate any cold start times. The site is deployed as static assets on Zeit Now using their handy &lt;a href="https://zeit.co/github"&gt;GitHub continuous deployment tool&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="Streaks_56"&gt;Streaks&lt;/h3&gt;
&lt;p&gt;It turns out I’m very motivated by streaks: I’m at 342 days for Duolingo Spanish and 603 days for an Apple Watch move streak. Could I apply this to other things in my life?&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://twitter.com/simonw/status/1186824721280593920"&gt;asked on Twitter&lt;/a&gt; and was recommended the &lt;a href="https://streaks.app/"&gt;Streaks iOS app&lt;/a&gt;. It’s beautiful! I’m now tracking streaks for guitar practice, Duolingo, checking email, checking Slack, reading some books and adding a new museum to &lt;a href="http://www.niche-museums.com"&gt;www.niche-museums.com&lt;/a&gt; (if I add one a day I can get from 17 museums today to 382 in a year!)&lt;/p&gt;
&lt;p&gt;It seems to be working pretty well so far. I particularly like their iPhone widget.&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2019/streaks-widget.jpg" alt="Streaks widget" style="max-width: 100%" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/museums"&gt;museums&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/productivity"&gt;productivity&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/visualization"&gt;visualization&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/streaks"&gt;streaks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duolingo"&gt;duolingo&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="museums"/><category term="productivity"/><category term="projects"/><category term="visualization"/><category term="weeknotes"/><category term="baked-data"/><category term="streaks"/><category term="duolingo"/></entry><entry><title>The interesting ideas in Datasette</title><link href="https://simonwillison.net/2018/Oct/4/datasette-ideas/#atom-tag" rel="alternate"/><published>2018-10-04T02:28:45+00:00</published><updated>2018-10-04T02:28:45+00:00</updated><id>https://simonwillison.net/2018/Oct/4/datasette-ideas/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; (&lt;a href="https://simonwillison.net/tags/datasette/"&gt;previously&lt;/a&gt;) is my open source tool for exploring and publishing structured data. There are a lot of ideas embedded in Datasette. I realized that I haven’t put many of them into writing.&lt;/p&gt;
&lt;p&gt;
&lt;a href="#Publishing_readonly_data"&gt;Publishing read-only data&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Bundling_the_data_with_the_code"&gt;Bundling the data with the code&lt;/a&gt;&lt;br /&gt;
&lt;a href="#SQLite_as_the_underlying_data_engine"&gt;SQLite as the underlying data engine&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Farfuture_cache_expiration"&gt;Far-future cache expiration&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Publishing_as_a_core_feature"&gt;Publishing as a core feature&lt;/a&gt;&lt;br /&gt;
&lt;a href="#License_and_source_metadata"&gt;License and source metadata&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Facet_everything"&gt;Facet everything&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Respect_for_CSV"&gt;Respect for CSV&lt;/a&gt;&lt;br /&gt;
&lt;a href="#SQL_as_an_API_language"&gt;SQL as an API language&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Optimistic_query_execution_with_time_limits"&gt;Optimistic query execution with time limits&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Keyset_pagination"&gt;Keyset pagination&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Interactive_demos_based_on_the_unit_tests"&gt;Interactive demos based on the unit tests&lt;/a&gt;&lt;br /&gt;
&lt;a href="#Documentation_unit_tests"&gt;Documentation unit tests&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="Publishing_readonly_data"&gt;&lt;/a&gt;Publishing read-only data&lt;/h3&gt;
&lt;p&gt;Datasette provides a read-only API to your data. It makes no attempt to deal with writes. Avoiding writes entirely is fundamental to a plethora of interesting properties, many of which are expanded on further below. In brief:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hosting web applications with no read/write persistence requirements is incredibly cheap in 2018 - often free (both &lt;a href="https://zeit.co/now"&gt;ZEIT Now&lt;/a&gt; and a &lt;a href="https://www.heroku.com/"&gt;Heroku&lt;/a&gt; have generous free tiers). This is a big deal: even having to pay a few dollars a month is enough to dicentivise sharing data, since now you have to figure out who will pay and ensure the payments don’t expire in the future.&lt;/li&gt;
&lt;li&gt;Being read-only makes it trivial to scale: just add more instances, each with their own copy of the data. All of the hard problems in scaling web applications that relate to writable data stores can be skipped entirely.&lt;/li&gt;
&lt;li&gt;Since the database file is opened using SQLite’s &lt;a href="https://www.sqlite.org/uri.html#uriimmutable"&gt;immutable mode&lt;/a&gt;, we can accept arbitrary SQL queries with no risk of them corrupting the data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Any time your data changes, you need to publish a brand new copy of the whole database. With the right hosting this is easy: deploy a brand new copy of your data and application in parallel to your existing live deployment, then switch over incoming HTTP traffic to your API at the load balancer level. Heroku and Zeit Now both support this strategy out of the box.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Bundling_the_data_with_the_code"&gt;&lt;/a&gt;Bundling the data with the code&lt;/h3&gt;
&lt;p&gt;Since the data is read-only and is encapsulated in a single binary SQLite database file, we can bundle the data as part of the app. This means we can trivially create and publish Docker images that provide both the data and the API and UI for accessing it. We can also publish to any hosting provider that will allow us to run a Python application, without also needing to provision a mutable database.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html#datasette-package"&gt;datasette package&lt;/a&gt; command takes one or more SQLite databases and bundles them together with the Datasette application in a single Docker image, ready to be deployed anywhere that can run Docker containers.&lt;/p&gt;
&lt;h3&gt;&lt;a id="SQLite_as_the_underlying_data_engine"&gt;&lt;/a&gt;SQLite as the underlying data engine&lt;/h3&gt;
&lt;p&gt;Datasette encourages people to use &lt;a href="https://www.sqlite.org/"&gt;SQLite&lt;/a&gt; as a standard format for publishing data.&lt;/p&gt;
&lt;p&gt;Relational database are great: once you know how to use them, you can represent any data you can imagine using a carefully designed schema.&lt;/p&gt;
&lt;p&gt;What about data that’s too unstructured to fit a relational schema? SQLite includes excellent &lt;a href="https://www.sqlite.org/json1.html"&gt;support for JSON data&lt;/a&gt; - so if you can’t shape your data to fit a table schema you can instead store it as text blobs of JSON - and use SQLite’s JSON functions to filter by or extract specific fields.&lt;/p&gt;
&lt;p&gt;What about binary data? Even that’s covered: SQLite will happily store binary blobs. My &lt;a href="https://github.com/simonw/datasette-render-images"&gt;datasette-render-images plugin&lt;/a&gt; (&lt;a href="https://datasette-render-images-demo.datasette.io/favicons-6a27915/favicons"&gt;live demo here&lt;/a&gt;) is one example of a tool that works with binary image data stored in SQLite blobs.&lt;/p&gt;
&lt;p&gt;What if my data is too big? Datasette is not a “big data” tool, but if your definition of big data is something that won’t fit in RAM  that threshold is growing all the time (2TB of RAM on a single AWS instance &lt;a href="https://aws.amazon.com/about-aws/whats-new/2016/05/now-available-x1-instances-the-largest-amazon-ec2-memory-optimized-instance-with-2-tb-of-memory/"&gt;now costs less than $4/hour&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I’ve personally had great results from multiple GB SQLite databases and Datasette. The theoretical maximum size of a single SQLite database is &lt;a href="https://www.sqlite.org/limits.html"&gt;around 140TB&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;SQLite also has built-in support for &lt;a href="https://www.sqlite.org/fts5.html"&gt;surprisingly good full-text search&lt;/a&gt;, and thanks to being extensible via modules has excellent geospatial functionality in the form of the &lt;a href="https://www.gaia-gis.it/fossil/libspatialite/index"&gt;SpatiaLite extension&lt;/a&gt;. Datasette benefits enormously from this wider ecosystem.&lt;/p&gt;
&lt;p&gt;The reason most developers avoid SQLite for production web applications is that it doesn’t deal brilliantly with large volumes of concurrent writes. Since Datasette is read-only we can entirely ignore this limitation.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Farfuture_cache_expiration"&gt;&lt;/a&gt;Far-future cache expiration&lt;/h3&gt;
&lt;p&gt;Since the data in a Datasette instance never changes, why not cache calls to it forever?&lt;/p&gt;
&lt;p&gt;Datasette sends a far future HTTP cache expiry header with every API response. This means that browsers will only ever fetch data the first time a specific URL is accessed, and if you host Datasette behind a CDN such as &lt;a href="https://www.fastly.com/"&gt;Fastly&lt;/a&gt; or &lt;a href="https://www.cloudflare.com/"&gt;Cloudflare&lt;/a&gt; each unique API call will hit Datasette just once and then be cached essentially forever by the CDN.&lt;/p&gt;
&lt;p&gt;This means it’s safe to deploy a JavaScript app using an inexpensively hosted Datasette-backed API to the front page of even a high traffic site - the CDN will easily take the load.&lt;/p&gt;
&lt;p&gt;Zeit added Cloudflare to every deployment (even their free tier) &lt;a href="https://zeit.co/blog/now-cdn"&gt;back in July&lt;/a&gt;, so if you are hosted there you get this CDN benefit for free.&lt;/p&gt;
&lt;p&gt;What if you re-publish an updated copy of your data? Datasette has that covered too. You may have noticed that every Datasette database gets a hashed suffix automatically when it is deployed:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4"&gt;https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This suffix is based on the SHA256 hash of the entire database file contents - so any change to the data will result in new URLs. If you query a previous suffix Datasette will notice and redirect you to the new one.&lt;/p&gt;
&lt;p&gt;If you know you’ll be changing your data, you can build your application against the non-suffixed URL. This will not be cached and will always 302 redirect to the correct version (and these redirects are extremely fast).&lt;/p&gt;
&lt;p&gt;&lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight/alcohol-consumption%2Fdrinks.json"&gt;https://fivethirtyeight.datasettes.com/fivethirtyeight/alcohol-consumption%2Fdrinks.json&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The redirect sends an HTTP/2 push header such that if you are running behind a CDN that understands push (&lt;a href="https://blog.cloudflare.com/announcing-support-for-http-2-server-push-2/"&gt;such as Cloudflare&lt;/a&gt;) your browser won’t have to make two requests to follow the redirect. You can use the Chrome DevTools to see this in action:&lt;/p&gt;
&lt;p&gt;&lt;img  style="max-width: 100%" src="https://static.simonwillison.net/static/2018/http2-push.png" alt="Chrome DevTools showing a redirect initiated by an HTTP/2 push" /&gt;&lt;/p&gt;
&lt;p&gt;And finally, if you need to opt out of HTTP caching for some reason you can disable it on a per-request basis by including &lt;code&gt;?_ttl=0&lt;/code&gt; &lt;a href="https://datasette.readthedocs.io/en/stable/json_api.html#special-json-arguments"&gt;in the URL query string&lt;/a&gt;.  - for example, if you want to return a random member of the Avengers it doesn’t make sense to cache the response:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight?sql=select+*+from+%5Bavengers%2Favengers%5D+order+by+random()+limit+1&amp;amp;_ttl=0"&gt;https://fivethirtyeight.datasettes.com/fivethirtyeight?sql=select+*+from+[avengers%2Favengers]+order+by+random()+limit+1&amp;amp;_ttl=0&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="Publishing_as_a_core_feature"&gt;&lt;/a&gt;Publishing as a core feature&lt;/h3&gt;
&lt;p&gt;Datasette aims to reduce the friction for publishing interesting data online as much as possible.&lt;/p&gt;
&lt;p&gt;To this end, Datasette includes &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html"&gt;a “publish” subcommand&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# deploy to Heroku
datasette publish heroku mydatabase.db
# Or deploy to Zeit Now
datasette publish now mydatabase.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These commands take one or more SQLite databases, upload them to a hosting provider, configure a Datasette instance to serve them and return the public URL of the newly deployed application.&lt;/p&gt;
&lt;p&gt;Out of the box, Datasette can publish to either Heroku or to Zeit Now. The &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#publish-subcommand-publish"&gt;publish_subcommand plugin hook&lt;/a&gt; means other providers can be supported by writing plugins.&lt;/p&gt;
&lt;h3&gt;&lt;a id="License_and_source_metadata"&gt;&lt;/a&gt;License and source metadata&lt;/h3&gt;
&lt;p&gt;Datasette believes that data should be accompanied by source information and a license, whenever possible. The &lt;a href="https://datasette.readthedocs.io/en/stable/metadata.html"&gt;metadata.json file&lt;/a&gt; that can be bundled with your data supports these. You can also provide source and license information when you run &lt;code&gt;datasette publish&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish fivethirtyeight.db \
    --source=&amp;quot;FiveThirtyEight&amp;quot; \
    --source_url=&amp;quot;https://github.com/fivethirtyeight/data&amp;quot; \
    --license=&amp;quot;CC BY 4.0&amp;quot; \
    --license_url=&amp;quot;https://creativecommons.org/licenses/by/4.0/&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you use these options Datasette will create the corresponding &lt;code&gt;metadata.json&lt;/code&gt; file for you as part of the deployment.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Facet_everything"&gt;&lt;/a&gt;Facet everything&lt;/h3&gt;
&lt;p&gt;I really love faceted search: it’s the first tool I turn to whenever I want to start understanding a collection of data. I’ve built faceted search engines on top of Solr, Elasticsearch and PostgreSQL and many of my favourite tools (like Splunk and Datadog) have it as a core feature.&lt;/p&gt;
&lt;p&gt;Datasette automatically attempts to calculate facets against every table. You can read &lt;a href="https://simonwillison.net/2018/May/20/datasette-facets/"&gt;more about the Datasette Facets feature here&lt;/a&gt; - as a huge faceted search fan it’s one of my all-time favourite features of the project. Now I can add SQLite to the list of technologies I’ve used to build faceted search!&lt;/p&gt;
&lt;h3&gt;&lt;a id="Respect_for_CSV"&gt;&lt;/a&gt;Respect for CSV&lt;/h3&gt;
&lt;p&gt;CSV is by far the most common format for sharing and publishing data online. Almost every useful data tool has the ability to export to it, and it remains the lingua franca of spreadsheet import and export.&lt;/p&gt;
&lt;p&gt;It has many flaws: it can’t easily represent nested data structures, escaping rules for values containing commas are inconsistently implemented and it doesn’t have a standard way of representing character encoding.&lt;/p&gt;
&lt;p&gt;Datasette aims to promote SQLite as a much better default format for publishing data. I would much rather download a .db file full of pre-structured data than download a .csv and then have to re-structure it as a separate piece of work.&lt;/p&gt;
&lt;p&gt;But interacting well with the enormous CSV ecosystem is essential. Datasette has &lt;a href="https://datasette.readthedocs.io/en/stable/csv_export.html"&gt;deep CSV export functionality&lt;/a&gt;: any data you can see, you can export - including the results of arbitrary SQL queries. If your query can be paginated Datasette can stream down every page in a single CSV file for you.&lt;/p&gt;
&lt;p&gt;Datasette’s sister-tool &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; handles the other side of the equation: importing data from CSV into SQLite tables. And the &lt;a href="https://simonwillison.net/2018/Jan/17/datasette-publish/"&gt;Datasette Publish web application&lt;/a&gt; allows users to upload their CSVs and have them deployed directly to their own fresh Datasette instance - no command line required.&lt;/p&gt;
&lt;h3&gt;&lt;a id="SQL_as_an_API_language"&gt;&lt;/a&gt;SQL as an API language&lt;/h3&gt;
&lt;p&gt;A lot of people these days are excited about &lt;a href="https://graphql.org/"&gt;GraphQL&lt;/a&gt;, because it allows API clients to request exactly the data they need, including traversing into related objects in a single query.&lt;/p&gt;
&lt;p&gt;Guess what? SQL has been able to do that since the 1970s!&lt;/p&gt;
&lt;p&gt;There are a number of reasons most APIs don’t allow people to pass them arbitrary SQL queries:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Security: we don’t want people messing up our data&lt;/li&gt;
&lt;li&gt;Performance: what if someone sends an accidental (or deliberate) expensive query that exhausts our resources?&lt;/li&gt;
&lt;li&gt;Hiding implementation details: if people write SQL against our API we can never change the structure of our database tables&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Datasette has answers to all three.&lt;/p&gt;
&lt;p&gt;On security: the data is read-only, using SQLite’s immutable mode. You can’t damage it with a query - INSERT and UPDATEs will simply throw harmless errors.&lt;/p&gt;
&lt;p&gt;On performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold. Datasette sets this to one second by default, though you can &lt;a href="https://datasette.readthedocs.io/en/stable/config.html#sql-time-limit-ms"&gt;alter that configuration&lt;/a&gt; if you need to (I often bump it up to ten seconds when exploring multi-GB data on my laptop).&lt;/p&gt;
&lt;p&gt;On hidden implementation details: since we are publishing static data rather than maintaining an evolving API, we can mostly ignore this issue. If you are really worried about it you can take advantage of &lt;a href="https://datasette.readthedocs.io/en/stable/sql_queries.html#canned-queries"&gt;canned queries&lt;/a&gt; and &lt;a href="https://datasette.readthedocs.io/en/stable/sql_queries.html#views"&gt;SQL view definitions&lt;/a&gt; to expose a carefully selected forward-compatible view into your data.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Optimistic_query_execution_with_time_limits"&gt;&lt;/a&gt;Optimistic query execution with time limits&lt;/h3&gt;
&lt;p&gt;I mentioned Datasette’s SQL time limits above. These aren’t just there to avoid malicious queries: the idea of “optimistic SQL evaluation” is baked into some of Datasette’s core features.&lt;/p&gt;
&lt;p&gt;Consider &lt;a href="https://datasette.readthedocs.io/en/stable/facets.html#suggested-facets"&gt;suggested facets&lt;/a&gt; - where Datasette inspects any table you view and tries to suggest columns that are worth faceting against.&lt;/p&gt;
&lt;p&gt;The way this works is Datasette loops over &lt;em&gt;every&lt;/em&gt; column in the table and runs a query to see if there are less than 20 unique values for that column. On a large table this could take a prohibitive amount of time, so Datasette sets an aggressive timeout on those queries: &lt;a href="https://datasette.readthedocs.io/en/stable/config.html#facet-suggest-time-limit-ms"&gt;just 50ms&lt;/a&gt;. If the query fails to run in that time it is silently dropped and the column is not listed as a suggested facet.&lt;/p&gt;
&lt;p&gt;Datasette’s JSON API provides a mechanism for JavaScript applications to use that same pattern. If you add &lt;code&gt;?_timelimit=20&lt;/code&gt; to any Datasette API call, the underlying query will only get 20ms to run. If it goes over you’ll get a very fast error response from the API. This means you can design your own features that attempt to optimistically run expensive queries without damaging the  performance of your app.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Keyset_pagination"&gt;&lt;/a&gt;Keyset pagination&lt;/h3&gt;
&lt;p&gt;SQL pagination using OFFSET/LIMIT has a fatal flaw: if you request page number 300 at 20 per page the underlying SQL engine needs to calculate and sort all 6,000 preceding rows before it can return the 20 you have requested.&lt;/p&gt;
&lt;p&gt;This does not scale at all well.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://use-the-index-luke.com/sql/partial-results/fetch-next-page"&gt;Keyset pagination&lt;/a&gt; (often known by other names, including cursor-based pagination) is a far more efficient way to paginate through data. It works against ordered data. Each page is returned with a token representing the last record you saw, then when you request the next page the engine merely has to filter for records that are greater than that tokenized value and scan through the next 20 of them.&lt;/p&gt;
&lt;p&gt;(Actually, it scans through 21. By requesting one more record than you intend to display you can detect if another page of results exists - if you ask for 21 but get back 20 or less you know you are on the last page.)&lt;/p&gt;
&lt;p&gt;Datasette’s table view includes a sophisticated implementation of keyset pagination.&lt;/p&gt;
&lt;p&gt;Datasette defaults to sorting by primary key (or SQLite rowid). This is perfect for efficient pagination: running a select against the primary key column for values greater than X is one of the fastest range scan queries any database can support. This allows users to paginate as deep as they like without paying the offset/limit performance penalty.&lt;/p&gt;
&lt;p&gt;This is also how the “export all rows as CSV” option works: when you select that option, Datasette opens a stream to your browser and internally starts keyset-pagination over the entire table. This keeps resource usage in check even while streaming back millions of rows.&lt;/p&gt;
&lt;p&gt;Here’s where Datasette gets fancy: it handles keyset pagination for any other sort order as well. If you sort by any column and click “next” you’ll be requesting the next set of rows after the last value you saw. And this even works for columns containing duplicate values: If you sort by such a column, Datasette actually sorts by that column combined with the primary key. The “next” pagination token it generates encodes both the sorted value and the primary key, allowing it to correctly serve you the next page when you click the link.&lt;/p&gt;
&lt;p&gt;Try clicking “next” &lt;a href="https://latest.datasette.io/fixtures/sortable?_sort_desc=sortable"&gt;on this page&lt;/a&gt; to see keyset pagination against a sorted column in action.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Interactive_demos_based_on_the_unit_tests"&gt;&lt;/a&gt;Interactive demos based on the unit tests&lt;/h3&gt;
&lt;p&gt;I love interactive demos. I decided it would be useful if every single release of Datasette had a permanent interactive demo illustrating its features.&lt;/p&gt;
&lt;p&gt;Thanks to Zeit Now, this was pretty easy to set up. I’ve actually taken it a step further: every successful push to master on GitHub is also deployed to a permanent URL.&lt;/p&gt;
&lt;p&gt;Some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://latest.datasette.io/"&gt;https://latest.datasette.io/&lt;/a&gt; - the most recent commit to Datasette master. You can see the currently deployed commit hash on &lt;a href="https://latest.datasette.io/-/versions"&gt;https://latest.datasette.io/-/versions&lt;/a&gt; and compare it to &lt;a href="https://github.com/simonw/datasette/commits"&gt;https://github.com/simonw/datasette/commits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://v0-25.datasette.io/"&gt;https://v0-25.datasette.io/&lt;/a&gt; is a permanent URL to the 0.25 tagged release of Datasette. See also &lt;a href="https://v0-24.datasette.io/"&gt;https://v0-24.datasette.io/&lt;/a&gt; and &lt;a href="https://v0-23-2.datasette.io/"&gt;https://v0-23-2.datasette.io/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://700d83d.datasette.io/-/versions"&gt;https://700d83d.datasette.io/-/versions&lt;/a&gt; is a permanent URL to the code from this commit: &lt;a href="https://github.com/simonw/datasette/commit/700d83d"&gt;https://github.com/simonw/datasette/commit/700d83d&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The database that is used for this demo is the exact same database that is created by Datasette’s &lt;a href="https://github.com/simonw/datasette/blob/master/tests/fixtures.py"&gt;unit test fixtures&lt;/a&gt;. The unit tests are already designed to exercise every feature, so reusing them for a live demo makes a lot of sense.&lt;/p&gt;
&lt;p&gt;You can view this test database on your own machine by checking out the full Datasette repository from GitHub and running the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;python tests/fixtures.py fixtures.db metadata.json
datasette fixtures.db -m metadata.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s &lt;a href="https://github.com/simonw/datasette/blob/96af802352e49e35751e295e9846aa39c5e22311/.travis.yml#L23-L42"&gt;the code in the Datasette Travis CI configuration&lt;/a&gt; that deploys a live demo for every commit and every released tag.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Documentation_unit_tests"&gt;&lt;/a&gt;Documentation unit tests&lt;/h3&gt;
&lt;p&gt;I wrote about the &lt;a href="https://simonwillison.net/2018/Jul/28/documentation-unit-tests/"&gt;Documentation unit tests&lt;/a&gt; pattern back in July.&lt;/p&gt;
&lt;p&gt;Datasette’s unit tests &lt;a href="https://github.com/simonw/datasette/blob/master/tests/test_docs.py"&gt;include some assertions&lt;/a&gt; that ensure that every plugin hook, configuration setting and underlying view class is mentioned in the documentation. A commit or pull request that adds or modifies these without also updating the documentation (or at least ensuring there is a corresponding heading in the docs) will fail its tests.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Learning_more"&gt;&lt;/a&gt;Learning more&lt;/h3&gt;
&lt;p&gt;Datasette’s &lt;a href="http://datasette.readthedocs.io/"&gt;documentation&lt;/a&gt; is in pretty good shape now, and &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html"&gt;the changelog&lt;/a&gt; provides a detailed overview of new features that I’ve added to the project. I presented Datasette at the PyBay conference in August and I’ve published &lt;a href="https://static.simonwillison.net/static/2018/pybay-datasette/"&gt;my annotated slides&lt;/a&gt; from that talk. I was &lt;a href="https://changelog.com/podcast/296#t=00:54:45"&gt;interviewed about Datasette&lt;/a&gt; for the Changelog podcast in May and &lt;a href="https://simonwillison.net/2018/May/9/changelog/"&gt;my notes from that conversation&lt;/a&gt; include some of my favourite demos.&lt;/p&gt;
&lt;p&gt;Datasette now has an official Twitter account - you can follow &lt;a href="https://twitter.com/datasetteproj"&gt;@datasetteproj&lt;/a&gt; there for updates about the project.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="sqlite"/><category term="testing"/><category term="datasette"/><category term="baked-data"/></entry></feed>