Simon Willison's Weblog: zeit-now

Weeknotes: django-sql-dashboard widgets

2021-03-21T05:50:25+00:00

A few small releases this week, for django-sql-dashboard, datasette-auth-passwords and datasette-publish-vercel.

django-sql-dashboard widgets and permissions

django-sql-dashboard, my subset-of-Datasette-for-Django-and-PostgreSQL continues to come together.

New this week: widgets and permissions.

To recap: this Django app borrows some ideas from Datasette: it encourages you to create a read-only PostgreSQL user and grant authenticated users the ability to run one or more raw SQL queries directly against your database.

You can execute more than one SQL query and combine them into a saved dashboard, which will then show multiple tables containing the results.

This week I added support for dashboard widgets. You can construct SQL queries to return specific column patterns which will then be rendered on the page in different ways.

There are four widgets at the moment: "big number", bar chart, HTML and Markdown.

Big number is the simplest: define a SQL query that returns two columns called label and big_number and the dashboard will display that result as a big number:

select 'Entries' as label, count(*) as big_number from blog_entry;

Bar chart is more sophisticated: return columns named bar_label and bar_quantity to display a bar chart of the results:

select
  to_char(date_trunc('month', created), 'YYYY-MM') as bar_label,
  count(*) as bar_quantity
from
  blog_entry
group by
  bar_label
order by
  count(*) desc

HTML and Markdown are simpler: they display the rendered HTML or Markdown, after filtering it through the Bleach library to strip any harmful elements or scripts.

select
  '## Ten most recent blogmarks (of ' 
  || count(*) || ' total)'
as markdown from blog_blogmark;

I'm running the dashboard application on this blog, and I've set up an example dashboard here that illustrates the different types of widget.

Defining custom widgets is easy: take the column names you would like to respond to, sort them alphabetically, join them with hyphens and create a custom widget in a template file with that name.

So if you wanted to build a widget that looks for label and geojson columns and renders that data on a Leaflet map, you would create a geojson-label.html template and drop it into your Django templates/django-sql-dashboard/widgets folder. See the custom widgets documentation for details.

Which reminds me: I decided a README wasn't quite enough space for documentation here, so I started a Read The Docs documentation site for the project.

Datasette and sqlite-utils both use Sphinx and reStructuredText for their documentation.

For django-sql-dashboard I've decided to try out Sphinx and Markdown instead, using MyST - a Markdown flavour and parser for Sphinx.

I picked this because I want to add inline help to django-sql-dashboard, and since it ships with Markdown as a dependency already (to power the Markdown widget) my hope is that using Markdown for the documentation will allow me to ship some of the user-facing docs as part of the application itself. But it's also a fun excuse to try out MyST, which so far is working exactly as advertised.

I've seen people in the past avoid Sphinx entirely because they preferred Markdown to reStructuredText, so MyST feels like an important addition to the Python documentation ecosystem.

HTTP Basic authentication

datasette-auth-passwords implements password-based authentication to Datasette. The plugin defaults to providing a username and password login form which sets a signed cookie identifying the current user.

Version 0.4 introduces optional support for HTTP Basic authentication instead - where the user's browser handles the authentication prompt.

Basic auth has some disadvantages - most notably that it doesn't support logout without the user entirely closing down their browser. But it's useful for a number of reasons:

It's easy to protect every resource on a website with it - including static assets. Adding "http_basic_auth": true to your plugin configuration adds this protection, covering all of Datasette's resources.
It's much easier to authenticate with from automated scripts. curl and roquests and httpx all have simple built-in support for passing basic authentication usernames and passwords, which makes it a useful target for scripting - without having to install an additional authentication plugin such as datasette-auth-tokens.

I'm continuing to flesh out authentication options for Datasette, and adding this to datasette-auth-passwords is one of those small improvements that should pay off long into the future.

A fix for datasette-publish-vercel

Datasette instances published to Vercel using the datasette-publish-vercel have previously been affected by an obscure Vercel bug: characters such as + in the query string were being lost due to Vercel unescaping encoded characters before the request got to the Python application server.

Vercel fixed this earlier this month, and the latest release of datasette-publish-vercel includes their fix by switching to the new @vercel/python builder. Thanks @styfle from Vercel for shepherding this fix through!

New photos on Niche Museums

My Niche Museums project has been in hiberation since the start of the pandemic. Now that vaccines are rolling out it feels like there might be an end to this thing, so I've started thinking about my museum hobby again.

I added some new photos to the site today - on the entries for Novelty Automation, DEVIL-ish Little Things, Evergreen Aviation & Space Museum and California State Capitol Dioramas.

Hopefully someday soon I'll get to visit and add an entirely new museum!

Releases this week

django-sql-dashboard: 0.4a1 - (10 releases total) - 2021-03-21
Django app for building dashboards using raw SQL queries
datasette-publish-vercel: 0.9.2 - (14 releases total) - 2021-03-20
Datasette plugin for publishing data using Vercel
datasette-auth-passwords: 0.4 - (9 releases total) - 2021-03-19
Datasette plugin for authentication using passwords

Tags: authentication, dashboard, django, postgresql, projects, zeit-now, weeknotes, django-sql-dashboard, sphinx-docs

Weeknotes: Hacking on 23 different projects

2020-04-16T05:03:11+00:00

I wrote a lot of code this week: 184 commits over 23 repositories! I've also started falling for Zeit Now v2, having found workarounds for some of my biggest problems with it.

Better Datasette on Zeit Now v2

Last week I bemoaned the loss of Zeit Now v1 and documented my initial explorations of Zeit Now v2 with respect to Datasette.

My favourite thing about Now v1 was that it ran from Dockerfiles, which gave me complete control over the versions of everything in my deployment environment.

Now v2 runs on AWS Lambda, which means you are mostly stuck with what Zeit's flavour of Lambda gives you. This currently means Python 3.6 (not too terrible - Datasette fully supports it) and a positively ancient SQLite - 3.7.17 from May 2013.

Lambda runs on Amazon Linux. Charles Leifer maintains a package called pysqlite3 which bundles the latest version of SQLite3 as a standalone Python package, and includes a pysqlite3-binary package precompiled for Linux. Could it work on Amazon Linux...?

It turns out it does! A one-line change (not including tests) to my datasette-publish-now and it now deploys Datasette on Now v2 with SQLite 3.31.1 - the latest release from January this year, with window functions and all kinds of other goodness.

This means that Now v2 is back to being a really solid option for hosting Datasette instances. You get scale-to-zero, crazily low prices and really fast cold-boot times. It can only take databases up to around 50MB - if you need more space than that you're better off with Cloud Run - but it's a great option for smaller data.

I released a few versions of datasette-publish-now as a result of this research. I plan to release the first non-alpha version at the same time as Datasette 0.40.

Various projects ported to Now v2 or Cloud Run

I had over 100 projects running on Now v1 that needed updating or deleting in time for that platform's shutdown in August. I've been porting some of them very quickly using datasette-publish-now, but a few have been more work. Some highlights from this week:

ftfy.now.sh, my web app that takes a string of broken unicode and figures out the sequence of transformations you can use to make sense of it (built on the incredible FTFY Python library by Robyn Speer) has been upgraded to Now v2 - repo here.
gzthermal.now.sh offers a web interface to the gzthermal gzip visualization tool, released by caveman on the encode.ru (now encode.su) forum. My repo is here.
My crowdsourced directory of range maps of cryptozoological creatures is now running on Cloud Run (I haven't figured out a way to run SpatiaLite on Now v2 yet).
The datasette-sqlite-fts4.datasette.io demo instance I used for explanations in Exploring search relevance algorithms with SQLite.
The demo instance used for datasette-jellyfish is on Now v2.
The demo for datasette-jq had to move to Cloud Run, because I couldn't install pyjq on Now v2.

big-local-datasette

I've been collaborating with the Big Local team at Stanford on a number of projects related to the Covid-19 situation. It's not quite open to the public yet but I've been building a Datasette instance which shares data from the "open projects" maintained by that team.

The implementation fits a common pattern for me: a scheduled GitHub Action which fetches project data from a GraphQL API, seeks out CSV files which have changed (using HTTP HEAD requests to check their ETags), loads the CSV into SQLite tables and publishes the resulting database using datasette publish cloudrun.

There's one interesing new twist: I'm fetching the existing database files on every run using my new datasette-clone tool (written for this project), applying changes to them and then only publishing if the resulting MD5 sums have changed since last time.

It seems to work well, and I'm excited about this technique as a way of incrementally updating existing databases using stateless code running in a GitHub Action.

Datasette Cloud

I continue to work on the invite-only alpha of my SaaS Datasette platform, Datasette Cloud. This week I ported the CI and deployment scripts from GitLab to GitHub Actions, mainly to try and reduce the variety of CI systems I'm working with (I now have projects live on three: Travis, Circle CI and GitHub Actions).

I've also been figuring out ways of supporting API tokens for making requests to authentication-protected Datasette instances. I shipped small releases of datasette-auth-github and datasette-auth-existing-cookies to support this.

In tinkering with Datasette Cloud I also shipped an upgrade to datasette-mask-columns, which now shows visible REDACTED text on redacted columns in table view.

Miscellaneous

My covid-19.datasettes.com project now also imports data from the LA Times.
I added .rows_where(..., order_by="column") in release 2.6 of sqlite-utils.
I shipped a new release of paginate-json, a tool I built primarily for paginating through the GitHub API and piping the results to sqlite-utils.
I fixed a minor bug with Datasette's --plugin-secret mechanism and added a CSS customization hook for the canned query page.
I built a HEIC to JPEG converting proxy as part of my ongoing mission to eventually liberate my photos from Apple Photos and make them available to Dogsheep. In doing so I contributed usage documentation to the pyheif Python library.

Tags: projects, zeit-now, datasette, dogsheep, weeknotes, datasette-cloud

Goodbye Zeit Now v1, hello datasette-publish-now - and talking to myself in GitHub issues

2020-04-08T03:32:24+00:00

This week I’ve been mostly dealing with the finally announced shutdown of Zeit Now v1. And having long-winded conversations with myself in GitHub issues.

How Zeit Now inspired Datasette

I first started experiencing with Zeit’s serverless Now hosting platform back in October 2017, when I used it to deploy json-head.now.sh - an updated version of an API tool I originally built for Google App Engine in July 2008.

I liked Zeit Now, a lot. Instant, inexpensive deploys of any stateless project that could be defined using a Dockerfile? Just type now to deploy the project in your current directory? Every deployment gets its own permanent URL? Amazing!

There was just one catch: Since Now deployments are ephemeral applications running on them need to be stateless. If you want a database, you need to involve another (potentially costly) service. It's a limitation shared by other scalable hosting solutions - Heroku, App Engine and so on. How much interesting stuff can you build without a database?

I was musing about this in the shower one day (that old cliche really happened for me) when I had a thought: sure, you can't write to a database... but if your data is read-only, why not bundle the database alongside the application code as part of the Docker image?

Ever since I helped launch the Datablog at the Guardian back in 2009 I had been interested in finding better ways to publish data journalism datasets than CSV files or a Google spreadsheets - so building something that could package and bundle read-only data was of extreme interest to me.

In November 2017 I released the first version of Datasette. The original idea was very much inspired by Zeit Now.

I gave a talk about Datasette at the Zeit Day conference in San Francisco in April 2018. Suffice to say I was a huge fan!

Goodbye, Zeit Now v1

In November 2018, Zeit announced Now v2. And it was, different.

v2 is an entirely different architecture from v1. Where v1 built on Docker containers, v2 is built on top of serverless functions - AWS Lambda in particular.

I can see why Zeit did this. Lambda functions can launch from cold way faster - v1's Docker infrastructure had tough cold-start times. They are much cheaper to run as well - crucial for Zeit given their extremely generous pricing plans.

But it was bad news for my projects. Lambdas are tightly size constrained, which is tough when you're bundling potentially large SQLite database files with your deployments.

More importantly, in 2018 Amazon were deliberately excluding the Python sqlite3 standard library module from the Python Lambda environment! I guess they hadn't considered people who might want to work with read-only database files.

So Datasette on Now v2 just wasn't going to work. Zeit kept v1 supported for the time being, but the writing was clearly on the wall.

In April 2019 Google announced Cloud Run, a serverless, scale-to-zero hosting environment based around Docker containers. In many ways it's Google's version of Zeit Now v1 - it has many of the characteristics I loved about v1, albeit with a clunkier developer experience and much more friction in assigning nice URLs to projects. Romain Primet contributed Cloud Run support to Datasette and it has since become my preferred hosting target for my new projects (see Deploying a data API using GitHub Actions and Cloud Run).

Last week, Zeit finally announced the sunset date for v1. From 1st of May new deploys won't be allowed, and on the 7th of August they'll be turning off the old v1 infrastructure and deleting all existing Now v1 deployments.

I engaged in an extensive Twitter conversation about this, where I praised Zeit's handling of the shutdown while bemoaning the loss of the v1 product I had loved so much.

Migrating my projects

My newer projects have been on Cloud Run for quite some time, but I still have a bunch of old projects that I care about and want to keep running past the v1 shutdown.

The first project I ported was latest.datasette.io, a live demo of Datasette which updates with the latest code any time I push to the Datasette master branch on GitHub.

Any time I do some kind of ops task like this I've gotten into the habit of meticulously documenting every single step in comments on a GitHub issue. Here's the issue for porting latest.datasette.io to Cloud Run (and switching from Circle CI to GitHub Actions at the same time).

My next project was global-power-plants-datasette, a small project which takes a database of global power plants published by the World Resources Institute and publishes it using Datasette. It checks for new updates to their repo once a day. I originally built it as a demo for datasette-cluster-map, since it's fun seeing 33,000 power plants on a single map. Here's that issue.

Having warmed up with these two, my next target was the most significant: porting my Niche Museums website.

Niche Museums is the most heavily customized Datasette instance I've run anywhere - it incorporates custom templates, CSS and plugins.

Here's the tracking issue for porting it to Cloud Run. I ran into a few hurdles with DNS and TLS certificates, and I had to do some additional work to ensure niche-museums.com redirects to www.niche-musums.com, but it's now fully migrated.

Hello, Zeit Now v2

In complaining about the lack of that essential sqlite3 module I figured it would be responsible to double-check and make sure that was still true.

It was not! Today Now's Python environment includes sqlite3 after all.

Datasette's publish_subcommand() plugin hook lets plugins add new publishing targets to the datasette publish command (I used it to build datasette-publish-fly last month). How hard would it be to build a plugin for Zeit Now v2?

I fired up a new lengthy talking-to-myself GitHub issue and started prototyping.

Now v2 may not support Docker, but it does support the ASGI Python standard (the asynchronous alternative to WSGI, shepherded by Andrew Godwin).

Zeit are keen proponents of the Jamstack approach, where websites are built using static pre-rendered HTML and JavaScript that calls out to APIs for dynamic data. v2 deployments are expected to consist of static HTML with "serverless functions" - standalone server-side scripts that live in an api/ directory by convention and are compiled into separate lambdas.

Datasette works just fine without JavaScript, which means it needs to handle all of the URL routes for a site. Essentually I need to build a single function that runs the whole of Datasette, then route all incoming traffic to it.

It took me a while to figure it out, but it turns out the Now v2 recipe for that is a now.json file that looks like this:

{
    "version": 2,
    "builds": [
        {
            "src": "index.py",
            "use": "@now/python"
        }
    ],
    "routes": [
        {
            "src": "(.*)",
            "dest": "index.py"
        }
    ]
}

Thanks Aaron Boodman for the tip.

Given the above configuration, Zeit will install any Python dependencies in a requirements.txt file, then treat an app variable in the index.py file as an ASGI application it should route all incoming traffic to. Exactly what I need to deploy Datasette!

This was everything I needed to build the new plugin. datasette-publish-now is the result.

Here's the generated source code for a project deployed using the plugin, showing how the underlyinng ASGI application is configured.

It's currently an alpha - not every feature is supported (see this milestone) and it relies on a minor deprecated feature (which I've implored Zeit to reconsider) but it's already full-featured enough that I can start using it to upgrade some of my smaller existing Now projects.

The first I upgraded is one of my favourites: polar-bears.now.sh, which visualizes tracking data from polar bear ear tags (using datasette-cluster-map) that was published by the USGS Alaska Science Center, Polar Bear Research Program.

Here's the command I used to deploy the site:

$ pip install datasette-publish-now
$ datasette publish now2 polar-bears.db \
    --title "Polar Bear Ear Tags, 2009-2011" \
    --source "USGS Alaska Science Center, Polar Bear Research Program" \
    --source_url "https://alaska.usgs.gov/products/data.php?dataid=130" \
    --install datasette-cluster-map \
    --project=polar-bears

I exported a full list of my Now v1 projects from their handy active v1 instances page.

The rest of my projects

I scraped the page using the following JavaScript, constructed with the help of the instant evaluation console feature in Firefox 75:

console.log(
  JSON.stringify(
    Array.from(
      Array.from(
        document.getElementsByTagName("table")[1].
          getElementsByTagName("tr")
      ).slice(1).map(
        (tr) =>
          Array.from(
            tr.getElementsByTagName("td")
        ).map((td) => td.innerText)
      )
    )
  )
);

Then I loaded them into Datasette for analysis.

After filtering out the datasette-latest-commithash.now.sh projects I had deployed for every push to GitHub it turns out I have 34 distinct projects running there.

I won't port all of them, but given datasette-publish-now I should be able to port the ones that I care about without too much trouble.

Debugging Datasette with git bisect run

I fixed two bugs in Datasette this week using git bisect run - a tool I've been meaning to figure out for years, which lets you run an automated binary search against a commit log to find the source of a bug.

Since I was figuring out a new tool, I fired up another GitHub issue self-conversation: in issue #716 I document my process of both learning to use git bisect run and using it to find a solution to that particular bug.

It worked great, so I used the same trick on issue 689 as well.

Watching git bisect run churn through 32 revisions in a few seconds and pinpoint the exact moment a bug was introduced is pretty delightful:

$ git bisect start master 0.34
Bisecting: 32 revisions left to test after this (roughly 5 steps)
[dc80e779a2e708b2685fc641df99e6aae9ad6f97] Handle scope path if it is a string
$ git bisect run python check_templates_considered.py
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 15 revisions left to test after this (roughly 4 steps)
[7c6a9c35299f251f9abfb03fd8e85143e4361709] Better tests for prepare_connection() plugin hook, refs #678
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 7 revisions left to test after this (roughly 3 steps)
[0091dfe3e5a3db94af8881038d3f1b8312bb857d] More reliable tie-break ordering for facet results
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[ce12244037b60ba0202c814871218c1dab38d729] Release notes for 0.35
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 1 revision left to test after this (roughly 1 step)
[70b915fb4bc214f9d064179f87671f8a378aa127] Datasette.render_template() method, closes #577
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[286ed286b68793532c2a38436a08343b45cfbc91] geojson-to-sqlite
running python check_templates_considered.py
70b915fb4bc214f9d064179f87671f8a378aa127 is the first bad commit
commit 70b915fb4bc214f9d064179f87671f8a378aa127
Author: Simon Willison
Date:   Tue Feb 4 12:26:17 2020 -0800

    Datasette.render_template() method, closes #577

    Pull request #664.

:040000 040000 def9e31252e056845609de36c66d4320dd0c47f8 da19b7f8c26d50a4c05e5a7f05220b968429725c M	datasette
bisect run success

Supporting metadata.yaml

The other Datasette project I completed this week is a relatively small feature with hopefully a big impact: you can now use YAML for Datasette's metadata configuration as an alternative to JSON.

I'm not crazy about YAML: I still don't feel like I've mastered it, and I've been tracking it for 18 years! But it has one big advantage over JSON for configuration files: robust support for multi-line strings.

Datasette's metadata file can include lengthy SQL statements and strings of HTML, both of which benefit from multi-line strings.

I first used YAML for metadata for my Analyzing US Election Russian Facebook Ads project. The metadata file for that demonstrates both embedded HTML and embedded SQL - and an accompanying build_metadata.py script converted it to JSON at build time. I've since used the same trick for a number of other projects.

The next release of Datasette (hopefully within a week) will ship the new feature, at which point those conversion scripts won't be necessary.

This should work particularly well with the forthcoming ability for a canned query to write to a database. Getting that wrapped up and shipped will be my focus for the next few days.

Tags: git, github, projects, yaml, zeit-now, datasette, weeknotes, github-issues

Zeit Now v1 to sunset soon: no new deployments from 1st May, total shutdown 7th August

2020-04-04T05:32:02+00:00

Zeit Now v1 to sunset soon: no new deployments from 1st May, total shutdown 7th August

I posted a thread on Twitter with some thoughts. Zeit Now v1 remains the best hosting platform I’ve ever used given my particular tastes. They’ve handled the shutdown very responsibly, but I’m sad to see it go.

Tags: hosting, zeit-now, datasette

Ministry of Silly Runtimes: Vintage Python on Cloud Run

2019-04-09T17:33:47+00:00

Ministry of Silly Runtimes: Vintage Python on Cloud Run

Cloud Run is an exciting new hosting service from Google that lets you define a container using a Dockerfile and then run that container in a “scale to zero” environment, so you only pay for time spent serving traffic. It’s similar to the now-deprecated Zeit Now 1.0 which inspired me to create Datasette. Here Dustin Ingram demonstrates how powerful Docker can be as the underlying abstraction by deploying a web app using a 25 year old version of Python 1.x.

Via @jacobian

Tags: cloud, python, zeit-now, docker, datasette, cloudrun, dustin-ingram

Building smaller Python Docker images

2018-11-19T03:13:40+00:00

Changes are afoot at Zeit Now, my preferred hosting provider for the past year (see previous posts). They have announced Now 2.0, an intriguing new approach to providing auto-scaling immutable deployments. It’s built on top of lambdas, and comes with a whole host of new constraints: code needs to fit into a 5MB bundle for example (though it looks like this restriction will soon be relaxed a little - update November 19th you can now bump this up to 50MB).

Unfortunately, they have also announced their intent to deprecate the existing Now v1 Docker-based solution.

“We will only start thinking about deprecation plans once we are able to accommodate the most common and critical use cases of v1 on v2” - Matheus Fernandes

“When we reach feature parity, we still intend to give customers plenty of time to upgrade (we are thinking at the very least 6 months from the time we announce it)” - Guillermo Rauch

This is pretty disastrous news for many of my projects, most crucially Datasette and Datasette Publish.

Datasette should be fine - it supports Heroku as an alternative to Zeit Now out of the box, and the publish_subcommand plugin hook makes it easy to add further providers (I’m exploring several new options at the moment).

Datasette Publish is a bigger problem. The whole point of that project is to make it easy for less-technical users to deploy their data as an interactive API to a Zeit Now account that they own themselves. Talking these users through what they need to do to upgrade should v1 be shut down in the future is not an exciting prospect.

So I’m going to start hunting for an alternative backend for Datasette Publish, but in the meantime I’ve had to make some changes to how it works in order to handle a new size limit of 100MB for Docker images deployed by free users.

Building smaller Docker images

Zeit appear to have introduced a new limit for free users of their Now v1 platform: Docker images need to be no larger than 100MB.

Datasette Publish was creating final image sizes of around 350MB, blowing way past that limit. I spent some time today figuring out how to get it to produce images within the new limit, and learned a lot about Docker image optimization in the process.

I ended up using Docker’s multi-stage build feature, which allows you to create temporary images during a build, use them to compile dependencies, then copy just the compiled assets into the final image.

An example of the previous Datasette Publish generated Dockerfile can be seen here. Here’s a rough outline of what it does:

Start with the python:3.6-slim-stretch image
apt-installs python3-dev and gcc so it can compile Python libraries with binary dependencies (pandas and uvloop for example)
Use pip to install csvs-to-sqlite and datasette
Add the uploaded CSV files, then run csvs-to-sqlite to convert them into a SQLite database
Run datasette inspect to cache a JSON file with information about the different tables
Run datasette serve to serve the resulting web application

There’s a lot of scope for improvement here. The final image has all sorts of cruft that’s not actually needed for serving the image: it has csvs-to-sqlite and all of its dependencies, plus the original uploaded CSV files.

Here’s the workflow I used to build a Dockerfile and check the size of the resulting image. My work-in-progress can be found in the datasette-small repo.

# Build the Dockerfile in the current directory and tag as datasette-small
$ docker build . -t datasette-small
# Inspect the size of the resulting image
$ docker images | grep datasette-small
# Start the container running
$ docker run -d -p 8006:8006 datasette-small
654d3fc4d3343c6b73414c6fb4b2933afc56fbba1f282dde9f515ac6cdbc5339
# Now visit http://localhost:8006/ to see it running

Alpine Linux

When you start looking for ways to build smaller Dockerfiles, the first thing you will encounter is Alpine Linux. Alpine is a Linux distribution that’s perfect for containers: it builds on top of BusyBox to strip down to the smallest possible image that can still do useful things.

The python:3.6-alpine container should be perfect: it gives you the smallest possible container that can run Python 3.6 applications (including the ability to pip install additional dependencies).

There’s just one problem: in order to install C-based dependencies like pandas (used by csvs-to-sqlite) and Sanic (used by Datasette) you need a compiler toolchain. Alpine doesn’t have this out-of-the-box, but you can install one using Alpine’s apk package manager. Of course, now you’re bloating your container with a bunch of compilation tools that you don’t need to serve the final image.

This is what makes multi-stage builds so useful! We can spin up an Alpine image with the compilers installed, build our modules, then copy the resulting binary blobs into a fresh container.

Here’s the basic recipe for doing that:

FROM python:3.6-alpine as builder

# Install and compile Datasette + its dependencies
RUN apk add --no-cache gcc python3-dev musl-dev alpine-sdk
RUN pip install datasette

# Now build a fresh container, copying across the compiled pieces
FROM python:3.6-alpine

COPY --from=builder /usr/local/lib/python3.6 /usr/local/lib/python3.6
COPY --from=builder /usr/local/bin/datasette /usr/local/bin/datasette

This pattern works really well, and produces delightfully slim images. My first attempt at this wasn’t quite slim enough to fit the 100MB limit though, so I had to break out some Docker tools to figure out exactly what was going on.

Inspecting docker image layers

Part of the magic of Docker is the concept of layers. When Docker builds a container it uses a layered filesystem (UnionFS) and creates a new layer for every executable line in the Dockerfile. This dramatically speeds up future builds (since layers can be reused if they have already been built) and also provides a powerful tool for inspecting different stages of the build.

When you run docker build part of the output is IDs of the different image layers as they are constructed:

datasette-small $ docker build . -t datasette-small
Sending build context to Docker daemon  2.023MB
Step 1/21 : FROM python:3.6-slim-stretch as csvbuilder
 ---> 971a5d5dad01
Step 2/21 : RUN apt-get update && apt-get install -y python3-dev gcc wget
 ---> Running in f81485df62dd

Given a layer ID, like 971a5d5dad01, it’s possible to spin up a new container that exposes the exact state of that layer (thanks, Stack Overflow). Here’s how do to that:

docker run -it --rm 971a5d5dad01 sh

The -it argument attaches standard input to the container (-i) and allocates a pseudo-TTY (-t). The -rm option means that the container will be removed when you Ctrl+D back out of it. sh is the command we want to run in the container - using a shell lets us start interacting with it.

Now that we have a shell against that layer, we can use regular unix commands to start exploring it. du -m (m for MB) is particularly useful here, as it will show us the largest directories in the filesystem. I pipe it through sort like so:

$ docker run -it --rm abc63755616b sh
# du -m | sort -n
...
58  ./usr/local/lib/python3.6
70  ./usr/local/lib
71  ./usr/local
76  ./usr/lib/python3.5
188 ./usr/lib
306 ./usr
350 .

Straight away we can start seeing where the space is being taken up in our image.

Deleting unnecessary files

I spent quite a while inspecting different stages of my builds to try and figure out where the space was going. The alpine copy recipe worked neatly, but I was still a little over the limit. When I started to dig around in my final image I spotted some interesting patterns - in particular, the /usr/local/lib/python3.6/site-packages/uvloop directory was 17MB!

# du -m /usr/local | sort -n -r | head -n 5
96  /usr/local
95  /usr/local/lib
83  /usr/local/lib/python3.6
36  /usr/local/lib/python3.6/site-packages
17  /usr/local/lib/python3.6/site-packages/uvloop

That seems like a lot of disk space for a compiled C module, so I dug in further…

It turned out the uvloop folder still contained a bunch of files that were used as part of the compilation, including a 6.7MB loop.c file and a bunch of .pxd and .pyd files that are compiled by Cython. None of these files are needed after the extension has been compiled, but they were there, taking up a bunch of precious space.

So I added the following to my Dockerfile:

RUN find /usr/local/lib/python3.6 -name '*.c' -delete
RUN find /usr/local/lib/python3.6 -name '*.pxd' -delete
RUN find /usr/local/lib/python3.6 -name '*.pyd' -delete

Then I noticed that there were __pycache__ files that weren’t needed either, so I added this as well:

RUN find /usr/local/lib/python3.6 -name '__pycache__' | xargs rm -r

(The -delete flag didn’t work correctly for that one, so I used xargs instead.)

This shaved off around 15MB, putting me safely under the limit.

Running csvs-to-sqlite in its own stage

The above tricks had got me the smallest Alpine Linux image I could create that would still run Datasette… but Datasette Publish also needs to run csvs-to-sqlite in order to convert the user’s uploaded CSV files to SQLite.

csvs-to-sqlite has some pretty heavy dependencies of its own in the form of Pandas and NumPy. Even with the build chain installed I was having trouble installing these under Alpine, especially since building numpy for Alpine is notoriously slow.

Then I realized that thanks to multi-stage builds there’s no need for me to use Alpine at all for this step. I switched back to python:3.6-slim-stretch and used it to install csvs-to-sqlite and compile the CSV files into a SQLite database. I also ran datasette inspect there for good measure.

Then in my final Alpine container I could use the following to copy in just those compiled assets:

COPY --from=csvbuilder inspect-data.json inspect-data.json
COPY --from=csvbuilder data.db data.db

Tying it all together

Here’s an example of a full Dockerfile generated by Datasette Publish that combines all of these tricks. To summarize, here’s what it does:

Spin up a python:3.6-slim-stretch - call it csvbuilder
- apt-get install -y python3-dev gcc so we can install compiled dependencies
- pip install csvs-to-sqlite datasette
- Copy in the uploaded CSV files
- Run csvs-to-sqlite to convert them into a SQLite database
- Run datasette inspect data.db to generate an inspect-data.json file with statistics about the tables. This can later be used to reduce startup time for datasette serve.
Spin up a python:3.6-alpine - call it buildit
- We need a build chain to compile a copy of datasette for Alpine Linux…
- apk add --no-cache gcc python3-dev musl-dev alpine-sdk
- Now we can pip install datasette, plus any requested plugins
- Reduce the final image size by deleting any __pycache__ or *.c, *.pyd and *.pxd files.
Spin up a fresh python:3.6-alpine for our final image
- Copy in data.db and inspect-data.json from csvbuilder
- Copy across /usr/local/lib/python3.6 and /usr/local/bin/datasette from bulidit
- … and we’re done! Expose port 8006 and set datasette serve to run when the container is started

Now that I’ve finally learned how to take advantage of multi-stage builds I expect I’ll be using them for all sorts of interesting things in the future.

Tags: projects, python, zeit-now, docker, datasette

The Now CDN

2018-07-12T03:34:06+00:00

The Now CDN

Huge announcement from Zeit Now today: all .now.sh deployments are now served through the Cloudflare CDN, which means they benefit from 150 worldwide CDN locations that obey HTTP caching headers. This is particularly relevant for Datasette, since it serves far-future cache headers by default and uses Cloudflare-compatible HTTP/2 push hints to accelerate 302 redirects. This means that both the “datasette publish now” CLI command and the Datasette Publish web app will now result in Cloudflare-accelerated deployments.

Via @zeithq

Tags: cdn, performance, zeit-now, datasette, cloudflare

Continuous Integration with Travis CI - ZEIT Documentation

2018-06-01T17:21:50+00:00

Continuous Integration with Travis CI - ZEIT Documentation

One of the neat things about Zeit Now is that since deployments are unlimited and are automatically assigned a unique URL you can set up a continuous integration system like Travis to deploy a brand new copy of every commit or every pull request. This documentation also shows how to have commits to master automatically aliased to a known URL. I have quite a few Datasette projects that are deployed automatically to Now by Travis and the pattern seems to be working great so far.

Tags: continuous-deployment, continuous-integration, zeit-now, travis

Datasette - a talk at Zeit Day SF 2018

2018-04-28T21:31:40+00:00

Datasette - a talk at Zeit Day SF 2018

Slides from the talk I gave today about Datasette and Datasette Publish at the Zeit Day SF conference.

Via @simonw

Tags: my-talks, zeit-now, datasette

Make Near Me

2018-04-28T21:28:44+00:00

Make Near Me

The natural evolution of owlsnearme.com—Make Near Me uses the Zeit Now API to allow anyone to deploy their own version of Owls Near Me for any species! I announced this on stage at Zeit Day SF 2018 as part of my talk on Datasette and Datasette Publish.

Via @simonw

Tags: owlsnearyou, projects, zeit-now

Domains Search for Web: Instant, Serverless & Global

2018-01-26T01:14:52+00:00

Domains Search for Web: Instant, Serverless & Global

The team at Zeit are pioneering a whole bunch of fascinating web engineering architectural patterns. Their new domain name autocomplete search uses Next.js and server-side rendering on first load, then switches to client-side rendering from then on. It can then load results asynchronously over a custom WebSocket protocol as the microservices on the backend finish resolving domain availability from the various different TLD providers.

Via Guillermo Rauch‏

Tags: domains, websockets, zeit-now, microservices

API 2.0: Log-In with ZEIT, New Docs & More

2018-01-17T15:23:15+00:00

API 2.0: Log-In with ZEIT, New Docs & More

Here’s Zeit’s write-up of their brand new API 2.0, which adds OAuth support and allows anything that can be done with their command-line tools to be achieved via their public API as well. This is the enabling technology that allowed me to build Datasette Publish.

Tags: zeit-now

Datasette Publish: a web app for publishing CSV files as an online database

2018-01-17T14:11:05+00:00

I’ve just released Datasette Publish, a web tool for turning one or more CSV files into an online database with a JSON API.

Here’s a demo application I built using Datasette Publish, showing Californian campaign finance data using CSV files released by the California Civic Data Coalition.

And here’s an animated screencast showing exactly how I built it:

Datasette Publish combines my Datasette tool for publishing SQLite databases as an API with my csvs-to-sqlite tool for generating them.

It’s built on top of the Zeit Now hosting service, which means anything you deploy with it lives on your own account with Zeit and stays entirely under your control. I used the brand new Zeit API 2.0.

Zeit’s generous free plan means you can try the tool out as many times as you like - and if you want to use it for an API powering a production website you can easily upgrade to a paid hosting plan.

Who should use it

Anyone who has data they want to share with the world!

The fundamental idea behind Datasette is that publishing structured data as both a web interface and a JSON API should be as quick and easy as possible.

The world is full of interesting data that often ends up trapped in PDF blobs or other hard-to-use formats, if it gets published at all. Datasette encourages using SQLite instead: a powerful, flexible format that enables analysis via SQL queries and can easily be shared and hosted online.

Since so much of the data that IS published today uses CSV, this first release of Datasette Publish focuses on CSV conversion above anything else. I plan to add support for other useful formats in the future.

The three areas I’m most excited in seeing adoption of Datasette are data journalism, civic open data and cultural institutions.

Data journalism because when I worked at the Guardian Datasette is the tool I wish I had had for publishing data. When we started the Guardian Datablog we ended up using Google Sheets for this.

Civic open data because it turns out the open data movement mostly won! It’s incredible how much high quality data is published by local and national governments these days. My San Francisco tree search project for example uses data from the Department of Public Works - a CSV of 190,000 trees around the city.

Cultural institutions because the museums and libraries of the world are sitting on enormous treasure troves of valuable information, and have an institutional mandate to share that data as widely as possible.

If you are involved in any of the above please get in touch. I’d love your help improving the Datasette ecosystem to better serve your needs.

How it works

Datasette Publish would not be possible without Zeit Now. Now is a revolutionary approach to hosting: it lets you instantly create immutable deployments with a unique URL, via a command-line tool or using their recently updated API. It’s by far the most productive hosting environment I’ve ever worked with.

I built the main Datasette Publish interface using React. Building a SPA here made a lot of sense, because it allowed me to construct the entire application without any form of server-side storage (aside from Keen for analytics).

When you sign in via Zeit OAuth I store your access token in a signed cookie. Each time you upload a CSV the file is stored directly using Zeit’s upload API, and the file metadata is persisted in JavaScript state in the React app. When you click “publish” the accumulated state is sent to the server where it is used to construct a new Zeit deployment.

The deployment itself consists of the CSV files plus a Dockerfile that installs Python, Datasette, csvs-to-sqlite and their dependencies, then runs csvs-to-sqlite against the CSV files and starts up Datasette against the resulting database.

If you specified a title, description, source or license I generate a Datasette metadata.json file and include that in the deployment as well.

Since free deployments to Zeit are “source code visible”, you can see exactly how the resulting application is structured by visiting https://datasette-onrlszntsq.now.sh/_src (the campaign finance app I built earlier).

Using the Zeit API in this way has the neat effect that I don’t ever store any user data myself - neither the access token used to access your account nor any of the CSVs that you upload. Uploaded files go straight to your own Zeit account and stay under your control. Access tokens are never persisted. The deployed application lives on your own hosting account, where you can terminate it or upgrade it to a paid plan without any further involvement from the tool I have built.

Not having to worry about storing encrypted access tokens or covering any hosting costs beyond the Datasette Publish tool itself is delightful.

This ability to build tools that themselves deploy other tools is fascinating. I can’t wait to see what other kinds of interesting new applications it enables.

Discussion on Hacker News.

Tags: csv, projects, zeit-now, datasette

ftfy - fix unicode that's broken in various ways

2018-01-09T03:22:25+00:00

ftfy - fix unicode that's broken in various ways

I shipped a small web UI wrapper around the excellent Python FTFY library, which can take broken unicode strings and suggest a sequence of operations that can be applied to get back sensible text.

Via Me on Twitter

Tags: projects, unicode, zeit-now

gzthermal-web

2017-11-21T18:24:12+00:00

gzthermal-web

I built a quick web application wrapping the gzthermal gzip visualization tool and deployed it to Zeit Now wrapped up in a Docker container. Give it a URL and it shows you a PNG visualization of how gzip encodes that page.

Via GitHub

Tags: projects, sanic, zeit-now, docker

now-ab

2017-11-16T23:03:55+00:00

now-ab

Intriguing example of a Zeit Now microservice: now-ab is a Node.js HTTP proxy which proxies through to one of two or more other Now-deployed applications based on a cookie. If you don’t have the cookie, it picks a backend at random and sets the cookie. Admittedly this is the easiest part of implementing A/B testing (the hard part is the analytics: tracking exposures and conversions) but as an example of a microservice architectural pattern this is fascinating.

Tags: ab-testing, nodejs, zeit-now, microservices

ZEIT – 6x Faster Now Uploads with HTTP/2

2017-11-08T01:04:56+00:00

ZEIT – 6x Faster Now Uploads with HTTP/2

Fantastic optimization write-up by Pranay Prakash. The Now deployment tool works by computing a hash for every local file in a project, then uploading just the ones that are missing. Pranay switched to uploading over HTTP/2 using the fetch-h2 library and got a 6x speedup for larger projects.

Via Guillermo Rauch

Tags: nodejs, zeit-now, http2

Running a load testing Go utility using Docker for Mac

2017-11-05T03:50:20+00:00

I’m playing around with Zeit Now at the moment (see my previous entry) and decided to hit it with some traffic using Apache Bench. I got this SSL handshake error:

simonw$ ab -n 10 -c 2 'https://json-head.now.sh/'
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking json-head.now.sh (be patient)...SSL handshake failed (1).
140735278280784:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:/Library/Caches/com.apple.xbs/Sources/libressl/libressl-1.60.1.2.1/libressl/ssl/s23_clnt.c:541:
SSL handshake failed (1).

Some brief Googling turned up this thread on Stack Overflow, which suggested trying hey as an alternative. Hey is a load testing utility written in Go, and the installation instructions are as follows:

go get -u github.com/rakyll/hey

Unfortunately, I don’t have a current Go environment set up on this laptop - I have Go 1.6, but Hey calls for at least Go 1.7.

Rather than work through upgrading my Go environment, I decided to see if I could get this tool working using Docker for Mac.

We recently switched to Docker for Mac for running our development environments at work, and having worked through various iterations of Docker over the past few years Docker for Mac offers by far the most pleasant developer experience. You download the installer, run it, and now docker info in a terminal will reveal a fully functioning Docker environment. Couldn’t be simpler.

But how to use it to run a one-off tool written in Go? This article on the official Docker blog gave me everything I needed to know.

First step: run the go get command in a brand new Docker container, like so:

docker run golang go get -v github.com/rakyll/hey

This runs the go get command in a new instance of the official golang container. If you’ve never used the container before, Docker will download everything it needs before executing the rest of the command.

Once this command finishes, we have a container with the Go program compiled and installed in it. But how to run it?

We can “commit” the container to freeze it into a new image that bakes in the command. Here’s how to do that:

docker commit $(docker ps -lq) heyimage

The nested docker ps -lq command outputs the container ID. The outer docker commit command then creates a new image freezing those latest changes.

Having frozen the container, we can run the command like this:

docker run heyimage hey -n 10 -c 2 'https://json-head.now.sh/'

And the command runs, exactly as if I’d installed it without using Docker at all.

simonw$ docker run heyimage hey -n 10 -c 2 'https://json-head.now.sh/'
Summary:
  Total:    0.9778 secs
  Slowest:  0.6794 secs
  Fastest:  0.0564 secs
  Average:  0.1954 secs
  Requests/sec: 10.2266

Response time histogram:
  0.056 [1] |∎∎∎∎∎∎
  0.119 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.181 [0] |
  0.243 [0] |
  0.306 [0] |
  0.368 [0] |
  0.430 [0] |
  0.493 [0] |
  0.555 [0] |
  0.617 [0] |
  0.679 [2] |∎∎∎∎∎∎∎∎∎∎∎

Latency distribution:
  10% in 0.0588 secs
  25% in 0.0653 secs
  50% in 0.0868 secs
  75% in 0.6792 secs
  90% in 0.6794 secs

Details (average, fastest, slowest):
  DNS+dialup:    0.1221 secs, 0.0000 secs, 0.6109 secs
  DNS-lookup:    0.0981 secs, 0.0000 secs, 0.4906 secs
  req write:     0.0001 secs, 0.0000 secs, 0.0001 secs
  resp wait:     0.0727 secs, 0.0561 secs, 0.0904 secs
  resp read:     0.0004 secs, 0.0001 secs, 0.0012 secs

Status code distribution:
  [200] 10 responses

One last puzzle: the above command worked for load testing externally hosted URLs, but I also wanted to try running it against a web server running on port 8000 on my Mac itself. Running hey against http://localhost:8000/ didn't work inside the container. Instead, I ran ipconfig getifaddr en0 to find the local network IP address of my Mac and then ran hey against that IP address (thanks again, Stack Overflow):

simonw$ docker run heyimage hey -n 100 -c 10 'http://10.0.0.12:8000/'
Summary:
  Total:	0.2481 secs
  ...

For me, this use-case illustrates a huge part of the value of Docker: it lets you execute tools written in basically anything without having to pollute your laptop with environment junk.

Running commands against files

Update: 9th November 2017

I decided to use this technique to try out this Go minify tool by Taco de Wolff. Building the tool into a container used the same pattern:

docker run golang go get -v github.com/tdewolff/minify/cmd/minify
docker commit $(docker ps -lq) minify

Running the command this time is a bit harder, because it needs access to files on my filesystem. I can give it access by mounting my current directory as part of the docker run command, like so:

docker run -v `pwd`:/mnt minify minify /mnt/all.css

Running this minifies the contents of the all.css file in my current directory and outputs the result to standard out. If I want to save it I can redirect it to a file like so:

docker run -v `pwd`:/mnt minify minify /mnt/all.css > all.min.css

Tags: go, load-testing, zeit-now, docker

Carbon

2017-10-19T18:31:47+00:00

Carbon

Beautiful little tool that you can paste source code into to generate an image of that code with syntax highlighting applied, ready to be tweeted or shared anywhere that lets you share an image. Built in Node and next.js, with image generation handled client-side by the dom-to-image JavaScript library which loads HTML into a SVG foreignObject (sadly not yet supported by Safari) and uses that to populate a canvas and produce a PNG.

Via Guillermo Rauch

Tags: javascript, nodejs, svg, zeit-now

Deploying an asynchronous Python microservice with Sanic and Zeit Now

2017-10-14T21:46:38+00:00

Back in 2008 Natalie Downe and I deployed what today we would call a microservice: json-head, a tiny Google App Engine app that allowed you to make an HTTP head request against a URL and get back the HTTP headers as JSON. One of our initial use-scase for this was Natalie’s addSizes.js, an unobtrusive jQuery script that could annotate links to PDFs and other large files with their corresponding file size pulled from the Content-Length header. Another potential use-case is detecting broken links, since the API can be used to spot 404 status codes (as in this example).

At some point in the following decade json-head.appspot.com stopped working. Today I’m bringing it back, mainly as an excuse to try out the combination of Python 3.5 async, the Sanic microframework and Zeit’s brilliant Now deployment platform.

First, a demo. https://json-head.now.sh/?url=https://simonwillison.net/ returns the following:

[
    {
        "ok": true,
        "headers": {
            "Date": "Sat, 14 Oct 2017 18:37:52 GMT",
            "Content-Type": "text/html; charset=utf-8",
            "Connection": "keep-alive",
            "Set-Cookie": "__cfduid=dd0b71b4e89bbaca5b27fa06c0b95af4a1508006272; expires=Sun, 14-Oct-18 18:37:52 GMT; path=/; domain=.simonwillison.net; HttpOnly; Secure",
            "Cache-Control": "s-maxage=200",
            "X-Frame-Options": "SAMEORIGIN",
            "Via": "1.1 vegur",
            "CF-Cache-Status": "HIT",
            "Vary": "Accept-Encoding",
            "Server": "cloudflare-nginx",
            "CF-RAY": "3adca70269a51e8f-SJC",
            "Content-Encoding": "gzip"
        },
        "status": 200,
        "url": "https://simonwillison.net/"
    }
]

Given a URL, json-head.now.sh performs an HTTP HEAD request and returns the resulting status code and the HTTP headers. Results are returned with the Access-Control-Allow-Origin: * header so you can call the API using fetch() or XMLHttpRequest from JavaScript running on any page.

Sanic and Python async/await

A key new feature added to Python 3.5 back in September 2015 was built-in syntactic support for coroutine control via the async/await statements. Python now has some serious credibility as a platform for asynchronous I/O (the concept that got me so excited about Node.js back in 2009). This has lead to an explosion of asynchronous innovation around the Python community.

json-head is the perfect application for async - it’s little more than a dumbed-down HTTP proxy, accepting incoming HTTP requests, making its own requests elsewhere and then returning the results.

Sanic is a Flask-like web framework built specifically to take advantage of async/await in Python 3.5. It’s designed for speed - built on top of uvloop, a Python wrapper for libuv (which itself was originally built to power Node.js). uvloop’s self-selected benchmarks are extremely impressive.

Zeit Now

To host this new microservice, I chose Zeit Now. It’s a truly beautiful piece of software design.

Now lets you treat deployments as immutable. Every time you deploy you get a brand new URL. You can then interact with your deployment directly, or point an existing alias to it if you want a persistent URL for your project.

Deployments are free, and deployed code stays available forever due to some clever engineering behind the scenes.

Best of all: deploying a project takes just a single command: type now and the code in your current directory will be deployed to their cloud and assigned a unique URL.

Now was originally built for Node.js projects, but last August Zeit added Docker support. If the directory you run it in contains a Dockerfile, running now will upload, build and run the corresponding image.

There’s just one thing missing: good examples of how to deploy Python projects to Now using Docker. I’m hoping this article can help fill that gap.

Here’s the complete Dockerfile I’m using for json-head:

FROM python:3
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
EXPOSE 8006
CMD ["python", "json_head.py"]

I’m using the official Docker Python image as a base, copying the current directory into the image, using pip install to install dependencies and then exposing port 8006 (for no reason other than that’s the port I use for local development environment) and running the json_head.py script. Now is smart enough to forward incoming HTTP traffic on port 80 to the port that was exposed by the container.

If you setup Now yourself (npm install -g now or use one of their installers) you can deploy my code directly from GitHub to your own instance with a single command:

$ now simonw/json-head
> Didn't find directory. Searching on GitHub...
> Deploying GitHub repository "simonw/json-head" under simonw
> Ready! https://simonw-json-head-xqkfgorgei.now.sh (copied to clipboard) [1s]
> Initializing…
> Building
> ▲ docker build
Sending build context to Docker daemon 7.168 kBkB
> Step 1 : FROM python:3
> 3: Pulling from library/python
> ... lots more stuff here ...

Initial implementation

Here’s my first working version of json-head using Sanic:

from sanic import Sanic
from sanic import response
import aiohttp

app = Sanic(__name__)

async def head(session, url):
    try:
        async with session.head(url) as response:
            return {
                'ok': True,
                'headers': dict(response.headers),
                'status': response.status,
                'url': url,
            }
    except Exception as e:
        return {
            'ok': False,
            'error': str(e),
            'url': url,
        }

@app.route('/')
async def handle_request(request):
    url = request.args.get('url')
    if url:
        async with aiohttp.ClientSession() as session:
            head_info = await head(session, url)
            return response.json(
                head_info,
                headers={
                    'Access-Control-Allow-Origin': '*'
                },
            )
    else:
        return response.html('Try /?url=xxx')

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8006)

This exact code is deployed at https://json-head-thlbstmwfi.now.sh/ - since Now deployments are free, there’s no reason not to leave work-in-progress examples hosted as throwaway deployments.

In addition to Sanic, I’m also using the handy aiohttp asynchronous HTTP library - which features API design clearly inspired by my all-time favourite HTTP library, requests.

The key new pieces of syntax to understand in the above code are the async and await statements. async def is used to declare a function that acts as a coroutine. Coroutines need to be executed inside an event loop (which Sanic handles for us), but gain the ability to use the await statement.

The await statement is the real magic here: it suspends the current coroutine until the coroutine it is calling has finished executing. It is this that allows us to write asynchronous code without descending into a messy hell of callback functions.

Adding parallel requests

So far we haven’t really taken advantage of what async I/O can do - if every incoming HTTP request results in a single outgoing HTTP response then async may help us scale to serve more incoming requests at once but it’s not really giving us any new functionality.

Executing multiple outbound HTTP requests in parallel is a much more interesting use-case. Let’s add support for multiple ?url= parameters, such as the following:

https://json-head.now.sh/?url=https://simonwillison.net/&url=https://www.google.com/

[
    {
        "ok": true,
        "headers": {
            "Date": "Sat, 14 Oct 2017 19:35:29 GMT",
            "Content-Type": "text/html; charset=utf-8",
            "Connection": "keep-alive",
            "Set-Cookie": "__cfduid=ded486c1faaac166e8ae72a87979c02101508009729; expires=Sun, 14-Oct-18 19:35:29 GMT; path=/; domain=.simonwillison.net; HttpOnly; Secure",
            "Cache-Control": "s-maxage=200",
            "X-Frame-Options": "SAMEORIGIN",
            "Via": "1.1 vegur",
            "CF-Cache-Status": "EXPIRED",
            "Vary": "Accept-Encoding",
            "Server": "cloudflare-nginx",
            "CF-RAY": "3adcfb671c862888-SJC",
            "Content-Encoding": "gzip"
        },
        "status": 200,
        "url": "https://simonwillison.net/"
    },
    {
        "ok": true,
        "headers": {
            "Date": "Sat, 14 Oct 2017 19:35:29 GMT",
            "Expires": "-1",
            "Cache-Control": "private, max-age=0",
            "Content-Type": "text/html; charset=ISO-8859-1",
            "P3P": "CP=\"This is not a P3P policy! See g.co/p3phelp for more info.\"",
            "Content-Encoding": "gzip",
            "Server": "gws",
            "X-XSS-Protection": "1; mode=block",
            "X-Frame-Options": "SAMEORIGIN",
            "Set-Cookie": "1P_JAR=2017-10-14-19; expires=Sat, 21-Oct-2017 19:35:29 GMT; path=/; domain=.google.com",
            "Alt-Svc": "quic=\":443\"; ma=2592000; v=\"39,38,37,35\"",
            "Transfer-Encoding": "chunked"
        },
        "status": 200,
        "url": "https://www.google.com/"
    }
]

We’re now accepting multiple URLs and executing multiple HEAD requests… but Python 3.5 async makes it easy to do this in parallel, so our overall request time should match that of the single longest HEAD request that we triggered.

Here’s an implementation that adds support for multiple, parallel outbound HTTP requests:

@app.route('/')
async def handle_request(request):
    urls = request.args.getlist('url')
    if urls:
        async with aiohttp.ClientSession() as session:
            head_infos = await asyncio.gather(*[
                head(session, url) for url in urls
            ])
            return response.json(
                head_infos,
                headers={'Access-Control-Allow-Origin': '*'},
            )
    else:
        return response.html(INDEX)

We’re using the asyncio module from the Python 3.5 standard library here - in particular the gather function. asyncio.gather takes a list of coroutines and returns a future aggregating their results. This future will resolve (and return to a corresponding await statement) as soon as all of those coroutines have returned their values.

My final code for json-head can be found on GitHub. As I hope I’ve demonstrated, the combination of Python 3.5+, Sanic and Now makes deploying asynchronous Python microservices trivially easy.

Tags: async, jsonhead, natalie-downe, python, sanic, zeit-now, docker