Simon Willison's Weblog: gzip

asgi-gzip 0.3

2026-04-09T03:54:40+00:00

I ran into trouble deploying a new feature using SSE to a production Datasette instance, and it turned out that instance was using datasette-gzip which uses asgi-gzip which was incorrectly compressing event/text-stream responses.

asgi-gzip was extracted from Starlette, and has a GitHub Actions scheduled workflow to check Starlette for updates that need to be ported to the library... but that action had stopped running and hence had missed Starlette's own fix for this issue.

I ran the workflow and integrated the new fix, and now datasette-gzip and asgi-gzip both correctly handle text/event-stream in SSE responses.

Tags: gzip, python, asgi

Automatically opening issues when tracked file content changes

2022-04-28T17:18:14+00:00

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here's why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well tested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it's really good code.

The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven't made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here's one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here's a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Update: October 13th 2022

Almost six months after writing about this... it triggered for the first time!

Here's the issue that the script opened: #4: tracking/gzip.py was updated.

I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released version 0.2 of the package.

Tags: github, gzip, projects, python, datasette, asgi, github-actions, git-scraping, github-issues, starlette

Weeknotes: Parallel SQL queries for Datasette, plus some middleware tricks

2022-04-27T19:01:30+00:00

A promising new performance optimization for Datasette, plus new datasette-gzip and datasette-total-page-time plugins.

Parallel SQL queries in Datasette

From the start of the project, Datasette has been built on top of Python's asyncio capabilities - mainly to benefit things like streaming enormous CSV files.

This week I started experimenting with a new way to take advantage of them, by exploring the potential to run multiple SQL queries in parallel.

Consider this Datasette table page:

That page has to execute quite a few SQL queries:

A select count(*) ... to populate the 3,283 rows heading at the top
Queries against each column to decide what the "suggested facets" should be (details here)
For each of the selected facets (in this case repos and committer) a select name, count(*) from ... group by name order by count(*) desc query
The actual select * from ... limit 101 query used to display the actual table

It ends up executing more than 30 queries! Which may seem like a lot, but Many Small Queries Are Efficient In SQLite.

One thing that's interesting about the above list of queries though is that they don't actually have any dependencies on each other. There's no reason not to run all of them in parallel - later queries don't depend on the results from earlier queries.

I've been exploring a fancy way of executing parallel code using pytest-style dependency injection in my asyncinject library. But I decided to do a quick prototype to see what this would look like using asyncio.gather().

It turns out that simpler approach worked surprisingly well!

You can follow my research in this issue, but the short version is that as-of a few days ago the Datasette main branch runs many of the above queries in parallel.

This trace (using the datasette-pretty-traces plugin) illustrates my initial results:

As you can see, the grey lines for many of those SQL queries are now overlapping.

You can add the undocumented ?_noparallel=1 query string parameter to disable parallel execution to compare the difference:

One thing that gives me pause: for this particular Datasette deployment (on the cheapest available Cloud Run instance) the overall performance difference between the two is very small.

I need to dig into this deeper: on my laptop I feel like I'm seeing slightly better results, but definitely not conclusively. It may be that multiple cores are not being used effectively here.

Datasette runs SQL queries in a pool of threads. You might expect Python's infamous GIL (Global Interpreter Lock) to prevent these from executing across multiple cores - but I checked, and the GIL is released in Python's C code the moment control transfers to SQLite. And since SQLite can happily run multiple threads, my hunch is that this means parallel queries should be able to take advantage of multiple cores. Theoretically at least!

I haven't yet figured out how to prove this though, and I'm not currently convinced that parallel queries are providing any overall benefit at all. If you have any ideas I'd love to hear them - I have a research issue open, comments welcome!

Update 28th April 2022: Research continues, but it looks like there's little performance benefit from this. Current leading theory is that this is because of the GIL - while the SQLite C code releases the GIL, much of the activity involved in things like assembling Row objects returned by a query still uses Python - so parallel queries still end up mostly blocked on a single core. Follow the issue for more details. I started a discussion on the SQLite Forum which has some interesting clues in it as well.

Further update: It's definitely the GIL. I know because I tried running it against Sam Gross's nogil Python fork and the parallel version soundly beat the non-parallel version! Details in this comment.

datasette-gzip

I've been putting off investigating gzip support for Datasette for a long time, because it's easy to add as a separate layer. If you run Datasette behind Cloudflare or an Apache or Nginx proxy configuring gzip can happen there, with very little effort and fantastic performance.

Then I noticed that my Global Power Plants demo returned an HTML table page that weighed in at 420KB... but gzipped was just 16.61KB. Turns out HTML tables have a ton of repeated markup and compress REALLY well!

More importantly: Google Cloud Run doesn't gzip for you. So all of my Datasette instances that were running on Cloud Run without also using Cloudflare were really suffering.

So this morning I released datasette-gzip, a plugin that gzips content if the browser sends an Accept-Encoding: gzip header.

The plugin is an incredibly thin wrapper around the thorougly proven-in-production GZipMiddleware. So thin that this is the full implementation:

from datasette import hookimpl
from starlette.middleware.gzip import GZipMiddleware

@hookimpl(trylast=True)
def asgi_wrapper(datasette):
    return GZipMiddleware

This kind of thing is exactly why I ported Datasette to ASGI back in 2019 - and why I continue to think that the burgeoning ASGI ecosystem is the most under-rated piece of today's Python web development environment.

The plugin's tests are a lot more interesting.

That @hookimpl(trylast=True) line is there to ensure that this plugin runs last, after ever other plugin has executed.

This is necessary because there are existing ASGI plugins for Datasette (such as the new datasette-total-page-time) which modify the generated request.

If the gzip plugin runs before they do, they'll get back a blob of gzipped data rather than the HTML that they were expecting. This is likely to break them.

I wanted to prove to myself that trylast=True would prevent these errors - so I ended up writing a test that demonstrated that the plugin registered with trylast=True was compatible with a transforming content plugin (in the test it just converts everything to uppercase) whereas tryfirst=True would instead result in an error.

Thankfully I have an older TIL on Registering temporary pluggy plugins inside tests that I could lean on to help figure out how to do this.

The plugin is now running on my latest-with-plugins demo instance. Since that instance loads dozens of different plugins it ends up serving a bunch of extra JavaScript and CSS, all of which benefits from gzip:

datasette-total-page-time

To help understand the performance improvements introduced by parallel SQL queries I decided I wanted the Datasette footer to be able to show how long it took for the entire page to load.

This is a tricky thing to do: how do you measure the total time for a page and then include it on that page if the page itself hasn't finished loading when you render that template?

I came up with a pretty devious middleware trick to solve this, released as the datasette-total-page-time plugin.

The trick is to start a timer when the page load begins, and then end that timer at the very last possible moment as the page is being served back to the user.

Then, inject the following HTML directly after the closing </html> tag (which works fine, even though it's technically invalid):

<script>
let footer = document.querySelector("footer");
if (footer) {
    let ms = 37.224;
    let s = ` &middot; Page took ${ms.toFixed(3)}ms`;
    footer.innerHTML += s;
}
</script>

This adds the timing information to the page's <footer> element, if one exists.

You can see this running on this latest-with-plugins page.

Releases this week

datasette-gzip: 0.1 - 2022-04-27
Add gzip compression to Datasette
datasette-total-page-time: 0.1 - 2022-04-26
Add a note to the Datasette footer measuring the total page load time
asyncinject: 0.5 - (7 releases total) - 2022-04-22
Run async workflows using pytest-fixtures-style dependency injection
django-sql-dashboard: 1.1 - (35 releases total) - 2022-04-20
Django app for building dashboards using raw SQL queries
shot-scraper: 0.13 - (14 releases total) - 2022-04-18
Tools for taking automated screenshots of websites

TIL this week

Tags: async, gzip, python, datasette, weeknotes

gzthermal

2017-11-21T14:56:41+00:00

gzthermal

“pseudo thermal view of Gzip/Deflate compression efficiency”—neat tool for visualizing gzip compressed data and understanding exactly how run-length encoding and back references apply to a gzipped file.

Via Of SVG, Minification and Gzip

Tags: gzip

Of SVG, Minification and Gzip

2017-11-21T14:54:15+00:00

Of SVG, Minification and Gzip

Delightfully nerdy exploration of tricks you can use to hand-optimize your SVG in order to maximize gzip compression. Premature optimization may be the root of all evil but this is still a great way to learn about how gzip actually works.

Tags: gzip, minification, svg

gzip support for Amazon Web Services CloudFront

2010-11-12T05:33:00+00:00

gzip support for Amazon Web Services CloudFront

This would have saved me a bunch of work a few weeks ago. CloudFront can now be pointed at your own web server rather than S3, and you can ask it to forward on the Accept-Encoding header and cache multiple content versions based on the result.

Tags: cloudfront, gzip, http, recovered

Velocity: Forcing Gzip Compression

2010-09-30T17:45:00+00:00

Velocity: Forcing Gzip Compression

Almost every browser supports gzip these days, but 15% of web requests have had their Accept-Encoding header stripped or mangled, generally due to poorly implemented proxies or anti-virus software. Steve Souders passes on a trick used by Google Search, where an iframe is used to test the browser’s gzip support and set a cookie to force gzipping of future pages.

Tags: browsers, gzip, performance, proxies, steve-souders, recovered

PNGStore - Embedding compressed CSS & JavaScript in PNGs

2010-08-23T09:47:00+00:00

PNGStore - Embedding compressed CSS & JavaScript in PNGs

Cal did some further analysis on the CSS/JS to PNG compression trick (including producing some interesting images of jQuery compressed using different image packing techniques) and found it to be slightly less effective than regular GZipping.

Tags: cal-henderson, gzip, png, recovered

Paul Buchheit: Make your site faster and cheaper to operate in one easy step

2009-04-17T17:19:44+00:00

Paul Buchheit: Make your site faster and cheaper to operate in one easy step

Paul promotes gzip encoding using nginx as a proxy, and mentions that FriendFeed use a “custom, epoll-based python server” as their application server. Does that mean that they’re serving their real-time comet feeds directly from Python?

Tags: comet, epoll, friendfeed, gzip, nginx, paul-buchheit, python