<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: zero-downtime</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/zero-downtime.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-07-30T21:45:32+00:00</updated><author><name>Simon Willison</name></author><entry><title>Making Machines Move</title><link href="https://simonwillison.net/2024/Jul/30/making-machines-move/#atom-tag" rel="alternate"/><published>2024-07-30T21:45:32+00:00</published><updated>2024-07-30T21:45:32+00:00</updated><id>https://simonwillison.net/2024/Jul/30/making-machines-move/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://fly.io/blog/machine-migrations/"&gt;Making Machines Move&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Another deep technical dive into Fly.io infrastructure from Thomas Ptacek, this time describing how they can quickly boot up an instance with a persistent volume on a new host (for things like zero-downtime deploys) using a block-level cloning operation, so the new instance gets a volume that becomes accessible instantly, serving proxied blocks of data until the new volume has been completely migrated from the old host.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/thomas-ptacek"&gt;thomas-ptacek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;



</summary><category term="ops"/><category term="thomas-ptacek"/><category term="zero-downtime"/><category term="fly"/></entry><entry><title>Skew protection in Vercel</title><link href="https://simonwillison.net/2024/Mar/20/skew-protection-in-vercel/#atom-tag" rel="alternate"/><published>2024-03-20T14:06:38+00:00</published><updated>2024-03-20T14:06:38+00:00</updated><id>https://simonwillison.net/2024/Mar/20/skew-protection-in-vercel/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://vercel.com/docs/deployments/skew-protection"&gt;Skew protection in Vercel&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Version skew is a name for the bug that occurs when your user loads a web application and then unintentionally keeps that browser tab open across a deployment of a new version of the app. If you’re unlucky this can lead to broken behaviour, where a client makes a call to a backend endpoint that has changed in an incompatible way.&lt;/p&gt;

&lt;p&gt;Vercel have an ingenious solution to this problem. Their platform already makes it easy to deploy many different instances of an application. You can now turn on “skew protection” for a number of hours which will keep older versions of your backend deployed.&lt;/p&gt;

&lt;p&gt;The application itself can then include its desired deployment ID in a x-deployment-id header, a __vdpl cookie or a ?dpl= query string parameter.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/vercel_changes/status/1770280131250286851"&gt;Vercel changes&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/frontend"&gt;frontend&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vercel"&gt;vercel&lt;/a&gt;&lt;/p&gt;



</summary><category term="frontend"/><category term="zero-downtime"/><category term="vercel"/></entry><entry><title>pgroll</title><link href="https://simonwillison.net/2024/Jan/30/pgroll/#atom-tag" rel="alternate"/><published>2024-01-30T21:27:13+00:00</published><updated>2024-01-30T21:27:13+00:00</updated><id>https://simonwillison.net/2024/Jan/30/pgroll/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/xataio/pgroll"&gt;pgroll&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“Zero-downtime, reversible, schema migrations for Postgres”&lt;/p&gt;

&lt;p&gt;I love this kind of thing. This one is has a really interesting design: you define your schema modifications (adding/dropping columns, creating tables etc) using a JSON DSL, then apply them using a Go binary.&lt;/p&gt;

&lt;p&gt;When you apply a migration the tool first creates a brand new PostgreSQL schema (effectively a whole new database) which imitates your new schema design using PostgreSQL views. You can then point your applications that have been upgraded to the new schema at it, using the PostgreSQL search_path setting.&lt;/p&gt;

&lt;p&gt;Old applications can continue talking to the previous schema design, giving you an opportunity to roll out a zero-downtime deployment of the new code.&lt;/p&gt;

&lt;p&gt;Once your application has upgraded and the physical rows in the database have been transformed to the new schema you can run a --continue command to make the final destructive changes and drop the mechanism that simulates both schema designs at once.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/buhd4e/postgresql_zero_downtime_reversible"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/migrations"&gt;migrations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="migrations"/><category term="postgresql"/><category term="zero-downtime"/></entry><entry><title>Upgrading GitHub.com to MySQL 8.0</title><link href="https://simonwillison.net/2023/Dec/10/upgrading-github-to-mysql-8/#atom-tag" rel="alternate"/><published>2023-12-10T20:36:23+00:00</published><updated>2023-12-10T20:36:23+00:00</updated><id>https://simonwillison.net/2023/Dec/10/upgrading-github-to-mysql-8/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.blog/2023-12-07-upgrading-github-com-to-mysql-8-0/"&gt;Upgrading GitHub.com to MySQL 8.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I love a good zero-downtime upgrade story, and this is a fine example of the genre. GitHub spent a year upgrading MySQL from 5.7 to 8 across 1200+ hosts, covering 300+ TB that was serving 5.5 million queries per second. The key technique was extremely carefully managed replication, plus tricks like leaving enough 5.7 replicas available to handle a rollback should one be needed.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/yqwasc/upgrading_github_com_mysql_8_0"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysql"&gt;mysql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/replication"&gt;replication&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="mysql"/><category term="ops"/><category term="replication"/><category term="zero-downtime"/></entry><entry><title>Stripe: Online migrations at scale</title><link href="https://simonwillison.net/2023/Nov/5/online-migrations-at-scale/#atom-tag" rel="alternate"/><published>2023-11-05T16:06:32+00:00</published><updated>2023-11-05T16:06:32+00:00</updated><id>https://simonwillison.net/2023/Nov/5/online-migrations-at-scale/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://stripe.com/blog/online-migrations"&gt;Stripe: Online migrations at scale&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This 2017 blog entry from Jacqueline Xu at Stripe provides a very clear description of the “dual writes” pattern for applying complex data migrations without downtime: dual write to new and old tables, update the read paths, update the write paths and finally remove the now obsolete data—illustrated with an example of upgrading customers from having a single to multiple subscriptions.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/eatonphil/status/1721195409647829052"&gt;@eatonphil&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/migrations"&gt;migrations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stripe"&gt;stripe&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="migrations"/><category term="zero-downtime"/><category term="stripe"/></entry><entry><title>Database Migrations</title><link href="https://simonwillison.net/2023/Oct/1/database-migrations/#atom-tag" rel="alternate"/><published>2023-10-01T23:55:25+00:00</published><updated>2023-10-01T23:55:25+00:00</updated><id>https://simonwillison.net/2023/Oct/1/database-migrations/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://vadimkravcenko.com/shorts/database-migrations/"&gt;Database Migrations&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Vadim Kravcenko provides a useful, in-depth description of the less obvious challenges of applying database migrations successfully. Vadim uses and likes Django’s migrations (as do I) but notes that running them at scale still involves a number of thorny challenges.&lt;/p&gt;

&lt;p&gt;The biggest of these, which I’ve encountered myself multiple times, is that if you want truly zero downtime deploys you can’t guarantee that your schema migrations will be deployed at the exact same instant as changes you make to your application code.&lt;/p&gt;

&lt;p&gt;This means all migrations need to be forward-compatible: you need to apply a schema change in a way that your existing code will continue to work error-free, then ship the related code change as a separate operation.&lt;/p&gt;

&lt;p&gt;Vadim describes what this looks like in detail for a number of common operations: adding a field, removing a field and changing a field that has associated business logic implications. He also discusses the importance of knowing when to deploy a dual-write strategy.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/migrations"&gt;migrations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="django"/><category term="migrations"/><category term="ops"/><category term="zero-downtime"/></entry><entry><title>MRSK</title><link href="https://simonwillison.net/2023/Apr/29/mrsk/#atom-tag" rel="alternate"/><published>2023-04-29T23:54:40+00:00</published><updated>2023-04-29T23:54:40+00:00</updated><id>https://simonwillison.net/2023/Apr/29/mrsk/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mrsk.dev/"&gt;MRSK&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A new open source web application deployment tool from 37signals, developed to help migrate their Hey webmail app out of the cloud and onto their own managed hardware. The key feature is one that I care about deeply: it enables zero-downtime deploys by running all traffic through a Traefik reverse proxy in a way that allows requests to be paused while a new deployment is going out—so end users get a few seconds delay on their HTTP requests before being served by the replaced application.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/37signals"&gt;37signals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deployment"&gt;deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/traefik"&gt;traefik&lt;/a&gt;&lt;/p&gt;



</summary><category term="37signals"/><category term="deployment"/><category term="ops"/><category term="zero-downtime"/><category term="traefik"/></entry><entry><title>Software engineering practices</title><link href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#atom-tag" rel="alternate"/><published>2022-10-01T15:56:02+00:00</published><updated>2022-10-01T15:56:02+00:00</updated><id>https://simonwillison.net/2022/Oct/1/software-engineering-practices/#atom-tag</id><summary type="html">
    &lt;p&gt;Gergely Orosz &lt;a href="https://twitter.com/GergelyOrosz/status/1576161504260657152"&gt;started a Twitter conversation&lt;/a&gt; asking about recommended "software engineering practices" for development teams.&lt;/p&gt;
&lt;p&gt;(I really like his rejection of the term "best practices" here: I always feel it's prescriptive and misguiding to announce something as "best".)&lt;/p&gt;
&lt;p&gt;I decided to flesh some of my replies out into a longer post.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#docs-same-repo"&gt;Documentation in the same repo as the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#create-test-data"&gt;Mechanisms for creating test data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#rock-solid-migrations"&gt;Rock solid database migrations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#new-project-templates"&gt;Templates for new projects and components&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#auto-formatting"&gt;Automated code formatting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#tested-dev-environments"&gt;Tested, automated process for new development environments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Oct/1/software-engineering-practices/#automated-previews"&gt;Automated preview environments&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="docs-same-repo"&gt;Documentation in the same repo as the code&lt;/h4&gt;
&lt;p&gt;The most important characteristic of internal documentation is trust: do people trust that documentation both exists and is up-to-date?&lt;/p&gt;
&lt;p&gt;If they don't, they won't read it or contribute to it.&lt;/p&gt;
&lt;p&gt;The best trick I know of for improving the trustworthiness of documentation is to put it in the same repository as the code it documents, for a few reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You can enforce documentation updates as part of your code review process. If a PR changes code in a way that requires documentation updates, the reviewer can ask for those updates to be included.&lt;/li&gt;
&lt;li&gt;You get versioned documentation. If you're using an older version of a library you can consult the documentation for that version. If you're using the current main branch you can see documentation for that, without confusion over what corresponds to the most recent "stable" release.&lt;/li&gt;
&lt;li&gt;You can integrate your documentation with your automated tests! I wrote about this in &lt;a href="https://simonwillison.net/2018/Jul/28/documentation-unit-tests/"&gt;Documentation unit tests&lt;/a&gt;, which describes a pattern for introspecting code and then ensuring that the documentation at least has a section header that matches specific concepts, such as plugin hooks or configuration options.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 id="create-test-data"&gt;Mechanisms for creating test data&lt;/h4&gt;
&lt;p&gt;When you work on large products, your customers will inevitably find surprising ways to stress or break your system. They might create an event with over a hundred different types of ticket for example, or an issue thread with a thousand comments.&lt;/p&gt;
&lt;p&gt;These can expose performance issues that don't affect the majority of your users, but can still lead to service outages or other problems.&lt;/p&gt;
&lt;p&gt;Your engineers need a way to replicate these situations in their own development environments.&lt;/p&gt;
&lt;p&gt;One way to handle this is to provide tooling to import production data into local environments. This has privacy and security implications - what if a developer laptop gets stolen that happens to have a copy of your largest customer's data?&lt;/p&gt;
&lt;p&gt;A better approach is to have a robust system in place for generating test data, that covers a variety of different scenarios.&lt;/p&gt;
&lt;p&gt;You might have a button somewhere that creates an issue thread with a thousand fake comments, with a note referencing the bug that this helps emulate.&lt;/p&gt;
&lt;p&gt;Any time a new edge case shows up, you can add a new recipe to that system. That way engineers can replicate problems locally without needing copies of production data.&lt;/p&gt;
&lt;h4 id="rock-solid-migrations"&gt;Rock solid database migrations&lt;/h4&gt;
&lt;p&gt;The hardest part of large-scale software maintenance is inevitably the bit where you need to change your database schema.&lt;/p&gt;
&lt;p&gt;(I'm confident that one of the biggest reasons NoSQL databases became popular over the last decade was the pain people had associated with relational databases due to schema changes. Of course, NoSQL database schema modifications are still necessary, and often they're even more painful!)&lt;/p&gt;
&lt;p&gt;So you need to invest in a really good, version-controlled mechanism for managing schema changes. And a way to run them in production without downtime.&lt;/p&gt;
&lt;p&gt;If you do not have this your engineers will respond by being fearful of schema changes. Which means they'll come up with increasingly complex hacks to avoid them, which piles on technical debt.&lt;/p&gt;
&lt;p&gt;This is a deep topic. I mostly use Django for large database-backed applications, and Django has the best &lt;a href="https://docs.djangoproject.com/en/4.1/topics/migrations/"&gt;migration system&lt;/a&gt; I've ever personally experienced. If I'm working without Django I try to replicate its approach as closely as possible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The database knows which migrations have already been applied. This means when you run the "migrate" command it can run just the ones that are still needed - important for managing multiple databases, e.g. production, staging, test and development environments.&lt;/li&gt;
&lt;li&gt;A single command that applies pending migrations, and updates the database rows that record which migrations have been run.&lt;/li&gt;
&lt;li&gt;Optional: rollbacks. Django migrations can be rolled back, which is great for iterating in a development environment but using that in production is actually quite rare: I'll often ship a new migration that reverses the change instead rather than using a rollback, partly to keep the record of the mistake in version control.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even harder is the challenge of making schema changes without any downtime. I'm always interested in reading about new approaches for this - GitHub's &lt;a href="https://github.com/github/gh-ost"&gt;gh-ost&lt;/a&gt; is a neat solution for MySQL.&lt;/p&gt;
&lt;p&gt;An interesting consideration here is that it's rarely possible to have application code and database schema changes go out at the exact same instance in time. As a result, to avoid downtime you need to design every schema change with this in mind. The process needs to be:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Design a new schema change that can be applied without changing the application code that uses it.&lt;/li&gt;
&lt;li&gt;Ship that change to production, upgrading your database while keeping the old code working.&lt;/li&gt;
&lt;li&gt;Now ship new application code that uses the new schema.&lt;/li&gt;
&lt;li&gt;Ship a new schema change that cleans up any remaining work - dropping columns that are no longer used, for example.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process is a pain. It's difficult to get right. The only way to get good at it is to practice it a lot over time.&lt;/p&gt;
&lt;p&gt;My rule is this: &lt;strong&gt;schema changes should be boring and common&lt;/strong&gt;, as opposed to being exciting and rare.&lt;/p&gt;
&lt;h4 id="new-project-templates"&gt;Templates for new projects and components&lt;/h4&gt;
&lt;p&gt;If you're working with microservices, your team will inevitably need to build new ones.&lt;/p&gt;
&lt;p&gt;If you're working in a monorepo, you'll still have elements of your codebase with similar structures - components and feature implementations of some sort.&lt;/p&gt;
&lt;p&gt;Be sure to have really good templates in place for creating these "the right way" - with the right directory structure, a README and a test suite with a single, dumb passing test.&lt;/p&gt;
&lt;p&gt;I like to use the Python &lt;a href="https://cookiecutter.readthedocs.io/"&gt;cookiecutter&lt;/a&gt; tool for this. I've also used GitHub template repositories, and I even have a neat trick for &lt;a href="https://simonwillison.net/2021/Aug/28/dynamic-github-repository-templates/"&gt;combining the two&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These templates need to be maintained and kept up-to-date. The best way to do that is to make sure they are being used - every time a new project is created is a chance to revise the template and make sure it still reflects the recommended way to do things.&lt;/p&gt;
&lt;h4 id="auto-formatting"&gt;Automated code formatting&lt;/h4&gt;
&lt;p&gt;This one's easy. Pick a code formatting tool for your language - like &lt;a href="https://github.com/psf/black"&gt;Black&lt;/a&gt; for Python or &lt;a href="https://prettier.io/"&gt;Prettier&lt;/a&gt; for JavaScript (I'm so jealous of how Go has &lt;a href="https://pkg.go.dev/cmd/gofmt"&gt;gofmt&lt;/a&gt; built in) - and run its "check" mode in your CI flow.&lt;/p&gt;
&lt;p&gt;Don't argue with its defaults, just commit to them.&lt;/p&gt;
&lt;p&gt;This saves an incredible amount of time in two places:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;As an individual, you get back all of that mental energy you used to spend thinking about the best way to format your code and can spend it on something more interesting.&lt;/li&gt;
&lt;li&gt;As a team, your code reviews can entirely skip the pedantic arguments about code formatting. Huge productivity win!&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="tested-dev-environments"&gt;Tested, automated process for new development environments&lt;/h4&gt;
&lt;p&gt;The most painful part of any software project is inevitably setting up the initial development environment.&lt;/p&gt;
&lt;p&gt;The moment your team grows beyond a couple of people, you should invest in making this work better.&lt;/p&gt;
&lt;p&gt;At the very least, you need a documented process for creating a new environment - and it has to be known-to-work, so any time someone is onboarded using it they should be encouraged to fix any problems in the documentation or accompanying scripts as they encounter them.&lt;/p&gt;
&lt;p&gt;Much better is an automated process: a single script that gets everything up and running. Tools like Docker have made this a LOT easier over the past decade.&lt;/p&gt;
&lt;p&gt;I'm increasingly convinced that the best-in-class solution here is cloud-based development environments. The ability to click a button on a web page and have a fresh, working development environment running a few seconds later is a game-changer for large development teams.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.gitpod.io/"&gt;Gitpod&lt;/a&gt; and &lt;a href="https://github.com/features/codespaces"&gt;Codespaces&lt;/a&gt; are two of the most promising tools I've tried in this space.&lt;/p&gt;
&lt;p&gt;I've seen developers lose hours a week to issues with their development environment. Eliminating that across a large team is the equivalent of hiring several new full-time engineers!&lt;/p&gt;
&lt;h4 id="automated-previews"&gt;Automated preview environments&lt;/h4&gt;
&lt;p&gt;Reviewing a pull request is a lot easier if you can actually try out the changes.&lt;/p&gt;
&lt;p&gt;The best way to do this is with automated preview environments, directly linked to from the PR itself.&lt;/p&gt;
&lt;p&gt;These are getting increasingly easy to offer. &lt;a href="https://vercel.com/features/previews"&gt;Vercel&lt;/a&gt;, &lt;a href="https://www.netlify.com/products/deploy-previews/"&gt;Netlify&lt;/a&gt;, &lt;a href="https://render.com/docs/pull-request-previews"&gt;Render&lt;/a&gt; and &lt;a href="https://devcenter.heroku.com/articles/github-integration-review-apps"&gt;Heroku&lt;/a&gt; all have features that can do this. Building a custom system on top of something like &lt;a href="https://cloud.google.com/run"&gt;Google Cloud Run&lt;/a&gt; or &lt;a href="https://fly.io/blog/fly-machines/"&gt;Fly Machines&lt;/a&gt; is also possible with a bit of work.&lt;/p&gt;
&lt;p&gt;This is another one of those things which requires some up-front investment but will pay itself off many times over through increased productivity and quality of reviews.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-engineering"&gt;software-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/version-control"&gt;version-control&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/technical-debt"&gt;technical-debt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gergely-orosz"&gt;gergely-orosz&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="documentation"/><category term="software-engineering"/><category term="testing"/><category term="version-control"/><category term="zero-downtime"/><category term="github-actions"/><category term="technical-debt"/><category term="gergely-orosz"/></entry><entry><title>Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website</title><link href="https://simonwillison.net/2020/Aug/5/zero-downtime-release/#atom-tag" rel="alternate"/><published>2020-08-05T03:27:27+00:00</published><updated>2020-08-05T03:27:27+00:00</updated><id>https://simonwillison.net/2020/Aug/5/zero-downtime-release/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://dl.acm.org/doi/abs/10.1145/3387514.3405885"&gt;Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I remain fascinated by techniques for zero downtime deployment—once you have it working it makes shipping changes to your software so much less stressful, which means you can iterate faster and generally be much more confident in shipping code.&lt;/p&gt;

&lt;p&gt;Facebook have invested vast amounts of effort into getting this right, and their new paper for the ACM SIGCOMM conference goes into detail about how it all works.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/copyconstruct/status/1290199786000244737"&gt;Cindy Sridharan&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/deployment"&gt;deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="deployment"/><category term="zero-downtime"/></entry><entry><title>Weeknotes: Datasette Cloud and zero downtime deployments</title><link href="https://simonwillison.net/2020/Jan/21/weeknotes-datasette-cloud-and-zero-downtime-deployments/#atom-tag" rel="alternate"/><published>2020-01-21T20:56:46+00:00</published><updated>2020-01-21T20:56:46+00:00</updated><id>https://simonwillison.net/2020/Jan/21/weeknotes-datasette-cloud-and-zero-downtime-deployments/#atom-tag</id><summary type="html">
    &lt;p&gt;Yesterday's piece on &lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Tracking FARA by deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; was originally intended to be my weeknotes, but ended up getting a bit too involved.&lt;/p&gt;

&lt;p&gt;Aside from playing with GitHub Actions and Cloud Run, my focus over the past week has been working on Datasette Cloud. Datasette Cloud is the current name I'm using for my hosted &lt;a href="https://datasette.readthedocs.io/"&gt;Datasette&lt;/a&gt; product - the idea being that I'll find it &lt;em&gt;a lot&lt;/em&gt; easier to get &lt;a href="https://simonwillison.net/2019/Sep/10/jsk-fellowship/"&gt;feedback on Datasette from journalists&lt;/a&gt; if they can use it without having to install anything!&lt;/p&gt;

&lt;p&gt;My MVP for Datasette Cloud is that I can use it to instantly provision a new, private Datasette instance for a journalist (or team of journalists) that they can then sign into, start playing with and start uploading their data to (initially as CSV files).&lt;/p&gt;

&lt;p&gt;I have to solve quite a few problems to get there:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Secure, isolated instances of Datasette. A team or user should only be able to see their own files. I plan to solve this using Docker containers that are mounted such that they can only see their own dedicated volumes.&lt;/li&gt;&lt;li&gt;The ability to provision new instances as easily as possible - and give each one its own HTTPS subdomain.&lt;/li&gt;&lt;li&gt;Authentication: users need to be able to register and sign in to accounts. I could use &lt;a href="https://github.com/simonw/datasette-auth-github"&gt;datasette-auth-github&lt;/a&gt; for this but I'd like to be able to support regular email/password accounts too.&lt;/li&gt;&lt;li&gt;Users need to be able to upload CSV files and have them converted into a SQLite database compatible with Datasette.&lt;/li&gt;&lt;/ul&gt;

&lt;h3&gt;Zero downtime deployments&lt;/h3&gt;

&lt;p&gt;I have a stretch goal which I'm taking pretty seriously: I want to have a mechanism in place for zero-downtime deployments of new versions of the software.&lt;/p&gt;

&lt;p&gt;Arguable this is an unneccessary complication for an MVP. I may not fully implement it, but I do want to at least know that the path I've taken is compatible with zero downtime deployments.&lt;/p&gt;

&lt;p&gt;Why do zero downtime deployments matter so much to me? Because they are desirable for rapid iteration, and crucial for setting up continuious deployment. Even a couple of seconds of downtime during a deployment leaves a psychological desire not to deploy too often. I've seen the productivity boost that deploying fearlessly multiple times a day brings, and I want it.&lt;/p&gt;

&lt;p&gt;So I've been doing a bunch of research into zero downtime deployment options (thanks to some &lt;a href="https://twitter.com/simonw/status/1217599189921628160"&gt;great help on Twitter&lt;/a&gt;) and I think I have something that's going to work for me.&lt;/p&gt;

&lt;p&gt;The first ingredient is &lt;a href="https://docs.traefik.io/"&gt;Traefik&lt;/a&gt; - a new-to-me edge router (similar to nginx) which has a delightful focus on runtime configuration based on automatic discovery.&lt;/p&gt;

&lt;p&gt;It works with a bunch of different technology stacks, but I'm going to be using it with regular Docker. Traefik watches for new Docker containers, reads their labels and uses that to reroute traffic to them.&lt;/p&gt;

&lt;p&gt;So I can launch a new Docker container, apply the Docker label &lt;code&gt;"traefik.frontend.rule": "Host:subdomain.mydomain.com"&lt;/code&gt; and Traefik will start proxying traffic to that subdomain directly to that container.&lt;/p&gt;

&lt;p&gt;Traefik also has extremely robust built-in support for Lets Encrypt to issue certificates. I managed to &lt;a href="https://docs.traefik.io/https/acme/#wildcard-domains"&gt;issue a wildcard TLS certificate&lt;/a&gt; for my entire domain, so new subdomains are encrypted straight away. This did require me to give Traefik API access to modify DNS entries - I'm running DNS for this project on Digital Ocean and thankfully Traefik knows how to do this by talking to their API.&lt;/p&gt;

&lt;p&gt;That solves provisioning: when I create a new account I can call the Docker API (from Python) to start up a new, labelled container on a subdomain protected by a TLS certificate.&lt;/p&gt;

&lt;p&gt;I still needed a way to run a zero-downtime deployment of a new container (for example when I release a new version of Datasette and want to upgrade everyone). After quite a bit of research (during which I discovered you can't modify the labels on a Docker container without restarting it) I settled on the approach described in &lt;a href="https://coderbook.com/@marcus/how-to-do-zero-downtime-deployments-of-docker-containers/"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Essentially you configure Traefik to retry failed requests, start a new, updated container with the same routing information as the existing one (causing Traefik to load balance HTTP requests across both), then shut down the old container and trust Traefik to retry in-flight requests against the one that's still running.&lt;/p&gt;

&lt;p&gt;Rudimentary testing with &lt;code&gt;ab&lt;/code&gt; suggested that this is working as desired.&lt;/p&gt;

&lt;p&gt;One remaining problem: if Traefik is running in a Docker container and proxying all of my traffic, how can I upgrade Traefik itself without any downtime?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://twitter.com/simonw/status/1218604019033100288"&gt;Consensus on Twitter&lt;/a&gt; seems to be that Docker on its own doesn't have a great mechanism for this (I was hoping I could re-route port 80 traffic to the host to a different container in an atomic way). But... &lt;code&gt;iptables&lt;/code&gt; has mechanisms that can re-route traffic from one port to another - so I should be able to run a new Traefik container on a different port and re-route to it at the operating system level.&lt;/p&gt;

&lt;p&gt;That's quite enough yak shaving around zero time deployments for now!&lt;/p&gt;

&lt;h3 id="datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/h3&gt;

&lt;p&gt;A big problem I'm seeing with the current Datasette ecosystem is that while Datasette offers a web-based user interface for querying and accessing data, the &lt;a href="https://datasette.readthedocs.io/en/0.33/ecosystem.html#tools-for-creating-sqlite-databases"&gt;tools I've written for actually creating those databases&lt;/a&gt; are decidedly command-line only.&lt;/p&gt;

&lt;p&gt;Telling journalists they have to learn to install and run software on the command-line is way too high a barrier to entry.&lt;/p&gt;

&lt;p&gt;I've always intended to have Datasette plugins that can handle uploading and converting data. It's time to actually build one!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt; is what I've got so far. It has a big warning not to use it in the README - it's &lt;em&gt;very&lt;/em&gt; alpha sofware at the moment - but it does prove that the concept can work.&lt;/p&gt;

&lt;p&gt;It uses the &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#asgi-wrapper-datasette"&gt;asgi_wrapper&lt;/a&gt; plugin hook to intercept requests to the path &lt;code&gt;/-/upload-csv&lt;/code&gt; and forward them on to another ASGI app, written using Starlette, which provides a basic upload form and then handles the upload.&lt;/p&gt;

&lt;p&gt;Uploaded CSVs are converted to SQLite using &lt;a href="https://sqlite-utils.readthedocs.io/"&gt;sqlite-utils&lt;/a&gt; and written to the first mutable database attached to Datasette.&lt;/p&gt;

&lt;p&gt;It needs a bunch more work (and tests) before I'm comfortable telling people to use it, but it does at least exist as a proof of concept for me to iterate on.&lt;/p&gt;

&lt;h3&gt;datasette-auth-django-cookies&lt;/h3&gt;

&lt;p&gt;No code for this yet, but I'm beginning to flesh it out as a concept.&lt;/p&gt;

&lt;p&gt;I don't particularly want to implement user registration and authentication and cookies and password hashing. I know how to do it, which means I know it's not something you shuld re-roll for every project.&lt;/p&gt;

&lt;p&gt;Django has a really well designed, robust authentication system. Can't I just use that?&lt;/p&gt;

&lt;p&gt;Since all of my applications will be running on subdomains of a single domain, my current plan is to have a regular Django application which handles registration and logins. Each subdomain will then run a custom piece of Datasette ASGI middleware which knows how to read and validate the Django authentication cookie.&lt;/p&gt;

&lt;p&gt;This should give me single sign-on with a single, audited codebase for registration and login with (hopefully) the least amount of work needed to integrate it with Datasette.&lt;/p&gt;

&lt;p&gt;Code for this will hopefully follow over the next week.&lt;/p&gt;

&lt;h3&gt;Niche Museums - now publishing weekly&lt;/h3&gt;

&lt;p&gt;I hit a milestone with my &lt;a href="https://www.niche-museums.com/"&gt;Niche Museums&lt;/a&gt; project: the site now lists details of 100 museums!&lt;/p&gt;

&lt;p&gt;For the 100th entry I decided to celebrate with by far the most rewarding (and exclusive) niche museum experience I've ever had: &lt;a href="https://www.niche-museums.com/browse/museums/100"&gt;Ray Bandar's Bone Palace&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You should read the entry. The short version is that Ray Bandar collected 7,000 animals skulls over a sixty year period, and Natalie managed to score us a tour of his incredible basement mere weeks before the collection was donated to the California Academy of Sciences.&lt;/p&gt;

&lt;img src="https://niche-museums.imgix.net/ray-bandar.jpeg?w=1600&amp;amp;h=800&amp;amp;fit=crop&amp;amp;auto=compress" alt="The basement full of skulls" style="max-width: 100%" /&gt;

&lt;p&gt;Posting one museum a day was taking increasingly more of my time, as I had to delve into the depths of my museums-I-have-visited backlog and do increasing amounts of research. Now that I've hit 100 I'm going to switch to publishing one a week, which should also help me visit new ones quickly enough to keep the backlog full!&lt;/p&gt;

&lt;p&gt;So I only posted four this week:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/97"&gt;The ruins of Llano del Rio&lt;/a&gt; in Los Angeles County&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/98"&gt;Cleveland Hungarian Museum&lt;/a&gt; in Cleveland&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/99"&gt;New Orleans Historic Voodoo Museum&lt;/a&gt; in New Orleans&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/100"&gt;Ray Bandar's Bone Palace&lt;/a&gt; in San Francisco&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;I also &lt;a href="https://github.com/simonw/museums/commits/842dfb96"&gt;built a simple JavaScript image gallery&lt;/a&gt; to better display the 54 photos I published from our trip to Ray Bandar's basement.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deployment"&gt;deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/museums"&gt;museums&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/traefik"&gt;traefik&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-cloud"&gt;datasette-cloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/digitalocean"&gt;digitalocean&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="deployment"/><category term="museums"/><category term="projects"/><category term="zero-downtime"/><category term="docker"/><category term="datasette"/><category term="weeknotes"/><category term="traefik"/><category term="datasette-cloud"/><category term="digitalocean"/></entry><entry><title>How to do Zero Downtime Deployments of Docker Containers</title><link href="https://simonwillison.net/2020/Jan/16/zero-downtime-deployments/#atom-tag" rel="alternate"/><published>2020-01-16T23:12:35+00:00</published><updated>2020-01-16T23:12:35+00:00</updated><id>https://simonwillison.net/2020/Jan/16/zero-downtime-deployments/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://coderbook.com/@marcus/how-to-do-zero-downtime-deployments-of-docker-containers/"&gt;How to do Zero Downtime Deployments of Docker Containers&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’m determined to get reliable zero-downtime deploys working for a new project, because I know from experience that even a few seconds of downtime during a deploy changes the project mentality from “deploy any time you want” to “don’t deploy too often”. I’m using Docker containers behind Traefik, which means new containers should have traffic automatically balanced to them by Traefik based on their labels. After much fiddling around the pattern described by this article worked best for me: it lets me start a new container, then stop the old one and have Traefik’s “retry” mechanism send any requests to the stopped container over to the new one instead.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/deployment"&gt;deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/traefik"&gt;traefik&lt;/a&gt;&lt;/p&gt;



</summary><category term="deployment"/><category term="zero-downtime"/><category term="docker"/><category term="traefik"/></entry><entry><title>How to Create an Index in Django Without Downtime</title><link href="https://simonwillison.net/2019/Apr/11/index-in-django-without-downtime/#atom-tag" rel="alternate"/><published>2019-04-11T15:06:09+00:00</published><updated>2019-04-11T15:06:09+00:00</updated><id>https://simonwillison.net/2019/Apr/11/index-in-django-without-downtime/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://realpython.com/create-django-index-without-downtime/"&gt;How to Create an Index in Django Without Downtime&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Excellent advanced tutorial on Django migrations, which uses a desire to create indexes in PostgreSQL without locking the table (with CREATE INDEX CONCURRENTLY) to explain the SeparateDatabaseAndState and atomic features of Django’s migration framework.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/webology/status/1116109854492516353"&gt;Jeff Triplett&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/migrations"&gt;migrations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="migrations"/><category term="postgresql"/><category term="zero-downtime"/></entry><entry><title>Migrating Messenger storage to optimize performance</title><link href="https://simonwillison.net/2018/Jun/27/migrating-messenger-storage-optimize-performance/#atom-tag" rel="alternate"/><published>2018-06-27T15:05:36+00:00</published><updated>2018-06-27T15:05:36+00:00</updated><id>https://simonwillison.net/2018/Jun/27/migrating-messenger-storage-optimize-performance/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://code.facebook.com/posts/201318390519340"&gt;Migrating Messenger storage to optimize performance&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems).

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=17402241"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/migration"&gt;migration&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysql"&gt;mysql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="facebook"/><category term="migration"/><category term="mysql"/><category term="scaling"/><category term="zero-downtime"/></entry><entry><title>How the Citus distributed database rebalances your data</title><link href="https://simonwillison.net/2018/Feb/1/citus/#atom-tag" rel="alternate"/><published>2018-02-01T22:50:00+00:00</published><updated>2018-02-01T22:50:00+00:00</updated><id>https://simonwillison.net/2018/Feb/1/citus/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.citusdata.com/blog/2018/02/01/how-citus-database-rebalances-your-data/"&gt;How the Citus distributed database rebalances your data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Citus is a fascinating implementation of database sharding built on top of PostgreSQL primitives. PostgreSQL 10 introduced extremely flexible logical replication—in this post Craig Kerstiens explains how Citus use this new ability to re-balance shards (e.g. when you move from two to four physical PostgreSQL nodes) without downtime.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sharding"&gt;sharding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/craig-kerstiens"&gt;craig-kerstiens&lt;/a&gt;&lt;/p&gt;



</summary><category term="postgresql"/><category term="sharding"/><category term="software-architecture"/><category term="zero-downtime"/><category term="craig-kerstiens"/></entry><entry><title>How Balanced does Database Migrations with Zero-Downtime</title><link href="https://simonwillison.net/2017/Nov/7/how-balanced-does-database-migrations-with-zero-downtime/#atom-tag" rel="alternate"/><published>2017-11-07T11:36:25+00:00</published><updated>2017-11-07T11:36:25+00:00</updated><id>https://simonwillison.net/2017/Nov/7/how-balanced-does-database-migrations-with-zero-downtime/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.balancedpayments.com/payments-infrastructure-suspending-traffic-zero-downtime-migrations/"&gt;How Balanced does Database Migrations with Zero-Downtime&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’m fascinated by the idea of “pausing” traffic during a blocking site maintenance activity (like a database migration) and then un-pausing when the operation is complete—so end clients just see some of their requests taking a few seconds longer than expected. I first saw this trick described by Braintree. Balanced wrote about a neat way of doing this just using HAproxy, which lets you live reconfigure the maxconns to your backend down to zero (causing traffic to be queued up) and then bring the setting back up again a few seconds later to un-pause those requests.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/haproxy"&gt;haproxy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/highavailability"&gt;highavailability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http"&gt;http&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/migrations"&gt;migrations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="haproxy"/><category term="highavailability"/><category term="http"/><category term="migrations"/><category term="scaling"/><category term="zero-downtime"/></entry><entry><title>Quoting Ryan King</title><link href="https://simonwillison.net/2010/May/29/spof/#atom-tag" rel="alternate"/><published>2010-05-29T11:36:00+00:00</published><updated>2010-05-29T11:36:00+00:00</updated><id>https://simonwillison.net/2010/May/29/spof/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://github.com/blog/655-scheduled-maintenance-today-22-00-pst#comment-7708"&gt;&lt;p&gt;The easiest way to have no-downtime upgrades is have an architecture that can tolerate some subset of their processes to be down at any time. De-SPOF and this gets easier (not that de-SPOFing is always trivial).&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://github.com/blog/655-scheduled-maintenance-today-22-00-pst#comment-7708"&gt;Ryan King&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ryan-king"&gt;ryan-king&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/spof"&gt;spof&lt;/a&gt;&lt;/p&gt;



</summary><category term="software-architecture"/><category term="recovered"/><category term="zero-downtime"/><category term="ryan-king"/><category term="spof"/></entry><entry><title>Zero-downtime Redis upgrade discussion</title><link href="https://simonwillison.net/2010/May/28/scheduled/#atom-tag" rel="alternate"/><published>2010-05-28T14:50:00+00:00</published><updated>2010-05-28T14:50:00+00:00</updated><id>https://simonwillison.net/2010/May/28/scheduled/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://github.com/blog/655-scheduled-maintenance-today-22-00-pst"&gt;Zero-downtime Redis upgrade discussion&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
GitHub have a short window of scheduled downtime in order to upgrade their Redis server. I asked in their comments if they’d considered trying to run the upgrade with no downtime at all using Redis replication, and Ryan Tomayko has posted some interesting replies.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ryan-tomayko"&gt;ryan-tomayko&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/upgrades"&gt;upgrades&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="ops"/><category term="redis"/><category term="ryan-tomayko"/><category term="upgrades"/><category term="recovered"/><category term="zero-downtime"/></entry></feed>