Simon Willison's Weblog: upsert

Datasette 1.0a2: Upserts and finely grained permissions

2022-12-15T17:58:19+00:00

I've released the third alpha of Datasette 1.0. The 1.0a2 release introduces upsert support to the new JSON API and makes some major improvements to the Datasette permissions system.

Here are the annotated releases (see previous).

You can install and try out the alpha using:

pip install datasette==1.0a2

Upserts for the JSON API

New /db/table/-/upsert API, documented here. Upsert is an update-or-insert: existing rows will have specified keys updated, but if no row matches the incoming primary key a brand new row will be inserted instead. (#1878)

I wrote about the new JSON Write API when I released the first alpha a couple of weeks ago.

The API can be used to create and drop tables, and to insert, update and delete rows in those tables.

The new /db/table/-/upsert API adds upsert support to Datasette.

An upsert is a update-or-insert. Consider the following:

POST /books/authors/-/upsert
Authorization: Bearer $TOKEN
Content-Type: application/json

{
  "rows": [
    {
      "id": 1,
      "name": "Ursula K. Le Guin",
      "born": "1929-10-21"
    },
    {
      "id": 2,
      "name": "Terry Pratchett",
      "born": "1948-04-28"
    },
    {
      "id": 3,
      "name": "Neil Gaiman",
      "born": "1960-11-10"
    }
  ]
}

This table has a primary key of id. The above API call will create three records if the table is empty. But if the table already has records matching any of those primary keys, their name and born columns will be updated to match the incoming data.

Upserts can be a really convenient way of synchronizing data with an external data source. I had a couple of enquiries about them when I published the first alpha, so I decided to make them a key feature for this release.

Ignore and replace for the create table API

The /db/-/create API for creating a table now accepts "ignore": true and "replace": true options when called with the "rows" property that creates a new table based on an example set of rows. This means the API can be called multiple times with different rows, setting rules for what should happen if a primary key collides with an existing row. (#1927)

/db/-/create API now requires actor to have insert-row permission in order to use the "row" or "rows" properties. (#1937)

This feature is a little less obvious, but I think it's going to be really useful.

The /db/-/create API can be used to create a new table. You can feed it an explicit list of columns, but you can also give it one or more rows and have it infer the correct schema based on those examples.

Datasette inherits this feature from sqlite-utils - I've been finding this an incredibly productive way to work with SQLite databases for a few years now.

The real magic of this feature is that you can pipe data into Datasette without even needing to first check that the appropriate table has been created. It's a really fast way of getting from data to a populated database and a working API.

Prior to 1.0a2 you could call /db/-/create with "rows" more than once and it would probably work... unless you attempted to insert rows with primary keys that were already in use - in which case you would get an error. This limited the utility of the feature.

Now you can pass "ignore": true or "replace": true to the API call, to tell Datasette what to do if it encounters a primary key that already exists in the table.

Heres an example using the author data from above:

POST /books/-/create
Authorization: Bearer $TOKEN
Content-Type: application/json

{
  "table": "authors",
  "pk": "id",
  "replace": true,
  "rows": [
    {
      "id": 1,
      "name": "Ursula K. Le Guin",
      "born": "1929-10-21"
    },
    {
      "id": 2,
      "name": "Terry Pratchett",
      "born": "1948-04-28"
    },
    {
      "id": 3,
      "name": "Neil Gaiman",
      "born": "1960-11-10"
    }
  ]
}

This will create the authors table if it does not exist and ensure that those three rows exist in it, in their exact state. If a row already exists it will be replaced.

Note that this is subtly different from an upsert. An upsert will only update the columns that were provided in the incoming data, leaving any other columns unchanged. A replace will replace the entire row.

Finely grained permissions

This is the most significant area of improvement in this release.

New register_permissions(datasette) plugin hook. Plugins can now register named permissions, which will then be listed in various interfaces that show available permissions. (#1940)

Prior to this, permissions were just strings - things like "view-instance" or "view-table" or "insert-row".

Plugins can introduce their own permissions - many do already, like datasette-edit-schema which adds a "edit-schema" permission.

In order to start building UIs for managing permissions, I needed Datasette to know what they were!

The register_permissions() hook lets them do exactly that, and Datasette core uses it to register its own default set of permissions too.

Permissions are registered using the following named tuple:

Permission = collections.namedtuple(
    "Permission", (
        "name", "abbr", "description",
        "takes_database", "takes_resource", "default"
    )
)

The abbr is an abbreviation - e.g. insert-row can be abbreviated to ir. This is useful for creating things like signed API tokens where space is at a premium.

takes_database and takes_resource are booleans that indicate whether the permission can optionally be applied to a specific database (e.g. execute-sql) or to a "resource", which is the name I'm now using for something that could be a SQL table, a SQL view or a canned query.

The insert-row permission for example can be granted to the whole of Datasette, or to all tables in a specific database, or to specific tables.

Finally, the default value is a boolean that indicates whether the permission should be default-allow (view-instance for example) or default-deny (create-table and suchlike).

This next feature explains why I needed those permission names to be known to Datasette:

The /-/create-token page can now be used to create API tokens which are restricted to just a subset of actions, including against specific databases or resources. See API Tokens for details. (#1947)

Datasette now has finely grained permissions for API tokens!

This is the feature I always want when I'm working with other APIs: the ability to create a token that can only perform a restricted subset of actions.

When I'm working with the GitHub API for example I frequently find myself wanting to create a "personal access token" that only has the ability to read issues from a specific repository. It's infuriating how many APIs leave this ability out.

The /-/create-token interface (which you can try out on latest.datasette.io by first signing in as root and then visiting this page) lets you create an API token that can act on your behalf... and then optionally specify a subset of actions that the token is allowed to perform.

Thanks to the new permissions registration system, the UI on that page knows which permissions can be applied to which entities within Datasette itself.

Here's a partial screenshot of the UI:

Select a subset of permissions, hit "Create token" and the result will be an access token you can copy and paste into another application, or use to call Datasette with curl.

Here's an example token I created that grants view-instance permission against all of Datasette, and view-database and execute-sql permission against the ephemeral database, which in that demo is hidden from anonymous Datasette users.

dstok_eyJhIjoicm9vdCIsInQiOjE2NzEwODUzMDIsIl9yIjp7ImEiOlsidmkiXSwiZCI6eyJlcGhlbWVyYWwiOlsidmQiLCJlcyJdfX19.1uw0xyx8UND_Y_vTVg5kEF6m3GU

Here's a screenshot of the screen I saw when I created it:

I've expanded the "token details" section to show the JSON that is bundled inside the signed token. The "_r" block records the specific permissions granted by the token:

"a": ["vi"] indicates that the view-instance permission is granted against all of Datasette.
"d": {"ephemeral": ["vd", "es"]} indicates that the view-database and execute-sql permissions are granted against the ephemeral database.

The token also contains the ID of the user who created it ("a": "root") and the time that the token was created ("t": 1671085302). If the token was set to expire that expiry duration would be baked in here as well.

You can see the effect this has on the command-line using curl like so:

curl 'https://latest.datasette.io/ephemeral.json?sql=select+3+*+5&_shape=array'

This will return a forbidden error. But if you add the signed token:

curl 'https://latest.datasette.io/ephemeral.json?sql=select+3+*+5&_shape=array' -H 'Authorization: Bearer dstok_eyJhIjoicm9vdCIsInQiOjE2NzEwODUzMDIsIl9yIjp7ImEiOlsidmkiXSwiZCI6eyJlcGhlbWVyYWwiOlsidmQiLCJlcyJdfX19.1uw0xyx8UND_Y_vTVg5kEF6m3GU'

You'll get back a JSON response:

[{"3 * 5": 15}]

The datasette create-token CLI tool

Likewise, the datasette create-token CLI command can now create tokens with a subset of permissions. (#1855)

New datasette.create_token() API method for programmatically creating signed API tokens. (#1951)

The other way you can create Datasette tokens is on the command-line, using the datasette create-token command.

That's been upgraded to support finely grained permissions too.

Here's how you'd create a token for the same set of permissions as my ephemeral example above:

datasette create-token root \
  --all view-instance \
  --database ephemeral view-database \
  --database ephemeral execute-sql \
  --secret MY_DATASETTE_SECRET

In order to sign the token you need to pass in the --secret used by the server - although it will pick that up from the DATASETTE_SECRET environment variable if it's available.

This has the interesting side-effect that you can use that command to create valid tokens for other Datasette instances, provided you know the secret they're using. I think this ability will be really useful for people like myself who run lots of different Datasette instances on stateless hosting platforms such as Vercel and Google Cloud Run.

Configuring permissions in metadata.json/yaml

Arbitrary permissions can now be configured at the instance, database and resource (table, SQL view or canned query) level in Datasette's Metadata JSON and YAML files. The new "permissions" key can be used to specify which actors should have which permissions. See Other permissions in metadata for details. (#1636)

Datasette has long had the ability to set permissions for viewing databases and tables using blocks of configuration in the increasingly poorly named metadata.json/yaml files.

As I've built new plugins that introduce new permissions, I've found myself wishing for an easier way to say "user X is allowed to perform action Y" for arbitrary other permissions.

The new "permissions" key in metadata.json/yaml files allows you to do that.

Here's how to specify that the user with "id": "simon" is allowed to use the API to create tables and insert data into the docs database:

databases:
  docs:
    permissions:
      create-table:
        id: simon
      insert-row:
        id: simon

Here's a demo you can run on your own machine. Save the above to permissions.yaml and run the following in one terminal window:

datasette docs.db --create --secret sekrit -m permissions.yaml

This will create the docs.db database if it doesn't already exist, and start Datasette with the permissions.yaml metadata file.

It sets --secret to a known value (you should always use a random secure secret in production) so we can easily use it with create-token in the next step:

Then in another terminal window run:

export TOKEN=$(
  datasette create-token simon \
  --secret sekrit
)
curl -XPOST http://localhost:8001/docs/-/create \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "table": "demo",
    "row": {"id": 1, "name": "Simon"},
    "pk": "id"
  }'

The first line creates a token that can act on behalf of the simon actor. The second curl line then uses that token to create a table using the /-/create endpoint.

Run this, then visit http://localhost:8001/docs/demo to see the newly created table.

What's next?

With the 1.0a2 release I'm reasonably confident that Datasette 1.0 is new-feature-complete. There's still a lot of work to do before the final release, but the remaining work is far more intimidating: I need to make clean backwards-incompatible breakages to a whole host of existing features in order to ship a 1.0 that I can keep stable for as long as possible.

First up: I'm going to redesign Datasette's default API output.

The current default JSON output for a simple table looks like this:

{
  "database": "fixtures",
  "table": "facet_cities",
  "is_view": false,
  "human_description_en": "sorted by name",
  "rows": [
    [
      3,
      "Detroit"
    ],
    [
      2,
      "Los Angeles"
    ],
    [
      4,
      "Memnonia"
    ],
    [
      1,
      "San Francisco"
    ]
  ],
  "truncated": false,
  "filtered_table_rows_count": 4,
  "expanded_columns": [],
  "expandable_columns": [],
  "columns": [
    "id",
    "name"
  ],
  "primary_keys": [
    "id"
  ],
  "units": {},
  "query": {
    "sql": "select id, name from facet_cities order by name limit 101",
    "params": {}
  },
  "facet_results": {},
  "suggested_facets": [],
  "next": null,
  "next_url": null,
  "private": false,
  "allow_execute_sql": true,
  "query_ms": 6.718471999647591,
  "source": "tests/fixtures.py",
  "source_url": "https://github.com/simonw/datasette/blob/main/tests/fixtures.py",
  "license": "Apache License 2.0",
  "license_url": "https://github.com/simonw/datasette/blob/main/LICENSE"
}

In addition to being really verbose, you'll note that the rows themselves are represented like this:

[
  [
    3,
    "Detroit"
  ],
  [
    2,
    "Los Angeles"
  ],
  [
    4,
    "Memnonia"
  ],
  [
    1,
    "San Francisco"
  ]
]

I originally designed it this way because I thought saving on repeating the column names for every row would be more efficient.

In practice, every single time I've used Datasette's API I've found myself using the ?_shape=array parameter, which outputs this format instead:

[
  {
    "id": 3,
    "name": "Detroit"
  },
  {
    "id": 2,
    "name": "Los Angeles"
  },
  {
    "id": 4,
    "name": "Memnonia"
  },
  {
    "id": 1,
    "name": "San Francisco"
  }
]

It's just so much more convenient to work with!

So the new default format will look like this:

{
  "rows": [
    {
      "id": 3,
      "name": "Detroit"
    },
    {
      "id": 2,
      "name": "Los Angeles"
    },
    {
      "id": 4,
      "name": "Memnonia"
    },
    {
      "id": 1,
      "name": "San Francisco"
    }
  ]
}

The rows key is there so I can add extra keys to the output, based on additional ?_extra= request parameters. You'll be able to get back everything you can get in the current full-fat table API, but you'll have to ask for it.

There are a ton of other changes I want to make to Datasette as a whole - things like renaming metadata.yaml to config.yaml to reflect that it's gone way beyond its origins as a way of attaching metadata to a database.

The 1.0 milestone is a dumping ground for many of these ideas. It's not a canonical reference though: I'd be very surprised if everything currently in that milestone makes it into the final 1.0 release.

As I get closer to 1.0 though I'll be refining that milestone so it should get more accurate over time.

Once again: now is the time to be providing feedback on this stuff! The Datasette Discord is a particularly valuable way for me to get feedback on the work so far, and my plans for the future.

Tags: api, permissions, projects, upsert, datasette, annotated-release-notes

sqlite-utils 2.0: real upserts

2019-12-30T06:48:40+00:00

I just released version 2.0 of my sqlite-utils library/CLI tool to PyPI.

It's 2.0 because this is my first backwards compatibility breaking version of the library. I was hoping to avoid this for as long as possible: given semantic versioning, my opinion is that the highest quality libraries are the 1.x versions with the highest sub-version. Version 1.253 means a library managed to ship 253 new features without once breaking compatibility!

For sqlite-utils I needed the clean break, because I made a big mistake in my definition of "upsert".

An upsert is a database operation which says "Update this record with these new values. If the record does not exist, create it."

This is actually a documented feature of SQLite version 3.24.0 and higher. I've seen the term used for other databases such as PostgreSQL as well.

The problem is: when I first implemented the .upsert() method for sqlite-utils I incorrectly assumed that upsert was an alias for INSERT OR REPLACE.

So in sqlite-utils 1.x, the following code:

db["dogs"].upsert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id")

Executes then following SQL:

INSERT OR REPLACE INTO dogs
    ("id", "name", "age")
VALUES
    (1, "Cleo", 4);

It turns out this isn't how upsert should be expected to work. This query will entirely replace any existing record. But... an upsert should update the record in place. Crucially, it should leave any columns not referenced in the upsert alone. My implementation over-wrote the entire row, deleting any missing values.

So if I run this Python code:

db["dogs"].upsert({
    "id": 1,
    "age": 5
}, pk="id")

The existing record for id=1 should have its age updated, but the name should stay the same.

I thought hard about whether I could keep the existing, incorrect .upsert() behavior and introduce a different method name or argument that did things right - maintaining backwards compatibility at the price of a less intuitive API - but this felt like too big a discrepancy. And this is what major version number changes are for! You can follow my deliberations about this in this issue.

While this is technically a breaking change, when I look at the code I have already written that uses sqlite-utils it turns out all of my uses of .upsert() are compatible with the new implementation anyway - they all send the full record object, so they will over-ride all of the fields no matter if they are running against 1.x or 2.x.

If I'm lucky, other users of sqlite-utils will be unaffected by this behavior change either. So hopefully it will be a painless upgrade.

If you do need to keep the old behavior, I've made it available as a replace=True argument to .insert() and .insert_all(), like so:

db["dogs"].insert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id", replace=True)

I did think about introducing a new piece of terminology for this - .inplace(), for insert-or-replace - but after getting in trouble redefining upsert by mistake I decided inventing new language wasn't worth the risk.

Using this for Niche Museums

The project that made me realize I needed this upsert change was Niche Museums.

Niche Museums works by building a SQLite database file from a museums.yaml file. The YAML file contains the data for all 81 museums on the site. To add or edit a listing, I edit that file - sometimes in Visual Code on my Mac, sometimes via the GitHub web interface and often by using the Working Copy git client on my iPhone.

Creating the database is done using my yaml-to-sqlite tool:

$ yaml-to-sqlite browse.db museums museums.yaml --pk=id

The build process then runs a few extra scripts to annotate the records with extra data.

annotate_timestamps.py iterates through the full git history and uses it to figure out when each record was first created or last updated.

annotate_nominatim.py uses the OpenStreetMap Nominatim API to reverse geocode the latitude and longitude of each museum, adding extra columns for the city, state, country and all kinds of other interesting geographical details.

I'm not doing much with this OSM data yet - mainly using it to include the OSM country and city name in the full-text search index - but I hope to use it for faceting and other navigation improvements in the future (and maybe location-based breadcrumbs).

Here's the problem though: every time I deploy the site, I run those scripts. Nominatim asks you not to hit their API more than once a second, so I have a sleep(1) in the code. This means the more museums I add, the longer that build step takes to reverse geocode them all.

It would be great to avoid geocoding museums that have already been geocoded. Properly implemented upsert let's me do exactly that.

I haven't made these changes yet, but my plan is to update the build script so that instead of starting from a blank database each time, it downloads the previously built version of the database and runs an upsert against it with the data from the YAML. This will overwrite all of the data that I might have updated in that file, but leave the osm_* columns that were created by the annotate_nominatim.py script in place.

That script can then just run against records for which osm_country is null. As such, it should only geocode newly created records.

Other improvements this week

I added a few new features to Niche Museums this week.

First, I added simple photo galleries to the listings pages, using CSS columns. This means I can add more than one photo to a museum! I've done this to a few now, including going back to the listing for the Ilfracombe Museum to add a whole bunch of extra photos.

I added photos and press links to the Atom feed for the site. Since the Atom feed is defined using a custom SQL query and datasette-atom, achieving this involved building a truly diabolical SQL query (making extensive use of the SQLite json1 extension).

I fixed an issue with the search engine where unexpected punctuation could cause an FTS syntax error in SQLite. I've extracted that into a fix for a Datasette issue as well.

The site search engine also now attempts to geocode the user's query and offers a link to the corresponding lat/lon radius search for that point. Try a search for new orleans to see what that looks like - the implementation (entirely client-side, calling the Nominatim API) is here.

Here are the museums I added this week:

Jukebox London in London
Stonehenge in Wiltshire
Yolo Causeway Bat Colony in Yolo County
Horniman Museum and Gardens in London
Southern Food and Beverage Museum in New Orleans
Oakley Court in Windsor and Maidenhead
Seaton Tramway in East Devon

Tags: museums, projects, sql, sqlite, upsert, weeknotes, sqlite-utils

Using Mongo for Real-Time Analytics

2009-06-30T19:28:43+00:00

Using Mongo for Real-Time Analytics

MongoDB supports an “upsert” query, which when combined with the $inc operator can cause counter fields to be incremented if they exist and created otherwise. This makes it a great fit for real-time analytics applications (one increment per page view), something that regular relational databases aren’t particularly good at.

Tags: counters, databases, increment, mongodb, upsert