Simon Willison's Weblog: yaml

model.yaml

2025-06-21T17:15:21+00:00

From their GitHub repo it looks like this effort quietly launched a couple of months ago, driven by the LM Studio team. Their goal is to specify an "open standard for defining crossplatform, composable AI models".

A model can be defined using a YAML file that looks like this:

model: mistralai/mistral-small-3.2
base:
  - key: lmstudio-community/mistral-small-3.2-24b-instruct-2506-gguf
    sources:
      - type: huggingface
        user: lmstudio-community
        repo: Mistral-Small-3.2-24B-Instruct-2506-GGUF
metadataOverrides:
  domain: llm
  architectures:
    - mistral
  compatibilityTypes:
    - gguf
  paramsStrings:
    - 24B
  minMemoryUsageBytes: 14300000000
  contextLengths:
    - 4096
  vision: true

This should be enough information for an LLM serving engine - such as LM Studio - to understand where to get the model weights (here that's lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF on Hugging Face, but it leaves space for alternative providers) plus various other configuration options and important metadata about the capabilities of the model.

I like this concept a lot. I've actually been considering something similar for my LLM tool - my idea was to use Markdown with a YAML frontmatter block - but now that there's an early-stage standard for it I may well build on top of this work instead.

I couldn't find any evidence that anyone outside of LM Studio is using this yet, so it's effectively a one-vendor standard for the moment. All of the models in their Model Catalog are defined using model.yaml.

Tags: standards, yaml, ai, generative-ai, llms, llm, lm-studio

openai/openai-openapi

2024-12-22T22:59:25+00:00

openai/openai-openapi

Seeing as the LLM world has semi-standardized on imitating OpenAI's API format for a whole host of different tools, it's useful to note that OpenAI themselves maintain a dedicated repository for a OpenAPI YAML representation of their current API.

(I get OpenAI and OpenAPI typo-confused all the time, so openai-openapi is a delightfully fiddly repository name.)

The openapi.yaml file itself is over 26,000 lines long, defining 76 API endpoints ("paths" in OpenAPI terminology) and 284 "schemas" for JSON that can be sent to and from those endpoints. A much more interesting view onto it is the commit history for that file, showing details of when each different API feature was released.

Browsing 26,000 lines of YAML isn't pleasant, so I got Claude to build me a rudimentary YAML expand/hide exploration tool. Here's that tool running against the OpenAI schema, loaded directly from GitHub via a CORS-enabled fetch() call: https://tools.simonwillison.net/yaml-explorer#.eyJ1c... - the code after that fragment is a base64-encoded JSON for the current state of the tool (mostly Claude's idea).

The tool is a little buggy - the expand-all option doesn't work quite how I want - but it's useful enough for the moment.

Update: It turns out the petstore.swagger.io demo has an (as far as I can tell) undocumented ?url= parameter which can load external YAML files, so here's openai-openapi/openapi.yaml in an OpenAPI explorer interface.

Tags: apis, tools, yaml, ai, openai, generative-ai, llms, ai-assisted-programming, claude-3-5-sonnet

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

2020-09-03T23:28:29+00:00

This week I figured out how to populate Datasette from Airtable, wrote code to generate social media preview card page screenshots using Puppeteer, and made a big breakthrough with my Dogsheep project.

airtable-export

I wrote about Rocky Beaches in my weeknotes two weeks ago. It's a new website built by Natalie Downe that showcases great places to go rockpooling (tidepooling in American English), mixing in tide data from NOAA and species sighting data from iNaturalist.

Rocky Beaches is powered by Datasette, using a GitHub Actions workflow that builds the site's underlying SQLite database using API calls and YAML data stored in the GitHub repository.

Natalie wanted to use Airtable to maintain the structured data for the site, rather than hand-editing a YAML file. So I built airtable-export, a command-line script for sucking down all of the data from an Airtable instance and writing it to disk as YAML or JSON.

You run it like this:

airtable-export out/ mybaseid table1 table2 --key=key

This will create a folder called out/ with a .yml file for each of the tables.

Sadly the Airtable API doesn't yet provide a mechanism to list all of the tables in a database (a long-running feature request) so you have to list the tables yourself.

We're now running that command as part of the Rocky Beaches build script, and committing the latest version of the YAML file back to the GitHub repo (thus gaining a full change history for that data).

I really like social media cards - og:image HTML meta attributes for Facebook and twitter:image for Twitter. I wanted them for articles on my TIL website since I often share those via Twitter.

One catch: my TILs aren't very image heavy. So I decided to generate screenshots of the pages and use those as the 2x1 social media card images.

The best way I know of programatically generating screenshots is to use Puppeteer, a Node.js library for automating a headless instance of the Chrome browser that is maintained by the Chrome DevTools team.

My first attempt was to run Puppeteer in an AWS Lambda function on Vercel. I remembered seeing an example of how to do this in the Vercel documentation a few years ago. The example isn't there any more, but I found the original pull request that introduced it.

Since the example was MIT licensed I created my own fork at simonw/puppeteer-screenshot and updated it to work with the latest Chrome.

It's pretty resource intensive, so I also added a secret ?key= mechanism so only my own automation code could call my instance running on Vercel.

I needed to store the generated screenshots somewhere. They're pretty small - on the order of 60KB each - so I decided to store them in my SQLite database itself and use my datasette-media plugin (see Fun with binary data and SQLite) to serve them up.

This worked! Until it didn't... I ran into a showstopper bug when I realized that the screenshot process relies on the page being live on the site... but when a new article is added it's not live when the build process works, so the generated screenshot is of the 404 page.

So I reworked it to generate the screenshots inside the GitHub Action as part of the build script, using puppeteer-cli.

My generate_screenshots.py script handles this, by first shelling out to datasette --get to render the HTML for the page, then running puppeteer to generate the screenshot. Relevant code:

def png_for_path(path):
    # Path is e.g. /til/til/python_debug-click-with-pdb.md
    page_html = str(TMP_PATH / "generate-screenshots-page.html")
    # Use datasette to generate HTML
    proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True)
    open(page_html, "wb").write(proc.stdout)
    # Now use puppeteer screenshot to generate a PNG
    proc2 = subprocess.run(
        [
            "puppeteer",
            "screenshot",
            page_html,
            "--viewport",
            "800x400",
            "--full-page=false",
        ],
        capture_output=True,
    )
    png_bytes = proc2.stdout
    return png_bytes

This worked great! Except for one thing... the site is hosted on Vercel, and Vercel has a 5MB response size limit.

Every time my GitHub build script runs it downloads the previous SQLite database file, so it can avoid regenerating screenshots and HTML for pages that haven't changed.

The addition of the binary screenshots drove the size of the SQLite database over 5MB, so the part of my script that retrieved the previous database no longer worked.

I needed a reliable way to store that 5MB (and probably eventually 10-50MB) database file in between runs of my action.

The best place to put this would be an S3 bucket, but I find the process of setting up IAM permissions for access to a new bucket so infuriating that I couldn't bring myself to do it.

So... I created a new dedicated GitHub repository, simonw/til-db, and updated my action to store the binary file in that repo - using a force push so the repo doesn't need to maintain unnecessary version history of the binary asset.

This is an abomination of a hack, and it made me cackle a lot. I tweeted about it and got the suggestion to try Git LFS instead, which would definitely be a more appropriate way to solve this problem.

Rendering Markdown

I write my blog entries in Markdown and transform them into HTML before I post them on my blog. Some day I'll teach my blog to render Markdown itself, but so far I've got by through copying and pasting into Markdown tools.

My favourite Markdown flavour is GitHub's, which adds a bunch of useful capabilities - most notably the ability to apply syntax highlighting. GitHub expose an API that applies their Markdown formatter and returns the resulting HTML.

I built myself a quick and scrappy tool in JavaScript that sends Markdown through their API and then applies a few DOM manipulations to clean up what comes back. It was a nice opportunity to write some modern vanilla JavaScript using fetch():

async function render(markdown) {
    return (await fetch('https://api.github.com/markdown', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({'mode': 'markdown', 'text': markdown})
    })).text();
}

const button = document.getElementsByTagName('button')[0];
const output = document.getElementById('output');
const preview = document.getElementById('preview');

button.addEventListener('click', async function() {
    const rendered = await render(input.value);
    output.value = rendered;
    preview.innerHTML = rendered;
});

Dogsheep Beta

My most exciting project this week was getting out the first working version of Dogsheep Beta - the search engine that ties together results from my Dogsheep family of tools for personal analytics.

I'm giving a talk about this tonight at PyCon Australia: Build your own data warehouse for personal analytics with SQLite and Datasette. I'll be writing up detailed notes in the next few days, so watch this space.

TIL this week

Releases this week

dogsheep-beta 0.4.1 - 2020-09-03
dogsheep-beta 0.4 - 2020-09-03
dogsheep-beta 0.4a1 - 2020-09-03
dogsheep-beta 0.4a0 - 2020-09-03
dogsheep-beta 0.3 - 2020-09-02
dogsheep-beta 0.2 - 2020-09-01
dogsheep-beta 0.1 - 2020-09-01
dogsheep-beta 0.1a2 - 2020-09-01
dogsheep-beta 0.1a - 2020-09-01
airtable-export 0.4 - 2020-08-30
datasette-yaml 0.1a - 2020-08-29
airtable-export 0.3.1 - 2020-08-29
airtable-export 0.3 - 2020-08-29
airtable-export 0.2 - 2020-08-29
airtable-export 0.1.1 - 2020-08-29
airtable-export 0.1 - 2020-08-29
datasette 0.49a0 - 2020-08-28
sqlite-utils 2.16.1 - 2020-08-28

Tags: projects, yaml, markdown, dogsheep, weeknotes, github-actions, airtable, puppeteer

airtable-export

2020-08-29T21:48:37+00:00

airtable-export

I wrote a command-line utility for exporting data from Airtable and dumping it to disk as YAML, JSON or newline delimited JSON files. This means you can backup an Airtable database from a GitHub Action and get a commit history of changes made to your data.

Tags: json, projects, yaml, airtable

Goodbye Zeit Now v1, hello datasette-publish-now - and talking to myself in GitHub issues

2020-04-08T03:32:24+00:00

This week I’ve been mostly dealing with the finally announced shutdown of Zeit Now v1. And having long-winded conversations with myself in GitHub issues.

How Zeit Now inspired Datasette

I first started experiencing with Zeit’s serverless Now hosting platform back in October 2017, when I used it to deploy json-head.now.sh - an updated version of an API tool I originally built for Google App Engine in July 2008.

I liked Zeit Now, a lot. Instant, inexpensive deploys of any stateless project that could be defined using a Dockerfile? Just type now to deploy the project in your current directory? Every deployment gets its own permanent URL? Amazing!

There was just one catch: Since Now deployments are ephemeral applications running on them need to be stateless. If you want a database, you need to involve another (potentially costly) service. It's a limitation shared by other scalable hosting solutions - Heroku, App Engine and so on. How much interesting stuff can you build without a database?

I was musing about this in the shower one day (that old cliche really happened for me) when I had a thought: sure, you can't write to a database... but if your data is read-only, why not bundle the database alongside the application code as part of the Docker image?

Ever since I helped launch the Datablog at the Guardian back in 2009 I had been interested in finding better ways to publish data journalism datasets than CSV files or a Google spreadsheets - so building something that could package and bundle read-only data was of extreme interest to me.

In November 2017 I released the first version of Datasette. The original idea was very much inspired by Zeit Now.

I gave a talk about Datasette at the Zeit Day conference in San Francisco in April 2018. Suffice to say I was a huge fan!

Goodbye, Zeit Now v1

In November 2018, Zeit announced Now v2. And it was, different.

v2 is an entirely different architecture from v1. Where v1 built on Docker containers, v2 is built on top of serverless functions - AWS Lambda in particular.

I can see why Zeit did this. Lambda functions can launch from cold way faster - v1's Docker infrastructure had tough cold-start times. They are much cheaper to run as well - crucial for Zeit given their extremely generous pricing plans.

But it was bad news for my projects. Lambdas are tightly size constrained, which is tough when you're bundling potentially large SQLite database files with your deployments.

More importantly, in 2018 Amazon were deliberately excluding the Python sqlite3 standard library module from the Python Lambda environment! I guess they hadn't considered people who might want to work with read-only database files.

So Datasette on Now v2 just wasn't going to work. Zeit kept v1 supported for the time being, but the writing was clearly on the wall.

In April 2019 Google announced Cloud Run, a serverless, scale-to-zero hosting environment based around Docker containers. In many ways it's Google's version of Zeit Now v1 - it has many of the characteristics I loved about v1, albeit with a clunkier developer experience and much more friction in assigning nice URLs to projects. Romain Primet contributed Cloud Run support to Datasette and it has since become my preferred hosting target for my new projects (see Deploying a data API using GitHub Actions and Cloud Run).

Last week, Zeit finally announced the sunset date for v1. From 1st of May new deploys won't be allowed, and on the 7th of August they'll be turning off the old v1 infrastructure and deleting all existing Now v1 deployments.

I engaged in an extensive Twitter conversation about this, where I praised Zeit's handling of the shutdown while bemoaning the loss of the v1 product I had loved so much.

Migrating my projects

My newer projects have been on Cloud Run for quite some time, but I still have a bunch of old projects that I care about and want to keep running past the v1 shutdown.

The first project I ported was latest.datasette.io, a live demo of Datasette which updates with the latest code any time I push to the Datasette master branch on GitHub.

Any time I do some kind of ops task like this I've gotten into the habit of meticulously documenting every single step in comments on a GitHub issue. Here's the issue for porting latest.datasette.io to Cloud Run (and switching from Circle CI to GitHub Actions at the same time).

My next project was global-power-plants-datasette, a small project which takes a database of global power plants published by the World Resources Institute and publishes it using Datasette. It checks for new updates to their repo once a day. I originally built it as a demo for datasette-cluster-map, since it's fun seeing 33,000 power plants on a single map. Here's that issue.

Having warmed up with these two, my next target was the most significant: porting my Niche Museums website.

Niche Museums is the most heavily customized Datasette instance I've run anywhere - it incorporates custom templates, CSS and plugins.

Here's the tracking issue for porting it to Cloud Run. I ran into a few hurdles with DNS and TLS certificates, and I had to do some additional work to ensure niche-museums.com redirects to www.niche-musums.com, but it's now fully migrated.

Hello, Zeit Now v2

In complaining about the lack of that essential sqlite3 module I figured it would be responsible to double-check and make sure that was still true.

It was not! Today Now's Python environment includes sqlite3 after all.

Datasette's publish_subcommand() plugin hook lets plugins add new publishing targets to the datasette publish command (I used it to build datasette-publish-fly last month). How hard would it be to build a plugin for Zeit Now v2?

I fired up a new lengthy talking-to-myself GitHub issue and started prototyping.

Now v2 may not support Docker, but it does support the ASGI Python standard (the asynchronous alternative to WSGI, shepherded by Andrew Godwin).

Zeit are keen proponents of the Jamstack approach, where websites are built using static pre-rendered HTML and JavaScript that calls out to APIs for dynamic data. v2 deployments are expected to consist of static HTML with "serverless functions" - standalone server-side scripts that live in an api/ directory by convention and are compiled into separate lambdas.

Datasette works just fine without JavaScript, which means it needs to handle all of the URL routes for a site. Essentually I need to build a single function that runs the whole of Datasette, then route all incoming traffic to it.

It took me a while to figure it out, but it turns out the Now v2 recipe for that is a now.json file that looks like this:

{
    "version": 2,
    "builds": [
        {
            "src": "index.py",
            "use": "@now/python"
        }
    ],
    "routes": [
        {
            "src": "(.*)",
            "dest": "index.py"
        }
    ]
}

Thanks Aaron Boodman for the tip.

Given the above configuration, Zeit will install any Python dependencies in a requirements.txt file, then treat an app variable in the index.py file as an ASGI application it should route all incoming traffic to. Exactly what I need to deploy Datasette!

This was everything I needed to build the new plugin. datasette-publish-now is the result.

Here's the generated source code for a project deployed using the plugin, showing how the underlyinng ASGI application is configured.

It's currently an alpha - not every feature is supported (see this milestone) and it relies on a minor deprecated feature (which I've implored Zeit to reconsider) but it's already full-featured enough that I can start using it to upgrade some of my smaller existing Now projects.

The first I upgraded is one of my favourites: polar-bears.now.sh, which visualizes tracking data from polar bear ear tags (using datasette-cluster-map) that was published by the USGS Alaska Science Center, Polar Bear Research Program.

Here's the command I used to deploy the site:

$ pip install datasette-publish-now
$ datasette publish now2 polar-bears.db \
    --title "Polar Bear Ear Tags, 2009-2011" \
    --source "USGS Alaska Science Center, Polar Bear Research Program" \
    --source_url "https://alaska.usgs.gov/products/data.php?dataid=130" \
    --install datasette-cluster-map \
    --project=polar-bears

I exported a full list of my Now v1 projects from their handy active v1 instances page.

The rest of my projects

I scraped the page using the following JavaScript, constructed with the help of the instant evaluation console feature in Firefox 75:

console.log(
  JSON.stringify(
    Array.from(
      Array.from(
        document.getElementsByTagName("table")[1].
          getElementsByTagName("tr")
      ).slice(1).map(
        (tr) =>
          Array.from(
            tr.getElementsByTagName("td")
        ).map((td) => td.innerText)
      )
    )
  )
);

Then I loaded them into Datasette for analysis.

After filtering out the datasette-latest-commithash.now.sh projects I had deployed for every push to GitHub it turns out I have 34 distinct projects running there.

I won't port all of them, but given datasette-publish-now I should be able to port the ones that I care about without too much trouble.

Debugging Datasette with git bisect run

I fixed two bugs in Datasette this week using git bisect run - a tool I've been meaning to figure out for years, which lets you run an automated binary search against a commit log to find the source of a bug.

Since I was figuring out a new tool, I fired up another GitHub issue self-conversation: in issue #716 I document my process of both learning to use git bisect run and using it to find a solution to that particular bug.

It worked great, so I used the same trick on issue 689 as well.

Watching git bisect run churn through 32 revisions in a few seconds and pinpoint the exact moment a bug was introduced is pretty delightful:

$ git bisect start master 0.34
Bisecting: 32 revisions left to test after this (roughly 5 steps)
[dc80e779a2e708b2685fc641df99e6aae9ad6f97] Handle scope path if it is a string
$ git bisect run python check_templates_considered.py
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 15 revisions left to test after this (roughly 4 steps)
[7c6a9c35299f251f9abfb03fd8e85143e4361709] Better tests for prepare_connection() plugin hook, refs #678
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 7 revisions left to test after this (roughly 3 steps)
[0091dfe3e5a3db94af8881038d3f1b8312bb857d] More reliable tie-break ordering for facet results
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[ce12244037b60ba0202c814871218c1dab38d729] Release notes for 0.35
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 1 revision left to test after this (roughly 1 step)
[70b915fb4bc214f9d064179f87671f8a378aa127] Datasette.render_template() method, closes #577
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[286ed286b68793532c2a38436a08343b45cfbc91] geojson-to-sqlite
running python check_templates_considered.py
70b915fb4bc214f9d064179f87671f8a378aa127 is the first bad commit
commit 70b915fb4bc214f9d064179f87671f8a378aa127
Author: Simon Willison
Date:   Tue Feb 4 12:26:17 2020 -0800

    Datasette.render_template() method, closes #577

    Pull request #664.

:040000 040000 def9e31252e056845609de36c66d4320dd0c47f8 da19b7f8c26d50a4c05e5a7f05220b968429725c M	datasette
bisect run success

Supporting metadata.yaml

The other Datasette project I completed this week is a relatively small feature with hopefully a big impact: you can now use YAML for Datasette's metadata configuration as an alternative to JSON.

I'm not crazy about YAML: I still don't feel like I've mastered it, and I've been tracking it for 18 years! But it has one big advantage over JSON for configuration files: robust support for multi-line strings.

Datasette's metadata file can include lengthy SQL statements and strings of HTML, both of which benefit from multi-line strings.

I first used YAML for metadata for my Analyzing US Election Russian Facebook Ads project. The metadata file for that demonstrates both embedded HTML and embedded SQL - and an accompanying build_metadata.py script converted it to JSON at build time. I've since used the same trick for a number of other projects.

The next release of Datasette (hopefully within a week) will ship the new feature, at which point those conversion scripts won't be necessary.

This should work particularly well with the forthcoming ability for a canned query to write to a database. Getting that wrapped up and shipped will be my focus for the next few days.

Tags: git, github, projects, yaml, zeit-now, datasette, weeknotes, github-issues

niche-museums.com, powered by Datasette

2019-11-25T22:27:46+00:00

I just released a major upgrade to my www.niche-museums.com website (launched last month).

The site is now rendered server-side. The previous version used lit-html to render content using JavaScript.
Each museum now has its own page. Here's today's new museum listing for the Conservatory of Flowers in San Francisco. These pages have a map on them.
The site has an about page.
You can now link to the page for a specific latitude and longitude, e.g. this location in Golden Gate Park.
The source code for the site is now available on GitHub.

Notably, the site is entirely powered by Datasette. It's a heavily customized Datasette instance, making extensive use of custom templates and plugins.

It's a really fun experiment. I'm essentially using Datasette as a weird twist on a static site generator - no moving parts since the database is immutable but there's still stuff happening server-side to render the pages.

Continuous deployment

The site is entirely stateless and is published using Circle CI to a serverless hosting provider (currently Zeit Now v1, but I'll probably move it to Google Cloud Run in the near future.)

The site content - 46 museums and counting - lives in the museums.yaml file. I've been adding a new museum listing every day by editing the YAML file using Working Copy on my iPhone.

The build script runs automatically on every commit. It converts the YAML file into a SQLite database using my yaml-to-sqlite tool, then runs datasette publish now... to deploy the resulting database.

The full deployment command is as follows:

datasette publish now browse.db about.db \
    --token=$NOW_TOKEN \
    --alias=www.niche-museums.com \
    --name=niche-museums \
    --install=datasette-haversine \
    --install=datasette-pretty-json \
    --install=datasette-template-sql \
    --install=datasette-json-html \
    --install=datasette-cluster-map~=0.8 \
    --metadata=metadata.json \
    --template-dir=templates \
    --plugins-dir=plugins \
    --branch=master

There's a lot going on here.

browse.db is the SQLite database file that was built by running yaml-to-sqlite.

about.db is an empty database built using sqlite3 about.db '' - more on this later.

The --alias= option tells Zeit Now to alias that URL to the resulting deployment. This is the single biggest feature that I'm missing from Google Cloud Run at the moment. It's possible to point domains at deployments there but it's not nearly as easy to script.

The --install= options tell datasette publish which plugins should be installed on the resulting instance.

--metadata=, --template-dir= and --plugins-dir= are the options that customize the instance.

--branch=master means we always deploy the latest master of Datasette directly from GitHub, ignoring the most recent release to PyPI. This isn't strictly necessary here.

Customization

The site itself is built almost entirely using Datasette custom templates. I have four of them:

index.html is the template used for the homepage, and for the page you see when you search for museums near a specific latitude and longitude.
row-browse-museums.html is the template used for the individual museum pages. It includes the JavaScript used for the map (which is powered by Leaflet and uses Wikimedia's OpenStreetMap tiles, which I discovered thanks to this Observable notebook by Tom MacWright).
_museum_card.html is an included template rendering a card for a museum, shared by the index and museum pages.
database-about.html is the template for the about page.

The about page uses a particularly devious hack.

Datasette doesn't have an easy way to create additional custom pages with URLs at the moment (without abusing the asgi_wrapper() hook, which is pretty low-level).

But... every attached database gets its own URL at /database-name.

So, to create the /about page I create an empty database called about.db using the sqlite3 about.db "" command. I serve that using Datasette, then create a custom template for that specific database using Datasette's template naming conventions.

I'll probably come up with a less grotesque way of doing this and bake it into Datasette in the future. For the moment this seems to work pretty well.

Plugins

The two key plugins here are datasette-haversine and datasette-template-sql.

datasette-haversine adds a custom SQL function to Datasette called haversine(), which calculates the haversine distance between two latitude/longitude points.

It's used by the SQL query which finds the nearest museums to the user.

This is very inefficient - it's essentially a brute-force approach which calculates that distance for every museum in the database and sorts them accordingly - but it will be years before I have enough museums listed for that to cause any kind of performance issue.

datasette-template-sql is the new plugin I described last week, made possible by Datasette dropping Python 3.5 support. It allows SQL queries to be executed directly from templates. I'm using it here to run the queries that power homepage.

I tried to get the site working just using code in the templates, but it got pretty messy. Instead, I took advantage of Datasette's --plugins-dir option, which causes Datasette to treat all Python modules in a specific directory as plugins and attempt to load them.

index_vars.py is a single custom plugin that I'm bundling with the site. It uses the extra_template_vars() plugin took to detect requests to the index page and inject some additional custom template variables based on values read from the querystring.

This ends up acting a little bit like a custom Django view function. It's a slightly weird pattern but again it does the job - and helps me further explore the potential of Datasette as a tool for powering websites in addition to just providing an API.

Weeknotes

This post is standing in for my regular weeknotes, because it represents most of what I achieved this last week. A few other bits and pieces:

I've been exploring ways to enable CSV upload directly into a Datasette instance. I'm building a prototype of this on top of Starlette, because it has solid ASGI file upload support. This is currently a standalone web application but I'll probably make it work as a Datasette ASGI plugin once I have something I like.
Shortcuts in iOS 13 got some very interesting new features, most importantly the ability to trigger shortcuts automatically on specific actions - including every time you open a specific app. I've been experimenting with using this to automatically copy data from my iPhone up to a custom web application - maybe this could help ingest notes and photos into Dogsheep.
Posted seven new museums to niche-museums.com:
- Cable Car Museum in San Francisco
- Audium in San Francisco
- House of Broel Dollhouse Museum in New Orleans
- Neptune Society Columbarium in San Francisco
- Recoleta Cemetery in Buenos Aires
- NASA Glenn Visitor Center in Cleveland
- Conservatory of Flowers in San Francisco
I composed devious SQL query for generating the markdown for the seven most recently added museums.

Tags: museums, projects, yaml, datasette, weeknotes, baked-data

Analyzing US Election Russian Facebook Ads

2018-08-06T16:01:18+00:00

Two interesting data sources have emerged in the past few weeks concerning the Russian impact on the 2016 US elections.

FiveThirtyEight published nearly 3 million tweets from accounts associated with the Russian “Internet Research Agency” - see my article and searchable tweet archive here.

Separately, the House Intelligence Committee Minority released 3,517 Facebook ads that were reported to have been bought by the Russian Internet Research Agency as a set of redacted PDF files.

Exploring the Russian Facebook Ad spend

The initial data was released as zip files full of PDFs, one of the least friendly formats you can use to publish data.

Ed Summers took on the intimidating task of cleaning that up. His results are incredible: he used the pytesseract OCR library and PyPDF2 to extract both the images and the associated metadata and convert the whole lot into a single 3.9MB JSON file.

I wrote some code to convert his JSON file to SQLite (more on the details later) and the result can be found here:

https://russian-ira-facebook-ads.datasettes.com/

Here’s an example search for “cops” ordered by the USD equivalent spent on the ad (some of the spends are in rubles, so I convert those to USD using today’s exchange rate of 0.016).

One of the most interesting things about this data is that it includes the Facebook ad targetting options that were used to promote the ads. I’ve built a separate interface for browsing those - you can see the most frequently applied targets:

And by browsing through the different facets you can construct e.g. a search for all ads that targeted people interested in both interests:Martin Luther King and interests:Police Brutality is a Crime: https://russian-ira-facebook-ads.datasettes.com/russian-ads-919cbfd/display_ads?_targets_json=["d6ade"%2C"40c27"]

New tooling under the hood

I ended up spinning up several new projects to help process and explore this data.

sqlite-utils

The first is a new library called sqlite-utils. If data is already in CSV I tend to convert it using csvs-to-sqlite, but if data is in a less tabular format (JSON or XML for example) I have to hand-write code. Here’s a script I wrote to process the XML version of the UK Register of Members Interests for example.

My goal with sqlite-utils is to take some of the common patterns from those scripts and make them as easy to use as possible, in particular when running inside a Jupyter notebook. It’s still very early, but the script I wrote to process the Russian ads JSON is a good example of the kind of thing I want to do with it.

datasette-json-html

The second new tool is a new Datasette plugin (and corresponding plugin hook) called datasette-json-html. I used this to solve the need to display both rendered images and customized links as part of the regular Datasette instance.

It’s a pretty crazy solution (hence why it’s implemented as a plugin and not part of Datasette core) but it works surprisingly well. The basic idea is to support a mini JSON language which can be detected and rendered as HTML. A couple of examples:

{
  "img_src": "https://raw.githubusercontent.com/edsu/irads/03fb4b/site/images/0771.png",
  "width": 200
}

Is rendered as an HTML <img src=""> element.

[
  {
    "label": "location:United States",
    "href": "/russian-ads/display_ads?_target=ec3ac"
  },
  {
    "label": "interests:Martin Luther King",
    "href": "/russian-ads/display_ads?_target=d6ade"
  },
  {
    "label": "interests:Jr.",
    "href": "/russian-ads/display_ads?_target=8e7b3"
  }
]

Is rendered as a comma-separated list of HTML links.

Why use JSON for this? Because SQLite has some incredibly powerful JSON features, making it trivial to output JSON as part of the result of a SQL query. Most interestingly of all it has json_group_array() which can work as an aggregation function to combine a set of related rows into a single JSON array.

The display_ads page shown above is powered by a SQL view. Here’s the relevant subset of that view:

select ads.id,
    case when image is not null then
        json_object("img_src", "https://raw.githubusercontent.com/edsu/irads/03fb4b/site/" || image, "width", 200)
    else
        "no image"
    end as img,
    json_group_array(
        json_object(
            "label", targets.name,
            "href", "/russian-ads/display_ads?_target="
                || urllib_quote_plus(targets.id)
        )
    ) as targeting
from ads
    join ad_targets on ads.id = ad_targets.ad_id
    join targets on ad_targets.target_id = targets.id
group by ads.id limit 10

I’m using SQLite’s JSON functions to dynamically assemble the JSON format that datasette-json-html knows how to render. I’m delighted at how well it works.

I’ve turned off arbitrary SQL querying against the main Facebook ads Datasette instance, but there’s a copy running at https://russian-ira-facebook-ads-sql-allowed.now.sh/russian-ads if you want to play with these queries.

Weird implementation details

The full source code for my implementation is available on GitHub.

I ended up using an experimental plugin hook to enable additional custom filtering on Datasette views in order to support showing ads against multiple m2m targets, but hopefully that will be made unnecessary as work on Datasette’s support for m2m relationships progresses.

I also experimented with YAML to generate the metadata.json file as JSON strings aren’t a great way of representing multi-line HTML and SQL. And if you want to see some really convoluted SQL have a look at how the canned query for the faceted targeting interface works.

This was a really fun project, which further stretched my ideas about what Datasette should be capable of out of the box. I’m hoping that the m2m work will make a lot of these crazy hacks redundant.

Tags: politics, projects, yaml, datasette, sqlite-utils

twitter-text-conformance

2010-02-06T15:39:27+00:00

twitter-text-conformance

This is a neat idea: Twitter have released open source libraries for parsing standard tweet syntax in Ruby and Java, but they’ve also released a set of YAML unit tests aimed at anyone who wants to implement the same parsing logic in other languages.

Via Twitter Engineering Blog

Tags: java, ruby, testing, twitter, yaml, conformance-suites

More YAML

2003-02-05T23:49:43+00:00

Paul Tchistopolskii's XML Alternatives reminded me to take another look at YAML. The specification has been updated since I last looked and seems to be a bit more complicated, but it's still a very nicely designed format. Implementations are available for Perl, Python and Ruby with C and Java on the way but strangely no one seems to be doing one for PHP yet. I'm doing a course at Uni on compilers at the moment which includes quite a lot of stuff about writing parsers so I'm very tempted to have a go at a YAML implementation in the next few weeks just to try stuff out. The possibility of easily swapping relatively complex data structures between PHP and Python is pretty tempting as well.

Tags: xml, yaml

YAML

2002-12-05T02:49:08+00:00

I forget quite how I got there, but the other day I found myself reading about YAML - YAML Ain't Markup Language. It looks really interesting. YAML aims to be an easily human readable format for storing and transferring structured data - so far, so XML. Where it differs from the IT world's favourite buzzword is that YAML is specifically designed to handle the three most common data structures - scalars (single values), lists and dictionaries. Here's a sample (taken from the official specification):

Time: 2001-11-23 15:01:42 -05:00
User: ed
Warning: >
  This is an error message
  for the log file

YAML has a number of obvious influences, including Python and MIME. Implementations already exist for Perl, Python and Java. XML-RPC aptly demonstrates how powerful the combination of lists, dictionaries and arrays can be for exchanging data between different systems and YAML looks like it offers a very nice alternative to XML based data structure syntax. I have to admit to being slightly concerned by the length of the specification - while YAML is definitely human readable it looks like it could take a while for a human to learn to write it. Then again, the actual generation of the format is meant to be handled by computers (I imagine that humans will make simple edits to YAML files more often than they create them from scratch) so the complexity of the more advanced parts of the specification is probably not too much of a problem.

Tags: markup, yaml

Simon Willison's Weblog: yaml

model.yaml

openai/openai-openapi

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

airtable-export

Social media cards for my TILs

Rendering Markdown

Dogsheep Beta

TIL this week

Releases this week

airtable-export

Goodbye Zeit Now v1, hello datasette-publish-now - and talking to myself in GitHub issues

How Zeit Now inspired Datasette

Goodbye, Zeit Now v1

Migrating my projects

Hello, Zeit Now v2

The rest of my projects

Debugging Datasette with git bisect run

Supporting metadata.yaml

niche-museums.com, powered by Datasette

Continuous deployment

Customization

Plugins

Weeknotes

Analyzing US Election Russian Facebook Ads

Exploring the Russian Facebook Ad spend

New tooling under the hood

sqlite-utils

datasette-json-html

Weird implementation details

twitter-text-conformance

More YAML

YAML