<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: duckdb</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/duckdb.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-08-23T21:21:02+00:00</updated><author><name>Simon Willison</name></author><entry><title>Spatial Joins in DuckDB</title><link href="https://simonwillison.net/2025/Aug/23/spatial-joins-in-duckdb/#atom-tag" rel="alternate"/><published>2025-08-23T21:21:02+00:00</published><updated>2025-08-23T21:21:02+00:00</updated><id>https://simonwillison.net/2025/Aug/23/spatial-joins-in-duckdb/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2025/08/08/spatial-joins"&gt;Spatial Joins in DuckDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Extremely detailed overview by Max Gabrielsson of DuckDB's new spatial join optimizations.&lt;/p&gt;
&lt;p&gt;Consider the following query, which counts the number of &lt;a href="https://citibikenyc.com/system-data"&gt;NYC Citi Bike Trips&lt;/a&gt; for each of the neighborhoods defined by the &lt;a href="https://www.nyc.gov/content/planning/pages/resources/datasets/neighborhood-tabulation"&gt;NYC Neighborhood Tabulation Areas polygons&lt;/a&gt; and returns the top three:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; neighborhood,
  &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;AS&lt;/span&gt; num_rides
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; rides
&lt;span class="pl-k"&gt;JOIN&lt;/span&gt; hoods &lt;span class="pl-k"&gt;ON&lt;/span&gt; ST_Intersects(
  &lt;span class="pl-c1"&gt;rides&lt;/span&gt;.&lt;span class="pl-c1"&gt;start_geom&lt;/span&gt;, &lt;span class="pl-c1"&gt;hoods&lt;/span&gt;.&lt;span class="pl-c1"&gt;geom&lt;/span&gt;
)
&lt;span class="pl-k"&gt;GROUP BY&lt;/span&gt; neighborhood
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; num_rides &lt;span class="pl-k"&gt;DESC&lt;/span&gt;
&lt;span class="pl-k"&gt;LIMIT&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;;&lt;/pre&gt;

&lt;p&gt;The rides table contains 58,033,724 rows. The hoods table has polygons for 310 neighborhoods.&lt;/p&gt;
&lt;p&gt;Without an optimized spatial joins this query requires a nested loop join, executing that expensive &lt;code&gt;ST_Intersects()&lt;/code&gt; operation 58m * 310 ~= 18 billion times. This took around 30 minutes on the 36GB MacBook M3 Pro used for the benchmark.&lt;/p&gt;
&lt;p&gt;The first optimization described - implemented from DuckDB 1.2.0 onwards - uses a "piecewise merge join". This takes advantage of the fact that a bounding box intersection is a whole lot faster to calculate, especially if you pre-cache the bounding box (aka the minimum bounding rectangle or MBR) in the stored binary &lt;code&gt;GEOMETRY&lt;/code&gt; representation.&lt;/p&gt;
&lt;p&gt;Rewriting the query to use a fast bounding box intersection and then only running the more expensive &lt;code&gt;ST_Intersects()&lt;/code&gt; filters on those matches drops the runtime from 1800 seconds to 107 seconds.&lt;/p&gt;
&lt;p&gt;The second optimization, added in &lt;a href="https://duckdb.org/2025/05/21/announcing-duckdb-130.html"&gt;DuckDB 1.3.0&lt;/a&gt; in May 2025 using the new SPATIAL_JOIN operator, is significantly more sophisticated.&lt;/p&gt;
&lt;p&gt;DuckDB can now identify when a spatial join is working against large volumes of data and automatically build an in-memory R-Tree of bounding boxes for the larger of the two tables being joined.&lt;/p&gt;
&lt;p&gt;This new R-Tree further accelerates the bounding box intersection part of the join, and drops the runtime down to just 30 seconds.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/mackaszechno.bsky.social/post/3lx3lnagg7s2t"&gt;@mackaszechno.bsky.social&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="geospatial"/><category term="sql"/><category term="duckdb"/></entry><entry><title>DuckDB is Probably the Most Important Geospatial Software of the Last Decade</title><link href="https://simonwillison.net/2025/May/4/duckdb-is-probably-the-most-important-geospatial-software-of-the/#atom-tag" rel="alternate"/><published>2025-05-04T00:28:35+00:00</published><updated>2025-05-04T00:28:35+00:00</updated><id>https://simonwillison.net/2025/May/4/duckdb-is-probably-the-most-important-geospatial-software-of-the/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dbreunig.com/2025/05/03/duckdb-is-the-most-impactful-geospatial-software-in-a-decade.html"&gt;DuckDB is Probably the Most Important Geospatial Software of the Last Decade&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Drew Breunig argues that the ease of installation of DuckDB is opening up geospatial analysis to a whole new set of developers.&lt;/p&gt;
&lt;p&gt;This inspired &lt;a href="https://news.ycombinator.com/item?id=43881468#43882914"&gt;a comment on Hacker News&lt;/a&gt; from DuckDB Labs geospatial engineer Max Gabrielsson which helps explain why the drop in friction introduced by DuckDB is so significant:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.&lt;/p&gt;
&lt;p&gt;[...] the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've lost count of the time I've spent fiddling with dependencies like GDAL trying to get various geospatial tools to work in the past. Bundling difficult dependencies statically is an under-appreciated trick!&lt;/p&gt;
&lt;p&gt;If the bold claim in the headline inspires you to provide a counter-example, bear in mind that a decade ago is 2015, and most of the key technologies
In the modern geospatial stack - QGIS, PostGIS, geopandas, SpatiaLite - predate that by quite a bit.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;&lt;/p&gt;



</summary><category term="geospatial"/><category term="sql"/><category term="duckdb"/><category term="drew-breunig"/></entry><entry><title>Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)</title><link href="https://simonwillison.net/2025/Apr/22/duckdb-wasm-doom/#atom-tag" rel="alternate"/><published>2025-04-22T16:29:13+00:00</published><updated>2025-04-22T16:29:13+00:00</updated><id>https://simonwillison.net/2025/Apr/22/duckdb-wasm-doom/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.hey.earth/posts/duckdb-doom"&gt;Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Brilliant hack by Patrick Trainer who got an ASCII-art Doom clone running in the browser using convoluted SQL queries running against the WebAssembly build of DuckDB. Here’s the &lt;a href="https://patricktrainer.github.io/duckdb-doom/"&gt;live demo&lt;/a&gt;, and the &lt;a href="https://github.com/patricktrainer/duckdb-doom"&gt;code on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;div style="text-align: center; margin-bottom: 1em"&gt;
&lt;img alt="Animated demo GIF. Green ASCII art on black, with a map on the right and a Doom-style first person view on the left." src="https://static.simonwillison.net/static/2025/duckdb-wasm-doom.gif"&gt;
&lt;/div&gt;

&lt;p&gt;The SQL is &lt;a href="https://github.com/patricktrainer/duckdb-doom/blob/c36bcdab16bea40d916d3165f7bfdb437b86dde2/index.html#L140-L224"&gt;so much fun&lt;/a&gt;. Here’s a snippet that implements ray tracing as part of a SQL view:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;CREATE OR REPLACE&lt;/span&gt; &lt;span class="pl-k"&gt;VIEW&lt;/span&gt; &lt;span class="pl-en"&gt;render_3d_frame&lt;/span&gt; &lt;span class="pl-k"&gt;AS&lt;/span&gt;
WITH RECURSIVE
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; ...&lt;/span&gt;
    rays &lt;span class="pl-k"&gt;AS&lt;/span&gt; (
        &lt;span class="pl-k"&gt;SELECT&lt;/span&gt; 
            &lt;span class="pl-c1"&gt;c&lt;/span&gt;.&lt;span class="pl-c1"&gt;col&lt;/span&gt;, 
            (&lt;span class="pl-c1"&gt;p&lt;/span&gt;.&lt;span class="pl-c1"&gt;dir&lt;/span&gt; &lt;span class="pl-k"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;fov&lt;/span&gt;&lt;span class="pl-k"&gt;/&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;fov&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; (&lt;span class="pl-c1"&gt;c&lt;/span&gt;.&lt;span class="pl-c1"&gt;col&lt;/span&gt;&lt;span class="pl-k"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;/&lt;/span&gt; (&lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;view_w&lt;/span&gt; &lt;span class="pl-k"&gt;-&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;))) &lt;span class="pl-k"&gt;AS&lt;/span&gt; angle 
        &lt;span class="pl-k"&gt;FROM&lt;/span&gt; cols c, s, p
    ),
    raytrace(col, step_count, fx, fy, angle) &lt;span class="pl-k"&gt;AS&lt;/span&gt; (
        &lt;span class="pl-k"&gt;SELECT&lt;/span&gt; 
            &lt;span class="pl-c1"&gt;r&lt;/span&gt;.&lt;span class="pl-c1"&gt;col&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;1&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;p&lt;/span&gt;.&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-k"&gt;+&lt;/span&gt; COS(&lt;span class="pl-c1"&gt;r&lt;/span&gt;.&lt;span class="pl-c1"&gt;angle&lt;/span&gt;)&lt;span class="pl-k"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;step&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;p&lt;/span&gt;.&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-k"&gt;+&lt;/span&gt; SIN(&lt;span class="pl-c1"&gt;r&lt;/span&gt;.&lt;span class="pl-c1"&gt;angle&lt;/span&gt;)&lt;span class="pl-k"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;step&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;r&lt;/span&gt;.&lt;span class="pl-c1"&gt;angle&lt;/span&gt; 
        &lt;span class="pl-k"&gt;FROM&lt;/span&gt; rays r, p, s 
        &lt;span class="pl-k"&gt;UNION ALL&lt;/span&gt; 
        &lt;span class="pl-k"&gt;SELECT&lt;/span&gt; 
            &lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;col&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;step_count&lt;/span&gt; &lt;span class="pl-k"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;fx&lt;/span&gt; &lt;span class="pl-k"&gt;+&lt;/span&gt; COS(&lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;angle&lt;/span&gt;)&lt;span class="pl-k"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;step&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;fy&lt;/span&gt; &lt;span class="pl-k"&gt;+&lt;/span&gt; SIN(&lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;angle&lt;/span&gt;)&lt;span class="pl-k"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;step&lt;/span&gt;, 
            &lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;angle&lt;/span&gt; 
        &lt;span class="pl-k"&gt;FROM&lt;/span&gt; raytrace rt, s 
        &lt;span class="pl-k"&gt;WHERE&lt;/span&gt; &lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;step_count&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt; &lt;span class="pl-c1"&gt;s&lt;/span&gt;.&lt;span class="pl-c1"&gt;max_steps&lt;/span&gt; 
          &lt;span class="pl-k"&gt;AND&lt;/span&gt; NOT EXISTS (
              &lt;span class="pl-k"&gt;SELECT&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt; 
              &lt;span class="pl-k"&gt;FROM&lt;/span&gt; map m 
              &lt;span class="pl-k"&gt;WHERE&lt;/span&gt; &lt;span class="pl-c1"&gt;m&lt;/span&gt;.&lt;span class="pl-c1"&gt;x&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; CAST(&lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;fx&lt;/span&gt; &lt;span class="pl-k"&gt;AS&lt;/span&gt; &lt;span class="pl-k"&gt;INT&lt;/span&gt;) 
                &lt;span class="pl-k"&gt;AND&lt;/span&gt; &lt;span class="pl-c1"&gt;m&lt;/span&gt;.&lt;span class="pl-c1"&gt;y&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; CAST(&lt;span class="pl-c1"&gt;rt&lt;/span&gt;.&lt;span class="pl-c1"&gt;fy&lt;/span&gt; &lt;span class="pl-k"&gt;AS&lt;/span&gt; &lt;span class="pl-k"&gt;INT&lt;/span&gt;) 
                &lt;span class="pl-k"&gt;AND&lt;/span&gt; &lt;span class="pl-c1"&gt;m&lt;/span&gt;.&lt;span class="pl-c1"&gt;tile&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;#&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
          )
    ),
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; ...&lt;/span&gt;&lt;/pre&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43761998"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="sql"/><category term="webassembly"/><category term="duckdb"/></entry><entry><title>OpenTimes</title><link href="https://simonwillison.net/2025/Mar/17/opentimes/#atom-tag" rel="alternate"/><published>2025-03-17T22:49:59+00:00</published><updated>2025-03-17T22:49:59+00:00</updated><id>https://simonwillison.net/2025/Mar/17/opentimes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sno.ws/opentimes/"&gt;OpenTimes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Spectacular new open geospatial project by &lt;a href="https://sno.ws/"&gt;Dan Snow&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time data for free and with no limits.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://opentimes.org/?id=060816135022&amp;amp;mode=car#9.76/37.5566/-122.3085"&gt;what I get&lt;/a&gt; for travel times by car from El Granada, California:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Isochrone map showing driving times from the El Granada census tract to other places in the San Francisco Bay Area" src="https://static.simonwillison.net/static/2025/opentimes.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The technical details are &lt;em&gt;fascinating&lt;/em&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The entire OpenTimes backend is just static Parquet files on &lt;a href="https://www.cloudflare.com/developer-platform/products/r2/"&gt;Cloudflare's R2&lt;/a&gt;. There's no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a &lt;em&gt;great&lt;/em&gt; way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sure enough, &lt;a href="https://developers.cloudflare.com/r2/pricing/"&gt;R2 pricing&lt;/a&gt; charges "based on the total volume of data stored" - $0.015 / GB-month for standard storage, then $0.36 / million requests for "Class B" operations which include reads. They charge nothing for outbound bandwidth.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;All travel times were calculated by pre-building the inputs (OSM, OSRM networks) and then distributing the compute over &lt;a href="https://github.com/dfsnow/opentimes/actions/workflows/calculate-times.yaml"&gt;hundreds of GitHub Actions jobs&lt;/a&gt;. This worked shockingly well for this specific workload (and was also completely free).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's a &lt;a href="https://github.com/dfsnow/opentimes/actions/runs/13094249792"&gt;GitHub Actions run&lt;/a&gt; of the &lt;a href="https://github.com/dfsnow/opentimes/blob/a6a5f7abcdd69559b3e29f360fe0ff0399dbb400/.github/workflows/calculate-times.yaml#L78-L80"&gt;calculate-times.yaml workflow&lt;/a&gt; which uses a matrix to run 255 jobs!&lt;/p&gt;
&lt;p&gt;&lt;img alt="GitHub Actions run: calculate-times.yaml run by workflow_dispatch taking 1h49m to execute 255 jobs with names like run-job (2020-01) " src="https://static.simonwillison.net/static/2025/opentimes-github-actions.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Relevant YAML:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  matrix:
    year: ${{ fromJSON(needs.setup-jobs.outputs.years) }}
    state: ${{ fromJSON(needs.setup-jobs.outputs.states) }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where those JSON files were created by the previous step, which reads in the year and state values from &lt;a href="https://github.com/dfsnow/opentimes/blob/a6a5f7abcdd69559b3e29f360fe0ff0399dbb400/data/params.yaml#L72-L132"&gt;this params.yaml file&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The query layer uses a single DuckDB database file with &lt;em&gt;views&lt;/em&gt; that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a really creative use of DuckDB's feature that lets you run queries against large data from a laptop using HTTP range queries to avoid downloading the whole thing.&lt;/p&gt;
&lt;p&gt;The README shows &lt;a href="https://github.com/dfsnow/opentimes/blob/3439fa2c54af227e40997b4a5f55678739e0f6df/README.md#using-duckdb"&gt;how to use that from R and Python&lt;/a&gt; - I got this working in the &lt;code&gt;duckdb&lt;/code&gt; client (&lt;code&gt;brew install duckdb&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;INSTALL httpfs;
LOAD httpfs;
ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb' AS opentimes;

SELECT origin_id, destination_id, duration_sec
  FROM opentimes.public.times
  WHERE version = '0.0.1'
      AND mode = 'car'
      AND year = '2024'
      AND geography = 'tract'
      AND state = '17'
      AND origin_id LIKE '17031%' limit 10;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In answer to a question about adding public transit times &lt;a href="https://news.ycombinator.com/item?id=43392521#43393183"&gt;Dan said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the next year or so maybe. The biggest obstacles to adding public transit are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Collecting all the necessary scheduling data (e.g. GTFS feeds) for every transit system in the county. Not insurmountable since there are services that do this currently.&lt;/li&gt;
&lt;li&gt;Finding a routing engine that can compute nation-scale travel time matrices quickly. Currently, the two fastest open-source engines I've tried (OSRM and Valhalla) don't support public transit for matrix calculations and the engines that do support public transit (R5, OpenTripPlanner, etc.) are too slow.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gtfs.org/"&gt;GTFS&lt;/a&gt; is a popular CSV-based format for sharing transit schedules - here's &lt;a href="https://gtfs.org/resources/data/"&gt;an official list&lt;/a&gt; of available feed directories.&lt;/p&gt;
&lt;p&gt;This whole project feels to me like a great example of the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;baked data&lt;/a&gt; architectural pattern in action.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43392521"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/census"&gt;census&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openstreetmap"&gt;openstreetmap&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudflare"&gt;cloudflare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http-range-requests"&gt;http-range-requests&lt;/a&gt;&lt;/p&gt;



</summary><category term="census"/><category term="geospatial"/><category term="open-data"/><category term="openstreetmap"/><category term="cloudflare"/><category term="parquet"/><category term="github-actions"/><category term="baked-data"/><category term="duckdb"/><category term="http-range-requests"/></entry><entry><title>S1: The $6 R1 Competitor?</title><link href="https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/#atom-tag" rel="alternate"/><published>2025-02-05T20:00:26+00:00</published><updated>2025-02-05T20:00:26+00:00</updated><id>https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://timkellogg.me/blog/2025/02/03/s1"&gt;S1: The $6 R1 Competitor?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Tim Kellogg shares his notes on a new paper, &lt;a href="https://arxiv.org/abs/2501.19393"&gt;s1: Simple test-time scaling&lt;/a&gt;, which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for 26 minutes on 16 NVIDIA H100 GPUs.&lt;/p&gt;
&lt;p&gt;Tim highlight the most exciting result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that's needed to achieve o1-preview performance on a 32B model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The paper describes a technique called "Budget forcing":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To enforce a minimum, we suppress the generation
of the end-of-thinking token delimiter and optionally append
the string “Wait” to the model’s current reasoning trace to
encourage the model to reflect on its current generation&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's the same trick Theia Vogel described &lt;a href="https://simonwillison.net/2025/Jan/22/r1py/"&gt;a few weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the &lt;code&gt;s1-32B&lt;/code&gt; model &lt;a href="https://huggingface.co/simplescaling/s1-32B"&gt;on Hugging Face&lt;/a&gt;. I found a GGUF version of it at &lt;a href="https://huggingface.co/brittlewis12/s1-32B-GGUF"&gt;brittlewis12/s1-32B-GGUF&lt;/a&gt;, which I ran using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run hf.co/brittlewis12/s1-32B-GGUF:Q4_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also found those 1,000 samples on Hugging Face in the &lt;a href="https://huggingface.co/datasets/simplescaling/s1K"&gt;simplescaling/s1K&lt;/a&gt; data repository there.&lt;/p&gt;
&lt;p&gt;I used DuckDB to convert the parquet file to CSV (and turn one &lt;code&gt;VARCHAR[]&lt;/code&gt; column into JSON):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;COPY (
    SELECT 
      solution,
      question,
      cot_type,
      source_type,
      metadata,
      cot,
      json_array(thinking_trajectories) as thinking_trajectories,
      attempt
    FROM 's1k-00001.parquet'
) TO 'output.csv' (HEADER, DELIMITER ',');
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I loaded that CSV into &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt; so I could use the &lt;code&gt;convert&lt;/code&gt; command to turn a Python data structure into JSON using &lt;code&gt;json.dumps()&lt;/code&gt; and &lt;code&gt;eval()&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Load into SQLite
sqlite-utils insert s1k.db s1k output.csv --csv
# Fix that column
sqlite-utils convert s1k.db s1u metadata 'json.dumps(eval(value))' --import json
# Dump that back out to CSV
sqlite-utils rows s1k.db s1k --csv &amp;gt; s1k.csv
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's that CSV &lt;a href="https://gist.github.com/simonw/048385f27e351c11b488bd9737452fa7"&gt;in a Gist&lt;/a&gt;, which means I can &lt;a href="https://lite.datasette.io/?install=datasette-pretty-json&amp;amp;csv=https://gist.githubusercontent.com/simonw/048385f27e351c11b488bd9737452fa7/raw/5270dacc5aa4a7385f9a6e3d691c81cf3595abc9/s1k.csv#/data/s1k?_facet=cot_type"&gt;load it into Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Datasette Lite showing cot_type 1, crossword 15, Link: 93, rowid: 93, solution: &amp;quot;### Answer: INCIDENT ROOM ### Explanation: Definition: investigators' facility **Anagram of**(... changes) NOTICED MINOR. Defn: ... in a police station.&amp;quot; Question text: &amp;quot;Solve the crossword puzzle. You are presented with a clue as input and the number of letters in brackets.&amp;quot; Clue: &amp;quot;Noticed minor changes in investigators' facility (8,4)&amp;quot; cot_type: crossword, source_type: 0xharib/xword1, metadata: { &amp;quot;instruction&amp;quot;: &amp;quot;You are an expert level solver of cryptic crosswords. You are presented with a clue as input. Respond with the answer and explanation.&amp;quot; }" src="https://static.simonwillison.net/static/2025/s1k.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It really is a tiny amount of training data. It's mostly math and science, but there are also &lt;a href="https://lite.datasette.io/?install=datasette-pretty-json&amp;amp;csv=https://gist.githubusercontent.com/simonw/048385f27e351c11b488bd9737452fa7/raw/5270dacc5aa4a7385f9a6e3d691c81cf3595abc9/s1k.csv#/data/s1k?_facet=cot_type&amp;amp;cot_type=crossword"&gt;15 cryptic crossword examples&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="datasette"/><category term="parquet"/><category term="duckdb"/><category term="datasette-lite"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="ollama"/><category term="llm-reasoning"/><category term="ai-in-china"/></entry><entry><title>Turning Your Root URL Into a DuckDB Remote Database</title><link href="https://simonwillison.net/2024/Dec/1/turning-your-root-url-into-a-duckdb-remote-database/#atom-tag" rel="alternate"/><published>2024-12-01T22:02:18+00:00</published><updated>2024-12-01T22:02:18+00:00</updated><id>https://simonwillison.net/2024/Dec/1/turning-your-root-url-into-a-duckdb-remote-database/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dbreunig.com/2024/12/01/turning-your-root-url-into-a-duckdb-remote-database.html"&gt;Turning Your Root URL Into a DuckDB Remote Database&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fun idea from Drew Breunig: DuckDB supports attaching existing databases that are accessible over HTTP using their URL. Drew suggests creating vanity URLs using your root domain, detecting the DuckDB user-agent and serving the database file directly - allowing tricks like this one:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ATTACH 'https://steplist.app/' AS steplist;
SELECT * FROM steplist.lists;
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;&lt;/p&gt;



</summary><category term="duckdb"/><category term="drew-breunig"/></entry><entry><title>Foursquare Open Source Places: A new foundational dataset for the geospatial community</title><link href="https://simonwillison.net/2024/Nov/20/foursquare-open-source-places/#atom-tag" rel="alternate"/><published>2024-11-20T05:52:38+00:00</published><updated>2024-11-20T05:52:38+00:00</updated><id>https://simonwillison.net/2024/Nov/20/foursquare-open-source-places/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://location.foursquare.com/resources/blog/products/foursquare-open-source-places-a-new-foundational-dataset-for-the-geospatial-community/"&gt;Foursquare Open Source Places: A new foundational dataset for the geospatial community&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I did not expect this!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places ("FSQ OS Places"). This base layer of 100mm+ global places of interest ("POI") includes 22 core attributes (see schema &lt;a href="https://docs.foursquare.com/data-products/docs/places-os-data-schema"&gt;here&lt;/a&gt;) that will be updated monthly and available for commercial use under the Apache 2.0 license framework.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The data is available &lt;a href="https://docs.foursquare.com/data-products/docs/access-fsq-os-places"&gt;as Parquet files&lt;/a&gt; hosted on Amazon S3.&lt;/p&gt;
&lt;p&gt;Here's how to list the available files:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;aws s3 ls s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got back &lt;code&gt;places-00000.snappy.parquet&lt;/code&gt; through &lt;code&gt;places-00024.snappy.parquet&lt;/code&gt;, each file around 455MB for a total of 10.6GB of data.&lt;/p&gt;
&lt;p&gt;I ran &lt;code&gt;duckdb&lt;/code&gt; and then used DuckDB's ability to remotely query Parquet on S3 to explore the data a bit more without downloading it to my laptop first:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select count(*) from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This got back 4,180,424 - that number is similar for each file, suggesting around 104,000,000 records total.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; DuckDB can use wildcards in S3 paths (thanks, &lt;a href="https://mas.to/@paulbailey/113520325087085448"&gt;Paul&lt;/a&gt;) so this query provides an exact count:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select count(*) from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That returned 104,511,073 - and Activity Monitor on my Mac confirmed that DuckDB only needed to fetch 1.2MB of data to answer that query.&lt;/p&gt;
&lt;p&gt;I ran this query to retrieve 1,000 places from that first file as newline-delimited JSON:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;copy (
    select * from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet'
    limit 1000
) to '/tmp/places.json';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/53ad57ad42c7efe75e2028d975907180"&gt;that places.json file&lt;/a&gt;, and here it is &lt;a href="https://lite.datasette.io/?json=https://gist.github.com/simonw/53ad57ad42c7efe75e2028d975907180#/data/raw"&gt;imported into Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally, I got ChatGPT Code Interpreter to &lt;a href="https://chatgpt.com/share/673d7b92-0b4c-8006-a442-c5e6c2713d9c"&gt;convert that file to GeoJSON&lt;/a&gt; and pasted the result &lt;a href="https://gist.github.com/simonw/1e2a170b7368932ebd3922cb5d234924"&gt;into this Gist&lt;/a&gt;, giving me a map of those thousand places (because Gists automatically render GeoJSON):&lt;/p&gt;
&lt;p&gt;&lt;img alt="A map of the world with 1000 markers on it. A marker in Columbia shows a dialog for Raisbeck, Bogota Dv, Cra 47 A 114 05 Second Floor" src="https://static.simonwillison.net/static/2024/places-geojson.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://waxy.org/2024/11/foursquare-open-sources-its-places-database/"&gt;Andy Baio&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/foursquare"&gt;foursquare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geojson"&gt;geojson&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;&lt;/p&gt;



</summary><category term="geospatial"/><category term="open-source"/><category term="foursquare"/><category term="geojson"/><category term="parquet"/><category term="duckdb"/><category term="datasette-lite"/><category term="ai-assisted-programming"/><category term="code-interpreter"/><category term="coding-agents"/></entry><entry><title>Wikidata is a Giant Crosswalk File</title><link href="https://simonwillison.net/2024/Oct/5/wikidata-is-a-giant-crosswalk-file/#atom-tag" rel="alternate"/><published>2024-10-05T15:45:36+00:00</published><updated>2024-10-05T15:45:36+00:00</updated><id>https://simonwillison.net/2024/Oct/5/wikidata-is-a-giant-crosswalk-file/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dbreunig.com/2024/10/04/wikidata-is-a-giant-crosswalk-file.html"&gt;Wikidata is a Giant Crosswalk File&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Drew Breunig shows how to take the 140GB Wikidata JSON export, use &lt;code&gt;sed 's/,$//'&lt;/code&gt; to convert it to newline-delimited JSON, then use DuckDB to run queries and extract external identifiers, including a query that pulls out 500MB of latitude and longitude points.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/wikipedia"&gt;wikipedia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;&lt;/p&gt;



</summary><category term="json"/><category term="wikipedia"/><category term="duckdb"/><category term="drew-breunig"/></entry><entry><title>Conflating Overture Places Using DuckDB, Ollama, Embeddings, and More</title><link href="https://simonwillison.net/2024/Sep/30/conflating-overture-places/#atom-tag" rel="alternate"/><published>2024-09-30T17:24:03+00:00</published><updated>2024-09-30T17:24:03+00:00</updated><id>https://simonwillison.net/2024/Sep/30/conflating-overture-places/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dbreunig.com/2024/09/27/conflating-overture-points-of-interests-with-duckdb-ollama-and-more.html"&gt;Conflating Overture Places Using DuckDB, Ollama, Embeddings, and More&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Drew Breunig's detailed tutorial on "conflation" - combining different geospatial data sources by de-duplicating address strings such as &lt;code&gt;RESTAURANT LOS ARCOS,3359 FOOTHILL BLVD,OAKLAND,94601&lt;/code&gt; and &lt;code&gt;LOS ARCOS TAQUERIA,3359 FOOTHILL BLVD,OAKLAND,94601&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Drew uses an entirely offline stack based around Python, DuckDB and Ollama and finds that a combination of H3 geospatial tiles and &lt;code&gt;mxbai-embed-large&lt;/code&gt; embeddings (though other embedding models should work equally well) gets really good results.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/overture"&gt;overture&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="geospatial"/><category term="python"/><category term="ai"/><category term="duckdb"/><category term="embeddings"/><category term="drew-breunig"/><category term="overture"/><category term="ollama"/></entry><entry><title>An example running DuckDB in ChatGPT Code Interpreter</title><link href="https://simonwillison.net/2024/Jul/17/duckdb-in-chatgpt-code-interpreter/#atom-tag" rel="alternate"/><published>2024-07-17T21:04:27+00:00</published><updated>2024-07-17T21:04:27+00:00</updated><id>https://simonwillison.net/2024/Jul/17/duckdb-in-chatgpt-code-interpreter/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://chatgpt.com/share/de75e15e-d990-4c4e-b168-9f0390516dbe"&gt;An example running DuckDB in ChatGPT Code Interpreter&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I confirmed today that DuckDB can indeed be run inside ChatGPT Code Interpreter (aka "data analysis"), provided you upload the correct wheel file for it to install. The wheel file it needs is currently &lt;code&gt;duckdb-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl&lt;/code&gt; from the &lt;a href="https://pypi.org/project/duckdb/#files"&gt;PyPI releases page&lt;/a&gt; - I asked ChatGPT to identify its platform, and it said that it needs &lt;code&gt;manylinux2014_x86_64.whl&lt;/code&gt; wheels.&lt;/p&gt;
&lt;p&gt;Once the wheel in installed ChatGPT already knows enough of the DuckDB API to start performing useful operations with it - and any brand new features in 1.0 will work if you tell it how to use them.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1813678487573852594"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="duckdb"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="code-interpreter"/><category term="coding-agents"/></entry><entry><title>Using DuckDB for Embeddings and Vector Search</title><link href="https://simonwillison.net/2024/Jun/15/duckdb-for-embeddings/#atom-tag" rel="alternate"/><published>2024-06-15T14:39:18+00:00</published><updated>2024-06-15T14:39:18+00:00</updated><id>https://simonwillison.net/2024/Jun/15/duckdb-for-embeddings/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/"&gt;Using DuckDB for Embeddings and Vector Search&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sören Brunk's comprehensive tutorial combining DuckDB 1.0, a subset of German Wikipedia from Hugging Face (loaded using Parquet), the &lt;a href="https://huggingface.co/BAAI/bge-m3"&gt;BGE M3&lt;/a&gt; embedding model and DuckDB's &lt;a href="https://duckdb.org/2024/05/03/vector-similarity-search-vss.html"&gt;new vss extension&lt;/a&gt; for implementing an HNSW vector index.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soebrunk/status/1801631086386012453"&gt;@soebrunk&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="parquet"/><category term="duckdb"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>DuckDB 1.0</title><link href="https://simonwillison.net/2024/Jun/3/duckdb-10/#atom-tag" rel="alternate"/><published>2024-06-03T13:23:50+00:00</published><updated>2024-06-03T13:23:50+00:00</updated><id>https://simonwillison.net/2024/Jun/3/duckdb-10/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2024/06/03/announcing-duckdb-100"&gt;DuckDB 1.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Six years in the making. The most significant feature in this milestone is stability of the file format: previous releases often required files to be upgraded to work with the new version.&lt;/p&gt;

&lt;p&gt;This release also aspires to provide stability for both the SQL dialect and the C API, though these may still change with sufficient warning in the future.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/duckdb/status/1797619191341551969"&gt;@duckdb&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="sql"/><category term="duckdb"/></entry><entry><title>DuckDB as the New jq</title><link href="https://simonwillison.net/2024/Mar/21/duckdb-as-the-new-jq/#atom-tag" rel="alternate"/><published>2024-03-21T20:36:20+00:00</published><updated>2024-03-21T20:36:20+00:00</updated><id>https://simonwillison.net/2024/Mar/21/duckdb-as-the-new-jq/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/"&gt;DuckDB as the New jq&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The DuckDB CLI tool can query JSON files directly, making it a surprisingly effective replacement for jq. Paul Gross demonstrates the following query:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;select license-&amp;gt;&amp;gt;'key' as license, count(*) from 'repos.json' group by 1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;repos.json&lt;/code&gt; contains an array of &lt;code&gt;{"license": {"key": "apache-2.0"}..}&lt;/code&gt; objects. This example query shows counts for each of those licenses.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/x5immj/duckdb_as_new_jq"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jq"&gt;jq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="sql"/><category term="jq"/><category term="duckdb"/></entry><entry><title>Announcing DuckDB 0.10.0</title><link href="https://simonwillison.net/2024/Feb/13/duckdb-0100/#atom-tag" rel="alternate"/><published>2024-02-13T17:57:17+00:00</published><updated>2024-02-13T17:57:17+00:00</updated><id>https://simonwillison.net/2024/Feb/13/duckdb-0100/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2024/02/13/announcing-duckdb-0100.html"&gt;Announcing DuckDB 0.10.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Somewhat buried in this announcement: DuckDB has Fixed-Length Arrays now, along with &lt;code&gt;array_cross_product(a1, a2)&lt;/code&gt;, &lt;code&gt;array_cosine_similarity(a1, a2)&lt;/code&gt; and &lt;code&gt;array_inner_product(a1, a2)&lt;/code&gt; functions.&lt;/p&gt;
&lt;p&gt;This means you can now use DuckDB to find related content (and other tricks) using vector embeddings!&lt;/p&gt;
&lt;p&gt;Also notable:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DuckDB can now attach MySQL, Postgres, and SQLite databases in addition to databases stored in its own format. This allows data to be read into DuckDB and moved between these systems in a convenient manner, as attached databases are fully functional, appear just as regular tables, and can be updated in a safe, transactional manner.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysql"&gt;mysql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="mysql"/><category term="postgresql"/><category term="sql"/><category term="sqlite"/><category term="duckdb"/><category term="embeddings"/></entry><entry><title>Fastest Way to Read Excel in Python</title><link href="https://simonwillison.net/2024/Jan/3/fastest-way-to-read-excel-in-python/#atom-tag" rel="alternate"/><published>2024-01-03T20:04:00+00:00</published><updated>2024-01-03T20:04:00+00:00</updated><id>https://simonwillison.net/2024/Jan/3/fastest-way-to-read-excel-in-python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hakibenita.com/fast-excel-python"&gt;Fastest Way to Read Excel in Python&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Haki Benita produced a meticulously researched and written exploration of the options for reading a large Excel spreadsheet into Python. He explored Pandas, Tablib, Openpyxl, shelling out to LibreOffice, DuckDB and python-calamine (a Python wrapper of a Rust library). Calamine was the winner, taking 3.58s to read 500,00 rows—compared to Pandas in last place at 32.98s.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/5tugrd/fastest_way_read_excel_python"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/excel"&gt;excel&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pandas"&gt;pandas&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/haki-benita"&gt;haki-benita&lt;/a&gt;&lt;/p&gt;



</summary><category term="excel"/><category term="pandas"/><category term="python"/><category term="rust"/><category term="duckdb"/><category term="haki-benita"/></entry><entry><title>Online gradient descent written in SQL</title><link href="https://simonwillison.net/2023/Mar/7/online-gradient-descent-written-in-sql/#atom-tag" rel="alternate"/><published>2023-03-07T18:56:21+00:00</published><updated>2023-03-07T18:56:21+00:00</updated><id>https://simonwillison.net/2023/Mar/7/online-gradient-descent-written-in-sql/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://maxhalford.github.io/blog/ogd-in-sql/"&gt;Online gradient descent written in SQL&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Max Halford trains an online gradient descent model against two years of AAPL stock data using just a single advanced SQL query. He built this against DuckDB—I tried to replicate his query in SQLite and it almost worked, but it gave me a “recursive reference in a subquery” error that I was unable to resolve.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=35054786"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="machine-learning"/><category term="sql"/><category term="sqlite"/><category term="ai"/><category term="duckdb"/></entry><entry><title>Querying Postgres Tables Directly From DuckDB</title><link href="https://simonwillison.net/2022/Oct/3/duckdb-postgres/#atom-tag" rel="alternate"/><published>2022-10-03T14:27:40+00:00</published><updated>2022-10-03T14:27:40+00:00</updated><id>https://simonwillison.net/2022/Oct/3/duckdb-postgres/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2022/09/30/postgres-scanner.html"&gt;Querying Postgres Tables Directly From DuckDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I learned a lot of interesting PostgreSQL tricks from this write-up of the new DuckDB feature that allows it to run queries against PostgreSQL servers directly. It works using COPY (SELECT ...) TO STDOUT (FORMAT binary) which writes rows to the protocol stream in efficient binary format, but splits the table being read into parallel fetches against page ranges and uses SET TRANSACTION SNAPSHOT ... in those parallel queries to ensure they see the same transactional snapshot of the database.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/__alexmonahan__/status/1575868420226097152"&gt;@__alexmonahan__&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="postgresql"/><category term="duckdb"/></entry><entry><title>Notes on the SQLite DuckDB paper</title><link href="https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/#atom-tag" rel="alternate"/><published>2022-09-01T17:33:14+00:00</published><updated>2022-09-01T17:33:14+00:00</updated><id>https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://vldb.org/pvldb/volumes/15/paper/SQLite%3A%20Past%2C%20Present%2C%20and%20Future"&gt;SQLite: Past, Present, and Future&lt;/a&gt; is a newly published paper authored by Kevin P. Gaffney, Martin Prammer and Jignesh M. Patel from the University of Wisconsin-Madison and D. Richard Hipp, Larry Brasfield and Dan Kennedy from the core SQLite engineering team.&lt;/p&gt;
&lt;p&gt;The paper compares &lt;a href="https://www.sqlite.org/"&gt;SQLite&lt;/a&gt; and &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt;, and describes some optimization work to make SQLite perform better for analytical queries.&lt;/p&gt;
&lt;p&gt;DuckDB is a relatively new project which is frequently nicknamed (including by this paper) "SQLite for analytics". It shares many of the characteristics of SQLite: an embedded database implemented as a header file and implementation file (using C++ as opposed to SQLite's C) with databases that are theselves single binary files.&lt;/p&gt;
&lt;p&gt;The primary difference is that DuckDB is optimized for analytical queries: queries that apply aggregate calculations across large numbers of rows, rather than being optimized for fast scanning and lookup of individual rows of data.&lt;/p&gt;
&lt;p&gt;I've been &lt;a href="https://simonwillison.net/tags/duckdb/"&gt;tracking DuckDB for a while&lt;/a&gt; - it's a very exciting project. If I ever introduce alternative database backends to &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; it's top of my list of things to try.&lt;/p&gt;
&lt;p&gt;The paper spends quite a bit of time on benchmark comparisons. To very loosely summarize those:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SQLite out-performs DuckDB on a write transactions benchmark by 10x-500x on a powerful cloud server and 2x-60x on a Raspberry Pi, for small to large databases.&lt;/li&gt;
&lt;li&gt;For analytical benchmarks using the SSB (&lt;a href="https://www.cs.umb.edu/~poneil/StarSchemaB.PDF"&gt;Star Schema Benchmark&lt;/a&gt;) DuckDB out-performs SQLite by 30-50x at the highest margin and 3-8x at the lowest.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Neither of these are particularly surprising: DuckDB isn't optimized for write transactions and SQLite isn't optimized for analytical queries.&lt;/p&gt;
&lt;p&gt;Next, the paper explores optimizations that can be applied to SQLite to improve its analytical query performance.&lt;/p&gt;
&lt;p&gt;One of these has already shipped! In February 2022 the &lt;a href="https://www.sqlite.org/releaselog/3_38_0.html"&gt;SQLite 3.38.0&lt;/a&gt; release notes included this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Use a Bloom filter to speed up large analytic queries.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had wondered at the time what the deal with this was - the paper explains it in some detail:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A key change is made to the join processing, which is to probe the Bloom filters before carrying out the rest of the join. Applying the Bloom filters early in the join pipeline dramatically reduces the number of tuples that flow through the join pipeline, and thus improves performance.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This had a pretty big impact on the benchmark:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The performance impact of our optimizations is shown in Figure 6. On the Raspberry Pi, SQLite is now 4.2X faster on SSB. Our optimizations are particularly effective for query flight 2, resulting in 10X speedup. On the cloud server, we observed an overall speedup of 2.7X and individual query speedups up to 7X&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I decided to quote in full the first part of section 5, "future development", because it does a great job of describing some of the key features of the culture of SQLite itself.&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4&gt;5 FUTURE DEVELOPMENT&lt;/h4&gt;
&lt;p&gt;The developers intend to provide support for SQLite through the year 2050, and design decisions are made accordingly. SQLite’s code and database file format are fully cross-platform, ensuring that SQLite can run on any current or future platform with an 8-bit byte, two’s complement 32-bit and 64-bit integers, and a C compiler. Every machine-code branch in the SQLite library is tested with multiple platforms and compilers, which makes the code robust for future migrations. SQLite is also extensively documented and commented, which helps new developers quickly understand SQLite’s architecture. Finally, the developers work hard to evaluate new programming trends based on merit rather than popularity.&lt;/p&gt;
&lt;p&gt;While the performance gap has narrowed as a result of this work, DuckDB is still considerably faster than SQLite on SSB. This is somewhat expected; SQLite is a general-purpose database engine, whereas DuckDB is designed from the ground up for efficient OLAP. Although SQLite’s OLAP performance could be further improved in future work, there are several constraints that potential modifications to SQLite must satisfy.&lt;/p&gt;
&lt;p&gt;First, modifications should cause no significant performance regression across the broad range of workloads served by SQLite. Second, the benefit of an optimization must be weighed against its impact on the size of the source code and the compiled library. Finally, modifications should not break SQLite’s backwards compatibility with previous versions and cross-compatibility with different machine architectures&lt;/p&gt;
&lt;p&gt;Although SQLite’s performance is a key priority, it must be balanced with these (sometimes competing) goals. We considered several means of improving value extraction in SQLite, but no single solution satisfied all the constraints above. For example, changing the data format from row-oriented to column-oriented would streamline value extraction, but it would also likely increase overhead for OLTP workloads. Moreover, drastic changes to the data format are at odds with SQLite’s goal of stability for the database file format&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That section continues with this intriguing detail:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An alternative approach to improving SQLite’s OLAP performance is a separate, yet tightly connected query engine that evaluates analytical queries on its own copy of the data, while SQLite continues to serve transactional requests, ensuring that the analytical engine stays up to date with the freshest data. If the extra space overhead is acceptable, the specialized analytical engine can provide substantial OLAP performance gains.&lt;/p&gt;
&lt;p&gt;This design has been successfully implemented in SQLite3/HE [46], a query acceleration path for analytics in SQLite. SQLite3/HE achieves speedups of over 100X on SSB with no degradation in OLTP performance. However, the current implementation of SQLite3/HE does not persist columnar data to storage and is designed to be used in a single process. Future work may explore similar approaches without these limitations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is the first I had heard of SQLite3/HE. It's described in this paper: &lt;a href="https://www.cidrdb.org/cidr2022/papers/p56-prammer.pdf"&gt;Introducing a Query Acceleration Path for Analytics in SQLite3&lt;/a&gt; - by Martin Prammer, Suryadev Sahadevan Rajesh, Junda Chen, Jignesh M. Patel from the University of Wisconsin-Madison. Martin and Jignesh are both credited as authors on the SQLite/DuckDB paper.&lt;/p&gt;

&lt;p&gt;It's also the subject of this &lt;a href="https://www.youtube.com/watch?v=c9bQyzm6JRU"&gt;12 minute video&lt;/a&gt; from the CIDR DB conference.&lt;/p&gt;

&lt;p&gt;It's not clear to me if the code for this has been made available yet. I'd be fascinated to learn more about this project.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/d-richard-hipp"&gt;d-richard-hipp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paper-review"&gt;paper-review&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="databases"/><category term="sqlite"/><category term="duckdb"/><category term="d-richard-hipp"/><category term="paper-review"/></entry><entry><title>DuckDB-Wasm: Efficient Analytical SQL in the Browser</title><link href="https://simonwillison.net/2021/Oct/29/duckdb-wasm-efficient-analytical-sql-in-the-browser/#atom-tag" rel="alternate"/><published>2021-10-29T15:25:43+00:00</published><updated>2021-10-29T15:25:43+00:00</updated><id>https://simonwillison.net/2021/Oct/29/duckdb-wasm-efficient-analytical-sql-in-the-browser/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2021/10/29/duckdb-wasm.html"&gt;DuckDB-Wasm: Efficient Analytical SQL in the Browser&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
First SQLite, now DuckDB: options for running database engines in the browser using WebAssembly keep on growing. DuckDB means browsers now have a fast, intuitive mechanism for querying Parquet files too. This also supports the same HTTP Range header trick as the SQLite demo from a while back, meaning it can query large databases loaded over HTTP without downloading the whole file.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/hfmuehleisen/status/1454100577730838529"&gt;@hfmuehleisen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http-range-requests"&gt;http-range-requests&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="parquet"/><category term="webassembly"/><category term="duckdb"/><category term="http-range-requests"/></entry><entry><title>Querying Parquet using DuckDB</title><link href="https://simonwillison.net/2021/Jun/25/querying-parquet-using-duckdb/#atom-tag" rel="alternate"/><published>2021-06-25T22:40:27+00:00</published><updated>2021-06-25T22:40:27+00:00</updated><id>https://simonwillison.net/2021/Jun/25/querying-parquet-using-duckdb/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2021/06/25/querying-parquet.html"&gt;Querying Parquet using DuckDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this?

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=27634840"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="parquet"/><category term="duckdb"/></entry><entry><title>DuckDB</title><link href="https://simonwillison.net/2020/Sep/19/duckdb/#atom-tag" rel="alternate"/><published>2020-09-19T23:43:41+00:00</published><updated>2020-09-19T23:43:41+00:00</updated><id>https://simonwillison.net/2020/Sep/19/duckdb/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a really interesting, relatively new database. It’s kind of a weird hybrid between SQLite and PostgreSQL: it uses the PostgreSQL parser but models itself after SQLite in that databases are a single file and the code is designed for use as an embedded library, distributed in a single amalgamation C++ file (SQLite uses a C amalgamation). It features a “columnar-vectorized query execution engine” inspired by MonetDB (also by the DuckDB authors) and is hence designed to run analytical queries really quickly. You can install it using “pip install duckdb”—the resulting module feels similar to Python’s sqlite3, and follows roughly the same DBAPI pattern.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://github.com/simonw/datasette/discussions/968"&gt;Any thoughts/future plans for using DuckDB?&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="postgresql"/><category term="sqlite"/><category term="duckdb"/></entry></feed>