<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: parquet</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/parquet.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-10-11T03:47:16+00:00</updated><author><name>Simon Willison</name></author><entry><title>An MVCC-like columnar table on S3 with constant-time deletes</title><link href="https://simonwillison.net/2025/Oct/11/mvcc-s3/#atom-tag" rel="alternate"/><published>2025-10-11T03:47:16+00:00</published><updated>2025-10-11T03:47:16+00:00</updated><id>https://simonwillison.net/2025/Oct/11/mvcc-s3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.shayon.dev/post/2025/277/an-mvcc-like-columnar-table-on-s3-with-constant-time-deletes/"&gt;An MVCC-like columnar table on S3 with constant-time deletes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
s3's support for conditional writes (&lt;a href="https://simonwillison.net/2024/Nov/26/s3-conditional-writes/"&gt;previously&lt;/a&gt;) makes it an interesting, scalable and often inexpensive platform for all kinds of database patterns.&lt;/p&gt;
&lt;p&gt;Shayon Mukherjee presents an ingenious design for a Parquet-backed database in S3 which accepts concurrent writes, presents a single atomic view for readers and even supports reliable row deletion despite Parquet requiring a complete file rewrite in order to remove data.&lt;/p&gt;
&lt;p&gt;The key to the design is a &lt;code&gt;_latest_manifest&lt;/code&gt; JSON file at the top of the bucket, containing an integer version number. Clients use compare-and-swap to increment that version - only one client can succeed at this, so the incremented version they get back is guaranteed unique to them.&lt;/p&gt;
&lt;p&gt;Having reserved a version number the client can write a unique manifest file for that version - &lt;code&gt;manifest/v00000123.json&lt;/code&gt; - with a more complex data structure referencing the current versions of every persisted file, including the one they just uploaded.&lt;/p&gt;
&lt;p&gt;Deleted rows are written to tombstone files as either a list of primary keys or a list of of ranges. Clients consult these when executing reads, filtering out deleted rows as part of resolving a query.&lt;/p&gt;
&lt;p&gt;The pricing estimates are especially noteworthy:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For a workload ingesting 6 TB/day with 2 TB of deletes and 50K queries/day:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PUT requests: ~380K/day (≈4 req/s) = $1.88/day&lt;/li&gt;
&lt;li&gt;GET requests: highly variable, depends on partitioning effectiveness&lt;ul&gt;
&lt;li&gt;Best case (good time-based partitioning): ~100K-200K/day = $0.04-$0.08/day&lt;/li&gt;
&lt;li&gt;Worst case (poor partitioning, scanning many files): ~2M/day = $0.80/day&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;~$3/day for ingesting 6TB of data is pretty fantastic!&lt;/p&gt;
&lt;p&gt;Watch out for storage costs though - each new TB of data at $0.023/GB/month adds $23.55 to the ongoing monthly bill.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45493158"&gt;Hacked News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="databases"/><category term="s3"/><category term="parquet"/></entry><entry><title>OpenTimes</title><link href="https://simonwillison.net/2025/Mar/17/opentimes/#atom-tag" rel="alternate"/><published>2025-03-17T22:49:59+00:00</published><updated>2025-03-17T22:49:59+00:00</updated><id>https://simonwillison.net/2025/Mar/17/opentimes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sno.ws/opentimes/"&gt;OpenTimes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Spectacular new open geospatial project by &lt;a href="https://sno.ws/"&gt;Dan Snow&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time data for free and with no limits.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://opentimes.org/?id=060816135022&amp;amp;mode=car#9.76/37.5566/-122.3085"&gt;what I get&lt;/a&gt; for travel times by car from El Granada, California:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Isochrone map showing driving times from the El Granada census tract to other places in the San Francisco Bay Area" src="https://static.simonwillison.net/static/2025/opentimes.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The technical details are &lt;em&gt;fascinating&lt;/em&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The entire OpenTimes backend is just static Parquet files on &lt;a href="https://www.cloudflare.com/developer-platform/products/r2/"&gt;Cloudflare's R2&lt;/a&gt;. There's no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a &lt;em&gt;great&lt;/em&gt; way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sure enough, &lt;a href="https://developers.cloudflare.com/r2/pricing/"&gt;R2 pricing&lt;/a&gt; charges "based on the total volume of data stored" - $0.015 / GB-month for standard storage, then $0.36 / million requests for "Class B" operations which include reads. They charge nothing for outbound bandwidth.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;All travel times were calculated by pre-building the inputs (OSM, OSRM networks) and then distributing the compute over &lt;a href="https://github.com/dfsnow/opentimes/actions/workflows/calculate-times.yaml"&gt;hundreds of GitHub Actions jobs&lt;/a&gt;. This worked shockingly well for this specific workload (and was also completely free).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's a &lt;a href="https://github.com/dfsnow/opentimes/actions/runs/13094249792"&gt;GitHub Actions run&lt;/a&gt; of the &lt;a href="https://github.com/dfsnow/opentimes/blob/a6a5f7abcdd69559b3e29f360fe0ff0399dbb400/.github/workflows/calculate-times.yaml#L78-L80"&gt;calculate-times.yaml workflow&lt;/a&gt; which uses a matrix to run 255 jobs!&lt;/p&gt;
&lt;p&gt;&lt;img alt="GitHub Actions run: calculate-times.yaml run by workflow_dispatch taking 1h49m to execute 255 jobs with names like run-job (2020-01) " src="https://static.simonwillison.net/static/2025/opentimes-github-actions.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Relevant YAML:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  matrix:
    year: ${{ fromJSON(needs.setup-jobs.outputs.years) }}
    state: ${{ fromJSON(needs.setup-jobs.outputs.states) }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where those JSON files were created by the previous step, which reads in the year and state values from &lt;a href="https://github.com/dfsnow/opentimes/blob/a6a5f7abcdd69559b3e29f360fe0ff0399dbb400/data/params.yaml#L72-L132"&gt;this params.yaml file&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The query layer uses a single DuckDB database file with &lt;em&gt;views&lt;/em&gt; that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a really creative use of DuckDB's feature that lets you run queries against large data from a laptop using HTTP range queries to avoid downloading the whole thing.&lt;/p&gt;
&lt;p&gt;The README shows &lt;a href="https://github.com/dfsnow/opentimes/blob/3439fa2c54af227e40997b4a5f55678739e0f6df/README.md#using-duckdb"&gt;how to use that from R and Python&lt;/a&gt; - I got this working in the &lt;code&gt;duckdb&lt;/code&gt; client (&lt;code&gt;brew install duckdb&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;INSTALL httpfs;
LOAD httpfs;
ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb' AS opentimes;

SELECT origin_id, destination_id, duration_sec
  FROM opentimes.public.times
  WHERE version = '0.0.1'
      AND mode = 'car'
      AND year = '2024'
      AND geography = 'tract'
      AND state = '17'
      AND origin_id LIKE '17031%' limit 10;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In answer to a question about adding public transit times &lt;a href="https://news.ycombinator.com/item?id=43392521#43393183"&gt;Dan said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the next year or so maybe. The biggest obstacles to adding public transit are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Collecting all the necessary scheduling data (e.g. GTFS feeds) for every transit system in the county. Not insurmountable since there are services that do this currently.&lt;/li&gt;
&lt;li&gt;Finding a routing engine that can compute nation-scale travel time matrices quickly. Currently, the two fastest open-source engines I've tried (OSRM and Valhalla) don't support public transit for matrix calculations and the engines that do support public transit (R5, OpenTripPlanner, etc.) are too slow.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gtfs.org/"&gt;GTFS&lt;/a&gt; is a popular CSV-based format for sharing transit schedules - here's &lt;a href="https://gtfs.org/resources/data/"&gt;an official list&lt;/a&gt; of available feed directories.&lt;/p&gt;
&lt;p&gt;This whole project feels to me like a great example of the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;baked data&lt;/a&gt; architectural pattern in action.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43392521"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/census"&gt;census&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openstreetmap"&gt;openstreetmap&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudflare"&gt;cloudflare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http-range-requests"&gt;http-range-requests&lt;/a&gt;&lt;/p&gt;



</summary><category term="census"/><category term="geospatial"/><category term="open-data"/><category term="openstreetmap"/><category term="cloudflare"/><category term="parquet"/><category term="github-actions"/><category term="baked-data"/><category term="duckdb"/><category term="http-range-requests"/></entry><entry><title>The Best Way to Use Text Embeddings Portably is With Parquet and Polars</title><link href="https://simonwillison.net/2025/Feb/24/text-embeddings-parquet/#atom-tag" rel="alternate"/><published>2025-02-24T23:58:28+00:00</published><updated>2025-02-24T23:58:28+00:00</updated><id>https://simonwillison.net/2025/Feb/24/text-embeddings-parquet/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://minimaxir.com/2025/02/embeddings-parquet/"&gt;The Best Way to Use Text Embeddings Portably is With Parquet and Polars&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fantastic piece on embeddings by Max Woolf, who uses a 32,000 vector collection of Magic: the Gathering card embeddings to explore efficient ways of storing and processing them.&lt;/p&gt;
&lt;p&gt;Max advocates for the brute-force approach to nearest-neighbor calculations:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What many don't know about text embeddings is that you don't &lt;em&gt;need&lt;/em&gt; a vector database to calculate nearest-neighbor similarity if your data isn't too large. Using &lt;a href="https://numpy.org/doc/stable/index.html"&gt;numpy&lt;/a&gt; and my Magic card embeddings, a 2D matrix of 32,254 &lt;code&gt;float32&lt;/code&gt; embeddings at a dimensionality of 768D (common for "smaller" LLM embedding models) occupies &lt;strong&gt;94.49 MB&lt;/strong&gt; of system memory, which is relatively low for modern personal computers and can fit within free usage tiers of cloud VMs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He uses this brilliant snippet of Python code to find the top K matches by distance:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;fast_dot_product&lt;/span&gt;(&lt;span class="pl-s1"&gt;query&lt;/span&gt;, &lt;span class="pl-s1"&gt;matrix&lt;/span&gt;, &lt;span class="pl-s1"&gt;k&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;3&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;dot_products&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;query&lt;/span&gt; @ &lt;span class="pl-s1"&gt;matrix&lt;/span&gt;.&lt;span class="pl-c1"&gt;T&lt;/span&gt;
    &lt;span class="pl-s1"&gt;idx&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;np&lt;/span&gt;.&lt;span class="pl-c1"&gt;argpartition&lt;/span&gt;(&lt;span class="pl-s1"&gt;dot_products&lt;/span&gt;, &lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-s1"&gt;k&lt;/span&gt;)[&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-s1"&gt;k&lt;/span&gt;:]
    &lt;span class="pl-s1"&gt;idx&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;idx&lt;/span&gt;[&lt;span class="pl-s1"&gt;np&lt;/span&gt;.&lt;span class="pl-c1"&gt;argsort&lt;/span&gt;(&lt;span class="pl-s1"&gt;dot_products&lt;/span&gt;[&lt;span class="pl-s1"&gt;idx&lt;/span&gt;])[::&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;]]
    &lt;span class="pl-s1"&gt;score&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dot_products&lt;/span&gt;[&lt;span class="pl-s1"&gt;idx&lt;/span&gt;]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;idx&lt;/span&gt;, &lt;span class="pl-s1"&gt;score&lt;/span&gt;&lt;/pre&gt;

&lt;blockquote&gt;
&lt;p&gt;Since dot products are such a fundamental aspect of linear algebra, numpy's implementation is extremely fast: with the help of additional numpy &lt;a href="https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html"&gt;sorting&lt;/a&gt; &lt;a href="https://numpy.org/doc/2.1/reference/generated/numpy.argsort.html"&gt;shenanigans&lt;/a&gt;, on my M3 Pro MacBook Pro it takes just &lt;strong&gt;1.08 ms&lt;/strong&gt; on average to calculate all 32,254 dot products, find the top 3 most similar embeddings, and return their corresponding &lt;code&gt;idx&lt;/code&gt; of the matrix and and cosine similarity &lt;code&gt;score&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I ran that Python code through Claude 3.7 Sonnet for an explanation, which I can &lt;a href="https://claude.ai/share/51bde7eb-17ed-493c-b3ec-75c9c21c0c65"&gt;share here&lt;/a&gt; using their brand new "Share chat" feature. TIL about &lt;a href="https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html"&gt;numpy.argpartition&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;He explores multiple options for efficiently storing these embedding vectors, finding that naive CSV storage takes 631.5 MB while pickle uses 94.49 MB and his preferred option, Parquet via &lt;a href="https://pola.rs/"&gt;Polars&lt;/a&gt;, uses &lt;a href="https://huggingface.co/datasets/minimaxir/mtg-embeddings/blob/main/mtg_embeddings.parquet"&gt;94.3 MB&lt;/a&gt; and enables some neat zero-copy optimization tricks.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/max-woolf"&gt;max-woolf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="parquet"/><category term="max-woolf"/><category term="embeddings"/><category term="claude"/></entry><entry><title>S1: The $6 R1 Competitor?</title><link href="https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/#atom-tag" rel="alternate"/><published>2025-02-05T20:00:26+00:00</published><updated>2025-02-05T20:00:26+00:00</updated><id>https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://timkellogg.me/blog/2025/02/03/s1"&gt;S1: The $6 R1 Competitor?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Tim Kellogg shares his notes on a new paper, &lt;a href="https://arxiv.org/abs/2501.19393"&gt;s1: Simple test-time scaling&lt;/a&gt;, which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for 26 minutes on 16 NVIDIA H100 GPUs.&lt;/p&gt;
&lt;p&gt;Tim highlight the most exciting result:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that's needed to achieve o1-preview performance on a 32B model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The paper describes a technique called "Budget forcing":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To enforce a minimum, we suppress the generation
of the end-of-thinking token delimiter and optionally append
the string “Wait” to the model’s current reasoning trace to
encourage the model to reflect on its current generation&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's the same trick Theia Vogel described &lt;a href="https://simonwillison.net/2025/Jan/22/r1py/"&gt;a few weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the &lt;code&gt;s1-32B&lt;/code&gt; model &lt;a href="https://huggingface.co/simplescaling/s1-32B"&gt;on Hugging Face&lt;/a&gt;. I found a GGUF version of it at &lt;a href="https://huggingface.co/brittlewis12/s1-32B-GGUF"&gt;brittlewis12/s1-32B-GGUF&lt;/a&gt;, which I ran using &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run hf.co/brittlewis12/s1-32B-GGUF:Q4_0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also found those 1,000 samples on Hugging Face in the &lt;a href="https://huggingface.co/datasets/simplescaling/s1K"&gt;simplescaling/s1K&lt;/a&gt; data repository there.&lt;/p&gt;
&lt;p&gt;I used DuckDB to convert the parquet file to CSV (and turn one &lt;code&gt;VARCHAR[]&lt;/code&gt; column into JSON):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;COPY (
    SELECT 
      solution,
      question,
      cot_type,
      source_type,
      metadata,
      cot,
      json_array(thinking_trajectories) as thinking_trajectories,
      attempt
    FROM 's1k-00001.parquet'
) TO 'output.csv' (HEADER, DELIMITER ',');
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I loaded that CSV into &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt; so I could use the &lt;code&gt;convert&lt;/code&gt; command to turn a Python data structure into JSON using &lt;code&gt;json.dumps()&lt;/code&gt; and &lt;code&gt;eval()&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Load into SQLite
sqlite-utils insert s1k.db s1k output.csv --csv
# Fix that column
sqlite-utils convert s1k.db s1u metadata 'json.dumps(eval(value))' --import json
# Dump that back out to CSV
sqlite-utils rows s1k.db s1k --csv &amp;gt; s1k.csv
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's that CSV &lt;a href="https://gist.github.com/simonw/048385f27e351c11b488bd9737452fa7"&gt;in a Gist&lt;/a&gt;, which means I can &lt;a href="https://lite.datasette.io/?install=datasette-pretty-json&amp;amp;csv=https://gist.githubusercontent.com/simonw/048385f27e351c11b488bd9737452fa7/raw/5270dacc5aa4a7385f9a6e3d691c81cf3595abc9/s1k.csv#/data/s1k?_facet=cot_type"&gt;load it into Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Datasette Lite showing cot_type 1, crossword 15, Link: 93, rowid: 93, solution: &amp;quot;### Answer: INCIDENT ROOM ### Explanation: Definition: investigators' facility **Anagram of**(... changes) NOTICED MINOR. Defn: ... in a police station.&amp;quot; Question text: &amp;quot;Solve the crossword puzzle. You are presented with a clue as input and the number of letters in brackets.&amp;quot; Clue: &amp;quot;Noticed minor changes in investigators' facility (8,4)&amp;quot; cot_type: crossword, source_type: 0xharib/xword1, metadata: { &amp;quot;instruction&amp;quot;: &amp;quot;You are an expert level solver of cryptic crosswords. You are presented with a clue as input. Respond with the answer and explanation.&amp;quot; }" src="https://static.simonwillison.net/static/2025/s1k.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;It really is a tiny amount of training data. It's mostly math and science, but there are also &lt;a href="https://lite.datasette.io/?install=datasette-pretty-json&amp;amp;csv=https://gist.githubusercontent.com/simonw/048385f27e351c11b488bd9737452fa7/raw/5270dacc5aa4a7385f9a6e3d691c81cf3595abc9/s1k.csv#/data/s1k?_facet=cot_type&amp;amp;cot_type=crossword"&gt;15 cryptic crossword examples&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="datasette"/><category term="parquet"/><category term="duckdb"/><category term="datasette-lite"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="ollama"/><category term="llm-reasoning"/><category term="ai-in-china"/></entry><entry><title>Foursquare Open Source Places: A new foundational dataset for the geospatial community</title><link href="https://simonwillison.net/2024/Nov/20/foursquare-open-source-places/#atom-tag" rel="alternate"/><published>2024-11-20T05:52:38+00:00</published><updated>2024-11-20T05:52:38+00:00</updated><id>https://simonwillison.net/2024/Nov/20/foursquare-open-source-places/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://location.foursquare.com/resources/blog/products/foursquare-open-source-places-a-new-foundational-dataset-for-the-geospatial-community/"&gt;Foursquare Open Source Places: A new foundational dataset for the geospatial community&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I did not expect this!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places ("FSQ OS Places"). This base layer of 100mm+ global places of interest ("POI") includes 22 core attributes (see schema &lt;a href="https://docs.foursquare.com/data-products/docs/places-os-data-schema"&gt;here&lt;/a&gt;) that will be updated monthly and available for commercial use under the Apache 2.0 license framework.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The data is available &lt;a href="https://docs.foursquare.com/data-products/docs/access-fsq-os-places"&gt;as Parquet files&lt;/a&gt; hosted on Amazon S3.&lt;/p&gt;
&lt;p&gt;Here's how to list the available files:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;aws s3 ls s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got back &lt;code&gt;places-00000.snappy.parquet&lt;/code&gt; through &lt;code&gt;places-00024.snappy.parquet&lt;/code&gt;, each file around 455MB for a total of 10.6GB of data.&lt;/p&gt;
&lt;p&gt;I ran &lt;code&gt;duckdb&lt;/code&gt; and then used DuckDB's ability to remotely query Parquet on S3 to explore the data a bit more without downloading it to my laptop first:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select count(*) from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This got back 4,180,424 - that number is similar for each file, suggesting around 104,000,000 records total.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; DuckDB can use wildcards in S3 paths (thanks, &lt;a href="https://mas.to/@paulbailey/113520325087085448"&gt;Paul&lt;/a&gt;) so this query provides an exact count:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select count(*) from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That returned 104,511,073 - and Activity Monitor on my Mac confirmed that DuckDB only needed to fetch 1.2MB of data to answer that query.&lt;/p&gt;
&lt;p&gt;I ran this query to retrieve 1,000 places from that first file as newline-delimited JSON:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;copy (
    select * from 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet'
    limit 1000
) to '/tmp/places.json';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/53ad57ad42c7efe75e2028d975907180"&gt;that places.json file&lt;/a&gt;, and here it is &lt;a href="https://lite.datasette.io/?json=https://gist.github.com/simonw/53ad57ad42c7efe75e2028d975907180#/data/raw"&gt;imported into Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally, I got ChatGPT Code Interpreter to &lt;a href="https://chatgpt.com/share/673d7b92-0b4c-8006-a442-c5e6c2713d9c"&gt;convert that file to GeoJSON&lt;/a&gt; and pasted the result &lt;a href="https://gist.github.com/simonw/1e2a170b7368932ebd3922cb5d234924"&gt;into this Gist&lt;/a&gt;, giving me a map of those thousand places (because Gists automatically render GeoJSON):&lt;/p&gt;
&lt;p&gt;&lt;img alt="A map of the world with 1000 markers on it. A marker in Columbia shows a dialog for Raisbeck, Bogota Dv, Cra 47 A 114 05 Second Floor" src="https://static.simonwillison.net/static/2024/places-geojson.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://waxy.org/2024/11/foursquare-open-sources-its-places-database/"&gt;Andy Baio&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/foursquare"&gt;foursquare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geojson"&gt;geojson&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;&lt;/p&gt;



</summary><category term="geospatial"/><category term="open-source"/><category term="foursquare"/><category term="geojson"/><category term="parquet"/><category term="duckdb"/><category term="datasette-lite"/><category term="ai-assisted-programming"/><category term="code-interpreter"/><category term="coding-agents"/></entry><entry><title>Using DuckDB for Embeddings and Vector Search</title><link href="https://simonwillison.net/2024/Jun/15/duckdb-for-embeddings/#atom-tag" rel="alternate"/><published>2024-06-15T14:39:18+00:00</published><updated>2024-06-15T14:39:18+00:00</updated><id>https://simonwillison.net/2024/Jun/15/duckdb-for-embeddings/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/"&gt;Using DuckDB for Embeddings and Vector Search&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sören Brunk's comprehensive tutorial combining DuckDB 1.0, a subset of German Wikipedia from Hugging Face (loaded using Parquet), the &lt;a href="https://huggingface.co/BAAI/bge-m3"&gt;BGE M3&lt;/a&gt; embedding model and DuckDB's &lt;a href="https://duckdb.org/2024/05/03/vector-similarity-search-vss.html"&gt;new vss extension&lt;/a&gt; for implementing an HNSW vector index.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soebrunk/status/1801631086386012453"&gt;@soebrunk&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="parquet"/><category term="duckdb"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>Fleet Context</title><link href="https://simonwillison.net/2023/Nov/15/fleet-context/#atom-tag" rel="alternate"/><published>2023-11-15T22:20:15+00:00</published><updated>2023-11-15T22:20:15+00:00</updated><id>https://simonwillison.net/2023/Nov/15/fleet-context/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/fleet-ai/context"&gt;Fleet Context&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This project took the source code and documentation for 1221 popular Python libraries and ran them through the OpenAI text-embedding-ada-002 embedding model, then made those pre-calculated embedding vectors available as Parquet files for download from S3 or via a custom Python CLI tool.&lt;/p&gt;

&lt;p&gt;I haven’t seen many projects release pre-calculated embeddings like this, it’s an interesting initiative.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="parquet"/><category term="llms"/><category term="embeddings"/></entry><entry><title>Overture Maps Foundation Releases Its First World-Wide Open Map Dataset</title><link href="https://simonwillison.net/2023/Jul/27/overture-maps/#atom-tag" rel="alternate"/><published>2023-07-27T16:45:09+00:00</published><updated>2023-07-27T16:45:09+00:00</updated><id>https://simonwillison.net/2023/Jul/27/overture-maps/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://overturemaps.org/overture-maps-foundation-releases-first-world-wide-open-map-dataset/"&gt;Overture Maps Foundation Releases Its First World-Wide Open Map Dataset&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Overture Maps Foundation is a collaboration lead by Amazon, Meta, Microsoft and TomTom dedicated to producing “reliable, easy-to-use, and interoperable open map data”.&lt;/p&gt;

&lt;p&gt;Yesterday they put out their first release and it’s pretty astonishing: four different layers of geodata, covering Places of Interest (shops, restaurants, attractions etc), administrative boundaries, building outlines and transportation networks.&lt;/p&gt;

&lt;p&gt;The data is available as Parquet. I just downloaded the 8GB places dataset and can confirm that it contains 59 million listings from around the world—I filtered to just places in my local town and a spot check showed that recently opened businesses (last 12 months) were present and the details all looked accurate.&lt;/p&gt;

&lt;p&gt;The places data is licensed under “Community Data License Agreement – Permissive” which looks like the only restriction is that you have to include that license when you further share the data.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/overture"&gt;overture&lt;/a&gt;&lt;/p&gt;



</summary><category term="geospatial"/><category term="open-data"/><category term="parquet"/><category term="meta"/><category term="overture"/></entry><entry><title>Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking</title><link href="https://simonwillison.net/2023/Jun/4/parquet-in-datasette-lite/#atom-tag" rel="alternate"/><published>2023-06-04T21:14:27+00:00</published><updated>2023-06-04T21:14:27+00:00</updated><id>https://simonwillison.net/2023/Jun/4/parquet-in-datasette-lite/#atom-tag</id><summary type="html">
    &lt;p&gt;I've fallen a bit behind on my weeknotes. Here's a catchup for the last few weeks.&lt;/p&gt;
&lt;h4 id="parquet-datasette-lite"&gt;Parquet in Datasette Lite&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/"&gt;Datasette Lite&lt;/a&gt; is my build of Datasette (a server-side Python web application) which runs entirely in the browser using WebAssembly and &lt;a href="https://pyodide.org/en/stable/"&gt;Pyodide&lt;/a&gt;. I recently added the ability to &lt;a href="https://github.com/simonw/datasette-lite/issues/67"&gt;directly load Parquet files over HTTP&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This required an upgrade to the underlying version of Pyodide, in order to use the WebAssembly compiled version of the &lt;a href="https://pypi.org/project/fastparquet/"&gt;fastparquet&lt;/a&gt; library. That upgrade was blocked by a &lt;code&gt;AttributeError: module 'os' has no attribute 'link'&lt;/code&gt; error, but Roman Yurchak &lt;a href="https://github.com/pyodide/pyodide/issues/3880#issuecomment-1560130092"&gt;showed me a workaround&lt;/a&gt; which unblocked me.&lt;/p&gt;
&lt;p&gt;So now the following works:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?parquet=https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet"&gt;https://lite.datasette.io/?parquet=https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This will work with any URL to a Parquet file that is served with open CORS headers - files on GitHub (or in a GitHub Gist) get these headers automatically.&lt;/p&gt;
&lt;p&gt;Also new in Datasette Lite: the &lt;code&gt;?memory=1&lt;/code&gt; query string option, which starts Datasette Lite without loading any default demo databases. I added this to help me construct this demo for my new &lt;a href="https://github.com/simonw/datasette-sqlite-url-lite"&gt;datasette-sqlite-url-lite&lt;/a&gt; plugin:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?memory=1&amp;amp;install=datasette-sqlite-url-lite#/_memory?sql=select+'url_valid()'+as+fn%2C+url_valid(%3Aurl)+as+result%0Aunion+all%0Aselect+'url_scheme()'%2C+url_scheme(%3Aurl)%0Aunion+all%0Aselect+'url_host()'%2C+url_host(%3Aurl)%0Aunion+all%0Aselect+'url_path()'%2C+url_path(%3Aurl)%0Aunion+all%0Aselect+'url_fragment()'%2C+url_fragment(%3Aurl)%3B&amp;amp;url=https%3A%2F%2Fwww.sqlite.org%2Fvtab.html%23usage"&gt;https://lite.datasette.io/?memory=1&amp;amp;install=datasette-sqlite-url-lite#/_memory?sql=select+'url_valid()'+as+fn%2C+url_valid(%3Aurl)+as+result%0Aunion+all%0Aselect+'url_scheme()'%2C+url_scheme(%3Aurl)%0Aunion+all%0Aselect+'url_host()'%2C+url_host(%3Aurl)%0Aunion+all%0Aselect+'url_path()'%2C+url_path(%3Aurl)%0Aunion+all%0Aselect+'url_fragment()'%2C+url_fragment(%3Aurl)%3B&amp;amp;url=https%3A%2F%2Fwww.sqlite.org%2Fvtab.html%23usage&lt;/a&gt;&lt;/p&gt;
&lt;h4 id="datasette-sqlite-url-lite"&gt;datasette-sqlite-url-lite - mostly written by GPT-4&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/asg017/sqlite-url/tree/main/python/datasette_sqlite_url"&gt;datasette-sqlite-url&lt;/a&gt; is a really neat plugin by Alex Garcia which adds custom SQL functions to SQLite that allow you to parse URLs and extract their components.&lt;/p&gt;
&lt;p&gt;There's just one catch: the extension itself is written in C, and there isn't yet a version of it compiled for WebAssembly to work in Datasette Lite.&lt;/p&gt;
&lt;p&gt;I wanted to use some of the functions in it, so I decided to see if I could get a Pure Python alternative of it working. But this was a very low stakes project, so I decided to see if I could get GPT-4 to do essentially all of the work for me.&lt;/p&gt;
&lt;p&gt;I prompted it like this - copying and pasting the examples directly from Alex's documentation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Write Python code to register the following SQLite custom functions:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select url_valid('https://sqlite.org'); -- 1
select url_scheme('https://www.sqlite.org/vtab.html#usage'); -- 'https'
select url_host('https://www.sqlite.org/vtab.html#usage'); -- 'www.sqlite.org'
select url_path('https://www.sqlite.org/vtab.html#usage'); -- '/vtab.html'
select url_fragment('https://www.sqlite.org/vtab.html#usage'); -- 'usage'
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;
&lt;p&gt;The code it produced was almost exactly what I needed.&lt;/p&gt;
&lt;p&gt;I wanted some tests too, so I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Write a suite of pytest tests for this&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This gave me the tests I needed - with one error in the way they called SQLite, but still doing 90% of the work for me.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://chat.openai.com/share/9a541ea9-eab7-4ea3-8b43-a521880dfd17"&gt;the full ChatGPT conversation&lt;/a&gt; and the &lt;a href="https://github.com/simonw/datasette-sqlite-url-lite/commit/14b2fefbf0b879d4c34e5961b70151564d31f7cc#diff-d741a233298e1ce8d45fc52005e9f9d7534c12b010e5d90a01da26979fff446e"&gt;resulting code I checked into the repo&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="various-talks"&gt;Various talks&lt;/h4&gt;
&lt;p&gt;Videos for three of my recent talks are now available on YouTube:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=rsE0XhlPnug"&gt;Big Opportunities in Small Data&lt;/a&gt; is the keynote I gave at Citus Con: An Event for Postgres 2023 - talking about Datasette, SQLite and some tricks I would love to see the  PostgreSQL community adopt from the explorations I've been doing around small data.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=zI43eaPc59Q"&gt;The Data Enthusiast's Toolkit&lt;/a&gt; is an hour long interview with Rizel Scarlett about both Datasette and my career to date. Frustratingly I had about 10 minutes of terrible microphone audio in the middle, but the conversation itself was really great.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=5TdIxxBPUSI"&gt;Data analysis with SQLite and Python&lt;/a&gt; is a video from PyCon of the full 2hr45m tutorial I gave there last month. The handout notes for that are &lt;a href="https://sqlite-tutorial-pycon-2023.readthedocs.io/en/latest/"&gt;available online too&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I also spotted that the Changelog put up a video &lt;a href="https://www.youtube.com/watch?v=yayY-R4koPI"&gt;Just getting in to AI for development? Start here&lt;/a&gt; with an extract from our podcast episode &lt;a href="https://simonwillison.net/2023/Apr/8/llms-break-the-internet/"&gt;LLMs break the internet&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Entries this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/Jun/4/closed-model-training/"&gt;It's infuriatingly hard to understand how closed models train on their input&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/May/30/chatgpt-inline-tips/"&gt;ChatGPT should include inline tips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/May/27/lawyer-chatgpt/"&gt;Lawyer cites fake cases invented by ChatGPT, judge is not amused&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/May/18/cli-tools-for-llms/"&gt;llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/May/11/delimiters-wont-save-you/"&gt;Delimiters won't save you from prompt injection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sqlite-url-lite/releases/0.1"&gt;datasette-sqlite-url-lite 0.1&lt;/a&gt;&lt;/strong&gt; - 2023-05-26&lt;br /&gt;A pure Python alternative to sqlite-url ready to be used in Datasette Lite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils/releases/3.32.1"&gt;sqlite-utils 3.32.1&lt;/a&gt;&lt;/strong&gt; - 2023-05-21&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/strip-tags/releases/0.3"&gt;strip-tags 0.3&lt;/a&gt;&lt;/strong&gt; - 2023-05-19&lt;br /&gt;CLI tool for stripping tags from HTML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/ttok/releases/0.1"&gt;ttok 0.1&lt;/a&gt;&lt;/strong&gt; - 2023-05-18&lt;br /&gt;Count and truncate text based on tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm/releases/0.3"&gt;llm 0.3&lt;/a&gt;&lt;/strong&gt; - 2023-05-17&lt;br /&gt;Access large language models from the command-line&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/http/testing-cors-max-age"&gt;Testing the Access-Control-Max-Age CORS header&lt;/a&gt; - 2023-05-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/sqlite/comparing-datasets"&gt;Comparing two training datasets using sqlite-utils&lt;/a&gt; - 2023-05-23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/llms/mlc-chat-redpajama"&gt;mlc-chat - RedPajama-INCITE-Chat-3B on macOS&lt;/a&gt; - 2023-05-22&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/misc/hexdump"&gt;hexdump and hexdump -C&lt;/a&gt; - 2023-05-22&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/datasette/baseline"&gt;Exploring Baseline with Datasette Lite&lt;/a&gt; - 2023-05-12&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tutorials"&gt;tutorials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="speaking"/><category term="tutorials"/><category term="datasette"/><category term="parquet"/><category term="weeknotes"/><category term="datasette-lite"/><category term="llms"/></entry><entry><title>Exploring the training data behind Stable Diffusion</title><link href="https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/#atom-tag" rel="alternate"/><published>2022-09-05T00:18:42+00:00</published><updated>2022-09-05T00:18:42+00:00</updated><id>https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;Two weeks ago, the Stable Diffusion image generation model was &lt;a href="https://stability.ai/blog/stable-diffusion-public-release"&gt;released to the public&lt;/a&gt;. I wrote about this last week, in &lt;a href="https://simonwillison.net/2022/Aug/29/stable-diffusion/"&gt;Stable Diffusion is a really big deal&lt;/a&gt; - a post which has since become one of the top ten results for "stable diffusion" on Google and shown up in all sorts of different places online.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://waxy.org"&gt;Andy Baio&lt;/a&gt; pinged me a week ago on Friday and asked if I'd be interested in collaborating with him on digging into the training data. The Stable Diffusion &lt;a href="https://github.com/CompVis/stable-diffusion/blob/main/Stable_Diffusion_v1_Model_Card.md"&gt;Model Card&lt;/a&gt; provides a detailed description of how the model was trained - primarily on the &lt;a href="https://huggingface.co/datasets/laion/laion2B-en"&gt;LAION 2B-en&lt;/a&gt;) dataset (a subset of &lt;a href="https://laion.ai/blog/laion-5b/"&gt;LAION 5B&lt;/a&gt;), with further emphasis given to images with higher calculated aesthetic scores.&lt;/p&gt;
&lt;p&gt;We ended up deciding to dig into the &lt;a href="https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus"&gt;improved_aesthetics_6plus&lt;/a&gt; subset, which consists of 12 million images with an aesthetics score of 6 or higher.&lt;/p&gt;
&lt;p&gt;This isn't the full training set used for the model, but it's small enough that it fits comfortably in a SQLite database on inexpensive hosting...&lt;/p&gt;
&lt;p&gt;So I built a search engine, powered by &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update, 20th December 2023:&lt;/strong&gt; This search tool is no longer available.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;You can search for images by keyword using the following interface:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images"&gt;laion-aesthetic.datasette.io/laion-aesthetic-6pls/images&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/laion-lemur.jpg" alt="Screenshot of the search interface, showing the results for lemur" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;Or see a breakdown of image counts by the domain they were scraped from &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/domain?_sort_desc=image_counts"&gt;on this page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The search engine provides access to 12,096,835 rows, and uses SQLite full-text search to power search across their text descriptions.&lt;/p&gt;
&lt;p&gt;Andy used this Datasette instance to conduct a thorough analysis of the underlying training data, which he wrote about in &lt;a href="https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/"&gt;Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This analysis has had a really huge impact! Stories mentioning it made the front page of the websites of both the New York Times and the Washington Post on the same day:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.washingtonpost.com/technology/2022/09/02/midjourney-artificial-intelligence-state-fair-colorado/"&gt;He used AI art from Midjourney to win a fine-arts prize. Did he cheat?&lt;/a&gt; - The Washington Post&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html"&gt;An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.&lt;/a&gt; - The New York Times&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Further afield, we spotted coverage from publications that included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.spiegel.de/netzwelt/web/stable-diffusion-verstoerend-gute-ki-kunst-und-jeder-kann-mitmischen-a-0bde407d-c0d5-474a-92fc-de2a99c01774"&gt;Diese Software macht Sie zum KI-Künstler&lt;/a&gt; - Der Spiegel. I get quoted in this one (a translated snippet from my blog at least).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://gigazine.net/news/20220831-exploring-stable-diffusions/"&gt;23億枚もの画像で構成された画像生成AI「Stable Diffusion」のデータセットのうち1200万枚がどこから入手した画像かを調査した結果が公開される&lt;/a&gt; - Gigazine, a long-running (22 years old) Japanese online news magazine.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="how-i-built-the-database"&gt;How I built the database&lt;/h4&gt;
&lt;p&gt;The code for the Datasette instance can be found in &lt;a href="https://github.com/simonw/laion-aesthetic-datasette"&gt;this GitHub repository&lt;/a&gt;. The &lt;a href="https://github.com/simonw/laion-aesthetic-datasette/issues"&gt;issues&lt;/a&gt; in that repo contain a detailed record of the various steps I took to build the database.&lt;/p&gt;
&lt;p&gt;The data subset I loaded into the search engine is &lt;a href="https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus"&gt;published on Hugging Face&lt;/a&gt; by Christoph Schuhmann. It consists of 7 parquet files, each of which are 325MB and stored in a GitHub repo using Git LFS.&lt;/p&gt;
&lt;p&gt;The first step was to fetch that data.&lt;/p&gt;
&lt;p&gt;This was my first time running &lt;code&gt;git lfs&lt;/code&gt; - I had to install it first using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;brew install git-lfs
git lfs install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I cloned the repo and fetched the data like this. Note that to make the actual files available in the directory you need to run both &lt;code&gt;git lfs fetch&lt;/code&gt; and &lt;code&gt;git lfs checkout&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; improved_aesthetics_6plus
git lfs fetch
git lfs checkout&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The result is 7 parquet files. I wanted to load these into SQLite.&lt;/p&gt;
&lt;p&gt;The first solution I found that worked was to use the &lt;a href="https://pypi.org/project/parquet-tools/"&gt; parquet-tools&lt;/a&gt; Python package:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pipx install parquet-tools
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I could then convert the parquet data to CSV like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;parquet-tools csv train-00002-of-00007-709151a2715d894d.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This outputs the contents of the file as CSV.&lt;/p&gt;
&lt;p&gt;Since this is a lot of data it made sense to create an empty SQLite table first (with columns with the correct column types) before inserting the data. I did that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite3 laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;CREATE TABLE IF NOT EXISTS images (&lt;/span&gt;
&lt;span class="pl-s"&gt;   [url] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [text] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [width] INTEGER,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [height] INTEGER,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [similarity] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [punsafe] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [pwatermark] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [aesthetic] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [hash] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [__index_level_0__] INTEGER&lt;/span&gt;
&lt;span class="pl-s"&gt;);&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I used a &lt;code&gt;bash&lt;/code&gt; loop to insert all of the data:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;filename&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt;.parquet&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
    parquet-tools csv &lt;span class="pl-smi"&gt;$filename&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite3 -csv laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;.import --skip 1 '|cat -' images&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses the &lt;code&gt;sqlite3&lt;/code&gt; tool's &lt;code&gt;.import&lt;/code&gt; mechanism, because it's &lt;a href="https://til.simonwillison.net/sqlite/import-csv"&gt;really fast&lt;/a&gt;. The &lt;code&gt;--skip 1&lt;/code&gt; option is necessary to skip the first line, which is the CSV column names. The &lt;code&gt;'|cat -'&lt;/code&gt; is the idiom used to tell SQLite to read from standard input.&lt;/p&gt;
&lt;p&gt;This did the job! The result was a SQLite database file, about 3.5GB in size.&lt;/p&gt;
&lt;h4&gt;Enabling search&lt;/h4&gt;
&lt;p&gt;To enable SQLite full-text search against the images, I used &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search"&gt;sqlite-utils enable-fts&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils enable-fts laion-aesthetic-6pls.db images text
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This took about a minute and a half to run. The resulting database file was around 3.9GB in size - the full text index didn't add as much to the file size as I had expected.&lt;/p&gt;
&lt;p&gt;Best of all, the search was fast! Most search queries took in the order of 20ms to run. My opinion of SQLite FTS keeps improving the more I use it.&lt;/p&gt;
&lt;h4&gt;Extracting domains with sqlite-utils --functions&lt;/h4&gt;
&lt;p&gt;We knew we wanted to count how many images had been scraped from each domain - but we currently only had the full image URLs:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;https://cdn.idahopotato.com/cache/4075b86c99bc2c46f927f3be5949d161_w310.jpg&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;While walking &lt;a href="https://twitter.com/cleopaws"&gt;Cleo&lt;/a&gt; I had an idea: what if &lt;code&gt;sqlite-utils&lt;/code&gt; made it really easy to register custom SQL functions and use them from the command-line? Then I could use a Python function to extract the domain names.&lt;/p&gt;
&lt;p&gt;This became the impetus for releasing &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-29"&gt;sqlite-utils 3.29&lt;/a&gt; with a brand new feature: &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-query-functions"&gt;sqlite-utils --functions&lt;/a&gt;, which lets you do exactly that.&lt;/p&gt;
&lt;p&gt;Here's how I used that to extract the domain names from the URLs:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; First, add an empty 'domain' column to the table&lt;/span&gt;
sqlite-utils add-column data.db images domain

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Now populate it using a custom SQL function:&lt;/span&gt;
sqlite-utils laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;update images set domain = domain(url)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
--functions &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from urllib.parse import urlparse&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;def domain(url):&lt;/span&gt;
&lt;span class="pl-s"&gt;    return urlparse(url).netloc&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are executing this SQL query against the database:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;update&lt;/span&gt; images &lt;span class="pl-k"&gt;set&lt;/span&gt; domain &lt;span class="pl-k"&gt;=&lt;/span&gt; domain(url)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Where that &lt;code&gt;domain(url)&lt;/code&gt; function is defined in the Python snippet passed to the &lt;code&gt;--functions&lt;/code&gt; option:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;urllib&lt;/span&gt;.&lt;span class="pl-s1"&gt;parse&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;urlparse&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;domain&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;):
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-en"&gt;urlparse&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;).&lt;span class="pl-s1"&gt;netloc&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; runs &lt;code&gt;eval()&lt;/code&gt; against the code in that block, then loops through any callable objects defined by that code (skipping them if their name starts with an underscore) and registers those as custom SQL functions with SQLite.&lt;/p&gt;
&lt;p&gt;I'm really excited about this pattern. I think it makes &lt;code&gt;sqlite-utils&lt;/code&gt; an even more useful tool for running ad-hoc data cleanup and enrichment tasks.&lt;/p&gt;
&lt;h4&gt;Populating the domains table&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;domain&lt;/code&gt; column in the &lt;code&gt;images&lt;/code&gt; table was now populated, but it was a bit of a verbose column: it duplicated a chunk of text from the existing &lt;code&gt;url&lt;/code&gt;, and was repeated for over 12 million rows.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#extracting-columns-into-a-separate-table"&gt;sqlite-utils extract&lt;/a&gt; command is designed for this exact use-case. It can extract a column from an existing table out into a separate lookup table, reducing the database size by swapping those duplicate text fields for a much smaller integer foreign key column instead.&lt;/p&gt;
&lt;p&gt;I ran that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils extract laion-aesthetic-6pls.db images domain&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The result was a new &lt;code&gt;domains&lt;/code&gt; table, and a &lt;code&gt;domain_id&lt;/code&gt; column in the &lt;code&gt;images&lt;/code&gt; table that pointed to records there.&lt;/p&gt;
&lt;p&gt;One more step: I didn't want people visiting the site to have to run an expensive group by/count query to see which domains had the most images. So I denormalized that data into the &lt;code&gt;domains&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;First I added a new integer column to it, called &lt;code&gt;image_counts&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils add-column laion-aesthetic-6pls.db domain image_counts integer&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I populated it with a query like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;sqlite&lt;span class="pl-k"&gt;-&lt;/span&gt;utils laion&lt;span class="pl-k"&gt;-&lt;/span&gt;aesthetic&lt;span class="pl-k"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;6pls&lt;/span&gt;.&lt;span class="pl-c1"&gt;db&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;with counts as (&lt;/span&gt;
&lt;span class="pl-s"&gt;  select domain_id, count(*) as c from images group by domain_id&lt;/span&gt;
&lt;span class="pl-s"&gt;)&lt;/span&gt;
&lt;span class="pl-s"&gt;update domain&lt;/span&gt;
&lt;span class="pl-s"&gt;  set image_counts = counts.c&lt;/span&gt;
&lt;span class="pl-s"&gt;  from counts&lt;/span&gt;
&lt;span class="pl-s"&gt;  where id = counts.domain_id&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I first learned to combine CTEs and SQL updates while working &lt;a href="https://til.simonwillison.net/django/migration-using-cte"&gt;with Django migrations&lt;/a&gt; - I was delighted to see the same trick works for SQLite as well.&lt;/p&gt;
&lt;p&gt;You can see the result of this query in the &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/domain?_sort_desc=image_counts"&gt;domain&lt;/a&gt; table. The first five rows look like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;domain&lt;/th&gt;
&lt;th&gt;image_counts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;i.pinimg.com&lt;/td&gt;
&lt;td&gt;1043949&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;render.fineartamerica.com&lt;/td&gt;
&lt;td&gt;601106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;us.123rf.com&lt;/td&gt;
&lt;td&gt;497244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;cdn.shopify.com&lt;/td&gt;
&lt;td&gt;241632&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;td&gt;photos.smugmug.com&lt;/td&gt;
&lt;td&gt;225582&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Doing the same for celebrities, artists, characters&lt;/h4&gt;
&lt;p&gt;We also wanted to provide pre-calculated counts for searches against a number of celebrities, artists and fictional characters - to help give a sense of the kinds of images that were included in the data.&lt;/p&gt;
&lt;p&gt;Andy gathered the ones we wanted to track in &lt;a href="https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/edit"&gt;this Google Sheet&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I recently learned how to use the &lt;code&gt;/export?format=csv&lt;/code&gt; endpoint to export a Google Sheet as CSV. I found out that you can use &lt;code&gt;/export?format=csv&amp;amp;gid=1037423923&lt;/code&gt; to target a specific tab in a multi-tabbed sheet.&lt;/p&gt;
&lt;p&gt;So I imported Andy's data into SQLite using the following:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db artists - --csv
curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&amp;amp;gid=1037423923&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db celebrities - --csv
curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&amp;amp;gid=480391249&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db characters - --csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This gave me &lt;code&gt;artists&lt;/code&gt;, &lt;code&gt;celebrities&lt;/code&gt; and &lt;code&gt;characters&lt;/code&gt; tables.&lt;/p&gt;
&lt;p&gt;The next challenge was to run a search query for each row in each of those tables and return the count of results. After some experimentation I found that this one worked:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; name, (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;from&lt;/span&gt; images_fts &lt;span class="pl-k"&gt;where&lt;/span&gt; images_fts match &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; name &lt;span class="pl-k"&gt;||&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
) &lt;span class="pl-k"&gt;as&lt;/span&gt; search_count &lt;span class="pl-k"&gt;from&lt;/span&gt; celebrities &lt;span class="pl-k"&gt;order by&lt;/span&gt; search_count &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note the &lt;code&gt;match '"' || name || '"'&lt;/code&gt; part - this was necessary to ensure the name was correctly quoted in a way that would avoid names like &lt;code&gt;Dwayne 'The Rock' Johnson&lt;/code&gt; from breaking the search query.&lt;/p&gt;
&lt;p&gt;Now that I had the query I could use that same CTE update trick to populate a counts column in the tables:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils add-column laion-aesthetic-6pls.db celebrities image_counts integer

sqlite-utils laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;with counts as (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  select name,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      select count(*) from images_fts where images_fts match '"' || name || '"'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    ) as search_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    from celebrities&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;update celebrities&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  set image_counts = counts.search_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  from counts&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  where celebrities.name = counts.name&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm using the &lt;code&gt;cat &amp;lt;&amp;lt;EOF&lt;/code&gt; trick here to avoid having to use shell escaping for the single and double quotes, as described in this TIL: &lt;a href="https://til.simonwillison.net/zsh/argument-heredoc"&gt;Passing command arguments using heredoc syntax&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here are the finished tables: &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/characters"&gt;characters&lt;/a&gt;, &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/celebrities"&gt;celebrities&lt;/a&gt;, &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/artists"&gt;artists&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Deploying it to Fly&lt;/h4&gt;
&lt;p&gt;At just under 4GB the resulting SQLite database was an awkward size. I often deploy ~1GB databases to Google Cloud Run, but this was a bit too large for me to feel comfortable with that. Cloud Run can also get expensive for projects that attract a great deal of traffic.&lt;/p&gt;
&lt;p&gt;I decided to use &lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt; instead. Fly includes support for mountable volumes, which means it's a great fit for these larger database files.&lt;/p&gt;
&lt;p&gt;I wrote about &lt;a href="https://simonwillison.net/2022/Feb/15/fly-volumes/"&gt;Using SQLite and Datasette with Fly Volumes&lt;/a&gt; back in February, when I added support to volumes to the &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; Datasette plugin.&lt;/p&gt;
&lt;p&gt;This was still the largest database I had ever deployed to Fly, and it took a little bit of work to figure out the best way to handle it.&lt;/p&gt;
&lt;p&gt;In the end, I used the following recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette \
  --install datasette-json-html \
  --extra-options &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -m metadata.yml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first time I ran this I used &lt;code&gt;--create-volume 20&lt;/code&gt; to create a 20GB volume called &lt;code&gt;datasette&lt;/code&gt;. I over-provisioned this so I could run commands like &lt;code&gt;sqlite-utils vacuum&lt;/code&gt;, which need twice the amount of space as is taken up by the database file itself.&lt;/p&gt;
&lt;p&gt;I uploaded the database file itself &lt;a href="https://til.simonwillison.net/fly/scp"&gt;using scp&lt;/a&gt;, and ran &lt;code&gt;fly ssh console -a laion-aesthetic&lt;/code&gt; to SSH in and execute other commands such as &lt;code&gt;datasette inspect  laion-aesthetic-6pls.db &gt; inspect.json&lt;/code&gt; to create the inspect JSON file.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--extra-options&lt;/code&gt; deserve explanation.&lt;/p&gt;
&lt;p&gt;Normally when you run &lt;code&gt;datasette publish&lt;/code&gt; the file you pass to the command is automatically deployed using &lt;a href="https://docs.datasette.io/en/stable/performance.html?highlight=immutable#immutable-mode"&gt;immutable mode&lt;/a&gt;. This mode is specifically designed for running read-only databases, and uses optimizations like only counting the rows in the table once on startup (or loading the counts from a pre-prepared &lt;code&gt;inspect.json&lt;/code&gt; file).&lt;/p&gt;
&lt;p&gt;I wanted those optimizations for this project. But &lt;code&gt;datasette publish fly&lt;/code&gt; is currently designed with the assumption that any databases you put in the &lt;code&gt;/data&lt;/code&gt; volume are designed to accept writes, and hence shouldn't be opened in immutable mode.&lt;/p&gt;
&lt;p&gt;I ended up coming up with a horrible hack. I add &lt;code&gt;-i /data/laion-aesthetic-6pls.db&lt;/code&gt; to the &lt;code&gt;--extra-options&lt;/code&gt; command to tell Datasette to open the file in immutable mode.&lt;/p&gt;
&lt;p&gt;But this wasn't enough! &lt;code&gt;datasette publish fly&lt;/code&gt; also configures Datasette to automatically open any databases in &lt;code&gt;/data&lt;/code&gt; in read-only mode, so that newly saved database files will be served correctly.&lt;/p&gt;
&lt;p&gt;This meant my instance was loading the same database twice - once in read-only mode and once in immutable mode.&lt;/p&gt;
&lt;p&gt;Rather than fixing the design of &lt;code&gt;datasette-publish-fly&lt;/code&gt;, I went for a cheap workaround. I start Datasette with the following &lt;code&gt;metadata.yml&lt;/code&gt; configuration (simplified):&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;databases&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;laion-aesthetic-6pls&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;tables&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;domain&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;label_column&lt;/span&gt;: &lt;span class="pl-s"&gt;domain&lt;/span&gt;
  &lt;span class="pl-ent"&gt;laion-aesthetic-6pls_2&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This ensures that the &lt;code&gt;laion-aesthetic-6pls&lt;/code&gt; database - the immutable one - is served correctly, and has a label column set for the &lt;code&gt;domain&lt;/code&gt; table too.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;laion-aesthetic-6pls_2&lt;/code&gt; is the second copy of that database, loaded because Datasette spotted it in the &lt;code&gt;/data&lt;/code&gt; directory. Setting &lt;code&gt;allow: false&lt;/code&gt; on it uses Datasette's &lt;a href="https://docs.datasette.io/en/stable/authentication.html#defining-permissions-with-allow-blocks"&gt;permissions framework&lt;/a&gt; to hide that duplicate database from view.&lt;/p&gt;
&lt;p&gt;I'm not proud of these workarounds, and I hope to fix them in the future - but for the moment this is what it took to deploy the project.&lt;/p&gt;
&lt;h4&gt;Scaling it to meet demand&lt;/h4&gt;
&lt;p&gt;I launched the first version of the application on Fly's cheapest instance - 256MB of RAM, costing $1.87/month.&lt;/p&gt;
&lt;p&gt;This worked fine when it was just me and Andy playing with the site, but it started to struggle as traffic started to increase.&lt;/p&gt;
&lt;p&gt;Fly have a "scale app" button which lets you upgrade your instance. I hadn't actually used it before, but I was delighted to find that it worked exactly as expected: I bumped the RAM up to 4GB (not coincidentally the size of the SQLite database file) and the instance restarted within a few seconds with upgraded capacity.&lt;/p&gt;
&lt;p&gt;Fly provide a preconfigured Grafana interface for watching your instances, and it helped me feel confident that the resized instance was happily dealing with the traffic.&lt;/p&gt;
&lt;p&gt;I plan to dial back down to a cheaper instance once interest in the project starts to fade.&lt;/p&gt;
&lt;h4&gt;Got a problem? Throw a search engine at it&lt;/h4&gt;
&lt;p&gt;This is the third time I've used Datasette to build a search engine in the past three weeks! My other two recent projects are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://scotrail.datasette.io/"&gt;scotrail.datasette.io&lt;/a&gt;, described in &lt;a href="https://simonwillison.net/2022/Aug/21/scotrail/"&gt;Analyzing ScotRail audio announcements with Datasette—from prototype to production&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.sfmicrosociety.org"&gt;archive.sfmicrosociety.org&lt;/a&gt;, described in &lt;a href="https://simonwillison.net/2022/Aug/25/sfms-archive/"&gt;Building a searchable archive for the San Francisco Microscopical Society&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The ability to spin up a full search engine for anything that you can stuff into a SQLite database table (which it turns out is almost everything) is a really powerful ability. I plan to write a &lt;a href="https://datasette.io/tutorials"&gt;Datasette tutorial&lt;/a&gt; about this in the future.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-render-image-tags"&gt;datasette-render-image-tags&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-render-image-tags/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-09-04
&lt;br /&gt;Turn any URLs ending in .jpg/.png/.gif into img tags with width 200&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sitemap"&gt;datasette-sitemap&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sitemap/releases/tag/1.0"&gt;1.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sitemap/releases"&gt;3 releases total&lt;/a&gt;) - 2022-08-30
&lt;br /&gt;Generate sitemap.xml for Datasette sites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-block-robots"&gt;datasette-block-robots&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-block-robots/releases/tag/1.1"&gt;1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-block-robots/releases"&gt;6 releases total&lt;/a&gt;) - 2022-08-30
&lt;br /&gt;Datasette plugin that blocks robots and crawlers using robots.txt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.29"&gt;3.29&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;103 releases total&lt;/a&gt;) - 2022-08-28
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/vacum-disk-full"&gt;SQLite VACUUM: database or disk is full&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/fly/scp"&gt;How to scp files to and from Fly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/twitter/birdwatch-sqlite"&gt;Loading Twitter Birdwatch into SQLite for analysis with Datasette&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andy-baio"&gt;andy-baio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laion"&gt;laion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="andy-baio"/><category term="ethics"/><category term="search"/><category term="ai"/><category term="datasette"/><category term="parquet"/><category term="weeknotes"/><category term="fly"/><category term="sqlite-utils"/><category term="stable-diffusion"/><category term="generative-ai"/><category term="laion"/><category term="training-data"/><category term="text-to-image"/><category term="ai-ethics"/></entry><entry><title>DuckDB-Wasm: Efficient Analytical SQL in the Browser</title><link href="https://simonwillison.net/2021/Oct/29/duckdb-wasm-efficient-analytical-sql-in-the-browser/#atom-tag" rel="alternate"/><published>2021-10-29T15:25:43+00:00</published><updated>2021-10-29T15:25:43+00:00</updated><id>https://simonwillison.net/2021/Oct/29/duckdb-wasm-efficient-analytical-sql-in-the-browser/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2021/10/29/duckdb-wasm.html"&gt;DuckDB-Wasm: Efficient Analytical SQL in the Browser&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
First SQLite, now DuckDB: options for running database engines in the browser using WebAssembly keep on growing. DuckDB means browsers now have a fast, intuitive mechanism for querying Parquet files too. This also supports the same HTTP Range header trick as the SQLite demo from a while back, meaning it can query large databases loaded over HTTP without downloading the whole file.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/hfmuehleisen/status/1454100577730838529"&gt;@hfmuehleisen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http-range-requests"&gt;http-range-requests&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="parquet"/><category term="webassembly"/><category term="duckdb"/><category term="http-range-requests"/></entry><entry><title>Querying Parquet using DuckDB</title><link href="https://simonwillison.net/2021/Jun/25/querying-parquet-using-duckdb/#atom-tag" rel="alternate"/><published>2021-06-25T22:40:27+00:00</published><updated>2021-06-25T22:40:27+00:00</updated><id>https://simonwillison.net/2021/Jun/25/querying-parquet-using-duckdb/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://duckdb.org/2021/06/25/querying-parquet.html"&gt;Querying Parquet using DuckDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this?

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=27634840"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="parquet"/><category term="duckdb"/></entry><entry><title>athena-sqlite</title><link href="https://simonwillison.net/2019/Dec/18/athena-sqlite/#atom-tag" rel="alternate"/><published>2019-12-18T09:05:40+00:00</published><updated>2019-12-18T09:05:40+00:00</updated><id>https://simonwillison.net/2019/Dec/18/athena-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dacort/athena-sqlite"&gt;athena-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Amazon Athena is the AWS tool for querying data stored in S3—as CSV, JSON or Apache Parquet files—using SQL. It’s an interesting way of buliding a very cheap data warehouse on top of S3 without having to run any additional services. Athena recently added a query federation SDK which lets you define additional custom data sources using Lambda functions. Damon Cortesi used this to write a custom connector for SQLite, which lets you run queries against data stored in SQLite files that you have uploaded to S3. You can then run joins between that data and other Athena sources.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/dacort/status/1206059672513306626"&gt;@dacort&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/athena"&gt;athena&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;&lt;/p&gt;



</summary><category term="athena"/><category term="aws"/><category term="s3"/><category term="sql"/><category term="sqlite"/><category term="parquet"/></entry><entry><title>Query Parquet files in SQLite</title><link href="https://simonwillison.net/2018/Jun/24/query-parquet-files-sqlite/#atom-tag" rel="alternate"/><published>2018-06-24T19:44:06+00:00</published><updated>2018-06-24T19:44:06+00:00</updated><id>https://simonwillison.net/2018/Jun/24/query-parquet-files-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html"&gt;Query Parquet files in SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="big-data"/><category term="datasette"/><category term="parquet"/><category term="colin-dellow"/></entry></feed>