Simon Willison's Weblog: How I blog

Adding TILs, releases, museums, tools and research to my blog

2026-02-20T23:47:10+00:00

I've been wanting to add indications of my various other online activities to my blog for a while now. I just turned on a new feature I'm calling "beats" (after story beats, naming this was hard!) which adds five new types of content to my site, all corresponding to activity elsewhere.

Here's what beats look like:

Those three are from the 30th December 2025 archive page.

Beats are little inline links with badges that fit into different content timeline views around my site, including the homepage, search and archive pages.

There are currently five types of beats:

Releases are GitHub releases of my many different open source projects, imported from this JSON file that was constructed by GitHub Actions.
TILs are the posts from my TIL blog, imported using a SQL query over JSON and HTTP against the Datasette instance powering that site.
Museums are new posts on my niche-museums.com blog, imported from this custom JSON feed.
Tools are HTML and JavaScript tools I've vibe-coded on my tools.simonwillison.net site, as described in Useful patterns for building HTML tools.
Research is for AI-generated research projects, hosted in my simonw/research repo and described in Code research projects with async coding agents like Claude Code and Codex.

That's five different custom integrations to pull in all of that data. The good news is that this kind of integration project is the kind of thing that coding agents really excel at. I knocked most of the feature out in a single morning while working in parallel on various other things.

I didn't have a useful structured feed of my Research projects, and it didn't matter because I gave Claude Code a link to the raw Markdown README that lists them all and it spun up a parser regex. Since I'm responsible for both the source and the destination I'm fine with a brittle solution that would be too risky against a source that I don't control myself.

Claude also handled all of the potentially tedious UI integration work with my site, making sure the new content worked on all of my different page types and was handled correctly by my faceted search engine.

Prototyping with Claude Artifacts

I actually prototyped the initial concept for beats in regular Claude - not Claude Code - taking advantage of the fact that it can clone public repos from GitHub these days. I started with:

Clone simonw/simonwillisonblog and tell me about the models and views

And then later in the brainstorming session said:

use the templates and CSS in this repo to create a new artifact with all HTML and CSS inline that shows me my homepage with some of those inline content types mixed in

After some iteration we got to this artifact mockup, which was enough to convince me that the concept had legs and was worth handing over to full Claude Code for web to implement.

If you want to see how the rest of the build played out the most interesting PRs are Beats #592 which implemented the core feature and Add Museums Beat importer #595 which added the Museums content type.

Tags: blogging, museums, ai, til, generative-ai, llms, ai-assisted-programming, claude-artifacts, claude-code, site-upgrades

How I automate my Substack newsletter with content from my blog

2025-11-19T22:00:34+00:00

I sent out my weekly-ish Substack newsletter this morning and took the opportunity to record a YouTube video demonstrating my process and describing the different components that make it work. There's a lot of digital duct tape involved, taking the content from Django+Heroku+PostgreSQL to GitHub Actions to SQLite+Datasette+Fly.io to JavaScript+Observable and finally to Substack.

The core process is the same as I described back in 2023. I have an Observable notebook called blog-to-newsletter which fetches content from my blog's database, filters out anything that has been in the newsletter before, formats what's left as HTML and offers a big "Copy rich text newsletter to clipboard" button.

I click that button, paste the result into the Substack editor, tweak a few things and hit send. The whole process usually takes just a few minutes.

I make very minor edits:

I set the title and the subheading for the newsletter. This is often a direct copy of the title of the featured blog post.
Substack turns YouTube URLs into embeds, which often isn't what I want - especially if I have a YouTube URL inside a code example.
Blocks of preformatted text often have an extra blank line at the end, which I remove.
Occasionally I'll make a content edit - removing a piece of content that doesn't fit the newsletter, or fixing a time reference like "yesterday" that doesn't make sense any more.
I pick the featured image for the newsletter and add some tags.

That's the whole process!

The Observable notebook

The most important cell in the Observable notebook is this one:

raw_content = {
  return await (
    await fetch(
      `https://datasette.simonwillison.net/simonwillisonblog.json?sql=${encodeURIComponent(
        sql
      )}&_shape=array&numdays=${numDays}`
    )
  ).json();
}

This uses the JavaScript fetch() function to pull data from my blog's Datasette instance, using a very complex SQL query that is composed elsewhere in the notebook.

Here's a link to see and execute that query directly in Datasette. It's 143 lines of convoluted SQL that assembles most of the HTML for the newsletter using SQLite string concatenation! An illustrative snippet:

with content as (
  select
    id,
    'entry' as type,
    title,
    created,
    slug,
    '<h3><a href="' || 'https://simonwillison.net/' || strftime('%Y/', created)
      || substr('JanFebMarAprMayJunJulAugSepOctNovDec', (strftime('%m', created) - 1) * 3 + 1, 3) 
      || '/' || cast(strftime('%d', created) as integer) || '/' || slug || '/' || '">' 
      || title || '</a> - ' || date(created) || '</h3>' || body
      as html,
    'null' as json,
    '' as external_url
  from blog_entry
  union all
  # ...

My blog's URLs look like /2025/Nov/18/gemini-3/ - this SQL constructs that three letter month abbreviation from the month number using a substring operation.

This is a terrible way to assemble HTML, but I've stuck with it because it amuses me.

The rest of the Observable notebook takes that data, filters out anything that links to content mentioned in the previous newsletters and composes it into a block of HTML that can be copied using that big button.

Here's the recipe it uses to turn HTML into rich text content on a clipboard suitable for Substack. I can't remember how I figured this out but it's very effective:

Object.assign(
  html`<button style="font-size: 1.4em; padding: 0.3em 1em; font-weight: bold;">Copy rich text newsletter to clipboard`,
  {
    onclick: () => {
      const htmlContent = newsletterHTML;
      // Create a temporary element to hold the HTML content
      const tempElement = document.createElement("div");
      tempElement.innerHTML = htmlContent;
      document.body.appendChild(tempElement);
      // Select the HTML content
      const range = document.createRange();
      range.selectNode(tempElement);
      // Copy the selected HTML content to the clipboard
      const selection = window.getSelection();
      selection.removeAllRanges();
      selection.addRange(range);
      document.execCommand("copy");
      selection.removeAllRanges();
      document.body.removeChild(tempElement);
    }
  }
)

From Django+Postgresql to Datasette+SQLite

My blog itself is a Django application hosted on Heroku, with data stored in Heroku PostgreSQL. Here's the source code for that Django application. I use the Django admin as my CMS.

Datasette provides a JSON API over a SQLite database... which means something needs to convert that PostgreSQL database into a SQLite database that Datasette can use.

My system for doing that lives in the simonw/simonwillisonblog-backup GitHub repository. It uses GitHub Actions on a schedule that executes every two hours, fetching the latest data from PostgreSQL and converting that to SQLite.

My db-to-sqlite tool is responsible for that conversion. I call it like this:

db-to-sqlite \
  $(heroku config:get DATABASE_URL -a simonwillisonblog | sed s/postgres:/postgresql+psycopg2:/) \
  simonwillisonblog.db \
  --table auth_permission \
  --table auth_user \
  --table blog_blogmark \
  --table blog_blogmark_tags \
  --table blog_entry \
  --table blog_entry_tags \
  --table blog_quotation \
  --table blog_quotation_tags \
  --table blog_note \
  --table blog_note_tags \
  --table blog_tag \
  --table blog_previoustagname \
  --table blog_series \
  --table django_content_type \
  --table redirects_redirect

That heroku config:get DATABASE_URL command uses Heroku credentials in an environment variable to fetch the database connection URL for my blog's PostgreSQL database (and fixes a small difference in the URL scheme).

db-to-sqlite can then export that data and write it to a SQLite database file called simonwillisonblog.db.

The --table options specify the tables that should be included in the export.

The repository does more than just that conversion: it also exports the resulting data to JSON files that live in the repository, which gives me a commit history of changes I make to my content. This is a cheap way to get a revision history of my blog content without having to mess around with detailed history tracking inside the Django application itself.

At the end of my GitHub Actions workflow is this code that publishes the resulting database to Datasette running on Fly.io using the datasette publish fly plugin:

datasette publish fly simonwillisonblog.db \
  -m metadata.yml \
  --app simonwillisonblog-backup \
  --branch 1.0a2 \
  --extra-options "--setting sql_time_limit_ms 15000 --setting truncate_cells_html 10000 --setting allow_facet off" \
  --install datasette-block-robots \
  # ... more plugins

As you can see, there are a lot of moving parts! Surprisingly it all mostly just works - I rarely have to intervene in the process, and the cost of those different components is pleasantly low.

Tags: blogging, django, javascript, postgresql, sql, sqlite, youtube, heroku, datasette, observable, github-actions, fly, newsletter, substack, site-upgrades

My approach to running a link blog

2024-12-22T18:37:16+00:00

I started running a basic link blog on this domain back in November 2003 - publishing links (which I called "blogmarks") with a title, URL, short snippet of commentary and a "via" link where appropriate.

So far I've published 7,607 link blog posts and counting.

In April of this year I finally upgraded my link blog to support Markdown, allowing me to expand my link blog into something with a lot more room.

The way I use my link blog has evolved substantially in the eight months since then. I'm going to describe the informal set of guidelines I've set myself for how I link blog, in the hope that it might encourage other people to give this a try themselves.

Writing about things I've found

Back in November 2022 I wrote What to blog about, which started with this:

You should start a blog. Having your own little corner of the internet is good for the soul!

The point of that article was to emphasize that blogging doesn't have to be about unique insights. The value is in writing frequently and having something to show for it over time - worthwhile even if you don't attract much of an audience (or any audience at all).

In that article I proposed two categories of content that are low stakes and high value: things I learned and descriptions of my projects.

I realize now that link blogging deserves to be included a third category of low stakes, high value writing. We could think of that category as things I've found.

That's the purpose of my link blog: it's an ongoing log of things I've found - effectively a combination of public bookmarks and my own thoughts and commentary on why those things are interesting.

Trying to add something extra

When I first started link blogging I would often post a link with a one sentence summary of the linked content, and maybe a tiny piece of opinionated commentary.

After I upgraded my link blog to support additional markup (links, images, quotations) I decided to be more ambitious. Here are some of the things I try to do:

I always include the names of the people who created the content I am linking to, if I can figure that out. Credit is really important, and it's also useful for myself because I can later search for someone's name and find other interesting things they have created that I linked to in the past. If I've linked to someone's work three or more times I also try to notice and upgrade them to a dedicated tag.
I try to add something extra. My goal with any link blog post is that if you read both my post and the source material you'll have an enhanced experience over if you read just the source material itself.
- Ideally I'd like you to take something useful away even if you don't follow the link itself. This can be a slightly tricky balance: I don't want to steal attention from the authors and plagiarize their message. Generally I'll try to find some key idea that's worth emphasizing. Slightly cynically, I may try to capture that idea as backup against the original source vanishing from the internet. Link rot is real!
- My most basic version of this is trying to provide context as to why I think this particular thing is worth reading - especially important for longer content. A good recent example is my post about Anthropic's Building effective agents essay the other day.
- I might tie it together to other similar concepts, including things I've written about in the past, for example linking Prompt caching with Claude to my coverage of Context caching for Google Gemini.
- If part of the material is a video, I might quote a snippet of the transcript (often extracted using MacWhisper) like I did in this post about Anthropic's Clio.
- A lot of stuff I link to involves programming. I'll often include a direct link to relevant code, using the GitHub feature where I can link to a snippet as-of a particular commit. One example is the fetch-rss.py link in this post.
I'm liberal with quotations. Finding and quoting a paragraph that captures the key theme of a post is a very quick and effective way to summarize it and help people decide if it's worth reading the whole thing. My post on François Chollet's o3 ARC-AGI analysis is an example of that.
If the original author reads my post, I want them to feel good about it. I know from my own experience that often when you publish something online the silence can be deafening. Knowing that someone else read, appreciated, understood and then shared your work can be very pleasant.
A slightly self-involved concern I have is that I like to prove that I've read it. This is more for me than for anyone else: I don't like to recommend something if I've not read that thing myself, and sticking in a detail that shows I read past the first paragraph helps keep me honest about that.
I've started leaning more into screenshots and even short video or audio clips. A screenshot can be considered a visual quotation - I'll sometimes snap these from interesting frames in a YouTube video or live demo associated with the content I'm linking to. I used a screenshot of the Clay debugger in my post about Clay.

There are a lot of great link blogs out there, but the one that has influenced me the most in how I approach my own is John Gruber's Daring Fireball. I really like the way he mixes commentary, quotations and value-added relevant information.

The technology

The technology behind my link blog is probably the least interesting thing about it. It's part of my simonwillisonblog Django application - the main model is called Blogmark and it inherits from a BaseModel defining things like tags and draft modes that are shared across my other types of content (entries and quotations).

I use the Django Admin to create and edit entries, configured here.

The most cumbersome part of link blogging for me right now is images. I convert these into smaller JPEGs using a tiny custom tool I built (with Claude), then upload them to my static.simonwillison.net S3 bucket using Transmit and drop them into my posts using a Markdown image reference. I generate a first draft of the alt text using a Claude Project with these custom instructions, then usually make a few changes before including that in the markup. At some point I'll wire together a UI that makes this process a little smoother.

That static.simonwillison.net bucket is then served via Cloudflare's free tier, which means I effectively never have to think about the cost of serving up those image files.

I wrote up a TIL about Building a blog in Django a while ago which describes a similar setup to the one I'm using for my link blog, including how the RSS feed works (using Django's syndication framework).

The most technically interesting component is my search feature. I wrote about how that works in Implementing faceted search with Django and PostgreSQL - the most recent code for that can be found in blog/search.py on GitHub.

One of the most useful small enhancements I added was draft mode, which lets me assign a URL to an item and preview it in my browser without publishing it to the world. This really helps when I am editing posts on my mobile phone as it gives me a reliable preview so I can check for any markup mistakes.

I also send out an approximately weekly email newsletter version of my blog, for people who want to subscribe in their inbox. This is a straight copy of content from my blog - Substack doesn't have an API for this but their editor does accept copy and paste, so I have a delightful digital duct tape solution for assembling the newsletter which I described in Semi-automating a Substack newsletter with an Observable notebook.

More people should do this

I posted this on Bluesky last night:

I wish people would post more links to interesting things

I feel like Twitter and LinkedIn and Instagram and TikTok have pushed a lot of people out of the habit of doing that, by penalizing shared links in the various "algorithms"

Bluesky doesn't have that misfeature, thankfully!

(In my ideal world everyone would get their own link blog too, but sharing links on Bluesky and Mastodon is almost as good)

Sharing interesting links with commentary is a low effort, high value way to contribute to internet life at large.

Tags: blogging, django, django-admin, john-gruber

A homepage redesign for my blog's 22nd birthday

2024-06-12T19:59:17+00:00

This blog is 22 years old today! I wrote up a whole bunch of higlights for the 20th birthday a couple of years ago. Today I'm celebrating with something a bit smaller: I finally redesigned the homepage.

I publish three kinds of content on my blog: entries (like this one), "blogmarks" (aka annotated links) and quotations. Until recently the entries were the main feature on the (desktop) homepage, with blogmarks and quotations relegated to the sidebar.

Back in April I implemented Markdown support for my blogmarks, allowing me to include additional links and quotations in the body of those descriptions.

I was inspired in this by Daring Fireball, which has long published a combination of annotated links combined with longer essay style entries.

It turns out I really like posting longer-form content attached to links! Here's one from earlier today which rivals my full entries in length.

These were looking pretty cramped in the sidebar:

So I've done a small redesign. The left hand column on my homepage now displays entries, quotations and blogmarks as a combined list, reusing the format I already had in place for the tag page.

The right hand column is for "highlights", aka my longer form blog entries.

The mobile version of my site was already serving content mixed together like this, so this change mainly brings the desktop version in line with the mobile one.

Here's the issue on GitHub and the commit that implemented the change.

Tags: blogging, site-upgrades

Semi-automating a Substack newsletter with an Observable notebook

2023-04-04T17:55:28+00:00

I recently started sending out a weekly-ish email newsletter consisting of content from my blog. I've mostly automated that, using an Observable Notebook to generate the HTML. Here's how that system works.

What goes in my newsletter

My blog has three types of content: entries, blogmarks and quotations. "Blogmarks" is a name I came up with for bookmarks in 2003.

Blogmarks and quotations show up in my blog's sidebar, entries get the main column - but on mobile the three are combined into a single flow.

These live in a PostgreSQL database managed by Django. You can see them defined in models.py in my blog's open source repo.

My newsletter consists of all of the new entries, blogmarks and quotations since I last sent it out. I include the entries first in reverse chronological order, since usually the entry I've just written is the one I want to use for the email subject. The blogmarks and quotations come in chronological order afterwards.

I'm including the full HTML for everything: people don't need to click through back to my blog to read it, all of the content should be right there in their email client.

The Substack API: RSS and copy-and-paste

Substack doesn't yet offer an API, and have no public plans to do so.

They do offer an RSS feed of each newsletter though - add /feed to the newsletter subdomain to get it. Mine is at https://simonw.substack.com/feed.

So we can get data back out again... but what about getting data in? I don't want to manually assemble a newsletter from all of these different sources of data.

That's where copy-and-paste comes in.

The Substack compose editor incorporates a well built rich-text editor. You can paste content into it and it will clean it up to fit the subset of HTML that Substack supports... but that's a pretty decent subset. Headings, paragraphs, lists, links, code blocks and images are all supported.

The vast majority of content on my blog fits that subset neatly.

Crucially, pasting in images as part of that rich text content Just Works: Substack automatically copies any images to their substack-post-media S3 bucket and embeds links to their CDN in the body of the newsletter.

So... if I can generate the intended rich-text HTML for my whole newsletter, I can copy and paste it directly into the Substack.

That's exactly what my new Observable notebook does: https://observablehq.com/@simonw/blog-to-newsletter

Generating HTML is a well trodden path, but I also wanted a "copy to clipboard" button that would copy the rich text version of that HTML such that pasting it into Substack would do the right thing.

With a bit of help from MDN and ChatGPT (my TIL) I figured out the following:

function copyRichText(html) {
  const htmlContent = html;
  // Create a temporary element to hold the HTML content
  const tempElement = document.createElement("div");
  tempElement.innerHTML = htmlContent;
  document.body.appendChild(tempElement);
  // Select the HTML content
  const range = document.createRange();
  range.selectNode(tempElement);
  // Copy the selected HTML content to the clipboard
  const selection = window.getSelection();
  selection.removeAllRanges();
  selection.addRange(range);
  document.execCommand("copy");
  selection.removeAllRanges();
  document.body.removeChild(tempElement);
}

This works great! Set up a button that triggers that function and clicking that button will copy a rich text version of the HTML to the clipboard, such that pasting it directly into the Substack editor has the desired effect.

Assembling the HTML

I love using Observable Notebooks for this kind of project: quick data integration tools that need a UI and will likely be incrementally improved over time.

Using Observable for these means I don't need to host anything and I can iterate my way to the right solution really quickly.

First, I needed to retrieve my entries, blogmarks and quotations.

I never built an API for my Django blog directly, but a while ago I set up a mechanism that exports the contents of my blog to my simonwillisonblog-backup GitHub repository for safety, and then deploys a Datasette/SQLite copy of that data to https://datasette.simonwillison.net/.

Datasette offers a JSON API for querying that data, and exposes open CORS headers which means JavaScript running in Observable can query it directly.

Here's an example SQL query running against that Datasette instance - click the .json link on that page to get that data back as JSON instead.

My Observable notebook can then retrieve the exact data it needs to construct the HTML for the newsletter.

The smart thing to do would have been to retrieve the data from the API and then use JavaScript inside Observable to compose that together into the HTML for the newsletter.

I decided to challenge myself to doing most of the work in SQL instead, and came up with the following absolute monster of a query:

with content as (
  select
    'entry' as type, title, created, slug,
    '<h3><a href="' || 'https://simonwillison.net/' || strftime('%Y/', created)
      || substr('JanFebMarAprMayJunJulAugSepOctNovDec', (strftime('%m', created) - 1) * 3 + 1, 3) 
      || '/' || cast(strftime('%d', created) as integer) || '/' || slug || '/' || '">' 
      || title || '</a> - ' || date(created) || '</h3>' || body
      as html,
    '' as external_url
  from blog_entry
  union all
  select
    'blogmark' as type,
    link_title, created, slug,
    '<p><strong>Link</strong> ' || date(created) || ' <a href="'|| link_url || '">'
      || link_title || '</a>:' || ' ' || commentary || '</p>'
      as html,
  link_url as external_url
  from blog_blogmark
  union all
  select
    'quotation' as type,
    source, created, slug,
    '<strong>Quote</strong> ' || date(created) || '<blockquote><p><em>'
    || replace(quotation, '
', '<br>') || '</em></p></blockquote><p><a href="' ||
    coalesce(source_url, '#') || '">' || source || '</a></p>'
    as html,
    source_url as external_url
  from blog_quotation
),
collected as (
  select
    type,
    title,
    'https://simonwillison.net/' || strftime('%Y/', created)
      || substr('JanFebMarAprMayJunJulAugSepOctNovDec', (strftime('%m', created) - 1) * 3 + 1, 3) || 
      '/' || cast(strftime('%d', created) as integer) || '/' || slug || '/'
      as url,
    created,
    html,
    external_url
  from content
  where created >= date('now', '-' || :numdays || ' days')   
  order by created desc
)
select type, title, url, created, html, external_url
from collected 
order by 
  case type 
    when 'entry' then 0 
    else 1 
  end,
  case type 
    when 'entry' then created 
    else -strftime('%s', created) 
  end desc

This logic really should be in the JavaScript instead! You can try that query in Datasette.

There are a bunch of tricks in there, but my favourite is this one:

select 'https://simonwillison.net/' || strftime('%Y/', created)
  || substr(
    'JanFebMarAprMayJunJulAugSepOctNovDec',
    (strftime('%m', created) - 1) * 3 + 1, 3
  ) ||  '/' || cast(strftime('%d', created) as integer) || '/' || slug || '/'
  as url

This is the trick I'm using to generate the URL for each entry, blogmark and quotation.

These are stored as datetime values in the database, but the eventual URLs look like this:

https://simonwillison.net/2023/Apr/2/calculator-for-words/

So I need to turn that date into a YYYY/Mon/DD URL component.

One problem: SQLite doesn't have a date format string that produces a three letter month abbreviation. But... with cunning application of the substr() function and a string of all the month abbreviations I can get what I need.

The above SQL query plus a little bit of JavaScript provides almost everything I need to generate the HTML for my newsletter.

Excluding previously sent content

There's one last problem to solve: I want to send a newsletter containing everything that's new since my last edition - I don't want to send out the same content twice.

I came up with a delightfully gnarly solution to that as well.

As mentioned earlier, Substack provides an RSS feed of previous editions. I can use that data to avoid including content that's already been sent.

One problem: the Substack RSS feed does't include CORS headers, which means I can't access it directly from my notebook.

GitHub offers CORS headers for every file in every repository. I already had a repo that was backing up my blog... so why not set that to backup my RSS feed from Substack as well?

I added this to my existing backup.yml GitHub Actions workflow:

- name: Backup Substack
  run: |-
    curl 'https://simonw.substack.com/feed' | \
      python -c "import sys, xml.dom.minidom; print(xml.dom.minidom.parseString(sys.stdin.read()).toprettyxml(indent='  '))" \
      > simonw-substack-com.xml

I'm piping it through a tiny Python script here to pretty-print the XML before saving it, because pretty-printed XML is easier to read diffs against later on.

Now simonw-substack-com.xml is a copy of my RSS feed in a GitHub repo, which means I can access the data directly from JavaScript running on Observable.

Here's the code I wrote there to fetch that RSS feed, parse it as XML and return a string containing just the HTML of all of the posts:

previousNewsletters = {
  const response = await fetch(
    "https://raw.githubusercontent.com/simonw/simonwillisonblog-backup/main/simonw-substack-com.xml"
  );
  const rss = await response.text();
  const parser = new DOMParser();
  const xmlDoc = parser.parseFromString(rss, "application/xml");
  const xpathExpression = "//content:encoded";

  const namespaceResolver = (prefix) => {
    const ns = {
      content: "http://purl.org/rss/1.0/modules/content/"
    };
    return ns[prefix] || null;
  };

  const result = xmlDoc.evaluate(
    xpathExpression,
    xmlDoc,
    namespaceResolver,
    XPathResult.ANY_TYPE,
    null
  );
  let node;
  let text = [];
  while ((node = result.iterateNext())) {
    text.push(node.textContent);
  }
  return text.join("\n");
}

Then I span up a regular expression to extract all of the URLs from that HTML:

previousLinks = {
  const regex = /(?:"|&quot;)(https?:\/\/[^\s"<>]+)(?:"|&quot;)/g;
  return Array.from(previousNewsletters.matchAll(regex), (match) => match[1]);
}

Added a "skip existing" toggle checkbox to my notebook:

viewof skipExisting = Inputs.toggle({
  label: "Skip content sent in prior newsletters"
})

And added this code to filter the raw content based on whether or not the toggle was selected:

content = skipExisting
  ? raw_content.filter(
      (e) =>
        !previousLinks.includes(e.url) &&
        !previousLinks.includes(e.external_url)
    )
  : raw_content

The url is the URL to the post on my blog. external_url is the URL to the original source of the blogmark or quotation. A match against ether of those should exclude the content from my next newsletter.

My workflow for sending a newsletter

Given all of the above, sending a newsletter out is hardly any work at all:

Ensure the most recent backup of my blog has run, such that the Datasette instance contains my latest content. I do that by triggering this action.
Navigate to https://observablehq.com/@simonw/blog-to-newsletter - select "Skip content sent in prior newsletters" and then click the "Copy rich text newsletter to clipboard" button.
Navigate to the Substack "publish" interface and paste that content into the rich text editor.
Pick a title and subheading, and maybe add a bit of introductory text.
Preview it. If the preview looks good, hit "send".

Copy and paste APIs

I think copy and paste is under-rated as an API mechanism.

There are no rate limits or API keys to worry about.

It's supported by almost every application, even ones that are resistant to API integrations.

It even works great on mobile phones, especially if you include a "copy to clipboard" button.

My datasette-copyable plugin for Datasette is one of my earlier explorations of this. It makes it easy to copy data out of Datasette in a variety of useful formats.

This Observable newsletter project has further convinced me that the clipboard is an under-utilized mechanism for building tools to help integrate data together in creative ways.

Tags: blogging, projects, datasette, observable, cors, newsletter, substack, site-upgrades

What to blog about

2022-11-06T17:05:37+00:00

You should start a blog. Having your own little corner of the internet is good for the soul!

But what should you write about?

It's easy to get hung up on this. I've definitely felt the self-imposed pressure to only write something if it's new, and unique, and feels like it's never been said before. This is a mental trap that does nothing but hold you back.

Here are two types of content that I guarantee you can produce and feel great about producing: TILs, and writing descriptions of your projects.

Today I Learned

A TIL - Today I Learned - is the most liberating form of content I know of.

Did you just learn how to do something? Write about that.

Call it a TIL - that way you're not promising anyone a revelation or an in-depth tutorial. You're saying "I just figured this out: here are my notes, you may find them useful too".

I also like the humility of this kind of content. Part of the reason I publish them is to emphasize that even with 25 years of professional experience you should still celebrate learning even the most basic of things.

I learned the "interact" command in pdb the other day! Here's my TIL.

I started publishing TILs in April 2020. I'm up to 346 now, and most of them took less than 10 minutes to write. It's such a great format for quick and satisfying online writing.

My collection lives at https://til.simonwillison.net - which publishes content from my simonw/til GitHub repository.

Write about your projects

If you do a project, you should write about it.

I recommend adding "write about it" to your definition of "done" for anything that you build or create.

Like with TILs, this takes away the pressure to be unique. It doesn't matter if your project overlaps with thousands of others: the experience of building it is unique to you. You deserve to have a few paragraphs and a screenshot out there explaining (and quietly celebrating) what you made.

The screenshot is particularly important. Will your project still exist and work in a decade? I hope so, but we all know how quickly things succumb to bit-rot.

Even better than a screenshot: an animated GIF screenshot! I capture these with LICEcap. And a video is even better than that, but those take a lot more effort to produce.

It's incredibly tempting to skip the step where you write about a project. But any time you do that you're leaving a huge amount of uncaptured value from that project on the table.

These days I make myself do it: I tell myself that writing about something is the cost I have to pay for building it. And I always end up feeling that the effort was more than worthwhile.

Check out my projects tag for examples of this kind of content.

So that's my advice for blogging: write about things you've learned, and write about things you've built!

Update 22nd December 2024: I identified a third useful category: writing about things you've found.

Tags: blogging, writing

Twenty years of my blog

2022-06-12T22:59:31+00:00

I started this blog on June 12th 2002 - twenty years ago today! To celebrate two decades of blogging, I decided to pull together some highlights and dive down a self-indulgent nostalgia hole.

Some highlights

Some of my more influential posts, in chronological order.

A new XML-RPC library for PHP - 2nd September 2002

I was really excited about XML-RPC, one of the earliest technologies for building Web APIs. IXR, the Incutio library for XML-RPC, was one of my earliest ever open source library releases. Here's a capture of the old site.

I've not touched anything relating to this project in over 15 years now, but it has lived on in both WordPress and Drupal (now only in Drupal 6 LTS).

It's also been responsible for at least one CVE vulnerability in those platforms!
getElementsBySelector() - 25th March 2003

Andrew Hayward had posted a delightful snippet of JavaScript called document.getElementsByClassName() - like document.getElementsByTagName() but for classes instead.

Inspired by this, I built document.getElementsBySelector() - a function that could take a CSS selector and return all of the matching elements.

This ended up being very influential indeed! Paul Irish offers a timeline of JavaScript CSS selector engines which tracks some of what happens next. Most notably, getElementsBySelector() was part of John Resig's inspiration in creating the first version of jQuery. To this day, the jQuery source includes this testing fixture which is derived from my original demo page.

I guess you could call document.getElementsBySelector() the original polyfill for document.querySelectorAll().
I'm in Kansas - 27th August 2003

In May 2003 Adrian Holovaty posted about a job opportunity for a web developer at at the Lawrence Journal-World newspaper in Lawrence, Kansas.

This coincided with my UK university offering a "year in industry" placement, which meant I could work for a year anywhere in the world with a student visa program. I'd been reading Adrian's blog for a while and really liked the way he thought about building for the web - we were big fans of Web Standards and CSS and cleanly-designed URLs, all of which were very hot new things at the time!

So I talked to Adrian about if this could work as a year-long opportunity, and we figured out how to make it work.

At the Lawrence Journal-Word Adrian and I decided to start using Python instead of PHP, in order to build a CMS for that local newspaper...
Introducing Django - 17th July 2005

... and this was the eventual outcome! Adrian and I didn't even know we were building a web framework at first - we called it "the CMS". But we kept having to solve new foundational problems: how should database routing work? What about templating? What's the best way to represent the incoming HTTP request?

I had left the Lawrence Journal-World in 2004, but by 2005 the team there had grown what's now known as Django far beyond where it was when I had left, and they got the go-ahead from the company to release it as open source (partly thanks to the example set by Ruby on Rails, which first released in August 2004).

In 2010 I wrote up a more detailed history of Django in a Quora answer, now mirrored to my blog.
Finally powered by Django - 15th December 2006

In which I replaced my duct-tape-and-mud PHP blogging engine with a new Django app. I sadly don't have the version history for this anymore (this was pre-git, I think I probably had it in Subversion or Mercurial somewhere) but today's implementation is still based on the same code, upgraded to Django 1.8 in 2015.

That 2006 version did include a very pleasing Flickr integration to import my photos (example on the Internet Archive):
How to turn your blog in to an OpenID - 19th December 2006

In late 2006 I got very, very excited about OpenID. I was convinced that Microsoft Passport was going to take over SSO on the internet, and that the only way to stop that was to promote an open, decentralized solution. I wrote posts about it, made screencasts (that one got 840 diggs! Annoyingly I was serving it from the Internet Archive who appear to have deleted it) and gave a whole bunch of conference talks about it too.

I spent the next few years advocating for OpenID - in particular the URL-based OpenID mechanism where any website can be turned into an identifier. It didn't end up taking off, and with hindsight I think that's likely for the best: expecting people to take control of their own security by chosing their preferred authentication provider sounded great to me in 2006, but I can understand why companies chose to instead integrate with a smaller, tightly controlled set of SSO partners over time.
A few notes on the Guardian Open Platform - 10th March 2009

In 2009 I was working at the Guardian newspaper in London in my first proper data journalism role - my work at the Lawrence Journal-World had hinted towards that a little, but I spent the vast majority of my time there building out a CMS.

In March we launched two major initiatives: the Datablog (also known as the Data Store) and the Guardian's Open Platform (an API that is still offered to this day).

The goal of the Datablog was to share the data behind the stories. Simon Rogers, the Guardian's data editor, had been collecting meticulous datasets about the world to help power infographics in the paper for years. The new plan was to share that raw data with the world.

We started out using Google Sheets for this. I desperately wanted to come up with something less proprietary than that - I spent quite some time experimenting with CouchDB - but Google Sheets was more than enough to get the project started.

Many years later my continued mulling of this problem formed part of the inspiration for my creation of Datasette, a story I told in my 2018 PyBay talk How to Instantly Publish Data to the Internet with Datasette.
Why I like Redis - 22nd October 2009

I got interested in NoSQL for a few years starting around 2009. I still think Redis was the most interesting new piece of technology to come out of that whole movement - an in-memory data structure server exposed over the network turns out to be a fantastic complement for other data stores, and even though I now default to PostgreSQL or SQLite for almost everything else I can still find problems for which Redis is a great solution.

In April 2010 I gave a three hour Redis tutorial at NoSQL Europe which I wrote up in Comprehensive notes from my three hour Redis tutorial.
Node.js is genuinely exciting - 23rd November 2009

In December 2009 I found out about Node.js. As a Python web developer I had been following the evolution of Twisted with great interest, but I'd also run into the classic challenge that once you start using event-driven programming almost every library you might want to use likely doesn't work for you any more.

Node.js had server-side event-driven programming baked into its very core. You couldn't accidentally make a blocking call and break your event loop because it didn't ever give you the option to do so!

I liked it so much I switched out my talk for Full Frontal 2009 at the last minute for one about Node.js instead.

I think this was an influential decision. I won't say who they are (for fear of mis-representing or mis-quoting them), but I've talked to entrepreneurs who built significant products on top of server-side JavaScript who told me that they heard about Node.js from me first.
Crowdsourced document analysis and MP expenses - 20th December 2009

This was my biggest data journalism project at the Guardian.

The UK government had finally got around to releasing our Member of Parliament expense reports, and there was a giant scandal brewing about the expenses that had been claimed. We recruited our audience to help dig through 10,000s of pages of PDFs to help us find more stories.

The first round of the MP's expenses crowdsourcing project launched in June, but I was too busy working on it to properly write about it! Charles Arthur wrote about it for the Guardian in The breakneck race to build an application to crowdsource MPs' expenses.

In December we launched round two, and I took the time to write about it properly.

Here's a Google Scholar search for guardian mps expenses - I think it was pretty influential. It's definitely one of the projects I'm most proud of in my career so far.
WildlifeNearYou: It began on a fort... - 12th January 2010

In October 2008 I participated in the first /dev/fort - a bunch of nerds rent a fortress (or similar historic building) for a week and hack on a project together.

Following that week of work it took 14 months to add the "final touches" before putting the site we had built live (partly because I insisted on implementing OpenID for it) but in January 2010 we finally went live with WildlifeNearYou.com (sadly no longer available). It was a fabulous website, which crowdsourced places that people had seen animals in order to answer the crucial question "where is my nearest Llama?".

Here's what it looked like:

Although it shipped after the Guardian MP's expenses project most of the work on WildlifeNearYou had come before that - building WildlifeNearYou (in Django) was the reason I was confident that the MP's expenses project was feasible.
Getting married and going travelling - 21st June 2010

One June 5th 2010 I married Natalie Downe, and we both quit our jobs to set off travelling around the world and see how far we could get.

We got as far as Casablanca, Morocco before we accidentally launched a startup together: Lanyrd, launched in August 2010. "Sign in with Twitter to see conferences that your friends are speaking at, attending or tracking, then add your own events."

We ended up spending the next three years on this: we went through Y Combinator, raised a sizable seed round, moved to London, hired a team and shipped a LOT of features. We even managed to ship some features that made the company money!

This also coincided with me putting the blog on the back-burner for a few years.

Here's an early snapshot:

In 2013 we sold Lanyrd to Eventbrite, and moved our entire team (and their families) from London to San Francisco. It had been a very wild ride.

Sadly the site itself is no longer available: as Eventbrite grew it became impossible to justify the work needed to keep Lanyrd maintained, safe and secure. Especially as it started to attract overwhelming volumes of spam.

Natalie told the full story of Lanyrd on her blog in September 2013: Lanyrd: from idea to exit - the story of our startup.
Scraping hurricane Irma - 10th September 2017

In 2017 hurricane Irma devastated large areas of the Caribbean and the southern USA.

I got involved with the Irma Response project, helping crowdsource and publish critical information for people affected by the storm.

I came up with a trick to help with scraping: I ran scrapers against important information sources and recorded the results to a git repository, in order to cheaply track changes to those sources over time.

I later coined the term "Git scraping" for this technique, see my series of posts about Git scraping over time.
Getting the blog back together - 1st October 2017

Running a startup, and then working at Eventbrite afterwards, had resulted in an almost 7 year gap in blogging for me. In October 2017 I decided to finally get my blog going again. I also back-filled content for the intervening years by scraping my content from Quora and from Ask Metafilter.

If you've been meaning to start a new blog or revive an old one this is a trick that I can thoroughly recommend: just because you initially wrote something elsewhere doesn't mean you shouldn't repost it on a site you own.
Recovering missing content from the Internet Archive - 8th October 2017

The other step in recovering my old blog's content was picking up some content that was missing from my old database backup. Here's how I pulled in that content by scraping the Internet Archive.
Implementing faceted search with Django and PostgreSQL - 5th October 2017

I absolutely love building faceted search engines. I realized a while ago that most of my career has been spent applying the exact same trick - faceted search - to different problem spaces. WildlifeNearYou offered faceted search over animal sightings. MP's expenses had faceted search across crowdsourced expense analysis. Lanyrd was faceted search for conferences.

I implemented faceted search for this blog on top of PostgreSQL, and wrote about how I did it.
Datasette: instantly create and publish an API for your SQLite databases - 13th November 2017

I shipped the first release of simonw/datasette in Nevember 2017. Nearly five years later it's now my number-one focus, and I don't see myself losing interest in it for many decades to come.

Datasette was inspired by the Guardian Datablog, combined with my realization that Zeit Now (today called Vercel) meant you could bundle data up in a SQLite database and deploy it as part of an exploratory application almost for free.

My blog has 284 items tagged datasette at this point.
Datasette Facets - 20th May 2018

Given how much I love faceted search, it's surprising it took me until May 2018 to realize that I could bake them into Datasette itself - turning it into a tool for building faceted search engines against any data. It turns out to be my ideal solution to my favourite problem!
Documentation unit tests - 28th July 2018

I figured out a pattern for using unit tests to ensure that features of my projects were covered by the documentation. Four years later I can confirm that this technique works really well - though I wish I'd called it Test-driven documentation instead!
Letterboxing on Lundy - 18th September 2018

A brief foray into travel writing: Natalie and I spent a few days staying in a small castle on the delightful island of Lundy off the coast of North Devon, and I used it as an opportunity to enthuse about letterboxing and the Landmark Trust.
sqlite-utils: a Python library and CLI tool for building SQLite databases - 25th February 2019

Datasette helps you explore and publish data stored in SQLite, but how do you get data into SQLite in the first place?

sqlite-utils is my answer to that question - a combined CLI tool and Python library with all sorts of utilites for working with and creating SQLite databases.

It recently had its 100th release!
I commissioned an oil painting of Barbra Streisand’s cloned dogs - 7th March 2019

Not much I can add that's not covered by the title. It's a really good painting!
My JSK Fellowship: Building an open source ecosystem of tools for data journalism - 10th September 2019

In late 2019 I left Eventbrite to join the JSK fellowship program at Stanford. It was an opportunity to devote myself full-time to working on my growing collection of open source tools for data journalism, centered around Datasette.

I jumped on that opportunity with both hands, and I've been mostly working full-time on Datasette and associated projects (without being paid for it since the fellowship ended) ever since.
Weeknotes: ONA19, twitter-to-sqlite, datasette-rure - 13th September 2019

At the start of my fellowship I decided to publish weeknotes, to keep myself accountable for what I was working on now that I didn't have the structure of a full-time job.

I've managed to post them roughly once a week ever since - 128 posts and counting.

I absolutely love weeknotes as a format. Even if no-one else ever reads them, I find them really useful as a way to keep track of my progress and ensure that I have motivation to get projects to a point where I can write about them at the end of the week!
Using a self-rewriting README powered by GitHub Actions to track TILs - 20th April 2020

In April 2020 I started publishing TILs - Today I Learneds - at til.simonwillison.net.

The idea behind TILs is to dramatically reduce the friction involved in writing a blog post. If I learned something that was useful to me, I'll write it up as a TIL. These often take less than ten minutes to throw together and I find myself referring back to them all the time.

My main blog is a Django application, but my TILs run entirely using Datasette. You can see how that all works in the simonw/til GitHub repository.
Using SQL to find my best photo of a pelican according to Apple Photos - 21st May 2020

Dogsheep is my ongoing side project in which I explore ways to analyze my own personal data using SQLite and Datasette.

dogsheep-photos is my tool for extracting metadata about my photos from the undocumented Apple Photos SQLite database (building on osxphotos by Rhet Turnbull). I had been wanting to solve the photo problem for years and was delighted when osxphotos provided the capability I had been missing. And I really like pelicans, so I celebrated by using my photos of them for the demo.
Git scraping: track changes over time by scraping to a Git repository - 9th October 2020

If you really want people to engage with a technique, it's helpful to give it a name. I defined Git scraping in this post, and I've been promoting it heavily ever since.

There are now 275 public repositories on GitHub with the git-scraping topic, and if you sort them by recently updated you can see the scrapers on there that most recently captured some new data.
Personal Data Warehouses: Reclaiming Your Data - 14th November 2020

I gave this talk for GitHub's OCTO (previously Office of the CTO, since rebranded to GitHub Next) speaker series.

It's the Dogsheep talk, with a better title (thanks, Idan!) It includes a full video demo of my personal Dogsheep instance, including my dog's Foursquare checkins, my Twitter data, Apple Watch GPS trails and more.

I also explain why I called it Dogsheep: it's a devastatingly terrible pun on Wolfram.

I'm frustrated when information like this is only available in video format, so when I give particularly information-dense talks I like to turn them into full write-ups as well, providing extra notes and resources alongside screen captures from the talk.

For this one I added a custom template mechanism to my blog, to allow me to break out of my usual entry page design.
Trying to end the pandemic a little earlier with VaccinateCA - 28th February 2021

In February 2021 I joined the VaccinateCA effort to try and help end the pandemic a little bit earlier by crowdsourcing information about the best places to get vaccinated. It was a classic match-up for my skills and interests: a huge crowdsourcing effort that needed to be spun up as a fresh Django application as quickly as possible.

Django SQL Dashboard was one project that spun directly out of that effort.
The Baked Data architectural pattern - 28th July 2021

My second attempt at coining a new term, after Git scraping: Baked Data is the name I'm using for the architectural pattern embodied by Datasette where you bundle a read-only copy of your data alongside the code for your application, as part of the same deployment. I think it's a really good idea, and more people should be doing it.
How I build a feature - 12th January 2022

Over the years I’ve evolved a processes for feature development that works really well for me, and scales down to small personal projects as well as scaling up to much larger pieces of work. I described that in detail in this post.

Picking out these highlights wasn't easy. I ended up setting myself a time limit (to ensure I could put this post live within a minute of midnight UTC time on my blog's 20th birthday) so there's plenty more that I would have liked to dig up.

My tags index page includes a 2010s-style word cloud that you can visit if you want to explore the rest of my content. Or use the faceted search!

A few more project release highlights:

GraphQL in Datasette with the new datasette-graphql plugin - 7th August 2020
git-history: a tool for analyzing scraped data collected using Git and SQLite - 7th December 2021
shot-scraper: automated screenshots for documentation, built on Playwright - 10th March 2022
Django SQL Dashboard - 10th May 2021
Datasette Desktop—a macOS desktop application for Datasette - 8th September 2021
Datasette Lite: a server-side Python web application running in a browser - 4th May 2022

Evolution over time

I started my blog in my first year of as a student studying computer science at the University of Bath.

You can tell that Twitter wasn't a thing yet, because I wrote 107 posts in that first month. Lots of links to other people's blog posts (we did a lot of that back then) with extra commentary. Lots of blogging about blogging.

That first version of the site was hosted at http://www.bath.ac.uk/~cs1spw/blog/ - on my university's student hosting. Sadly the Internet Archive doesn't have a capture of it there, since I moved it to http://simon.incutio.com/ (my part-time employer at the time) in September 2002. Here's my note from then about rewriting it to use MySQL instead of flat file storage.

This is the earliest capture I could find on the Internet Archive, from June 2003:

Full entry on Using bookmarklets to experiment with CSS.

By November 2006 I had redesigned from orange to green, and started writing Blogmarks - the name I used for small, bookmark-style link posts. I've collected 6,304 of them over the years!

By 2010 I'd reached more-or-less my current purple on white design, albeit with the ability to sign in with OpenID to post a comment. I dropped comments entirely when I relaunched in 2017 - constantly fighting against spam comments makes blogging much less fun.

The source code for the current iteration of my blog is available on GitHub.

Taking screenshots of the Internet Archive with shot-scraper

Here's how I generated the screenshots in this post, using shot-scraper against the Internet Archive but with a line of JavaScript to hide the banner the display at the top of every archived page:

shot-scraper 'https://web.archive.org/web/20030610004652/http://simon.incutio.com/' \
  --javascript 'document.querySelector("#wm-ipp-base").style.display="none"' \
   --width 800 --height 600 --retina

mgdlbp on Hacker News pointed out that you can instead add if_ to the date part of the archive URLs to hide the banner, like this:

shot-scraper 'https://web.archive.org/web/20030610004652if_/http://simon.incutio.com/' \
   --width 800 --height 600 --retina

Tags: adrian-holovaty, blogging, personal-news

One year of TILs

2021-05-02T18:01:44+00:00

Just over a year ago I started tracking TILs, inspired by Josh Branchaud's collection. I've since published 148 TILs across 43 different topics. It's a great format!

TIL stands for Today I Learned. The thing I like most about TILs is that they drop the barrier to publishing something online to almost nothing.

If I'm writing a blog entry, I feel like it needs to say something new. This pressure for originality leads to vast numbers of incomplete, draft posts and a sporadic publishing schedule that trends towards not publishing anything at all.

(Establishing a weeknotes habit has helped enormously here too.)

The bar for a TIL is literally "did I just learn something?" - they effectively act as a public notebook.

They also reflect my values as a software engineer. The thing I love most about this career is that the opportunities to learn new things never reduce - there will always be new sub-disciplines to explore, and I aspire to learn something new every single working day.

My hope is that by publishing a constant stream of TILs I can reinforce the idea that even if you've been working in this industry for twenty years there will always be new things to learn, and learning any new trick - even the most basic thing - should be celebrated.

Tags: blogging, til

Implementing faceted search with Django and PostgreSQL

2017-10-05T14:12:27+00:00

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.

I love this kind of search interface, because the counts tell you so much more about the underlying data. Turns out I was most active in quoting people talking about JavaScript back in 2007, for example.

I usually build faceted search engines using either Solr or Elasticsearch (though the first version of search on this blog was actually powered by Hyper Estraier) - but I’m hosting this blog as simply and inexpensively as possible on Heroku and I don’t want to shell out for a SaaS search solution or run an Elasticsearch instance somewhere myself. I thought I’d have to go back to using Google Custom Search.

Then I read Postgres full-text search is Good Enough! by Rachid Belaid - closely followed by Postgres Full-Text Search With Django by Nathan Shafer - and I decided to have a play with the new PostgreSQL search functionality that was introduced in Django 1.10.

… and wow! Full-text search is yet another example of a feature that’s been in PostgreSQL for nearly a decade now, incrementally improving with every release to the point where it’s now really, really good.

At its most basic level a search system needs to handle four things:

It needs to take user input and find matching documents.
It needs to understand and ignore stopwords (common words like “the” and “and”) and apply stemming - knowing that “ridicule” and “ridiculous” should be treated as the same root, for example. Both of these features need to be language-aware.
It needs to be able to apply relevance ranking, calculating which documents are the best match for a search query.
It needs to be fast - working against some kind of index rather than scanning every available document in full.

Modern PostgreSQL ticks all of those boxes. Let’s put it to work.

Simple search without an index

Here’s how to execute a full-text search query against a simple text column:

from blog.models import Entry
from django.contrib.postgres.search import SearchVector

results = Entry.objects.annotate(
    searchable=SearchVector('body')
).filter(searchable='django')

The generated SQL looks something like this:

SELECT "blog_entry"."id", ...,
to_tsvector(COALESCE("blog_entry"."body", %s)) AS "searchable"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body", "django"))
    @@ (plainto_tsquery("django")) = true
ORDER BY "blog_entry"."created" DESC

The SearchVector class constructs a stemmed, stopword-removed representation of the body column ready to be searched. The resulting queryset contains entries that are a match for “django”.

My blog entries are stored as HTML, but I don’t want search to include those HTML tags. One (extremely un-performant) solution is to use Django’s Func helper to apply a regular expression inside PostgreSQL to strip tags before they are considered for search:

from django.db.models import Value, F, Func

results = Entry.objects.annotate(
    searchable=SearchVector(
        Func(
            F('body'), Value('<.*?>'), Value(''), Value('g'),
            function='regexp_replace'
        )
    )
).filter(searchable='http')

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

This works, but performance isn’t great. PostgreSQL ends up having to scan every row and construct a list of search vectors for each one every time you execute a query.

If you want it to go fast, you need to add a special search vector column to your table and then create the appropriate index on it. As of Django 1.11 this is trivial:

from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Entry(models.Model):
    # ...
    search_document = SearchVectorField(null=True)

    class Meta:
        indexes = [
            GinIndex(fields=['search_document'])
        ]

Django’s migration system will automatically add both the field and the special GIN index.

What’s trickier is populating that search_document field. Django does not yet support a easy method to populate it directly in your initial INSERT call, instead recommending that you populated with a SQL UPDATE statement after the fact. Here is a one-liner that will populate the field for everything in that table (and strip tags at the same time):

def strip_tags_func(field):
    return Func(
        F(field), Value('<.*?>'), Value(''), Value('g'),
        function='regexp_replace'
    )
 
Entry.objects.update(
    search_document=(
        SearchVector('title', weight='A') +
        SearchVector(strip_tags_func('body'), weight='C')
    )
)

I’m using a neat feature of the SearchVector class here: it can be concatenated together using the + operator, and each component can be assigned a weight of A, B, C or D. These weights affect ranking calculations later on.

Updates using signals

We could just set this up to run periodically (as I did in my initial implementation), but we can get better real-time results by ensuring this field gets updated automatically when the rest of the model is modified. Some people solve this with PostgreSQL triggers, but I’m still more comfortable handling this kind of thing in python code - so I opted to use Django’s signals mechanism instead.

Since I need to run search queries across three different types of blog content - Entries, Blogmarks and Quotations - I added a method to each model that returns the text fragments corresponding to each of the weight values. Here’s that method for my Quotation model:

class Quotation(models.Model):
    quotation = models.TextField()
    source = models.CharField(max_length=255)
    tags = models.ManyToManyField(Tag, blank=True)

    def index_components(self):
        return {
            'A': self.quotation,
            'B': ' '.join(self.tags.values_list('tag', flat=True)),
            'C': self.source,
        }

As you can see, I’m including the tags that have been assigned to the quotation in the searchable document.

Here are my signals - loaded once via an import statement in my blog application’s AppConfig.ready() method:

@receiver(post_save)
def on_save(sender, **kwargs):
    if not issubclass(sender, BaseModel):
        return
    transaction.on_commit(make_updater(kwargs['instance']))

@receiver(m2m_changed)
def on_m2m_changed(sender, **kwargs):
    instance = kwargs['instance']
    model = kwargs['model']
    if model is Tag:
        transaction.on_commit(make_updater(instance))
    elif isinstance(instance, Tag):
        for obj in model.objects.filter(pk__in=kwargs['pk_set']):
            transaction.on_commit(make_updater(obj))

def make_updater(instance):
    components = instance.index_components()
    pk = instance.pk

    def on_commit():
        search_vectors = []
        for weight, text in components.items():
            search_vectors.append(
                SearchVector(Value(text, output_field=models.TextField()), weight=weight)
            )
        instance.__class__.objects.filter(pk=pk).update(
            search_document=reduce(operator.add, search_vectors)
        )
    return on_commit

(The full code can be found here).

The on_save method is pretty straightforward - it checks if the model that was just saved has my BaseModel as a base class, then it calls make_updater to get a function to be executed by the transaction.on_commit hook.

The on_m2m_changed handler is significantly more complicated. There are a number of scenarios in which this will be called - I’m reasonably confident that the idiom I use here will capture all of the modifications that should trigger a re-indexing operation.

Running a search now looks like this:

results = Entry.objects.filter(
    search_document=SearchQuery('django')
)

We need one more thing though: we need to sort our search results by relevance. PostgreSQL has pretty good relevance built in, and sorting by the relevance score can be done by applying a Django ORM annotation:

query = SearchQuery('ibm')

results = Entry.objects.filter(
    search_document=query
).annotate(
    rank=SearchRank(F('search_document'), query)
).order_by('-rank')

We now have basic full text search implemented against a single Django model, making use of a GIN index. This is lightning fast.

Searching multiple tables using queryset.union()

My site has three types of content, represented in three different models and hence three different underlying database tables.

I’m using an abstract base model to define common fields shared by all three: the created date, the slug (used to construct permalink urls) and the search_document field populated above.

As of Django 1.11 It’s possible to combine queries across different tables using the SQL union operator.

Here’s what that looks like for running a search across three tables, all with the same search_document search vector field. I need to use .values() to restrict the querysets I am unioning to the same subset of fields:

query = SearchQuery('django')
rank_annotation = SearchRank(F('search_document'), query)
qs = Blogmark.objects.annotate(
    rank=rank_annotation,
).filter(
    search_document=query
).values('pk', 'created', 'rank').union(
    Entry.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values('pk', 'created', 'rank'),
    Quotation.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values('pk', 'created', 'rank'),
).order_by('-rank')[:5]

# Output
<QuerySet [
    {'pk': 186, 'rank': 0.875179, 'created': datetime.datetime(2008, 4, 8, 13, 48, 18, tzinfo=<UTC>)},
    {'pk': 134, 'rank': 0.842655, 'created': datetime.datetime(2007, 10, 20, 13, 46, 56, tzinfo=<UTC>)},
    {'pk': 1591, 'rank': 0.804502, 'created': datetime.datetime(2009, 9, 28, 23, 32, 4, tzinfo=<UTC>)},
    {'pk': 5093, 'rank': 0.788616, 'created': datetime.datetime(2010, 2, 26, 19, 22, 47, tzinfo=<UTC>)},
    {'pk': 2598, 'rank': 0.786928, 'created': datetime.datetime(2007, 1, 26, 12, 38, 46, tzinfo=<UTC>)}
]>

This is not enough information though - I have the primary keys, but I don’t know which type of model they belong to. In order to retrieve the actual resulting objects from the database I need to know which type of content is represented by each of those results.

I can achieve that using another annotation:

qs = Blogmark.objects.annotate(
    rank=rank_annotation,
    type=models.Value('blogmark', output_field=models.CharField())
).filter(
    search_document=query
).values('pk', 'type', 'rank').union(
    Entry.objects.annotate(
        rank=rank_annotation,
        type=models.Value('entry', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'),
    Quotation.objects.annotate(
        rank=rank_annotation,
        type=models.Value('quotation', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'),
).order_by('-rank')[:5]

# Output:
<QuerySet [
    {'pk': 186, 'type': u'quotation', 'rank': 0.875179},
    {'pk': 134, 'type': u'quotation', 'rank': 0.842655},
    {'pk': 1591, 'type': u'entry', 'rank': 0.804502},
    {'pk': 5093, 'type': u'blogmark', 'rank': 0.788616},
    {'pk': 2598, 'type': u'blogmark', 'rank': 0.786928}
]>

Now I just need to write function which can take a list of types and primary keys and return the full objects from the database:

def load_mixed_objects(dicts):
    """
    Takes a list of dictionaries, each of which must at least have a 'type'
    and a 'pk' key. Returns a list of ORM objects of those various types.
    Each returned ORM object has a .original_dict attribute populated.
    """
    to_fetch = {}
    for d in dicts:
        to_fetch.setdefault(d['type'], set()).add(d['pk'])
    fetched = {}
    for key, model in (
        ('blogmark', Blogmark),
        ('entry', Entry),
        ('quotation', Quotation),
    ):
        ids = to_fetch.get(key) or []
        objects = model.objects.prefetch_related('tags').filter(pk__in=ids)
        for obj in objects:
            fetched[(key, obj.pk)] = obj
    # Build list in same order as dicts argument
    to_return = []
    for d in dicts:
        item = fetched.get((d['type'], d['pk'])) or None
        if item:
            item.original_dict = d
        to_return.append(item)
    return to_return

One last challenge: when I add filtering by type, I’m going to want to selectively union together only a subset of these querysets. I need a queryset to start unions against, but I don’t yet know which queryset I will be using. I can abuse Django’s queryset.none() method to crate an empty ValuesQuerySet in the correct shape like this

qs = Entry.objects.annotate(
    type=models.Value('empty', output_field=models.CharField()),
    rank=rank_annotation
).values('pk', 'type', 'rank').none()

Now I can progressively build up my union in a loop like this:

for klass in (Entry, Blogmark, Quotation):
    qs = qs.union(klass.objects.annotate(
        rank=rank_annotation,
        type=models.Value('quotation', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'))

The Django ORM is smart enough to compile away the empty queryset when it constructs the SQL, which ends up looking something like this:

(((SELECT "blog_entry"."id",
            "entry" AS "type",
            ts_rank("blog_entry"."search_document", plainto_tsquery(%s)) AS "rank"
     FROM "blog_entry"
     WHERE "blog_entry"."search_document" @@ (plainto_tsquery(%s)) = TRUE
     ORDER BY "blog_entry"."created" DESC))
 UNION
   (SELECT "blog_blogmark"."id",
           "blogmark" AS "type",
           ts_rank("blog_blogmark"."search_document", plainto_tsquery(%s)) AS "rank"
    FROM "blog_blogmark"
    WHERE "blog_blogmark"."search_document" @@ (plainto_tsquery(%s)) = TRUE
    ORDER BY "blog_blogmark"."created" DESC))
UNION
  (SELECT "blog_quotation"."id",
          "quotation" AS "type",
          ts_rank("blog_quotation"."search_document", plainto_tsquery(%s)) AS "rank"
   FROM "blog_quotation"
   WHERE "blog_quotation"."search_document" @@ (plainto_tsquery(%s)) = TRUE
   ORDER BY "blog_quotation"."created" DESC)

Applying filters

So far, our search engine can only handle user-entered query strings. If I am going to build a faceted search interface I need to be able to handle filtering as well. I want the ability to filter by year, tag and type.

The key difference between filtering and querying (borrowing these definitions from Elasticsearch) is that querying is loose - it involves stemming and stopwords - while filtering is exact. Additionally, querying affects the calculated relevance score while filtering does not - a document either matches the filter or it doesn’t.

Since PostgreSQL is a relational database, filtering can be handled by simply constructing extra SQL where clauses using the Django ORM.

Each of the filters I need requires a slightly different approach. Filtering by type is easy - I just selectively include or exclude that model from my union queryset.

Year and month work like this:

selected_year = request.GET.get('year', '')
selected_month = request.GET.get('month', '')
if selected_year:
    qs = qs.filter(created__year=int(selected_year))
if selected_month:
    qs = qs.filter(created__month=int(selected_month))

Tags involve a join through a many-2-many relationship against the Tags table. We want to be able to apply more than one tag, for example this search for all items tagged both python and javascript. Django’s ORM makes this easy:

selected_tags = request.GET.getlist('tag')
for tag in selected_tags:
    qs = qs.filter(tags__tag=tag)

Adding facet counts

There is just one more ingredient needed to complete our faceted search: facet counts!

Again, the way we calculate these is different for each of our filters. For types, we need to call .count() on a separate queryset for each of the types we are searching:

queryset = make_queryset(Entry, 'entry')
type_counts['entry'] = queryset.count()

(the make_queryset function is defined here)

For years we can do this:

from django.db.models.functions import TruncYear

for row in queryset.order_by().annotate(
    year=TruncYear('created')
).values('year').annotate(n=models.Count('pk')):
    year_counts[row['year']] = year_counts.get(
        row['year'], 0
    ) + row['n']

Tags are trickiest. Let’s take advantage of he fact that Django’s ORM knows how to construct sub-selects if you pass another queryset to the __in operator.

tag_counts = {}
type_name = 'entry'
queryset = make_queryset(Entry, 'entry')
for tag, count in Tag.objects.filter(**{
    '%s__in' % type_name: queryset
}).annotate(
    n=models.Count('tag')
).values_list('tag', 'n'):
    tag_counts[tag] = tag_counts.get(tag, 0) + count

Rendering it all in a template

Having constructed the various facets counts in the view function, the template is really simple:

{% if type_counts %}
    <h3>Types</h3>
    <ul>
        {% for t in type_counts %}
            <li><a href="{% add_qsarg "type" t.type %}">{{ t.type }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}
{% if year_counts %}
    <h3>Years</h3>
    <ul>
        {% for t in year_counts %}
            <li><a href="{% add_qsarg "year" t.year|date:"Y" %}">{{ t.year|date:"Y" }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}
{% if tag_counts %}
    <h3>Tags</h3>
    <ul>
        {% for t in tag_counts %}
            <li><a href="{% add_qsarg "tag" t.tag %}">{{ t.tag }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}

I am using custom templates tags here to add arguments to the current URL. I’ve built systems like this in the past where the URLs are instead generated in the view logic, which I think I prefer. As always, perfect is the enemy of shipped.

And because the results are just a Django queryset, we can use Django’s pagination helpers for the pagination links.

The final implementation

The full current version of the code at time of writing can be seen here. You can follow my initial implementation of this feature through the following commits: 7e3a0217 c7e7b30c 7f6b524c a16ddb5e 7055c7e1 74c194d9 f3ffc100 6c24d9fd cb88c2d4 2c262c75 776a562a b8484c50 0b361c78 1322ada2 79b1b13d 3955f41b 3f5ca052.

And that’s how I built faceted search on top of PostgreSQL and Django! I don’t have my blog comments up and running yet, so please post any thoughts or feedback over on this GitHub issue or over on this thread on Hacker News.

Update 9th September 2021: A few years after implementing this I started to notice performance issues with my blog, which turned out to be caused by search engine crawlers hitting every possible combination of facets, triggering a ton of expensive SQL queries. I excluded /search from being crawled using robots.txt which fixed the problem.

Tags: django, full-text-search, orm, postgresql, projects, search, facetedsearch

1000th Blogmark

2004-08-26T00:30:54+00:00

I just posted my 1000th blogmark. I can't emphasize enough how much of an impact this 15 minute hack has had on both my browsing and my blogging habits. While I still tend to leave browser windows open for days at a time, I now at least have a procedure for getting rid of the ones that still interest me. More importantly, having blogmarks has eliminated the temptation to write a full blog entry (with quotation) just to share a link. This has dramatically reduced my posting rate, but has meant that when I do post an entry I usually have something moderately interesting to say.

To celebrate this personal milestone, I've linked up the rudimentary LIKE query search engine I've been using for a while on the blogmarks index page. My long term aim is still to integrate them with my main content and add comments in the style of photomatt, but that would require more time spent hacking on my blogging system (or switching to WordPress) than I have to spend right now.

Tags: blogging, site-upgrades

Blogmarks

2003-11-24T00:52:16+00:00

This entry was going to be another list of links, together with a note about how much I really needed to set up a separate link blog. Then I realised that it would make more sense just to set one up so that's exactly what I've done. I still need to implement the archive but it's getting dark so I'm posting this and heading home.

My main points of inspiration were Paul Hammond's bookmark store, Mark Pilgrim's b-links, Anil Dash's Daily Links and Jason Kottke's Remaindered Links. Since there didn't seem to be any naming convention I decided to call them blogmarks, which isn't a new term but doesn't seem to have a widely accepted meaning yet either.

The system is powered by a simple bookmarklet. To make things a little more interesting I'm capturing the referral information and using it to automatically generate the 'via' link; since the title of the previous page isn't available in Javascript I extract is using a server side script instead. I swayed briefly between using page extracts a la Hammond or sarcastic commentary a la Pilgrim and decided that commentary would be far more fun.

Tags: anil-dash, blogging, jason-kottke, mark-pilgrim, paul-hammond, site-upgrades

One year of blogging

2003-06-12T23:59:37+00:00

Today marks the first anniversary of the start of my blog (and, by a slightly contrived coincidence, my thousandth blog entry). It's been a fun year. Here are my highlights - if you can't stand lengthy self-congratulatory bullet points, stop reading now.

My first post covered the launch of phase two of the Web Standards project. I can remember agonising over a first post for ages, before eventually copping out and going for something dull but unchallenging.
A few days later I had my first weblog driven discussion, a debate with Hixie about standards compliance. Unsurprisingly, I lost - but it took me nearly a year to properly understand the issues involved.
The 24th of June saw my first rant about website usability with respect to the Glastobury festival site: amazingly the rant drew a response from the creator in October of that year.
My second rant was aimed at a far more deserving target: Connected Earth, whose site was so terrible it ended up as an example of what not to do in Jeffrey Zeldman's new book.
On the 6th of July I discovered blo.gs; I've been using it to power my blogroll ever since.
The 10th of July saw my first published CSS experiment: Numbered code listings. It's cropped up in a few different places since then.
Around the 14th, I discovered wikis, setting up the initial MACCAWS wiki and the Smarty wiki, which is still going strong.
Amazon launched their Web Service API on the 17th, and I released a PHP sample implementation on the same day.
I launched Archivist in August, the mailing list archive system used for the css-discuss archive.
September 2nd was another active day: I released the initial version of my XML-RPC library, and used it to create the first Pingback implementation, based on an idea by Stuart.
Towards the end of September I started an ill-fated experiment in blogging my lecture notes. I soon realised that lecture notes work better when confined to a separate site.
My big project for October was the css-discuss wiki, launched on the 11th.
Later in October came my experimental XML-RPC interface to the W3C HTML validator. I keep meaning to return to that and finish it off.
In November I attempted to run a PHP training course, and found that it was harder than I thought it would be. The training material I wrote is still available though.
I released Blockquote Citations in December, my first useful Javascript hack.
In January I switched to Linux.
The two big hacks for February were my Image Drag bookmarklet and Safe HTML Checker class for my comments system.
March was a month for playing with Javascript: I released a PHP/Javascript spell checker and getElementsBySelector, my most ambitious piece of javascript to date.
Also in March, I announced BCSS - the new Computer Science society at my University.
In April I released a bunch of PHP hacks, the most important being HttpClient, followed by code for supporting conditional GET and an XMLWriter class.
Finally, in May a rant about structural markup lead to my CSS tutorial series, of which there is plenty more on the way.

I've gained a huge amount from the last year, thanks almost entirely to the many excellent bloggers who have inspired me along the way (most of whom are listed on my blogroll). Here's to another exciting year.

Tags: blogging, php