<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: full-text-search</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/full-text-search.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-10-04T16:22:09+00:00</updated><author><name>Simon Willison</name></author><entry><title>Hybrid full-text search and vector search with SQLite</title><link href="https://simonwillison.net/2024/Oct/4/hybrid-full-text-search-and-vector-search-with-sqlite/#atom-tag" rel="alternate"/><published>2024-10-04T16:22:09+00:00</published><updated>2024-10-04T16:22:09+00:00</updated><id>https://simonwillison.net/2024/Oct/4/hybrid-full-text-search-and-vector-search-with-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexgarcia.xyz/blog/2024/sqlite-vec-hybrid-search/index.html"&gt;Hybrid full-text search and vector search with SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
As part of Alex’s work on his &lt;a href="https://github.com/asg017/sqlite-vec"&gt;sqlite-vec&lt;/a&gt; SQLite extension - adding fast vector lookups to SQLite - he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.&lt;/p&gt;
&lt;p&gt;The most promising approach looks to be &lt;a href="https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking"&gt;Reciprocal Rank Fusion&lt;/a&gt;, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; the sqlite-vec KNN vector search results&lt;/span&gt;
with vec_matches &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    article_id,
    row_number() over (&lt;span class="pl-k"&gt;order by&lt;/span&gt; distance) &lt;span class="pl-k"&gt;as&lt;/span&gt; rank_number,
    distance
  &lt;span class="pl-k"&gt;from&lt;/span&gt; vec_articles
  &lt;span class="pl-k"&gt;where&lt;/span&gt;
    headline_embedding match lembed(:query)
    &lt;span class="pl-k"&gt;and&lt;/span&gt; k &lt;span class="pl-k"&gt;=&lt;/span&gt; :k
),
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; the FTS5 search results&lt;/span&gt;
fts_matches &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    rowid,
    row_number() over (&lt;span class="pl-k"&gt;order by&lt;/span&gt; rank) &lt;span class="pl-k"&gt;as&lt;/span&gt; rank_number,
    rank &lt;span class="pl-k"&gt;as&lt;/span&gt; score
  &lt;span class="pl-k"&gt;from&lt;/span&gt; fts_articles
  &lt;span class="pl-k"&gt;where&lt;/span&gt; headline match :query
  &lt;span class="pl-k"&gt;limit&lt;/span&gt; :k
),
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; combine FTS5 + vector search results with RRF&lt;/span&gt;
final &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-c1"&gt;articles&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;,
    &lt;span class="pl-c1"&gt;articles&lt;/span&gt;.&lt;span class="pl-c1"&gt;headline&lt;/span&gt;,
    &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; vec_rank,
    &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; fts_rank,
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; RRF algorithm&lt;/span&gt;
    (
      coalesce(&lt;span class="pl-c1"&gt;1&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;/&lt;/span&gt; (:rrf_k &lt;span class="pl-k"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt;), &lt;span class="pl-c1"&gt;0&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt;) &lt;span class="pl-k"&gt;*&lt;/span&gt; :weight_fts &lt;span class="pl-k"&gt;+&lt;/span&gt;
      coalesce(&lt;span class="pl-c1"&gt;1&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;/&lt;/span&gt; (:rrf_k &lt;span class="pl-k"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt;), &lt;span class="pl-c1"&gt;0&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt;) &lt;span class="pl-k"&gt;*&lt;/span&gt; :weight_vec
    ) &lt;span class="pl-k"&gt;as&lt;/span&gt; combined_rank,
    &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;distance&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; vec_distance,
    &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;score&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; fts_score
  &lt;span class="pl-k"&gt;from&lt;/span&gt; fts_matches
  full outer &lt;span class="pl-k"&gt;join&lt;/span&gt; vec_matches &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;article_id&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt;
  &lt;span class="pl-k"&gt;join&lt;/span&gt; articles &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;articles&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; coalesce(&lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt;, &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;article_id&lt;/span&gt;)
  &lt;span class="pl-k"&gt;order by&lt;/span&gt; combined_rank &lt;span class="pl-k"&gt;desc&lt;/span&gt;
)
&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; final;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them - it uses them purely for &lt;code&gt;row_number()&lt;/code&gt; ranking within each set and combines the results based on that.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-garcia"&gt;alex-garcia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="search"/><category term="sql"/><category term="sqlite"/><category term="alex-garcia"/><category term="vector-search"/><category term="embeddings"/><category term="rag"/></entry><entry><title>tantivy-cli</title><link href="https://simonwillison.net/2024/Jun/13/tantivy-cli/#atom-tag" rel="alternate"/><published>2024-06-13T06:03:00+00:00</published><updated>2024-06-13T06:03:00+00:00</updated><id>https://simonwillison.net/2024/Jun/13/tantivy-cli/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/quickwit-oss/tantivy-cli"&gt;tantivy-cli&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I tried out this Rust based search engine today and I was very impressed.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/quickwit-oss/tantivy"&gt;Tantivy&lt;/a&gt; is the core project - it's an open source (MIT) Rust library that implements Lucene-style full text search, with a very full set of features: BM25 ranking, faceted search, range queries, incremental indexing etc.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;tantivy-cli&lt;/code&gt; offers a CLI wrapper around the Rust library. It's not actually as full-featured as I hoped: it's intended as more of a demo than a full exposure of the library's features. The JSON API server it runs can only be used to run simple keyword or phrase searches for example, no faceting or filtering.&lt;/p&gt;
&lt;p&gt;Tantivy's performance is fantastic. I was able to index the entire contents of my link blog in a fraction of a second.&lt;/p&gt;
&lt;p&gt;I found &lt;a href="https://fulmicoton.com/posts/behold-tantivy/"&gt;this post&lt;/a&gt; from 2017 where Tantivy creator Paul Masurel described the initial architecture of his new search side-project that he created to help him learn Rust. Paul went on to found &lt;a href="https://quickwit.io/"&gt;Quickwit&lt;/a&gt;, an impressive looking analytics platform that uses Tantivy as one of its core components.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/quickwit-oss/tantivy-py"&gt;Python bindings&lt;/a&gt; for Tantivy look well maintained, wrapping the Rust library using &lt;a href="https://github.com/PyO3/maturin"&gt;maturin&lt;/a&gt;. Those are probably the best way for a developer like myself to really start exploring what it can do.&lt;/p&gt;
&lt;p&gt;Also notable: the &lt;a href="https://news.ycombinator.com/item?id=40492834"&gt;Hacker News thread&lt;/a&gt; has dozens of posts from happy Tantivy users reporting successful use on their projects.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=40492834"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="full-text-search"/><category term="search"/><category term="rust"/></entry><entry><title>Weeknotes: datasette-dump, sqlite-backup, talks</title><link href="https://simonwillison.net/2020/Sep/11/weeknotes-sqlite-dump/#atom-tag" rel="alternate"/><published>2020-09-11T04:47:54+00:00</published><updated>2020-09-11T04:47:54+00:00</updated><id>https://simonwillison.net/2020/Sep/11/weeknotes-sqlite-dump/#atom-tag</id><summary type="html">
    &lt;p&gt;I spent some time this week digging into Python's sqlite3 internals. I also gave two talks and recorded a third, due to air at PyGotham in October.&lt;/p&gt;
&lt;h4&gt;sqlite-dump and datasette-backup&lt;/h4&gt;
&lt;p&gt;I'm running an increasing number of Datasette instances with mutable database files - databases that are updated through a variety of different mechanisms. So I need to start thinking about backups.&lt;/p&gt;
&lt;p&gt;Prior to this most of my database files had been relatively disposable: they're built from other sources of data (often by scheduled GitHub Actions) so backups weren't necessary since I could always rebuild them from their point of truth.&lt;/p&gt;
&lt;p&gt;Creating a straight copy of a SQLite database file isn't enough for robust backups, because the file may be accepting writes while you are creating the copy.&lt;/p&gt;
&lt;p&gt;SQLite has various mechanisms for backups. There's an &lt;a href="https://sqlite.org/backup.html"&gt;online backup API&lt;/a&gt; and more recent SQLite versions support a &lt;a href="https://sqlite.org/lang_vacuum.html#vacuuminto"&gt;VACUUM INTO&lt;/a&gt; command which also optimizes the backed up database.&lt;/p&gt;
&lt;p&gt;I figured it would be useful to expose this functionality by a Datasette plugin - one that could allow automated backups to be directly fetched from Datasette over HTTPS. So I started work on &lt;a href="https://github.com/simonw/datasette-backup"&gt;datasette-backup&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For the first backup mode, I decided to take advantage of the &lt;a href="https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.iterdump"&gt;connection.iterdump()&lt;/a&gt; method that's built into Python's sqlite3 module. This method is an iterator that outputs plain text SQL that can recreate a database. Crucially it's a streaming-compatible mechanism - unlike &lt;code&gt;VACUUM INTO&lt;/code&gt; which would require me to create a temporary file the same as the database I was backing up.&lt;/p&gt;
&lt;p&gt;I started experimenting with it, and ran into &lt;a href="https://github.com/simonw/datasette-backup/issues/5"&gt;a big problem&lt;/a&gt;. I make extensive use of SQLite full-text search, but the &lt;code&gt;.sql&lt;/code&gt; dumps generated by &lt;code&gt;.iterdump()&lt;/code&gt; break with constraint errors if they include any FTS tables.&lt;/p&gt;
&lt;p&gt;After a bit of digging I came across &lt;a href="https://github.com/python/cpython/blame/v3.8.5/Lib/sqlite3/dump.py#L38-L45"&gt;a 13 year old comment about this&lt;/a&gt; in the cPython source code itself!&lt;/p&gt;
&lt;p&gt;The implementation for &lt;code&gt;.iterdump()&lt;/code&gt; turns out to be entirely in Python, and way less complicated than I had expected. So I decided to see if I could get FTS table exports working.&lt;/p&gt;
&lt;p&gt;In a classic case of yak shaving, I decided to create a Python library called &lt;a href="https://github.com/simonw/sqlite-dump"&gt;sqlite-dump&lt;/a&gt; to solve this problem. And since my existing cookiecutter templates only cover &lt;a href="https://github.com/simonw/datasette-plugin"&gt;Datasette Plugins&lt;/a&gt; or &lt;a href="https://github.com/simonw/click-app"&gt;Click apps&lt;/a&gt; I first needed to create a new &lt;a href="https://github.com/simonw/python-lib"&gt;python-lib&lt;/a&gt; template in order to create the library I needed for my plugin.&lt;/p&gt;
&lt;p&gt;I got it working! Install the &lt;a href="https://github.com/simonw/datasette-backup"&gt;datasette-backup&lt;/a&gt; plugin on any Datasette instance to get a &lt;code&gt;/-/backup/name-of-database.sql&lt;/code&gt; URL that will produce a streaming SQL dump of any attached database.&lt;/p&gt;
&lt;h4&gt;A weird bug with SQLite FTS and triggers&lt;/h4&gt;
&lt;p&gt;While working on &lt;code&gt;datasette-backup&lt;/code&gt; I noticed a weird issue with some of my SQLite full-text search enabled databases: they kept getting bigger. Way bigger than I would expect them to.&lt;/p&gt;
&lt;p&gt;I eventually noticed that the &lt;code&gt;licenses_fts&lt;/code&gt; table in my &lt;a href="https://github-to-sqlite.dogsheep.net/"&gt;github-to-sqlite demo database&lt;/a&gt; had 7 rows in it, but the accompanying &lt;code&gt;licenses_fts_docsize&lt;/code&gt; table had 9,141. I would expect it to only have 7 as well.&lt;/p&gt;
&lt;p&gt;I was stumped as to what was going on, so I &lt;a href="https://sqlite.org/forum/forumpost/e05a261524"&gt;turned to the official SQLite forum&lt;/a&gt;. I only recently discovered how useful this is as a resource. Dan Kennedy, one of the three core SQLite maintainers, replied within an hour and gave me some useful hints. The root cause turned out to be the way SQLite triggers work: by default, SQLite runs in &lt;code&gt;recursive_triggers=off&lt;/code&gt; mode (for backwards compatibility with older databases). This means that an &lt;code&gt;INSERT OR REPLACE&lt;/code&gt; update to a table that is backed by full-text search may not correctly trigger the updates needed on the FTS table itself.&lt;/p&gt;
&lt;p&gt;Since there doesn't appear to be any disadvantage to running with &lt;code&gt;recursive_triggers=on&lt;/code&gt; I've now set that as the default for &lt;code&gt;sqlite-utils&lt;/code&gt;, as-of &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-17"&gt;version 2.17&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I then added a &lt;code&gt;sqlite-utils rebuild-fts data.db&lt;/code&gt; command in &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-18"&gt;version 2.18&lt;/a&gt; which can rebuild the FTS tables in a database and fix the &lt;code&gt;_fts_docsize&lt;/code&gt; problem.&lt;/p&gt;
&lt;h4&gt;Talks&lt;/h4&gt;
&lt;p&gt;I presented &lt;strong&gt;Build your own data warehouse for personal analytics with SQLite and Datasette&lt;/strong&gt; at PyCon AU last week. The &lt;a href="https://www.youtube.com/watch?v=CPQCD3Qxxik"&gt;video is here&lt;/a&gt; and includes my first public demo of &lt;a href="https://github.com/dogsheep/beta"&gt;Dogsheep Beta&lt;/a&gt;, my new combined search engine for personal analytics data imported using my &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; family of tools. I took questions in &lt;a href="https://docs.google.com/document/d/1q0sH954Iv83sctmB_sfHJg2zZMYZGhY_n2s38yOCpOU/edit#heading=h.b1b8d2228j9"&gt;this Google Doc&lt;/a&gt;, and filled out more detailed answers after the talk.&lt;/p&gt;
&lt;p&gt;I gave a talk at PyRVA a couple of days called &lt;strong&gt;Rapid data analysis with SQLite and Datasette&lt;/strong&gt;. Here's &lt;a href="https://www.youtube.com/watch?v=5U3RyzxTKtQ"&gt;the video&lt;/a&gt; and &lt;a href="https://docs.google.com/document/d/1clxqE0xCqPw5baFLGKOr29iur1sV9ruMj8MNW-Xlzpo/edit"&gt;Google Doc&lt;/a&gt; for that one.&lt;/p&gt;
&lt;p&gt;I also pre-recorded my talk for PyGotham: &lt;a href="https://2020.pygotham.tv/talks/datasette-an-ecosystem-of-tools-for-working-with-small-data/"&gt;Datasette - an ecosystem of tools for working with Small Data&lt;/a&gt;. The conference is in the first week of October and I'll be hanging out there during the talk answering questions and chatting about the project, safe from the stress of also having to present it live!&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/typescript_basic-tsc.md"&gt;Very basic tsc usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/aws_instance-costs-per-month.md"&gt;Display EC2 instance costs per month&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/linux_basic-strace.md"&gt;Basic strace to see what a process is doing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-dns/releases/tag/0.1a1"&gt;datasette-dns 0.1a1&lt;/a&gt; - 2020-09-10&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-dns/releases/tag/0.1a0"&gt;datasette-dns 0.1a0&lt;/a&gt; - 2020-09-10&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.7a0"&gt;dogsheep-beta 0.7a0&lt;/a&gt; - 2020-09-09&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/2.18"&gt;sqlite-utils 2.18&lt;/a&gt; - 2020-09-08&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/2.17"&gt;sqlite-utils 2.17&lt;/a&gt; - 2020-09-07&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-backup/releases/tag/0.1"&gt;datasette-backup 0.1&lt;/a&gt; - 2020-09-07&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-dump/releases/tag/0.1.1"&gt;sqlite-dump 0.1.1&lt;/a&gt; - 2020-09-07&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-dump/releases/tag/0.1"&gt;sqlite-dump 0.1&lt;/a&gt; - 2020-09-07&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-dump/releases/tag/0.1a"&gt;sqlite-dump 0.1a&lt;/a&gt; - 2020-09-06&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-backup/releases/tag/0.1a"&gt;datasette-backup 0.1a&lt;/a&gt; - 2020-09-06&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-block-robots/releases/tag/0.3"&gt;datasette-block-robots 0.3&lt;/a&gt; - 2020-09-06&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-block-robots/releases/tag/0.2"&gt;datasette-block-robots 0.2&lt;/a&gt; - 2020-09-05&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.6"&gt;dogsheep-beta 0.6&lt;/a&gt; - 2020-09-05&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.5"&gt;dogsheep-beta 0.5&lt;/a&gt; - 2020-09-04&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="full-text-search"/><category term="speaking"/><category term="sqlite"/><category term="datasette"/><category term="weeknotes"/><category term="sqlite-utils"/></entry><entry><title>sqlite-utils 2.14</title><link href="https://simonwillison.net/2020/Aug/1/sqlite-utils-214/#atom-tag" rel="alternate"/><published>2020-08-01T21:19:04+00:00</published><updated>2020-08-01T21:19:04+00:00</updated><id>https://simonwillison.net/2020/Aug/1/sqlite-utils-214/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-14"&gt;sqlite-utils 2.14&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I finally figured out porter stemming with SQLite full-text search today—it turns out it’s as easy as adding tokenize=’porter’ to the CREATE VIRTUAL TABLE statement. So I just shipped sqlite-utils 2.14 with a tokenize= option (plus the ability to insert binary file data from stdin).

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1289669022410104832"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="projects"/><category term="search"/><category term="sqlite"/><category term="sqlite-utils"/></entry><entry><title>datasette-search-all: a new plugin for searching multiple Datasette tables at once</title><link href="https://simonwillison.net/2020/Mar/9/datasette-search-all/#atom-tag" rel="alternate"/><published>2020-03-09T00:59:37+00:00</published><updated>2020-03-09T00:59:37+00:00</updated><id>https://simonwillison.net/2020/Mar/9/datasette-search-all/#atom-tag</id><summary type="html">
    &lt;p&gt;I just released a new plugin for Datasette, and it's pretty fun. &lt;a href="https://github.com/simonw/datasette-search-all"&gt;datasette-search-all&lt;/a&gt; is a plugin written mostly in JavaScript that executes the same search query against every searchable table in every database connected to your Datasette instance.&lt;/p&gt;

&lt;p&gt;You can try it out on my FARA (Foreign Agents Registration Act) search site, &lt;a href="https://fara.datasettes.com/"&gt;fara.datasettes.com&lt;/a&gt; - see &lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; for background on that project.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://fara.datasettes.com/-/search?q=manafort"&gt;a search for manafort&lt;/a&gt; across all four FARA tables (derived from CSVs originally pulled from the &lt;a href="https://efile.fara.gov/ords/f?p=API:BULKDATA:0:"&gt;Justice Department bulk data site&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;img alt="Running a search for manafort" src="https://static.simonwillison.net/static/2020/datasette-search-all.gif" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;I've been planning to build cross-table search for Datasette for quite a while now. It's a potentially very difficult problem: searching a single table is easy, but the moment you attempt to search multiple tables you run into a number of challenges:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Different tables have different columns. How do you present those in a single search interface?&lt;/li&gt;&lt;li&gt;Full-text search relevance scores make sense within a single table (due to the statistics they rely on, see &lt;a href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/"&gt;Exploring search relevance algorithms with SQLite&lt;/a&gt;) but cannot be compared across multiple tables.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;I have an idea for how I can address these, but it involves creating a single separate full-text index table that incorporates text from many different tables, along with a complex set of indexing mechanisms (maybe driven by triggers) for keeping it up to date.&lt;/p&gt;

&lt;h3&gt;But maybe I was overthinking this?&lt;/h3&gt;

&lt;p&gt;While I stewed on the ideal way to solve this problem, projects like my FARA site were stuck without cross-table search. Then this morning I realized that there was another way: I could build pretty much the simplest thing that could possibly work (always a good plan in my experience).&lt;/p&gt;

&lt;p&gt;Here's how the new plugin works: it scans through every table attached to Datasette looking for tables that are configured for full-text search. Then it presents a UI which can excute searches against ALL of those tables, and present the top five results from each one.&lt;/p&gt;

&lt;p&gt;The scanning-for-searchable-tables &lt;a href="https://github.com/simonw/datasette-search-all/blob/c0772d1012709af04906569588cd2be04fa71aa4/datasette_search_all/utils.py"&gt;happens in Python&lt;/a&gt;, but the actual searching is all &lt;a href="https://github.com/simonw/datasette-search-all/blob/c0772d1012709af04906569588cd2be04fa71aa4/datasette_search_all/templates/search_all.html#L111-L118"&gt;in client-side JavaScript&lt;/a&gt;. The searches run in parallel, which means the user sees results from the fastest (smallest) tables first, then the larger, slower tables drop in at the bottom.&lt;/p&gt;

&lt;p&gt;It's stupidly simple, but I really like the result. It's also a neat demonstration of running parallel SQL queries from JavaScript, a technique which I'm keen to apply to all sorts of other interesting problems.&lt;/p&gt;

&lt;h3&gt;JavaScript style&lt;/h3&gt;

&lt;p&gt;The JavaScript I wrote for this project is unconventional for 2020: it's a block of inline script on the page, using no libraries or frameworks, but taking advantage of modern niceties like &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals"&gt;backtick template literals&lt;/a&gt; and &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch"&gt;fetch()&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The code is messy, short and extremely easy to change in the future. It doesn't require running a build tool. I'm pretty happy with it.&lt;/p&gt;

&lt;h3&gt;Adding a search form to the homepage&lt;/h3&gt;

&lt;p&gt;The other thing the plugin does is add a search box to the Datasette homepage (as seen on the FARA site) - but only if the attached databases contain at least one FTS-configured searchable table.&lt;/p&gt;

&lt;p&gt;There are two parts to the implementation here. The first is a &lt;code&gt;extra_template_vars()&lt;/code&gt; plugin hook which injects a &lt;code&gt;searchable_tables&lt;/code&gt; variable into the hompage context - &lt;a href="https://github.com/simonw/datasette-search-all/blob/f9b2a9c3328f35021a4744205d862c108f8f08eb/datasette_search_all/__init__.py#L6-L15"&gt;code here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The second is a custom &lt;code&gt;index.html&lt;/code&gt; template which &lt;a href="https://github.com/simonw/datasette-search-all/blob/f9b2a9c3328f35021a4744205d862c108f8f08eb/datasette_search_all/templates/index.html"&gt;ships with the plugin&lt;/a&gt;. When Datasette renders a template it looks first in the local &lt;code&gt;--template-dirs&lt;/code&gt; folder (if that option was used), secondly in all of the installed plugins and finally in the Datasette set of default templates.&lt;/p&gt;

&lt;p&gt;The new &lt;code&gt;index.html&lt;/code&gt; template starts with &lt;code&gt;{% extends "default:index.html" %}&lt;/code&gt;, which means it extends the default template that shipped with Datasette. It then redefines the &lt;code&gt;description_source_license&lt;/code&gt; block from that template to conditionally show the search form.&lt;/p&gt;

&lt;p&gt;I'm not at all happy with abusing &lt;code&gt;description_source_license&lt;/code&gt; in this way - it just happens to be a block located at the top of that page. As I write more plugins that customize the Datasette UI in some way I continually run into this problem: plugins need to add markup to pages in specific points, but they also need to do so in a way that won't over-ride what other plugins are up to.&lt;/p&gt;

&lt;p&gt;I'm beginning to formulate an idea for how Datasette can better support this, but until that's ready I'll be stuck with hacks like the one used here.&lt;/p&gt;

&lt;h3&gt;Using this with datasette-configure-fts and datasette-upload-csvs&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette-configure-fts"&gt;datasette-configure-fts plugin&lt;/a&gt; provides a simple UI for configuring search for different tables, by selecting which columns should be searchable.&lt;/p&gt;

&lt;p&gt;Combining this with &lt;code&gt;datasette-search-all&lt;/code&gt; is really powerful. It means you can dump a bunch of CSVs into Datasette (maybe using &lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt;), select some columns and then run searches across all of those different data sources in one place.&lt;/p&gt;

&lt;p&gt;Not bad for 93 lines of JavaScript and a bit of Python glue!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="full-text-search"/><category term="javascript"/><category term="plugins"/><category term="projects"/><category term="search"/><category term="datasette"/></entry><entry><title>sqlite-utils 1.11</title><link href="https://simonwillison.net/2019/Sep/3/sqlite-utils-fts/#atom-tag" rel="alternate"/><published>2019-09-03T01:05:41+00:00</published><updated>2019-09-03T01:05:41+00:00</updated><id>https://simonwillison.net/2019/Sep/3/sqlite-utils-fts/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v1-11"&gt;sqlite-utils 1.11&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Amjith Ramanujam contributed an excellent new feature to sqlite-utils, which I’ve now released as part of version 1.11. Previously you could enable SQLite full-text-search on a table using the .enable_fts() method (or the “sqlite-utils enable-fts” CLI command) but it wouldn’t reflect future changes to the table—you had to use populate_fts() any time you inserted new records. Thanks to Amjith you can now pass create_triggers=True (or --create-triggers) to cause sqlite-utils to automatically add triggers that keeps the FTS index up-to-date any time a row is inserted, updated or deleted from the table.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="full-text-search"/><category term="projects"/><category term="sqlite"/><category term="sqlite-utils"/></entry><entry><title>Exploring search relevance algorithms with SQLite</title><link href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/#atom-tag" rel="alternate"/><published>2019-01-07T03:29:29+00:00</published><updated>2019-01-07T03:29:29+00:00</updated><id>https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://www.sqlite.org/index.html"&gt;SQLite&lt;/a&gt; isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the &lt;a href="https://www.sqlite.org/fts3.html"&gt;FTS4&lt;/a&gt; and &lt;a href="https://sqlite.org/fts5.html"&gt;FTS5&lt;/a&gt; extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.&lt;/p&gt;
&lt;p&gt;I’ve been using these capabilities for &lt;a href="https://datasette.readthedocs.io/en/stable/full_text_search.html"&gt;basic search in Datasette&lt;/a&gt; for over a year now, but I’ve recently started digging into some of their more advanced features. It turns out hacking around with SQLite is a great way to learn more about how fundamental information retrieval algorithms work under the hood.&lt;/p&gt;
&lt;p&gt;Today I’m releasing &lt;a href="https://github.com/simonw/sqlite-fts4"&gt;sqlite-fts4&lt;/a&gt; - a Python package that provides a collection of custom SQL functions for working with SQLite’s FTS4 module. It includes some neat tools for introspecting how relevancy ranking algorithms actually work.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Why_not_just_use_FTS5_8"&gt;&lt;/a&gt;Why not just use FTS5?&lt;/h3&gt;
&lt;p&gt;If it’s available to you FTS5 is usually the best option: it has a good ranking algorithm built in. I described how to use it to build &lt;a href="https://24ways.org/2018/fast-autocomplete-search-for-your-website/"&gt;fast autocomplete search for your website&lt;/a&gt; for the 2018 &lt;a href="https://24ways.org/"&gt;24 ways advent calendar&lt;/a&gt;. You can join directly against a virtual table and order by a pre-calculated relevance score accessible through that table.&lt;/p&gt;
&lt;p&gt;What makes FTS4 interesting is that it doesn’t include a scoring mechanism: it instead exposes raw statistical data to you in a way that lets you build your own ranking functions.&lt;/p&gt;
&lt;p&gt;You probably don’t &lt;em&gt;need&lt;/em&gt; to do this - unless you are stuck on an older SQLite version that doesn’t support the latest features. But… if you’re interested in understanding more about how search actually works, the need to implement a ranking function is an excellent learning learning opportunity.&lt;/p&gt;
&lt;p&gt;I’ll be demonstrating these functions using a hosted Datasette instance running at &lt;a href="https://datasette-sqlite-fts4.datasette.io/"&gt;datasette-sqlite-fts4.datasette.io&lt;/a&gt; (with the data from &lt;a href="https://24ways.org/2018/fast-autocomplete-search-for-your-website/"&gt;my 24 ways article&lt;/a&gt;). You can play with them out there, or if you want to use your own Datasette instance you can enable these custom SQL functions by pip installing my new &lt;a href="https://github.com/simonw/datasette-sqlite-fts4"&gt;datasette-sqlite-fts4&lt;/a&gt; plugin.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Raw_FTS4_matchinfo_data_18"&gt;&lt;/a&gt;Raw FTS4 matchinfo() data&lt;/h3&gt;
&lt;p&gt;When using FTS4, the only scoring help SQLite gives you is the bulit-in &lt;a href="https://www.sqlite.org/fts3.html#matchinfo"&gt;matchinfo() function&lt;/a&gt;. For each document in your search result set, this function will expose raw statistical data that can be used to calculate a score.&lt;/p&gt;
&lt;p&gt;Let’s try it out using the following query:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    *, matchinfo(articles_fts, &amp;quot;pcx&amp;quot;)
from
    articles_fts
where
    articles_fts match :search
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++*%2C+matchinfo%28articles_fts%2C+%22pcx%22%29%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch&amp;amp;search=jquery+maps"&gt;Run matchinfo() in Datasette&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;pcx&lt;/code&gt; here is called the format string - it lets SQLite know what information about the match you would like to see.&lt;/p&gt;
&lt;p&gt;The results are returned as a binary string! For the first matching document, we get back the following:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;\x02\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa3\x00\x00\x00\x1f\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\\\x00\x00\x00\x15\x00\x00\x00&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;SQLite’s C heritage is showing through here.&lt;/p&gt;
&lt;h3&gt;&lt;a id="decode_matchinfo_to_decode_the_binary_41"&gt;&lt;/a&gt;decode_matchinfo() to decode the binary&lt;/h3&gt;
&lt;p&gt;The first step in working with matchinfo is to decode that binary string. It’s actually a sequence of unsigned 32 bit integers. We can turn it into a Python list of integers using the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;struct.unpack(&amp;quot;I&amp;quot; * (len(matchinfo) // 4), matchinfo)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;sqlite-fts4&lt;/code&gt; exposes a SQL function called &lt;code&gt;decode_matchinfo()&lt;/code&gt; which does exactly this. Let’s expand our example to use it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, &amp;quot;pcx&amp;quot;)),
    matchinfo(articles_fts, &amp;quot;pcx&amp;quot;)
from
    articles_fts
where
    articles_fts match :search
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++decode_matchinfo%28matchinfo%28articles_fts%2C+%22pcx%22%29%29%2C%0D%0A++++matchinfo%28articles_fts%2C+%22pcx%22%29%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch&amp;amp;search=jquery+maps"&gt;Run decode_matchinfo() in Datasette&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The matchinfo for our first matching document now looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[2, 3, 0, 2, 2, 0, 0, 0, 1, 163, 31, 0, 2, 2, 0, 0, 0, 2, 92, 21]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Better, but still obscure. What does it mean?&lt;/p&gt;
&lt;p&gt;The anwser lies in the &lt;a href="https://www.sqlite.org/fts3.html#matchinfo"&gt;SQLite matchinfo documentation&lt;/a&gt;. In our format string, we requested &lt;code&gt;p&lt;/code&gt;, &lt;code&gt;c&lt;/code&gt; and &lt;code&gt;x&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;p&lt;/code&gt; requests a single integer representing the number of search terms we are matching. Since our search query is &lt;code&gt;jquery maps&lt;/code&gt; this is 2.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;c&lt;/code&gt; requests the number of searchable columns in our table. We created &lt;code&gt;articles_fts&lt;/code&gt; with 3 columns, so it’s 3. That’s the second integer in the list.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;x&lt;/code&gt; is much more interesting: it returns 3 integer values for each term/column combination. Since we have 2 terms and 3 columns that means we get back 6 * 3 = 18 integers. If you count the items in the array above you’ll see there are 18 left after you remove the first two. Each triple represents the number of times the term appears in the current column, the number of times it appears in this column across every row and the number of total documents that match the term in this column at least once.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Search relevancy scores are usually calculated against exactly this kind of collection of statistics: we rank based on how rare the matching terms are across the rest of the corpus.&lt;/p&gt;
&lt;h3&gt;&lt;a id="annotate_matchinfo_to_annotate_the_integers_74"&gt;&lt;/a&gt;annotate_matchinfo() to annotate the integers&lt;/h3&gt;
&lt;p&gt;Having a list of integers made things easier, but still not easy enough. That’s where &lt;code&gt;annotate_matchinfo()&lt;/code&gt; comes in. This custom SQL function expands the matchinfo list of integers into a giant JSON object describing exactly what each of the results means.&lt;/p&gt;
&lt;p&gt;We’ll try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, &amp;quot;pcx&amp;quot;)),
    json_object(&amp;quot;pre&amp;quot;, annotate_matchinfo(matchinfo(articles_fts, &amp;quot;pcx&amp;quot;), &amp;quot;pcx&amp;quot;))
from
    articles_fts
where
    articles_fts match :search
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++decode_matchinfo%28matchinfo%28articles_fts%2C+%22pcx%22%29%29%2C%0D%0A++++json_object%28%22pre%22%2C+annotate_matchinfo%28matchinfo%28articles_fts%2C+%22pcx%22%29%2C+%22pcx%22%29%29%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch&amp;amp;search=jquery+maps"&gt;Run annotate_matchinfo() in Datasette&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Note that we have to provide the format string twice, so that &lt;code&gt;annotate_matchinfo()&lt;/code&gt; knows the requested order of the binary matchinfo data.&lt;/p&gt;
&lt;p&gt;This returns a JSON object that looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{
  &amp;quot;p&amp;quot;: {
    &amp;quot;title&amp;quot;: &amp;quot;Number of matchable phrases in the query&amp;quot;,
    &amp;quot;value&amp;quot;: 2,
    &amp;quot;idx&amp;quot;: 0
  },
  &amp;quot;c&amp;quot;: {
    &amp;quot;title&amp;quot;: &amp;quot;Number of user defined columns in the FTS table&amp;quot;,
    &amp;quot;value&amp;quot;: 3,
    &amp;quot;idx&amp;quot;: 1
  },
  &amp;quot;x&amp;quot;: {
    &amp;quot;title&amp;quot;: &amp;quot;Details for each phrase/column combination&amp;quot;
    &amp;quot;value&amp;quot;: [
      ...
      {
        &amp;quot;phrase_index&amp;quot;: 0,
        &amp;quot;column_index&amp;quot;: 2,
        &amp;quot;hits_this_column_this_row&amp;quot;: 1,
        &amp;quot;hits_this_column_all_rows&amp;quot;: 163,
        &amp;quot;docs_with_hits&amp;quot;: 31,
        &amp;quot;idxs&amp;quot;: [8, 9, 10]
      },
      {
        &amp;quot;phrase_index&amp;quot;: 1,
        &amp;quot;column_index&amp;quot;: 0,
        &amp;quot;hits_this_column_this_row&amp;quot;: 0,
        &amp;quot;hits_this_column_all_rows&amp;quot;: 2,
        &amp;quot;docs_with_hits&amp;quot;: 2,
        &amp;quot;idxs&amp;quot;: [11, 12, 13]
      }...
    ],
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Try it out &lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++decode_matchinfo%28matchinfo%28articles_fts%2C+%22pcx%22%29%29%2C%0D%0A++++json_object%28%22pre%22%2C+annotate_matchinfo%28matchinfo%28articles_fts%2C+%22pcxnalyb%22%29%2C+%22pcxnalyb%22%29%29%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch&amp;amp;search=jquery+maps"&gt;with pcxnalyb&lt;/a&gt; to see the complete set of format string options.&lt;/p&gt;
&lt;p&gt;You may be wondering why I wrapped that function call in &lt;code&gt;json_object(&amp;quot;pre&amp;quot;, ...)&lt;/code&gt;. This is a Datasette trick: I recently added the ability to pretty-print JSON to my &lt;code&gt;datasette-html-json&lt;/code&gt; plugin - see that package’s README &lt;a href="https://github.com/simonw/datasette-json-html#preformatted-text"&gt;for details&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Building_ranking_functions_134"&gt;&lt;/a&gt;Building ranking functions&lt;/h3&gt;
&lt;p&gt;These statistics are everything we need to calculate relevance scores. &lt;code&gt;sqlite-fts4&lt;/code&gt; implements two such functions: &lt;code&gt;rank_score()&lt;/code&gt; is a simple TF/IDF function. &lt;code&gt;rank_bm25()&lt;/code&gt; is much more interesting - it’s an implementation of the &lt;a href="https://en.wikipedia.org/wiki/Okapi_BM25"&gt;Okapi BM25&lt;/a&gt;, inspired by the one that ships &lt;a href="https://github.com/coleifer/peewee/blob/fb538c2b2fcc835b9eb7226e56682fbbce49de0f/playhouse/sqlite_ext.py#L1134"&gt;with the peewee ORM&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Let’s try them both out:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    title, author,
    rank_score(matchinfo(articles_fts, &amp;quot;pcx&amp;quot;)) as score,
    rank_bm25(matchinfo(articles_fts, &amp;quot;pcnalx&amp;quot;)) as bm25,
    json_object(&amp;quot;pre&amp;quot;, annotate_matchinfo(matchinfo(articles_fts, &amp;quot;pcxnalyb&amp;quot;), &amp;quot;pcxnalyb&amp;quot;))
from
    articles_fts
where
    articles_fts match :search
order by bm25
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++rank_score%28matchinfo%28articles_fts%2C+%22pcx%22%29%29+as+score%2C%0D%0A++++rank_bm25%28matchinfo%28articles_fts%2C+%22pcnalx%22%29%29+as+bm25%2C%0D%0A++++json_object%28%22pre%22%2C+annotate_matchinfo%28matchinfo%28articles_fts%2C+%22pcxnalyb%22%29%2C+%22pcxnalyb%22%29%29%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch%0D%0Aorder+by+bm25&amp;amp;search=jquery+maps"&gt;Try rank_score() and rank_bm25() in Datasette&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can switch the &lt;code&gt;order by&lt;/code&gt; clause between &lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++rank_score%28matchinfo%28articles_fts%2C+%22pcx%22%29%29+as+score%2C%0D%0A++++rank_bm25%28matchinfo%28articles_fts%2C+%22pcnalx%22%29%29+as+bm25%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch%0D%0Aorder+by+bm25&amp;amp;search=jquery+maps"&gt;bm25&lt;/a&gt; and &lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++rank_score%28matchinfo%28articles_fts%2C+%22pcx%22%29%29+as+score%2C%0D%0A++++rank_bm25%28matchinfo%28articles_fts%2C+%22pcnalx%22%29%29+as+bm25%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch%0D%0Aorder+by+score&amp;amp;search=jquery+maps"&gt;score&lt;/a&gt; to compare the two.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;bm25()&lt;/code&gt; is definitely a better option. It’s the default algorithm used these days by Elasticsearch, and they wrote up &lt;a href="https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables"&gt;an excellent explanation of how it works&lt;/a&gt;  on their blog.&lt;/p&gt;
&lt;p&gt;Take a look at &lt;a href="https://github.com/simonw/sqlite-fts4/blob/2316cca52ff2a23ebbb09cf655609ea9e59ceb1b/sqlite_fts4/__init__.py#L197-L266"&gt;the source code for the ranking functions&lt;/a&gt; to see how they are implemented. They work against the data structure returned by &lt;code&gt;annotate_matchinfo()&lt;/code&gt; to try and make it clear what is going on.&lt;/p&gt;
&lt;p&gt;Building the &lt;code&gt;rank_bm25()&lt;/code&gt; function took me longer than I expected: I was comparing my results against &lt;code&gt;bm25()&lt;/code&gt; from &lt;a href="https://github.com/coleifer/peewee"&gt;peewee&lt;/a&gt; to ensure I was getting them right, but I couldn’t get them to match. After &lt;a href="https://gist.github.com/simonw/e0b9156d66b41b172a66d0cfe32d9391"&gt;some furious debugging&lt;/a&gt; I finally figured out the problem: peewee had a rare bug! I &lt;a href="https://github.com/coleifer/peewee/issues/1826"&gt;reported it to Charles Leifer&lt;/a&gt; and he analyzed it and turned around a fix in a matter of hours - it turns out the C library that peewee had ported to Python &lt;a href="https://github.com/rads/sqlite-okapi-bm25/issues/2"&gt;had the same problem&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Next_steps_161"&gt;&lt;/a&gt;Next steps&lt;/h3&gt;
&lt;p&gt;I’m really impressed with the flexibility that FTS4 provides - it turns out FTS5 isn’t the only worthwhile option for search in SQLite&lt;/p&gt;
&lt;p&gt;I’m thinking about ways to expose some of the bm25 tuning parameters (in particular the magic B and K1 constants &lt;a href="https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables"&gt;explained by the Elasticsearch article&lt;/a&gt;) and I plan to upgrade Datasette’s search functionality to make ranking available as a first-class feature on the &lt;a href="https://datasette.readthedocs.io/en/stable/pages.html#table"&gt;searchable table view&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I’m also generally excited about SQLite as a learning tool for exploring different search ranking mechanisms. Once you’ve decoded that binary matchinfo string it’s impressive how much you can get done with the underlying data.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="full-text-search"/><category term="projects"/><category term="search"/><category term="sqlite"/><category term="datasette"/></entry><entry><title>Datasette: Full-text search</title><link href="https://simonwillison.net/2018/May/12/datasette-full-text-search/#atom-tag" rel="alternate"/><published>2018-05-12T12:09:24+00:00</published><updated>2018-05-12T12:09:24+00:00</updated><id>https://simonwillison.net/2018/May/12/datasette-full-text-search/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://datasette.readthedocs.io/en/latest/full_text_search.html"&gt;Datasette: Full-text search&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I wrote some documentation for Datasette’s full-text search feature, which detects tables which have been configured to use the SQLite FTS module and adds a search input box and support for a _search= querystring parameter.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="search"/><category term="sqlite"/><category term="datasette"/></entry><entry><title>Implementing faceted search with Django and PostgreSQL</title><link href="https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/#atom-tag" rel="alternate"/><published>2017-10-05T14:12:27+00:00</published><updated>2017-10-05T14:12:27+00:00</updated><id>https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/#atom-tag</id><summary type="html">
    &lt;p&gt;I’ve added &lt;a href="https://simonwillison.net/search/"&gt;a faceted search engine&lt;/a&gt; to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/search/?q=postgresql"&gt;All content matching “postgresql”&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/search/?q=django&amp;amp;type=quotation"&gt;Just quotations matching “django”&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/search/?q=python+javascript&amp;amp;tag=mozilla&amp;amp;year=2007"&gt;All content matching “python” and “javascript” with the tag “mozilla” posted in 2007&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2017/faceted-search.png" alt="Screenshot of my faceted search interface" style="width: 100%"/&gt;&lt;/p&gt;

&lt;p&gt;I love this kind of search interface, because the counts tell you so much more about the underlying data. Turns out I was &lt;a href="https://simonwillison.net/search/?q=javascript&amp;amp;type=quotation"&gt;most active in quoting people talking about JavaScript back in 2007&lt;/a&gt;, for example.&lt;/p&gt;
&lt;p&gt;I usually build faceted search engines using either &lt;a href="https://simonwillison.net/tags/solr/"&gt;Solr&lt;/a&gt; or &lt;a href="https://simonwillison.net/tags/solr/"&gt;Elasticsearch&lt;/a&gt; (though the first version of search on this blog was actually powered by &lt;a href="http://fallabs.com/hyperestraier/intro-en.html"&gt;Hyper Estraier&lt;/a&gt;) - but I’m hosting this blog as simply and inexpensively as possible on Heroku and I don’t want to shell out for a SaaS search solution or run an Elasticsearch instance somewhere myself. I thought I’d have to go back to using &lt;a href="https://developers.google.com/custom-search/"&gt;Google Custom Search&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I read &lt;a href="http://rachbelaid.com/postgres-full-text-search-is-good-enough/"&gt;Postgres full-text search is Good Enough!&lt;/a&gt; by Rachid Belaid - closely followed by &lt;a href="http://blog.lotech.org/postgres-full-text-search-with-django.html"&gt;Postgres Full-Text Search With Django&lt;/a&gt; by Nathan Shafer - and I decided to have a play with the new PostgreSQL search functionality that was &lt;a href="https://docs.djangoproject.com/en/1.11/releases/1.10/#full-text-search-for-postgresql"&gt;introduced in Django 1.10&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;… and wow! Full-text search is yet another example of a feature that’s been in PostgreSQL for &lt;a href="https://www.postgresql.org/docs/8.3/static/release-8-3.html"&gt;nearly a decade now&lt;/a&gt;, incrementally improving with every release to the point where it’s now  really, &lt;em&gt;really&lt;/em&gt; good.&lt;/p&gt;
&lt;p&gt;At its most basic level a search system needs to handle four things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It needs to take user input and find matching documents.&lt;/li&gt;
&lt;li&gt;It needs to understand and ignore stopwords (common words like “the” and “and”) and apply stemming - knowing that “ridicule” and “ridiculous” should be treated as the same root, for example. Both of these features need to be language-aware.&lt;/li&gt;
&lt;li&gt;It needs to be able to apply relevance ranking, calculating which documents are the best match for a search query.&lt;/li&gt;
&lt;li&gt;It needs to be &lt;em&gt;fast&lt;/em&gt; - working against some kind of index rather than scanning every available document in full.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Modern PostgreSQL &lt;a href="https://www.postgresql.org/docs/9.5/static/textsearch.html"&gt;ticks all of those boxes&lt;/a&gt;. Let’s put it to work.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Simple_search_without_an_index_29"&gt;&lt;/a&gt;Simple search without an index&lt;/h3&gt;
&lt;p&gt;Here’s how to execute a full-text search query against a simple text column:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-keyword"&gt;from&lt;/span&gt; blog.models &lt;span class="hljs-keyword"&gt;import&lt;/span&gt; Entry
&lt;span class="hljs-keyword"&gt;from&lt;/span&gt; django.contrib.postgres.search &lt;span class="hljs-keyword"&gt;import&lt;/span&gt; SearchVector

results = Entry.objects.annotate(
    searchable=SearchVector(&lt;span class="hljs-string"&gt;'body'&lt;/span&gt;)
).filter(searchable=&lt;span class="hljs-string"&gt;'django'&lt;/span&gt;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The generated SQL looks something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;&lt;span class="hljs-operator"&gt;&lt;span class="hljs-keyword"&gt;SELECT&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"id"&lt;/span&gt;, ...,
to_tsvector(&lt;span class="hljs-keyword"&gt;COALESCE&lt;/span&gt;(&lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"body"&lt;/span&gt;, %s)) &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"searchable"&lt;/span&gt;
&lt;span class="hljs-keyword"&gt;FROM&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;
&lt;span class="hljs-keyword"&gt;WHERE&lt;/span&gt; to_tsvector(&lt;span class="hljs-keyword"&gt;COALESCE&lt;/span&gt;(&lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"body"&lt;/span&gt;, &lt;span class="hljs-string"&gt;"django"&lt;/span&gt;))
    @@ (plainto_tsquery(&lt;span class="hljs-string"&gt;"django"&lt;/span&gt;)) = &lt;span class="hljs-literal"&gt;true&lt;/span&gt;
&lt;span class="hljs-keyword"&gt;ORDER&lt;/span&gt; &lt;span class="hljs-keyword"&gt;BY&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"created"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;DESC&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;SearchVector&lt;/code&gt; class constructs a stemmed, stopword-removed representation of the &lt;code&gt;body&lt;/code&gt; column ready to be searched. The resulting queryset contains entries that are a match for “django”.&lt;/p&gt;
&lt;p&gt;My blog entries are stored as HTML, but I don’t want search to include those HTML tags. One (extremely un-performant) solution is to use Django’s &lt;code&gt;Func&lt;/code&gt; helper to apply a regular expression inside PostgreSQL to strip tags before they are considered for search:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-keyword"&gt;from&lt;/span&gt; django.db.models &lt;span class="hljs-keyword"&gt;import&lt;/span&gt; Value, F, Func

results = Entry.objects.annotate(
    searchable=SearchVector(
        Func(
            F(&lt;span class="hljs-string"&gt;'body'&lt;/span&gt;), Value(&lt;span class="hljs-string"&gt;'&amp;lt;.*?&amp;gt;'&lt;/span&gt;), Value(&lt;span class="hljs-string"&gt;''&lt;/span&gt;), Value(&lt;span class="hljs-string"&gt;'g'&lt;/span&gt;),
            function=&lt;span class="hljs-string"&gt;'regexp_replace'&lt;/span&gt;
        )
    )
).filter(searchable=&lt;span class="hljs-string"&gt;'http'&lt;/span&gt;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 6th October 8:23pm UTC&lt;/strong&gt; - it turns out this step is entirely unnecessary. &lt;a href="https://github.com/simonw/simonwillisonblog/issues/1#issuecomment-334770577"&gt;Paolo Melchiorre points out&lt;/a&gt; that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing &lt;samp&gt;SELECT to_tsvector('&amp;lt;div&amp;gt;Hey look what happens to &amp;lt;blockquote&amp;gt;this tag&amp;lt;/blockquote&amp;gt;&amp;lt;/div&amp;gt;')&lt;/samp&gt; &lt;a href="http://sqlfiddle.com/#!17/9eecb/4552"&gt;using SQL Fiddle&lt;/a&gt; returns &lt;samp&gt;'happen':4 'hey':1 'look':2 'tag':7&lt;/samp&gt;, with the tags already stripped.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This works, but performance isn’t great. PostgreSQL ends up having to scan every row and construct a list of search vectors for each one every time you execute a query.&lt;/p&gt;
&lt;p&gt;If you want it to go fast, you need to add a special search vector column to your table and then create the appropriate index on it. As of Django 1.11 this is trivial:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-keyword"&gt;from&lt;/span&gt; django.contrib.postgres.search &lt;span class="hljs-keyword"&gt;import&lt;/span&gt; SearchVectorField
&lt;span class="hljs-keyword"&gt;from&lt;/span&gt; django.contrib.postgres.indexes &lt;span class="hljs-keyword"&gt;import&lt;/span&gt; GinIndex

&lt;span class="hljs-class"&gt;&lt;span class="hljs-keyword"&gt;class&lt;/span&gt; &lt;span class="hljs-title"&gt;Entry&lt;/span&gt;&lt;span class="hljs-params"&gt;(models.Model)&lt;/span&gt;:&lt;/span&gt;
    &lt;span class="hljs-comment"&gt;# ...&lt;/span&gt;
    search_document = SearchVectorField(null=&lt;span class="hljs-keyword"&gt;True&lt;/span&gt;)

    &lt;span class="hljs-class"&gt;&lt;span class="hljs-keyword"&gt;class&lt;/span&gt; &lt;span class="hljs-title"&gt;Meta&lt;/span&gt;:&lt;/span&gt;
        indexes = [
            GinIndex(fields=[&lt;span class="hljs-string"&gt;'search_document'&lt;/span&gt;])
        ]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Django’s migration system will automatically add both the field and the special &lt;a href="https://www.postgresql.org/docs/9.5/static/gin-intro.html"&gt;GIN index&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What’s trickier is populating that &lt;code&gt;search_document&lt;/code&gt; field. Django does not yet support a easy method to populate it directly in your initial INSERT call, instead recommending that you populated with a SQL UPDATE statement after the fact. Here is a one-liner that will populate the field for everything in that table (and strip tags at the same time):&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;strip_tags_func&lt;/span&gt;&lt;span class="hljs-params"&gt;(field)&lt;/span&gt;:&lt;/span&gt;
    &lt;span class="hljs-keyword"&gt;return&lt;/span&gt; Func(
        F(field), Value(&lt;span class="hljs-string"&gt;'&amp;lt;.*?&amp;gt;'&lt;/span&gt;), Value(&lt;span class="hljs-string"&gt;''&lt;/span&gt;), Value(&lt;span class="hljs-string"&gt;'g'&lt;/span&gt;),
        function=&lt;span class="hljs-string"&gt;'regexp_replace'&lt;/span&gt;
    )
 
Entry.objects.update(
    search_document=(
        SearchVector(&lt;span class="hljs-string"&gt;'title'&lt;/span&gt;, weight=&lt;span class="hljs-string"&gt;'A'&lt;/span&gt;) +
        SearchVector(strip_tags_func(&lt;span class="hljs-string"&gt;'body'&lt;/span&gt;), weight=&lt;span class="hljs-string"&gt;'C'&lt;/span&gt;)
    )
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’m using a neat feature of the &lt;code&gt;SearchVector&lt;/code&gt; class here: it can be concatenated together using the &lt;code&gt;+&lt;/code&gt; operator, and each component can be assigned a weight of &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;B&lt;/code&gt;, &lt;code&gt;C&lt;/code&gt; or &lt;code&gt;D&lt;/code&gt;. These weights affect ranking calculations later on.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Updates_using_signals"&gt;&lt;/a&gt;Updates using signals&lt;/h3&gt;
&lt;p&gt;We could just set this up to run periodically (as I did in my &lt;a href="https://github.com/simonw/simonwillisonblog/commit/7e3a02178e3ca71c464ae68a3b68d70e5fa66692#diff-1cbcc518bc02f9495bba963e698143e0"&gt;initial implementation&lt;/a&gt;), but we can get better real-time results by ensuring this field gets updated automatically when the rest of the model is modified. Some people solve this with PostgreSQL triggers, but I’m still more comfortable handling this kind of thing in python code - so I opted to use Django’s &lt;a href="https://docs.djangoproject.com/en/1.11/topics/signals/"&gt;signals mechanism&lt;/a&gt; instead.&lt;/p&gt;
&lt;p&gt;Since I need to run search queries across three different types of blog content - Entries, Blogmarks and Quotations - I added a method to each model that returns the text fragments corresponding to each of the weight values. Here’s that method for my Quotation model:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-class"&gt;&lt;span class="hljs-keyword"&gt;class&lt;/span&gt; &lt;span class="hljs-title"&gt;Quotation&lt;/span&gt;&lt;span class="hljs-params"&gt;(models.Model)&lt;/span&gt;:&lt;/span&gt;
    quotation = models.TextField()
    source = models.CharField(max_length=&lt;span class="hljs-number"&gt;255&lt;/span&gt;)
    tags = models.ManyToManyField(Tag, blank=&lt;span class="hljs-keyword"&gt;True&lt;/span&gt;)

    &lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;index_components&lt;/span&gt;&lt;span class="hljs-params"&gt;(self)&lt;/span&gt;:&lt;/span&gt;
        &lt;span class="hljs-keyword"&gt;return&lt;/span&gt; {
            &lt;span class="hljs-string"&gt;'A'&lt;/span&gt;: self.quotation,
            &lt;span class="hljs-string"&gt;'B'&lt;/span&gt;: &lt;span class="hljs-string"&gt;' '&lt;/span&gt;.join(self.tags.values_list(&lt;span class="hljs-string"&gt;'tag'&lt;/span&gt;, flat=&lt;span class="hljs-keyword"&gt;True&lt;/span&gt;)),
            &lt;span class="hljs-string"&gt;'C'&lt;/span&gt;: self.source,
        }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you can see, I’m including the tags that have been assigned to the quotation in the searchable document.&lt;/p&gt;
&lt;p&gt;Here are my signals - loaded once via an import statement in my blog application’s &lt;a href="https://github.com/simonw/simonwillisonblog/blob/3f5ca05248e409a946b53593f7d11b6f9551044f/blog/apps.py"&gt;&lt;code&gt;AppConfig.ready()&lt;/code&gt; method&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-decorator"&gt;@receiver(post_save)&lt;/span&gt;
&lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;on_save&lt;/span&gt;&lt;span class="hljs-params"&gt;(sender, **kwargs)&lt;/span&gt;:&lt;/span&gt;
    &lt;span class="hljs-keyword"&gt;if&lt;/span&gt; &lt;span class="hljs-keyword"&gt;not&lt;/span&gt; issubclass(sender, BaseModel):
        &lt;span class="hljs-keyword"&gt;return&lt;/span&gt;
    transaction.on_commit(make_updater(kwargs[&lt;span class="hljs-string"&gt;'instance'&lt;/span&gt;]))

&lt;span class="hljs-decorator"&gt;@receiver(m2m_changed)&lt;/span&gt;
&lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;on_m2m_changed&lt;/span&gt;&lt;span class="hljs-params"&gt;(sender, **kwargs)&lt;/span&gt;:&lt;/span&gt;
    instance = kwargs[&lt;span class="hljs-string"&gt;'instance'&lt;/span&gt;]
    model = kwargs[&lt;span class="hljs-string"&gt;'model'&lt;/span&gt;]
    &lt;span class="hljs-keyword"&gt;if&lt;/span&gt; model &lt;span class="hljs-keyword"&gt;is&lt;/span&gt; Tag:
        transaction.on_commit(make_updater(instance))
    &lt;span class="hljs-keyword"&gt;elif&lt;/span&gt; isinstance(instance, Tag):
        &lt;span class="hljs-keyword"&gt;for&lt;/span&gt; obj &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; model.objects.filter(pk__in=kwargs[&lt;span class="hljs-string"&gt;'pk_set'&lt;/span&gt;]):
            transaction.on_commit(make_updater(obj))

&lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;make_updater&lt;/span&gt;&lt;span class="hljs-params"&gt;(instance)&lt;/span&gt;:&lt;/span&gt;
    components = instance.index_components()
    pk = instance.pk

    &lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;on_commit&lt;/span&gt;&lt;span class="hljs-params"&gt;()&lt;/span&gt;:&lt;/span&gt;
        search_vectors = []
        &lt;span class="hljs-keyword"&gt;for&lt;/span&gt; weight, text &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; components.items():
            search_vectors.append(
                SearchVector(Value(text, output_field=models.TextField()), weight=weight)
            )
        instance.__class__.objects.filter(pk=pk).update(
            search_document=reduce(operator.add, search_vectors)
        )
    &lt;span class="hljs-keyword"&gt;return&lt;/span&gt; on_commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(The full code can be &lt;a href="https://github.com/simonw/simonwillisonblog/blob/3f5ca05248e409a946b53593f7d11b6f9551044f/blog/signals.py"&gt;found here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;on_save&lt;/code&gt; method is pretty straightforward - it checks if the model that was just saved has my &lt;code&gt;BaseModel&lt;/code&gt; as a base class, then it calls &lt;code&gt;make_updater&lt;/code&gt; to get a function to be executed by the &lt;code&gt;transaction.on_commit&lt;/code&gt; hook.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;on_m2m_changed&lt;/code&gt; handler is &lt;a href="https://docs.djangoproject.com/en/1.11/ref/signals/#m2m-changed"&gt;significantly more complicated&lt;/a&gt;. There are a number of scenarios in which this will be called - I’m reasonably confident that the idiom I use here will capture all of the modifications that should trigger a re-indexing operation.&lt;/p&gt;
&lt;p&gt;Running a search now looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;results = Entry.objects.filter(
    search_document=SearchQuery(&lt;span class="hljs-string"&gt;'django'&lt;/span&gt;)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We need one more thing though: we need to sort our search results by relevance. PostgreSQL has pretty good relevance built in, and sorting by the relevance score can be done by  applying a Django ORM annotation:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;query = SearchQuery(&lt;span class="hljs-string"&gt;'ibm'&lt;/span&gt;)

results = Entry.objects.filter(
    search_document=query
).annotate(
    rank=SearchRank(F(&lt;span class="hljs-string"&gt;'search_document'&lt;/span&gt;), query)
).order_by(&lt;span class="hljs-string"&gt;'-rank'&lt;/span&gt;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We now have basic full text search implemented against a single Django model, making use of a GIN index. This is lightning fast.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Searching_multiple_tables_using_querysetunion_192"&gt;&lt;/a&gt;Searching multiple tables using queryset.union()&lt;/h3&gt;
&lt;p&gt;My site has three types of content, represented in three different models and hence three different underlying database tables.&lt;/p&gt;
&lt;p&gt;I’m using &lt;a href="https://github.com/simonw/simonwillisonblog/blob/3f5ca05248e409a946b53593f7d11b6f9551044f/blog/models.py#L78-L107"&gt;an abstract base model&lt;/a&gt; to define common fields shared by all three: the created date, the slug (used to construct permalink urls) and the &lt;code&gt;search_document&lt;/code&gt; field populated above.&lt;/p&gt;
&lt;p&gt;As of Django 1.11 It’s possible to combine queries across different tables &lt;a href="https://docs.djangoproject.com/en/1.11/releases/1.11/#models"&gt;using the SQL union operator&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here’s  what that looks like for running a search across three tables, all with the same &lt;code&gt;search_document&lt;/code&gt; search vector field. I need to use &lt;code&gt;.values()&lt;/code&gt; to restrict the querysets I am unioning to the same subset of fields:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;query = SearchQuery(&lt;span class="hljs-string"&gt;'django'&lt;/span&gt;)
rank_annotation = SearchRank(F(&lt;span class="hljs-string"&gt;'search_document'&lt;/span&gt;), query)
qs = Blogmark.objects.annotate(
    rank=rank_annotation,
).filter(
    search_document=query
).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;).union(
    Entry.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;),
    Quotation.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;),
).order_by(&lt;span class="hljs-string"&gt;'-rank'&lt;/span&gt;)[:&lt;span class="hljs-number"&gt;5&lt;/span&gt;]

&lt;span class="hljs-comment"&gt;# Output&lt;/span&gt;
&amp;lt;QuerySet [
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;186&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.875179&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;: datetime.datetime(&lt;span class="hljs-number"&gt;2008&lt;/span&gt;, &lt;span class="hljs-number"&gt;4&lt;/span&gt;, &lt;span class="hljs-number"&gt;8&lt;/span&gt;, &lt;span class="hljs-number"&gt;13&lt;/span&gt;, &lt;span class="hljs-number"&gt;48&lt;/span&gt;, &lt;span class="hljs-number"&gt;18&lt;/span&gt;, tzinfo=&amp;lt;UTC&amp;gt;)},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;134&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.842655&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;: datetime.datetime(&lt;span class="hljs-number"&gt;2007&lt;/span&gt;, &lt;span class="hljs-number"&gt;10&lt;/span&gt;, &lt;span class="hljs-number"&gt;20&lt;/span&gt;, &lt;span class="hljs-number"&gt;13&lt;/span&gt;, &lt;span class="hljs-number"&gt;46&lt;/span&gt;, &lt;span class="hljs-number"&gt;56&lt;/span&gt;, tzinfo=&amp;lt;UTC&amp;gt;)},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;1591&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.804502&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;: datetime.datetime(&lt;span class="hljs-number"&gt;2009&lt;/span&gt;, &lt;span class="hljs-number"&gt;9&lt;/span&gt;, &lt;span class="hljs-number"&gt;28&lt;/span&gt;, &lt;span class="hljs-number"&gt;23&lt;/span&gt;, &lt;span class="hljs-number"&gt;32&lt;/span&gt;, &lt;span class="hljs-number"&gt;4&lt;/span&gt;, tzinfo=&amp;lt;UTC&amp;gt;)},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;5093&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.788616&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;: datetime.datetime(&lt;span class="hljs-number"&gt;2010&lt;/span&gt;, &lt;span class="hljs-number"&gt;2&lt;/span&gt;, &lt;span class="hljs-number"&gt;26&lt;/span&gt;, &lt;span class="hljs-number"&gt;19&lt;/span&gt;, &lt;span class="hljs-number"&gt;22&lt;/span&gt;, &lt;span class="hljs-number"&gt;47&lt;/span&gt;, tzinfo=&amp;lt;UTC&amp;gt;)},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;2598&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.786928&lt;/span&gt;, &lt;span class="hljs-string"&gt;'created'&lt;/span&gt;: datetime.datetime(&lt;span class="hljs-number"&gt;2007&lt;/span&gt;, &lt;span class="hljs-number"&gt;1&lt;/span&gt;, &lt;span class="hljs-number"&gt;26&lt;/span&gt;, &lt;span class="hljs-number"&gt;12&lt;/span&gt;, &lt;span class="hljs-number"&gt;38&lt;/span&gt;, &lt;span class="hljs-number"&gt;46&lt;/span&gt;, tzinfo=&amp;lt;UTC&amp;gt;)}
]&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is not enough information though - I have the primary keys, but I don’t know which type of model they belong to. In order to retrieve the actual resulting objects from the database I need to know which type of content is represented by each of those results.&lt;/p&gt;
&lt;p&gt;I can achieve that using another annotation:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;qs = Blogmark.objects.annotate(
    rank=rank_annotation,
    type=models.Value(&lt;span class="hljs-string"&gt;'blogmark'&lt;/span&gt;, output_field=models.CharField())
).filter(
    search_document=query
).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;).union(
    Entry.objects.annotate(
        rank=rank_annotation,
        type=models.Value(&lt;span class="hljs-string"&gt;'entry'&lt;/span&gt;, output_field=models.CharField())
    ).filter(
        search_document=query
    ).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;),
    Quotation.objects.annotate(
        rank=rank_annotation,
        type=models.Value(&lt;span class="hljs-string"&gt;'quotation'&lt;/span&gt;, output_field=models.CharField())
    ).filter(
        search_document=query
    ).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;),
).order_by(&lt;span class="hljs-string"&gt;'-rank'&lt;/span&gt;)[:&lt;span class="hljs-number"&gt;5&lt;/span&gt;]

&lt;span class="hljs-comment"&gt;# Output:&lt;/span&gt;
&amp;lt;QuerySet [
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;186&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;: &lt;span class="hljs-string"&gt;u'quotation'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.875179&lt;/span&gt;},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;134&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;: &lt;span class="hljs-string"&gt;u'quotation'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.842655&lt;/span&gt;},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;1591&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;: &lt;span class="hljs-string"&gt;u'entry'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.804502&lt;/span&gt;},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;5093&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;: &lt;span class="hljs-string"&gt;u'blogmark'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.788616&lt;/span&gt;},
    {&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;: &lt;span class="hljs-number"&gt;2598&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;: &lt;span class="hljs-string"&gt;u'blogmark'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;: &lt;span class="hljs-number"&gt;0.786928&lt;/span&gt;}
]&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I just need to write function which can take a list of types and primary keys and return the full objects from the database:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-function"&gt;&lt;span class="hljs-keyword"&gt;def&lt;/span&gt; &lt;span class="hljs-title"&gt;load_mixed_objects&lt;/span&gt;&lt;span class="hljs-params"&gt;(dicts)&lt;/span&gt;:&lt;/span&gt;
    &lt;span class="hljs-string"&gt;"""
    Takes a list of dictionaries, each of which must at least have a 'type'
    and a 'pk' key. Returns a list of ORM objects of those various types.
    Each returned ORM object has a .original_dict attribute populated.
    """&lt;/span&gt;
    to_fetch = {}
    &lt;span class="hljs-keyword"&gt;for&lt;/span&gt; d &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; dicts:
        to_fetch.setdefault(d[&lt;span class="hljs-string"&gt;'type'&lt;/span&gt;], set()).add(d[&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;])
    fetched = {}
    &lt;span class="hljs-keyword"&gt;for&lt;/span&gt; key, model &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; (
        (&lt;span class="hljs-string"&gt;'blogmark'&lt;/span&gt;, Blogmark),
        (&lt;span class="hljs-string"&gt;'entry'&lt;/span&gt;, Entry),
        (&lt;span class="hljs-string"&gt;'quotation'&lt;/span&gt;, Quotation),
    ):
        ids = to_fetch.get(key) &lt;span class="hljs-keyword"&gt;or&lt;/span&gt; []
        objects = model.objects.prefetch_related(&lt;span class="hljs-string"&gt;'tags'&lt;/span&gt;).filter(pk__in=ids)
        &lt;span class="hljs-keyword"&gt;for&lt;/span&gt; obj &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; objects:
            fetched[(key, obj.pk)] = obj
    &lt;span class="hljs-comment"&gt;# Build list in same order as dicts argument&lt;/span&gt;
    to_return = []
    &lt;span class="hljs-keyword"&gt;for&lt;/span&gt; d &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; dicts:
        item = fetched.get((d[&lt;span class="hljs-string"&gt;'type'&lt;/span&gt;], d[&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;])) &lt;span class="hljs-keyword"&gt;or&lt;/span&gt; &lt;span class="hljs-keyword"&gt;None&lt;/span&gt;
        &lt;span class="hljs-keyword"&gt;if&lt;/span&gt; item:
            item.original_dict = d
        to_return.append(item)
    &lt;span class="hljs-keyword"&gt;return&lt;/span&gt; to_return
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One last challenge: when I add filtering by type, I’m going to want to selectively union together only a subset of these querysets. I need a queryset to start unions against, but I don’t yet know which queryset I will be using. I can abuse Django’s &lt;code&gt;queryset.none()&lt;/code&gt; method to crate an empty &lt;code&gt;ValuesQuerySet&lt;/code&gt; in the correct shape like this&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;qs = Entry.objects.annotate(
    type=models.Value(&lt;span class="hljs-string"&gt;'empty'&lt;/span&gt;, output_field=models.CharField()),
    rank=rank_annotation
).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;).none()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can progressively build up my union in a loop like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-keyword"&gt;for&lt;/span&gt; klass &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; (Entry, Blogmark, Quotation):
    qs = qs.union(klass.objects.annotate(
        rank=rank_annotation,
        type=models.Value(&lt;span class="hljs-string"&gt;'quotation'&lt;/span&gt;, output_field=models.CharField())
    ).filter(
        search_document=query
    ).values(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'type'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'rank'&lt;/span&gt;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Django ORM is smart enough to compile away the empty queryset when it constructs the SQL, which ends up looking something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;(((&lt;span class="hljs-operator"&gt;&lt;span class="hljs-keyword"&gt;SELECT&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"id"&lt;/span&gt;,
            &lt;span class="hljs-string"&gt;"entry"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"type"&lt;/span&gt;,
            ts_rank(&lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"search_document"&lt;/span&gt;, plainto_tsquery(%s)) &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"rank"&lt;/span&gt;
     &lt;span class="hljs-keyword"&gt;FROM&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;
     &lt;span class="hljs-keyword"&gt;WHERE&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"search_document"&lt;/span&gt; @@ (plainto_tsquery(%s)) = &lt;span class="hljs-literal"&gt;TRUE&lt;/span&gt;
     &lt;span class="hljs-keyword"&gt;ORDER&lt;/span&gt; &lt;span class="hljs-keyword"&gt;BY&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_entry"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"created"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;DESC&lt;/span&gt;))
 &lt;span class="hljs-keyword"&gt;UNION&lt;/span&gt;
   (&lt;span class="hljs-keyword"&gt;SELECT&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_blogmark"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"id"&lt;/span&gt;,
           &lt;span class="hljs-string"&gt;"blogmark"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"type"&lt;/span&gt;,
           ts_rank(&lt;span class="hljs-string"&gt;"blog_blogmark"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"search_document"&lt;/span&gt;, plainto_tsquery(%s)) &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"rank"&lt;/span&gt;
    &lt;span class="hljs-keyword"&gt;FROM&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_blogmark"&lt;/span&gt;
    &lt;span class="hljs-keyword"&gt;WHERE&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_blogmark"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"search_document"&lt;/span&gt; @@ (plainto_tsquery(%s)) = &lt;span class="hljs-literal"&gt;TRUE&lt;/span&gt;
    &lt;span class="hljs-keyword"&gt;ORDER&lt;/span&gt; &lt;span class="hljs-keyword"&gt;BY&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_blogmark"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"created"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;DESC&lt;/span&gt;))
&lt;span class="hljs-keyword"&gt;UNION&lt;/span&gt;
  (&lt;span class="hljs-keyword"&gt;SELECT&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_quotation"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"id"&lt;/span&gt;,
          &lt;span class="hljs-string"&gt;"quotation"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"type"&lt;/span&gt;,
          ts_rank(&lt;span class="hljs-string"&gt;"blog_quotation"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"search_document"&lt;/span&gt;, plainto_tsquery(%s)) &lt;span class="hljs-keyword"&gt;AS&lt;/span&gt; &lt;span class="hljs-string"&gt;"rank"&lt;/span&gt;
   &lt;span class="hljs-keyword"&gt;FROM&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_quotation"&lt;/span&gt;
   &lt;span class="hljs-keyword"&gt;WHERE&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_quotation"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"search_document"&lt;/span&gt; @@ (plainto_tsquery(%s)) = &lt;span class="hljs-literal"&gt;TRUE&lt;/span&gt;
   &lt;span class="hljs-keyword"&gt;ORDER&lt;/span&gt; &lt;span class="hljs-keyword"&gt;BY&lt;/span&gt; &lt;span class="hljs-string"&gt;"blog_quotation"&lt;/span&gt;.&lt;span class="hljs-string"&gt;"created"&lt;/span&gt; &lt;span class="hljs-keyword"&gt;DESC&lt;/span&gt;)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;a id="Applying_filters_345"&gt;&lt;/a&gt;Applying filters&lt;/h3&gt;
&lt;p&gt;So far, our search engine can only handle user-entered query strings. If I am going to build a faceted search interface I need to be able to handle filtering as well. I want the ability to filter by year, tag and type.&lt;/p&gt;
&lt;p&gt;The key difference between filtering and querying (borrowing &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html"&gt;these definitions from Elasticsearch&lt;/a&gt;) is that querying is loose - it involves stemming and stopwords - while filtering is exact. Additionally, querying affects the calculated relevance score while filtering does not - a document either matches the filter or it doesn’t.&lt;/p&gt;
&lt;p&gt;Since PostgreSQL is a relational database, filtering can be handled by simply constructing extra SQL where clauses using the Django ORM.&lt;/p&gt;
&lt;p&gt;Each of the filters I need requires a slightly different approach. Filtering by type is easy - I just selectively include or exclude that model from my union queryset.&lt;/p&gt;
&lt;p&gt;Year and month work like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;selected_year = request.GET.get(&lt;span class="hljs-string"&gt;'year'&lt;/span&gt;, &lt;span class="hljs-string"&gt;''&lt;/span&gt;)
selected_month = request.GET.get(&lt;span class="hljs-string"&gt;'month'&lt;/span&gt;, &lt;span class="hljs-string"&gt;''&lt;/span&gt;)
&lt;span class="hljs-keyword"&gt;if&lt;/span&gt; selected_year:
    qs = qs.filter(created__year=int(selected_year))
&lt;span class="hljs-keyword"&gt;if&lt;/span&gt; selected_month:
    qs = qs.filter(created__month=int(selected_month))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tags involve a join through a many-2-many relationship against the Tags table.  We want to be able to apply more than one tag, for example this search for &lt;a href="https://simonwillison.net/search/?tag=python&amp;amp;tag=javascript"&gt;all items tagged both python and javascript&lt;/a&gt;. Django’s ORM makes this easy:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;selected_tags = request.GET.getlist(&lt;span class="hljs-string"&gt;'tag'&lt;/span&gt;)
&lt;span class="hljs-keyword"&gt;for&lt;/span&gt; tag &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; selected_tags:
    qs = qs.filter(tags__tag=tag)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;a id="Adding_facet_counts_374"&gt;&lt;/a&gt;Adding facet counts&lt;/h3&gt;
&lt;p&gt;There is just one more ingredient needed to complete our faceted search: facet counts!&lt;/p&gt;
&lt;p&gt;Again, the way we calculate these is different for each of our filters. For types, we need to call &lt;code&gt;.count()&lt;/code&gt; on a separate queryset for each of the types we are searching:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;queryset = make_queryset(Entry, &lt;span class="hljs-string"&gt;'entry'&lt;/span&gt;)
type_counts[&lt;span class="hljs-string"&gt;'entry'&lt;/span&gt;] = queryset.count()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(the make_queryset function is &lt;a href="https://github.com/simonw/simonwillisonblog/blob/3f5ca05248e409a946b53593f7d11b6f9551044f/blog/views.py#L408-L423"&gt;defined here&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;For years we can do this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;&lt;span class="hljs-keyword"&gt;from&lt;/span&gt; django.db.models.functions &lt;span class="hljs-keyword"&gt;import&lt;/span&gt; TruncYear

&lt;span class="hljs-keyword"&gt;for&lt;/span&gt; row &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; queryset.order_by().annotate(
    year=TruncYear(&lt;span class="hljs-string"&gt;'created'&lt;/span&gt;)
).values(&lt;span class="hljs-string"&gt;'year'&lt;/span&gt;).annotate(n=models.Count(&lt;span class="hljs-string"&gt;'pk'&lt;/span&gt;)):
    year_counts[row[&lt;span class="hljs-string"&gt;'year'&lt;/span&gt;]] = year_counts.get(
        row[&lt;span class="hljs-string"&gt;'year'&lt;/span&gt;], &lt;span class="hljs-number"&gt;0&lt;/span&gt;
    ) + row[&lt;span class="hljs-string"&gt;'n'&lt;/span&gt;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tags are trickiest. Let’s take advantage of he fact that Django’s ORM knows how to construct sub-selects if you pass another queryset to the &lt;code&gt;__in&lt;/code&gt;  operator.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;tag_counts = {}
type_name = &lt;span class="hljs-string"&gt;'entry'&lt;/span&gt;
queryset = make_queryset(Entry, &lt;span class="hljs-string"&gt;'entry'&lt;/span&gt;)
&lt;span class="hljs-keyword"&gt;for&lt;/span&gt; tag, count &lt;span class="hljs-keyword"&gt;in&lt;/span&gt; Tag.objects.filter(**{
    &lt;span class="hljs-string"&gt;'%s__in'&lt;/span&gt; % type_name: queryset
}).annotate(
    n=models.Count(&lt;span class="hljs-string"&gt;'tag'&lt;/span&gt;)
).values_list(&lt;span class="hljs-string"&gt;'tag'&lt;/span&gt;, &lt;span class="hljs-string"&gt;'n'&lt;/span&gt;):
    tag_counts[tag] = tag_counts.get(tag, &lt;span class="hljs-number"&gt;0&lt;/span&gt;) + count
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;a id="Rendering_it_all_in_a_template_414"&gt;&lt;/a&gt;Rendering it all in a template&lt;/h3&gt;
&lt;p&gt;Having constructed the various facets counts in the view function, the template is really simple:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-html"&gt;{% if type_counts %}
    &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;h3&lt;/span&gt;&amp;gt;&lt;/span&gt;Types&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;h3&lt;/span&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;ul&lt;/span&gt;&amp;gt;&lt;/span&gt;
        {% for t in type_counts %}
            &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;li&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;a&lt;/span&gt; &lt;span class="hljs-attribute"&gt;href&lt;/span&gt;=&lt;span class="hljs-value"&gt;"{% add_qsarg "&lt;/span&gt;&lt;span class="hljs-value"&gt;type"&lt;/span&gt; &lt;span class="hljs-attribute"&gt;t.type&lt;/span&gt; %}"&amp;gt;&lt;/span&gt;{{ t.type }}&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;a&lt;/span&gt;&amp;gt;&lt;/span&gt; {{ t.n }}&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;a&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;li&lt;/span&gt;&amp;gt;&lt;/span&gt;
        {% endfor %}
    &lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;ul&lt;/span&gt;&amp;gt;&lt;/span&gt;
{% endif %}
{% if year_counts %}
    &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;h3&lt;/span&gt;&amp;gt;&lt;/span&gt;Years&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;h3&lt;/span&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;ul&lt;/span&gt;&amp;gt;&lt;/span&gt;
        {% for t in year_counts %}
            &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;li&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;a&lt;/span&gt; &lt;span class="hljs-attribute"&gt;href&lt;/span&gt;=&lt;span class="hljs-value"&gt;"{% add_qsarg "&lt;/span&gt;&lt;span class="hljs-value"&gt;year"&lt;/span&gt; &lt;span class="hljs-attribute"&gt;t.year&lt;/span&gt;|&lt;span class="hljs-attribute"&gt;date:&lt;/span&gt;"&lt;span class="hljs-attribute"&gt;Y&lt;/span&gt;" %}"&amp;gt;&lt;/span&gt;{{ t.year|date:"Y" }}&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;a&lt;/span&gt;&amp;gt;&lt;/span&gt; {{ t.n }}&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;a&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;li&lt;/span&gt;&amp;gt;&lt;/span&gt;
        {% endfor %}
    &lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;ul&lt;/span&gt;&amp;gt;&lt;/span&gt;
{% endif %}
{% if tag_counts %}
    &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;h3&lt;/span&gt;&amp;gt;&lt;/span&gt;Tags&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;h3&lt;/span&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;ul&lt;/span&gt;&amp;gt;&lt;/span&gt;
        {% for t in tag_counts %}
            &lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;li&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;span class="hljs-tag"&gt;&amp;lt;&lt;span class="hljs-title"&gt;a&lt;/span&gt; &lt;span class="hljs-attribute"&gt;href&lt;/span&gt;=&lt;span class="hljs-value"&gt;"{% add_qsarg "&lt;/span&gt;&lt;span class="hljs-value"&gt;tag"&lt;/span&gt; &lt;span class="hljs-attribute"&gt;t.tag&lt;/span&gt; %}"&amp;gt;&lt;/span&gt;{{ t.tag }}&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;a&lt;/span&gt;&amp;gt;&lt;/span&gt; {{ t.n }}&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;a&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;li&lt;/span&gt;&amp;gt;&lt;/span&gt;
        {% endfor %}
    &lt;span class="hljs-tag"&gt;&amp;lt;/&lt;span class="hljs-title"&gt;ul&lt;/span&gt;&amp;gt;&lt;/span&gt;
{% endif %}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I am using &lt;a href="https://github.com/simonw/simonwillisonblog/blob/3f5ca05248e409a946b53593f7d11b6f9551044f/blog/templatetags/blog_tags.py#L54-L68"&gt;custom templates tags&lt;/a&gt; here to add arguments to the current URL. I’ve built systems like this in the past where the URLs are instead generated in the view logic, which I think I prefer. As always, perfect is the enemy of shipped.&lt;/p&gt;
&lt;p&gt;And because the results are just a Django queryset, we can use Django’s pagination helpers for the pagination links.&lt;/p&gt;
&lt;h3&gt;&lt;a id="The_final_implementation_449"&gt;&lt;/a&gt;The final implementation&lt;/h3&gt;
&lt;p&gt;The full current version of the code at time of writing &lt;a href="https://github.com/simonw/simonwillisonblog/blob/3f5ca05248e409a946b53593f7d11b6f9551044f/blog/views.py#L388-L552"&gt;can be seen here&lt;/a&gt;. You can follow my initial implementation of this feature through the following commits: &lt;a href="https://github.com/simonw/simonwillisonblog/commit/7e3a0217"&gt;7e3a0217&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/c7e7b30c"&gt;c7e7b30c&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/7f6b524c"&gt;7f6b524c&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/a16ddb5e"&gt;a16ddb5e&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/7055c7e1"&gt;7055c7e1&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/74c194d9"&gt;74c194d9&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/f3ffc100"&gt;f3ffc100&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/6c24d9fd"&gt;6c24d9fd&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/cb88c2d4"&gt;cb88c2d4&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/2c262c75"&gt;2c262c75&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/776a562a"&gt;776a562a&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/b8484c50"&gt;b8484c50&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/0b361c78"&gt;0b361c78&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/1322ada2"&gt;1322ada2&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/79b1b13d"&gt;79b1b13d&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/3955f41b"&gt;3955f41b&lt;/a&gt; &lt;a href="https://github.com/simonw/simonwillisonblog/commit/3f5ca052"&gt;3f5ca052&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And that’s how I built faceted search on top of PostgreSQL and Django! I don’t have my blog comments up and running yet, so please post any thoughts or feedback over on &lt;a href="https://github.com/simonw/simonwillisonblog/issues/1"&gt;this GitHub issue&lt;/a&gt; or &lt;a href="https://news.ycombinator.com/item?id=15409733"&gt;over on this thread on Hacker News&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 9th September 2021&lt;/strong&gt;: A few years after implementing this I started to notice performance issues with my blog, which turned out to be caused by search engine crawlers hitting every possible combination of facets, triggering a ton of expensive SQL queries. I excluded &lt;code&gt;/search&lt;/code&gt; from being crawled using &lt;code&gt;robots.txt&lt;/code&gt; which fixed the problem.&lt;/em&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/orm"&gt;orm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facetedsearch"&gt;facetedsearch&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="django"/><category term="full-text-search"/><category term="orm"/><category term="postgresql"/><category term="projects"/><category term="search"/><category term="facetedsearch"/></entry><entry><title>A fast, fuzzy, full-text index using Redis</title><link href="https://simonwillison.net/2010/May/5/fuzzy/#atom-tag" rel="alternate"/><published>2010-05-05T17:51:00+00:00</published><updated>2010-05-05T17:51:00+00:00</updated><id>https://simonwillison.net/2010/May/5/fuzzy/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://playnice.ly/blog/2010/05/05/a-fast-fuzzy-full-text-index-using-redis/"&gt;A fast, fuzzy, full-text index using Redis&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting twist on building a reverse-index using Redis sets: this one indexes only the metaphones of the words, resulting in a phonetic fuzzy search.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fuzzy"&gt;fuzzy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/metaphone"&gt;metaphone&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="redis"/><category term="search"/><category term="recovered"/><category term="fuzzy"/><category term="metaphone"/></entry><entry><title>Digg Search: Now With 99.987% Less Suck</title><link href="https://simonwillison.net/2009/Apr/10/digg/#atom-tag" rel="alternate"/><published>2009-04-10T22:17:57+00:00</published><updated>2009-04-10T22:17:57+00:00</updated><id>https://simonwillison.net/2009/Apr/10/digg/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.digg.com/?p=653"&gt;Digg Search: Now With 99.987% Less Suck&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really nice implementation of faceted search, still using Lucene and Solr under the hood.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/digg"&gt;digg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facets"&gt;facets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lucene"&gt;lucene&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/solr"&gt;solr&lt;/a&gt;&lt;/p&gt;



</summary><category term="digg"/><category term="facets"/><category term="full-text-search"/><category term="lucene"/><category term="search"/><category term="solr"/></entry><entry><title>Sphinx 0.9.9-rc2 is out</title><link href="https://simonwillison.net/2009/Apr/8/sphinx/#atom-tag" rel="alternate"/><published>2009-04-08T13:59:26+00:00</published><updated>2009-04-08T13:59:26+00:00</updated><id>https://simonwillison.net/2009/Apr/8/sphinx/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://sphinxsearch.com/news/37.html"&gt;Sphinx 0.9.9-rc2 is out&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting new feature: the Sphinx search server now supports the MySQL binary protocol, so you can talk to it using a regular MySQL client library and fire off search queries using SELECT syntax and the new SphinxQL query language.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysql"&gt;mysql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sphinx-search"&gt;sphinx-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="mysql"/><category term="search"/><category term="sphinx-search"/><category term="sql"/></entry><entry><title>Guardian + Lucene = Similar Articles + Categorisation</title><link href="https://simonwillison.net/2009/Mar/11/hublog/#atom-tag" rel="alternate"/><published>2009-03-11T12:53:39+00:00</published><updated>2009-03-11T12:53:39+00:00</updated><id>https://simonwillison.net/2009/Mar/11/hublog/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://hublog.hubmed.org/archives/001823.html"&gt;Guardian + Lucene = Similar Articles + Categorisation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alf Eaton loaded 13,000 Guardian articles tagged Science in to Solr and Lucene and is using Solr’s MoreLikeThisHandler to find related articles and automatically apply Guardian tags to Nature News articles.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alf-eaton"&gt;alf-eaton&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/guardian"&gt;guardian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lucene"&gt;lucene&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/naturenews"&gt;naturenews&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openplatform"&gt;openplatform&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/solr"&gt;solr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tagging"&gt;tagging&lt;/a&gt;&lt;/p&gt;



</summary><category term="alf-eaton"/><category term="full-text-search"/><category term="guardian"/><category term="lucene"/><category term="naturenews"/><category term="openplatform"/><category term="search"/><category term="solr"/><category term="tagging"/></entry><entry><title>Xapian performance comparision with Whoosh</title><link href="https://simonwillison.net/2009/Feb/14/xapian/#atom-tag" rel="alternate"/><published>2009-02-14T13:15:15+00:00</published><updated>2009-02-14T13:15:15+00:00</updated><id>https://simonwillison.net/2009/Feb/14/xapian/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://xapian.wordpress.com/2009/02/12/xapian-performance-comparision-with-whoosh/"&gt;Xapian performance comparision with Whoosh&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Whoosh appears to be around four times slower than Xapian for indexing and empty cache searches, but Xapian with a full cache blows Whoosh out of the water (5408 searches/second compared to 26.3). Considering how fast Xapian is, that’s still a pretty impressive result for the pure-Python Whoosh.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/richard-boulton"&gt;richard-boulton&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whoosh"&gt;whoosh&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xapian"&gt;xapian&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="python"/><category term="richard-boulton"/><category term="search"/><category term="whoosh"/><category term="xapian"/></entry><entry><title>Tokyo Cabinet and Tokyo Tyrant Presentation</title><link href="https://simonwillison.net/2009/Feb/14/cabinet/#atom-tag" rel="alternate"/><published>2009-02-14T11:34:58+00:00</published><updated>2009-02-14T11:34:58+00:00</updated><id>https://simonwillison.net/2009/Feb/14/cabinet/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.scribd.com/doc/12016121/Tokyo-Cabinet-and-Tokyo-Tyrant-Presentation"&gt;Tokyo Cabinet and Tokyo Tyrant Presentation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
By Tokyo Cabinet author Mikio Hirabayashi. The third leg of the Tokyo tripod is Tokyo Dystopia, a full-text search engine which is presumably a modern replacement for Mikio’s older hyperestraier engine.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hyperestraier"&gt;hyperestraier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mikiohirabayashi"&gt;mikiohirabayashi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokyocabinet"&gt;tokyocabinet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokyodystopia"&gt;tokyodystopia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokyotyrant"&gt;tokyotyrant&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="hyperestraier"/><category term="mikiohirabayashi"/><category term="tokyocabinet"/><category term="tokyodystopia"/><category term="tokyotyrant"/></entry><entry><title>Whoosh</title><link href="https://simonwillison.net/2009/Feb/12/whoosh/#atom-tag" rel="alternate"/><published>2009-02-12T12:49:59+00:00</published><updated>2009-02-12T12:49:59+00:00</updated><id>https://simonwillison.net/2009/Feb/12/whoosh/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://whoosh.ca/"&gt;Whoosh&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A brand new, pure-python full text indexing engine (think Lucene). Claims to offer performance in the same league as wrappers to C or Java libraries. If this works as well as it claims it will be an excellent tool for adding search to projects that wish to avoid a dependency on an external engine.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lucene"&gt;lucene&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whoosh"&gt;whoosh&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="lucene"/><category term="open-source"/><category term="python"/><category term="search"/><category term="whoosh"/></entry><entry><title>How-to: Full-text search in Google App Engine</title><link href="https://simonwillison.net/2008/Jun/27/app/#atom-tag" rel="alternate"/><published>2008-06-27T08:25:35+00:00</published><updated>2008-06-27T08:25:35+00:00</updated><id>https://simonwillison.net/2008/Jun/27/app/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://appengineguy.com/2008/06/how-to-full-text-search-in-google-app.html"&gt;How-to: Full-text search in Google App Engine&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Use search.SearchableModel instead of db.Model—it’s pretty rough at the moment which is probably why it’s still undocumented.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google-app-engine"&gt;google-app-engine&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="google-app-engine"/><category term="python"/><category term="search"/></entry><entry><title>PostgreSQL 8.3 beta 4 release notes</title><link href="https://simonwillison.net/2007/Dec/12/release/#atom-tag" rel="alternate"/><published>2007-12-12T00:43:58+00:00</published><updated>2007-12-12T00:43:58+00:00</updated><id>https://simonwillison.net/2007/Dec/12/release/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://developer.postgresql.org/pgdocs/postgres/release-8-3.html"&gt;PostgreSQL 8.3 beta 4 release notes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
In addition to the huge speed improvements, 8.3 adds support for XML, UUID and ENUM data types and brings full text (tsearch2) in to the core database engine.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/beta"&gt;beta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/enum"&gt;enum&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tsearch2"&gt;tsearch2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uuid"&gt;uuid&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="beta"/><category term="databases"/><category term="enum"/><category term="full-text-search"/><category term="postgresql"/><category term="tsearch2"/><category term="uuid"/><category term="xml"/></entry><entry><title>Opera 9.5 alpha, Kestrel, released</title><link href="https://simonwillison.net/2007/Sep/16/opera/#atom-tag" rel="alternate"/><published>2007-09-16T20:34:34+00:00</published><updated>2007-09-16T20:34:34+00:00</updated><id>https://simonwillison.net/2007/Sep/16/opera/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://virtuelvis.com/archives/2007/09/kestrel"&gt;Opera 9.5 alpha, Kestrel, released&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“With history search, Opera creates a full-text index of each and every page you visit, and when you go to the address bar, you can simply start entering words you know have been on pages you’ve visited before, and items matching your search show up.” I just tried this; it’s magic. I’m switching back to Opera from Camino.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/camino"&gt;camino&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/history"&gt;history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kestrel"&gt;kestrel&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opera"&gt;opera&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="camino"/><category term="full-text-search"/><category term="history"/><category term="kestrel"/><category term="opera"/><category term="search"/></entry><entry><title>django-sphinx</title><link href="https://simonwillison.net/2007/Sep/9/djangosphinx/#atom-tag" rel="alternate"/><published>2007-09-09T00:35:19+00:00</published><updated>2007-09-09T00:35:19+00:00</updated><id>https://simonwillison.net/2007/Sep/9/djangosphinx/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://code.google.com/p/django-sphinx/"&gt;django-sphinx&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
More code from Curse Gaming; this time a really nice API for adding Sphinx full-text search to a Django model.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://www.davidcramer.net/code/54/mediawiki-markup-and-sphinxsearch-for-django.html"&gt;David Cramer&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cursegaming"&gt;cursegaming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/david-cramer"&gt;david-cramer&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/orm"&gt;orm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sphinx-search"&gt;sphinx-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="cursegaming"/><category term="david-cramer"/><category term="django"/><category term="full-text-search"/><category term="orm"/><category term="python"/><category term="search"/><category term="sphinx-search"/></entry><entry><title>SQLite Keynote</title><link href="https://simonwillison.net/2006/Oct/21/sqlite/#atom-tag" rel="alternate"/><published>2006-10-21T23:44:28+00:00</published><updated>2006-10-21T23:44:28+00:00</updated><id>https://simonwillison.net/2006/Oct/21/sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.cleverly.com/permalinks/247.html"&gt;SQLite Keynote&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
SQLite 3.3.8 has full text indexing!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/d-richard-hipp"&gt;d-richard-hipp&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="sqlite"/><category term="d-richard-hipp"/></entry></feed>