Simon Willison's Weblog: data-structures

Scaling HNSWs

2025-11-11T23:38:39+00:00

Salvatore Sanfilippo spent much of this year working on vector sets for Redis, which first shipped in Redis 8 in May.

A big part of that work involved implementing HNSW - Hierarchical Navigable Small World - an indexing technique first introduced in this 2016 paper by Yu. A. Malkov and D. A. Yashunin.

Salvatore's detailed notes on the Redis implementation here offer an immersive trip through a fascinating modern field of computer science. He describes several new contributions he's made to the HNSW algorithm, mainly around efficient deletion and updating of existing indexes.

Since embedding vectors are notoriously memory-hungry I particularly appreciated this note about how you can scale a large HNSW vector set across many different nodes and run parallel queries against them for both reads and writes:

[...] if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough].

Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process.

It's always exciting to see new implementations of fundamental algorithms and data structures like this make it into Redis because Salvatore's C code is so clearly commented and pleasant to read - here's vector-sets/hnsw.c and vector-sets/vset.c.

Via Hacker News

Tags: algorithms, c, computer-science, data-structures, redis, salvatore-sanfilippo, vector-search, embeddings

Zed Decoded: Rope & SumTree

2024-04-28T15:25:58+00:00

Zed Decoded: Rope & SumTree

Text editors like Zed need in-memory data structures that are optimized for handling large strings where text can be inserted or deleted at any point without needing to copy the whole string.

Ropes are a classic, widely used data structure for this.

Zed have their own implementation of ropes in Rust, but it's backed by something even more interesting: a SumTree, described here as a thread-safe, snapshot-friendly, copy-on-write B+ tree where each leaf node contains multiple items and a Summary for each Item, and internal tree nodes contain a Summary of the items in its subtree.

These summaries allow for some very fast traversal tree operations, such as turning an offset in the file into a line and row coordinate and vice-versa. The summary itself can be anything, so each application of SumTree in Zed collects different summary information.

Uses in Zed include tracking highlight regions, code folding state, git blame information, project file trees and more - over 20 different classes and counting.

Zed co-founder Nathan Sobo calls SumTree "the soul of Zed".

Also notable: this detailed article is accompanied by an hour long video with a four-way conversation between Zed maintainers providing a tour of these data structures in the Zed codebase.

Via @eatonphil

Tags: data-structures, rust

Reducing search indexing latency to one second

2020-06-26T17:06:08+00:00

Reducing search indexing latency to one second

Really detailed dive into the nuts and bolts of Twitter’s latest iteration of search indexing technology, including a great explanation of skip lists.

Tags: data-structures, lucene, scaling, search, twitter

Why does Python not have any data structures that store data in sorted order?

2013-06-25T17:41:00+00:00

My answer to Why does Python not have any data structures that store data in sorted order? on Quora

The bisect module provides functions for achieving this using a python list as the underlying data structure: http://docs.python.org/2/library...

Tags: data-structures, python, quora

What data structures are used to implement the DOM tree?

2013-02-17T13:31:00+00:00

My answer to What data structures are used to implement the DOM tree? on Quora

You may enjoy this post from Hixie back in 2002 which illustrates how different browsers deal with incorrectly nested HTML. IE6 used to create a tree that wasn't actually a tree! http://ln.hixie.ch/?start=103791...

Tags: chrome, data-structures, firefox, html, internet-explorer, opera, webkit, quora, firefoxos

Speeding up dateutil: Python's heapq module turns minutes into seconds

2007-12-22T13:07:43+00:00

Speeding up dateutil: Python's heapq module turns minutes into seconds

Neat case study in data structure optimisation.

Tags: brian-beck, data-structures, dateutil, heapq, python