<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: data-structures</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/data-structures.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-11T23:38:39+00:00</updated><author><name>Simon Willison</name></author><entry><title>Scaling HNSWs</title><link href="https://simonwillison.net/2025/Nov/11/scaling-hnsws/#atom-tag" rel="alternate"/><published>2025-11-11T23:38:39+00:00</published><updated>2025-11-11T23:38:39+00:00</updated><id>https://simonwillison.net/2025/Nov/11/scaling-hnsws/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://antirez.com/news/156"&gt;Scaling HNSWs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Salvatore Sanfilippo spent much of this year working on &lt;a href="https://github.com/redis/redis/blob/8.2.3/modules/vector-sets/README.md"&gt;vector sets for Redis&lt;/a&gt;, which first shipped in &lt;a href="https://redis.io/blog/redis-8-ga/"&gt;Redis 8 in May&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A big part of that work involved implementing HNSW - Hierarchical Navigable Small World - an indexing technique first introduced in &lt;a href="https://arxiv.org/abs/1603.09320"&gt;this 2016 paper&lt;/a&gt; by Yu. A. Malkov and D. A. Yashunin.&lt;/p&gt;
&lt;p&gt;Salvatore's detailed notes on the Redis implementation here offer an immersive trip through a fascinating modern field of computer science. He describes several new contributions he's made to the HNSW algorithm, mainly around efficient deletion and updating of existing indexes.&lt;/p&gt;
&lt;p&gt;Since embedding vectors are notoriously memory-hungry I particularly appreciated this note about how you can scale a large HNSW vector set across many different nodes and run parallel queries against them for both reads and writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough].&lt;/p&gt;
&lt;p&gt;Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's always exciting to see new implementations of fundamental algorithms and data structures like this make it into Redis because Salvatore's C code is so clearly commented and pleasant to read - here's &lt;a href="https://github.com/redis/redis/blob/8.2.3/modules/vector-sets/hnsw.c"&gt;vector-sets/hnsw.c&lt;/a&gt; and &lt;a href="https://github.com/redis/redis/blob/8.2.3/modules/vector-sets/vset.c"&gt;vector-sets/vset.c&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45887466"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/algorithms"&gt;algorithms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/computer-science"&gt;computer-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/salvatore-sanfilippo"&gt;salvatore-sanfilippo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="algorithms"/><category term="c"/><category term="computer-science"/><category term="data-structures"/><category term="redis"/><category term="salvatore-sanfilippo"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>Zed Decoded: Rope &amp; SumTree</title><link href="https://simonwillison.net/2024/Apr/28/zed-decoded-rope-sumtree/#atom-tag" rel="alternate"/><published>2024-04-28T15:25:58+00:00</published><updated>2024-04-28T15:25:58+00:00</updated><id>https://simonwillison.net/2024/Apr/28/zed-decoded-rope-sumtree/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://zed.dev/blog/zed-decoded-rope-sumtree"&gt;Zed Decoded: Rope &amp;amp; SumTree&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Text editors like &lt;a href="https://zed.dev/"&gt;Zed&lt;/a&gt; need in-memory data structures that are optimized for handling large strings where text can be inserted or deleted at any point without needing to copy the whole string.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://en.m.wikipedia.org/wiki/Rope_(data_structure)"&gt;Ropes&lt;/a&gt; are a classic, widely used data structure for this.&lt;/p&gt;
&lt;p&gt;Zed have their own implementation of ropes in Rust, but it's backed by something even more interesting: a SumTree, described here as a thread-safe, snapshot-friendly, copy-on-write B+ tree where each leaf node contains multiple items and a Summary for each Item, and internal tree nodes contain a Summary of the items in its subtree.&lt;/p&gt;
&lt;p&gt;These summaries allow for some very fast traversal tree operations, such as turning an offset in the file into a line and row coordinate and vice-versa. The summary itself can be anything, so each application of SumTree in Zed collects different summary information.&lt;/p&gt;
&lt;p&gt;Uses in Zed include tracking highlight regions, code folding state, git blame information, project file trees and more - over 20 different classes and counting.&lt;/p&gt;
&lt;p&gt;Zed co-founder Nathan Sobo calls SumTree "the soul of Zed".&lt;/p&gt;
&lt;p&gt;Also notable: this detailed article is accompanied by an &lt;a href="https://youtu.be/uUu9eFNNbjg"&gt;hour long video&lt;/a&gt; with a four-way conversation between Zed maintainers providing a tour of these data structures in the Zed codebase.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/eatonphil/status/1784576184937799885"&gt;@eatonphil&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-structures"/><category term="rust"/></entry><entry><title>Reducing search indexing latency to one second</title><link href="https://simonwillison.net/2020/Jun/26/reducing-search-indexing-latency-one-second/#atom-tag" rel="alternate"/><published>2020-06-26T17:06:08+00:00</published><updated>2020-06-26T17:06:08+00:00</updated><id>https://simonwillison.net/2020/Jun/26/reducing-search-indexing-latency-one-second/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second.html"&gt;Reducing search indexing latency to one second&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really detailed dive into the nuts and bolts of Twitter’s latest iteration of search indexing technology, including a great explanation of skip lists.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lucene"&gt;lucene&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twitter"&gt;twitter&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-structures"/><category term="lucene"/><category term="scaling"/><category term="search"/><category term="twitter"/></entry><entry><title>Why does Python not have any data structures that store data in sorted order?</title><link href="https://simonwillison.net/2013/Jun/25/why-does-python-not/#atom-tag" rel="alternate"/><published>2013-06-25T17:41:00+00:00</published><updated>2013-06-25T17:41:00+00:00</updated><id>https://simonwillison.net/2013/Jun/25/why-does-python-not/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/Why-does-Python-not-have-any-data-structures-that-store-data-in-sorted-order/answer/Simon-Willison"&gt;Why does Python not have any data structures that store data in sorted order?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The bisect module provides functions for achieving this using a python list as the underlying data structure: &lt;span&gt;&lt;a href="http://docs.python.org/2/library/bisect.html"&gt;http://docs.python.org/2/library...&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-structures"/><category term="python"/><category term="quora"/></entry><entry><title>What data structures are used to implement the DOM tree?</title><link href="https://simonwillison.net/2013/Feb/17/what-data-structures-are/#atom-tag" rel="alternate"/><published>2013-02-17T13:31:00+00:00</published><updated>2013-02-17T13:31:00+00:00</updated><id>https://simonwillison.net/2013/Feb/17/what-data-structures-are/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-data-structures-are-used-to-implement-the-DOM-tree/answer/Simon-Willison"&gt;What data structures are used to implement the DOM tree?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You may enjoy this post from Hixie back in 2002 which illustrates how different browsers deal with incorrectly nested HTML. IE6 used to create a tree that wasn't actually a tree! &lt;span&gt;&lt;a href="http://ln.hixie.ch/?count=1&amp;amp;start=1037910467"&gt;http://ln.hixie.ch/?start=103791...&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/chrome"&gt;chrome&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox"&gt;firefox&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-explorer"&gt;internet-explorer&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opera"&gt;opera&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webkit"&gt;webkit&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefoxos"&gt;firefoxos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="chrome"/><category term="data-structures"/><category term="firefox"/><category term="html"/><category term="internet-explorer"/><category term="opera"/><category term="webkit"/><category term="quora"/><category term="firefoxos"/></entry><entry><title>Speeding up dateutil: Python's heapq module turns minutes into seconds</title><link href="https://simonwillison.net/2007/Dec/22/heapq/#atom-tag" rel="alternate"/><published>2007-12-22T13:07:43+00:00</published><updated>2007-12-22T13:07:43+00:00</updated><id>https://simonwillison.net/2007/Dec/22/heapq/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.brianbeck.com/post/22129050"&gt;Speeding up dateutil: Python&amp;#x27;s heapq module turns minutes into seconds&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat case study in data structure optimisation.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/brian-beck"&gt;brian-beck&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dateutil"&gt;dateutil&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/heapq"&gt;heapq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="brian-beck"/><category term="data-structures"/><category term="dateutil"/><category term="heapq"/><category term="python"/></entry></feed>