Simon Willison's Weblog: sam-rose

Quantization from the ground up

2026-03-26T16:21:09+00:00

Sam Rose continues his streak of publishing spectacularly informative interactive essays, this time explaining how quantization of Large Language Models works (which he says might be "the best post I've ever made".)

Also included is the best visual explanation I've ever seen of how floating point numbers are represented using binary digits.

I hadn't heard about outlier values in quantization - rare float values that exist outside of the normal tiny-value distribution - but apparently they're very important:

Why do these outliers exist? [...] tl;dr: no one conclusively knows, but a small fraction of these outliers are very important to model quality. Removing even a single "super weight," as Apple calls them, can cause the model to output complete gibberish.

Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them so that their block isn't destroyed.

Plus there's a section on How much does quantization affect model accuracy?. Sam explains the concepts of perplexity and ** KL divergence ** and then uses the llama.cpp perplexity tool and a run of the GPQA benchmark to show how different quantization levels affect Qwen 3.5 9B.

His conclusion:

It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit is more noticeable, but it's certainly not a quarter as good as the original. Closer to 90%, depending on how you want to measure it.

Tags: computer-science, ai, explorables, generative-ai, llms, sam-rose, qwen

Sam Rose explains how LLMs work with a visual essay

2025-12-19T18:33:41+00:00

Sam Rose explains how LLMs work with a visual essay

Sam Rose is one of my favorite authors of explorable interactive explanations - here's his previous collection.

Sam joined ngrok in September as a developer educator. Here's his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.

The result is one of the clearest and most accessible introductions to LLM internals I've seen anywhere.

Tags: ai, explorables, generative-ai, llms, sam-rose, tokenization

Reservoir Sampling

2025-05-08T21:00:22+00:00

Reservoir Sampling

Yet another outstanding interactive essay by Sam Rose (previously), this time explaining how reservoir sampling can be used to select a "fair" random sample when you don't know how many options there are and don't want to accumulate them before making a selection.

Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient.

I appreciate that Sam starts the article with "No math notation, I promise." Lots of delightful widgets to interact with here, all of which help build an intuitive understanding of the underlying algorithm.

Sam shows how this algorithm can be applied to the real-world problem of sampling log files when incoming logs threaten to overwhelm a log aggregator.

The dog illustration is commissioned art and the MIT-licensed code is available on GitHub.

Via Hacker News

Tags: algorithms, logging, rate-limiting, explorables, sam-rose

Load Balancing

2024-07-13T22:51:45+00:00

Load Balancing

Sam Rose built this interactive essay explaining how different load balancing strategies work. It's part of a series that includes memory allocation, bloom filters and more.

Tags: algorithms, load-balancing, explorables, sam-rose

Bloom Filters, explained by Sam Rose

2024-02-23T15:59:33+00:00

Bloom Filters, explained by Sam Rose

Beautifully designed explanation of bloom filters, complete with interactive demos that illustrate exactly how they work.

Tags: bloom-filters, explorables, sam-rose