<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: sam-rose</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/sam-rose.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-26T16:21:09+00:00</updated><author><name>Simon Willison</name></author><entry><title>Quantization from the ground up</title><link href="https://simonwillison.net/2026/Mar/26/quantization-from-the-ground-up/#atom-tag" rel="alternate"/><published>2026-03-26T16:21:09+00:00</published><updated>2026-03-26T16:21:09+00:00</updated><id>https://simonwillison.net/2026/Mar/26/quantization-from-the-ground-up/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ngrok.com/blog/quantization"&gt;Quantization from the ground up&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Rose continues &lt;a href="https://simonwillison.net/tags/sam-rose/"&gt;his streak&lt;/a&gt; of publishing spectacularly informative interactive essays, this time explaining how quantization of Large Language Models works (which he says might be "&lt;a href="https://twitter.com/samwhoo/status/2036845101561835968"&gt;the best post I've ever made&lt;/a&gt;".)&lt;/p&gt;
&lt;p&gt;Also included is the best visual explanation I've ever seen of how floating point numbers are represented using binary digits.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of an interactive float32 binary representation tool showing the value -48.92364502, with color-coded bit fields labeled S (sign), EXPONENT (blue), and SIGNIFICAND (pink), displaying the 32-bit pattern 11000010010000111101100001110100000, and a slider control at the bottom along with minus, plus, and reset buttons." src="https://static.simonwillison.net/static/2026/float.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I hadn't heard about &lt;strong&gt;outlier values&lt;/strong&gt; in quantization - rare float values that exist outside of the normal tiny-value distribution - but apparently they're very important:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Why do these outliers exist? [...] tl;dr: no one conclusively knows, but a small fraction of these outliers are &lt;em&gt;very&lt;/em&gt; important to model quality. Removing even a &lt;em&gt;single&lt;/em&gt; "super weight," as Apple calls them, can cause the model to output complete gibberish.&lt;/p&gt;
&lt;p&gt;Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them so that their block isn't destroyed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Plus there's a section on &lt;a href="https://ngrok.com/blog/quantization#how-much-does-quantization-affect-model-accuracy"&gt;How much does quantization affect model accuracy?&lt;/a&gt;. Sam explains the concepts of &lt;strong&gt;perplexity&lt;/strong&gt; and ** KL divergence ** and then uses the &lt;a href="https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity"&gt;llama.cpp perplexity tool&lt;/a&gt; and a run of the GPQA benchmark to show how different quantization levels affect Qwen 3.5 9B.&lt;/p&gt;
&lt;p&gt;His conclusion:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit is more noticeable, but it's certainly not a quarter as good as the original. Closer to 90%, depending on how you want to measure it.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/computer-science"&gt;computer-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;&lt;/p&gt;



</summary><category term="computer-science"/><category term="ai"/><category term="explorables"/><category term="generative-ai"/><category term="llms"/><category term="sam-rose"/><category term="qwen"/></entry><entry><title>Sam Rose explains how LLMs work with a visual essay</title><link href="https://simonwillison.net/2025/Dec/19/sam-rose-llms/#atom-tag" rel="alternate"/><published>2025-12-19T18:33:41+00:00</published><updated>2025-12-19T18:33:41+00:00</updated><id>https://simonwillison.net/2025/Dec/19/sam-rose-llms/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ngrok.com/blog/prompt-caching/"&gt;Sam Rose explains how LLMs work with a visual essay&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Rose is one of my favorite authors of &lt;a href="https://simonwillison.net/tags/explorables/"&gt;explorable interactive explanations&lt;/a&gt; - here's &lt;a href="https://samwho.dev/"&gt;his previous collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sam joined ngrok in September as a developer educator. Here's his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.&lt;/p&gt;
&lt;p&gt;The result is one of the clearest and most accessible introductions to LLM internals I've seen anywhere.&lt;/p&gt;
&lt;div style="text-align: center"&gt;&lt;img alt="Animation. Starts in tokens mode with an array of 75, 305, 24, 887 - clicking embeddings animates those into a 2D array showing each one to be composed of three floating point numbers." src="https://static.simonwillison.net/static/2025/tokens-embeddings.gif" style="max-width: 100%"&gt;&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="explorables"/><category term="generative-ai"/><category term="llms"/><category term="sam-rose"/><category term="tokenization"/></entry><entry><title>Reservoir Sampling</title><link href="https://simonwillison.net/2025/May/8/reservoir-sampling/#atom-tag" rel="alternate"/><published>2025-05-08T21:00:22+00:00</published><updated>2025-05-08T21:00:22+00:00</updated><id>https://simonwillison.net/2025/May/8/reservoir-sampling/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://samwho.dev/reservoir-sampling/"&gt;Reservoir Sampling&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Yet another outstanding interactive essay by Sam Rose (&lt;a href="https://simonwillison.net/tags/sam-rose/"&gt;previously&lt;/a&gt;), this time explaining how reservoir sampling can be used to select a "fair" random sample when you don't know how many options there are and don't want to accumulate them before making a selection.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I appreciate that Sam starts the article with "No math notation, I promise." Lots of delightful widgets to interact with here, all of which help build an intuitive understanding of the underlying algorithm.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Animated demo. As a slider moves from left to right the probability of cards drawn from a deck is simulated. Text at the bottom reads Anything older than 15 cards ago is has a less than 0.01% chance of being held when I stop." src="https://static.simonwillison.net/static/2025/sam-rose-cards.gif" /&gt;&lt;/p&gt;
&lt;p&gt;Sam shows how this algorithm can be applied to the real-world problem of sampling log files when incoming logs threaten to overwhelm a log aggregator.&lt;/p&gt;
&lt;p&gt;The dog illustration is &lt;a href="https://samwho.dev/dogs/"&gt;commissioned art&lt;/a&gt; and the MIT-licensed code is &lt;a href="https://github.com/samwho/visualisations/tree/main/reservoir-sampling"&gt;available on GitHub&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43928315"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/algorithms"&gt;algorithms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/logging"&gt;logging&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rate-limiting"&gt;rate-limiting&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;&lt;/p&gt;



</summary><category term="algorithms"/><category term="logging"/><category term="rate-limiting"/><category term="explorables"/><category term="sam-rose"/></entry><entry><title>Load Balancing</title><link href="https://simonwillison.net/2024/Jul/13/load-balancing/#atom-tag" rel="alternate"/><published>2024-07-13T22:51:45+00:00</published><updated>2024-07-13T22:51:45+00:00</updated><id>https://simonwillison.net/2024/Jul/13/load-balancing/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://samwho.dev/load-balancing/"&gt;Load Balancing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Rose built this interactive essay explaining how different load balancing strategies work. It's part of &lt;a href="https://samwho.dev/"&gt;a series&lt;/a&gt; that includes &lt;a href="https://samwho.dev/memory-allocation/"&gt;memory allocation&lt;/a&gt;, &lt;a href="https://samwho.dev/bloom-filters/"&gt;bloom filters&lt;/a&gt; and more.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/algorithms"&gt;algorithms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/load-balancing"&gt;load-balancing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;&lt;/p&gt;



</summary><category term="algorithms"/><category term="load-balancing"/><category term="explorables"/><category term="sam-rose"/></entry><entry><title>Bloom Filters, explained by Sam Rose</title><link href="https://simonwillison.net/2024/Feb/23/bloom-filters-explained-by-sam-rose/#atom-tag" rel="alternate"/><published>2024-02-23T15:59:33+00:00</published><updated>2024-02-23T15:59:33+00:00</updated><id>https://simonwillison.net/2024/Feb/23/bloom-filters-explained-by-sam-rose/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://samwho.dev/bloom-filters/"&gt;Bloom Filters, explained by Sam Rose&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Beautifully designed explanation of bloom filters, complete with interactive demos that illustrate exactly how they work.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bloom-filters"&gt;bloom-filters&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;&lt;/p&gt;



</summary><category term="bloom-filters"/><category term="explorables"/><category term="sam-rose"/></entry></feed>