Simon Willison's Weblog: tobias-lutke

Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations

2026-03-13T03:44:34+00:00

Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations

PR from Shopify CEO Tobias Lütke against Liquid, Shopify's open source Ruby template engine that was somewhat inspired by Django when Tobi first created it back in 2005.

Tobi found dozens of new performance micro-optimizations using a variant of autoresearch, Andrej Karpathy's new system for having a coding agent run hundreds of semi-autonomous experiments to find new effective techniques for training nanochat.

Tobi's implementation started two days ago with this autoresearch.md prompt file and an autoresearch.sh script for the agent to run to execute the test suite and report on benchmark scores.

The PR now lists 93 commits from around 120 automated experiments. The PR description lists what worked in detail - some examples:

Replaced StringScanner tokenizer with String#byteindex. Single-byte byteindex searching is ~40% faster than regex-based skip_until. This alone reduced parse time by ~12%.

Pure-byte parse_tag_token. Eliminated the costly StringScanner#string= reset that was called for every {% %} token (878 times). Manual byte scanning for tag name + markup extraction is faster than resetting and re-scanning via StringScanner. [...]

Cached small integer to_s. Pre-computed frozen strings for 0-999 avoid 267 Integer#to_s allocations per render.

This all added up to a 53% improvement on benchmarks - truly impressive for a codebase that's been tweaked by hundreds of contributors over 20 years.

I think this illustrates a number of interesting ideas:

Having a robust test suite - in this case 974 unit tests - is a massive unlock for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.
The autoresearch pattern - where an agent brainstorms a multitude of potential improvements and then experiments with them one at a time - is really effective.
If you provide an agent with a benchmarking script "make it faster" becomes an actionable goal.
CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees. I've seen this pattern play out a lot over the past few months: coding agents make it feasible for people in high-interruption roles to productively work with code again.

Here's Tobi's GitHub contribution graph for the past year, showing a significant uptick following that November 2025 inflection point when coding agents got really good.

He used Pi as the coding agent and released a new pi-autoresearch plugin in collaboration with David Cortés, which maintains state in an autoresearch.jsonl file like this one.

Via @tobi

Tags: django, performance, rails, ruby, ai, andrej-karpathy, generative-ai, llms, ai-assisted-programming, coding-agents, agentic-engineering, november-2025-inflection, tobias-lutke, autoresearch

Context engineering

2025-06-27T23:42:43+00:00

The term context engineering has recently started to gain traction as a better alternative to prompt engineering. I like it. I think this one may have sticking power.

Here's an example tweet from Shopify CEO Tobi Lutke:

I really like the term “context engineering” over prompt engineering.

It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.

Recently amplified by Andrej Karpathy:

+1 for "context engineering" over "prompt engineering".

People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting [...] Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits. [...]

I've spoken favorably of prompt engineering in the past - I hoped that term could capture the inherent complexity of constructing reliable prompts. Unfortunately, most people's inferred definition is that it's a laughably pretentious term for typing things into a chatbot!

It turns out that inferred definitions are the ones that stick. I think the inferred definition of "context engineering" is likely to be much closer to the intended meaning.

Tags: definitions, ai, andrej-karpathy, prompt-engineering, generative-ai, llms, context-engineering, tobias-lutke

Quoting Tobias Lütke

2025-04-07T18:32:20+00:00

Using Al effectively is now a fundamental expectation of everyone at Shopify. It's a tool of all trades today, and will only grow in importance. Frankly, I don't think it's feasible to opt out of learning the skill of applying Al in your craft; you are welcome to try, but I want to be honest I cannot see this working out today, and definitely not tomorrow. Stagnation is almost certain, and stagnation is slow-motion failure. If you're not climbing, you're sliding [...]

We will add Al usage questions to our performance and peer review questionnaire. Learning to use Al well is an unobvious skill. My sense is that a lot of people give up after writing a prompt and not getting the ideal thing back immediately. Learning to prompt and load context is important, and getting peers to provide feedback on how this is going will be valuable.

— Tobias Lütke, CEO of Shopify, self-leaked memo

Tags: careers, ai, ai-ethics, tobias-lutke

Could you train a ChatGPT-beating model for $85,000 and run it in a browser?

2023-03-17T15:43:38+00:00

I think it's now possible to train a large language model with similar functionality to GPT-3 for $85,000. And I think we might soon be able to run the resulting model entirely in the browser, and give it capabilities that leapfrog it ahead of ChatGPT.

This is currently wild speculation on my part, but bear with me because I think this is worth exploring further.

Large language models with GPT-3-like capabilities cost millions of dollars to build, thanks to the cost of running the expensive GPU servers needed to train them. Whether you are renting or buying those machines, there are still enormous energy costs to cover.

Just one example of this: the BLOOM large language model was trained in France with the support of the French government. The cost was estimated as $2-5M, it took almost four months to train and boasts about its low carbon footprint because most of the power came from a nuclear reactor!

[ Fun fact: as of a few days ago you can now run the openly licensed BLOOM on your own laptop, using Nouamane Tazi's adaptive copy of the llama.cpp code that made that possible for LLaMA ]

Recent developments have made me suspect that these costs could be made dramatically lower. I think a capable language model can now be trained from scratch for around $85,000.

It's all about that LLaMA

The LLaMA plus Alpaca combination is the key here.

I wrote about these two projects previously:

Large language models are having their Stable Diffusion moment discusses the significance of LLaMA
Stanford Alpaca, and the acceleration of on-device large language model development describes Alpaca

To recap: LLaMA by Meta research provided a GPT-3 class model trained entirely on documented, available public training information, as opposed to OpenAI's continuing practice of not revealing the sources of their training data.

This makes the model training a whole lot more likely to be replicable by other teams.

The paper also describes some enormous efficiency improvements they made to the training process.

The LLaMA research was still extremely expensive though. From the paper:

... we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models

My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.

2048 * 5 * 30 * 24 = $7,372,800

But... that $7M was the cost to both iterate on the model and to train all four sizes of LLaMA that they tried: 7B, 13B, 33B, and 65B.

Here's Table 15 from the paper, showing the cost of training each model.

This shows that the smallest model, LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh and generating 14 tons of CO2.

(That's about 28 people flying from London to New York.)

Going by the $1/hour rule of thumb, this means that provided you get everything right on your first run you can train a LLaMA-7B scale model for around $82,432.

Upgrading to Alpaca

You can run LLaMA 7B on your own laptop (or even on a phone), but you may find it hard to get good results out of. That's because it hasn't been instruction tuned, so it's not great at answering the kind of prompts that you might send to ChatGPT or GPT-3 or 4.

Alpaca is the project from Stanford that fixes that. They fine-tuned LLaMA on 52,000 instructions (of somewhat dubious origin) and claim to have gotten ChatGPT-like performance as a result... from that smallest 7B LLaMA model!

You can try out their demo (update: no you can't, "Our live demo is suspended until further notice") and see for yourself that it really does capture at least some of that ChatGPT magic.

The best bit? The Alpaca fine-tuning can be done for less than $100. The Replicate team have repeated the training process and published a tutorial about how they did it.

Other teams have also been able to replicate the Alpaca fine-tuning process, for example antimatter15/alpaca.cpp on GitHub.

We are still within our $85,000 budget! And Alpaca - or an Alpaca-like model using different fine tuning data - is the ChatGPT on your own device model that we've all been hoping for.

Could we run it in a browser?

Alpaca is effectively the same size as LLaMA 7B - around 3.9GB (after 4-bit quantization ala llama.cpp). And LLaMA 7B has already been shown running on a whole bunch of different personal devices: laptops, Raspberry Pis (very slowly) and even a Pixel 5 phone at a decent speed!

The next frontier: running it in the browser.

I saw two tech demos yesterday that made me think this may be possible in the near future.

The first is Transformers.js. This is a WebAssembly port of the Hugging Face Transformers library of models - previously only available for server-side Python.

It's worth spending some time with their demos, which include some smaller language models and some very impressive image analysis languages too.

The second is Web Stable Diffusion. This team managed to get the Stable Diffusion generative image model running entirely in the browser as well!

Web Stable Diffusion uses WebGPU, a still emerging standard that's currently only working in Chrome Canary. But it does work! It rendered me this image of two raccoons eating a pie in the forest in 38 seconds.

The Stable Diffusion model this loads into the browser is around 1.9GB.

LLaMA/Alpaca at 4bit quantization is 3.9GB.

The sizes of these two models are similar enough that I would not be at all surprised to see an Alpaca-like model running in the browser in the not-too-distant future. I wouldn't be surprised if someone is working on that right now.

Now give it extra abilities with ReAct

A model running in your browser that behaved like a less capable version of ChatGPT would be pretty impressive. But what if it could be MORE capable than ChatGPT?

The ReAct prompt pattern is a simple, proven way of expanding a language model's abilities by giving it access to extra tools.

Matt Webb explains the significance of the pattern in The surprising ease and effectiveness of AI in a loop.

I got it working with a few dozen lines of Python myself, which I described in A simple Python implementation of the ReAct pattern for LLMs.

Here's the short version: you tell the model that it must think out loud and now has access to tools. It can then work through a question like this:

Question: Population of Paris, squared?

Thought: I should look up the population of paris and then multiply it

Action: search_wikipedia: Paris

Then it stops. Your code harness for the model reads that last line, sees the action and goes and executes an API call against Wikipedia. It continues the dialog with the model like this:

Observation: <truncated content from the Wikipedia page, including the 2,248,780 population figure>

The model continues:

Thought: Paris population is 2,248,780 I should square that

Action: calculator: 2248780 ** 2

Control is handed back to the harness, which passes that to a calculator and returns:

Observation: 5057011488400

The model then provides the answer:

Answer: The population of Paris squared is 5,057,011,488,400

Adding new actions to this system is trivial: each one can be a few lines of code.

But as the ReAct paper demonstrates, adding these capabilities to even an under-powered model (such as LLaMA 7B) can dramatically improve its abilities, at least according to several common language model benchmarks.

This is essentially what Bing is! It's GPT-4 with the added ability to run searches against the Bing search index.

Obviously if you're going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we've ever created.

Adding it all together

There are a lot more groups out there that can afford to spend $85,000 training a model than there are that can spend $2M or more.

I think LLaMA and Alpaca are going to have a lot of competition soon, from an increasing pool of openly licensed models.

A fine-tuned LLaMA scale model is leaning in the direction of a ChatGPT competitor already. But... if you hook in some extra capabilities as seen in ReAct and Bing even that little model should be able to way outperform ChatGPT in terms of actual ability to solve problems and do interesting things.

And we might be able to run such a thing on our phones... or even in our web browsers... sooner than you think.

And it's only going to get cheaper

Tobias Lütke on Twitter:

H100s are shipping and you can half this again. Twice (or more) if fp8 works.
- tobi lutke (@tobi) March 17, 2023

The H100 is the new Tensor Core GPU from NVIDIA, which they claim can offer up to a 30x performance improvement over their current A100s.

Tags: ai, webassembly, generative-ai, chatgpt, llama, local-llms, llms, bloom, mlc, transformers-js, llm-tool-use, llama-cpp, tobias-lutke

Quoting Tobi Lutke

2019-12-26T19:06:35+00:00

For creative work, you can't cheat. My believe is that there are 5 creative hours in everyone's day. All I ask of people at Shopify is that 4 of those are channeled into the company.

— Tobi Lutke

Tags: productivity, management, tobias-lutke