Simon Willison's Weblog: rodney

PCGamer Article Performance Audit

2026-03-22T22:49:00+00:00

Research: PCGamer Article Performance Audit

Stuart Breckenridge pointed out that PC Gamer Recommends RSS Readers in a 37MB Article That Just Keeps Downloading, highlighting a truly horrifying example of web bloat that added up to 100s more MBs thanks to auto-playing video ads. I decided to have Claude Code for web use Rodney to investigate the page - prompt here.

Tags: web-performance, rodney

Agentic manual testing

2026-03-06T05:43:54+00:00

Agentic Engineering Patterns >

The defining characteristic of a coding agent is that it can execute the code that it writes. This is what makes coding agents so much more useful than LLMs that simply spit out code without any way to verify it.

Never assume that code generated by an LLM works until that code has been executed.

Coding agents have the ability to confirm that the code they have produced works as intended, or iterate further on that code until it does.

Getting agents to write unit tests, especially using test-first TDD, is a powerful way to ensure they have exercised the code they are writing.

That's not the only worthwhile approach, though.

Just because code passes tests doesn't mean it works as intended. Anyone who's worked with automated tests will have seen cases where the tests all pass but the code itself fails in some obvious way - it might crash the server on startup, fail to display a crucial UI element, or miss some detail that the tests failed to cover.

Automated tests are no replacement for manual testing. I like to see a feature working with my own eye before I land it in a release.

I've found that getting agents to manually test code is valuable as well, frequently revealing issues that weren't spotted by the automated tests.

Mechanisms for agentic manual testing

How an agent should "manually" test a piece of code varies depending on what that code is.

For Python libraries a useful pattern is python -c "... code ...". You can pass a string (or multiline string) of Python code directly to the Python interpreter, including code that imports other modules.

The coding agents are all familiar with this trick and will sometimes use it without prompting. Reminding them to test using python -c can often be effective though:

Other languages may have similar mechanisms, and if they don't it's still quick for an agent to write out a demo file and then compile and run it. I sometimes encourage it to use /tmp purely to avoid those files being accidentally committed to the repository later on.

Many of my projects involve building web applications with JSON APIs. For these I tell the agent to exercise them using curl:

Telling an agent to "explore" often results in it trying out a bunch of different aspects of a new API, which can quickly cover a whole lot of ground.

If an agent finds something that doesn't work through their manual testing, I like to tell them to fix it with red/green TDD. This ensures the new case ends up covered by the permanent automated tests.

Using browser automation for web UIs

Having a manual testing procedure in place becomes even more valuable if a project involves an interactive web UI.

Historically these have been difficult to test from code, but the past decade has seen notable improvements in systems for automating real web browsers. Running a real Chrome or Firefox or Safari browser against an application can uncover all sorts of interesting problems in a realistic setting.

Coding agents know how to use these tools extremely well.

The most powerful of these today is Playwright, an open source library developed by Microsoft. Playwright offers a full-featured API with bindings in multiple popular programming languages and can automate any of the popular browser engines.

Simply telling your agent to "test that with Playwright" may be enough. The agent can then select the language binding that makes the most sense, or use Playwright's playwright-cli tool.

Coding agents work really well with dedicated CLIs. agent-browser by Vercel is a comprehensive CLI wrapper around Playwright specially designed for coding agents to use.

My own project Rodney serves a similar purpose, albeit using the Chrome DevTools Protocol to directly control an instance of Chrome.

Here's an example prompt I use to test things with Rodney:

There are three tricks in this prompt:

Saying "use uvx rodney --help" causes the agent to run rodney --help via the uvx package management tool, which automatically installs Rodney the first time it is called.
The rodney --help command is specifically designed to give agents everything they need to know to both understand and use the tool. Here's that help text.
Saying "look at screenshots" hints to the agent that it should use the rodney screenshot command and reminds it that it can use its own vision abilities against the resulting image files to evaluate the visual appearance of the page.

That's a whole lot of manual testing baked into a short prompt!

Rodney and tools like it offer a wide array of capabilities, from running JavaScript on the loaded site to scrolling, clicking, typing, and even reading the accessibility tree of the page.

As with other forms of manual tests, issues found and fixed via browser automation can then be added to permanent automated tests as well.

Many developers have avoided too many automated browser tests in the past due to their reputation for flakiness - the smallest tweak to the HTML of a page can result in frustrating waves of test breaks.

Having coding agents maintain those tests over time greatly reduces the friction involved in keeping them up-to-date in the face of design changes to the web interfaces.

Have them take notes with Showboat

Having agents manually test code can catch extra problems, but it can also be used to create artifacts that can help document the code and demonstrate how it has been tested.

I'm fascinated by the challenge of having agents show their work. Being able to see demos or documented experiments is a really useful way of confirming that the agent has comprehensively solved the challenge it was given.

I built Showboat to facilitate building documents that capture the agentic manual testing flow.

Here's a prompt I frequently use:

As with Rodney above, the showboat --help command teaches the agent what Showboat is and how to use it. Here's that help text in full.

The three key Showboat commands are note, exec, and image.

note appends a Markdown note to the Showboat document. exec records a command, then runs that command and records its output. image adds an image to the document - useful for screenshots of web applications taken using Rodney.

The exec command is the most important of these, because it captures a command along with the resulting output. This shows you what the agent did and what the result was, and is designed to discourage the agent from cheating and writing what it hoped had happened into the document.

I've been finding the Showboat pattern to work really well for documenting the work that has been achieved during my agent sessions. I'm hoping to see similar patterns adopted across a wider set of tools.

Tags: playwright, testing, agentic-engineering, ai, llms, coding-agents, ai-assisted-programming, rodney, showboat

Rodney v0.4.0

2026-02-17T23:02:33+00:00

Rodney v0.4.0

My Rodney CLI tool for browser automation attracted quite the flurry of PRs since I announced it last week. Here are the release notes for the just-released v0.4.0:

Errors now use exit code 2, which means exit code 1 is just for for check failures. #15

New rodney assert command for running JavaScript tests, exit code 1 if they fail. #19

New directory-scoped sessions with --local/--global flags. #14

New reload --hard and clear-cache commands. #17

New rodney start --show option to make the browser window visible. Thanks, Antonio Cuni. #13

New rodney connect PORT command to debug an already-running Chrome instance. Thanks, Peter Fraenkel. #12

New RODNEY_HOME environment variable to support custom state directories. Thanks, Senko Rašić. #11

New --insecure flag to ignore certificate errors. Thanks, Jakub Zgoliński. #10

Windows support: avoid Setsid on Windows via build-tag helpers. Thanks, adm1neca. #18

Tests now run on windows-latest and macos-latest in addition to Linux.

I've been using Showboat to create demos of new features - here those are for rodney assert, rodney reload --hard, rodney exit codes, and rodney start --local.

The rodney assert command is pretty neat: you can now Rodney to test a web app through multiple steps in a shell script that looks something like this (adapted from the README):

#!/bin/bash
set -euo pipefail

FAIL=0

check() {
    if ! "$@"; then
        echo "FAIL: $*"
        FAIL=1
    fi
}

rodney start
rodney open "https://example.com"
rodney waitstable

# Assert elements exist
check rodney exists "h1"

# Assert key elements are visible
check rodney visible "h1"
check rodney visible "#main-content"

# Assert JS expressions
check rodney assert 'document.title' 'Example Domain'
check rodney assert 'document.querySelectorAll("p").length' '2'

# Assert accessibility requirements
check rodney ax-find --role navigation

rodney stop

if [ "$FAIL" -ne 0 ]; then
    echo "Some checks failed"
    exit 1
fi
echo "All checks passed"

Tags: browsers, projects, testing, annotated-release-notes, rodney

Rodney and Claude Code for Desktop

2026-02-16T16:38:57+00:00

I'm a very heavy user of Claude Code on the web, Anthropic's excellent but poorly named cloud version of Claude Code where everything runs in a container environment managed by them, greatly reducing the risk of anything bad happening to a computer I care about.

I don't use the web interface at all (hence my dislike of the name) - I access it exclusively through their native iPhone and Mac desktop apps.

Something I particularly appreciate about the desktop app is that it lets you see images that Claude is "viewing" via its Read /path/to/image tool. Here's what that looks like:

This means you can get a visual preview of what it's working on while it's working, without waiting for it to push code to GitHub for you to try out yourself later on.

The prompt I used to trigger the above screenshot was:

Run "uvx rodney --help" and then use Rodney to manually test the new pages and menu - look at screenshots from it and check you think they look OK

I designed Rodney to have --help output that provides everything a coding agent needs to know in order to use the tool.

The Claude iPhone app doesn't display opened images yet, so I requested it as a feature just now in a thread on Twitter.

Tags: projects, ai, generative-ai, llms, ai-assisted-programming, anthropic, claude, coding-agents, claude-code, async-coding-agents, rodney

Introducing Showboat and Rodney, so agents can demo what they’ve built

2026-02-10T17:45:29+00:00

A key challenge working with coding agents is having them both test what they’ve built and demonstrate that software to you, their supervisor. This goes beyond automated tests - we need artifacts that show their progress and help us see exactly what the agent-produced software is able to do. I’ve just released two new tools aimed at this problem: Showboat and Rodney.

Proving code actually works

I recently wrote about how the job of a software engineer isn't to write code, it's to deliver code that works. A big part of that is proving to ourselves and to other people that the code we are responsible for behaves as expected.

This becomes even more important - and challenging - as we embrace coding agents as a core part of our software development process.

The more code we churn out with agents, the more valuable tools are that reduce the amount of manual QA time we need to spend.

One of the most interesting things about the StrongDM software factory model is how they ensure that their software is well tested and delivers value despite their policy that "code must not be reviewed by humans". Part of their solution involves expensive swarms of QA agents running through "scenarios" to exercise their software. It's fascinating, but I don't want to spend thousands of dollars on QA robots if I can avoid it!

I need tools that allow agents to clearly demonstrate their work to me, while minimizing the opportunities for them to cheat about what they've done.

Showboat: Agents build documents to demo their work

Showboat is the tool I built to help agents demonstrate their work to me.

It's a CLI tool (a Go binary, optionally wrapped in Python to make it easier to install) that helps an agent construct a Markdown document demonstrating exactly what their newly developed code can do.

It's not designed for humans to run, but here's how you would run it anyway:

showboat init demo.md 'How to use curl and jq'
showboat note demo.md "Here's how to use curl and jq together."
showboat exec demo.md bash 'curl -s https://api.github.com/repos/simonw/rodney | jq .description'
showboat note demo.md 'And the curl logo, to demonstrate the image command:'
showboat image demo.md 'curl -o curl-logo.png https://curl.se/logo/curl-logo.png && echo curl-logo.png'

Here's what the result looks like if you open it up in VS Code and preview the Markdown:

Here's that demo.md file in a Gist.

So a sequence of showboat init, showboat note, showboat exec and showboat image commands constructs a Markdown document one section at a time, with the output of those exec commands automatically added to the document directly following the commands that were run.

The image command is a little special - it looks for a file path to an image in the output of the command and copies that image to the current folder and references it in the file.

That's basically the whole thing! There's a pop command to remove the most recently added section if something goes wrong, a verify command to re-run the document and check nothing has changed (I'm not entirely convinced by the design of that one) and a extract command that reverse-engineers the CLI commands that were used to create the document.

It's pretty simple - just 172 lines of Go.

I packaged it up with my go-to-wheel tool which means you can run it without even installing it first like this:

uvx showboat --help

That --help command is really important: it's designed to provide a coding agent with everything it needs to know in order to use the tool. Here's that help text in full.

This means you can pop open Claude Code and tell it:

Run "uvx showboat --help" and then use showboat to create a demo.md document describing the feature you just built

And that's it! The --help text acts a bit like a Skill. Your agent can read the help text and use every feature of Showboat to create a document that demonstrates whatever it is you need demonstrated.

Here's a fun trick: if you set Claude off to build a Showboat document you can pop that open in VS Code and watch the preview pane update in real time as the agent runs through the demo. It's a bit like having your coworker talk you through their latest work in a screensharing session.

And finally, some examples. Here are documents I had Claude create using Showboat to help demonstrate features I was working on in other projects:

shot-scraper: A Comprehensive Demo runs through the full suite of features of my shot-scraper browser automation tool, mainly to exercise the showboat image command.
sqlite-history-json CLI demo demonstrates the CLI feature I added to my new sqlite-history-json Python library.
- row-state-sql CLI Demo shows a new row-state-sql command I added to that same project.
- Change grouping with Notes demonstrates another feature where groups of changes within the same transaction can have a note attached to them.
krunsh: Pipe Shell Commands to an Ephemeral libkrun MicroVM is a particularly convoluted example where I managed to get Claude Code for web to run a libkrun microVM inside a QEMU emulated Linux environment inside the Claude gVisor sandbox.

I've now used Showboat often enough that I've convinced myself of its utility.

(I've also seen agents cheat! Since the demo file is Markdown the agent will sometimes edit that file directly rather than using Showboat, which could result in command outputs that don't reflect what actually happened. Here's an issue about that.)

Rodney: CLI browser automation designed to work with Showboat

Many of the projects I work on involve web interfaces. Agents often build entirely new pages for these, and I want to see those represented in the demos.

Showboat's image feature was designed to allow agents to capture screenshots as part of their demos, originally using my shot-scraper tool or Playwright.

The Showboat format benefits from CLI utilities. I went looking for good options for managing a multi-turn browser session from a CLI and came up short, so I decided to try building something new.

Claude Opus 4.6 pointed me to the Rod Go library for interacting with the Chrome DevTools protocol. It's fantastic - it provides a comprehensive wrapper across basically everything you can do with automated Chrome, all in a self-contained library that compiles to a few MBs.

All Rod was missing was a CLI.

I built the first version as an asynchronous report prototype, which convinced me it was worth spinning out into its own project.

I called it Rodney as a nod to the Rod library it builds on and a reference to Only Fools and Horses - and because the package name was available on PyPI.

You can run Rodney using uvx rodney or install it like this:

uv tool install rodney

(Or grab a Go binary from the releases page.)

Here's a simple example session:

rodney start # starts Chrome in the background
rodney open https://datasette.io/
rodney js 'Array.from(document.links).map(el => el.href).slice(0, 5)'
rodney click 'a[href="/for"]'
rodney js location.href
rodney js document.title
rodney screenshot datasette-for-page.png
rodney stop

Here's what that looks like in the terminal:

As with Showboat, this tool is not designed to be used by humans! The goal is for coding agents to be able to run rodney --help and see everything they need to know to start using the tool. You can see that help output in the GitHub repo.

Here are three demonstrations of Rodney that I created using Showboat:

Rodney's original feature set, including screenshots of pages and executing JavaScript.
Rodney's new accessibility testing features, built during development of those features to show what they could do.
Using those features to run a basic accessibility audit of a page. I was impressed at how well Claude Opus 4.6 responded to the prompt "Use showboat and rodney to perform an accessibility audit of https://latest.datasette.io/fixtures" - transcript here.

Test-driven development helps, but we still need manual testing

After being a career-long skeptic of the test-first, maximum test coverage school of software development (I like tests included development instead) I've recently come around to test-first processes as a way to force agents to write only the code that's necessary to solve the problem at hand.

Many of my Python coding agent sessions start the same way:

Run the existing tests with "uv run pytest". Build using red/green TDD.

Telling the agents how to run the tests doubles as an indicator that tests on this project exist and matter. Agents will read existing tests before writing their own so having a clean test suite with good patterns makes it more likely they'll write good tests of their own.

The frontier models all understand that "red/green TDD" means they should write the test first, run it and watch it fail and then write the code to make it pass - it's a convenient shortcut.

I find this greatly increases the quality of the code and the likelihood that the agent will produce the right thing with the smallest amount of prompts to guide it.

But anyone who's worked with tests will know that just because the automated tests pass doesn't mean the software actually works! That’s the motivation behind Showboat and Rodney - I never trust any feature until I’ve seen it running with my own eye.

Before building Showboat I'd often add a “manual” testing step to my agent sessions, something like:

Once the tests pass, start a development server and exercise the new feature using curl

I built both of these tools on my phone

Both Showboat and Rodney started life as Claude Code for web projects created via the Claude iPhone app. Most of the ongoing feature work for them happened in the same way.

I'm still a little startled at how much of my coding work I get done on my phone now, but I'd estimate that the majority of code I ship to GitHub these days was written for me by coding agents driven via that iPhone app.

I initially designed these two tools for use in asynchronous coding agent environments like Claude Code for the web. So far that's working out really well.

Tags: go, projects, testing, markdown, ai, generative-ai, llms, ai-assisted-programming, coding-agents, async-coding-agents, showboat, rodney