Simon Willison's Weblog: jules

Code research projects with async coding agents like Claude Code and Codex

2025-11-06T15:53:23+00:00

I've been experimenting with a pattern for LLM usage recently that's working out really well: asynchronous code research tasks. Pick a research question, spin up an asynchronous coding agent and let it go and run some experiments and report back when it's done.

Code research

Software development benefits enormously from something I call code research. The great thing about questions about code is that they can often be definitively answered by writing and executing code.

I often see questions on forums which hint at a lack of understanding of this skill.

"Could Redis work for powering the notifications feed for my app?" is a great example. The answer is always "it depends", but a better answer is that a good programmer already has everything they need to answer that question for themselves. Build a proof-of-concept, simulate the patterns you expect to see in production, then run experiments to see if it's going to work.

I've been a keen practitioner of code research for a long time. Many of my most interesting projects started out as a few dozen lines of experimental code to prove to myself that something was possible.

Coding agents

It turns out coding agents like Claude Code and Codex are a fantastic fit for this kind of work as well. Give them the right goal and a useful environment and they'll churn through a basic research project without any further supervision.

LLMs hallucinate and make mistakes. This is far less important for code research tasks because the code itself doesn't lie: if they write code and execute it and it does the right things then they've demonstrated to both themselves and to you that something really does work.

They can't prove something is impossible - just because the coding agent couldn't find a way to do something doesn't mean it can't be done - but they can often demonstrate that something is possible in just a few minutes of crunching.

Asynchronous coding agents

I've used interactive coding agents like Claude Code and Codex CLI for a bunch of these, but today I'm increasingly turning to their asynchronous coding agent family members instead.

An asynchronous coding agent is a coding agent that operates on a fire-and-forget basis. You pose it a task, it churns away on a server somewhere and when it's done it files a pull request against your chosen GitHub repository.

OpenAI's Codex Cloud, Anthropic's Claude Code for web, Google Gemini's Jules, and GitHub's Copilot coding agent are four prominent examples of this pattern.

These are fantastic tools for code research projects. Come up with a clear goal, turn it into a few paragraphs of prompt, set them loose and check back ten minutes later to see what they've come up with.

I'm firing off 2-3 code research projects a day right now. My own time commitment is minimal and they frequently come back with useful or interesting results.

Give them a dedicated GitHub repository

You can run a code research task against an existing GitHub repository, but I find it's much more liberating to have a separate, dedicated repository for your coding agents to run their projects in.

This frees you from being limited to research against just code you've already written, and also means you can be much less cautious about what you let the agents do.

I have two repositories that I use for this - one public, one private. I use the public one for research tasks that have no need to be private, and the private one for anything that I'm not yet ready to share with the world.

Let them rip with unlimited network access

The biggest benefit of a dedicated repository is that you don't need to be cautious about what the agents operating in that repository can do.

Both Codex Cloud and Claude Code for web default to running agents in a locked-down environment, with strict restrictions on how they can access the network. This makes total sense if they are running against sensitive repositories - a prompt injection attack of the lethal trifecta variety could easily be used to steal sensitive code or environment variables.

If you're running in a fresh, non-sensitive repository you don't need to worry about this at all! I've configured my research repositories for full network access, which means my coding agents can install any dependencies they need, fetch data from the web and generally do anything I'd be able to do on my own computer.

My simonw/research collection

Let's dive into some examples. My public research repository is at simonw/research on GitHub. It currently contains 13 folders, each of which is a separate research project. I only created it two weeks ago so I'm already averaging nearly one a day!

It also includes a GitHub Workflow which uses GitHub Models to automatically update the README file with a summary of every new project, using Cog, LLM, llm-github-models and this snippet of Python.

Here are a some example research projects from the repo.

node-pyodide shows an example of a Node.js script that runs the Pyodide WebAssembly distribution of Python inside it - yet another of my ongoing attempts to find a great way of running Python in a WebAssembly sandbox on a server.

python-markdown-comparison (transcript) provides a detailed performance benchmark of seven different Python Markdown libraries. I fired this one off because I stumbled across cmarkgfm, a Python binding around GitHub's Markdown implementation in C, and wanted to see how it compared to the other options. This one produced some charts! cmarkgfm came out on top by a significant margin:

Here's the entire prompt I used for that project:

Create a performance benchmark and feature comparison report on PyPI cmarkgfm compared to other popular Python markdown libraries - check all of them out from github and read the source to get an idea for features, then design and run a benchmark including generating some charts, then create a report in a new python-markdown-comparison folder (do not create a _summary.md file or edit anywhere outside of that folder). Make sure the performance chart images are directly displayed in the README.md in the folder.

Note that I didn't specify any Markdown libraries other than cmarkgfm - Claude Code ran a search and found the other six by itself.

cmarkgfm-in-pyodide is a lot more fun. A neat thing about having all of my research projects in the same repository is that new projects can build on previous ones. Here I decided to see how hard it would be to get cmarkgfm - which has a C extension - working inside Pyodide inside Node.js. Claude successfully compiled a 88.4KB cmarkgfm_pyodide-2025.10.22-cp312-cp312-emscripten_3_1_46_wasm32.whl file with the necessary C extension and proved it could be loaded into Pyodide in WebAssembly inside of Node.js.

I ran this one using Claude Code on my laptop after an initial attempt failed. The starting prompt was:

Figure out how to get the cmarkgfm markdown lover [typo in prompt, this should have been "library" but it figured it out anyway] for Python working in pyodide. This will be hard because it uses C so you will need to compile it to pyodide compatible webassembly somehow. Write a report on your results plus code to a new cmarkgfm-in-pyodide directory. Test it using pytest to exercise a node.js test script that calls pyodide as seen in the existing node.js and pyodide directory

There is an existing branch that was an initial attempt at this research, but which failed because it did not have Internet access. You do have Internet access. Use that existing branch to accelerate your work, but do not commit any code unless you are certain that you have successfully executed tests that prove that the pyodide module you created works correctly.

This one gave up half way through, complaining that emscripten would take too long. I told it:

Complete this project, actually run emscripten, I do not care how long it takes, update the report if it works

It churned away for a bit longer and complained that the existing Python library used CFFI which isn't available in Pyodide. I asked it:

Can you figure out how to rewrite cmarkgfm to not use FFI and to use a pyodide-friendly way of integrating that C code instead?

... and it did. You can see the full transcript here.

blog-tags-scikit-learn. Taking a short break from WebAssembly, I thought it would be fun to put scikit-learn through its paces on a text classification task against my blog:

Work in a new folder called blog-tags-scikit-learn

Download https://datasette.simonwillison.net/simonwillisonblog.db - a SQLite database. Take a look at the blog_entry table and the associated tags - a lot of the earlier entries do not have tags associated with them, where the later entries do. Design, implement and execute models to suggests tags for those earlier entries based on textual analysis against later ones

Use Python scikit learn and try several different strategies

Produce JSON of the results for each one, plus scripts for running them and a detailed markdown description

Also include an HTML page with a nice visualization of the results that works by loading those JSON files.

This resulted in seven .py files, four .json results files and a detailed report. (It ignored the bit about an HTML page with a nice visualization for some reason.) Not bad for a few moments of idle curiosity typed into my phone!

That's just three of the thirteen projects in the repository so far. The commit history for each one usually links to the prompt and sometimes the transcript if you want to see how they unfolded.

More recently I added a short AGENTS.md file to the repo with a few extra tips for my research agents. You can read that here.

This is total slop, of course

My preferred definition of AI slop is AI-generated content that is published without human review. I've not been reviewing these reports in great detail myself, and I wouldn't usually publish them online without some serious editing and verification.

I want to share the pattern I'm using though, so I decided to keep them quarantined in this one public simonw/research repository.

A tiny feature request for GitHub: I'd love to be able to mark a repository as "exclude from search indexes" such that it gets labelled with <meta name="robots" content="noindex"> tags. I still like to keep AI-generated content out of search, to avoid contributing more to the dead internet.

Try it yourself

It's pretty easy to get started trying out this coding agent research pattern. Create a free GitHub repository (public or private) and let some agents loose on it and see what happens.

You can run agents locally but I find the asynchronous agents to be more convenient - especially as I can run them (or trigger them from my phone) without any fear of them damaging my own machine or leaking any of my private data.

Claude Code for web offers a free $250 of credits for their $20/month users for a limited time (until November 18, 2025). Gemini Jules has a free tier. There are plenty of other coding agents you can try out as well.

Let me know if your research agents come back with anything interesting!

Tags: ai, webassembly, generative-ai, llms, ai-assisted-programming, slop, ai-agents, coding-agents, claude-code, jules, codex-cli

Embracing the parallel coding agent lifestyle

2025-10-05T12:06:55+00:00

For a while now I've been hearing from engineers who run multiple coding agents at once - firing up several Claude Code or Codex CLI instances at the same time, sometimes in the same repo, sometimes against multiple checkouts or git worktrees.

I was pretty skeptical about this at first. AI-generated code needs to be reviewed, which means the natural bottleneck on all of this is how fast I can review the results. It's tough keeping up with just a single LLM given how fast they can churn things out, where's the benefit from running more than one at a time if it just leaves me further behind?

Despite my misgivings, over the past few weeks I've noticed myself quietly starting to embrace the parallel coding agent lifestyle.

I can only focus on reviewing and landing one significant change at a time, but I'm finding an increasing number of tasks that can still be fired off in parallel without adding too much cognitive overhead to my primary work.

Here are some patterns I've found for applying parallel agents effectively.

Research for proof of concepts

The first category of tasks I've been applying this pattern to is research.

Research tasks answer questions or provide recommendations without making modifications to a project that you plan to keep.

A lot of software projects start with a proof of concept. Can Yjs be used to implement a simple collaborative note writing tool with a Python backend? The libraries exist, but do they work when you wire them together?

Today's coding agents can build a proof of concept with new libraries and resolve those kinds of basic questions. Libraries too new to be in the training data? Doesn't matter: tell them to checkout the repos for those new dependencies and read the code to figure out how to use them.

How does that work again?

If you need a reminder about how a portion of your existing system works, modern "reasoning" LLMs can provide a detailed, actionable answer in just a minute or two.

It doesn't matter how large your codebase is: coding agents are extremely effective with tools like grep and can follow codepaths through dozens of different files if they need to.

Ask them to make notes on where your signed cookies are set and read, or how your application uses subprocesses and threads, or which aspects of your JSON API aren't yet covered by your documentation.

These LLM-generated explanations are worth stashing away somewhere, because they can make excellent context to paste into further prompts in the future.

Small maintenance tasks

Now we're moving on to code edits that we intend to keep, albeit with very low-stakes. It turns out there are a lot of problems that really just require a little bit of extra cognitive overhead which can be outsourced to a bot.

Warnings are a great example. Is your test suite spitting out a warning that something you are using is deprecated? Chuck that at a bot - tell it to run the test suite and figure out how to fix the warning. No need to take a break from what you're doing to resolve minor irritations like that.

There is a definite knack to spotting opportunities like this. As always, the best way to develop that instinct is to try things - any small maintenance task is something that's worth trying with a coding agent. You can learn from both their successes and their failures.

Carefully specified and directed actual work

Reviewing code that lands on your desk out of nowhere is a lot of work. First you have to derive the goals of the new implementation: what's it trying to achieve? Is this something the project needs? Is the approach taken the best for this current project, given other future planned changes? A lot of big questions before you can even start digging into the details of the code.

Code that started from your own specification is a lot less effort to review. If you already decided what to solve, picked the approach and worked out a detailed specification for the work itself, confirming it was built to your needs can take a lot less time.

I described my more authoritarian approach to prompting models for code back in March. If I tell them exactly how to build something the work needed to review the resulting changes is a whole lot less taxing.

How I'm using these tools today

My daily drivers are currently Claude Code (on Sonnet 4.5), Codex CLI (on GPT-5-Codex), and Codex Cloud (for asynchronous tasks, frequently launched from my phone.)

I'm also dabbling with GitHub Copilot Coding Agent (the agent baked into the GitHub.com web interface in various places) and Google Jules, Google's currently-free alternative to Codex Cloud.

I'm still settling into patterns that work for me. I imagine I'll be iterating on my processes for a long time to come, especially as the landscape of coding agents continues to evolve.

I frequently have multiple terminal windows open running different coding agents in different directories. These are currently a mixture of Claude Code and Codex CLI, running in YOLO mode (no approvals) for tasks where I'm confident malicious instructions can't sneak into the context.

(I need to start habitually running my local agents in Docker containers to further limit the blast radius if something goes wrong.)

I haven't adopted git worktrees yet: if I want to run two agents in isolation against the same repo I do a fresh checkout, often into /tmp.

For riskier tasks I'm currently using asynchronous coding agents - usually Codex Cloud - so if anything goes wrong the worst that can happen is my source code getting leaked (since I allow it to have network access while running). Most of what I work on is open source anyway so that's not a big concern for me.

I occasionally use GitHub Codespaces to run VS Code's agent mode, which is surprisingly effective and runs directly in my browser. This is particularly great for workshops and demos since it works for anyone with GitHub account, no extra API key necessary.

This category of coding agent software is still really new, and the models have only really got good enough to drive them effectively in the past few months - Claude 4 and GPT-5 in particular.

I plan to write more as I figure out the ways of using them that are most effective. I encourage other practitioners to do the same!

aavetis/PRarena

2025-10-01T23:59:40+00:00

aavetis/PRarena

Albert Avetisian runs this repository on GitHub which uses the Github Search API to track the number of PRs that can be credited to a collection of different coding agents. The repo runs this collect_data.py script every three hours using GitHub Actions to collect the data, then updates the PR Arena site with a visual leaderboard.

The result is this neat chart showing adoption of different agents over time, along with their PR success rate:

I found this today while trying to pull off the exact same trick myself! I got as far as creating the following table before finding Albert's work and abandoning my own project.

Tool	Search term	Total PRs	Merged PRs	% merged	Earliest
Claude Code	`is:pr in:body "Generated with Claude Code"`	146,000	123,000	84.2%	Feb 21st
GitHub Copilot	`is:pr author:copilot-swe-agent[bot]`	247,000	152,000	61.5%	March 7th
Codex Cloud	`is:pr in:body "chatgpt.com" label:codex`	1,900,000	1,600,000	84.2%	April 23rd
Google Jules	`is:pr author:google-labs-jules[bot]`	35,400	27,800	78.5%	May 22nd

(Those "earliest" links are a little questionable, I tried to filter out false positives and find the oldest one that appeared to really be from the agent in question.)

It looks like OpenAI's Codex Cloud is massively ahead of the competition right now in terms of numbers of PRs both opened and merged on GitHub.

Update: To clarify, these numbers are for the category of autonomous coding agents - those systems where you assign a cloud-based agent a task or issue and the output is a PR against your repository. They do not (and cannot) capture the popularity of many forms of AI tooling that don't result in an easily identifiable pull request.

Claude Code for example will be dramatically under-counted here because its version of an autonomous coding agent comes in the form of a somewhat obscure GitHub Actions workflow buried in the documentation.

Tags: github, ai, git-scraping, openai, generative-ai, llms, ai-assisted-programming, anthropic, coding-agents, claude-code, async-coding-agents, jules

Jules, our asynchronous coding agent, is now available for everyone

2025-08-06T19:36:24+00:00

Jules, our asynchronous coding agent, is now available for everyone

I wrote about the Jules beta back in May. Google's version of the OpenAI Codex PR-submitting hosted coding tool graduated from beta today.

I'm mainly linking to this now because I like the new term they are using in this blog entry: Asynchronous coding agent. I like it so much I gave it a tag.

I continue to avoid the term "agent" as infuriatingly vague, but I can grudgingly accept it when accompanied by a prefix that clarifies the type of agent we are talking about. "Asynchronous coding agent" feels just about obvious enough to me to be useful.

... I just ran a Google search for "asynchronous coding agent" -jules and came up with a few more notable examples of this name being used elsewhere:

Introducing Open SWE: An Open-Source Asynchronous Coding Agent is an announcement from LangChain just this morning of their take on this pattern. They provide a hosted version (bring your own API keys) or you can run it yourself with their MIT licensed code.
The press release for GitHub's own version of this GitHub Introduces Coding Agent For GitHub Copilot states that "GitHub Copilot now includes an asynchronous coding agent".

Via Hacker News

Tags: definitions, github, google, ai, generative-ai, llms, ai-assisted-programming, gemini, agent-definitions, async-coding-agents, jules

PR #537: Fix Markdown in og descriptions

2025-06-03T23:58:34+00:00

PR #537: Fix Markdown in og descriptions

Since OpenAI Codex is now available to us ChatGPT Plus subscribers I decided to try it out against my blog.

It's a very nice implementation of the GitHub-connected coding "agent" pattern, as also seen in Google's Jules and Microsoft's Copilot Coding Agent.

First I had to configure an environment for it. My Django blog uses PostgreSQL which isn't part of the default Codex container, so I had Claude Sonnet 4 help me come up with a startup recipe to get PostgreSQL working.

I attached my simonw/simonwillisonblog GitHub repo and used the following as the "setup script" for the environment:

# Install PostgreSQL
apt-get update && apt-get install -y postgresql postgresql-contrib

# Start PostgreSQL service
service postgresql start

# Create a test database and user
sudo -u postgres createdb simonwillisonblog
sudo -u postgres psql -c "CREATE USER testuser WITH PASSWORD 'testpass';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE simonwillisonblog TO testuser;"
sudo -u postgres psql -c "ALTER USER testuser CREATEDB;"

pip install -r requirements.txt

I left "Agent internet access" off for reasons described previously.

Then I prompted Codex with the following (after one previous experimental task to check that it could run my tests):

Notes and blogmarks can both use Markdown.

They serve meta property="og:description" content=" tags on the page, but those tags include that raw Markdown which looks bad on social media previews.

Fix it so they instead use just the text with markdown stripped - so probably render it to HTML and then strip the HTML tags.

Include passing tests.

Try to run the tests, the postgresql details are:

database = simonwillisonblog username = testuser password = testpass

Put those in the DATABASE_URL environment variable.

I left it to churn away for a few minutes (4m12s, to be precise) and it came back with a fix that edited two templates and added one more (passing) test. Here's that change in full.

And sure enough, the social media cards for my posts now look like this - no visible Markdown any more:

Tags: django, github, postgresql, testing, ai, openai, generative-ai, chatgpt, llms, ai-assisted-programming, ai-agents, coding-agents, async-coding-agents, jules

Jules

2025-05-19T21:40:11+00:00

Jules

It seems like everyone is rolling out AI coding assistants that attach to your GitHub account and submit PRs for you right now. We had OpenAI Codex last week, today Microsoft announced GitHub Copilot coding agent (confusingly not the same thing as Copilot Workspace) and I found out just now that Google's Jules, announced in December, is now in a beta preview.

I'm flying home from PyCon but I managed to try out Jules from my phone. I took this GitHub issue thread, converted it to copy-pasteable Markdown with this tool and pasted it into Jules, with no further instructions.

Here's the resulting PR created from its branch. I haven't fully reviewed it yet and the tests aren't passing, so it's hard to evaluate from my phone how well it did. In a cursory first glance it looks like it's covered most of the requirements from the issue thread.

My habit of creating long issue threads where I talk to myself about the features I'm planning is proving to be a good fit for outsourcing implementation work to this new generation of coding assistants.

Tags: github, google, ai, generative-ai, llms, ai-assisted-programming, gemini, github-issues, async-coding-agents, jules