Simon Willison's Weblog: gpt-codex

Introducing GPT-5.2-Codex

2025-12-19T05:21:17+00:00

The latest in OpenAI's Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools).

GPT‑5.2-Codex is a version of GPT‑5.2⁠ further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.

As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API "in the coming weeks". Unlike previous models there's a new invite-only preview process for vetted cybersecurity professionals for "more permissive models".

I've been very impressed recently with GPT 5.2's ability to tackle multi-hour agentic coding challenges. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I'm not sure how concrete that 1.8% improvement will be!

I didn't hack API access together this time (see previous attempts), instead opting to just ask Codex CLI to "Generate an SVG of a pelican riding a bicycle" while running the new model (effort medium). Here's the transcript in my new Codex CLI timeline viewer, and here's the pelican it drew:

Tags: ai, openai, generative-ai, llms, pelican-riding-a-bicycle, llm-release, codex-cli, gpt-codex

Building more with GPT-5.1-Codex-Max

2025-11-19T23:15:10+00:00

Building more with GPT-5.1-Codex-Max

Hot on the heels of yesterday's Gemini 3 Pro release comes a new model from OpenAI called GPT-5.1-Codex-Max.

(Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!)

It's currently only available through their Codex CLI coding agent, where it's the new default model:

Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces. Unlike GPT‑5.1, which is a general-purpose model, we recommend using GPT‑5.1-Codex-Max and the Codex family of models only for agentic coding tasks in Codex or Codex-like environments.

It's not available via the API yet but should be shortly.

The timing of this release is interesting given that Gemini 3 Pro appears to have aced almost all of the benchmarks just yesterday. It's reminiscent of the period in 2024 when OpenAI consistently made big announcements that happened to coincide with Gemini releases.

OpenAI's self-reported SWE-Bench Verified score is particularly notable: 76.5% for thinking level "high" and 77.9% for the new "xhigh". That was the one benchmark where Gemini 3 Pro was out-performed by Claude Sonnet 4.5 - Gemini 3 Pro got 76.2% and Sonnet 4.5 got 77.2%. OpenAI now have the highest scoring model there by a full .7 of a percentage point!

They also report a score of 58.1% on Terminal Bench 2.0, beating Gemini 3 Pro's 54.2% (and Sonnet 4.5's 42.8%.)

The most intriguing part of this announcement concerns the model's approach to long context problems:

GPT‑5.1-Codex-Max is built for long-running, detailed work. It’s our first model natively trained to operate across multiple context windows through a process called compaction, coherently working over millions of tokens in a single task. [...]

Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.

There's a lot of confusion on Hacker News about what this actually means. Claude Code already does a version of compaction, automatically summarizing previous turns when the context runs out. Does this just mean that Codex-Max is better at that process?

I had it draw me a couple of pelicans by typing "Generate an SVG of a pelican riding a bicycle" directly into the Codex CLI tool. Here's thinking level medium:

And here's thinking level "xhigh":

I also tried xhigh on the my longer pelican test prompt, which came out like this:

Also today: GPT-5.1 Pro is rolling out today to all Pro users. According to the ChatGPT release notes:

GPT-5.1 Pro is rolling out today for all ChatGPT Pro users and is available in the model picker. GPT-5 Pro will remain available as a legacy model for 90 days before being retired.

That's a pretty fast deprecation cycle for the GPT-5 Pro model that was released just three months ago.

Via Hacker News

Tags: ai, openai, generative-ai, llms, evals, pelican-riding-a-bicycle, llm-release, gpt-5, codex-cli, gpt-codex, november-2025-inflection

Introducing GPT-5.1 for developers

2025-11-13T23:59:35+00:00

Introducing GPT-5.1 for developers

OpenAI announced GPT-5.1 yesterday, calling it a smarter, more conversational ChatGPT. Today they've added it to their API.

We actually got four new models today:

There are a lot of details to absorb here.

GPT-5.1 introduces a new reasoning effort called "none" (previous were minimal, low, medium, and high) - and none is the new default.

This makes the model behave like a non-reasoning model for latency-sensitive use cases, with the high intelligence of GPT‑5.1 and added bonus of performant tool-calling. Relative to GPT‑5 with 'minimal' reasoning, GPT‑5.1 with no reasoning is better at parallel tool calling (which itself increases end-to-end task completion speed), coding tasks, following instructions, and using search tools---and supports web search⁠ in our API platform.

When you DO enable thinking you get to benefit from a new feature called "adaptive reasoning":

On straightforward tasks, GPT‑5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills. On difficult tasks that require extra thinking, GPT‑5.1 remains persistent, exploring options and checking its work in order to maximize reliability.

Another notable new feature for 5.1 is extended prompt cache retention:

Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.

To enable this set "prompt_cache_retention": "24h" in the API call. Weirdly there's no price increase involved with this at all. I asked about that and OpenAI's Steven Heidel replied:

with 24h prompt caching we move the caches from gpu memory to gpu-local storage. that storage is not free, but we made it free since it moves capacity from a limited resource (GPUs) to a more abundant resource (storage). then we can serve more traffic overall!

The most interesting documentation I've seen so far is in the new 5.1 cookbook, which also includes details of the new shell and apply_patch built-in tools. The apply_patch.py implementation is worth a look, especially if you're interested in the advancing state-of-the-art of file editing tools for LLMs.

I'm still working on integrating the new models into LLM. The Codex models are Responses-API-only.

I got this pelican for GPT-5.1 default (no thinking):

And this one with reasoning effort set to high:

These actually feel like a regression from GPT-5 to me. The bicycles have less spokes!

Tags: ai, openai, generative-ai, llms, llm, pelican-riding-a-bicycle, llm-reasoning, llm-release, gpt-5, gpt-codex, november-2025-inflection

Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican

2025-11-09T03:31:34+00:00

OpenAI partially released a new model yesterday called GPT-5-Codex-Mini, which they describe as "a more compact and cost-efficient version of GPT-5-Codex". It's currently only available via their Codex CLI tool and VS Code extension, with proper API access "coming soon". I decided to use Codex to reverse engineer the Codex CLI tool and give me the ability to prompt the new model directly.

I made a video talking through my progress and demonstrating the final results.

This is a little bit cheeky

OpenAI clearly don't intend for people to access this model directly just yet. It's available exclusively through Codex CLI which is a privileged application - it gets to access a special backend API endpoint that's not publicly documented, and it uses a special authentication mechanism that bills usage directly to the user's existing ChatGPT account.

I figured reverse-engineering that API directly would be somewhat impolite. But... Codex CLI is an open source project released under an Apache 2.0 license. How about upgrading that to let me run my own prompts through its existing API mechanisms instead?

This felt like a somewhat absurd loophole, and I couldn't resist trying it out and seeing what happened.

Codex CLI is written in Rust

The openai/codex repository contains the source code for the Codex CLI tool, which OpenAI rewrote in Rust just a few months ago.

I don't know much Rust at all.

I made my own clone on GitHub and checked it out locally:

git clone git@github.com:simonw/codex
cd codex

Then I fired up Codex itself (in dangerous mode, because I like living dangerously):

codex --dangerously-bypass-approvals-and-sandbox

And ran this prompt:

Figure out how to build the rust version of this tool and then build it

This worked. It churned away for a bit and figured out how to build itself. This is a useful starting point for a project like this - in figuring out the compile step the coding agent gets seeded with a little bit of relevant information about the project, and if it can compile that means it can later partially test the code it is writing while it works.

Once the compile had succeeded I fed it the design for the new feature I wanted:

Add a new sub-command to the Rust tool called "codex prompt"

codex prompt "prompt goes here" - this runs the given prompt directly against the OpenAI API that Codex uses, with the same code path and authentication credentials as the rest of Codex.

codex prompt -m <model_name> "prompt goes here" - same again but lets you set the model

codex prompt "prompt goes here" -s/--system "system prompt goes here" - runs with a custom system/developer message

codex prompt --models - this lists all available models that can be used with the "codex prompt" command

My design for this new sub-command is lifted directly from my own llm command.

Iterating on the code

Codex got to work. I've shared the full transcript here (using the tool I described here), but this TODO list it made itself is a useful summary of what it decided to do:

Updated Plan
└ ✔ Inspect CLI structure and relevant core utilities for sending prompts and listing models
  ✔ Implement new `codex prompt` subcommand (CLI plumbing + command handler)
  ✔ Format, lint, and test (`just fmt`, `just fix -p codex-cli`, `cargo test -p codex-cli`)

I like that it figured out the justfile in the repo and decided to use it to run formatting and linting commands without me needing to tell it to. (Update: it turns out that was dictated by the AGENTS.md file.)

I tried running the first version of the code it wrote like this:

./target/debug/codex prompt 'Generate an SVG of a pelican riding a bicycle' -m gpt-5-codex-mini

... and it didn't quite work. I got this:

(reasoning summary) **Seeking
(reasoning summary)  instructions
(reasoning summary)  and
(reasoning summary)  sandbox
(reasoning summary)  info
(reasoning summary) **
(reasoning summary) **Dec
(reasoning summary) iding
(reasoning summary)  on
(reasoning summary)  SVG
(reasoning summary)  creation
(reasoning summary)  approach
(reasoning summary) **
(reasoning summary) **Checking
(reasoning summary)  current
(reasoning summary)  directory
(reasoning summary) **
(reasoning summary) **Preparing
(reasoning summary)  to
(reasoning summary)  check
(reasoning summary)  current
(reasoning summary)  directory
(reasoning summary) **
I�m ready to help�what would you like me to do next?I�m ready to help�what would you like me to do next?
Token usage: total=2459 input=2374 cached_input=0 output=85 reasoning_output=64

Note that it DID think about SVG creation, but then decided it should look at the current directory. This isn't what I want - it appeared to be running in Codex's normal mode with a system prompt telling it to edit files on disk. I wanted it to respond to the prompt without acting as if it had a full workspace available to it.

I prompted it again, pasting in the broken output:

this did not seem to work - here is what happened - note that the reasoning summary comes out on a lot of lines when I would rather it streams out to stderr on the same line (flushing constantly) - and then despite the reasoning summary thinking about pelican on SVG it did not actually act on that and display the result, which is very surprising. [pasted content]

It churned for a bit more, but from the output it looked like it was still trying to include Codex-specific tools and workspace information. So I said:

Note that this new 'codex prompt' command should NOT be able to execute any tools at all, and should not have to think about the workspace or sandbox - it should send a single prompt with no tools and then stream and display a response and then stop.

This got me a new build, but when I ran it:

Error: unexpected status 400 Bad Request: {"detail":"Instructions are not valid"}

My guess is that OpenAI's custom backend API for Codex doesn't like it if you attempt to call it without passing in the expected instructions. Fair enough!

Rather than paste that error in I got Codex to try the command itself, so it could see the error:

You try running this and see what happens: ./target/debug/codex prompt 'Generate an SVG of a pelican riding a bicycle'

Codex switched back to the default instructions. From its reply:

cli/src/prompt_cmd.rs#L1-L320 now forces a tool-free request: we keep the workspace defaults, inject a lightweight system/developer message (“don’t run tools”), disable apply_patch/web search via config overrides, and build the outbound prompt as just that system span plus the user text.

I guess that will have to do. I asked for one more feature:

Add a --debug option to the codex prompt command which causes the full JSON request and response to be printed to stderr, plus the URL that is being accessed and the HTTP verb

... and we're ready to try this thing out!

Notably I haven't written a single line of Rust myself here and paid almost no attention to what it was actually doing. My main contribution was to run the binary every now and then to see if it was doing what I needed yet.

I've pushed the working code to a prompt-subcommand branch in my repo if you want to take a look and see how it all works.

Let's draw some pelicans

With the final version of the code built, I drew some pelicans. Here's the full terminal transcript, but here are some highlights.

This is with the default GPT-5-Codex model:

./target/debug/codex prompt "Generate an SVG of a pelican riding a bicycle"

I pasted it into my tools.simonwillison.net/svg-render tool and got the following:

I ran it again for GPT-5:

./target/debug/codex prompt "Generate an SVG of a pelican riding a bicycle" -m gpt-5

And now the moment of truth... GPT-5 Codex Mini!

./target/debug/codex prompt "Generate an SVG of a pelican riding a bicycle" -m gpt-5-codex-mini

I don't think I'll be adding that one to my SVG drawing toolkit any time soon.

Bonus: the --debug option

I had Codex add a --debug option to help me see exactly what was going on.

./target/debug/codex prompt -m gpt-5-codex-mini "Generate an SVG of a pelican riding a bicycle" --debug

The output starts like this:

[codex prompt debug] POST https://chatgpt.com/backend-api/codex/responses
[codex prompt debug] Request JSON:

{
  "model": "gpt-5-codex-mini",
  "instructions": "You are Codex, based on GPT-5. You are running as a coding agent ...",
  "input": [
    {
      "type": "message",
      "role": "developer",
      "content": [
        {
          "type": "input_text",
          "text": "You are a helpful assistant. Respond directly to the user request without running tools or shell commands."
        }
      ]
    },
    {
      "type": "message",
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "Generate an SVG of a pelican riding a bicycle"
        }
      ]
    }
  ],
  "tools": [],
  "tool_choice": "auto",
  "parallel_tool_calls": false,
  "reasoning": {
    "summary": "auto"
  },
  "store": false,
  "stream": true,
  "include": [
    "reasoning.encrypted_content"
  ],
  "prompt_cache_key": "019a66bf-3e2c-7412-b05e-db9b90bbad6e"
}

This reveals that OpenAI's private API endpoint for Codex CLI is https://chatgpt.com/backend-api/codex/responses.

Also interesting is how the "instructions" key (truncated above, full copy here) contains the default instructions, without which the API appears not to work - but it also shows that you can send a message with role="developer" in advance of your user prompt.

Tags: ai, rust, openai, generative-ai, llms, ai-assisted-programming, pelican-riding-a-bicycle, llm-release, vibe-coding, coding-agents, gpt-5, codex-cli, gpt-codex

GPT-5-Codex

2025-09-23T23:59:20+00:00

GPT-5-Codex

OpenAI half-released this model earlier this month, adding it to their Codex CLI tool but not their API.

Today they've fixed that - the new model can now be accessed as gpt-5-codex. It's priced the same as regular GPT-5: $1.25/million input tokens, $10/million output tokens, and the same hefty 90% discount for previously cached input tokens, especially important for agentic tool-using workflows which quickly produce a lengthy conversation.

It's only available via their Responses API, which means you currently need to install the llm-openai-plugin to use it with LLM:

llm install -U llm-openai-plugin
llm -m openai/gpt-5-codex -T llm_version 'What is the LLM version?'

Outputs:

The installed LLM version is 0.27.1.

I added tool support to that plugin today, mostly authored by GPT-5 Codex itself using OpenAI's Codex CLI.

The new prompting guide for GPT-5-Codex is worth a read.

GPT-5-Codex is purpose-built for Codex CLI, the Codex IDE extension, the Codex cloud environment, and working in GitHub, and also supports versatile tool use. We recommend using GPT-5-Codex only for agentic and interactive coding use cases.

Because the model is trained specifically for coding, many best practices you once had to prompt into general purpose models are built in, and over prompting can reduce quality.

The core prompting principle for GPT-5-Codex is “less is more.”

I tried my pelican benchmark at a cost of 2.156 cents.

llm -m openai/gpt-5-codex "Generate an SVG of a pelican riding a bicycle"

I asked Codex to describe this image and it correctly identified it as a pelican!

llm -m openai/gpt-5-codex -a https://static.simonwillison.net/static/2025/gpt-5-codex-api-pelican.png \
  -s 'Write very detailed alt text'

Cartoon illustration of a cream-colored pelican with a large orange beak and tiny black eye riding a minimalist dark-blue bicycle. The bird’s wings are tucked in, its legs resemble orange stick limbs pushing the pedals, and its tail feathers trail behind with light blue motion streaks to suggest speed. A small coral-red tongue sticks out of the pelican’s beak. The bicycle has thin light gray spokes, and the background is a simple pale blue gradient with faint curved lines hinting at ground and sky.

Tags: ai, openai, prompt-engineering, generative-ai, llms, ai-assisted-programming, pelican-riding-a-bicycle, llm-reasoning, llm-release, gpt-5, codex-cli, gpt-codex

GPT‑5-Codex and upgrades to Codex

2025-09-15T18:55:35+00:00

GPT‑5-Codex and upgrades to Codex

OpenAI half-released a new model today: GPT‑5-Codex, a fine-tuned GPT-5 variant explicitly designed for their various AI-assisted programming tools.

Update: OpenAI call it a "version of GPT-5", they don't explicitly describe it as a fine-tuned model. Calling it a fine-tune was my mistake here.

I say half-released because it's not yet available via their API, but they "plan to make GPT‑5-Codex available in the API soon".

I wrote about the confusing array of OpenAI products that share the name Codex a few months ago. This new model adds yet another, though at least "GPT-5-Codex" (using two hyphens) is unambiguous enough not to add to much more to the confusion.

At this point it's best to think of Codex as OpenAI's brand name for their coding family of models and tools.

The new model is already integrated into their VS Code extension, the Codex CLI and their Codex Cloud asynchronous coding agent. I'd been calling that last one "Codex Web" but I think Codex Cloud is a better name since it can also be accessed directly from their iPhone app.

Codex Cloud also has a new feature: you can configure it to automatically run code review against specific GitHub repositories (I found that option on chatgpt.com/codex/settings/code-review) and it will create a temporary container to use as part of those reviews. Here's the relevant documentation.

Some documented features of the new GPT-5-Codex model:

Specifically trained for code review, which directly supports their new code review feature.
"GPT‑5-Codex adapts how much time it spends thinking more dynamically based on the complexity of the task." Simple tasks (like "list files in this directory") should run faster. Large, complex tasks should use run for much longer - OpenAI report Codex crunching for seven hours in some cases!
Increased score on their proprietary "code refactoring evaluation" from 33.9% for GPT-5 (high) to 51.3% for GPT-5-Codex (high). It's hard to evaluate this without seeing the details of the eval but it does at least illustrate that refactoring performance is something they've focused on here.
"GPT‑5-Codex also shows significant improvements in human preference evaluations when creating mobile websites" - in the past I've habitually prompted models to "make it mobile-friendly", maybe I don't need to do that any more.
"We find that comments by GPT‑5-Codex are less likely to be incorrect or unimportant" - I originally misinterpreted this as referring to comments in code but it's actually about comments left on code reviews.

The system prompt for GPT-5-Codex in Codex CLI is worth a read. It's notably shorter than the system prompt for other models - here's a diff.

Here's the section of the updated system prompt that talks about comments:

Add succinct code comments that explain what is going on if code is not self-explanatory. You should not add comments like "Assigns the value to the variable", but a brief comment might be useful ahead of a complex code block that the user would otherwise have to spend time parsing out. Usage of these comments should be rare.

Theo Browne has a video review of the model and accompanying features. He was generally impressed but noted that it was surprisingly bad at using the Codex CLI search tool to navigate code. Hopefully that's something that can fix with a system prompt update.

Finally, can it drew a pelican riding a bicycle? Without API access I instead got Codex Cloud to have a go by prompting:

Generate an SVG of a pelican riding a bicycle, save as pelican.svg

Here's the result:

Tags: code-review, ai, openai, generative-ai, llms, ai-assisted-programming, pelican-riding-a-bicycle, llm-release, coding-agents, async-coding-agents, gpt-5, codex-cli, theo-browne, gpt-codex